Senior Site Reliability Engineer
ujet • Austin, TX, US; Remote, US; San Francisco, CA, US
Posted: April 17, 2026
Job Description
Position Overview
We’re looking for a Senior Site Reliability Engineer to help build and scale a high-impact SRE function. You’ll be a technical leader on a team responsible for improving system reliability, reducing operational toil, and establishing best practices across engineering.bIn this position, you’ll design how reliability works in UJET, influence engineering decisions, and build the tooling and processes that make production safer and more predictable.
Responsibilities
- Lead efforts to improve system reliability, scalability, and performance across critical services
- Define and implement SLIs/SLOs and error budgets, and use them to guide engineering priorities
- Design and develop observability systems (metrics, logging, tracing, alerting) that produce actionable alerts and data with minimal noise
- Lead complex incident response, acting as incident commander when needed
- Conduct postmortems focused on systemic causes rather than individual fault, and ensure corrective actions from those reviews are completed.
- Identify and eliminate toil through automation, tooling, and improved workflows
- Partner with product and platform teams on architecture decisions, production readiness, and designing systems that recover from failure
- Build reusable systems and “paved roads” that make it easier for teams to operate their services reliably
- Mentor other engineers and raise the overall operational maturity of the organization
Qualifications
- 6 - 10+ years of experience in SRE, infrastructure, or backend systems engineering
- Demonstrated experience of owning reliability outcomes for complex, distributed systems
- Strong experience with cloud infrastructure (AWS, GCP, or Azure) and production-scale systems
- Deep understanding of observability, incident management, and system performance
- Proficiency in at least one programming language (e.g., Go, Python, Java) with a focus on automation and tooling
- Able to change how other teams work without having managerial authority over them
- Strong competency in making clear decisions during incidents by following a defined process without reacting emotionally.
Nice to Have
- Experience building or scaling SRE practices (SLOs, incident frameworks, on-call models)
- Kubernetes/container orchestration experience
- Infrastructure as Code (Terraform, etc.)
- Experience with high-growth or scaling systems
- Background in performance engineering or capacity planning
Success Criteria
- Critical services have clear, meaningful SLOs that drive engineering decisions
- Alerts are actionable; irrelevant alerts are reduced; on-call workload is manageable.
- Incidents are handled efficiently, and repeat issues decline over time
- Engineering teams adopt reliability best practices with minimal friction
- Toil is actively reduced through automation and better system design
Position Context
This is an early position in the company's SRE function. You will have direct input into how reliability standards and practices are established, which forms the foundation on which product engineering builds. UJET is changing how companies deliver customer experience, and product engineering is building the platform that makes that possible. The reliability of that infrastructure is what allows it to operate at the scale and consistency that transformation requires.
Annual US Hiring Range: $100,000 - $120,000
*A candidate’s actual placement within this range will depend on geographic location, work experience, education, and/or skill level.
#LI-Remote
#LI-Hybrid
Additional Content
Position Overview
We’re looking for a Senior Site Reliability Engineer to help build and scale a high-impact SRE function. You’ll be a technical leader on a team responsible for improving system reliability, reducing operational toil, and establishing best practices across engineering.bIn this position, you’ll design how reliability works in UJET, influence engineering decisions, and build the tooling and processes that make production safer and more predictable.
Responsibilities
- Lead efforts to improve system reliability, scalability, and performance across critical services
- Define and implement SLIs/SLOs and error budgets, and use them to guide engineering priorities
- Design and develop observability systems (metrics, logging, tracing, alerting) that produce actionable alerts and data with minimal noise
- Lead complex incident response, acting as incident commander when needed
- Conduct postmortems focused on systemic causes rather than individual fault, and ensure corrective actions from those reviews are completed.
- Identify and eliminate toil through automation, tooling, and improved workflows
- Partner with product and platform teams on architecture decisions, production readiness, and designing systems that recover from failure
- Build reusable systems and “paved roads” that make it easier for teams to operate their services reliably
- Mentor other engineers and raise the overall operational maturity of the organization
Qualifications
- 6 - 10+ years of experience in SRE, infrastructure, or backend systems engineering
- Demonstrated experience of owning reliability outcomes for complex, distributed systems
- Strong experience with cloud infrastructure (AWS, GCP, or Azure) and production-scale systems
- Deep understanding of observability, incident management, and system performance
- Proficiency in at least one programming language (e.g., Go, Python, Java) with a focus on automation and tooling
- Able to change how other teams work without having managerial authority over them
- Strong competency in making clear decisions during incidents by following a defined process without reacting emotionally.
Nice to Have
- Experience building or scaling SRE practices (SLOs, incident frameworks, on-call models)
- Kubernetes/container orchestration experience
- Infrastructure as Code (Terraform, etc.)
- Experience with high-growth or scaling systems
- Background in performance engineering or capacity planning
Success Criteria
- Critical services have clear, meaningful SLOs that drive engineering decisions
- Alerts are actionable; irrelevant alerts are reduced; on-call workload is manageable.
- Incidents are handled efficiently, and repeat issues decline over time
- Engineering teams adopt reliability best practices with minimal friction
- Toil is actively reduced through automation and better system design
Position Context
This is an early position in the company's SRE function. You will have direct input into how reliability standards and practices are established, which forms the foundation on which product engineering builds. UJET is changing how companies deliver customer experience, and product engineering is building the platform that makes that possible. The reliability of that infrastructure is what allows it to operate at the scale and consistency that transformation requires.
Annual US Hiring Range: $100,000 - $120,000
*A candidate’s actual placement within this range will depend on geographic location, work experience, education, and/or skill level.
#LI-Remote
#LI-Hybrid