.jpg?1700169058)
Senior Staff Production Engineer
zscaler • San Jose, California, USA
Posted: May 29, 2026
Job Description
Role
We are looking for a Sr. Staff Production Engineer to join our team. This role is available as a hybrid opportunity 3 days a week in San Jose, CA or as a remote position, reporting to Production Engineering in the Cloud Infrastructure & Operations department. Join Zscaler to be a force multiplier for the reliability of a global platform protecting over 15 million users.
In this role, you will provide the technical vision and hands-on execution to drive an "automation-first" culture across the company. By maturing our observability and architectural standards, you will directly reduce our Mean Time to Mitigate (MTTM) and shape the scalability of our globally distributed, multi-cloud infrastructure.
What you’ll do (Role Expectations)
- Design and implement highly available, scalable infrastructure across AWS, Azure, GCP, and bare-metal environments
- Drive an "automation-first" culture by writing code (Python/Go) to eliminate manual toil and build self-healing systems
- Implement and maintain sophisticated observability (Prometheus, Grafana, OpenTelemetry), define SLIs/SLOs, and establish error budgets
- Act as a lead Incident Commander (TDO on-call), develop response playbooks, and conduct deep-dive post-incident analyses
- Partner with Engineering and partner teams to conduct operability reviews
Who You Are (Success Profile)
- You act like an owner with a bias for action and integrity.
- You are a pragmatic builder obsessed with creating, iterating, and shipping.
- You champion simplicity by distilling complex problems into clear, actionable plans.
- You are data-driven, valuing evidence over assumptions.
- You think at scale, building solutions and processes built to last a high-growth global organization.
What We’re Looking for (Minimum Qualifications)
- 8+ years of experience managing reliability, scalability, and availability for large-scale production services
- Deep expertise in programming (e.g., Python, Go, or C/C++)
- Strong background in networking protocols, Linux/FreeBSD systems, and distributed architecture
- Experience in high-stakes incident management and participation in a 24/7 on-call rotation
- Proficiency in leveraging ITIL frameworks and incident data to drive service maturity through systematic problem management and technical operability reviews
What Will Make You Stand Out (Preferred Qualifications)
- Extensive experience with public cloud (AWS, Azure, GCP) and Infrastructure-as-Code (Ansible, Terraform)
- Experience with chaos engineering and disaster recovery planning at scale
- Expertise in global routing (BGP) and traffic tunneling (GRE, IPSec) with a deep understanding of L7 proxy architectures (HAProxy), DNS at scale, and OS networking stack internals
#LI-Hybrid
#LI-CM3
Additional Content
Role
We are looking for a Sr. Staff Production Engineer to join our team. This role is available as a hybrid opportunity 3 days a week in San Jose, CA or as a remote position, reporting to Production Engineering in the Cloud Infrastructure & Operations department. Join Zscaler to be a force multiplier for the reliability of a global platform protecting over 15 million users.
In this role, you will provide the technical vision and hands-on execution to drive an "automation-first" culture across the company. By maturing our observability and architectural standards, you will directly reduce our Mean Time to Mitigate (MTTM) and shape the scalability of our globally distributed, multi-cloud infrastructure.
What you’ll do (Role Expectations)
- Design and implement highly available, scalable infrastructure across AWS, Azure, GCP, and bare-metal environments
- Drive an "automation-first" culture by writing code (Python/Go) to eliminate manual toil and build self-healing systems
- Implement and maintain sophisticated observability (Prometheus, Grafana, OpenTelemetry), define SLIs/SLOs, and establish error budgets
- Act as a lead Incident Commander (TDO on-call), develop response playbooks, and conduct deep-dive post-incident analyses
- Partner with Engineering and partner teams to conduct operability reviews
Who You Are (Success Profile)
- You act like an owner with a bias for action and integrity.
- You are a pragmatic builder obsessed with creating, iterating, and shipping.
- You champion simplicity by distilling complex problems into clear, actionable plans.
- You are data-driven, valuing evidence over assumptions.
- You think at scale, building solutions and processes built to last a high-growth global organization.
What We’re Looking for (Minimum Qualifications)
- 8+ years of experience managing reliability, scalability, and availability for large-scale production services
- Deep expertise in programming (e.g., Python, Go, or C/C++)
- Strong background in networking protocols, Linux/FreeBSD systems, and distributed architecture
- Experience in high-stakes incident management and participation in a 24/7 on-call rotation
- Proficiency in leveraging ITIL frameworks and incident data to drive service maturity through systematic problem management and technical operability reviews
What Will Make You Stand Out (Preferred Qualifications)
- Extensive experience with public cloud (AWS, Azure, GCP) and Infrastructure-as-Code (Ansible, Terraform)
- Experience with chaos engineering and disaster recovery planning at scale
- Expertise in global routing (BGP) and traffic tunneling (GRE, IPSec) with a deep understanding of L7 proxy architectures (HAProxy), DNS at scale, and OS networking stack internals
#LI-Hybrid
#LI-CM3