Apply on Employer Site

Replit · 3 months ago

Staff Site Reliability Engineer

Foster City, CA

Full-time

Onsite

Lead/Staff

$220K/yr - $325K/yr

8+ years exp

Replit is a software creation platform that enables users to build applications using natural language. They are seeking a Staff Site Reliability Engineer to ensure the reliability, scalability, and performance of their infrastructure while implementing automation and best practices. The role involves leading incident management, driving automation, and mentoring the engineering team in maintaining high reliability standards.

Artificial Intelligence (AI)Cloud ComputingDeveloper ToolsInformation TechnologySoftware

Growth Opportunities

H1B Sponsor Likely

Responsibilities

Architect and Implement Observability: Design, build, and lead the implementation of comprehensive monitoring, logging, and tracing solutions. Create dashboards and metrics that provide real-time visibility into system health and performance, enabling proactive issue detection

Define and Drive Reliability Standards: Work with product and engineering teams to define, implement, and track Service Level Objectives (SLOs) and Service Level Indicators (SLIs). Build systems to monitor and report on these metrics, holding teams accountable and ensuring we maintain high reliability standards while balancing innovation speed

Lead Incident Management and Response: Act as a senior leader during high-impact incidents, guiding the team to rapid resolution. Conduct thorough, blameless post-mortems and drive the implementation of preventative measures. Develop and refine runbooks and build automation to reduce Mean Time To Recovery (MTTR)

Drive Automation and Infrastructure as Code: Architect, build, and improve automation to eliminate toil and operational work. Design and maintain CI/CD pipelines and infrastructure automation using tools like Terraform or Pulumi. Create self-healing systems that can automatically respond to common failure scenarios

Optimize Performance on Kubernetes: Collaborate with core infrastructure and product teams to performance-tune and optimize our large-scale cloud deployments, with a deep focus on Kubernetes, Docker, and GCP. Identify and resolve performance bottlenecks, implement capacity planning strategies, and reduce latency across global regions

Debug and Harden Distributed Systems: Dive deep into debugging extremely difficult technical problems across the stack. Use your findings to design and implement long-term fixes that make our systems and products more robust, operable, and easier to diagnose

Provide Staff-Level Guidance: Review feature and system designs from across the company, acting as a key owner for the reliability, scalability, security, and operational integrity of those designs

Educate and Mentor: Educate, mentor, and hold accountable the broader engineering team to improve the reliability of our systems, making reliability a core value of the Replit engineering culture

Build and Integrate: Write high-quality, well-tested code in Python or Go to meet the needs of your customers, whether it's building new internal tools or integrating with third-party vendors

Qualification

Site Reliability EngineeringKubernetesPythonObservability solutionsInfrastructure as CodeIncident managementDistributed systemsGoGoogle Cloud PlatformCommunication skillsMentoring

Required

8-10 years of experience in Site Reliability Engineering or similar roles (e.g., DevOps, Systems Engineering, Infrastructure Engineering)

Strong programming skills in languages like Python or Go. You write high-quality, well-tested code

Deep understanding of distributed systems. You've designed, built, scaled, and maintained production services and know how to compose a service-oriented architecture

Deep experience with container orchestration platforms, specifically Kubernetes, and cloud-native technologies

Proven track record of designing, implementing, and maintaining sophisticated monitoring and observability solutions (e.g., metrics, logging, tracing)

Strong incident management skills with extensive experience leading incident response for complex systems and demonstrated critical thinking under pressure

Experience with infrastructure as code (e.g., Terraform, Pulumi) and configuration management tools

Excellent written and verbal communication skills, with an ability to explain complex technical concepts clearly and simply and a bias toward open, transparent cultural practices

Strong interpersonal skills, with experience working with and mentoring engineers from junior to principal levels

A willingness to dive into understanding, debugging, and improving any layer of the stack

You're passionate about making software creation accessible and empowering the next generation of builders

Preferred

Deep experience with Google Cloud Platform (GCP) services and tools

Expert-level knowledge of modern observability platforms (e.g., Prometheus, Grafana, Datadog, OpenTelemetry)

Experience designing and building reliable systems capable of handling high throughput and low latency

Significant experience with Go and Terraform

Familiarity with working in rapid-growth, startup environments

Experience writing company-facing blog posts and training materials

Benefits

Competitive Salary & Equity

401(k) Program

Health, Dental, Vision and Life Insurance

Short Term and Long Term Disability

Paid Parental, Medical, Caregiver Leave

Commuter Benefits

Monthly Wellness Stipend

Autonoumous Work Environement

In Office Set-Up Reimbursement

Flexible Time Off (FTO) + Holidays

Quarterly Team Gatherings

In Office Amenities

Company

Replit

Replit is the most secure agentic platform for production-ready apps.

Founded in 2016

Foster City, California, USA

51-200 employees

https://replit.com

H1B Sponsorship

Replit has a track record of offering H1B sponsorships. Please note that this does not guarantee sponsorship for this specific role. Below presents additional info for your reference. (Data Powered by US Department of Labor)

Distribution of Different Job Fields Receiving Sponsorship

Represents job field similar to this job

Trends of Total Sponsorships

2025 (8)

2024 (5)

2023 (2)

2022 (2)

Funding

Current Stage

Growth Stage

Total Funding

$472.02M

Key Investors

Prysm CapitalCraft VenturesAndreessen Horowitz

2025-07-30Series C· $250M

2023-11-06Series B· $20M

2023-04-25Series B· $97.4M

Leadership Team

Amjad Masad

CEO

Haya Odeh

Co-Founder

Recent News

Benzinga.com

Replit CEO Says 'Anyone Who Has Ideas Should Potentially Be Wealthy…That's The True Promise Of Capitalism'

2026-01-25

WebProNews

Replit’s AI Vibe Coding Revolutionizes iOS App Development by 2026

2026-01-19

36kr.com

After laying off 50% of the staff, he achieved a remarkable comeback through AI programming. His Annual Recurring Revenue (ARR) exceeded 100 million, and the company's valuation soared to 60 billion.

2026-01-17

Company data provided by crunchbase