Senior Manager, Site Reliability Engineering (SRE) jobs in United States
cer-icon
Apply on Employer Site
company-logo

GHX · 1 day ago

Senior Manager, Site Reliability Engineering (SRE)

Global Healthcare Exchange (GHX) is a healthcare business and data automation company that empowers healthcare organizations to enhance patient care and maximize savings. The Senior Manager, Site Reliability Engineering (SRE) will lead the SRE organization in delivering reliable, scalable, and resilient platforms and services, overseeing the strategy and implementation of a unified observability platform while driving a culture of reliability within the engineering teams.

Hospital & Health Care
check
H1B Sponsor Likelynote

Responsibilities

Hire, lead, and mentor a high-performing SRE team across geographies
Define and execute the SRE vision, roadmap, and strategy in alignment with business and engineering objectives
Establish a healthy 24x7 on-call model, ensuring coverage while promoting team well-being
Drive a blameless culture through structured postmortems and RCA follow-up actions
Build and manage a unified observability platform leveraging tools such as New Relic, Datadog, CloudWatch, Prometheus, Grafana, Graylog, and OpenTelemetry
Deliver holistic monitoring across infrastructure, applications, databases, APIs, and end-user experience
Implement APM (Application Performance Monitoring) to trace performance across distributed systems
Establish dashboards, metrics, and proactive alerting to identify anomalies early
Drive adoption of AIOps and predictive analytics for proactive reliability improvements
Define and manage SLIs, SLOs, SLAs, and Error Budgets across services
Partner with engineering teams to balance velocity with reliability, ensuring adherence to Error Budgets
Reduce MTTD (Mean Time to Detect) and MTTR (Mean Time to Resolve) through automation, faster detection, and better instrumentation
Perform capacity planning, scalability reviews, and resiliency testing
Lead major incident response, coordinating communications with executives and stakeholders
Drive root cause analysis (RCA) and implement long-term fixes
Partner with ITSM teams to align with incident, problem, and change management processes
Ensure continuous improvement loops from incidents back into observability, automation, and engineering practices
Collaborate with Engineering, Product, Security, Cloud, and DevOps teams to embed SRE practices
Provide guidance on instrumentation, reliability design, and operational readiness for new services
Partner with DBAs and data platform teams to monitor database health, replication, query performance, and failover readiness
Champion reliability as a shared responsibility across development and operations

Qualification

Incident responseSLIsSLOsSLAsAWS experienceContainers & orchestrationUnified observabilityObservability toolsPythonLeadership skillsStakeholder managementNetworking fundamentalsITIL/ITSM processesCommunication skills

Required

12+ years of experience in SRE, Operations, or Infrastructure Engineering, with 5+ years in leadership roles
Proven expertise in unified observability, monitoring, and alerting across infra, apps, APM, and databases
Strong knowledge of observability tools: New Relic, Datadog, Prometheus, Grafana, Graylog, CloudWatch, OpenTelemetry, SolarWinds
Hands-on with incident response, RCA, MTTR/MTTD reduction, and on-call management
Deep understanding of SLIs, SLOs, SLAs, and Error Budgets
Strong AWS experience (EC2, ECS, EKS, networking, scaling groups)
Hands-on with containers & orchestration (Docker, Kubernetes)
Proficiency in Python, Java, C#, and shell scripting for automation
Knowledge of networking fundamentals, distributed systems, and high-availability architectures
Familiarity with ITIL/ITSM processes (incident, problem, change)
Strong leadership, stakeholder management, and communication skills

Preferred

Experience in large-scale SaaS or product-driven environments
Hands-on experience with databases: MongoDB, Elasticsearch, SQL Server, Oracle
Experience with chaos engineering, resiliency testing, and disaster recovery planning
Certifications: AWS Solutions Architect / DevOps Engineer, CKAD/CKA
Experience managing global SRE teams across time zones
Proven ability to embed reliability into engineering culture via SLOs and Error Budgets

Benefits

Health, vision, and dental insurance
Accident and life insurance
401k matching
Paid-time off
Education reimbursement

Company

GHX is a software-as-a-service company that’s reducing the cost of doing business in healthcare by automating supply chain processes and improving visibility into the products used in patient care.

H1B Sponsorship

GHX has a track record of offering H1B sponsorships. Please note that this does not guarantee sponsorship for this specific role. Below presents additional info for your reference. (Data Powered by US Department of Labor)
Distribution of Different Job Fields Receiving Sponsorship
Represents job field similar to this job
Trends of Total Sponsorships
2025 (5)
2024 (9)
2023 (9)
2022 (3)
2021 (13)
2020 (2)

Funding

Current Stage
Late Stage

Leadership Team

leader-logo
Tina Vatanka Murphy
President & CEO
linkedin
leader-logo
CJ Singh
Chief Technology Officer
linkedin
Company data provided by crunchbase