Apply on Employer Site

GHX · 1 day ago

Senior Manager, Site Reliability Engineering (SRE)

United States

Full-time

Remote

Senior Level, Lead/Staff, Director/Executive

$143K/yr - $191K/yr

12+ years exp

Global Healthcare Exchange (GHX) is a healthcare business and data automation company that empowers healthcare organizations to enhance patient care and maximize savings. The Senior Manager, Site Reliability Engineering (SRE) will lead the SRE organization in delivering reliable, scalable, and resilient platforms and services, overseeing the strategy and implementation of a unified observability platform while driving a culture of reliability within the engineering teams.

Hospital & Health Care

H1B Sponsor Likely

Responsibilities

Hire, lead, and mentor a high-performing SRE team across geographies

Define and execute the SRE vision, roadmap, and strategy in alignment with business and engineering objectives

Establish a healthy 24x7 on-call model, ensuring coverage while promoting team well-being

Drive a blameless culture through structured postmortems and RCA follow-up actions

Build and manage a unified observability platform leveraging tools such as New Relic, Datadog, CloudWatch, Prometheus, Grafana, Graylog, and OpenTelemetry

Deliver holistic monitoring across infrastructure, applications, databases, APIs, and end-user experience

Implement APM (Application Performance Monitoring) to trace performance across distributed systems

Establish dashboards, metrics, and proactive alerting to identify anomalies early

Drive adoption of AIOps and predictive analytics for proactive reliability improvements

Define and manage SLIs, SLOs, SLAs, and Error Budgets across services

Partner with engineering teams to balance velocity with reliability, ensuring adherence to Error Budgets

Reduce MTTD (Mean Time to Detect) and MTTR (Mean Time to Resolve) through automation, faster detection, and better instrumentation

Perform capacity planning, scalability reviews, and resiliency testing

Lead major incident response, coordinating communications with executives and stakeholders

Drive root cause analysis (RCA) and implement long-term fixes

Partner with ITSM teams to align with incident, problem, and change management processes

Ensure continuous improvement loops from incidents back into observability, automation, and engineering practices

Collaborate with Engineering, Product, Security, Cloud, and DevOps teams to embed SRE practices

Provide guidance on instrumentation, reliability design, and operational readiness for new services

Partner with DBAs and data platform teams to monitor database health, replication, query performance, and failover readiness

Champion reliability as a shared responsibility across development and operations

Qualification

Incident responseSLIsSLOsSLAsAWS experienceContainers & orchestrationUnified observabilityObservability toolsPythonLeadership skillsStakeholder managementNetworking fundamentalsITIL/ITSM processesCommunication skills

Required

12+ years of experience in SRE, Operations, or Infrastructure Engineering, with 5+ years in leadership roles

Proven expertise in unified observability, monitoring, and alerting across infra, apps, APM, and databases

Strong knowledge of observability tools: New Relic, Datadog, Prometheus, Grafana, Graylog, CloudWatch, OpenTelemetry, SolarWinds

Hands-on with incident response, RCA, MTTR/MTTD reduction, and on-call management

Deep understanding of SLIs, SLOs, SLAs, and Error Budgets

Strong AWS experience (EC2, ECS, EKS, networking, scaling groups)

Hands-on with containers & orchestration (Docker, Kubernetes)

Proficiency in Python, Java, C#, and shell scripting for automation

Knowledge of networking fundamentals, distributed systems, and high-availability architectures

Familiarity with ITIL/ITSM processes (incident, problem, change)

Strong leadership, stakeholder management, and communication skills

Preferred

Experience in large-scale SaaS or product-driven environments

Hands-on experience with databases: MongoDB, Elasticsearch, SQL Server, Oracle

Experience with chaos engineering, resiliency testing, and disaster recovery planning

Certifications: AWS Solutions Architect / DevOps Engineer, CKAD/CKA

Experience managing global SRE teams across time zones

Proven ability to embed reliability into engineering culture via SLOs and Error Budgets

Benefits

Health, vision, and dental insurance

Accident and life insurance

401k matching

Paid-time off

Education reimbursement

Company

GHX

Glassdoor3.6

GHX is a software-as-a-service company that’s reducing the cost of doing business in healthcare by automating supply chain processes and improving visibility into the products used in patient care.

Founded in 2000

Louisville, CO, US

1001-5000 employees

http://www.ghx.com

H1B Sponsorship

GHX has a track record of offering H1B sponsorships. Please note that this does not guarantee sponsorship for this specific role. Below presents additional info for your reference. (Data Powered by US Department of Labor)

Distribution of Different Job Fields Receiving Sponsorship

Represents job field similar to this job

Trends of Total Sponsorships

2025 (5)

2024 (9)

2023 (9)

2022 (3)

2021 (13)

2020 (2)