Apply on Employer Site

NVIDIA · 9 hours ago

Senior Site Reliability Engineer, Observability

Santa Clara, CA

Full-time

Onsite

Senior Level, Lead/Staff

$184K/yr - $288K/yr

10+ years exp

NVIDIA sits at the center of the AI revolution, and they are seeking a Senior Site Reliability Engineer to work on their data and observability platforms. The role involves architecting and operating large-scale observability systems, designing resilient pipelines, and collaborating with various teams to enhance operational reliability.

AI InfrastructureArtificial Intelligence (AI)Consumer ElectronicsFoundational AIGPUHardwareSoftwareVirtual Reality

Growth Opportunities

H1B Sponsor Likely

Responsibilities

Architecting and operating large-scale observability systems that span global regions and support AI, data, and platform services

Designing resilient pipelines for metrics, logs, traces, profiling, and events that keep critical systems visible and debuggable

Working closely with platform, infrastructure, and application teams to establish telemetry standards, instrumentation patterns, and integration workflows

Automating deployments, scaling workflows, and maintenance tasks to cut down toil and level up operational maturity across the stack

Defining and maintaining SLOs, SLIs, error budgets, dashboards, and alerting models that guide reliability decisions company-wide

Building self-service tooling and frameworks that make observability easy to adopt for engineers across NVIDIA

Studying real system behavior to uncover bottlenecks, scaling limits, failure modes, and long-term architecture risks

Running day-to-day operations including upgrades, performance tuning, break/fix, and rotations that keep the platform healthy

Leading incident response and root-cause investigations, then driving the follow-through to eliminate repeat failures

Guiding engineers through design reviews, operational best practices, and reliability-focused decision making

Qualification

Observability platformsDistributed systemsPython programmingGo programmingLinux internalsTelemetry pipelinesSLOsSLIsIncident responseClear communicationCollaboration

Required

Bachelor's degree in Computer Science, Engineering, or a related field, or equivalent experience

10+ years operating large-scale production systems in roles such as SRE, Production Engineer, or Platform Engineer and 5+ years designing, building, and running observability platforms at scale

Deep hands-on experience with open-source observability stacks, including Prometheus/Thanos/Mimir for metrics, Loki or Elasticsearch/OpenSearch for logs, and Tempo/Jaeger/OpenTelemetry for tracing and profiling

Strong programming ability in Python and Go, with Java experience considered a plus

Solid grounding in Linux internals, networking, storage systems, distributed systems, concurrency, and performance engineering

Experience architecting multi-region, multi-tenant telemetry pipelines with high availability and strong durability guarantees

Proven skill in optimizing PromQL, LogQL, trace queries, ingestion paths, indexing strategies, and retention policies

Strong understanding of SLOs, SLIs, error budgets, incident response, and the operational processes that support reliable systems

Ability to analyze complex distributed systems, pinpoint failure modes, and drive data-informed debugging and root cause analysis

Clear communicator who can collaborate effectively across product, platform, infrastructure, and application engineering teams

Preferred

Designed or led the architecture of a global observability platform supporting thousands of services with strict reliability and performance requirements

Contributed meaningfully to OpenTelemetry, Prometheus, Grafana, or other major observability open-source projects

Built high-throughput ingestion pipelines and long-term storage systems, with a strong focus on cost efficiency, retention strategy, and query performance

Specialized in high-cardinality telemetry, multi-tenant isolation, and advanced retention or tiered storage models

Worked hands-on with Kafka, Spark, Flink, or large-scale collectors in ultra-high-scale production environments where observability is mission critical

Benefits

Equity

Benefits

Company

NVIDIA

Glassdoor4.6

NVIDIA is a computing platform company operating at the intersection of graphics, HPC, and AI.

Founded in 1993

Santa Clara, California, USA

10001+ employees

https://www.nvidia.com

H1B Sponsorship

NVIDIA has a track record of offering H1B sponsorships. Please note that this does not guarantee sponsorship for this specific role. Below presents additional info for your reference. (Data Powered by US Department of Labor)

Distribution of Different Job Fields Receiving Sponsorship

Represents job field similar to this job

Trends of Total Sponsorships

2025 (1877)

2024 (1355)

2023 (976)

2022 (835)

2021 (601)

2020 (529)

Funding

Current Stage

Public Company

Total Funding

$4.09B

Key Investors

ARPA-EARK Investment ManagementSoftBank Vision Fund

2023-05-09Grant· $5M

2022-08-09Post Ipo Equity· $65M

2021-02-18Post Ipo Equity

Leadership Team

Jensen Huang

Founder and CEO

Michael Kagan

Chief Technology Officer

Recent News

Tech Times

CES 2026 Gaming Gear: Best Monitors, Laptops & Accessories

2026-01-17

The Orlando Sentinel

Stocks waver on Wall Street and remain near records

2026-01-17

Geeky Gadgets

Inside NVIDIA Rubin : Six-Chip AI System Built to Cut Power and Spend

2026-01-17

Company data provided by crunchbase