MTS - Site Reliability Engineer jobs in United States
cer-icon
Apply on Employer Site
company-logo

Microsoft · 4 days ago

MTS - Site Reliability Engineer

Microsoft is a leading technology company focused on innovation and accessibility in artificial intelligence. They are seeking an experienced Site Reliability Engineer (SRE) to join their infrastructure team, responsible for maintaining the reliability and efficiency of large-scale distributed AI systems.

Agentic AIApplication Performance ManagementArtificial Intelligence (AI)Business DevelopmentDevOpsInformation ServicesInformation TechnologyManagement Information SystemsNetwork SecuritySoftware
check
Growth Opportunities
check
H1B Sponsor Likelynote

Responsibilities

Ensure uptime, resiliency, and fault tolerance of AI model training and inference systems
Design and maintain monitoring, alerting, and logging systems to provide real-time visibility into model serving pipelines and infra
Analyze system performance and scalability, optimize resource utilization (compute, GPU clusters, storage, networking)
Build automation for deployments, incident response, scaling, and failover in hybrid cloud/on-prem CPU+GPU environments
Lead on-call rotations, troubleshoot production issues, conduct blameless postmortems, and drive continuous improvements
Ensure data privacy, compliance, and secure operations across model training and serving environments
Partner with ML engineers and platform teams to improve developer experience and accelerate research-to-production workflows

Qualification

KubernetesDockerCloud platformsMonitoring toolsPythonCI/CD pipelinesDistributed systemsGPU clustersIncident managementCollaboration

Required

4+ years of experience in Site Reliability Engineering, DevOps, or Infrastructure Engineering roles

Preferred

Strong proficiency in Kubernetes, Docker, and container orchestration
Knowledge of CI/CD pipelines for Inference and ML model deployment
Hands-on experience with public cloud platforms like Azure/AWS/GCP and infrastructure-as-code
Expertise in monitoring & observability tools (Grafana, Datadog, OpenTelemetry, etc.)
Strong programming/scripting skills in Python, Go, or Bash
Solid knowledge of distributed systems, networking, and storage
Experience running large-scale GPU clusters for ML/AI workloads (preferred)
Familiarity with ML training/inference pipelines
Experience with high-performance computing (HPC) and workload schedulers (Kubernetes operators)
Background in capacity planning & cost optimization for GPU-heavy environments
Work on cutting-edge infrastructure that powers the future of Generative AI
Collaborate with world-class researchers and engineers
Impact millions of users through reliable and responsible AI deployments

Benefits

Competitive compensation
Equity options
Comprehensive benefits

Company

Microsoft

company-logo
Microsoft is a software corporation that develops, manufactures, licenses, supports, and sells a range of software products and services.

H1B Sponsorship

Microsoft has a track record of offering H1B sponsorships. Please note that this does not guarantee sponsorship for this specific role. Below presents additional info for your reference. (Data Powered by US Department of Labor)
Distribution of Different Job Fields Receiving Sponsorship
Represents job field similar to this job
Trends of Total Sponsorships
2025 (9192)
2024 (9343)
2023 (7677)
2022 (11403)
2021 (7210)
2020 (7852)

Funding

Current Stage
Public Company
Total Funding
$1M
Key Investors
Technology Venture Investors
2022-12-09Post Ipo Equity
1986-03-13IPO
1981-09-01Series Unknown· $1M

Leadership Team

leader-logo
Satya Nadella
Chairman and CEO
linkedin
leader-logo
Vukani Mngxati
Chief Executive Officer - Microsft South Africa
linkedin
Company data provided by crunchbase