Member of Technical Staff, HPC Operations Engineering Manager - MAI SuperIntelligence Team jobs in United States
cer-icon
Apply on Employer Site
company-logo

Microsoft · 12 hours ago

Member of Technical Staff, HPC Operations Engineering Manager - MAI SuperIntelligence Team

Microsoft AI is seeking an experienced HPC Operations Engineering Manager to join our Infrastructure Team. In this role, you’ll lead a team of Site Reliability Engineers to ensure the reliability and efficiency of our large-scale distributed AI infrastructure.

Artificial Intelligence (AI)Enterprise SoftwareCloud ComputingCyber SecuritySoftwareProfessional ServicesInformation TechnologyAgentic AIApplication Performance ManagementBusiness DevelopmentDevOpsInformation ServicesManagement Information SystemsNetwork Security
check
Growth Opportunities
check
H1B Sponsor Likelynote

Responsibilities

Team leadership: Lead a team of experienced SREs to ensure uptime, resiliency and fault tolerance of AI model training and inference systems
Observability: Design and help maintain monitoring, alerting, and logging systems to provide real-time visibility into model serving pipelines and infra
Automation & Tooling: Lead building of automation for deployments, incident response, scaling, and failover in hybrid cloud/on-prem CPU+GPU environments
Incident Management: Lead on-call rotations, troubleshoot production issues, conduct blameless postmortems, and drive continuous improvements
Security & Compliance: Ensure data privacy, compliance, and secure operations across model training and serving environments
Collaboration: Partner with ML engineers and platform teams to improve developer experience and accelerate research-to-production workflows

Qualification

KubernetesDockerPublic cloud platformsSite Reliability EngineeringProgramming/scripting skillsMonitoring toolsHigh-performance computingCI/CD pipelinesPeople managementCapacity planningCollaboration

Required

Bachelor's Degree in Computer Science or related technical field AND 8+ years technical engineering experience with Site Reliability Engineering, DevOps, or Infrastructure Engineering Leadership roles AND 8+ years experience with Kubernetes, Docker, and container orchestration, AND 8+ years experience with public cloud platforms like Azure/AWS/GCP and infrastructure-as-code, AND 6+ years experience with programming/scripting skills not limited to Python, Go, or Bash + OR equivalent experience

Preferred

Master's Degree in Computer Science or related technical field AND 12+ years technical engineering experience AND 10+ years experience with Kubernetes, Docker, and container orchestration, AND 10+ years' experience with public cloud platforms like Azure/AWS/GCP and infrastructure-as-code OR equivalent experience
6+ years people management experience
Experience in monitoring & observability tools (Grafana, Datadog, OpenTelemetry, etc.)
Experience running large-scale GPU clusters for ML/AI workloads
Experience with high-performance computing (HPC) and workload schedulers (Kubernetes operators)
Knowledge of CI/CD pipelines for Inference and ML model deployment
Solid knowledge of distributed systems, networking, and storage
Familiarity with ML training/inference pipelines
Background in capacity planning & cost optimization for GPU-heavy environments

Company

Microsoft

company-logo
Microsoft is a software corporation that develops, manufactures, licenses, supports, and sells a range of software products and services.

H1B Sponsorship

Microsoft has a track record of offering H1B sponsorships. Please note that this does not guarantee sponsorship for this specific role. Below presents additional info for your reference. (Data Powered by US Department of Labor)
Distribution of Different Job Fields Receiving Sponsorship
Represents job field similar to this job
Trends of Total Sponsorships
2025 (9192)
2024 (9343)
2023 (7677)
2022 (11403)
2021 (7210)
2020 (7852)

Funding

Current Stage
Public Company
Total Funding
$1M
Key Investors
Technology Venture Investors
2022-12-09Post Ipo Equity
1986-03-13IPO
1981-09-01Series Unknown· $1M

Leadership Team

leader-logo
Satya Nadella
Chairman and CEO
linkedin
leader-logo
Vukani Mngxati
Chief Executive Officer - Microsft South Africa
linkedin
Company data provided by crunchbase