Machine Learning Systems Administrator - HPC Infrastructure jobs in United States
cer-icon
Apply on Employer Site
company-logo

Zyphra · 4 hours ago

Machine Learning Systems Administrator - HPC Infrastructure

Zyphra is an artificial intelligence company based in San Francisco, California. The role involves maintaining and developing core infrastructure for machine learning research and production, ensuring smooth operations and scalable workflows.

Artificial Intelligence (AI)Cloud ComputingMachine LearningSoftware
check
H1B Sponsorednote

Responsibilities

Administration and automation of our Linux-based cluster environments
Managing user onboarding/offboarding, security auditing, and access control
Monitoring system resources and job scheduling
Supporting and improving developer workflows (e.g., VSCode compatibility, Docker)
Enabling and supporting AI/ML workloads, including large-scale training jobs
Comfortable operating across a wide range of infrastructure concerns and excited to own and improve critical systems
You’ll have a significant impact on both developer productivity and training and inference performance

Qualification

Linux system administrationScripting languagesContainerized environmentsJob scheduling systemsInfrastructure as codeCloud platformsDebugging skillsCommunication skills

Required

Strong experience with Linux system administration, user and access management, and automation
Demonstrated expertise in scripting languages for system tooling and automation (bash, Python, etc.)
Familiarity with containerized environments (e.g., Docker) and job scheduling systems like Slurm
Experience building tooling for cluster validation and reliability (GPU, networking, storage health checks)
Experience setting up and managing developer tools and third-party services (e.g, Cloud storage providers, Dockerhub, Slack, Gmail, Telegraf, experiment trackers, etc.)
Excellent debugging and troubleshooting skills across compute, storage, and networking
Strong communication skills and ability to collaborate across technical and non-technical teams

Preferred

Experience with infrastructure as code (e.g., Ansible, Terraform)
Prior work supporting ML/AI infrastructure, including GPU management and workload optimization
Exposure to backend development for ML model serving (e.g., vLLM, Ray, SGLang)
Experience working with cloud platforms such as AWS, Azure, or GCP
Familiarity with containers (Docker, Apptainer) and their integration with scheduling systems (Slurm, Kubernetes)

Benefits

Comprehensive medical, dental, vision, and FSA plans
Competitive compensation and 401(k)
Relocation and immigration support on a case-by-case basis
On-site meals prepared by a dedicated culinary team; Thursday Happy Hours

Company

Zyphra

twittertwitter
company-logo
Zyphra is superintelligence research and product company based in San Francisco, California.

H1B Sponsorship

Zyphra has a track record of offering H1B sponsorships. Please note that this does not guarantee sponsorship for this specific role. Below presents additional info for your reference. (Data Powered by US Department of Labor)
Distribution of Different Job Fields Receiving Sponsorship
Represents job field similar to this job
Trends of Total Sponsorships
2025 (1)

Funding

Current Stage
Growth Stage
Total Funding
$100M
2025-06-09Series A· $100M
2023-06-09Seed
2021-11-18Pre Seed
Company data provided by crunchbase