SIGN IN
Infra Engineer - SRE(Kubernetes) jobs in United States
cer-icon
Apply on Employer Site
company-logo

GMI Cloud · 20 hours ago

Infra Engineer - SRE(Kubernetes)

GMI Cloud is a fast-growing AI infrastructure startup based in Silicon Valley, working on cutting-edge technologies that power the future of artificial intelligence. They are seeking a dynamic and hands-on Site Reliability Engineer to ensure the stability, efficiency, and reliability of large-scale AI/ML clusters in their data center.
Artificial Intelligence (AI)Cloud ComputingInformation TechnologyData CenterService Industry
check
H1B Sponsor Likelynote
Hiring Manager
Peggy Zhou
linkedin

Responsibilities

Design, implement and maintain scalable AI/ML infrastructure solutions
Proactively monitor GPU cluster health, performance and troubleshoot issues across compute, accelerator, and storage systems
Automate deployment, configuration and management of infrastructure resources
Manage GPU node lifecycle workflows, including provisioning, scaling, maintenance, decommissioning and upgrades of GPU nodes
Implement CI/CD pipelines for infrastructure deployment and orchestration
Ensure security, compliance and best practices across infrastructure
Manage incident response related to Infrastructure resources (GPU, CPU, Storage, Network)
Handle customer provisioning requests for GPU resources, including onboarding, configuration and troubleshooting; resolve customer service requests related to GPU infrastructure, ensuring high customer satisfaction
Stay current with emerging GPU hardware and software technologies, integrating improvements as appropriate
Regional/international travel to GMI data center locations

Qualification

KubernetesInfrastructure automationLinux administrationPythonTroubleshooting skillsCommunication skillsTeamwork abilities

Required

Bachelor's degree in Computer Science or related field
Over 3+ years of experience in data center operations, infrastructure, or systems engineering
Proven experience in site reliability engineering and infrastructure automation (e.g. Ansible, Terraform)
Familiarity with containers orchestration platform (e.g. Kubernetes, Nvidia GPU operator, Nvidia Network operator, CNI, CSI) and job scheduling systems (e.g. Slurm)
Familiarity with Linux system administration and scripting (Python, Bash)
Familiarity with logging and monitoring tools such as Prometheus, Grafana, Loki
Strong troubleshooting skills and ability to analyze system logs and performance metrics
Excellent communication and teamwork abilities

Preferred

Good knowledge of GPU architecture, Nvidia CUDA, NCCL, or related AI/ML frameworks - added advantage

Company

GMI Cloud

twittertwittertwitter
company-logo
GMI Cloud provides GPU cloud access for generative AI applications.

H1B Sponsorship

GMI Cloud has a track record of offering H1B sponsorships. Please note that this does not guarantee sponsorship for this specific role. Below presents additional info for your reference. (Data Powered by US Department of Labor)
Distribution of Different Job Fields Receiving Sponsorship
Represents job field similar to this job
Trends of Total Sponsorships
2025 (2)

Funding

Current Stage
Growth Stage
Total Funding
$82M
Key Investors
Headline Asia (formerly Infinity Ventures)Banpu NEXT
2024-10-29Series A· $15M
2024-10-29Debt Financing· $67M
2024-07-16Corporate Round

Leadership Team

leader-logo
Alex Yeh
Founder & CEO
linkedin
leader-logo
Tim Chen
Chief Financial Officer
linkedin
Company data provided by crunchbase