AMD · 2 weeks ago
AI Infrastructure Engineer – Slurm Platform
AMD is a company focused on building innovative products that accelerate next-generation computing experiences. They are seeking a senior technical contributor to drive the delivery of software solutions for AI software development and machine learning model training, while collaborating with various teams to optimize use cases.
Responsibilities
Design, deploy, and operate Slurm clusters across on-prem and multi-cloud GPU environments (Azure, OCI, Vultr, DigitalOcean, etc.)
Integrate Slurm with the broader orchestration ecosystem, enabling hybrid scheduling, unified authentication, and telemetry pipelines
Build platform features that improve developer experience — e.g., job submission APIs, automated environment setup, and metrics dashboards
Optimize cluster utilization and scheduling for GPU and CPU workloads; develop fair-share, QoS, and preemption policies
Monitor cluster health and performance, implementing observability pipelines using Prometheus, Grafana, and custom exporters
Collaborate with internal developers (framework, compiler, and application teams) to understand workload needs and translate them into scalable Slurm features
Contribute to storage and network integration, ensuring performant I/O (e.g., NFS, Lustre, Weka) and high-speed interconnect configuration (InfiniBand, NVLink)
Support the job lifecycle — from image builds and environment modules to debugging and performance tuning of Slurm jobs
Qualification
Required
8+ years of experience managing and automating HPC or Slurm clusters in production environments
Deep understanding of Linux systems, job schedulers (Slurm), and resource management for GPU-accelerated workloads
Strong troubleshooting skills across compute, storage, and network layers
Proven ability to collaborate with developers and researchers to design scalable HPC solutions
Preferred
Experience integrating Slurm with Kubernetes or other control planes
Experience with HPC storage and I/O technologies (Lustre, ZFS, WekaFS, NFS)
Familiarity with metrics collection and visualization using Prometheus, Grafana, and Thanos
Exposure to CI/CD pipelines and DevOps practices for scientific or ML workloads
Understanding of machine learning workflows and frameworks (PyTorch, vLLM, SGLang)
Experience with infrastructure automation (e.g., Ansible, Terraform) and scripting (Python, Bash)
Bachelor's or Master's degree in related discipline preferred
Benefits
AMD benefits at a glance.
Company
AMD
Advanced Micro Devices is a semiconductor company that designs and develops graphics units, processors, and media solutions.
H1B Sponsorship
AMD has a track record of offering H1B sponsorships. Please note that this does not
guarantee sponsorship for this specific role. Below presents additional info for your
reference. (Data Powered by US Department of Labor)
Distribution of Different Job Fields Receiving Sponsorship
Represents job field similar to this job
Trends of Total Sponsorships
2025 (836)
2024 (770)
2023 (551)
2022 (739)
2021 (519)
2020 (547)
Funding
Current Stage
Public CompanyTotal Funding
unknownKey Investors
OpenAIDaniel Loeb
2025-10-06Post Ipo Equity
2023-03-02Post Ipo Equity
2021-06-29Post Ipo Equity
Recent News
2026-02-05
WSJ.com: US Business
2026-02-04
2026-02-04
Company data provided by crunchbase