AI Systems Engineer - HPC jobs in United States
cer-icon
Apply on Employer Site
company-logo

AMD · 2 hours ago

AI Systems Engineer - HPC

AMD is a company focused on building products that accelerate next-generation computing experiences. The AI Systems Engineer will design, develop, and administer High-Performance Computing (HPC) infrastructure, GPU clusters, and AI workload schedulers, while collaborating with cross-functional teams to support AI-related projects.

Artificial Intelligence (AI)Cloud ComputingComputerEmbedded SystemsGPUHardwareSemiconductor
check
Growth Opportunities
badNo H1Bnote
Hiring Manager
Matthew Fesl
linkedin

Responsibilities

Develop, implement, and maintain GPU-based clusters, ensuring optimal performance
Administer ML/AI platforms - Distributed ML services, LLMs and AI inferencing, by managing deployments, resource allocation, monitoring, and security
Automate system provisioning and Cluster management end to end
Collaborate with cross-functional teams to address AI infrastructure requirements, support AI-related projects, and provide technical expertise
Monitor and evaluate the performance of AI systems and clusters, ensuring that they adhere to industry best practices and meet company standards
Use AI/ML to continuously improve internal processes and tools that are used in end-to-end delivery of your services in this team

Qualification

HPC infrastructure engineeringGPU cluster managementAI workload schedulersPythonKubernetes managementAutomation toolsProblem-solving skillsCommunication skills

Required

Design, development, and administration of High-Performance Computing (HPC) infrastructure, GPU clusters, and AI workload schedulers
Passion for learning and the field of large-scale distributed computing in AI and HPC workloads
Responsibility for end-to-end outcomes of efforts
Desire to build scalable and highly performant HPC/AI/Data services with AMD hardware, software, people and processes
Curiosity to learn and improve scalable HPC systems
Significant experience in working across a globally distributed organization
Develop, implement, and maintain GPU-based clusters, ensuring optimal performance
Administer ML/AI platforms - Distributed ML services, LLMs and AI inferencing, by managing deployments, resource allocation, monitoring, and security
Automate system provisioning and Cluster management end to end
Collaborate with cross-functional teams to address AI infrastructure requirements, support AI-related projects, and provide technical expertise
Monitor and evaluate the performance of AI systems and clusters, ensuring that they adhere to industry best practices and meet company standards
Use AI/ML to continuously improve internal processes and tools that are used in end-to-end delivery of services in the team

Preferred

Experience in developing Python based AI apps and UI
HPC infrastructure engineering for AI/HPC domain
SLURM and Kubernetes management
Managing GPU clusters optimizing GPU-based services/tools/software
Experience in creating web services with HPC backend (like AI)
Proficiency in RoCEv2, K8s, KVM, Ubuntu, Python, Shell, GPU drivers, and Cluster interconnect with 400G networking
Demonstrated experience with AI workload schedulers and allocation optimization
Automation/monitoring tool - Ansible / Saltstack, Terraform, Prometheus, Grafana
Strong organizational, problem-solving, and troubleshooting skills, with the ability to manage multiple projects simultaneously
Excellent verbal and written communication skills, with the ability to collaborate effectively with team members and stakeholders at all levels of the organization

Benefits

AMD benefits at a glance.

Company

Advanced Micro Devices is a semiconductor company that designs and develops graphics units, processors, and media solutions.

Funding

Current Stage
Public Company
Total Funding
unknown
Key Investors
OpenAIDaniel Loeb
2025-10-06Post Ipo Equity
2023-03-02Post Ipo Equity
2021-06-29Post Ipo Equity

Leadership Team

leader-logo
Lisa Su
Chair & CEO
linkedin
leader-logo
Mark Papermaster
CTO and EVP
linkedin
Company data provided by crunchbase