HPC/AI Linux Administrator (HPC Engineer 2/3) jobs in United States
cer-icon
Apply on Employer Site
company-logo

Los Alamos National Laboratory · 16 hours ago

HPC/AI Linux Administrator (HPC Engineer 2/3)

Los Alamos National Laboratory is a multidisciplinary research institution engaged in strategic science on behalf of national security. They are seeking an HPC/AI Linux Administrator to operate and maintain high-performance computing systems, focusing on AI/ML infrastructure, while providing strategic design and mentoring to junior staff.

Security
check
Growth Opportunities
badNo H1BnoteSecurity Clearance RequirednoteU.S. Citizen Onlynote

Responsibilities

Join the High Performance Operations Group (HPC-OPS) in operating and maintaining some of the fastest supercomputers in the world
Designing, operating and maintaining these systems requires highly skilled personnel that specialize in both the hardware and software aspects of High Performance Computing
Innovators at heart, HPC-OPS Linux Administrators work both independently and collaboratively to maintain and implement capability improvements across a complex computing environment
This team is currently building on-premise cloud-like infrastructure to support the AI/ML/LLM needs of the laboratory
The Platforms Team is seeking to add highly knowledgeable and motivated team members to help build and deploy the AI/ML/LLM infrastructure for LANL
This person will be an expert Linux Administrator who will help design, build and run our production NVidia DGX/HGX pods optimized for our environment and workflow
They will run and manage both admin and user-facing services with an understanding of modern AI/ML/LLM user workflows, Kubernetes, and other common tools
The successful candidate will participate in periodic on-call responsibilities managing HPC clusters and AI infrastructure, while actively growing their technical skills and staying up to date with the latest technologies in the field
In addition, the selected candidate will have the opportunity to develop technical products such as technical documentation, presentations, technical papers, and reports, to communicate findings internally and at conferences
The selected HPC/AI Linux Administrator (HPC Engineer 2/3) will provide strategic design, testing, analysis, administration, configuration management, verification, and validation of the newly developed cloud-like infrastructure and specialized compute infrastructure for AL/ML workloads
Mentoring of students, junior staff, and peers in technical and professional growth activities is highly valued, as is maintaining state-of-the-art technical expertise and knowledge within HPC system administration and developing new skills in related disciplines

Qualification

Linux AdministrationConfiguration ManagementKubernetesHPC ExperienceComputer NetworkingScripting BashScripting PythonAI/ML WorkflowsTroubleshootingGitCloud TechnologiesCommunication SkillsMentoring

Required

Advanced Linux Administration Expertise: Demonstrated knowledge of administering production Linux computer systems, including strong command line Linux operating system skills, working knowledge of or experience with hardware and software security practices, and experience scripting in Bash, Python, or similar languages
Configuration Management Expertise: Demonstrated experience with configuration and automation tools and practices, such as Chef, Puppet, Ansible, Salt, CFEngine, or similar tools
Troubleshooting and Technical Analysis: Significant knowledge and demonstrated experience in formulating and testing hypotheses, investigating alternative solutions, and recommending solutions to technical problems
Computer Networking Expertise: Working knowledge of networking concepts and practices
Communication and Teaming Skills: Demonstrated effective communication skills, both verbal and written, including the ability to communicate technical information to both technical and non-technical personnel, to provide assistance and knowledge to peers, to collaborate with Group members, other HPC Group personnel and vendor representatives, as required, and to formulate and communicate technical results and findings to technical audiences and readerships (examples can include publications, team projects, and presentations)
Troubleshooting skills: Demonstrated ability to troubleshoot hardware and software errors, prioritizing problems and assessing impact to stakeholders, documenting problems and solutions
Container Orchestration Expertise: Demonstrated experience managing, administering and maintaining large production Kubernetes clusters
Troubleshooting Expertise: Experience troubleshooting and debugging user workflows in a Kubernetes environment
Computer Networking Expertise: High performance interconnects, preferably NVLink and InfiniBand networks
Leadership: Demonstrated experience with project planning and management. Ability developing and leading complex projects, generating formal project plans, delegating tasks, and providing routine updates to management
Filesystems: Knowledge of or demonstrated experience with parallel and distributed storage systems (e.g. Lustre); knowledge of file systems such as ZFS, EXT, XFS; working knowledge of file system structures and algorithms; and/or experience with Object storage and RESTful storage interfaces. Experience administering cluster storage technologies such as Ceph
HPC Experience: Demonstrated experience building, installation, and administration of HPC systems. Experience with modern image building and provisioning tools
Mentoring: Ability to mentor and lead individual junior team members and students
Education/Experience at HPC Engineer 2: The position requires a bachelor's in Computer Science, Computer Engineering, or a related field, and 3 years of relevant experience in high performance computing or scalable AI computing, or data center environments or equivalent combination of education and experience in related fields
Education/Experience at HPC Engineer 3: The position requires a bachelor's in Computer Science or Computer Engineering or a related field and 6 years of relevant experience in high performance computing or scalable AI computing, or data center environments or equivalent combination of education and experience in related fields

Preferred

Experience running Nvidia DGX/HGX/NVL72 systems or pods in a production environment
Experience using the Nvidia Base Command Manager for system administration of NVL72 clusters
Strong understanding of AI/ML workflows and experience setting up and maintaining user-facing AL/ML tools and services (such JupyterHub)
Experience writing and debugging Kubernetes microservices in Go
Knowledge of Cloud technologies
Experience integrating operational metrics into a monitoring system such as Splunk
Demonstrated effective communication skills, including demonstrated ability to work productively with customers and vendors
High attention to detail including excellent organizational skills, analytical thinking, observational and problem-solving skills. Proven ability to independently multi-task and adjust to the workings of a dynamic and fast paced environment
Experience with Git, creating issues, branches, merge requests and using CI/CD pipelines
Experience modifying Unix/Linux operating systems (e.g., enabling/disabling kernel modules)
Practical experience with Splunk or other monitoring tools
Demonstrated ability to develop new methods, techniques, or approaches to address critical technical problems and/develop new technical capabilities
Experience managing both SCIF and SAPF environments and HPC computing resources
Clearance: Active DOE Q or DoD Top Secret clearance with SCI eligibility

Benefits

PPO or High Deductible medical insurance with the same large nationwide network
Dental and vision insurance
Free basic life and disability insurance
Paid childbirth and parental leave
Award-winning 401(k) (6% matching plus 3.5% annually)
Learning opportunities and tuition assistance
Flexible schedules and time off (PTO and holidays)
Onsite gyms and wellness programs
Extensive relocation packages (outside a 50 mile radius)

Company

Los Alamos National Laboratory

company-logo
Los Alamos National Laboratory, a multidisciplinary research institution engaged in strategic science on behalf of national security, is

Funding

Current Stage
Late Stage
Total Funding
unknown
Key Investors
US Department of EnergyU.S. Department of Homeland Security
2023-08-16Grant
2023-05-19Grant
2023-01-17Grant

Leadership Team

leader-logo
Alex Delaney
R&D Engineer, Detonation Science and Technology
linkedin
Company data provided by crunchbase