Apply on Employer Site

Los Alamos National Laboratory · 16 hours ago

HPC/AI Linux Administrator (HPC Engineer 2/3)

Los Alamos, NM

Full-time

Hybrid

Mid, Senior Level

$104K/yr - $172K/yr

3+ years exp

Los Alamos National Laboratory is a multidisciplinary research institution engaged in strategic science on behalf of national security. They are seeking an HPC/AI Linux Administrator to operate and maintain high-performance computing systems, focusing on AI/ML infrastructure, while providing strategic design and mentoring to junior staff.

Security

Growth Opportunities

No H1B

Security Clearance Required

U.S. Citizen Only

Responsibilities

Join the High Performance Operations Group (HPC-OPS) in operating and maintaining some of the fastest supercomputers in the world

Designing, operating and maintaining these systems requires highly skilled personnel that specialize in both the hardware and software aspects of High Performance Computing

Innovators at heart, HPC-OPS Linux Administrators work both independently and collaboratively to maintain and implement capability improvements across a complex computing environment

This team is currently building on-premise cloud-like infrastructure to support the AI/ML/LLM needs of the laboratory

The Platforms Team is seeking to add highly knowledgeable and motivated team members to help build and deploy the AI/ML/LLM infrastructure for LANL

This person will be an expert Linux Administrator who will help design, build and run our production NVidia DGX/HGX pods optimized for our environment and workflow

They will run and manage both admin and user-facing services with an understanding of modern AI/ML/LLM user workflows, Kubernetes, and other common tools

The successful candidate will participate in periodic on-call responsibilities managing HPC clusters and AI infrastructure, while actively growing their technical skills and staying up to date with the latest technologies in the field

In addition, the selected candidate will have the opportunity to develop technical products such as technical documentation, presentations, technical papers, and reports, to communicate findings internally and at conferences

The selected HPC/AI Linux Administrator (HPC Engineer 2/3) will provide strategic design, testing, analysis, administration, configuration management, verification, and validation of the newly developed cloud-like infrastructure and specialized compute infrastructure for AL/ML workloads

Mentoring of students, junior staff, and peers in technical and professional growth activities is highly valued, as is maintaining state-of-the-art technical expertise and knowledge within HPC system administration and developing new skills in related disciplines

Qualification

Linux AdministrationConfiguration ManagementKubernetesHPC ExperienceComputer NetworkingScripting BashScripting PythonAI/ML WorkflowsTroubleshootingGitCloud TechnologiesCommunication SkillsMentoring

Required

Advanced Linux Administration Expertise: Demonstrated knowledge of administering production Linux computer systems, including strong command line Linux operating system skills, working knowledge of or experience with hardware and software security practices, and experience scripting in Bash, Python, or similar languages

Configuration Management Expertise: Demonstrated experience with configuration and automation tools and practices, such as Chef, Puppet, Ansible, Salt, CFEngine, or similar tools

Troubleshooting and Technical Analysis: Significant knowledge and demonstrated experience in formulating and testing hypotheses, investigating alternative solutions, and recommending solutions to technical problems

Computer Networking Expertise: Working knowledge of networking concepts and practices

Communication and Teaming Skills: Demonstrated effective communication skills, both verbal and written, including the ability to communicate technical information to both technical and non-technical personnel, to provide assistance and knowledge to peers, to collaborate with Group members, other HPC Group personnel and vendor representatives, as required, and to formulate and communicate technical results and findings to technical audiences and readerships (examples can include publications, team projects, and presentations)

Troubleshooting skills: Demonstrated ability to troubleshoot hardware and software errors, prioritizing problems and assessing impact to stakeholders, documenting problems and solutions

Container Orchestration Expertise: Demonstrated experience managing, administering and maintaining large production Kubernetes clusters

Troubleshooting Expertise: Experience troubleshooting and debugging user workflows in a Kubernetes environment

Computer Networking Expertise: High performance interconnects, preferably NVLink and InfiniBand networks

Leadership: Demonstrated experience with project planning and management. Ability developing and leading complex projects, generating formal project plans, delegating tasks, and providing routine updates to management

Filesystems: Knowledge of or demonstrated experience with parallel and distributed storage systems (e.g. Lustre); knowledge of file systems such as ZFS, EXT, XFS; working knowledge of file system structures and algorithms; and/or experience with Object storage and RESTful storage interfaces. Experience administering cluster storage technologies such as Ceph

HPC Experience: Demonstrated experience building, installation, and administration of HPC systems. Experience with modern image building and provisioning tools

Mentoring: Ability to mentor and lead individual junior team members and students

Education/Experience at HPC Engineer 2: The position requires a bachelor's in Computer Science, Computer Engineering, or a related field, and 3 years of relevant experience in high performance computing or scalable AI computing, or data center environments or equivalent combination of education and experience in related fields

Education/Experience at HPC Engineer 3: The position requires a bachelor's in Computer Science or Computer Engineering or a related field and 6 years of relevant experience in high performance computing or scalable AI computing, or data center environments or equivalent combination of education and experience in related fields

Preferred

Experience running Nvidia DGX/HGX/NVL72 systems or pods in a production environment

Experience using the Nvidia Base Command Manager for system administration of NVL72 clusters

Strong understanding of AI/ML workflows and experience setting up and maintaining user-facing AL/ML tools and services (such JupyterHub)

Experience writing and debugging Kubernetes microservices in Go

Knowledge of Cloud technologies

Experience integrating operational metrics into a monitoring system such as Splunk

Demonstrated effective communication skills, including demonstrated ability to work productively with customers and vendors

High attention to detail including excellent organizational skills, analytical thinking, observational and problem-solving skills. Proven ability to independently multi-task and adjust to the workings of a dynamic and fast paced environment

Experience with Git, creating issues, branches, merge requests and using CI/CD pipelines

Experience modifying Unix/Linux operating systems (e.g., enabling/disabling kernel modules)

Practical experience with Splunk or other monitoring tools

Demonstrated ability to develop new methods, techniques, or approaches to address critical technical problems and/develop new technical capabilities

Experience managing both SCIF and SAPF environments and HPC computing resources

Clearance: Active DOE Q or DoD Top Secret clearance with SCI eligibility

Benefits

PPO or High Deductible medical insurance with the same large nationwide network

Dental and vision insurance

Free basic life and disability insurance

Paid childbirth and parental leave

Award-winning 401(k) (6% matching plus 3.5% annually)

Learning opportunities and tuition assistance

Flexible schedules and time off (PTO and holidays)

Onsite gyms and wellness programs

Extensive relocation packages (outside a 50 mile radius)

Company

Los Alamos National Laboratory

Glassdoor4.0

Los Alamos National Laboratory, a multidisciplinary research institution engaged in strategic science on behalf of national security, is

Founded in 1943

Los Alamos, New Mexico, USA

10001+ employees

http://www.lanl.gov

Funding

Current Stage

Late Stage

Total Funding

unknown

Key Investors

US Department of EnergyU.S. Department of Homeland Security

2023-08-16Grant

2023-05-19Grant

2023-01-17Grant

Leadership Team

Alex Delaney

R&D Engineer, Detonation Science and Technology

Recent News

EIN Presswire

U.S. Department of Energy to Provide HALEU for Hermes Demonstration Reactor

2026-01-21

Tech Xplore

New technique could facilitate faster nuclear forensics

2026-01-13

Medical Xpress - latest medical and health news stories

Study finds higher hantavirus risk in drier, underdeveloped areas

2026-01-11

Company data provided by crunchbase