HPC Linux Storage Engineer jobs in United States
cer-icon
Apply on Employer Site
company-logo

Oak Ridge National Laboratory · 1 day ago

HPC Linux Storage Engineer

Oak Ridge National Laboratory (ORNL) is seeking highly skilled professionals to support large-scale storage systems and high-speed parallel file systems critical to advancing scientific discovery. The role involves designing, deploying, optimizing, and maintaining infrastructure that powers cutting-edge research across diverse scientific domains.

Advanced MaterialsClean EnergyEnergyEnergy ManagementManufacturingNuclearRenewable Energy
badNo H1BnoteSecurity Clearance RequirednoteU.S. Citizen Onlynote

Responsibilities

Design and Management of Infrastructure: Architect, deploy, and manage large-scale storage systems and HPC platforms to support research, scientific, and enterprise workloads. Develop and implement solutions for structured, unstructured, and archival data storage, focusing on scalability, reliability, and performance
Systems Analysis and Development: Apply systems analysis techniques to consult with users/customers, determine functional requirements, and design, test, or optimize storage and computational solutions tailored to their needs. Develop, document, and modify solutions, including system prototypes and automated workflows, to enhance operational efficiency
Performance, Optimization, and Troubleshooting: Ensure the performance, availability, scalability, and security of diverse infrastructure environments. Diagnose and resolve complex operational challenges quickly and effectively, applying advanced performance optimization techniques for a wide range of workloads
Collaboration and Best Practices: Work closely with stakeholders from research, technical, and operational teams to understand workflows, identify opportunities for improvement, and deliver effective solutions. Define, implement, and enforce best practices, standards, and procedures across projects and teams
Automation and Innovation: Automate system configuration, provisioning, monitoring, and maintenance to reduce manual efforts and downtime. Evaluate emerging technologies and tools to continuously improve system capabilities, adapt to changing needs, and plan for future advancements
Support and Maintenance: Support critical infrastructure through participation in a 24/7 on-call rotation and off-hours maintenance windows. Resolve hardware and software issues in coordination with vendors, ensuring minimal impact on operations

Qualification

HPC storage systemsLinux/UNIX systemsScripting languagesConfiguration management toolsHigh-performance parallel file systemsPerformance monitoring toolsVirtualization platformsCommunication skillsCollaboration skillsProblem-solving skills

Required

Bachelor's degree in computer science, engineering, information technology, or a related field; and at least 5 years of professional experience managing Linux/UNIX systems in heterogeneous environments. An equivalent combination of education and experience will be considered
Demonstrated experience with high-performance computing (HPC) storage systems and enterprise storage platforms (e.g., Lustre, GPFS, BeeGFS, or WEKA)
Proficiency in scripting languages (e.g., Python, Bash, Perl) and configuration management/automation tools (e.g., Ansible, Puppet, Git)
Strong communication, collaboration, and problem-solving skills with the ability to design and implement solutions independently

Preferred

Active DOE Q, DoD Top Secret, or TS/SCI clearance
Hands-on experience with HPC cluster technologies, including job schedulers (e.g., SLURM) and system deployment tools (e.g., Warewulf, PXEboot, Bright Cluster Manager)
Expertise in high-performance parallel file systems, tape library systems, and storage networking technologies (e.g., RAID, ZFS, NVMe-oF, Infiniband)
Familiarity with performance monitoring tools (e.g., Grafana, Nagios), benchmarking systems, and I/O optimization techniques
Experience with virtualization and containerization platforms (e.g., VMware, KVM, Podman, Apptainer)
Background in open source development, including submitting patches upstream, and building custom Linux packages (e.g., RPM for RHEL)
Demonstrated ability to troubleshoot and optimize high-performance storage, compute, and networking systems in HPC environments
Experience documenting technical processes and contributing to complex technical projects in government, scientific, or highly technical settings

Benefits

Professional development and leadership opportunities

Company

Oak Ridge National Laboratory

company-logo
Oak Ridge National Laboratory holds a range of R&D assignments, from fundamental nuclear physics to applied R&D on advanced energy systems.

Funding

Current Stage
Late Stage
Total Funding
$9.8M
Key Investors
US Department of Energy
2023-09-21Grant· $4.8M
2023-07-27Grant
2022-03-14Grant· $5M

Leadership Team

leader-logo
Arjun Shankar
Division Director, National Center for Computational Sciences, Oak Ridge National Laboratory
linkedin
leader-logo
Brett Ellis
Division Director - Research Computing Support
linkedin
Company data provided by crunchbase