Site Reliability Engineer - US Government jobs in United States
cer-icon
Apply on Employer Site
company-logo

xAI · 8 hours ago

Site Reliability Engineer - US Government

xAI is a company focused on creating AI systems that understand the universe and aid humanity. They are seeking a Senior Infrastructure Engineer to design, build, and operate secure, scalable infrastructure for critical government projects, ensuring compliance with federal regulations.

Artificial Intelligence (AI)Generative AIInformation TechnologyMachine Learning
check
Growth Opportunities
badNo H1BnoteSecurity Clearance RequirednoteU.S. Citizen Onlynote

Responsibilities

Develop and optimize software to provision and manage xAI’s infrastructure across on-premise, virtual machine, and classified cloud environments, enabling efficient scaling for US government initiatives
Enhance the reliability, performance, and cost-effectiveness of infrastructure to support large-scale AI and application workloads in secure, classified settings
Collaborate with xAI engineers to understand workload requirements and design tailored solutions that meet government-specific needs and compliance standards
Implement robust observability, monitoring, and security practices to ensure the integrity, availability, and confidentiality of critical systems, adhering to federal protocols
Manage storage infrastructure using Infrastructure-as-Code (IaC) tools such as Pulumi, Terraform, or Ansible, with a focus on secure data handling
Drive system reliability through incident management, postmortems, and the definition of clear SLAs and SLOs, while maintaining security and compliance

Qualification

KubernetesInfrastructure-as-CodeGPU hardware managementIncident managementSecurity complianceCommunicationProblem-solvingOwnership

Required

Active Top Secret (TS) security clearance
5+ years of experience as an Infrastructure Engineer, Site Reliability Engineer, or similar role, with a focus on building and maintaining reliable, scalable systems, preferably in secure or government environments
Proficiency in managing storage infrastructure with IaC tools such as Pulumi, Terraform, or Ansible
Deep understanding of the Kubernetes stack, including CNI, CRI, CSI, and related components
Demonstrated ability to improve system reliability through incident management, postmortems, and defining SLAs/SLOs
Excellent communication and documentation skills, with the ability to handle sensitive information concisely and accurately

Preferred

Deep familiarity with installing and using GPU hardware, including setting up drivers, debugging issues, and ensuring reliability
Experience with high-traffic web or mobile application workloads, including optimizing Kubernetes for large-scale deployments in classified or federal settings
Familiarity with chaos engineering, capacity planning, or similar practices for ensuring system resilience in government projects
Proficiency with tools such as Kyverno, ArgoCD, or Go programming for infrastructure automation
Strong sense of ownership, curiosity, and enthusiasm for tackling complex technical challenges in secure environments
Passion for problem-solving and a proactive drive to deliver impactful results while adhering to security protocols
Certifications in security-related fields (e.g., CISSP) or experience in secure federal environments

Benefits

Equity
Comprehensive medical, vision, and dental coverage
Access to a 401(k) retirement plan
Short & long-term disability insurance
Life insurance
Various other discounts and perks

Company

xAI

twittertwittertwitter
company-logo
XAI is an artificial intelligence startup that develops AI solutions and tools to enhance reasoning and search capabilities.

Funding

Current Stage
Growth Stage
Total Funding
$22.73B
Key Investors
Neptune Digital AssetsSpaceXMorgan Stanley
2025-12-11Secondary Market· $0.3M
2025-07-13Corporate Round· $5.32B
2025-07-01Debt Financing· $5B

Leadership Team

leader-logo
Toby Pohlen
Founding Member
linkedin
Company data provided by crunchbase