Apply on Employer Site

xAI · 23 hours ago

Member of Technical Staff

Palo Alto, CA

Full-time

Onsite

Senior Level

5+ years exp

xAI is focused on creating AI systems that enhance human understanding of the universe. The Member of Technical Staff will manage and enhance reliability in a multi-data center environment, automating processes and implementing observability solutions to ensure seamless operations for mission-critical AI infrastructure.

Artificial Intelligence (AI)Foundational AIGenerative AIInformation TechnologyMachine Learning

Growth Opportunities

H1B Sponsor Likely

Responsibilities

Design, develop, and deploy scalable code and services (primarily in Python and Rust, with flexibility for emerging languages) to automate reliability workflows, including monitoring, alerting, incident response, and infrastructure provisioning

Implement and maintain observability tools and practices, such as metrics collection, logging, tracing, and dashboards, to provide real-time insights into system health across multiple data centers—open to innovative stacks beyond traditional ones like ELK

Collaborate with cross-functional teams—including software development, network engineering, site operations, and facility operations (critical facilities, mechanical/electrical teams, and data center infrastructure management)—to identify reliability bottlenecks, automate solutions for fault tolerance, disaster recovery, capacity planning, and physical/environmental risk mitigation (e.g., power redundancy, cooling efficiency, and environmental monitoring integration)

Troubleshoot and resolve complex issues in data center environments, including hardware failures, environmental anomalies, software bugs, and network-related problems, while adhering to reliability principles like error budgets and SLAs

Optimize Linux-based systems for performance, security, and reliability, including kernel tuning, container orchestration (e.g., Kubernetes or emerging alternatives), and scripting for automation

Understand network topologies and concepts in large-scale, multi-data center environments to effectively troubleshoot connectivity, routing, redundancy, and performance issues; integrate observability into data center interconnects and facility-level controls for rapid diagnosis and automation

Participate in on-call rotations, post-incident reviews (blameless postmortems), and continuous improvement initiatives to enhance overall site reliability, including joint exercises with facility teams for physical failover and recovery scenarios

Mentor junior team members and document processes to foster a culture of automation, knowledge sharing, and adaptability to new technologies

Qualification

Site Reliability EngineeringPythonLinux Systems AdministrationContainerizationOrchestrationNetworking FundamentalsTroubleshooting Complex IssuesObservability SolutionsCollaboration SkillsMentoringDocumentation Skills

Required

Bachelor's degree in Computer Science, Computer Engineering, Electrical Engineering, or a closely related technical field (or equivalent professional experience)

5+ years of hands-on experience in site reliability engineering (SRE), infrastructure engineering, DevOps, or systems engineering, preferably supporting large-scale, distributed, or production environments

Strong programming skills with proven production experience in Python (required for automation and tooling); experience with Rust or willingness to work in Rust is a plus, but strong coding fundamentals in at least one systems-level language (e.g., Python, Go, C++) are essential

Solid experience with Linux systems administration, performance tuning, kernel-level understanding, and scripting/automation in production environments

Practical knowledge of containerization and orchestration technologies, such as Docker and Kubernetes (or similar systems)

Experience implementing observability solutions, including metrics, logging, tracing, monitoring tools (e.g., Prometheus, Grafana, or alternatives), alerting, and dashboards

Familiarity with troubleshooting complex issues in distributed systems, including software bugs, hardware failures, network problems, and environmental factors

Understanding of networking fundamentals (TCP/IP, routing, redundancy, DNS) in large-scale or multi-site environments

Experience participating in on-call rotations, incident response, post-incident reviews (blameless postmortems), and reliability practices such as error budgets or SLAs

Ability to collaborate effectively with cross-functional teams (software engineers, network teams, site/facility operations, mechanical/electrical teams)

Preferred

7+ years of experience in SRE or infrastructure roles, ideally in hyperscale, cloud, or AI/ML training infrastructure environments with multi-data center setups

Hands-on experience operating or scaling Kubernetes clusters (or equivalent orchestration) at large scale, including automation for provisioning, lifecycle management, and high-availability

Proficiency in Rust for systems programming and performance-critical components

Direct experience integrating software reliability tools with physical data center infrastructure (e.g., power, cooling, environmental monitoring, facility controls) and automating responses to physical events

Exposure to advanced or innovative observability stacks beyond traditional tools (e.g., exploring cutting-edge alternatives for metrics, logs, and tracing)

Experience building automated remediation, fault tolerance, disaster recovery, capacity planning, or predictive failure detection systems

Background in optimizing Linux-based systems for AI workloads, GPU clusters, or high-throughput compute environments

Demonstrated success reducing downtime, MTTR, or improving resource efficiency (e.g., through automation or observability) in high-stakes production settings

Prior work with bare-metal provisioning, data center interconnects, or hybrid/multi-site failover mechanisms

Mentoring experience, strong documentation skills, and a track record of fostering knowledge sharing and automation culture

Comfort with rapid technology adaptation in fast-evolving domains like AI infrastructure

Company

xAI

XAI is an artificial intelligence startup that develops AI solutions and tools to enhance reasoning and search capabilities.

Founded in 2023

Palo Alto, California, USA

1001-5000 employees

https://x.ai

H1B Sponsorship

xAI has a track record of offering H1B sponsorships. Please note that this does not guarantee sponsorship for this specific role. Below presents additional info for your reference. (Data Powered by US Department of Labor)

Distribution of Different Job Fields Receiving Sponsorship

Represents job field similar to this job

Trends of Total Sponsorships

2025 (1)

Funding

Current Stage

Late Stage

Total Funding

$42.73B

Key Investors

Valor Equity PartnersNeptune Digital AssetsSpaceX

2026-02-02Acquired

2026-01-06Series E· $20B

2025-12-11Secondary Market· $0.3M

Leadership Team

Greg Yang

Co-Founder

Yuhuai Wu

Co-Founder

Recent News

Axios

AI arms race approaches IPO reckoning

2026-02-06

MIT Technology Review

The Download: squeezing more metal out of aging mines, and AI’s truth crisis

2026-02-06

DAILYSABAH

Musk's SpaceX acquires his xAI ahead of expected IPO this year

2026-02-06

Company data provided by crunchbase