Software Engineer, Supercomputing jobs in United States
cer-icon
Apply on Employer Site
company-logo

Thinking Machines Lab · 2 months ago

Software Engineer, Supercomputing

Thinking Machines Lab's mission is to empower humanity through advancing collaborative general intelligence. The role involves designing, building, and operating the GPU supercomputing environment to deliver high-performance, reliable, and cost-efficient compute for large-scale training and inference.

Artificial Intelligence (AI)Foundational AIGenerative AIInformation TechnologyProduct ResearchSoftware
check
H1B Sponsorednote

Responsibilities

Operate and automate large GPU clusters including provisioning, imaging, and capacity planning
Write software that abstracts cluster management and presents a unified interface for training and inference
Extend scheduling/orchestration (Kubernetes, Slurm, or similar) for topology‑aware placement, preemption, quotas, and fair‑share multi‑tenancy
Monitor and improve operational metrics of speed, reliability, and error recovery
Build reliable storage and artifact paths for datasets, checkpoints, and logs with clear retention and lineage
Partner with researchers to unblock scale runs and advise on parallelism and performance trade‑offs

Qualification

GPU supercomputingKubernetesPythonRustLinuxDeep learning frameworksCollaborative environmentInitiative

Required

Bachelor's degree or equivalent experience in computer science, engineering, or similar
Proficiency in at least one backend language (we use Python or Rust)
Experience operating large‑scale clusters and container orchestration systems (e.g. Kubernetes or Slurm)
Comfort operating across the stack and owning projects end-to-end
Thrive in a highly collaborative environment involving many, different cross-functional partners and subject matter experts
A bias for action with a mindset to take initiative to work across different stacks and different teams where you spot the opportunity to make sure something ships

Preferred

Strong systems background: Linux, networking, and infrastructure‑as‑code
Familiarity with CUDA/NCCL and performance profiling for distributed training/inference
Prior work supporting large‑scale model training or inference environments
Understanding of deep learning frameworks (e.g., PyTorch, TensorFlow, JAX) and their underlying system architectures
Track record of working in fast-paced environments balancing care with urgency

Benefits

Generous health, dental, and vision benefits
Unlimited PTO
Paid parental leave
Relocation support as needed

Company

Thinking Machines Lab

twittertwittertwitter
company-logo
Thinking Machines Lab is an AI research and product company that aims to increase understanding and customization of AI systems.

H1B Sponsorship

Thinking Machines Lab has a track record of offering H1B sponsorships. Please note that this does not guarantee sponsorship for this specific role. Below presents additional info for your reference. (Data Powered by US Department of Labor)
Distribution of Different Job Fields Receiving Sponsorship
Represents job field similar to this job
Trends of Total Sponsorships
2025 (9)

Funding

Current Stage
Early Stage
Total Funding
$2.01B
Key Investors
Andreessen Horowitz"Ministry of Economy, Culture and Innovation"
2025-06-20Seed· $2B
2025-05-05Grant· $9.95M

Leadership Team

leader-logo
Mira Murati
Co-Founder and Chief Executive Officer
linkedin
leader-logo
Jonathan Lachman
Founding Head of Operations
linkedin
Company data provided by crunchbase