Research Engineer, Infrastructure, Training Systems jobs in United States
cer-icon
Apply on Employer Site
company-logo

Thinking Machines Lab · 1 month ago

Research Engineer, Infrastructure, Training Systems

Thinking Machines Lab is focused on advancing collaborative general intelligence and making AI accessible for unique needs. The role of Research Engineer, Infrastructure, Training Systems involves designing and building scalable systems for efficient training of large models, ensuring reliable experimentation and training to support research teams.

Artificial Intelligence (AI)Foundational AIGenerative AIInformation TechnologyProduct ResearchSoftware
check
H1B Sponsorednote

Responsibilities

Design, implement, and optimize distributed training systems that scale across thousands of GPUs and nodes for large-scale training workloads
Develop high-performance optimizations to maximize throughput and efficiency
Develop reusable frameworks and libraries to improve training reproducibility, reliability, and scalability for new model architectures
Establish standards for reliability, maintainability, and security, ensuring systems are robust under rapid iteration
Collaborate with researchers and engineers to build scalable infrastructure
Publish and share learnings through internal documentation, open-source libraries, or technical reports that advance the field of scalable AI infrastructure

Qualification

Distributed training systemsDeep learning frameworksPerformance optimizationOpen-source contributionsCollaborative environmentBachelor’s degreeEngineering skillsDebugging complex codebasesInitiative mindset

Required

Bachelor's degree or equivalent experience in computer science, electrical engineering, statistics, machine learning, physics, robotics, or similar
Strong engineering skills, ability to contribute performant, maintainable code and debug in complex codebases
Understanding of deep learning frameworks (e.g., PyTorch, JAX) and their underlying system architectures
Thrive in a highly collaborative environment involving many, different cross-functional partners and subject matter experts
A bias for action with a mindset to take initiative to work across different stacks and different teams where you spot the opportunity to make sure something ships

Preferred

Past experience working on distributed training for the world's largest models to make them stable, reliable, and performant
Track record of improving research productivity through infrastructure design or process improvements
Contributions to open-source ML infrastructure such as PyTorch, XLA, Megatron-LM, or DeepSpeed

Benefits

Generous health, dental, and vision benefits
Unlimited PTO
Paid parental leave
Relocation support as needed

Company

Thinking Machines Lab

twittertwittertwitter
company-logo
Thinking Machines Lab is an AI research and product company that aims to increase understanding and customization of AI systems.

H1B Sponsorship

Thinking Machines Lab has a track record of offering H1B sponsorships. Please note that this does not guarantee sponsorship for this specific role. Below presents additional info for your reference. (Data Powered by US Department of Labor)
Distribution of Different Job Fields Receiving Sponsorship
Represents job field similar to this job
Trends of Total Sponsorships
2025 (9)

Funding

Current Stage
Early Stage
Total Funding
$2.01B
Key Investors
Andreessen HorowitzMinistry of Economy, Culture and Innovation
2025-06-20Seed· $2B
2025-05-05Grant· $9.98M

Leadership Team

leader-logo
Mira Murati
Co-Founder and Chief Executive Officer
linkedin

Recent News

Business Insider
Company data provided by crunchbase