Basis Research Institute · 1 month ago
ML Systems Engineer, Infrastructure & Cloud
Basis Research Institute is a nonprofit applied AI research organization focused on understanding and building intelligence while addressing complex societal problems. The ML Systems Engineer will manage and optimize the training and evaluation infrastructure, ensuring it is fast, reliable, and scalable to support researchers in their work with complex models.
Artificial Intelligence (AI)CharityDatabaseInformation TechnologyNon Profit
Responsibilities
Own distributed training infrastructure including job launchers, checkpointing systems, recovery mechanisms, and monitoring that ensures experiments run reliably at scale
Debug and resolve training failures by diagnosing issues across GPUs, networking, numerics, and data pipelines, maintaining detailed logs of problems and solutions
Profile and optimize training performance by identifying bottlenecks in data loading, gradient computation, communication overhead, and implementing solutions that improve step time
Manage cloud infrastructure and costs including capacity planning, spot instance strategies, storage optimization, and building tools that give researchers visibility into resource usage
Implement security and compliance measures including access controls, data encryption, audit logging, and ensuring infrastructure meets requirements for handling sensitive data
Build evaluation and benchmarking infrastructure that enables consistent, reproducible measurement of model performance across different conditions and datasets
Develop monitoring and alerting systems that detect anomalies in training metrics, resource utilization, or system health, enabling rapid response to issues
Maintain development environments including containerization, dependency management, and tools that ensure researchers can reproduce results across different systems
Document and share knowledge through runbooks, post-mortems, and training materials that help the team understand and operate ML infrastructure effectively
Collaborate with researchers to understand requirements, suggest infrastructure solutions, and ensure systems support rather than constrain research goals
Qualification
Required
Have demonstrated expertise in ML systems engineering. Examples include: Managing distributed training jobs across hundreds of GPUs, Debugging and fixing numerical instabilities in large-scale training, Building infrastructure for reproducible ML experiments, Optimizing training throughput and resource utilization
Possess deep knowledge of distributed training frameworks including PyTorch/JAX distributed strategies (DDP, FSDP, ZeRO), gradient accumulation, mixed precision training, and checkpoint/recovery systems
Have strong cloud administration skills including AWS/GCP/Azure services, infrastructure as code (Terraform), Kubernetes orchestration, cost optimization, security best practices, and compliance requirements
Understand the full ML stack from hardware (GPUs, interconnects, storage) through frameworks (PyTorch, JAX) to high-level training loops and evaluation pipelines
Be skilled at debugging complex failures across the stack—GPU/NCCL issues, data loading bottlenecks, memory leaks, gradient explosions, and convergence problems
Value documentation and knowledge sharing. You maintain comprehensive logs of issues encountered, solutions found, and lessons learned, building institutional knowledge
Progress with autonomy while coordinating closely with researchers. You can anticipate infrastructure needs, prevent problems before they occur, and respond quickly when issues arise
Preferred
Experience at organizations training large models (OpenAI, Anthropic, Google, Meta)
Background in both ML research and production systems
Contributions to ML frameworks or distributed training libraries
Experience with on-premise GPU cluster management
Knowledge of optimization theory and numerical methods
Understanding of robotics-specific infrastructure requirements
Company
Basis Research Institute
Basis is a nonprofit applied research organization with two mutually reinforcing goals. The first is to understand and build intelligence.
H1B Sponsorship
Basis Research Institute has a track record of offering H1B sponsorships. Please note that this does not
guarantee sponsorship for this specific role. Below presents additional info for your
reference. (Data Powered by US Department of Labor)
Distribution of Different Job Fields Receiving Sponsorship
Represents job field similar to this job
Trends of Total Sponsorships
2024 (2)
2022 (5)
2020 (1)
Funding
Current Stage
Early StageCompany data provided by crunchbase