UniversalAGI ยท 2 months ago
Machine Learning Cloud Infrastructure Engineer
UniversalAGI is building OpenAI for Physics, an AI startup in San Francisco. The role involves building and owning the entire ML infrastructure stack to power AI for physics at scale, working directly with the CEO and founding team to develop robust systems for training and deploying models in production.
Research
Responsibilities
Build and scale fine tuning & training infrastructure for foundation models, distributed training across multi-GPU and multi-node clusters, optimizing for throughput, cost, and iteration speed
Design and implement model serving systems with low latency, high reliability, and the ability to handle complex physics workloads in production
Build fine-tuning pipelines that let customers adapt our foundation models to their specific use cases, data, and workflows without compromising model quality or security
Build deployment serving infrastructure for on-premise and cloud environments, working through customer security requirements and compliance constraints
Create robust data pipelines that can ingest, validate, and preprocess massive CFD datasets from diverse sources and formats
Instrument everything: Build observability, monitoring, and debugging tools that give our team and customers full visibility into model performance, data quality, and system health
Work directly with customers on deployment, integration, and scaling challenges, turning their infrastructure pain points into product improvements
Move fast and ship: Take infrastructure from prototype to production in weeks, iterating based on real customer needs and research team feedback
Qualification
Required
3+ years of hands-on experience building and scaling ML infrastructure for fine tuning, training, serving, or deployment
Deep experience with cloud platforms (AWS, GCP, Azure) and infrastructure-as-code (Terraform, Kubernetes, Docker)
Deep expertise in distributed training frameworks (PyTorch Distributed, DeepSpeed, Ray, etc.) and multi-GPU/multi-node orchestration
Strong foundation in ML serving: Experience building low-latency inference systems, model optimization, and production deployment
Expert-level coding skills in Python and infrastructure tools, comfortable diving deep into ML frameworks and optimizing performance
Understanding of ML workflows: Training pipelines, experiment tracking, model versioning, and the full lifecycle from research to production
Strong communicator capable of bridging customers, engineers, and researchers, translating infrastructure constraints into product decisions
Outstanding execution velocity: Ships fast, debugs quickly, and thrives in ambiguity
Exceptional problem-solving ability: Willing to dive deep into unfamiliar systems and figure out what's actually broken
Comfortable in high-intensity startup environments with evolving priorities and tight deadlines
Preferred
Computer Aided Engineer Software experience
Experience deploying ML in enterprise environments with strict security, compliance, and air-gapped requirements
Built fine-tuning infrastructure for foundation models
Experience with model optimization techniques
Deep understanding of GPU programming and performance optimization (CUDA, Triton, etc.)
Experience with large-scale data engineering for ML, ETL pipelines, and data validation systems
Built MLOps platforms or developer tools for ML teams
Experience at high-growth AI startups (Seed to Series C) or leading AI labs
Forward deployed experience working directly with customers on complex integrations
Open-source contributions to ML infrastructure or training frameworks
Benefits
Competitive compensation and equity.
Competitive health, dental, vision benefits paid by the company.
401(k) plan offering.
Flexible vacation.
Team Building & Fun Activities.
Great scope, ownership and impact.
AI tools stipend.
Monthly commute stipend.
Monthly wellness / fitness stipend.
Daily office lunch & dinner covered by the company.
Immigration support.
Company
UniversalAGI
UniversalAGI is automating physical systems engineering across the entire product lifecycle with artificial intelligence.
Funding
Current Stage
Early StageCompany data provided by crunchbase