CoreWeave · 11 hours ago
Staff Engineer - Perf and Benchmarking
CoreWeave is The Essential Cloud for AI™, delivering a platform of technology and expertise for AI innovators. The Staff Engineer will lead the Benchmarking & Performance team, responsible for managing performance data across global infrastructure and driving performance benchmarking initiatives.
AI InfrastructureArtificial Intelligence (AI)Cloud ComputingCloud InfrastructureInformation TechnologyMachine Learning
Responsibilities
Define the multi-year benchmarking strategy and roadmap; prioritize models/workloads (LLMs, diffusion, vision, speech) and hardware tiers
Build, lead, and mentor a high-performing team of performance engineers and data analysts
Establish governance for claims: documented methodologies, versioning, reproducibility, and audit trails
Lead end-to-end MLPerf Inference and Training submissions: workload selection, cluster planning, runbooks, audits, and result publication
Coordinate optimization tracks with NVIDIA (CUDA, cuDNN, TensorRT/TensorRT-LLM, Triton, NCCL) to hit competitive results; drive upstream fixes where needed
Design a Kubernetes-native, repeatable benchmarking service that exercises CoreWeave stacks across SUNK (Slurm on Kubernetes), Kueue, and Kubeflow pipelines
Measure and report p50/p95/p99 latency, jitter, tokens/s, time-to-first-token, cold-start/warm-start, and cost-per-token/request across models, precisions (BF16/FP8/FP4), batch sizes, and GPU types
Maintain a corpus of representative scenarios (streaming, batch, multi-tenant) and data sets; automate comparisons across software releases and hardware generations
Build CI/CD pipelines and K8s controllers/operators to schedule benchmarks at scale; integrate with observability stacks (Prometheus, Grafana, OpenTelemetry) and results warehouses
Implement supply-chain integrity for benchmark artifacts (SBOMs, Cosign signatures)
Partner with NVIDIA, key ISVs, and OSS projects (vLLM, Triton, KServe, PyTorch/DeepSpeed, ONNX Runtime) to co-develop optimizations and upstream improvements
Support Sales/SEs with authoritative numbers for RFPs and competitive evaluations; brief analysts and press with rigorous, defensible data
Qualification
Required
10+ years building distributed systems or HPC/cloud services, with deep expertise on large-scale ML training or similar high-performance workloads
Proven track record of architecting or building planet-scale data systems (e.g., telemetry platforms, observability stacks, cloud data warehouses, large-scale OLAP engines)
Deep understanding of GPU performance (CUDA, NCCL, RDMA, NVLink/PCIe, memory bandwidth), model-server stacks (Triton, vLLM, TensorRT-LLM, TorchServe), and distributed training frameworks (PyTorch FSDP/DeepSpeed/Megatron-LM)
Proficient with Kubernetes and ML control planes; familiarity with SUNK, Kueue, and Kubeflow in production environments
Excellent communicator able to interface with executives, customers, auditors, and OSS communities
Preferred
Experience with time-series databases, log-structured merge trees (LSM), or custom storage engine development
Experience running MLPerf submissions (Inference and/or Training) or equivalent audited benchmarks at scale
Contributions to MLPerf, Triton, vLLM, PyTorch, KServe, or similar OSS projects
Experience benchmarking multi-region fleets and large clusters (thousands of GPUs)
Publications/talks on ML performance, latency engineering, or large-scale benchmarking methodology
Benefits
Medical, dental, and vision insurance - 100% paid for by CoreWeave
Company-paid Life Insurance
Voluntary supplemental life insurance
Short and long-term disability insurance
Flexible Spending Account
Health Savings Account
Tuition Reimbursement
Ability to Participate in Employee Stock Purchase Program (ESPP)
Mental Wellness Benefits through Spring Health
Family-Forming support provided by Carrot
Paid Parental Leave
Flexible, full-service childcare support with Kinside
401(k) with a generous employer match
Flexible PTO
Catered lunch each day in our office and data center locations
A casual work environment
A work culture focused on innovative disruption
Company
CoreWeave
CoreWeave is a cloud-based AI infrastructure company offering GPU cloud services to simplify AI and machine learning workloads.
Funding
Current Stage
Public CompanyTotal Funding
$24.87BKey Investors
Jane Street CapitalStack CapitalCoatue
2025-12-08Post Ipo Debt· $2.54B
2025-11-12Post Ipo Debt· $2.5B
2025-08-20Post Ipo Secondary
Recent News
2026-01-24
The Motley Fool
2026-01-24
Company data provided by crunchbase