Staff Engineer - Perf and Benchmarking jobs in United States
cer-icon
Apply on Employer Site
company-logo

CoreWeave · 11 hours ago

Staff Engineer - Perf and Benchmarking

CoreWeave is The Essential Cloud for AI™, delivering a platform of technology and expertise for AI innovators. The Staff Engineer will lead the Benchmarking & Performance team, responsible for managing performance data across global infrastructure and driving performance benchmarking initiatives.

AI InfrastructureArtificial Intelligence (AI)Cloud ComputingCloud InfrastructureInformation TechnologyMachine Learning
badNo H1BnoteU.S. Citizen Onlynote

Responsibilities

Define the multi-year benchmarking strategy and roadmap; prioritize models/workloads (LLMs, diffusion, vision, speech) and hardware tiers
Build, lead, and mentor a high-performing team of performance engineers and data analysts
Establish governance for claims: documented methodologies, versioning, reproducibility, and audit trails
Lead end-to-end MLPerf Inference and Training submissions: workload selection, cluster planning, runbooks, audits, and result publication
Coordinate optimization tracks with NVIDIA (CUDA, cuDNN, TensorRT/TensorRT-LLM, Triton, NCCL) to hit competitive results; drive upstream fixes where needed
Design a Kubernetes-native, repeatable benchmarking service that exercises CoreWeave stacks across SUNK (Slurm on Kubernetes), Kueue, and Kubeflow pipelines
Measure and report p50/p95/p99 latency, jitter, tokens/s, time-to-first-token, cold-start/warm-start, and cost-per-token/request across models, precisions (BF16/FP8/FP4), batch sizes, and GPU types
Maintain a corpus of representative scenarios (streaming, batch, multi-tenant) and data sets; automate comparisons across software releases and hardware generations
Build CI/CD pipelines and K8s controllers/operators to schedule benchmarks at scale; integrate with observability stacks (Prometheus, Grafana, OpenTelemetry) and results warehouses
Implement supply-chain integrity for benchmark artifacts (SBOMs, Cosign signatures)
Partner with NVIDIA, key ISVs, and OSS projects (vLLM, Triton, KServe, PyTorch/DeepSpeed, ONNX Runtime) to co-develop optimizations and upstream improvements
Support Sales/SEs with authoritative numbers for RFPs and competitive evaluations; brief analysts and press with rigorous, defensible data

Qualification

Distributed systemsLarge-scale ML trainingGPU performanceKubernetesMLPerf submissionsData systems architecturePerformance benchmarkingCommunicatorTeam leadershipCross-functional collaboration

Required

10+ years building distributed systems or HPC/cloud services, with deep expertise on large-scale ML training or similar high-performance workloads
Proven track record of architecting or building planet-scale data systems (e.g., telemetry platforms, observability stacks, cloud data warehouses, large-scale OLAP engines)
Deep understanding of GPU performance (CUDA, NCCL, RDMA, NVLink/PCIe, memory bandwidth), model-server stacks (Triton, vLLM, TensorRT-LLM, TorchServe), and distributed training frameworks (PyTorch FSDP/DeepSpeed/Megatron-LM)
Proficient with Kubernetes and ML control planes; familiarity with SUNK, Kueue, and Kubeflow in production environments
Excellent communicator able to interface with executives, customers, auditors, and OSS communities

Preferred

Experience with time-series databases, log-structured merge trees (LSM), or custom storage engine development
Experience running MLPerf submissions (Inference and/or Training) or equivalent audited benchmarks at scale
Contributions to MLPerf, Triton, vLLM, PyTorch, KServe, or similar OSS projects
Experience benchmarking multi-region fleets and large clusters (thousands of GPUs)
Publications/talks on ML performance, latency engineering, or large-scale benchmarking methodology

Benefits

Medical, dental, and vision insurance - 100% paid for by CoreWeave
Company-paid Life Insurance
Voluntary supplemental life insurance
Short and long-term disability insurance
Flexible Spending Account
Health Savings Account
Tuition Reimbursement
Ability to Participate in Employee Stock Purchase Program (ESPP)
Mental Wellness Benefits through Spring Health
Family-Forming support provided by Carrot
Paid Parental Leave
Flexible, full-service childcare support with Kinside
401(k) with a generous employer match
Flexible PTO
Catered lunch each day in our office and data center locations
A casual work environment
A work culture focused on innovative disruption

Company

CoreWeave

twittertwittertwitter
company-logo
CoreWeave is a cloud-based AI infrastructure company offering GPU cloud services to simplify AI and machine learning workloads.

Funding

Current Stage
Public Company
Total Funding
$24.87B
Key Investors
Jane Street CapitalStack CapitalCoatue
2025-12-08Post Ipo Debt· $2.54B
2025-11-12Post Ipo Debt· $2.5B
2025-08-20Post Ipo Secondary

Leadership Team

leader-logo
Michael Intrator
Chief Executive Officer
linkedin
leader-logo
Brannin McBee
Founder & CDO
linkedin
Company data provided by crunchbase