Apply on Employer Site

CoreWeave · 11 hours ago

Staff Engineer - Perf and Benchmarking

Sunnyvale, CA

Full-time

Hybrid

Lead/Staff

$188K/yr - $275K/yr

10+ years exp

CoreWeave is The Essential Cloud for AI™, delivering a platform of technology and expertise for AI innovators. The Staff Engineer will lead the Benchmarking & Performance team, responsible for managing performance data across global infrastructure and driving performance benchmarking initiatives.

AI InfrastructureArtificial Intelligence (AI)Cloud ComputingCloud InfrastructureInformation TechnologyMachine Learning

No H1B

U.S. Citizen Only

Responsibilities

Define the multi-year benchmarking strategy and roadmap; prioritize models/workloads (LLMs, diffusion, vision, speech) and hardware tiers

Build, lead, and mentor a high-performing team of performance engineers and data analysts

Establish governance for claims: documented methodologies, versioning, reproducibility, and audit trails

Lead end-to-end MLPerf Inference and Training submissions: workload selection, cluster planning, runbooks, audits, and result publication

Coordinate optimization tracks with NVIDIA (CUDA, cuDNN, TensorRT/TensorRT-LLM, Triton, NCCL) to hit competitive results; drive upstream fixes where needed

Design a Kubernetes-native, repeatable benchmarking service that exercises CoreWeave stacks across SUNK (Slurm on Kubernetes), Kueue, and Kubeflow pipelines

Measure and report p50/p95/p99 latency, jitter, tokens/s, time-to-first-token, cold-start/warm-start, and cost-per-token/request across models, precisions (BF16/FP8/FP4), batch sizes, and GPU types

Maintain a corpus of representative scenarios (streaming, batch, multi-tenant) and data sets; automate comparisons across software releases and hardware generations

Build CI/CD pipelines and K8s controllers/operators to schedule benchmarks at scale; integrate with observability stacks (Prometheus, Grafana, OpenTelemetry) and results warehouses

Implement supply-chain integrity for benchmark artifacts (SBOMs, Cosign signatures)

Partner with NVIDIA, key ISVs, and OSS projects (vLLM, Triton, KServe, PyTorch/DeepSpeed, ONNX Runtime) to co-develop optimizations and upstream improvements

Support Sales/SEs with authoritative numbers for RFPs and competitive evaluations; brief analysts and press with rigorous, defensible data

Qualification

Distributed systemsLarge-scale ML trainingGPU performanceKubernetesMLPerf submissionsData systems architecturePerformance benchmarkingCommunicatorTeam leadershipCross-functional collaboration

Required

10+ years building distributed systems or HPC/cloud services, with deep expertise on large-scale ML training or similar high-performance workloads

Proven track record of architecting or building planet-scale data systems (e.g., telemetry platforms, observability stacks, cloud data warehouses, large-scale OLAP engines)

Deep understanding of GPU performance (CUDA, NCCL, RDMA, NVLink/PCIe, memory bandwidth), model-server stacks (Triton, vLLM, TensorRT-LLM, TorchServe), and distributed training frameworks (PyTorch FSDP/DeepSpeed/Megatron-LM)

Proficient with Kubernetes and ML control planes; familiarity with SUNK, Kueue, and Kubeflow in production environments

Excellent communicator able to interface with executives, customers, auditors, and OSS communities

Preferred

Experience with time-series databases, log-structured merge trees (LSM), or custom storage engine development

Experience running MLPerf submissions (Inference and/or Training) or equivalent audited benchmarks at scale

Contributions to MLPerf, Triton, vLLM, PyTorch, KServe, or similar OSS projects

Experience benchmarking multi-region fleets and large clusters (thousands of GPUs)

Publications/talks on ML performance, latency engineering, or large-scale benchmarking methodology

Benefits

Medical, dental, and vision insurance - 100% paid for by CoreWeave

Company-paid Life Insurance

Voluntary supplemental life insurance

Short and long-term disability insurance

Flexible Spending Account

Health Savings Account

Tuition Reimbursement

Ability to Participate in Employee Stock Purchase Program (ESPP)

Mental Wellness Benefits through Spring Health

Family-Forming support provided by Carrot

Paid Parental Leave

Flexible, full-service childcare support with Kinside

401(k) with a generous employer match

Flexible PTO

Catered lunch each day in our office and data center locations

A casual work environment

A work culture focused on innovative disruption

Company

CoreWeave

CoreWeave is a cloud-based AI infrastructure company offering GPU cloud services to simplify AI and machine learning workloads.

Founded in 2017

Livingston, New Jersey, USA

1001-5000 employees

https://www.coreweave.com

Funding

Current Stage

Public Company

Total Funding

$24.87B

Key Investors

Jane Street CapitalStack CapitalCoatue

2025-12-08Post Ipo Debt· $2.54B

2025-11-12Post Ipo Debt· $2.5B

2025-08-20Post Ipo Secondary

Leadership Team

Michael Intrator

Chief Executive Officer

Brannin McBee

Founder & CDO

Recent News

PR Newswire

CRWV Investors Have Opportunity to Lead CoreWeave, Inc. Securities Fraud Lawsuit

2026-01-24

GlobeNewswire

CRWV INVESTOR REMINDER: Coreweave, Inc. Investors Have Until March 13, 2026 To Seek Lead Plaintiff Role - Kirby McInerney LLP

2026-01-24

The Motley Fool

Can a $10,000 Investment in CoreWeave Turn Into $1 Million?

2026-01-24

Company data provided by crunchbase