NVIDIA · 1 hour ago
Solutions Architect, AI Cloud Partner Performance
NVIDIA is a pioneering company in computer graphics and accelerated computing, now leveraging AI for the next era of computing. They are seeking a Solutions Architect to guide partners in adopting Reference Architectures, ensuring high performance, reliability, and security while creating training materials and sharing knowledge with internal teams.
AI InfrastructureArtificial Intelligence (AI)Consumer ElectronicsFoundational AIGPUHardwareSoftwareVirtual Reality
Responsibilities
Work closely with NVIDIA Cloud Partners (NCP), as a compute and networking performance specialist, ensuring they are reaching high standards for performance and accomplishing their business goals
Enable NCPs to achieve Exemplar Cloud status through demonstration of performance capabilities with respect to reference benchmarks
Accelerate NCP onboarding time by resolving deviations from reference performance targets
Improve NVIDIA Cloud Partner cluster manageability, and reliability by advising customers on application of available solutions
Scale knowledge, reach, and opportunities by educating internal teams and communities on NVIDIA Reference Architectures and Exemplar Cloud program
Communicate feedback from the field to teams creating and maintaining Reference Architectures
Qualification
Required
Strong foundational expertise, from a BS, MS, or Ph.D. degree in Engineering, Mathematics, Physics, Computer Science, Data Science (or equivalent experience)
5+ years of proven experience with one or more Cloud Service Providers (AWS, Azure, GCP or OCI), NCPs (CoreWeave, Lambda Labs, Crusoe, etc) and cloud-native architectures and software
Experience leading joint debugging and optimization sessions with partners, driving the resolution of distributed training bottlenecks and fabric anomalies
Expertise in performance tuning of RDMA-enabled GPU clusters including running performance benchmarks and diagnosing performance issue with compute and network tracing tools
Strong coding and outstanding debugging skills. Proven expertise in the following areas: LLM training and inference workloads, Slurm, Kubernetes, MPI, NCCL
Linux-based configuration, management, monitoring, and system administration with proficiency in problem-solving in both bare metal and virtual environments
Understanding of networking fundamentals (e.g. router, firewall, load balancer, DNS, VPN) for high performance infrastructure
Preferred
Ability to perform root cause analysis on distributed training failures using Nsight Systems and NCCL-tests, applying a detailed divide-and-conquer approach to isolate network/fabric issues
Experience running LLM Benchmarks, NCCL-tests, and automating RDMA diagnostic tools
Background with deploying and configuring observability tooling including Grafana, Prometheus, W&B, Nagios, Zabbix
Ability to take ownership when resolving cluster downtime or degraded performance with customers
Benefits
Equity
Benefits
Company
NVIDIA
NVIDIA is a computing platform company operating at the intersection of graphics, HPC, and AI.
H1B Sponsorship
NVIDIA has a track record of offering H1B sponsorships. Please note that this does not
guarantee sponsorship for this specific role. Below presents additional info for your
reference. (Data Powered by US Department of Labor)
Distribution of Different Job Fields Receiving Sponsorship
Represents job field similar to this job
Trends of Total Sponsorships
2025 (1877)
2024 (1355)
2023 (976)
2022 (835)
2021 (601)
2020 (529)
Funding
Current Stage
Public CompanyTotal Funding
$4.09BKey Investors
ARPA-EARK Investment ManagementSoftBank Vision Fund
2023-05-09Grant· $5M
2022-08-09Post Ipo Equity· $65M
2021-02-18Post Ipo Equity
Recent News
Tech Startups - Tech News, Tech Trends & Startup Funding
2026-01-22
Dynamic Business
2026-01-22
2026-01-22
Company data provided by crunchbase