Runpod · 16 hours ago
Manager, Datacenter Network Engineering
Runpod is pioneering the future of AI and machine learning, offering cutting-edge cloud infrastructure for full‑stack AI applications. They are seeking an Engineering Manager, Datacenter Network Engineering to lead a team responsible for designing, deploying, and operating the global datacenter and backbone network. The role involves managing engineers, setting technical direction, and ensuring operational excellence in network architecture.
AI InfrastructureArtificial Intelligence (AI)Cloud InfrastructureGPU
Responsibilities
Lead the Datacenter Networking Team: Manage and grow a team of network engineers responsible for datacenter fabrics, interconnects, and global WAN connectivity. Provide mentorship, technical guidance, and clear ownership boundaries
Own Datacenter Network Architecture: Define and evolve network designs for GPU-heavy clusters, including spine-leaf topologies, ECMP routing, and high-bandwidth east-west traffic patterns
High-Performance GPU Networking: Oversee design and operation of InfiniBand and RoCE-based fabrics supporting distributed training and inference workloads. Ensure performance, loss characteristics, and congestion control meet AI workload requirements
Encapsulation & Overlay Protocols: Guide implementation and operations of encapsulation technologies such as VXLAN, EVPN, Geneve, or similar, enabling scalable multi-tenant isolation and flexible network provisioning
Global WAN & Backbone Connectivity: Lead strategy and execution for global WAN connectivity, including private backbone links, IX connectivity, and hybrid connectivity with cloud providers and partners
Reliability & Operations: Establish operational best practices for monitoring, capacity planning, change management, incident response, and post-mortems across the network stack
Cross-Functional Collaboration: Partner closely with Infrastructure, SRE, Hardware, and Product Engineering teams to ensure network capabilities align with platform and customer requirements
Vendor & Partner Management: Work with hardware vendors, colocation providers, and transit partners on network design, procurement, deployment timelines, and escalations
Security & Segmentation: Ensure network designs support secure isolation, DDoS resilience, and compliance requirements without compromising performance
Qualification
Required
3+ years managing network or infrastructure engineering teams, with experience scaling teams and systems in production environments
8+ years designing and operating large-scale datacenter networks, including spine-leaf architectures, BGP-based routing, and high-throughput fabrics
Strong hands-on experience with VXLAN/EVPN or equivalent encapsulation protocols, including control-plane and data-plane considerations
Proven experience with InfiniBand and/or RoCE, including congestion management, lossless Ethernet concepts, and performance tuning for GPU workloads
Deep familiarity with global WAN technologies, including private backbone design, inter-region connectivity, routing policy, and traffic engineering
Comfortable working with Linux-based systems, network operating systems, and automation tooling
Strong background in network observability, incident management, capacity forecasting, and change control
Clear written and verbal communication skills, with the ability to align stakeholders and lead teams through complex technical challenges
Successful completion of a background check
Preferred
Experience operating networks for GPU clusters, HPC environments, or AI/ML platforms
Familiarity with RDMA tuning, NCCL traffic patterns, and distributed training communication models
Experience with automation frameworks and network-as-code (e.g., Terraform, Ansible, internal tooling)
Background in multi-region or multi-cloud networking architectures
Experience working in high-growth or hyperscale infrastructure environments
Benefits
Meaningful equity in a fast-growing company- everyone on the team receives stock options — your impact drives our growth, and you share in the upside.
Generous medical, dental & vision plans — we cover 100% for all employees and partial for dependents.
Flexible PTO- take the time you need to recharge
Most roles are remote work first with an inclusive, collaborative teams utilizing slack as the main form of internal communication
Join a passionate team on the cutting edge of AI infrastructure — where culture, learning, and ownership are at the heart of how we scale.
Company
Runpod
Runpod is a cloud platform designed for GPUs, enabling developers to deploy customized full-stack AI applications.
Funding
Current Stage
Growth StageTotal Funding
$22M2024-05-08Seed· $20M
2023-03-30Pre Seed· $2M
Recent News
PR Newswire
2026-01-20
Crunchbase News
2025-12-26
Company data provided by crunchbase