Network Reliability Engineer - Decentralized High-Performance Computing Leader jobs in United States
cer-icon
Apply on Employer Site
company-logo

Andiamo · 3 days ago

Network Reliability Engineer - Decentralized High-Performance Computing Leader

Andiamo is a globally recognized staffing and consulting firm specializing in technology and go-to-market professionals. They are seeking a Senior Network Reliability Engineer to architect and optimize high-performance network fabrics for AI and HPC workloads, ensuring seamless and efficient operation of advanced compute infrastructure.

ConsultingHuman ResourcesInformation TechnologyStaffing Agency
check
Comp. & Benefits
check
H1B Sponsor Likelynote

Responsibilities

Engineer next-generation network performance: Fine-tune TCP/IP, RDMA (RoCE), kernel-bypass technologies (DPDK, XDP, eBPF), and NIC offloads to push latency and throughput to their physical limits for high-performance computing workloads
Deploy and scale at massive capacity: Roll out and optimize large-scale network fabrics across datacenters using top-tier hardware (Arista, NVIDIA/Mellanox, Juniper, and more). Configure advanced BGP/EVPN topologies, spine-leaf architectures, and congestion management for lossless transport
Automate network intelligence: Build telemetry pipelines and automated systems for real-time performance monitoring, packet-loss detection, and predictive congestion analysis across complex environments
Debug at the deepest levels: Lead investigations into packet loss, latency anomalies, and congestion hot spots — diving into kernel traces, switch firmware, and flow control mechanisms to pinpoint and resolve issues
Collaborate with the industry’s best: Work directly with hardware and silicon vendors to debug firmware, optimize RDMA and RoCE paths, validate optics, and integrate emerging technologies like 800G+ links and CPO/LPO networking
Design for resilience and reliability: Simulate large-scale network failures, run game-day exercises, and turn lessons learned into robust automation, playbooks, and SLOs that drive measurable reliability improvements

Qualification

Network engineeringLinux networking stackLow-latency networkingPython programmingInfrastructure-as-CodeDPDKXDPAutomationCollaborationProblem-solving

Required

7+ years of experience in network engineering, SRE, or performance infrastructure roles — ideally within AI, HPC, or large-scale cloud environments
Deep understanding of the Linux networking stack, including kernel-level debugging, TCP/IP, InfiniBand, and RoCE
Proven hands-on experience managing multi-layer datacenter networks, network overlays (VXLAN, Geneve), and multi-vendor environments (Arista, NVIDIA/Mellanox, Juniper, etc.)
Strong programming proficiency in Python, Go, or Rust, and experience with Infrastructure-as-Code and modern CI/CD practices
Practical knowledge of DPDK, XDP, eBPF, and hardware acceleration frameworks used in low-latency networking
Demonstrated success in building and scaling high-performance, low-latency network architectures for data-intensive systems or compute clusters

Company

The Talent Partners for the AI Revolution.

H1B Sponsorship

Andiamo has a track record of offering H1B sponsorships. Please note that this does not guarantee sponsorship for this specific role. Below presents additional info for your reference. (Data Powered by US Department of Labor)
Distribution of Different Job Fields Receiving Sponsorship
Represents job field similar to this job
Trends of Total Sponsorships
2022 (2)
2021 (1)

Funding

Current Stage
Growth Stage

Leadership Team

leader-logo
Patrick McAdams
CEO & Co-Founder
linkedin
leader-logo
Steven Kottler
CFO
linkedin
Company data provided by crunchbase