Andiamo · 3 days ago
Network Reliability Engineer - Decentralized High-Performance Computing Leader
Andiamo is a globally recognized staffing and consulting firm specializing in technology and go-to-market professionals. They are seeking a Senior Network Reliability Engineer to architect and optimize high-performance network fabrics for AI and HPC workloads, ensuring seamless and efficient operation of advanced compute infrastructure.
ConsultingHuman ResourcesInformation TechnologyStaffing Agency
Responsibilities
Engineer next-generation network performance: Fine-tune TCP/IP, RDMA (RoCE), kernel-bypass technologies (DPDK, XDP, eBPF), and NIC offloads to push latency and throughput to their physical limits for high-performance computing workloads
Deploy and scale at massive capacity: Roll out and optimize large-scale network fabrics across datacenters using top-tier hardware (Arista, NVIDIA/Mellanox, Juniper, and more). Configure advanced BGP/EVPN topologies, spine-leaf architectures, and congestion management for lossless transport
Automate network intelligence: Build telemetry pipelines and automated systems for real-time performance monitoring, packet-loss detection, and predictive congestion analysis across complex environments
Debug at the deepest levels: Lead investigations into packet loss, latency anomalies, and congestion hot spots — diving into kernel traces, switch firmware, and flow control mechanisms to pinpoint and resolve issues
Collaborate with the industry’s best: Work directly with hardware and silicon vendors to debug firmware, optimize RDMA and RoCE paths, validate optics, and integrate emerging technologies like 800G+ links and CPO/LPO networking
Design for resilience and reliability: Simulate large-scale network failures, run game-day exercises, and turn lessons learned into robust automation, playbooks, and SLOs that drive measurable reliability improvements
Qualification
Required
7+ years of experience in network engineering, SRE, or performance infrastructure roles — ideally within AI, HPC, or large-scale cloud environments
Deep understanding of the Linux networking stack, including kernel-level debugging, TCP/IP, InfiniBand, and RoCE
Proven hands-on experience managing multi-layer datacenter networks, network overlays (VXLAN, Geneve), and multi-vendor environments (Arista, NVIDIA/Mellanox, Juniper, etc.)
Strong programming proficiency in Python, Go, or Rust, and experience with Infrastructure-as-Code and modern CI/CD practices
Practical knowledge of DPDK, XDP, eBPF, and hardware acceleration frameworks used in low-latency networking
Demonstrated success in building and scaling high-performance, low-latency network architectures for data-intensive systems or compute clusters
Company
Andiamo
The Talent Partners for the AI Revolution.
H1B Sponsorship
Andiamo has a track record of offering H1B sponsorships. Please note that this does not
guarantee sponsorship for this specific role. Below presents additional info for your
reference. (Data Powered by US Department of Labor)
Distribution of Different Job Fields Receiving Sponsorship
Represents job field similar to this job
Trends of Total Sponsorships
2022 (2)
2021 (1)
Funding
Current Stage
Growth StageCompany data provided by crunchbase