AMD · 1 week ago
Principal Networking QoS Development Engineer
AMD is a company focused on building innovative products that enhance computing experiences across various domains including AI and data centers. The Principal Networking QoS Development Engineer will lead the development and implementation of quality of service strategies across data center SmartNICs and DPUs, ensuring optimal performance for various workloads.
Responsibilities
Own QoS architecture across network tiers (host → NIC/DPU including classification, policing, shaping, queue mapping, and scheduling strategies for mixed workloads (AI collectives, storage, RPC, control plane)
Design and implement SmartNIC QoS: map DSCP/PCP to NIC traffic classes, configure hardware TX/RX queues, rate limiters, WFQ/DRR schedulers, and offload paths for RDMA/TCP/UDP
Switch QoS policy design: configure PFC, ETS, ECN/RED/WRED, buffer pools, queue thresholds, shared vs. dedicated buffers, and congestion control across multiple ASICs (e.g., Broadcom, NVIDIA/Mellanox, Marvell)
RDMA/RoCE tuning end‑to‑end: lossless/loss‑tolerant modes, CNP/ECN parameters, RNR/retry behavior, MTU/Jumbo frames, and scalable multi‑tenant profiles
Performance engineering: build test plans and run micro/macro benchmarks (e.g., ib_send_lat/ib_write_bw, RCCL/NCCL, iperf, switch counters/telemetry) to validate latency, throughput, tail performance, and fairness
Instrumentation & observability: define SLI/SLOs for QoS (tail latency, drops, PFC events, ECN marks, queue depth, buffer occupancy); integrate with streaming telemetry (gNMI/INT/SFlow) and develop dashboards and alerts
Troubleshoot complex incidents: incast, PFC deadlocks, microbursts, head‑of‑line blocking, unfair scheduling, and noisy neighbors; lead root‑cause analysis and corrective actions
Scale & automation: deliver declarative QoS via intent‑based configs and CI/CD (e.g., Ansible/Salt, NAPALM, gNMI/gNOI, Netconf/YANG), including pre‑deployment simulation and automated canary/rollback
Documentation & standards: author design docs, runbooks, and guidance for tenant teams; contribute to internal standards and vendor requirements
Qualification
Required
Strong experience datacenter networking or systems engineering, with direct ownership of QoS on switches and/or SmartNICs/DPUs
Deep knowledge of QoS mechanisms: classification/marking (DSCP/PCP), policing, shaping, queueing (PRIO, WRR/WFQ/DRR), scheduling hierarchies, and buffer management
Hands‑on with PFC, ETS, ECN/WRED, explicit buffer tuning, and RDMA/RoCE performance/correctness in production
Experience configuring merchant switch silicon (e.g., Broadcom Trident/Tomahawk, NVIDIA Spectrum, Marvell Teralynx) via NOS CLIs/SDKs (e.g., SONiC, Cumulus, NX‑OS, EOS, Onyx)
SmartNIC/DPU experience (e.g., NVIDIA BlueField, Intel IPU, AMD Pensando, Netronome/Agilio): queue configuration, rate limiting, hardware offloads, and host‑NIC QoS mapping
Proficiency with Linux networking (TC, qdisc, mqprio, XDP/eBPF), ethtool, RDMA tools (perftest, rdma-core utilities), and packet/flow analysis (tcpdump, Wireshark, INT/sFlow)
Strong automation skills: Python and/or Go for network automation, telemetry pipelines, and CI/CD integration; Git‑based workflows
Demonstrated ability to debug low‑level performance issues (NIC queues, IRQ affinity, NUMA, PCIe/xGMI topology, driver/firmware interactions)
Excellent written/verbal communication; strong design documentation and cross‑team leadership
Preferred
Large‑scale operations experience (10K+ servers or multi‑region fabrics) with QoS at fleet scale and multi‑tenant isolation
Practical experience with AI/ML workloads (RCCL/NCCL AllReduce, parameter servers, distributed training) and storage (NVMe‑oF, NFS, SMB, object) QoS trade‑offs
Experience with traffic engineering and congestion control in Clos fabrics; familiarity with INT, gNMI, Inband telemetry, and P4 concepts
Contributions to SONiC, DPDK, eBPF/XDP, or OpenConfig; experience with YANG/Netconf, gNOI
Vendor engagement/bring‑up: working with ASIC/NIC vendors on buffer models, scheduling algorithms, and firmware roadmaps
Security awareness for multi‑tenant environments (DSCP abuse, QoS starvation, control‑plane protection, CoPP/ACL integration)
Benefits
AMD benefits at a glance.
Company
AMD
Advanced Micro Devices is a semiconductor company that designs and develops graphics units, processors, and media solutions.
Funding
Current Stage
Public CompanyTotal Funding
unknownKey Investors
OpenAIDaniel Loeb
2025-10-06Post Ipo Equity
2023-03-02Post Ipo Equity
2021-06-29Post Ipo Equity
Recent News
2026-02-06
The Next Platform
2026-02-06
2026-02-06
Company data provided by crunchbase