Principal Networking QoS Development Engineer jobs in United States
cer-icon
Apply on Employer Site
company-logo

AMD · 1 week ago

Principal Networking QoS Development Engineer

AMD is a company focused on building innovative products that enhance computing experiences across various domains including AI and data centers. The Principal Networking QoS Development Engineer will lead the development and implementation of quality of service strategies across data center SmartNICs and DPUs, ensuring optimal performance for various workloads.

AI InfrastructureArtificial Intelligence (AI)Cloud ComputingComputerEmbedded SystemsGPUHardwareSemiconductor
check
Growth Opportunities
badNo H1Bnote
Hiring Manager
Brenda Wilson, CDR
linkedin

Responsibilities

Own QoS architecture across network tiers (host → NIC/DPU including classification, policing, shaping, queue mapping, and scheduling strategies for mixed workloads (AI collectives, storage, RPC, control plane)
Design and implement SmartNIC QoS: map DSCP/PCP to NIC traffic classes, configure hardware TX/RX queues, rate limiters, WFQ/DRR schedulers, and offload paths for RDMA/TCP/UDP
Switch QoS policy design: configure PFC, ETS, ECN/RED/WRED, buffer pools, queue thresholds, shared vs. dedicated buffers, and congestion control across multiple ASICs (e.g., Broadcom, NVIDIA/Mellanox, Marvell)
RDMA/RoCE tuning end‑to‑end: lossless/loss‑tolerant modes, CNP/ECN parameters, RNR/retry behavior, MTU/Jumbo frames, and scalable multi‑tenant profiles
Performance engineering: build test plans and run micro/macro benchmarks (e.g., ib_send_lat/ib_write_bw, RCCL/NCCL, iperf, switch counters/telemetry) to validate latency, throughput, tail performance, and fairness
Instrumentation & observability: define SLI/SLOs for QoS (tail latency, drops, PFC events, ECN marks, queue depth, buffer occupancy); integrate with streaming telemetry (gNMI/INT/SFlow) and develop dashboards and alerts
Troubleshoot complex incidents: incast, PFC deadlocks, microbursts, head‑of‑line blocking, unfair scheduling, and noisy neighbors; lead root‑cause analysis and corrective actions
Scale & automation: deliver declarative QoS via intent‑based configs and CI/CD (e.g., Ansible/Salt, NAPALM, gNMI/gNOI, Netconf/YANG), including pre‑deployment simulation and automated canary/rollback
Documentation & standards: author design docs, runbooks, and guidance for tenant teams; contribute to internal standards and vendor requirements

Qualification

QoS strategyL2/L3/L4 QoSSmartNIC/DPU experienceLinux networkingNetwork automationTraffic engineeringDebugging performance issuesCommunication skills

Required

Strong experience datacenter networking or systems engineering, with direct ownership of QoS on switches and/or SmartNICs/DPUs
Deep knowledge of QoS mechanisms: classification/marking (DSCP/PCP), policing, shaping, queueing (PRIO, WRR/WFQ/DRR), scheduling hierarchies, and buffer management
Hands‑on with PFC, ETS, ECN/WRED, explicit buffer tuning, and RDMA/RoCE performance/correctness in production
Experience configuring merchant switch silicon (e.g., Broadcom Trident/Tomahawk, NVIDIA Spectrum, Marvell Teralynx) via NOS CLIs/SDKs (e.g., SONiC, Cumulus, NX‑OS, EOS, Onyx)
SmartNIC/DPU experience (e.g., NVIDIA BlueField, Intel IPU, AMD Pensando, Netronome/Agilio): queue configuration, rate limiting, hardware offloads, and host‑NIC QoS mapping
Proficiency with Linux networking (TC, qdisc, mqprio, XDP/eBPF), ethtool, RDMA tools (perftest, rdma-core utilities), and packet/flow analysis (tcpdump, Wireshark, INT/sFlow)
Strong automation skills: Python and/or Go for network automation, telemetry pipelines, and CI/CD integration; Git‑based workflows
Demonstrated ability to debug low‑level performance issues (NIC queues, IRQ affinity, NUMA, PCIe/xGMI topology, driver/firmware interactions)
Excellent written/verbal communication; strong design documentation and cross‑team leadership

Preferred

Large‑scale operations experience (10K+ servers or multi‑region fabrics) with QoS at fleet scale and multi‑tenant isolation
Practical experience with AI/ML workloads (RCCL/NCCL AllReduce, parameter servers, distributed training) and storage (NVMe‑oF, NFS, SMB, object) QoS trade‑offs
Experience with traffic engineering and congestion control in Clos fabrics; familiarity with INT, gNMI, Inband telemetry, and P4 concepts
Contributions to SONiC, DPDK, eBPF/XDP, or OpenConfig; experience with YANG/Netconf, gNOI
Vendor engagement/bring‑up: working with ASIC/NIC vendors on buffer models, scheduling algorithms, and firmware roadmaps
Security awareness for multi‑tenant environments (DSCP abuse, QoS starvation, control‑plane protection, CoPP/ACL integration)

Benefits

AMD benefits at a glance.

Company

Advanced Micro Devices is a semiconductor company that designs and develops graphics units, processors, and media solutions.

Funding

Current Stage
Public Company
Total Funding
unknown
Key Investors
OpenAIDaniel Loeb
2025-10-06Post Ipo Equity
2023-03-02Post Ipo Equity
2021-06-29Post Ipo Equity

Leadership Team

leader-logo
Lisa Su
Chair & CEO
linkedin
leader-logo
Mark Papermaster
CTO and EVP
linkedin
Company data provided by crunchbase