AMD · 1 week ago
Principal Network Software Development Engineer - QoS
AMD is a company focused on building innovative products that enhance computing experiences across various domains. The role involves owning the end-to-end QoS strategy and implementation for data center SmartNICs/DPUs, ensuring predictable performance for AI/ML and latency-sensitive services.
Responsibilities
Own QoS architecture across network tiers (host → NIC/DPU including classification, policing, shaping, queue mapping, and scheduling strategies for mixed workloads (AI collectives, storage, RPC, control plane)
Design and implement SmartNIC QoS: map DSCP/PCP to NIC traffic classes, configure hardware TX/RX queues, rate limiters, WFQ/DRR schedulers, and offload paths for RDMA/TCP/UDP
Switch QoS policy design: configure PFC, ETS, ECN/RED/WRED, buffer pools, queue thresholds, shared vs. dedicated buffers, and congestion control across multiple ASICs (e.g., Broadcom, NVIDIA/Mellanox, Marvell)
RDMA/RoCE tuning end‑to‑end: lossless/loss‑tolerant modes, CNP/ECN parameters, RNR/retry behavior, MTU/Jumbo frames, and scalable multi‑tenant profiles
Performance engineering: build test plans and run micro/macro benchmarks (e.g., ib_send_lat/ib_write_bw, RCCL/NCCL, iperf, switch counters/telemetry) to validate latency, throughput, tail performance, and fairness
Instrumentation & observability: define SLI/SLOs for QoS (tail latency, drops, PFC events, ECN marks, queue depth, buffer occupancy); integrate with streaming telemetry (gNMI/INT/SFlow) and develop dashboards and alerts
Troubleshoot complex incidents: incast, PFC deadlocks, microbursts, head‑of‑line blocking, unfair scheduling, and noisy neighbors; lead root‑cause analysis and corrective actions
Scale & automation: deliver declarative QoS via intent‑based configs and CI/CD (e.g., Ansible/Salt, NAPALM, gNMI/gNOI, Netconf/YANG), including pre‑deployment simulation and automated canary/rollback
Documentation & standards: author design docs, runbooks, and guidance for tenant teams; contribute to internal standards and vendor requirements
Qualification
Required
Strong object-oriented programming background, C/C++ preferred
Ability to write high quality code with a keen attention to detail
Experience with modern concurrent programming and threading APIs
Experience with Linux or similar operating system development
Experience with software development processes and tools such as debuggers, source code control systems (GitHub) and profilers is a plus
Effective communication and problem-solving skills
Bachelor's degree in Computer Science, Computer Engineering, or related field
Preferred
Large-scale operations experience (10K+ servers or multi-region fabrics) with QoS at fleet scale and multi-tenant isolation
Practical experience with AI/ML workloads (RCCL/NCCL AllReduce, parameter servers, distributed training) and storage (NVMe-oF, NFS, SMB, object) QoS trade-offs
Experience with traffic engineering and congestion control in Clos fabrics; familiarity with INT, gNMI, Inband telemetry, and P4 concepts
Contributions to SONiC, DPDK, eBPF/XDP, or OpenConfig; experience with YANG/Netconf, gNOI
Vendor engagement/bring-up: working with ASIC/NIC vendors on buffer models, scheduling algorithms, and firmware roadmaps
Security awareness for multi-tenant environments (DSCP abuse, QoS starvation, control-plane protection, CoPP/ACL integration)
Master's preferred
Benefits
AMD benefits at a glance.
Company
AMD
Advanced Micro Devices is a semiconductor company that designs and develops graphics units, processors, and media solutions.
Funding
Current Stage
Public CompanyTotal Funding
unknownKey Investors
OpenAIDaniel Loeb
2025-10-06Post Ipo Equity
2023-03-02Post Ipo Equity
2021-06-29Post Ipo Equity
Recent News
2026-02-12
Tech Startups - Tech News, Tech Trends & Startup Funding
2026-02-12
Tech Startups - Tech News, Tech Trends & Startup Funding
2026-02-12
Company data provided by crunchbase