Principal Network Software Development Engineer - QoS jobs in United States
cer-icon
Apply on Employer Site
company-logo

AMD · 1 week ago

Principal Network Software Development Engineer - QoS

AMD is a company focused on building innovative products that enhance computing experiences across various domains. The role involves owning the end-to-end QoS strategy and implementation for data center SmartNICs/DPUs, ensuring predictable performance for AI/ML and latency-sensitive services.

Embedded SoftwareArtificial Intelligence (AI)SemiconductorCloud ComputingElectronicsHardwareAI InfrastructureComputerEmbedded SystemsGPU
check
Growth Opportunities
badNo H1Bnote
Hiring Manager
Brenda Wilson, CDR
linkedin

Responsibilities

Own QoS architecture across network tiers (host → NIC/DPU including classification, policing, shaping, queue mapping, and scheduling strategies for mixed workloads (AI collectives, storage, RPC, control plane)
Design and implement SmartNIC QoS: map DSCP/PCP to NIC traffic classes, configure hardware TX/RX queues, rate limiters, WFQ/DRR schedulers, and offload paths for RDMA/TCP/UDP
Switch QoS policy design: configure PFC, ETS, ECN/RED/WRED, buffer pools, queue thresholds, shared vs. dedicated buffers, and congestion control across multiple ASICs (e.g., Broadcom, NVIDIA/Mellanox, Marvell)
RDMA/RoCE tuning end‑to‑end: lossless/loss‑tolerant modes, CNP/ECN parameters, RNR/retry behavior, MTU/Jumbo frames, and scalable multi‑tenant profiles
Performance engineering: build test plans and run micro/macro benchmarks (e.g., ib_send_lat/ib_write_bw, RCCL/NCCL, iperf, switch counters/telemetry) to validate latency, throughput, tail performance, and fairness
Instrumentation & observability: define SLI/SLOs for QoS (tail latency, drops, PFC events, ECN marks, queue depth, buffer occupancy); integrate with streaming telemetry (gNMI/INT/SFlow) and develop dashboards and alerts
Troubleshoot complex incidents: incast, PFC deadlocks, microbursts, head‑of‑line blocking, unfair scheduling, and noisy neighbors; lead root‑cause analysis and corrective actions
Scale & automation: deliver declarative QoS via intent‑based configs and CI/CD (e.g., Ansible/Salt, NAPALM, gNMI/gNOI, Netconf/YANG), including pre‑deployment simulation and automated canary/rollback
Documentation & standards: author design docs, runbooks, and guidance for tenant teams; contribute to internal standards and vendor requirements

Qualification

L2/L3/L4 QoSSmartNIC/DPUTraffic classificationCongestion controlC/C++ programmingLinux developmentAI/ML workloadsPerformance engineeringEffective communicationProblem-solving skills

Required

Strong object-oriented programming background, C/C++ preferred
Ability to write high quality code with a keen attention to detail
Experience with modern concurrent programming and threading APIs
Experience with Linux or similar operating system development
Experience with software development processes and tools such as debuggers, source code control systems (GitHub) and profilers is a plus
Effective communication and problem-solving skills
Bachelor's degree in Computer Science, Computer Engineering, or related field

Preferred

Large-scale operations experience (10K+ servers or multi-region fabrics) with QoS at fleet scale and multi-tenant isolation
Practical experience with AI/ML workloads (RCCL/NCCL AllReduce, parameter servers, distributed training) and storage (NVMe-oF, NFS, SMB, object) QoS trade-offs
Experience with traffic engineering and congestion control in Clos fabrics; familiarity with INT, gNMI, Inband telemetry, and P4 concepts
Contributions to SONiC, DPDK, eBPF/XDP, or OpenConfig; experience with YANG/Netconf, gNOI
Vendor engagement/bring-up: working with ASIC/NIC vendors on buffer models, scheduling algorithms, and firmware roadmaps
Security awareness for multi-tenant environments (DSCP abuse, QoS starvation, control-plane protection, CoPP/ACL integration)
Master's preferred

Benefits

AMD benefits at a glance.

Company

Advanced Micro Devices is a semiconductor company that designs and develops graphics units, processors, and media solutions.

Funding

Current Stage
Public Company
Total Funding
unknown
Key Investors
OpenAIDaniel Loeb
2025-10-06Post Ipo Equity
2023-03-02Post Ipo Equity
2021-06-29Post Ipo Equity

Leadership Team

leader-logo
Lisa Su
Chair & CEO
linkedin
leader-logo
Mark Papermaster
CTO and EVP
linkedin

Recent News

Tech Startups - Tech News, Tech Trends & Startup Funding
Tech Startups - Tech News, Tech Trends & Startup Funding
Company data provided by crunchbase