Advanced Microdevices Pvt. Ltd. (India) · 4 weeks ago
AI/ML and GPU Performance QA engineer
Advanced Micro Devices, Inc is a company focused on building products that enhance next-generation computing experiences. They are seeking a Senior Technical Validation Engineer to lead validation and performance engineering for Machine Learning and High-Performance Computing frameworks, ensuring the delivery of high-quality software for AI and HPC workloads.
BiopharmaBiotechnologyIndustrialManufacturing
Responsibilities
Lead validation for ML/AI models: accuracy testing, performance benchmarking, regression, drift detection, A/B testing
Test ML frameworks: PyTorch, Hugging Face, MLFlow experiment tracking
Validate wide varieties of AI models to ensure correctness in distributed training or inference
Perform GPU testing & profiling: ROCm/CUDA validation, performance profiling, memory/thermal analysis, multi-GPU scaling
Validate HPC frameworks, distributed runtimes, compilers, and GPU libraries
Build scalable CI/CD workflows for ML/HPC validation. Develop automated test pipelines using Docker, Kubernetes, GitHub Actions, Jenkins
Validate cloud-based AI workloads on AWS SageMaker, Lambda, and S3
Test the benchmarks under containerized and virtualized GPU environments
Design and implement automated validation pipelines for ML frameworks (e.g., PyTorch, TensorFlow, JAX) across GPU platforms
Develop and maintain benchmarking suites for AI models and HPC workloads, focusing on performance, scalability, and regression detection
Multi-node validation efforts using orchestration tools (e.g., Slurm, MPI, Kubernetes) to simulate real-world distributed training and inference
Collaborate with hardware and software teams to validate GPU hardware platforms (NVIDIA CUDA, AMD ROCm) for ML and HPC readiness
Analyze performance metrics using profiling tools (e.g.,rocprof) and provide actionable insights
Drive test content development for emerging AI workloads, including LLMs, vision models, and scientific computing benchmarks
Perform bottleneck analysis, hyperparameter validation, and competitive benchmarking
Mentor junior engineers and contribute to validation strategy, tooling, and best practices
Qualification
Required
Good understanding and experience in ROCm, CUDA, GPU architecture, ML frameworks, CI/CD systems, benchmarking, and competitive analysis
Lead validation for ML/AI models: accuracy testing, performance benchmarking, regression, drift detection, A/B testing
Test ML frameworks: PyTorch, Hugging Face, MLFlow experiment tracking
Validate wide varieties of AI models to ensure correctness in distributed training or inference
Perform GPU testing & profiling: ROCm/CUDA validation, performance profiling, memory/thermal analysis, multi-GPU scaling
Validate HPC frameworks, distributed runtimes, compilers, and GPU libraries
Build scalable CI/CD workflows for ML/HPC validation. Develop automated test pipelines using Docker, Kubernetes, GitHub Actions, Jenkins
Validate cloud-based AI workloads on AWS SageMaker, Lambda, and S3
Test the benchmarks under containerized and virtualized GPU environments
Design and implement automated validation pipelines for ML frameworks (e.g., PyTorch, TensorFlow, JAX) across GPU platforms
Develop and maintain benchmarking suites for AI models and HPC workloads, focusing on performance, scalability, and regression detection
Multi-node validation efforts using orchestration tools (e.g., Slurm, MPI, Kubernetes) to simulate real-world distributed training and inference
Collaborate with hardware and software teams to validate GPU hardware platforms (NVIDIA CUDA, AMD ROCm) for ML and HPC readiness
Analyze performance metrics using profiling tools (e.g., rocprof) and provide actionable insights
Drive test content development for emerging AI workloads, including LLMs, vision models, and scientific computing benchmarks
Perform bottleneck analysis, hyperparameter validation, and competitive benchmarking
Mentor junior engineers and contribute to validation strategy, tooling, and best practices
Preferred
Bachelor's or Master's degree in Computer Science, Electrical Engineering, or related field
8+ years of experience in validation engineering, ML infrastructure, or HPC performance testing
Strong hands-on experience with GPU platforms (NVIDIA CUDA, AMD ROCm) and their software ecosystems
Deep understanding of AI model architectures, training/inference workflows, and ML performance bottlenecks
Proven experience with CI/CD systems, Git, Docker, and automated test frameworks
Expertise in multi-node orchestration and distributed system validation
Familiarity with HPC benchmarks (e.g., HPL, HPCG, MLPerf) and AI model benchmarking methodologies
Proficiency in scripting and automation (Python, Bash, YAML) in Linux environments
Strong communication, documentation, and cross-functional collaboration skills
Benefits
AMD benefits at a glance.
Company
Advanced Microdevices Pvt. Ltd. (India)
Advanced Microdevices (mdi) is a leader in innovative membrane technologies.
Funding
Current Stage
Late StageLeadership Team
Nalini Kant Gupta
Founder & Managing Director
Recent News
2024-10-18
2024-10-16
Company data provided by crunchbase