SIGN IN
Senior AI Infrastructure Engineer jobs in United States
cer-icon
Apply on Employer Site
company-logo

ChatGPT Jobs · 1 day ago

Senior AI Infrastructure Engineer

WEX is a company focused on AI infrastructure, seeking a Senior AI Infrastructure Engineer to build and maintain a high-performance compute foundation for generative AI and machine learning initiatives. The role involves architecting the platform for production AI workloads, ensuring resilience, security, and cost-effectiveness while optimizing compute layers and deployment pipelines.
Computer Software

Responsibilities

Platform Architecture: Design and maintain a robust, Kubernetes-based AI platform that supports distributed training and high-throughput inference serving
Inference Optimization: Engineer low-latency serving solutions for LLMs and other models, optimizing engines (e.g., vLLM, TGI, Triton) to maximize throughput and minimize cost per token
Compute Orchestration: Manage and scale GPU clusters on Cloud (AWS) or on-prem environments, implementing efficient scheduling, auto-scaling, and spot instance management to optimize costs
Operational Excellence (MLOps): Build and maintain "Infrastructure as Code" (Terraform/Ansible) and CI/CD pipelines to automate the lifecycle of model deployments and infrastructure provisioning
Reliability & Observability: Implement comprehensive monitoring (Prometheus, Grafana) for GPU health, model latency, and system resource usage; lead incident response for critical AI infrastructure
Developer Experience: Create tools and abstraction layers (SDKs, CLI tools) that allow data scientists to self-serve compute resources without managing underlying infrastructure
Security & Compliance: Ensure all AI infrastructure meets strict security standards, handling sensitive data encryption and access controls (IAM, VPCs) effectively

Qualification

KubernetesPythonAWSInfrastructure as CodeGPU architecturesModel ServingAutomationObservabilityGoDockerTerraformPrometheusGrafana

Required

5+ years of experience in DevOps, Site Reliability Engineering (SRE), or Platform Engineering, with at least 2 years focused on Machine Learning infrastructure
Proven track record of managing large-scale production clusters (Kubernetes) and distributed systems
Deep understanding of GPU architectures (NVIDIA A100/H100), CUDA drivers, and networking requirements for distributed workloads
Experience deploying and scaling open-source LLMs and embedding models using containerized solutions
Strong belief in 'Everything as Code'-you automate toil wherever possible using Python, Go, or Bash
Expert proficiency in Python and Go; comfortable digging into lower-level system performance
Mastery of Kubernetes (EKS/GKE), Helm, Docker, and container runtimes
Advanced skills with Terraform, CloudFormation, or Pulumi
Hands-on experience with serving frameworks like Triton Inference Server, vLLM, Text Generation Inference (TGI), or TorchServe
Deep expertise in AWS (EC2, EKS, SageMaker) or GCP, specifically regarding GPU instance types and networking
Proficiency with Prometheus, Grafana, DataDog, and tracing tools (OpenTelemetry)
Understanding of service mesh (Istio), load balancing, and high-performance networking (RPC, gRPC)

Preferred

Experience with Ray or Slurm is a huge plus

Benefits

Medical
Dental
Vision
Life
Retirement
PTO

Company

ChatGPT Jobs

twitter
company-logo
We find the best job offers for experts in ChatGPT and related technologies.

Funding

Current Stage
Early Stage
Company data provided by crunchbase