Apply on Employer Site

ChatGPT Jobs · 1 day ago

Senior AI Infrastructure Engineer

United States

Full-time

Remote

Senior Level

$122K/yr - $146K/yr

5+ years exp

WEX is a company focused on AI infrastructure, seeking a Senior AI Infrastructure Engineer to build and maintain a high-performance compute foundation for generative AI and machine learning initiatives. The role involves architecting the platform for production AI workloads, ensuring resilience, security, and cost-effectiveness while optimizing compute layers and deployment pipelines.

Computer Software

Responsibilities

Platform Architecture: Design and maintain a robust, Kubernetes-based AI platform that supports distributed training and high-throughput inference serving

Inference Optimization: Engineer low-latency serving solutions for LLMs and other models, optimizing engines (e.g., vLLM, TGI, Triton) to maximize throughput and minimize cost per token

Compute Orchestration: Manage and scale GPU clusters on Cloud (AWS) or on-prem environments, implementing efficient scheduling, auto-scaling, and spot instance management to optimize costs

Operational Excellence (MLOps): Build and maintain "Infrastructure as Code" (Terraform/Ansible) and CI/CD pipelines to automate the lifecycle of model deployments and infrastructure provisioning

Reliability & Observability: Implement comprehensive monitoring (Prometheus, Grafana) for GPU health, model latency, and system resource usage; lead incident response for critical AI infrastructure

Developer Experience: Create tools and abstraction layers (SDKs, CLI tools) that allow data scientists to self-serve compute resources without managing underlying infrastructure

Security & Compliance: Ensure all AI infrastructure meets strict security standards, handling sensitive data encryption and access controls (IAM, VPCs) effectively

Qualification

KubernetesPythonAWSInfrastructure as CodeGPU architecturesModel ServingAutomationObservabilityGoDockerTerraformPrometheusGrafana

Required

5+ years of experience in DevOps, Site Reliability Engineering (SRE), or Platform Engineering, with at least 2 years focused on Machine Learning infrastructure

Proven track record of managing large-scale production clusters (Kubernetes) and distributed systems

Deep understanding of GPU architectures (NVIDIA A100/H100), CUDA drivers, and networking requirements for distributed workloads

Experience deploying and scaling open-source LLMs and embedding models using containerized solutions

Strong belief in 'Everything as Code'-you automate toil wherever possible using Python, Go, or Bash

Expert proficiency in Python and Go; comfortable digging into lower-level system performance

Mastery of Kubernetes (EKS/GKE), Helm, Docker, and container runtimes

Advanced skills with Terraform, CloudFormation, or Pulumi

Hands-on experience with serving frameworks like Triton Inference Server, vLLM, Text Generation Inference (TGI), or TorchServe

Deep expertise in AWS (EC2, EKS, SageMaker) or GCP, specifically regarding GPU instance types and networking

Proficiency with Prometheus, Grafana, DataDog, and tracing tools (OpenTelemetry)

Understanding of service mesh (Istio), load balancing, and high-performance networking (RPC, gRPC)

Preferred

Experience with Ray or Slurm is a huge plus

Benefits

Medical

Dental

Vision

Life

Retirement

PTO

Company

ChatGPT Jobs

We find the best job offers for experts in ChatGPT and related technologies.

Founded in 2024

New York, NY, New York, US

2-10 employees

https://www.chatgpt-jobs.com

Funding

Current Stage

Early Stage

Company data provided by crunchbase