Hydra Host · 22 hours ago
AI Infrastructure Engineer
Hydra Host is a Founders Fund–backed NVIDIA cloud partner building the infrastructure platform that powers AI at scale. As an AI Infrastructure Engineer, you will work directly with AI platform customers to streamline their infrastructure on Hydra, ensuring production readiness and automating onboarding processes.
Artificial Intelligence (AI)Cloud ComputingSoftwareCloud InfrastructureDeveloper APIsWeb Hosting
Responsibilities
Get AI Platform customers production-ready on Hydra —standing up Kubernetes clusters, configuring GPU drivers, validating networking, and troubleshooting the issues that surface when real workloads hit real hardware
Own the bare metal ←→ platform layer —bridging GPU infrastructure (NCCL, InfiniBand, NVLink, storage) with orchestration layers (Kubernetes, SLURM) and MLOps tooling that customers actually use
Configure, benchmark, and debug NVIDIA driver stacks —firmware versions, CUDA compatibility, NCCL tuning, MIG configurations. Run quality benchmarks and diagnostics to validate performance for inference and training workloads across chip types
Identify gaps before customers do—pressure-testing Hydra's infrastructure, APIs, and workflows to find what's missing or broken
Turn customer learnings into product —working with Product and Engineering to build reusable templates, default configurations, and automated workflows that eliminate manual onboarding
Advise customers on chip selection and tokenomics —helping AI platform customers understand price/performance trade-offs across GPU types, cost-per-token economics, and which hardware fits their inference or training workloads
Qualification
Required
Bare metal Linux depth —you've administered GPU servers at the metal: driver stacks, kernel tuning, firmware, storage configuration. Not just managed K8s
NVIDIA GPU stack expertise —drivers, CUDA, NCCL, NVLink, nvidia-smi profiling. You understand how stack compatibility affects performance
Kubernetes and orchestration —production experience with K8s, SLURM, or similar. You know how to stand up clusters, not just deploy to them
AI Networking fundamentals —TCP/IP, VLANs, bonding, and high-speed interconnects (InfiniBand, RoCE) for distributed workloads
Customer-facing communication —you can work directly with engineers at AI platform companies, understand their constraints, and translate that into clear requirements for your team
Bias toward scalable solutions — you'd rather build a feature that helps 10 customers than a custom deployment that helps 1
Preferred
HPC or large-scale distributed training environments
AI workload experience (vLLM, PyTorch, inference frameworks)
Storage systems (NVMe, distributed filesystems, CEPH, WEKA)
IaC and provisioning tools (Terraform, Ansible, Cloud-init, MaaS)
Benefits
Equity ownership — meaningful stake in what we're building
Healthcare — medical, dental, vision for you and your family
Remote-first — with hubs in Phoenix, Boulder, and Miami
Company
Hydra Host
Hydra offers a bare metal GPU platform, connecting businesses to a vareity of independent but standardized AI Factory Franchises.
Funding
Current Stage
Early StageTotal Funding
$10MKey Investors
Flume VenturesFounders Fund
2025-02-10Seed
2024-09-12Seed
2022-04-06Seed· $10M
Recent News
2025-10-23
Company data provided by crunchbase