SIGN IN
Senior AI Infrastructure Engineer jobs in United States
cer-icon
Apply on Employer Site
company-logo

Assail · 1 week ago

Senior AI Infrastructure Engineer

Assail is seeking a Senior AI Infrastructure Engineer to manage the uptime, performance, cost control, and security of their autonomous AI penetration testing platform. The role involves deploying and operating AI workloads, optimizing GPU utilization, and ensuring reliable and scalable infrastructure for AI operations.
Artificial Intelligence (AI)Cloud ComputingSoftwareCloud Data ServicesDeveloper APIs

Responsibilities

Deploy and operate our vLLM-based inference stack serving a custom fine-tuned 14B+ parameter security model
Optimize Time to First Token (TTFT) and tail latency under concurrent load from our 145-agent swarm
Manage multi-model routing across specialized functions (code analysis, deobfuscation, reasoning)
Ensure OpenAI-compatible API availability with <100ms p99 latency targets
Manage and optimize NVIDIA RTX GPU utilization (RTX 5090 / CUDA 12.1+) within our Kubernetes clusters
Configure GPU passthrough, tensor parallelism, and memory optimization for vLLM inference
Design scheduling and autoscaling strategies to minimize idle GPU spend while supporting burst agent workloads
Forecast GPU capacity needs as the agent swarm scales (currently 145 agents across 10 types)
Own our K3s-based Kubernetes infrastructure running on RHEL 10
Manage StatefulSets for stateful services (Neo4j, Qdrant, Kafka, Zookeeper, Android emulators)
Configure HPA (Horizontal Pod Autoscaler) for agent deployments (1-20 replicas per agent type)
Operate Podman rootless containers with Buildah for secure image builds
Maintain local container registry and image lifecycle
Operate our Apache Kafka 3.6 backbone handling 10,000+ messages/sec for agent coordination
Monitor consumer lag, partition health, and message throughput across tenant-scoped topics
Ensure exactly-once delivery semantics for mission-critical agent task distribution
PostgreSQL 16: Tenant data with Row-Level Security, PgBouncer connection pooling
Neo4j 5.x: Graph database for attack chains and knowledge graphs
Qdrant 1.7: Vector database for semantic search and pattern matching
Redis 7: Short-term memory, caching, and pub/sub (24hr TTL patterns)
MinIO: S3-compatible object storage for APK artifacts and reports
Build CI/CD pipelines that support:
Model weight deployment and LoRA adapter merges
Configuration and prompt updates
Automated testing and canary releases for AI features
Integrate with our custom deployment tooling (ares-cc.py) and Helm charts
Enable fast rollback when model behavior or inference performance regresses
Implement end-to-end tracing with Jaeger across gRPC services, Kafka, and model inference
Extend our Prometheus metrics for:
Per-agent task duration and failure rates
VLLM request latency and GPU utilization
Kafka consumer lag and throughput
Database query performance (Neo4j graph traversals, Qdrant vector searches)
Maintain Grafana dashboards for Agent Swarm Overview, Mission Performance, Infrastructure Health, and DAST Metrics
Design graceful degradation when models time out or agents fail
Enforce multi-tenant isolation at every layer:
Kubernetes namespaces and NetworkPolicies
Tenant-scoped Kafka topics (tenant-{id}-missions, tenant-{id}-agent-events)
Row-Level Security in PostgreSQL
Filtered vector search in Qdrant by tenant_id
Manage secrets via HashiCorp Vault (30+ secrets across 6 categories)
Maintain mTLS for all service-to-service communication (gRPC, database connections)
Operate container security scanning with Trivy, Cosign (image signing), and Syft (SBOM generation)
Enforce SELinux in production (RHEL 10)

Qualification

KubernetesVLLMNVIDIA GPU operationsApache KafkaTerraformPostgreSQLNeo4jQdrantRedisPrometheusGrafanaJaegerHashiCorp VaultGRPCSoft skills

Required

7+ years in DevOps, SRE, or Infrastructure Engineering
1–2+ years owning production AI workloads, including GPU-backed inference or large-scale model serving
Demonstrated ownership of production systems under load, not just 'set it up once and move on'
Deploy and operate our vLLM-based inference stack serving a custom fine-tuned 14B+ parameter security model
Optimize Time to First Token (TTFT) and tail latency under concurrent load from our 145-agent swarm
Manage multi-model routing across specialized functions (code analysis, deobfuscation, reasoning)
Ensure OpenAI-compatible API availability with <100ms p99 latency targets
Manage and optimize NVIDIA RTX GPU utilization (RTX 5090 / CUDA 12.1+) within our Kubernetes clusters
Configure GPU passthrough, tensor parallelism, and memory optimization for vLLM inference
Design scheduling and autoscaling strategies to minimize idle GPU spend while supporting burst agent workloads
Forecast GPU capacity needs as the agent swarm scales (currently 145 agents across 10 types)
Own our K3s-based Kubernetes infrastructure running on RHEL 10
Manage StatefulSets for stateful services (Neo4j, Qdrant, Kafka, Zookeeper, Android emulators)
Configure HPA (Horizontal Pod Autoscaler) for agent deployments (1-20 replicas per agent type)
Operate Podman rootless containers with Buildah for secure image builds
Maintain local container registry and image lifecycle
Operate our Apache Kafka 3.6 backbone handling 10,000+ messages/sec for agent coordination
Monitor consumer lag, partition health, and message throughput across tenant-scoped topics
Ensure exactly-once delivery semantics for mission-critical agent task distribution
PostgreSQL 16: Tenant data with Row-Level Security, PgBouncer connection pooling
Neo4j 5.x: Graph database for attack chains and knowledge graphs
Qdrant 1.7: Vector database for semantic search and pattern matching
Redis 7: Short-term memory, caching, and pub/sub (24hr TTL patterns)
MinIO: S3-compatible object storage for APK artifacts and reports
Build CI/CD pipelines that support model weight deployment and LoRA adapter merges
Configuration and prompt updates
Automated testing and canary releases for AI features
Integrate with our custom deployment tooling (ares-cc.py) and Helm charts
Enable fast rollback when model behavior or inference performance regresses
Implement end-to-end tracing with Jaeger across gRPC services, Kafka, and model inference
Extend our Prometheus metrics for per-agent task duration and failure rates
vLLM request latency and GPU utilization
Kafka consumer lag and throughput
Database query performance (Neo4j graph traversals, Qdrant vector searches)
Maintain Grafana dashboards for Agent Swarm Overview, Mission Performance, Infrastructure Health, and DAST Metrics
Design graceful degradation when models time out or agents fail
Enforce multi-tenant isolation at every layer: Kubernetes namespaces and NetworkPolicies
Tenant-scoped Kafka topics (tenant-{id}-missions, tenant-{id}-agent-events)
Row-Level Security in PostgreSQL
Filtered vector search in Qdrant by tenant_id
Manage secrets via HashiCorp Vault (30+ secrets across 6 categories)
Maintain mTLS for all service-to-service communication (gRPC, database connections)
Operate container security scanning with Trivy, Cosign (image signing), and Syft (SBOM generation)
Enforce SELinux in production (RHEL 10)
Kubernetes (Expert) — K3s, EKS, or equivalent
Podman and Buildah (rootless container runtime)
Helm for chart management
Experience with GPU scheduling, node pools, and StatefulSets
Infrastructure as Code: Terraform (we deploy on AWS)
AWS: EKS, IAM, VPC, or equivalent cloud experience
Experience with local/on-prem GPU environments alongside cloud
vLLM (required) — production deployment and optimization
NVIDIA GPU operations (CUDA, driver management, memory optimization)
Familiarity with quantization (INT4/INT8 via BitsAndBytes) and model optimization
Apache Kafka — production operations, consumer groups, exactly-once semantics
PostgreSQL — Row-Level Security, connection pooling, replication
Neo4j — graph database operations and Cypher queries
Qdrant or similar vector database — HNSW indexing, filtered search
Redis — pub/sub, caching patterns, Streams
Prometheus — custom metrics, alerting rules
Grafana — dashboard creation and maintenance
Jaeger or OpenTelemetry for distributed tracing
Experience with gRPC observability
gRPC and Protobuf — service mesh patterns, load balancing
VPC design, private networking, Kubernetes NetworkPolicies
HashiCorp Vault for secrets management
TLS/mTLS configuration for databases and services
Operating AI systems in adversarial environments (security product context)
Preventing data leakage across multi-tenant boundaries
Supporting reproducible, auditable AI outputs for security findings
Understanding the blast radius of misconfigured AI infrastructure in a pentest platform

Preferred

Experience with Android emulator farms or mobile device infrastructure at scale
Prior work supporting security products or regulated data environments
Familiarity with Frida, mitmproxy, or similar runtime instrumentation tools
Background in gRPC-based microservices and event-driven architectures
Experience with supply chain security (Trivy, Cosign, SBOM generation)

Company

Assail

twittertwitter
company-logo
Assail AI develops Ares, an agentic AI platform for autonomous offensive security testing targeting mobile apps, APIs, web infrastructure.

Funding

Current Stage
Early Stage
Total Funding
$0.25M
Key Investors
Squared Circle Ventures
2026-01-13Pre Seed· $0.25M

Leadership Team

leader-logo
Melissa Knight
Co-Founder, Chief Revenue Officer & Customer Champion
linkedin
Company data provided by crunchbase