Apply on Employer Site

Assail · 1 week ago

Senior AI Infrastructure Engineer

Las Vegas, NV

Full-time

Hybrid

Senior Level

7+ years exp

Assail is seeking a Senior AI Infrastructure Engineer to manage the uptime, performance, cost control, and security of their autonomous AI penetration testing platform. The role involves deploying and operating AI workloads, optimizing GPU utilization, and ensuring reliable and scalable infrastructure for AI operations.

Artificial Intelligence (AI)Cloud ComputingSoftwareCloud Data ServicesDeveloper APIs

Responsibilities

Deploy and operate our vLLM-based inference stack serving a custom fine-tuned 14B+ parameter security model

Optimize Time to First Token (TTFT) and tail latency under concurrent load from our 145-agent swarm

Manage multi-model routing across specialized functions (code analysis, deobfuscation, reasoning)

Ensure OpenAI-compatible API availability with <100ms p99 latency targets

Manage and optimize NVIDIA RTX GPU utilization (RTX 5090 / CUDA 12.1+) within our Kubernetes clusters

Configure GPU passthrough, tensor parallelism, and memory optimization for vLLM inference

Design scheduling and autoscaling strategies to minimize idle GPU spend while supporting burst agent workloads

Forecast GPU capacity needs as the agent swarm scales (currently 145 agents across 10 types)

Own our K3s-based Kubernetes infrastructure running on RHEL 10

Manage StatefulSets for stateful services (Neo4j, Qdrant, Kafka, Zookeeper, Android emulators)

Configure HPA (Horizontal Pod Autoscaler) for agent deployments (1-20 replicas per agent type)

Operate Podman rootless containers with Buildah for secure image builds

Maintain local container registry and image lifecycle

Operate our Apache Kafka 3.6 backbone handling 10,000+ messages/sec for agent coordination

Monitor consumer lag, partition health, and message throughput across tenant-scoped topics

Ensure exactly-once delivery semantics for mission-critical agent task distribution

PostgreSQL 16: Tenant data with Row-Level Security, PgBouncer connection pooling

Neo4j 5.x: Graph database for attack chains and knowledge graphs

Qdrant 1.7: Vector database for semantic search and pattern matching

Redis 7: Short-term memory, caching, and pub/sub (24hr TTL patterns)

MinIO: S3-compatible object storage for APK artifacts and reports

Build CI/CD pipelines that support:

Model weight deployment and LoRA adapter merges

Configuration and prompt updates

Automated testing and canary releases for AI features

Integrate with our custom deployment tooling (ares-cc.py) and Helm charts

Enable fast rollback when model behavior or inference performance regresses

Implement end-to-end tracing with Jaeger across gRPC services, Kafka, and model inference

Extend our Prometheus metrics for:

Per-agent task duration and failure rates

VLLM request latency and GPU utilization

Kafka consumer lag and throughput

Database query performance (Neo4j graph traversals, Qdrant vector searches)

Maintain Grafana dashboards for Agent Swarm Overview, Mission Performance, Infrastructure Health, and DAST Metrics

Design graceful degradation when models time out or agents fail

Enforce multi-tenant isolation at every layer:

Kubernetes namespaces and NetworkPolicies

Tenant-scoped Kafka topics (tenant-{id}-missions, tenant-{id}-agent-events)

Row-Level Security in PostgreSQL

Filtered vector search in Qdrant by tenant_id

Manage secrets via HashiCorp Vault (30+ secrets across 6 categories)

Maintain mTLS for all service-to-service communication (gRPC, database connections)

Operate container security scanning with Trivy, Cosign (image signing), and Syft (SBOM generation)

Enforce SELinux in production (RHEL 10)

Qualification

KubernetesVLLMNVIDIA GPU operationsApache KafkaTerraformPostgreSQLNeo4jQdrantRedisPrometheusGrafanaJaegerHashiCorp VaultGRPCSoft skills

Required

7+ years in DevOps, SRE, or Infrastructure Engineering

1–2+ years owning production AI workloads, including GPU-backed inference or large-scale model serving

Demonstrated ownership of production systems under load, not just 'set it up once and move on'