Corelight · 1 day ago
Lead Cloud Infrastructure Engineer / Site Reliability Engineer (SRE)
Corelight is a cybersecurity company that transforms network and cloud activity into evidence for threat detection and response. The Lead Cloud Infrastructure Engineer / Site Reliability Engineer (SRE) will ensure the stability, performance, and security of the cloud platform, focusing on availability, performance optimization, and compliance with FedRAMP standards.
AnalyticsCyber SecurityNetwork SecuritySecuritySoftware
Responsibilities
Collaborate with software engineering teams to ensure the reliability, performance, and security of the Federal region's infrastructure
Design, deploy, and scale AI/ML/LLM infrastructure across cloud platforms (AWS, Azure, or GCP) ensuring high reliability and performance
Manage and optimize Kubernetes environments (EKS, AKS, GKE) for AI services, data pipelines, and model operations
Build and automate end-to-end data and model pipelines for fine-tuning, inference, and RAG workloads using Terraform, Python, and CI/CD tooling
Utilize automation tools such as GitOps, CI/CD pipelines, and containerization technologies (Docker, Kubernetes) to streamline ML/LLM tasks across the Large Language Model lifecycle
Implement monitoring, observability, and reliability best practices using Prometheus, Grafana, ELK/EFK, Langfuse, and SLI/SLO/SLA frameworks
Participate in 24x7 on-call rotations, leading incident response, performance tuning, and cost optimization across SaaS Platform and production workloads
Own infrastructure end to end, leading scaling initiatives, deployments, and automation, and providing technical leadership across the team
Qualification
Required
Bachelor's or Master's degree in Computer Science, Engineering, or related field, or equivalent experience
8+ years in SRE, DevOps, Platform Engineering, MLOps, or Cloud Infrastructure roles
4+ years of production experience with Kubernetes (EKS, GKE, AKS) and containerization tools like Docker
Strong programming skills in Python and proficiency in Bash, Go, or PowerShell
Proficiency with Infrastructure-as-Code tools (Terraform, CloudFormation)
Experience with Kubernetes Operators, Helm, GitOps (ArgoCD, Flux), or Service Mesh (Istio, Linkerd)
Exposure to serverless compute (AWS Lambda, Azure Functions)
Experience building or automating data and model pipelines for AI/ML/LLM workloads (e.g., RAG, fine-tuning, inference)
Strong understanding of observability and monitoring using Prometheus, Grafana, ELK/EFK, Langfuse, or similar platforms
Familiarity with SLI/SLO/SLA practices, incident response, and reliability engineering in production environments
U.S. citizenship at the time of hire
Residence within the contiguous United States
Willingness to undergo a Single Scope Background Investigation, if required
Preferred
Cloud certifications (AWS, Azure, or GCP – e.g., Solutions Architect, DevOps Engineer)
Experience with agentic AI frameworks (CrewAI, LangGraph, AutoGen)
Background in hybrid or on-prem AI deployments, including OpenShift or Rancher
Familiarity with configuration management (Ansible, Chef, Puppet)
Contributions to open-source AI/ML, DevOps, or platform tooling
Experience with multimodal AI or model observability platforms (RAGAS, AgentOps, Langtrace), Distributed Tracing, OpenTelemetry
Knowledge of performance tuning, cost efficiency, or capacity planning for AI/LLM infrastructure
Understanding of security controls and FedRAMP compliance for cloud and various workloads
Benefits
Equity
Additional benefits will also be awarded
Company
Corelight
Corelight is a cybersecurity company specializing in network traffic analysis (NTA) solutions.
Funding
Current Stage
Late StageTotal Funding
$309.2MKey Investors
AccelEnergy Impact PartnersGeneral Catalyst
2024-04-30Series E· $150M
2021-09-02Series D· $75M
2019-10-17Series C· $50M
Recent News
2025-12-09
2025-11-05
Help Net Security
2025-11-01
Company data provided by crunchbase