Site Reliability Engineering Manager jobs in United States
cer-icon
Apply on Employer Site
company-logo

Kellton · 1 hour ago

Site Reliability Engineering Manager

Kellton is seeking a Site Reliability Engineering Architect who will be responsible for designing and evolving automation-first, AI-augmented reliability platforms for large-scale cloud environments. The role focuses on establishing architectural standards, reducing operational risk, and enhancing system reliability through automation and intelligent decision-making.

Artificial Intelligence (AI)Cloud Data ServicesEnterprise SoftwareInformation ServicesMobile Apps
check
H1B Sponsor Likelynote
Hiring Manager
Shams Naveed
linkedin

Responsibilities

Design reliability architectures that prioritize automation and intelligent decision-making over manual processes
Define patterns for fault isolation, graceful degradation, and recovery that assume automated and AI-assisted execution
Ensure reliability, security, and governance requirements are embedded directly into operational systems and workflows
Establish architectural standards that reduce complexity, human dependency, and operational risk
Architect event-driven automation platforms that span detection, decisioning, and execution
Design and implement workflow orchestration systems capable of handling both low-risk autonomous actions and higher-risk human-approved operations
Replace ticket-driven and static runbook processes with executable, testable automation
Standardize automation patterns across incident response, change execution, and platform operations
Ensure automation systems are resilient, observable, and auditable
Design and own internal AI-driven operational platforms that act as a centralized interface for reliability and automation workflows
Build systems that allow intelligent components to retrieve operational context, reason over signals, and invoke controlled actions across infrastructure and services
Establish architectures for agent coordination, capability discovery, and safe execution in production environments
Define guardrails, approval paths, observability, and auditability for AI-initiated actions
Integrate AI-driven decisioning directly into operational workflows rather than treating it as an external enhancement
Architect observability systems that feed automation and intelligent decision-making rather than static dashboards
Design signal pipelines that correlate metrics, logs, traces, and events into actionable context
Reduce alert fatigue through context-aware, noise-resistant detection and prioritization
Ensure every operational signal has a defined automated or AI-assisted response path
Drive continuous improvement through trend analysis and systemic remediation
Define governance-backed use of enterprise low-code automation platforms to accelerate operational workflows
Enable secure, scalable automation for approvals, communications, enrichment, and orchestration while preventing platform sprawl
Establish clear boundaries between low-code automation and code-first systems
Integrate enterprise automation tools with cloud-native automation and AI-driven operational platforms
Serve as the architectural authority for reliability, automation, and AI-driven operations
Mentor senior engineers and raise organizational maturity in automation and intelligent systems
Partner with engineering, security, and compliance teams to deliver safe, scalable operational platforms
Own reference architectures, operational standards, and long-term technical direction
Challenge designs that increase operational risk, toil, or manual dependency

Qualification

Site Reliability EngineeringAutomation PlatformsAI-Driven OperationsWorkflow OrchestrationProgramming SkillsAgent-Based SystemsCloud CertificationsRisk ManagementLeadership

Required

5+ years of experience in Site Reliability Engineering, Platform Engineering, DevOps, or Infrastructure Engineering supporting complex distributed systems
Proven experience designing and operating automation-heavy or autonomous operational platforms
Strong programming and automation skills using modern languages and frameworks
Hands-on experience with workflow orchestration and event-driven systems
Practical experience integrating AI or intelligent decision systems into production operations
Deep understanding of failure modes, blast radius management, and risk-aware automation

Preferred

Experience designing or implementing agent-based or AI-assisted operational systems
Familiarity with modern AI platforms and model integration for operational use cases
Experience with control-plane architectures for automation and intelligent systems
Enterprise automation and governance experience
Knowledge of cost-aware reliability design, FinOps principles, and zero-trust security models
Relevant cloud or platform certifications

Company

Kellton is a technology company that offers digital transformation solution and services in strategy, consulting, digital, and technology.

H1B Sponsorship

Kellton has a track record of offering H1B sponsorships. Please note that this does not guarantee sponsorship for this specific role. Below presents additional info for your reference. (Data Powered by US Department of Labor)
Distribution of Different Job Fields Receiving Sponsorship
Represents job field similar to this job
Trends of Total Sponsorships
2025 (25)
2024 (31)
2023 (20)
2022 (14)
2021 (2)
2020 (7)

Funding

Current Stage
Public Company
Total Funding
unknown
Key Investors
AloStar
2018-11-08Post Ipo Debt
2016-03-03IPO

Leadership Team

leader-logo
Niranjan Chintam
President
linkedin
Company data provided by crunchbase