Brilliant® · 6 hours ago
Site Reliability Engineer - Incident Response
Brilliant® is a company focused on improving incident management practices, and they are seeking a Senior Site Reliability Engineer. This role involves partnering closely with engineering teams during incidents, delivering executive-level summaries, and driving systemic reliability improvements.
Staffing & Recruiting
Responsibilities
Draft and deliver executive-level post-incident summaries
Develop and coach teams on blameless postmortem practices
Create templates and guide structured root cause analysis methods such as 5 Whys or fishbone diagrams
Maintain a centralized library of learnings and cross-cutting incident themes
Support engineering teams during incidents by assisting with rapid diagnosis and resolution
Analyze data from observability platforms to form informed conclusions about root causes
Evaluate incident response effectiveness to identify systemic reliability gaps
Standardize incident response workflows, including roles, communication, and escalation paths
Create or refine runbooks, incident command frameworks, and severity classification guidelines
Build dashboards tracking incident frequency, MTTR, MTTA, and recurrence rates
Use incident data to inform reliability-focused OKRs and engineering investment decisions
Identify repetitive or high-impact tasks suitable for automation
Develop and enhance scripts, bots, and AI-driven workflows for monitoring, alerting, and incident triage
Evaluate and integrate emerging AI and ML technologies to improve detection, root cause analysis, and reporting
Ensure tools and automations meet security, maintainability, and best practice standards
Document and share new tools and solutions to support adoption across teams
Work with engineering managers and incident leaders to gather and validate incident data
Partner with product, infrastructure, and leadership teams to promote reliability best practices
Act as a reliability consultant to teams experiencing significant or recurring incidents
Recommend improvements to monitoring, alerting, and response processes to reduce future incident impact
Qualification
Required
Bachelor's degree in a related discipline with 4 years of relevant experience, or an equivalent combination of education and experience
Must be authorized to work in the United States without current or future sponsorship
Strong ability to design, build, and maintain engineering solutions and tooling that improve reliability, automate incident response, and reduce operational toil
Skilled in interpreting logs, metrics, and traces to identify root causes during live incidents
Experience with observability platforms such as Datadog, Splunk, New Relic, or similar tools
Strong programming background in Python, Java, or C#, with experience supporting production-grade services and automation
Proven ability to design reliable, scalable, and highly available systems using sound software engineering practices
Experience developing automation to improve incident response, monitoring, deployment, and recovery processes
Ability to collaborate closely with software engineering teams to influence architecture and operational readiness
Experience leveraging AI and machine learning tools to enhance incident response, automation, and daily engineering workflows
Strong analytical skills with attention to detail in validating incident data and identifying trends
Solid understanding of DevOps concepts, including CI/CD pipelines, cloud-native infrastructure, caching, and scaling
Experience calculating and interpreting metrics such as MTTA (Mean Time to Acknowledge) and MTTR (Mean Time to Resolve)
Benefits
Healthcare
PTO
401k
Company
Brilliant®
About Brilliant® Brilliant is an award-winning staffing and recruiting firm that provides direct-hire and contract staffing services in accounting, finance, technology, and business operations serving businesses across the continental U.S.