Apply on Employer Site

Oracle · 1 month ago

Senior Principal Site Reliability Engineer | Oracle Health Federal Operations Team

United States

Full-time

Remote

Senior Level, Lead/Staff

Oracle is a technology leader that’s changing how the world does business, and they are seeking a Senior Principal Site Reliability Engineer to join their Oracle Health Federal Operations Team. This role involves defining and deploying key services focused on architecture, production operations, and performance management while ensuring reliability and performance across multiple cross-functional teams.

Data GovernanceData ManagementEnterprise SoftwareInformation TechnologySaaSSoftware

No H1B

Security Clearance Required

Responsibilities

Own the full service lifecycle: design, implementation, deployment, on-call, and continuous improvement—maintaining high code and reliability standards

Define and meet service-level objectives (availability, latency, durability) while reducing toil through automation, observability, and self-healing mechanisms

Lead architecture, analysis, design, implementation, and production operations for Core System Framework solutions, with strong documentation and runbooks

Create and maintain clear, version-controlled documentation—architectural diagrams, SOPs, runbooks, and incident playbooks—to ensure repeatable operations, auditability, and fast onboarding

Design, write, and deploy software that improves the availability, scalability, and efficiency of platform services

Develop designs, architectures, standards, and methods for large-scale distributed systems

Build automation to prevent problem recurrence; drive real-time monitoring, alerting, and self-healing into production systems

Conduct capacity planning and demand forecasting; perform software performance analysis, system tuning, and optimization

Contribute to and support platform services across architecture, provisioning, configuration, deployment, and ongoing operations

Partner with distributed teams to prototype and launch new platform services

Stay current on emerging technologies and introduce innovations that improve reliability, security, and developer productivity

Mentor and guide engineers in distributed systems design, high-scale data processing, and operational excellence

Set and raise engineering standards across multiple teams; model best practices in reliability, security, and automation

Collaborate closely with storage, networking, observability, and security teams to deliver platform features and secure-by-default designs

Participate in an on-call rotation; lead incident response, postmortems, and follow-through on corrective actions to drive continuous improvement

Qualification

Site Reliability EngineeringDevOpsDistributed SystemsAutomationPerformance ManagementIncident ResponseMentoringCollaborationDocumentation

Required

Experience in defining and deploying key services with deep focus on architecture, production operations, capacity planning, performance management, deployment, and release engineering

Ability to own the full service lifecycle: design, implementation, deployment, on-call, and continuous improvement

Experience in defining and meeting service-level objectives (availability, latency, durability) while reducing toil through automation, observability, and self-healing mechanisms

Strong documentation skills including creating and maintaining clear, version-controlled documentation—architectural diagrams, SOPs, runbooks, and incident playbooks

Experience in designing, writing, and deploying software that improves the availability, scalability, and efficiency of platform services

Ability to develop designs, architectures, standards, and methods for large-scale distributed systems

Experience in building automation to prevent problem recurrence; driving real-time monitoring, alerting, and self-healing into production systems

Conducting capacity planning and demand forecasting; performing software performance analysis, system tuning, and optimization

Experience in contributing to and supporting platform services across architecture, provisioning, configuration, deployment, and ongoing operations

Ability to partner with distributed teams to prototype and launch new platform services

Staying current on emerging technologies and introducing innovations that improve reliability, security, and developer productivity

Mentoring and guiding engineers in distributed systems design, high-scale data processing, and operational excellence

Setting and raising engineering standards across multiple teams; modeling best practices in reliability, security, and automation

Collaborating closely with storage, networking, observability, and security teams to deliver platform features and secure-by-default designs

Participating in an on-call rotation; leading incident response, postmortems, and follow-through on corrective actions to drive continuous improvement