GCP Site Reliability Engineer jobs in United States
cer-icon
Apply on Employer Site
company-logo

TechLine Consulting ยท 2 hours ago

GCP Site Reliability Engineer

TechLine Consulting is seeking a Senior Site Reliability Engineer with deep experience operating production systems on Google Cloud Platform (GCP). The role involves owning CI/CD pipelines, operating and scaling AI infrastructure, and building internal platforms to improve developer velocity and production safety.

Hiring Manager
Meagan Palermo
linkedin

Responsibilities

Own the reliability, uptime, and scalability of production services running on GCP by designing and implementing self-healing, fault-tolerant architectures using managed cloud services where appropriate
Architect and operate GCP-centric CI/CD pipelines, ensuring safe, repeatable, and reversible deployments across environments using GitLab CI and infrastructure-as-code
Build and maintain GCP infrastructure using Terraform and CloudFormation, with a focus on consistency, security, cost efficiency, and long-term maintainability
Partner closely with application engineers to improve the developer experience on GCP, making the preferred deployment paths, tooling, and patterns the easiest and most reliable options
Design and manage specialized GCP pipelines supporting AI workloads and human-in-the-loop systems, including compute orchestration, data pipelines, database operations, performance tuning, and compliance considerations
Lead production incident response for GCP-hosted systems, conducting post-mortems that emphasize root cause analysis and code-driven prevention over manual process changes
Implement comprehensive observability across GCP services by building metrics, logging, dashboards, and alerting using tools such as Prometheus, Grafana, and Datadog
Influence GCP architecture decisions, balancing reliability, performance, cost, and developer productivity as the platform scales

Qualification

Google Cloud Platform (GCP)CI/CDInfrastructure as CodePythonTerraformObservabilityJavaGitLab CIPrometheusGrafanaDatadog

Required

Deep experience operating production systems on Google Cloud Platform (GCP)
Designed and operated high-availability, cloud-native systems on GCP
Increased deployment frequency through automation
Proactively reduced operational risk through strong observability and infrastructure-as-code practices
Own the reliability, uptime, and scalability of production services running on GCP
Design and implement self-healing, fault-tolerant architectures using managed cloud services
Architect and operate GCP-centric CI/CD pipelines
Ensure safe, repeatable, and reversible deployments across environments using GitLab CI and infrastructure-as-code
Build and maintain GCP infrastructure using Terraform and CloudFormation
Focus on consistency, security, cost efficiency, and long-term maintainability
Partner closely with application engineers to improve the developer experience on GCP
Design and manage specialized GCP pipelines supporting AI workloads and human-in-the-loop systems
Lead production incident response for GCP-hosted systems
Conduct post-mortems that emphasize root cause analysis and code-driven prevention
Implement comprehensive observability across GCP services
Build metrics, logging, dashboards, and alerting using tools such as Prometheus, Grafana, and Datadog
Influence GCP architecture decisions, balancing reliability, performance, cost, and developer productivity

Preferred

Experience with Python and Java
Familiarity with Prometheus, Grafana, and Datadog

Company

TechLine Consulting

twitter
company-logo
At TechLine Consulting, people are our purpose and precision is our practice for Technology and Engineering.

Funding

Current Stage
Early Stage
Company data provided by crunchbase