Apply on Employer Site

Aalyria · 7 hours ago

Staff Site Reliability Engineer - Spacetime

United States

Full-time

Remote

Senior Level, Lead/Staff

$160K/yr - $200K/yr

7+ years exp

Aalyria is a leading technology company specializing in laser communications and software-defined networking for the aerospace industry. The Staff Site Reliability Engineer will be responsible for building and managing the observability platform that ensures the reliability of systems critical to satellite operations and deep space missions.

InternetNetwork SecuritySatellite CommunicationTelecommunications

No H1B

U.S. Citizen Only

Responsibilities

Design, build, and own the technical roadmap for Aalyria's centralized observability platform, integrating and scaling tools for metrics (Prometheus), logging (Loki), and distributed tracing (Tempo/OpenTelemetry)

Define, implement, and manage a robust framework of Service Level Objectives (SLOs), Service Level Indicators (SLIs), and error budgets for our core products, ensuring we are launch-ready

Establish and evangelize observability best practices, providing standards, documentation, and tooling (e.g., OpenTelemetry libraries) to empower our Go and Java application teams to instrument their services effectively

Partner with core software engineers to provide the tools and insights needed to debug performance, optimize computational pipelines (including CPU/GPU workloads), and ensure the reliability of large-scale distributed systems

Automate the deployment, scaling, and management of the entire observability stack using Infrastructure as Code (Terraform) and GitOps principles (ArgoCD)

Partner closely with the core infrastructure team to ensure deep visibility into our Kubernetes clusters and underlying GCP and AWS environments

Develop and lead the company's monitoring, alerting, and incident response strategy, driving a culture of proactive reliability and blameless post-mortems

Qualification

Observability platformsGoogle Cloud PlatformInfrastructure as CodeKubernetesService Level ObjectivesGitOps principlesSystems programmingSoft skills

Required

7+ years of experience in an SRE or platform engineering role, with a focus on observability for large-scale, distributed compute or network systems

Deep, hands-on expertise building, scaling, and managing observability platforms (e.g., Prometheus, Grafana, Loki/ELK, OpenTelemetry, Tempo/Jaeger, Honeycomb, etc.). You have proven experience using these tools to support performance analysis and debugging of complex distributed systems

Strong production-level experience with Google Cloud Platform (GCP) and Kubernetes

Proven mastery of Infrastructure as Code (IaC) with Terraform and GitOps principles (e.g., ArgoCD)

Proficiency in a systems programming language, with a strong preference for Go and Python for debugging and writing tooling

Demonstrable experience defining, implementing, and managing SLOs, SLIs, and error budgets for production services

Preferred

Experience operating a multi-cloud environment, specifically GCP and AWS

Hands-on experience with GitLab CI for CI/CD pipelines

Working knowledge of service mesh technologies such as Istio or Linkerd

Experience with high-performance computing (HPC) environments and instrumenting numerical optimization workloads

Familiarity with instrumenting applications written in Golang and C++

Experience with JVM observability (tuning, monitoring) for Java-based applications

Benefits

401(k)

Dental

Vision

Health

Life insurance

Paid time off

Equity options

Company

Aalyria

Aalyria is a telecom startup that specializes in data connectivity and reliability through weather and atmospheric conditions. It is a sub-organization of Google.

Founded in 2021

Livermore, California, USA

51-200 employees