Site Reliability Engineer - Spacetime jobs in United States
cer-icon
Apply on Employer Site
company-logo

Aalyria · 2 months ago

Site Reliability Engineer - Spacetime

Aalyria is a leading technology company that supplies laser communications technology and temporospatial software-defined networking platforms to the aerospace industry. The Site Reliability Engineer will build a centralized observability platform and ensure the reliability of systems critical to satellite operations, collaborating with software engineers and infrastructure teams.

InternetNetwork SecuritySatellite CommunicationTelecommunications
badNo H1BnoteU.S. Citizen Onlynote

Responsibilities

Help design and build Aalyria's centralized observability platform, integrating and scaling tools for metrics (e.g. Prometheus), logging (e.g. Loki), and distributed tracing (e.g. Tempo/OpenTelemetry)
Define, implement, and manage a robust framework of Service Level Objectives (SLOs), Service Level Indicators (SLIs), and error budgets for our core products, ensuring we are launch-ready
Partner with SWEs to implement observability best practices, develop standard templates and documentation, and configure tooling (e.g., OpenTelemetry libraries)
Automate the deployment, scaling, and management of the entire observability stack using Infrastructure as Code (e.g. Terraform) and GitOps principles (e.g. ArgoCD)
Partner closely with the core infrastructure team to ensure deep visibility into our Kubernetes clusters and underlying GCP and AWS environments
Develop and lead the company's monitoring, alerting, and incident response strategy, driving a culture of proactive reliability and blameless post-mortems

Qualification

Observability platformsGoogle Cloud PlatformKubernetesInfrastructure as CodeSystems programmingService Level ObjectivesGitOps principlesMulti-cloud environmentCI/CD pipelinesService mesh technologiesInstrumenting applicationsJVM observability

Required

4+ years of experience in an SRE or platform engineering role, with a focus on observability for large-scale, distributed compute or network systems
Deep, hands-on expertise building, scaling, and managing observability platforms (e.g., Prometheus, Grafana, Loki/ELK, OpenTelemetry, Tempo/Jaeger, Honeycomb, etc.). You have proven experience using these tools to support performance analysis and debugging of complex distributed systems
Strong production-level experience with Google Cloud Platform (GCP) and Kubernetes
Experience using Infrastructure as Code (IaC) and GitOps principles (e.g., ArgoCD)
Proficiency in a systems programming language, with a strong preference for Go and Python for debugging and writing tooling
Demonstrable experience defining, implementing, and managing SLOs, SLIs, and error budgets for production services for high availability distributed systems

Preferred

Experience operating a multi-cloud environment, specifically GCP and AWS
Hands-on experience with GitLab CI for CI/CD pipelines
Working knowledge of service mesh technologies such as Istio or Linkerd
Familiarity with instrumenting applications written in Go and C++
An active Secret clearance, or higher, is preferred for this position
Experience with JVM observability (tuning, monitoring) for Java-based applications

Benefits

401(k)
Dental
Vision
Health
Life insurance
Paid time off
Equity options

Company

Aalyria

twittertwitter
company-logo
Aalyria is a telecom startup that specializes in data connectivity and reliability through weather and atmospheric conditions. It is a sub-organization of Google.

Funding

Current Stage
Growth Stage
Total Funding
unknown
2022-09-01Series A

Leadership Team

leader-logo
Maria Hedden
Chief Operating Officer
linkedin
Company data provided by crunchbase