Apply on Employer Site

OpsMill · 12 hours ago

Product Reliability Engineer

United States

Full-time

Remote

Mid, Senior Level

4+ years exp

OpsMill is building Infrahub, a schema-driven infrastructure source of truth that helps teams unify data and scale automation reliably. The Product Reliability Engineer will be responsible for diagnosing issues in customer environments and building tools and processes to ensure reliability in on-prem deployments.

Software Development

Responsibilities

Partner directly with customers and with our Solution Architecture/Customer Success teams on L2/L3 escalations—communicating findings, driving root-cause analysis, and resolving complex packaging, deployment, upgrade, and runtime issues across heterogeneous Kubernetes environments

Drive issues to resolution by reproducing problems locally, isolating root causes, and coordinating fixes with engineering—then documenting learnings in crisp RCAs that become actionable improvements

Build and maintain diagnostics tooling including support bundles, health checks, environment validators, and "what changed?" helpers that make future troubleshooting 10x faster

Own the test automation infrastructure roadmap, improving CI stability, reducing flaky tests, and creating reproducible integration/e2e environments that catch issues before customers do

Establish and maintain performance baselines and regression tests that serve as actionable gates, helping teams catch scale and latency issues early

Improve installation and upgrade robustness by identifying recurring failure modes and eliminating them through product changes, automation, and guardrails

Write production-quality code in Python, Go, or Rust for internal tooling and product improvements that directly enhance reliability

Close the reliability feedback loop by systematically turning field issues into better tests, observability, documentation, and product defaults—measuring success through reduced time-to-resolution and fewer repeat incidents

Qualification

KubernetesPythonGoRustProduction engineeringDebuggingTestingObservabilityProblem decompositionSelf-directed workCommunication skillsCollaborative mindset

Required

4-7 years of experience in production engineering, SRE, platform engineering, or similar roles where you've owned reliability and customer escalations

Strong software engineering fundamentals including design, debugging, testing, code review, and a focus on maintainable, production-quality code

Practical Kubernetes expertise sufficient to debug real deployments: troubleshooting resources, networking, storage, RBAC, and platform-specific quirks across different distributions

Deep troubleshooting instincts and observability experience using logs, metrics, and traces to diagnose issues quickly in complex, distributed systems

Experience with at least one of: Python, Go, or Rust for building tooling and contributing to product code (you don't need to be expert in all three)

Excellent problem decomposition and communication skills—you can break down messy, ambiguous issues and clearly explain your findings and recommendations

Self-directed remote work capability with strong async communication skills and the ability to operate independently in a fast-moving environment where priorities shift based on customer needs

Collaborative mindset with experience partnering across product, engineering, and customer-facing teams to drive systematic improvements

Preferred

Experience with packaging and distribution systems (containers, Helm charts, installers) and managing upgrade/migration flows

Background running CI/CD at scale including test parallelization, hermetic environments, and artifact management

Familiarity with performance tooling such as profiling, load generation, and benchmark harnesses

Previous experience in customer-facing technical roles like escalation engineering, support engineering, or solutions engineering

Contributions to open source projects, especially in infrastructure, observability, or reliability tooling

Company

OpsMill

Simplify Infrastructure Automation

Founded in 2023

Paris, Ile-de-France, FRA

2-10 employees

https://www.opsmill.com/

Funding

Current Stage

Early Stage

Total Funding

unknown

Key Investors

Serena

2024-05-28Seed

Recent News

thefastmode.com

OpsMill Launches Infrahub Enterprise to Revolutionize Infrastructure Automation

2024-11-20

Company data provided by crunchbase