bedrock · 11 hours ago
Cloud - Staff Site Reliability Engineer
Bedrock Robotics is seeking an experienced Staff Site Reliability Engineer to own and evolve their cloud infrastructure, focusing on scalable design and system reliability. The role involves designing, building, and operating reliable systems while ensuring best practices in production engineering and observability.
ConstructionReal EstateSoftware
Responsibilities
Design, build, and operate highly scalable, reliable systems used by all Bedrock engineering teams
Take full ownership of Bedrock’s cloud infrastructure (AWS, GCP, Azure), ensuring best-in-class security, performance, and cost efficiency
Design, implement, and maintain Bedrock’s end-to-end observability stack (including monitoring, logging, and tracing)
Develop and implement best practices for system reliability, security, on-call rotation, and effective incident response
Continuously identify and implement improvements to enhance system performance and optimize cloud resource consumption
Qualification
Required
A deep passion for building and maintaining reliable, fault-tolerant distributed systems
Strong proficiency in major cloud platforms (such as AWS, GCP, or Azure) and Infrastructure as Code (IaC) tools like Terraform
Proven experience with container technologies and orchestration platforms, particularly Kubernetes
Hands-on experience with observability tools (e.g., Datadog, Prometheus, Splunk) and techniques
Strong understanding of distributed systems, networking concepts, database technologies, and compute infrastructure
Strong understanding and experience implementing security best practices in cloud environments
Ability to work in a fast-paced, high-growth environment, deal effectively with ambiguity, and take decisive ownership of challenging problems