Walmart Canada · 3 months ago
Software Engineer II - Site Reliability Operations Engineer
Walmart Inc. is seeking a Site Reliability Operations Engineer within their Global Technology Platforms Command and Control Center Team. The role involves maintaining mission-critical infrastructure and ensuring high availability and reliability of Walmart’s technology stack, while collaborating with cross-functional engineering teams.
DeliveryRetailShopping
Responsibilities
Acquire in-depth technical knowledge of omnichannel cloud platforms, web traffic flows, micro-services, and service dependencies for major incident resolution
Provide support for Unix and Linux systems from Kernel to Shell and beyond, taking into consideration system libraries, file systems, and client-server protocols
Leverage knowledge of network technologies such as different protocols (TCP/IP, UDP, ICMP, etc.), MAC addresses, IP packets, DNS, CDN, OSI layers, Firewalls, Gateway, Proxy, and Load balancers
Provide L1 and L2 production support for multiple cloud technologies such as Open stack, Cloud Native platform, Microsoft Azure, and Google Cloud Platform for triaging critical issues using various internal and vendor-related tools
Detect and analyze monitoring graphs and alerts to identify systems causing production impacts with various tools like Grafana, Prometheus, MMS, Kibana, Graphite, Service Now, JIRA, Dynatrace, New Relic, Omniture, Splunk, and CDN logs [Reduce MTTD – Mean Time to Detect]
Triage site-impacting production issues by quantifying impact, severity and urgency, analyzing systems for quick remediation, engaging the right teams for recovery [Reduce MTTE – Mean Time to Engage], and focusing on immediate restoration [ Reduce MTTR – Mean Time to Restore] of large-scale enterprise systems
Develop enterprise monitoring and utilize tooling software solutions such as Grafana, Kibana, Splunk, Graphite, New Relic, to improve visibility, pro-actively detect issues and restore system availability
Designing and implementing JavaScript for the integration of alerting tool with service API endpoints with various tools like ServiceNow, Spotlight and xMatters
Design and develop solutions for widespread internal communications for cloud applications support or workflows for infrastructure availability issues with various internal applications with multiple programming languages like Java, JavaScript (React, Node JS), Python and Shell programming technologies like Prometheus, Database Query languages
Demonstrate knowledge of scripting and software development for automation and self-healing of multi-cloud environments
Qualification
Required
2+ years in an infrastructure, systems, engineering or development environment delivering operational excellence to highly complex distributed systems
Bachelor's Degree in Computer Science or a related field, or relevant work experience
Strong and demonstrable incident management skills with relevant experience in an enterprise organization
Experience and exposure working in a 24/7 operations support environment
Methodical and systematic problem-solving approach, combined with a solid awareness of ownership, initiative and drive
Experience investigating, analyzing and troubleshooting large scale enterprise systems
Networking knowledge and understanding of network concepts, such as different protocols (TCP/IP, UDP, ICMP, etc.), MAC addresses, IP packets, DNS, OSI layers, and load balancing
Experience administering Unix/Linux in a production environment
Experience working with and developing enterprise monitoring/tooling/logging solutions like Grafana, Kibana, Splunk, Openobserve, Graphite, Nagios, New Relic, DynaTrace and Prometheus
Working knowledge of one or more cloud technologies such as AZURE, GCP, OpenStack
Experience with distributed version control like Git or similar
Designing and implementing JavaScript for the integration of alerting tool with service API endpoints with various tools like ServiceNow, Spotlight, Splunk, and xMatters
Programming experience in one or more of the following languages: Go, Java, Python, Shell, etc
Experience in data science/machine learning would be advantageous
Option 1: Bachelor's degree in computer science, computer engineering, computer information systems, software engineering, or related area
Option 2: 3 years' experience in software engineering or related area
Preferred
We value candidates with a background in creating inclusive digital experiences, demonstrating knowledge in implementing Web Content Accessibility Guidelines (WCAG) 2.2 AA standards, assistive technologies, and integrating digital accessibility seamlessly
The ideal candidate would have knowledge of accessibility best practices and join us as we continue to create accessible products and services following Walmart's accessibility standards and guidelines for supporting an inclusive culture
Benefits
Health benefits include medical, vision and dental coverage.
Financial benefits include 401(k), stock purchase and company-paid life insurance.
Paid time off benefits include PTO (including sick leave), parental leave, family care leave, bereavement, jury duty, and voting.
Other benefits include short-term and long-term disability, company discounts, Military Leave Pay, adoption and surrogacy expense reimbursement, and more.
Live Better U is a Walmart-paid education benefit program for full-time and part-time associates in Walmart and Sam's Club facilities.
Company
Walmart Canada
Walmart Canada is a subsidiary of Walmart that operates a chain of more than 400 stores nationwide. It is a sub-organization of Walmart.
Funding
Current Stage
Late StageRecent News
Canada NewsWire
2025-12-18
Canada NewsWire
2025-12-03
Company data provided by crunchbase