Reliability Engineer jobs in United States
cer-icon
Apply on Employer Site
company-logo

Ardent · 2 months ago

Reliability Engineer

Ardent is committed to solving customers’ most difficult problems while ensuring the well-being and professional development of its employees. The Reliability Engineer will enhance Production Monitoring and ensure optimal service delivery for applications by proactively identifying issues, resolving incidents, and optimizing system health in a 24/7 operational environment.

Responsibilities

Proactive and early notification of potential and actual issues impacting service delivery
Frequent and succinct communication to PSPD leadership during and post incident
Identification of trends and corrective measures
Provide needed metrics to PSPD leadership team
The enhanced Production Monitoring Services Branch will provide resources to staff the operation 24x7x365. The resources should provide additional technical support and diagnosis
Build monitoring and production support solutions to provide customer with visibility towards our services
Manage ITIL engineers
Triage and resolve production incidents related to the cloud platform and participate in root cause analysis and postmortem discussions
Function as a solution manager in support of the Manager, Production Support by leading the implementation of short-term and long-term solutions, automating manual processes, and building alerts to monitor the operation of services
Asses initial severity, gather impacts, create tickets, engage support teams, and escalate issues properly as they arrive
Participate in the creation and maintenance of technical and knowledge base documentation
Troubleshoot production issues problems and collaborate in developing simple technical solutions
Use diagnostic tools to maintain, troubleshoot and restore standard service or data to systems
Lead Implementation of production support activities in an Amazon Web Services environment
Lead technical and design discussions with IT to help enterprises speed their adoption of new technologies and practices
Perform System health monitoring and optimizing performance
Define and establish monitoring and other processes and tooling for monitoring and performing routine system health checks to ensure optimization and stability of application
Work as a technical leader alongside business, development, and infrastructure teams
Effectively work with IT and business teams, as well as external customers, to lead the resolution of production incidents and provide communication during outage
Collaborate with other members of IT and business in streamlining production support processes
Work closely with other teams and recommend solutions to improve production support current processes that reflect business needs, security, and SLAs of our production services
Work closely with Infrastructure team and other support staff to identify and resolve incidents and create and implement long term remediation techniques and fixes
Provide support and coach other members of the Production Support team
Communicate clearly and effectively across IT, business process owners, and customers at all levels of the organization
Communicate progress and any challenges to management
Communicate overall status and health of the application to business and application support teams

Qualification

AWSProduction MonitoringIncident ManagementITIL ProcessesRoot Cause AnalysisTechnical DocumentationLeadership SkillsEffective Communication

Required

Experience in Production Monitoring & Support within a 24x7x365 operational environment
Strong expertise in incident management, root cause analysis, and problem resolution for cloud-based applications
Hands-on experience with Amazon Web Services (AWS) and cloud-based monitoring tools
Proficiency in ITIL processes and managing ITIL engineers for efficient service delivery
Ability to build and implement monitoring solutions, automate manual processes, and create alerts to ensure system stability
Experience with system health monitoring, performance optimization, and troubleshooting production issues
Strong leadership skills to collaborate with IT, business, and infrastructure teams to improve production support processes
Effective communication skills to provide updates, incident reports, and status updates to leadership and stakeholders
Ability to develop and maintain technical documentation and knowledge base resources for production support
Experience in triaging and resolving production incidents, assessing severity, and properly escalating issues
Active CBP/BI or Top Secret clearance is highly desired
Must be open to working 2nd or 3rd shift in a 24/7/365 environment
All candidates in consideration for this role must be U.S. Citizens willing to undergo the government issued background investigation process

Benefits

Highly competitive benefits
Professional development opportunities
Flexibility
Innovation
Collaboration
Career growth

Company

Ardent

twitter
company-logo
For nearly 20 years, Ardent has served this country by delivering award-winning security and defense technology solutions.

Funding

Current Stage
Growth Stage

Leadership Team

leader-logo
Jesus Jackson
Chief Technology Officer
linkedin
leader-logo
Mireille Estephan, MBA
Chief Technology Officer
linkedin
Company data provided by crunchbase