Site Reliability Engineer jobs in United States
cer-icon
Apply on Employer Site
company-logo

Nightwing ยท 5 months ago

Site Reliability Engineer

Nightwing provides technically advanced full-spectrum cyber, data operations, systems integration and intelligence mission support services. The Site Reliability Engineer (SRE) will work closely with various teams to automate IT operations and enhance the reliability of systems, ensuring efficient deployment and operations support.

Information Technology & Services
badNo H1BnoteSecurity Clearance RequirednoteU.S. Citizen Onlynote

Responsibilities

Collaboratively work closely with the contract leadership, platform teams, and sponsor to refine the operational and technical strategy to automate key portions of IT operations and enable the product team (platform) to bring new software or new features to production as quickly as possible
Execute and analyze manual IT operations/admin tasks (log analysis, performance tuning, patch management, testing, and incident response) and convert them to automated tasks
Work with the platform, network and data operations teams to assist in deployment planning and onboard systems
Assist with monitoring, system analysis, and IT operations support
Work with sponsor, mission partners, and technical personnel to deliver robust scalable operations architecture that meets the customer goals for the enterprise
Analyze, define, and document requirements for data, workflow, logical processes, hardware and operating system environment, and network connectivity, other system interfaces, internal and external checks and controls, and outputs
Monitor and track metrics, logs and traces across all services in the system/network and provide context for identifying root causes in the event of an incident, performance degradation, or availability issue
Perform network/cloud optimization and resilience planning
Develop capabilities to automate hardware/software provisioning, monitoring, patching, and troubleshooting
Collaborate with and assist platform team and leadership in network and security health, intrusions or inappropriate activities
Optimize business processes, workflows, and service operations by building efficient on-call processes and streamlining alerting workflows
Leverage operational data to automate systems administration, operations and incident response processes to improve enterprise reliability to manage IT environment complexity
Work with LSA, Lab Manager, and CM to compose technical documents including design, deployment, system specifications and host nation baselines, updates, user's manuals, training materials, installation guides, proposals, and reports
Work with the OM to implement ITSM best practices for ICA/service discrepancy and reporting, issue resolution and operations support to include Tier 2/3 escalation

Qualification

PythonLinux/Unix AdministrationNetworkingAutomation ToolsMonitoring ToolsCloud TechnologiesContainerizationDevOps PrinciplesDatabase ManagementData AnalysisSLOsSLAsGSDC SRE CertificationAWS Certified SysOpsGoogle Cloud CertifiedAzure Certified Solutions ArchitectProblem-SolvingCommunicationCollaboration

Required

Programming: Proficiency in at least one programming language (e.g., Python, Go, Java, or JavaScript) is essential for automating tasks and developing tools
Linux/Unix Systems Administration: Strong knowledge of Linux/Unix operating systems, including command-line tools and system administration tasks
Networking: Understanding of network protocols, infrastructure, and troubleshooting techniques
Database Management: Familiarity with database technologies and principles
Automation: Experience with automation tools and techniques, such as configuration management (e.g., Ansible, Puppet, Chef) and orchestration (e.g., Kubernetes)
Monitoring and Logging: Experience with monitoring tools and logging systems
Problem-Solving: Strong analytical and problem-solving skills to diagnose and resolve system issues
Communication: Ability to communicate technical information clearly and concisely to both technical and non-technical audiences
Collaboration: Ability to work effectively with cross-functional teams, including software developers and operations personnel

Preferred

Cloud Technologies: Experience with cloud platforms (e.g., AWS, Google Cloud, Azure)
Containerization: Knowledge of containerization technologies (e.g., Docker, Kubernetes)
DevOps Principles: Understanding DevOps principles and practices
Service Level Objectives (SLOs) and Service Level Agreements (SLAs): Experience with defining, tracking, and managing SLOs and SLAs
Data Analysis: Experience with data analysis and visualization tools

Company

Nightwing

twitter
company-logo
We are the intelligence services company that continually redefines the edge of the possible to keep advancing our national security interests.

Funding

Current Stage
Late Stage
Company data provided by crunchbase