Nightwing ยท 5 months ago
Site Reliability Engineer
Nightwing provides technically advanced full-spectrum cyber, data operations, systems integration and intelligence mission support services. The Site Reliability Engineer (SRE) will work closely with various teams to automate IT operations and enhance the reliability of systems, ensuring efficient deployment and operations support.
Information Technology & Services
Responsibilities
Collaboratively work closely with the contract leadership, platform teams, and sponsor to refine the operational and technical strategy to automate key portions of IT operations and enable the product team (platform) to bring new software or new features to production as quickly as possible
Execute and analyze manual IT operations/admin tasks (log analysis, performance tuning, patch management, testing, and incident response) and convert them to automated tasks
Work with the platform, network and data operations teams to assist in deployment planning and onboard systems
Assist with monitoring, system analysis, and IT operations support
Work with sponsor, mission partners, and technical personnel to deliver robust scalable operations architecture that meets the customer goals for the enterprise
Analyze, define, and document requirements for data, workflow, logical processes, hardware and operating system environment, and network connectivity, other system interfaces, internal and external checks and controls, and outputs
Monitor and track metrics, logs and traces across all services in the system/network and provide context for identifying root causes in the event of an incident, performance degradation, or availability issue
Perform network/cloud optimization and resilience planning
Develop capabilities to automate hardware/software provisioning, monitoring, patching, and troubleshooting
Collaborate with and assist platform team and leadership in network and security health, intrusions or inappropriate activities
Optimize business processes, workflows, and service operations by building efficient on-call processes and streamlining alerting workflows
Leverage operational data to automate systems administration, operations and incident response processes to improve enterprise reliability to manage IT environment complexity
Work with LSA, Lab Manager, and CM to compose technical documents including design, deployment, system specifications and host nation baselines, updates, user's manuals, training materials, installation guides, proposals, and reports
Work with the OM to implement ITSM best practices for ICA/service discrepancy and reporting, issue resolution and operations support to include Tier 2/3 escalation
Qualification
Required
Programming: Proficiency in at least one programming language (e.g., Python, Go, Java, or JavaScript) is essential for automating tasks and developing tools
Linux/Unix Systems Administration: Strong knowledge of Linux/Unix operating systems, including command-line tools and system administration tasks
Networking: Understanding of network protocols, infrastructure, and troubleshooting techniques
Database Management: Familiarity with database technologies and principles
Automation: Experience with automation tools and techniques, such as configuration management (e.g., Ansible, Puppet, Chef) and orchestration (e.g., Kubernetes)
Monitoring and Logging: Experience with monitoring tools and logging systems
Problem-Solving: Strong analytical and problem-solving skills to diagnose and resolve system issues
Communication: Ability to communicate technical information clearly and concisely to both technical and non-technical audiences
Collaboration: Ability to work effectively with cross-functional teams, including software developers and operations personnel
Preferred
Cloud Technologies: Experience with cloud platforms (e.g., AWS, Google Cloud, Azure)
Containerization: Knowledge of containerization technologies (e.g., Docker, Kubernetes)
DevOps Principles: Understanding DevOps principles and practices
Service Level Objectives (SLOs) and Service Level Agreements (SLAs): Experience with defining, tracking, and managing SLOs and SLAs
Data Analysis: Experience with data analysis and visualization tools
Company
Nightwing
We are the intelligence services company that continually redefines the edge of the possible to keep advancing our national security interests.
Funding
Current Stage
Late StageCompany data provided by crunchbase