Network Engineer, Operations & Repair jobs in United States
cer-icon
Apply on Employer Site
company-logo

Fluidstack ยท 4 hours ago

Network Engineer, Operations & Repair

Fluidstack is building the infrastructure for abundant intelligence, partnering with top AI labs and enterprises to unlock compute at the speed of light. The Network Engineer, Operations & Repair will focus on improving the quality and reliability of AI networks through process development, data collection, and reliability metrics.

Cloud ComputingCloud StorageGenerative AIGPUInformation TechnologyMachine LearningPrivate CloudSoftware

Responsibilities

Ownership of Quality Assurance: Design, develop, and support QA process for network hardware and networks
Pipelines: Develop and deploy serverless workflows, server based, and manually triggered data pipelines producing network quality and reliability observability for internal and external customers
Deployment and Operations Support: Support full lifecycle data collection and analysis partnering with Deployment, Operations, DC hardware, and logistics teams to produce data that drives process improvements and delivers on SLA and SLOs
Process Engineering: Develop, pilot, and deploy process improvements for deployment and repair to produce data and consume data with Machine Learning to fulfill our mission
Cross-Team Collaboration: Own without ego and execute in a collaborative team with design, deployment, operations engineers and software developers
Subject Matter Expert: In at least two or more deep subjects such as IP routing, optics, optical transport, Ethernet, RDMA/RoCE, or electrical power

Qualification

Network EngineeringDatacenter Fabric ExpertiseIncident ResponseReliability EngineeringObservability & MonitoringSoftware DevelopmentOperational PragmatismSelf DrivenHardware Repair ExperienceAI/HPC Fabric OperationsCross-Team Collaboration

Required

5+ years in network engineering
3+ years in operations with significant hands-on operational experience
Experience running production networks or compute
Ability to respond to incidents at all hours
Experience debugging complex failures under pressure
Understanding of the difference between 'working' and 'production-ready'
Deep experience operating modern datacenter networks including EVPN/VXLAN, BGP, CLOS topologies, and high-radix switching
Comfortable troubleshooting Layer 2/3 issues, BGP routing problems, fabric misconfigurations, and physical media failures
Proven ability to lead incident response
Ability to perform systematic troubleshooting and drive issues to resolution
Ability to remain calm during outages and communicate clearly with stakeholders
Understanding of when to escalate versus when to dig deeper
Experience building relationships with onsite teams and coordinating physical infrastructure work
Ability to represent network engineering in a field environment
Ability to get things done in operational settings with many internal and external teams and stakeholders
Ability to balance perfection with progress
Ability to troubleshoot with imperfect information and make pragmatic decisions under time pressure
Ability to prioritize based on business impact
Ability to document as you go and continuously improve operational processes
Self-driven with the ability to embrace complex challenges with undefined process and key results
Ability to dive in to learn and build Objectives and Key Results
Ability to build a software development project and pipeline in Jira solo
Ability to switch hats and begin coding

Preferred

Experience operating AI/ML or HPC fabrics with RDMA (RoCEv2), lossless Ethernet (PFC, ECN), or high-performance networking
Experience with observability and reliability engineering from network operations or in manufacturing quality
Hands-on experience coordinating hardware repairs, RMAs, and physical infrastructure work
Understanding of datacenter logistics, vendor escalation processes, and how to work effectively with onsite technicians
Familiarity with network monitoring platforms, alerting systems, and telemetry collection
Experience using monitoring tools to diagnose issues proactively and tune alerting to reduce noise
Experience with SQL, MySQL, and building operations dashboards
Experience with ITIL, Agile (xP), and TDD including developing and leading programs and projects
Experience building hyperscale platforms in Go Lang with supporting tools in Python or RUST

Benefits

Retirement or pension plan, in line with local norms.
Health, dental, and vision insurance.
Generous PTO policy, in line with local norms.

Company

Fluidstack

twittertwittertwitter
company-logo
FluidStack is an AI cloud platform for frontier labs and startups.

Funding

Current Stage
Growth Stage
Total Funding
unknown
Key Investors
Seedcamp
2025-06-01Undisclosed
2024-10-01Private Equity
2018-02-01Pre Seed

Leadership Team

leader-logo
Gary Wu
CEO, Co-Founder
linkedin
Company data provided by crunchbase