Apply on Employer Site

Fluidstack · 4 hours ago

Network Engineer, Operations & Repair

San Francisco, CA

Full-time

Onsite

Senior Level

$150K/yr - $250K/yr

5+ years exp

Fluidstack is building the infrastructure for abundant intelligence, partnering with top AI labs and enterprises to unlock compute at the speed of light. The Network Engineer, Operations & Repair will focus on improving the quality and reliability of AI networks through process development, data collection, and reliability metrics.

Cloud ComputingCloud StorageGenerative AIGPUInformation TechnologyMachine LearningPrivate CloudSoftware

Responsibilities

Ownership of Quality Assurance: Design, develop, and support QA process for network hardware and networks

Pipelines: Develop and deploy serverless workflows, server based, and manually triggered data pipelines producing network quality and reliability observability for internal and external customers

Deployment and Operations Support: Support full lifecycle data collection and analysis partnering with Deployment, Operations, DC hardware, and logistics teams to produce data that drives process improvements and delivers on SLA and SLOs

Process Engineering: Develop, pilot, and deploy process improvements for deployment and repair to produce data and consume data with Machine Learning to fulfill our mission

Cross-Team Collaboration: Own without ego and execute in a collaborative team with design, deployment, operations engineers and software developers

Subject Matter Expert: In at least two or more deep subjects such as IP routing, optics, optical transport, Ethernet, RDMA/RoCE, or electrical power

Qualification

Network EngineeringDatacenter Fabric ExpertiseIncident ResponseReliability EngineeringObservability & MonitoringSoftware DevelopmentOperational PragmatismSelf DrivenHardware Repair ExperienceAI/HPC Fabric OperationsCross-Team Collaboration

Required

5+ years in network engineering

3+ years in operations with significant hands-on operational experience

Experience running production networks or compute

Ability to respond to incidents at all hours

Experience debugging complex failures under pressure

Understanding of the difference between 'working' and 'production-ready'

Deep experience operating modern datacenter networks including EVPN/VXLAN, BGP, CLOS topologies, and high-radix switching

Comfortable troubleshooting Layer 2/3 issues, BGP routing problems, fabric misconfigurations, and physical media failures

Proven ability to lead incident response

Ability to perform systematic troubleshooting and drive issues to resolution

Ability to remain calm during outages and communicate clearly with stakeholders

Understanding of when to escalate versus when to dig deeper

Experience building relationships with onsite teams and coordinating physical infrastructure work

Ability to represent network engineering in a field environment

Ability to get things done in operational settings with many internal and external teams and stakeholders

Ability to balance perfection with progress

Ability to troubleshoot with imperfect information and make pragmatic decisions under time pressure

Ability to prioritize based on business impact

Ability to document as you go and continuously improve operational processes

Self-driven with the ability to embrace complex challenges with undefined process and key results

Ability to dive in to learn and build Objectives and Key Results

Ability to build a software development project and pipeline in Jira solo

Ability to switch hats and begin coding

Preferred

Experience operating AI/ML or HPC fabrics with RDMA (RoCEv2), lossless Ethernet (PFC, ECN), or high-performance networking

Experience with observability and reliability engineering from network operations or in manufacturing quality

Hands-on experience coordinating hardware repairs, RMAs, and physical infrastructure work

Understanding of datacenter logistics, vendor escalation processes, and how to work effectively with onsite technicians

Familiarity with network monitoring platforms, alerting systems, and telemetry collection

Experience using monitoring tools to diagnose issues proactively and tune alerting to reduce noise

Experience with SQL, MySQL, and building operations dashboards

Experience with ITIL, Agile (xP), and TDD including developing and leading programs and projects

Experience building hyperscale platforms in Go Lang with supporting tools in Python or RUST

Benefits

Retirement or pension plan, in line with local norms.

Health, dental, and vision insurance.

Generous PTO policy, in line with local norms.

Company

Fluidstack

FluidStack is an AI cloud platform for frontier labs and startups.

Founded in 2017

London, England, GBR

51-200 employees

https://www.fluidstack.io

Funding

Current Stage

Growth Stage

Total Funding

unknown

Key Investors

Seedcamp

2025-06-01Undisclosed

2024-10-01Private Equity

2018-02-01Pre Seed

Leadership Team

Gary Wu

CEO, Co-Founder

Recent News

Cointelegraph

Riot Platforms offloads $161M in Bitcoin in December amid strategy shift

2026-01-07

Business Insider

The investor who blocked a $9 billion AI deal expects that bet to soon pay off

2026-01-06

TradingView

Hut 8 finishes 2025 strong despite difficult year for Bitcoin miners

2026-01-04

Company data provided by crunchbase