Apply on Employer Site

Denvr · 1 month ago

AI Platform Engineer

United States

Full-time

Remote

Mid Level

3+ years exp

Denvr is a vertically integrated AI Platform Services company headquartered in Calgary, Canada, focused on providing foundational compute infrastructure and services for the AI ecosystem. The AI Platform Engineer will be responsible for designing, implementing, and operating AI compute architectures, collaborating with cross-functional teams to deliver exceptional customer products and experiences.

Artificial Intelligence (AI)Cloud ComputingCloud Data ServicesCloud InfrastructureGenerative AIMachine LearningNatural Language ProcessingPrivate Cloud

Responsibilities

Architect and optimize high-performance AI Platform solutions for AI training and inferencing, leveraging NVIDIA systems (H200/H200/A100/GH200) and distributed training optimizations (NCCL, RDMA/Infiniband)

Administer RKE Kubernetes clusters, including custom operator development (KOPF), CNIs (Kube-OVN), and KubeVirt, alongside managing traditional virtualization (VMware ESXi/vCenter) and bare-metal provisioning (Metal3, Ironic)

Perform advanced OS management (Ubuntu), including kernel parameter optimization and hardware-level troubleshooting on Supermicro/Dell platforms

Manage high-throughput network fabrics using BGP EVPN, SONiC, and leaf/spine topologies, while maintaining network security via firewalls, VPNs, internet gateways, and granular policy management

Deploy and maintain scalable, high-performance storage fabrics for data-intensive workloads using technologies such as WEKA, Ceph (Rook), Qumulo, and Dell PowerStore

Design and build critical backend APIs and microservices using Python (FastAPI, asyncio, Pydantic) or Golang, including the development of Kubernetes Operators and integration with relational/NoSQL databases

Drive infrastructure consistency and repeatability through Terraform, CloudFormation, and Ansible, integrated within robust CI/CD pipelines

Adherence to change/release management, incident/problem management, documentation standards, cross-team architectural reviews, post-sales L3 support, and customer-facing technical engagement for both public cloud and private platform deployments

Work cross-functionally with vendors, engineering, and platform operations to define requirements, document processes, and continuously improve platform reliability and performance

Support business development and customer success teams by providing clear technical guidance, translating complex concepts, and aligning solutions to customer requirements

Opportunities to meet directly with customers to design and review complex platform integrations, custom architectures, and workload-optimized AI solutions

Collaborate with vendors to evaluate and validate new GPU and ASIC hardware, firmware, and system architectures, providing feedback for integration and improvement

Provide L3 engineering support for advanced troubleshooting, root-cause analysis, and performance evaluation across compute, storage, networking, and AI systems

Stay up to date with industry trends, attend workshops, seminars, and conferences

Pursue relevant certifications and continuous learning in cloud, AI/ML infrastructure, networking, storage, and security domains

Engage in internal knowledge sharing through documentation, demos, tech talks, and mentorship of peers

Qualification

AI & HPC InfrastructureCloud & VirtualizationLinux & Systems EngineeringAdvanced NetworkingStorage & Data PlatformsSystems IntegrationAutomation & IaCCustomer empathyAnalytical skillsCommunicationCreative problem-solverTeamwork & Collaboration

Required

Post secondary education in Computer Science, Engineering, Information Technology, or related technical discipline

3+yrs experience with AI/ML solutions engineering, cloud infrastructure, or a related field (preferred)

Background in software development, system design, or technical consulting is highly valued

Excellent written and verbal communication, with the ability to simplify and explain complex technical concepts

Strong customer empathy and discovery skills to uncover real needs and guide solution direction

Confident presenter who can engage both technical and non-technical audiences

Highly organized, able to manage multiple priorities, and comfortable shifting focus as business needs evolve

Creative problem-solver with a structured approach to diagnosing issues and designing solutions

Strong sense of ownership, accountability, and alignment with company vision and direction

Familiarity with AI industry trends, cloud and data center infrastructure, and secure, reliable operations at scale

Understanding of AI/ML workflows (training, multi-GPU/multi-node scaling, inferencing) and distributed storage fundamentals

General awareness of competitive landscape and emerging technologies in AI infrastructure and cloud services

Effective collaborator across cross-functional teams including Sales, Marketing, Product, and Engineering

Comfortable working in customer-facing technical roles where clarity, empathy, and responsiveness are critical

Strong analytical mindset for evaluating complex systems and diagnosing issues across compute, storage, and networking

Ability to design, articulate, and innovate technical solutions aligned with customer and business requirements

Company

Denvr

Denvr AI Platforms provide foundational AI services for the AI ecosystem and end users of AI, comprising of cloud-enabled services for inferencing, computing, data processing & storage, and software toolsets for the accelerated development, operations, adoption, and integration of AI technologies, delivered through the public Denvr AI Cloud, and also through Denvr AI Platform Services for private, fully dedicated, sovereign, and highly secure AI Services, including private platform infrastructure deployments that consist of advanced data centers, compute architectures, data processing & storage fabrics, with integrated platform operations software.

Founded in 2017

Calgary, Alberta, CAN

51-200 employees