Datadog · 2 days ago
Senior AI Engineer - APM Experiences
Datadog is a global SaaS business focused on delivering solutions for cloud monitoring and digital transformation. The Senior AI Engineer will lead the development of AI-powered capabilities for Application Performance Monitoring, focusing on debugging, performance optimization, and creating intelligent monitors.
AnalyticsCloud ComputingCloud Data ServicesCloud InfrastructureData ManagementDevOpsProductivity ToolsSaaS
Responsibilities
Shape AI experiences for APM. Design and ship LLM/agentic workflows that analyze traces, metrics, logs, and other telemetry to generate diagnoses, explanations, and guided fixes
Own the full loop. Prototype quickly, define success metrics and evals, run experiments, iterate, and ultimately productionize for scale and reliability
Build robust agent systems. Develop tools, retrieval and planning strategies, and guardrails; manage prompts/evals; design fallbacks and human‑in‑the‑loop paths
Integrate with Datadog’s platform. Leverage surfaces like Trace Explorer, Service Catalog, monitors, and workflows to deliver end‑to‑end value in the APM UI
Partner deeply. Collaborate with PM, Design, and partner teams to build cohesive experiences
Raise the bar on engineering. Write performant, maintainable backend code, own services in production, and improve reliability for high‑throughput, low‑latency data systems
Qualification
Required
4+ years building backend or real-time ML systems; you value simplicity, correctness, and performance
Proven experience delivering LLM/agent features to production (prompting, tooling, evals, safety/guardrails)
Comfortable owning user journeys, iterating from prototype → alpha → GA, and measuring impact with clear product metrics
Solid grasp of the ML lifecycle (task definition, dataset collection, modeling, evaluation, deployment, iteration) and statistics (experiment design, confidence intervals)
Experience choosing/modeling the right technique for the job (e.g., anomaly detection, ranking/recommendation, NLP), and knowing when a heuristic beats a model
Fluency with offline/online evals for AI systems; can build reliable golden sets and automatic regressions
Experience with microservices performance: tracing, latency breakdowns, concurrency, and resiliency patterns
Proficient in Go, Java, or Python; strong API/service design; production ops (monitoring, alerting, on‑call rotation)
Preferred
Hands‑on with distributed tracing stacks (OpenTelemetry/Datadog APM), profilers, and logs/metrics pipelines
Exposure to planning/agent frameworks, tool‑use orchestration, RAG, and retrieval/indexing for observability data
Familiarity with SLO/SLA practices and incident response
Benefits
Healthcare
Dental
Parental planning
Mental health benefits
A 401(k) plan and match
Paid time off
Fitness reimbursements
A discounted employee stock purchase plan
Company
Datadog
Datadog is an observability and security platform that offers infrastructure, applications, software development, and monitoring services.
H1B Sponsorship
Datadog has a track record of offering H1B sponsorships. Please note that this does not
guarantee sponsorship for this specific role. Below presents additional info for your
reference. (Data Powered by US Department of Labor)
Distribution of Different Job Fields Receiving Sponsorship
Represents job field similar to this job
Trends of Total Sponsorships
2025 (123)
2024 (66)
2023 (45)
2022 (53)
2021 (31)
2020 (29)
Funding
Current Stage
Public CompanyTotal Funding
$1.02BKey Investors
ICONIQ GrowthIndex VenturesOpenView
2024-12-09Post Ipo Debt· $870M
2020-05-28Post Ipo Debt
2019-09-19IPO
Recent News
The Motley Fool
2026-01-05
Company data provided by crunchbase