A Product-minded engineer who ships AI to production
5+ years experience with backend systems and microservices performance: tracing, latency breakdowns, concurrency, and resiliency patterns
Proficient in a modern programming language; strong API/service design; production ops (monitoring, alerting, on-call rotation)
Proven experience delivering LLM/agent features to production
Comfortable owning user journeys, iterating from prototype → alpha → GA, and measuring impact with clear product metrics
End-to-end AI implementation owner: Understands the end-to-end LLM product lifecycle
Fluent with offline/online evals for AI systems
What you'll be doing
Build AI-driven deployment gates: Design and ship decision systems that evaluate customer deployments using CI/CD context and Datadog telemetry, producing safe, explainable allow/block outcomes
Own evals and rollout: Define precision, recall, and trust metrics; build offline and online evals; validate changes in shadow mode; and safely promote improvements to enforcement
Design for robustness and safety: Implement conservative defaults, guardrails, fallbacks, and human-in-the-loop paths so gates behave predictably under noisy or incomplete data
Partner closely with Product: Work hand-in-hand with the Product Manager to translate customer problems, adoption signals, and roadmap goals into concrete technical decisions and iterations
Integrate across the Datadog platform: Partner with internal AI teams building the Faulty Deployment Detection pipeline, as well as teams working on LLMs and AI agents
Own production systems: Build and operate reliable backend services that run in the critical path of customer deployments, and be on-call for those services