4+ years building backend or real-time ML systems; value simplicity, correctness, and performance
Proven experience delivering LLM/agent features to production (prompting, tooling, evals, safety/guardrails)
Comfortable owning user journeys, iterating from prototype → alpha → GA, and measuring impact with clear product metrics
What You'll Be Doing:
Shape AI experiences for APM. Design and ship LLM/agentic workflows that analyze traces, metrics, logs, and other telemetry to generate diagnoses, explanations, and guided fixes.
Own the full loop. Prototype quickly, define success metrics and evals, run experiments, iterate, and ultimately productionize for scale and reliability.
Build robust agent systems. Develop tools, retrieval and planning strategies, and guardrails; manage prompts/evals; design fallbacks and human‑in‑the‑loop paths.
Integrate with Datadog’s platform. Leverage surfaces like Trace Explorer, Service Catalog, monitors, and workflows to deliver end‑to‑to-end value in the APM UI.
Partner deeply. Collaborate with PM, Design, and partner teams to build cohesive experiences.
Raise the bar on engineering. Write performant, maintainable backend code, own services in production, and improve reliability for high-throughput, low-latency data systems.
Nice to Have:
Hands-on with distributed tracing stacks (OpenTelemetry/Datadog APM), profilers, and logs/metrics pipelines
Exposure to planning/agent frameworks, tool-use orchestration, RAG, and retrieval/indexing for observability data
Familiarity with SLO/SLA practices and incident response
Perks and Benefits:
Get to build tools for software engineers, just like yourself. And use the tools we build to accelerate our development.
Have a lot of influence on product direction and impact on the business.
Work with skilled, knowledgeable, and kind teammates who are happy to teach and learn.