You have strong software engineering skills with experience in domains such as observability, SRE, or security
You have depth in distributed computing and ML systems for training and inference at scale; experience with Ray, Slurm, or similar frameworks is a plus
You are proficient in Python, familiar with a systems language (e.g., Rust, C++, or Go), and you are comfortable with modern cloud and data infrastructure
You have practical experience implementing and operating ML training and inference systems (e.g., PyTorch or JAX), including containerization, orchestration, and GPU acceleration
You are familiar with efficient training, fine-tuning, and inference techniques for large foundation models
You can explain design and performance trade-offs clearly to both technical and non-technical audiences
You have a strong interest in open-science and open-source contributions, including establishing rigorous benchmarks and sharing artifacts with the community
What You'll Be Doing:
Build and operate datasets, training and evaluation pipelines, benchmarks, and internal tooling
Implement models, run experiments at scale, and profile for reliability, performance, and cost
Orchestrate distributed training and distributed RL with Ray, including scheduling, scaling, and failure recovery
Make the research stack observable, reproducible, and easier to use
Establish rigorous automated benchmarks and regression tests for forecasting, anomaly detection, multi-modal analysis, agents, and code repair tasks
Collaborate with Research Scientists, Product, and Engineering to integrate advanced AI capabilities into Datadog’s product ecosystem and to harden prototypes into reliable services
Contribute high-quality code, documentation, and open-source artifacts that enable the community and internal teams to reproduce, extend, and evaluate results
Nice to Haves (Bonus Points):
You have a demonstrated ability to bridge cutting-edge research prototypes and real-world product applications, ideally with large foundation models, generative AI agents, or domain-specific LLM deployments
You are passionate about pushing the boundaries of AI while maintaining a strong focus on customer impact, scalability, and responsible deployment of new technologies
You have hands-on experience with GPU programming and optimization, including experience in CUDA
You have experience writing production data pipelines and applications
You have experience supporting or contributing to research publications
Perks and Benefits:
Competitive global benefits
New hire stock equity (RSUs) and employee stock purchase plan (ESPP)
Opportunity to collaborate closely with colleagues across the Datadog offices in New York City and Paris
Opportunity to attend and present at conferences and meetups
Intra-departmental mentor and buddy program for in-house networking
An inclusive company culture, ability to join our Community Guilds (Datadog employee resource groups)