Software Engineer

Netflix - 1d ago

Senior Software Engineer

Reddit - 4d ago

Next Level JobsNext LevelEU

Site Reliability Engineer

Requirements

Master’s degree in Computer Science, Engineering or a related field
7+ years of experience in a DevOps/SRE role
Strong experience with cloud computing and highly available distributed systems
Exposure to site reliability issues in critical environments (issue root cause analysis, in-production troubleshooting, on-call rotations...)
Experience working against reliability KPIs (observability, alerting, SLAs)
Hands-on experience with CI/CD, containerization and orchestration tools (Docker, Kubernetes...)
Knowledge of monitoring, logging, alerting and observability tools (Prometheus, Grafana, ELK Stack, Datadog...)
Familiarity with infrastructure-as-code tools like Terraform or CloudFormation
Proficiency in scripting languages (Python, Go, Bash...) and knowledge of software development best practices
Strong understanding of networking, security, and system administration concepts
Excellent problem-solving and communication skills
Self-motivated and able to work well in a fast-paced startup environment

Nice to Haves

Experience in an AI/ML environment
Experience of high-performance computing (HPC) systems and workload managers (Slurm)
Worked with modern AI-oriented solutions (Fluidstack, Coreweave, Vast...)

What you will be doing

Operations (50%)
- Design, build, and maintain scalable, highly available and fault-tolerant infrastructures to support our web services and ML workloads
- Make sure our platform, inference and model training environments are always highly available and enable seamless replication of work environments across several HPC clusters
- Operate systems and troubleshoot issues in production environments (interrupts, on-call responses, users admin, data extraction, infrastructure scaling, etc.)
- Implement and improve monitoring, alerting, and incident response systems
- Implement and maintain workflows and tools (CI/CD, containerization, orchestration, monitoring, logging and alerting systems)
- Participate occasionally in on-call rotations to respond to incidents and perform root cause analysis
Development (50%)
- Drive continuous improvement in infrastructure automation, deployment, and orchestration
- Collaborate with AI/ML researchers to implement solutions
- Build a cloud-agnostic platform offering an abstraction layer between science and infrastructure
- Design and develop new workflows and tooling to improve system performance
- Collaborate with the security team to ensure best security practices
- Contribute to open-source projects, research publications, blog articles and conferences

Perks and Benefits

Remote - Europe location

AI Summary ✨

Mistral AI

France, UK

Mistral AI

France, UK

Remote

Experience: Senior

Posted: April 2, 2024

Last seen: 2 hours ago

Docker

Golang

Kubernetes

Python

Terraform

sitereliability

Why we track Mistral AI

Mistral AI is a Paris-based AI company building frontier large language models. Founded by former DeepMind and Meta researchers, they're one of Europe's most important AI companies. If you want to work on cutting-edge AI research and infrastructure in Europe, Mistral is the one to watch.

More jobs Salary Data

Similar jobs

Mistral Cloud - Site Reliability Engineer

France, Netherlands, Spain, Germany, Belgium, UK, Luxembourg

8 hours ago

New

Remote

April 14, 20268 hours ago

Technology Continuity and Resilience Manager

France, Ireland, Spain

15 days ago

Remote

March 30, 202615 days ago

Staff SRE Engineer (Platform)

Remote EMEA

10 months ago

Still looking

Remote

June 25, 202510 months ago

See all jobs in France →