Company Logo
Software Engineer

Netflix - 1d ago

Company Logo
Senior Software Engineer

Reddit - 4d ago

Site Reliability Engineer

Requirements

  • Master’s degree in Computer Science, Engineering or a related field
  • 7+ years of experience in a DevOps/SRE role
  • Strong experience with cloud computing and highly available distributed systems
  • Exposure to site reliability issues in critical environments (issue root cause analysis, in-production troubleshooting, on-call rotations...)
  • Experience working against reliability KPIs (observability, alerting, SLAs)
  • Hands-on experience with CI/CD, containerization and orchestration tools (Docker, Kubernetes...)
  • Knowledge of monitoring, logging, alerting and observability tools (Prometheus, Grafana, ELK Stack, Datadog...)
  • Familiarity with infrastructure-as-code tools like Terraform or CloudFormation
  • Proficiency in scripting languages (Python, Go, Bash...) and knowledge of software development best practices
  • Strong understanding of networking, security, and system administration concepts
  • Excellent problem-solving and communication skills
  • Self-motivated and able to work well in a fast-paced startup environment

Nice to Haves

  • Experience in an AI/ML environment
  • Experience of high-performance computing (HPC) systems and workload managers (Slurm)
  • Worked with modern AI-oriented solutions (Fluidstack, Coreweave, Vast...)

What you will be doing

  • Operations (50%)
    • Design, build, and maintain scalable, highly available and fault-tolerant infrastructures to support our web services and ML workloads
    • Make sure our platform, inference and model training environments are always highly available and enable seamless replication of work environments across several HPC clusters
    • Operate systems and troubleshoot issues in production environments (interrupts, on-call responses, users admin, data extraction, infrastructure scaling, etc.)
    • Implement and improve monitoring, alerting, and incident response systems
    • Implement and maintain workflows and tools (CI/CD, containerization, orchestration, monitoring, logging and alerting systems)
    • Participate occasionally in on-call rotations to respond to incidents and perform root cause analysis
  • Development (50%)
    • Drive continuous improvement in infrastructure automation, deployment, and orchestration
    • Collaborate with AI/ML researchers to implement solutions
    • Build a cloud-agnostic platform offering an abstraction layer between science and infrastructure
    • Design and develop new workflows and tooling to improve system performance
    • Collaborate with the security team to ensure best security practices
    • Contribute to open-source projects, research publications, blog articles and conferences

Perks and Benefits

  • Remote - Europe location
AI Summary ✨
Mistral AI logo

Mistral AI

France, UK

Remote
Experience: Senior
Posted: April 2, 2024
Last seen: 2 hours ago
Docker
Golang
Kubernetes
Python
Terraform
sitereliability

Why we track Mistral AI

Mistral AI is a Paris-based AI company building frontier large language models. Founded by former DeepMind and Meta researchers, they're one of Europe's most important AI companies. If you want to work on cutting-edge AI research and infrastructure in Europe, Mistral is the one to watch.

Similar jobs

  • mistralai logo

    Mistral Cloud - Site Reliability Engineer

    France, Netherlands, Spain, Germany, Belgium, UK, Luxembourg

    8 hours ago
    New
    Remote
  • 15 days ago
    Remote
  • 10 months ago
    Still looking
    Remote
  • See all jobs in France