Company Logo
Software Engineer

Netflix - 1d ago

Company Logo
Senior Software Engineer

Reddit - 4d ago

Site Reliability Engineer - AI Agents

Requirements

  • 5+ years of experience as a Site Reliability Engineer, Infrastructure Engineer, Platform Engineer, or similar role in a production environment
  • Hands-on experience supporting ML infrastructure, model serving, or MLOps workflows in production
  • Experience building developer platforms, internal tooling, APIs, or SDKs consumed by engineering teams at scale
  • Strong understanding of platform engineering principles, including developer experience, self-service infrastructure, and API-driven platform design
  • Proficiency with Infrastructure as Code tools, particularly Terraform
  • Experience with containerization and orchestration, particularly Kubernetes and Docker
  • Solid understanding of cloud infrastructure, preferably AWS
  • Strong scripting skills (bash/shell) and proficiency in at least one programming language (Python preferred)
  • Experience designing and operating observability, monitoring, and alerting systems
  • Experience implementing incident response procedures and participating in on-call rotations
  • Strong collaboration skills working across data, AI, and engineering teams
  • High ownership mindset in a fast-moving, high-stakes production environment

Nice to Haves

  • Experience building or operating infrastructure for agent-based or LLM-powered systems
  • Familiarity with agent orchestration frameworks (e.g., LangGraph, CrewAI, or similar)
  • Background in data infrastructure, including familiarity with Airflow, Kafka, Spark, or data lake tooling
  • Experience with CI/CD pipelines and deployment automation for AI/ML workloads
  • Exposure to evaluation frameworks and model performance monitoring at scale
  • Experience working in fast-moving 0→1 environments or platform-building teams
  • Experience building SDKs, developer tooling, or internal platform products with a strong focus on usability and adoption
  • Experience with Cloudflare's cloud platform and product ecosystem, including networking, security, performance, and Zero Trust solutions

What You'll Be Doing

  • Design, build, and operate the infrastructure layer supporting AI agent workflows in production
  • Ensure reliability, scalability, and observability of agentic systems across internal and external products
  • Design and develop platform services, APIs, SDKs, and self-service capabilities that allow engineering teams to easily consume AI infrastructure and agent platform services
  • Manage and maintain the compute, orchestration, and serving infrastructure powering model inference and agent execution
  • Implement robust monitoring, alerting, and incident response procedures tailored to AI/ML workloads
  • Utilize Infrastructure as Code (IaC) tools such as Terraform to provision and manage cloud (AWS) infrastructure components
  • Build and maintain CI/CD pipelines that support rapid, reliable deployment of AI services and agent workflows
  • Define and implement guardrails, failure handling, and recovery patterns specific to agentic and LLM-powered systems
  • Collaborate with AI and Data Engineering teams to translate experimental agent prototypes into hardened production systems
  • Manage containerized workloads using Kubernetes, ensuring efficient deployment, scaling, and orchestration of AI services
  • Implement access controls and security best practices across AI infrastructure environments
  • Document architecture, runbooks, and best practices to support knowledge sharing across the team

Perks and Benefits

  • Opportunity to contribute to building the future of open finance
  • Work with a globally recognized crypto platform
  • Collaborate with diverse and talented colleagues from around the world
  • Continuous learning and growth opportunities in a dynamic industry
  • Equal opportunity employer fostering a culture of meritocracy and diversity
  • Potential for career advancement and skill development
AI Summary ✨
Kraken logo

Kraken

UK, Spain, Czech Republic, Sweden, Cyprus, Ireland, Poland, Portugal, Hungary, Lithuania, Switzerland, Bulgaria, Romania

Remote
Experience: Senior
Posted: June 11, 2026
Last seen: 6 minutes ago
Aws
Docker
Kubernetes
Python
Terraform
sitereliability

Why we track Kraken

Kraken is one of the oldest and most established crypto exchanges. They take security seriously and have a strong engineering reputation. Remote-friendly with EU roles. A more serious option in the crypto space.

Similar jobs

  • 4 hours ago
    New
  • 4 hours ago
    New
  • 4 hours ago
    New
  • See all jobs in UK