Requirements

5+ years of infrastructure engineering experience, with significant time spent on GPU compute, ML infrastructure, distributed systems, high-performance computing, or large-scale production platforms
Hands-on experience operating GPU clusters or accelerator-backed infrastructure in production or production-like environments, including scheduling, orchestration, utilization monitoring, and cost optimization
Strong systems engineering fundamentals across Linux, networking, storage, containers, Kubernetes, distributed runtimes, and production debugging
Experience with ML serving frameworks such as vLLM, Triton Inference Server, TensorRT, TorchServe, KServe, Ray Serve, or equivalent systems
Proficiency in Python for infrastructure automation, tooling, debugging, integration, and operational workflows
Practical understanding of performance tradeoffs across batching, concurrency, memory usage, GPU utilization, model size, latency, throughput, availability, and cost
Track record of optimizing compute costs while maintaining clear performance, reliability, and availability expectations
Experience building observable systems with useful metrics, logs, traces, dashboards, alerts, and incident workflows
Comfortable working in high-stakes, always-on environments where uptime, throughput, correctness, and operational discipline are critical
Clear communicator who can translate infrastructure tradeoffs for researchers, product teams, platform engineers, security stakeholders, and engineering leadership

Nice to Haves

Experience at a frontier AI lab, hyperscaler, high-frequency trading firm, research platform, or high-scale ML organization
Familiarity with custom silicon or specialized accelerators such as TPUs, AWS Trainium, Gaudi, or similar platforms
Background in capacity planning, procurement input, reserved capacity strategy, cloud accelerator economics, or GPU fleet cost management
Experience with distributed training frameworks such as DeepSpeed, Megatron-LM, FSDP, Ray, or equivalent systems
Experience debugging CUDA, NCCL, kernel, driver, runtime, memory, networking, or low-level performance issues
Experience with Rust, C++, Go, CUDA, or other systems languages used for performance-critical infrastructure
Crypto, financial services, trading infrastructure, or security-sensitive production infrastructure experience

What You'll Be Doing

Own and operate GPU and accelerator clusters used for training, inference, evaluation, and experimentation
Design infrastructure that enables running models locally on GPUs
Build and improve scheduling, orchestration, and utilization systems
Optimize inference pipelines for latency, throughput, and cost
Partner with ML engineers to remove bottlenecks
Build observability for GPU utilization and more
Drive reliability and incident response
Evaluate and integrate new hardware and cloud instance families
Build tooling for GPU usage visibility
Contribute to long-term architecture decisions

Perks and Benefits

Global team with diverse talents and backgrounds
Equal opportunity employer without discriminating based on various characteristics
Celebration of all Krakenites for their unique perspectives
Encouragement to apply even if not fully meeting requirements, especially if passionate about crypto
Ongoing acceptance of applications
Job-related skills or work-style assessments as part of the hiring process

Senior AI Compute Infrastructure Engineer

Requirements

Nice to Haves

What You'll Be Doing

Perks and Benefits

Kraken

Kraken

Similar jobs