Company Logo

Software Engineer

Netflix - 1d ago

Company Logo

Senior Software Engineer

Reddit - 4d ago

Senior HPC AI Cluster Engineer

AI Summary ✨

Requirements:

  • Bachelor's Degree in Computer Science, Engineering, or a related field; or equivalent experience

  • 5+ years of experience

  • Knowledge of HPC and AI solution technologies from CPU’s and GPU’s to high-speed interconnects and supporting software

  • Experience with job scheduling workloads and orchestration tools such as Slurm, K8s

  • Excellent knowledge of Windows and Linux networking and internals, ACLs, OS level security protection, and common protocols

  • Experience with multiple storage solutions such as Lustre, GPFS, zfs, and xfs

  • Python programming and bash scripting experience

  • Comfortable with automation and configuration management tools such as Jenkins, Ansible, Puppet/chef

  • Deep knowledge of Networking Protocols like InfiniBand, Ethernet

  • Deep understanding and experience with virtual systems (e.g. VMware, Hyper-V, KVM, or Citrix)

  • Familiarity with cloud computing platforms (e.g. AWS, Azure, Google Cloud)

Nice to haves:

  • Knowledge of CPU and/or GPU architecture

  • Knowledge of Kubernetes, container-related microservice technologies

  • Experience with GPU-focused hardware/software (DGX, Cuda)

  • Background with RDMA (InfiniBand or RoCE) fabrics

What you will be doing:

  • Designing, implementing, and maintaining large scale HPC/AI clusters with monitoring, logging, and alerting

  • Managing Linux job/workload schedules and orchestration tools

  • Developing and maintaining continuous integration and delivery pipelines

  • Developing tooling to automate deployment and management of large-scale infrastructure environments, operational monitoring and alerting, and enable self-service consumption of resources

  • Deploying monitoring solutions for the servers, network, and storage

  • Troubleshooting and fixing issues from bare metal, operating system, software stack, and application level

  • Being a technical resource, developing, re-defining, and documenting standard methodologies to share with internal teams

  • Supporting Research & Development activities and engaging in POCs/POVs for future improvements

Perks and benefits:

  • Highly competitive salaries

  • Extensive benefits package

  • Work environment that promotes diversity, inclusion, and flexibility

Apply here
NVIDIA logo

NVIDIA

Remote - Germany (Remote)

Experience: Senior
Posted: May 26, 2025
Aws
Azure
Gcp
Jenkins
Kubernetes
Python
backend

Similar jobs

  • 5 hours ago
    New
    Remote
  • a day ago
    New
    Remote
  • 2 days ago
    New
  • 2 days ago
    New
  • See all jobs in Germany