Company Logo

Software Engineer

Netflix - 1d ago

Company Logo

Senior Software Engineer

Reddit - 4d ago

Senior HPC AI Cluster Engineer

AI Summary ✨

Requirements:

  • Bachelor's Degree in Computer Science, Engineering, or a related field; or equivalent experience

  • 5+ years of experience

  • Knowledge of HPC and AI solution technologies from CPU’s and GPU’s to high speed interconnects and supporting software

  • Experience with job scheduling workloads and orchestration tools such as Slurm, K8s

  • Excellent knowledge of Windows and Linux (Redhat/CentOS and Ubuntu) networking (sockets, firewalls, iptables, wireshark, etc.) and internals, ACLs and OS level security protection and common protocols e.g. TCP, DHCP, DNS, etc.

  • Experience with multiple storage solutions such as Lustre, GPFS, zfs and xfs. Familiarity with newer and emerging storage technologies.

  • Python programming and bash scripting experience.

  • Comfortable with automation and configuration management tools such as Jenkins, Ansible, Puppet/chef

  • Deep knowledge of Networking Protocols like InfiniBand, Ethernet

  • Deep understanding and experience with virtual systems (for example VMware, Hyper-V, KVM, or Citrix)

  • Familiarity with cloud computing platforms (e.g. AWS, Azure, Google Cloud)

Nice to haves:

  • Knowledge of CPU and/or GPU architecture

  • Knowledge of Kubernetes, container related microservice technologies

  • Experience with GPU-focused hardware/software (DGX, Cuda)

  • Background with RDMA (InfiniBand or RoCE) fabrics

What you will be doing:

  • Designing, implementing and maintaining large scale HPC/AI clusters with monitoring, logging and alerting

  • Managing Linux job/workload schedules and orchestration tools

  • Developing and maintaining continuous integration and delivery pipelines

  • Developing tooling to automate deployment and management of large-scale infrastructure environments, to automate operational monitoring and alerting, and to enable self-service consumption of resources

  • Deploying monitoring solutions for the servers, network and storage

  • Troubleshooting and fixing, bottom up from bare metal, operating system, software stack and application level

  • Being a technical resource, developing, re-defining and documenting standard methodologies to share with internal teams

  • Supporting Research & Development activities and engaging in POCs/POVs for future improvements

Perks and benefits:

  • Highly competitive salaries

  • Extensive benefits package

  • Work environment that promotes diversity, inclusion, and flexibility

Apply here
NVIDIA logo

NVIDIA

Remote - Germany (Remote)

Experience: Senior
Posted: May 26, 2025
Aws
Azure
Gcp
Jenkins
Kubernetes
Python
backend

Similar jobs

  • 5 hours ago
    New
    Remote
  • a day ago
    New
    Remote
  • 2 days ago
    New
  • 2 days ago
    New
  • See all jobs in Germany