Company Logo

Software Engineer

Netflix - 1d ago

Company Logo

Senior Software Engineer

Reddit - 4d ago

Senior HPC AI Cluster Engineer

AI Summary ✨

Requirements:

  • Bachelor's Degree in Computer Science, Engineering, or a related field; or equivalent experience
  • 5+ years of experience
  • Knowledge of HPC and AI solution technologies from CPU’s and GPU’s to high speed interconnects and supporting software
  • Experience with job scheduling workloads and orchestration tools such as Slurm, K8s
  • Excellent knowledge of Windows and Linux (Redhat/CentOS and Ubuntu) networking (sockets, firewalls, iptables, wireshark, etc.) and internals, ACLs and OS level security protection and common protocols e.g. TCP, DHCP, DNS, etc.
  • Experience with multiple storage solutions such as Lustre, GPFS, zfs and xfs. Familiarity with newer and emerging storage technologies.
  • Python programming and bash scripting experience
  • Comfortable with automation and configuration management tools such as Jenkins, Ansible, Puppet/chef
  • Deep knowledge of Networking Protocols like InfiniBand, Ethernet
  • Deep understanding and experience with virtual systems (for example VMware, Hyper-V, KVM, or Citrix)
  • Familiarity with cloud computing platforms (e.g. AWS, Azure, Google Cloud)

Nice to Haves:

  • Knowledge of CPU and/or GPU architecture
  • Knowledge of Kubernetes, container related microservice technologies
  • Experience with GPU-focused hardware/software (DGX, Cuda)
  • Background with RDMA (InfiniBand or RoCE) fabrics

What you'll be doing:

  • Designing, implementing and maintaining large scale HPC/AI clusters with monitoring, logging and alerting
  • Managing Linux job/workload schedules and orchestration tools
  • Developing and maintaining continuous integration and delivery pipelines
  • Developing tooling to automate deployment and management of large-scale infrastructure environments, to automate operational monitoring and alerting, and to enable self-service consumption of resources
  • Deploying monitoring solutions for the servers, network and storage
  • Troubleshooting and fixing, bottom up from bare metal, operating system, software stack and application level
  • Being a technical resource, developing, re-defining and documenting standard methodologies to share with internal teams
  • Supporting Research & Development activities and engaging in POCs/POVs for future improvements

Perks and Benefits:

  • Highly competitive salaries
  • Extensive benefits package
  • Work environment that promotes diversity, inclusion, and flexibility
Apply here
NVIDIA logo

NVIDIA

Remote - Germany (Remote)

Experience: Senior
Posted: May 26, 2025
Aws
Azure
Gcp
Jenkins
Kubernetes
Python
backend

Similar jobs

  • 6 hours ago
    New
    Remote
  • a day ago
    New
    Remote
  • 2 days ago
    New
  • 2 days ago
    New
  • See all jobs in Germany