Company Logo

Software Engineer

Netflix - 1d ago

Company Logo

Senior Software Engineer

Reddit - 4d ago

Senior HPC DevOps Engineer

AI Summary ✨

Requirements:

  • B.Sc. in Computer Science, Engineering, or a related field with 5+ years of experience.
  • Deep knowledge of HPC and AI solution technologies, including CPUs, GPUs, high-speed interconnects, and supporting software.
  • Advanced proficiency in programming and scripting languages, with a solid understanding of object-oriented programming principles.
  • Familiarity with Jenkins, Ansible, Puppet/Chef.
  • Excellent knowledge of Windows and Linux (Redhat/CentOS and Ubuntu), networking and OS-level security.
  • Deep understanding of networking protocols such as InfiniBand and Ethernet.
  • Experience with job scheduling workloads and orchestration tools such as Slurm and Kubernetes.
  • Experience with multiple storage solutions like Lustre, GPFS, ZFS, and XFS.
  • Expertise with virtual systems (VMware, Hyper-V, KVM, Citrix).
  • Familiarity with cloud platforms (AWS, Azure, Google Cloud).

Nice to Haves:

  • Architectural Insight: Knowledge of CPU and/or GPU architecture.
  • Container Expertise: Understanding of Kubernetes and container-related microservice technologies.
  • GPU Focus: Experience with GPU-focused hardware/software (DGX, CUDA).
  • RDMA Fabrics: Background with RDMA (InfiniBand or RoCE) fabrics.

What you'll be doing:

  • Innovate and Implement: Design, implement, and maintain large-scale HPC/AI clusters with state-of-the-art monitoring, logging, and alerting systems.
  • Infrastructure as Code (IaC): Utilize and develop tools to manage infrastructure as code, ensuring scalable and repeatable deployments.
  • Streamline CI/CD Pipelines: Develop and maintain continuous integration and continuous delivery (CI/CD) pipelines to automate and streamline deployment processes.
  • Automate Everything: Develop automation scripts and tools to automate deployment, configuration management, and operational monitoring.
  • Enhance Monitoring: Deploy advanced monitoring solutions for servers, networks, and storage to ensure seamless operations.
  • Troubleshoot Complex Issues: Perform comprehensive troubleshooting from bare metal to application level, ensuring system reliability and efficiency.
  • Lead and Educate: Serve as a technical resource, developing and sharing best practices with internal teams.
  • Drive Innovation: Support R&D activities and engage in proof of concepts (POCs) and proof of values (POVs) for future improvements.

Perks and Benefits:

  • NVIDIA is at the forefront of breakthroughs in Artificial Intelligence, High-Performance Computing, and Visualization.
  • Highly competitive salaries.
  • An extensive benefits package.
  • A work environment that promotes diversity, inclusion, and flexibility.
  • Equal opportunity employer committed to fostering a supportive and empowering workplace for all.
Apply here
NVIDIA logo

NVIDIA

Remote - Germany (Remote)

Experience: Senior
Posted: January 3, 2025
Aws
Azure
Gcp
Jenkins
Kubernetes
backend

Similar jobs

  • google logo

    Data Center Technician II

    Frankfurt, Germany

    2 hours ago
    New
  • 8 hours ago
    New
  • 11 hours ago
    New
  • deliveryhero logo

    Search Engineer

    Berlin, Germany

    12 hours ago
    New
  • See all jobs in Germany