Software Engineer
Netflix - 1d ago
Senior Software Engineer
Reddit - 4d ago
Senior HPC AI Engineer
AI Summary ✨
Requirements:
- A degree in Computer Science, Engineering, or a related field and 5+ years of experience
- Knowledge of HPC and AI solution technologies from CPU’s and GPU’s to high speed interconnects and supporting software
- Experience with job scheduling workloads and orchestration tools such as Slurm, K8s
- Excellent knowledge of Windows and Linux (Redhat/CentOS and Ubuntu) networking and internals
- Experience with multiple storage solutions such as Lustre, GPFS, zfs, and xfs
- Python programming and bash scripting experience
- Comfortable with automation and configuration management tools such as Jenkins, Ansible, Puppet/chef
- Deep knowledge of Networking Protocols like InfiniBand, Ethernet
- Deep understanding and experience with virtual systems (e.g., VMware, Hyper-V, KVM, or Citrix)
- Familiarity with cloud computing platforms (e.g., AWS, Azure, Google Cloud)
What you'll be doing:
- Design, implement, and maintain large scale HPC/AI clusters with monitoring, logging, and alerting
- Manage Linux job/workload schedules and orchestration tools
- Develop and maintain continuous integration and delivery pipelines
- Develop tooling to automate deployment and management of large-scale infrastructure environments
- Deploy monitoring solutions for the servers, network, and storage
- Perform troubleshooting bottom up from bare metal to application level
- Develop, re-define, and document standard methodologies to share with internal teams
- Support Research & Development activities and engage in POCs/POVs for future improvements
Nice to haves:
- Knowledge of CPU and/or GPU architecture
- Knowledge of Kubernetes, container-related microservice technologies
- Experience with GPU-focused hardware/software (DGX, Cuda)
- Background with RDMA (InfiniBand or RoCE) fabrics
Apply hereGet notified about new job opportunities
Subscribe