Bachelors, Masters or Ph.D. in Computer Science, Computer Engineering, related field (or equivalent experience)
2+ years of relevant work or research experience in performance analysis and compiler optimizations.
Ability to work independently, define project goals and scope, and lead your own development effort adopting clean software engineering and testing practices.
Excellent C/C++ programming and software design skills, including debugging, performance analysis, and test design.
Strong foundation in CPU and/or GPU architecture. Knowledge of high-performance computing and distributed programming. CUDA or OpenCL programming experience is desired but not required.
Experience with technologies like XLA, TVM, MLIR, LLVM, OpenAI Triton, deep learning models and algorithms, and deep learning framework design.
Strong interpersonal skills and ability to work in a dynamic product-oriented team. A history of mentoring junior engineers and interns is a bonus.
What you'll be doing:
Crafting and implementing compiler optimization techniques for deep learning network graphs
Designing novel graph partitioning and tensor sharding techniques for distributed training and inference
Performance tuning and analysis
Code-generation for NVIDIA GPU backends using open-source compilers such as MLIR, LLVM and OpenAI Triton
Defining APIs in JAX and related libraries and other general software engineering work
Ways to stand out from the crowd:
Worked on a deep learning framework such as JAX, Pytorch or Tensorflow
Experience with CUDA or with GPUs
Proficient with open-source compilers such as LLVM and MLIR