BS, MS, or PhD in Computer Science, Engineering, or a related field.
4+ years of software development and architecture experience, with at least 2 years focused on machine learning platforms.
Expertise in designing and building ML platforms, including model training pipelines and inference services.
Strong programming skills in Python, Java, or Scala, and experience with frameworks like TensorFlow or PyTorch.
Familiarity with containerization/orchestration technologies (Docker, Kubernetes) and big data technologies (Spark, Hadoop, Kafka).
What you'll be doing:
Design and implement robust, scalable, and secure systems for the machine learning platform, supporting diverse use cases such as large-scale training and real-time inference.
Evaluate and adopt emerging technologies and frameworks to enhance platform capabilities and optimize performance and cost efficiency.
Collaborate with cross-functional teams to align platform development with business needs.
Identify bottlenecks and optimize the use of GPUs for large-scale data processing and high-throughput model serving.
Define and implement automated ML workflows, including data preprocessing, model training, hyper parameter tuning, deployment, and monitoring.
Ensure adherence to enterprise-level governance, compliance, and security standards.
Nice to have:
Experience with MLOps practices, CI/CD pipelines, and tools like MLflow or Kubeflow; familiarity with Ray.io is a plus.