Availability for meetings and impromptu communication during Quora's “coordination hours” (Mon-Fri: 9am-3pm Pacific Time)
Experience with building and owning end-to-end machine learning or data science-related systems
Experience instrumenting ML workloads for performance monitoring/efficiency
Experience with high performance, large-scaled distributed systems
4+ years of industry experience in Machine Learning, Infrastructure or related fields
4+ years of experience writing production code in Python, C++, or similar language
BS or MS in Computer Science, Engineering or a related technical field
Preferred Requirements:
Strong communication and inter-personal skills, experience working with ML teams is a plus
Experience working with Kubernetes, Docker, Terraform, or other forms of containerized infrastructure
Hands-on experience with AWS technologies like EC2, EBS, S3, EKS
Responsibilities:
Design, develop, and maintain the core infrastructure that powers Quora's machine learning platform, ensuring high availability, scalability, and performance
Build scalable and reliable distributed systems for serving machine learning models
Optimize infrastructure performance across the ML platform, identifying and resolving bottlenecks
Collaborate with ML engineers to provide solutions for building and deploying models efficiently
Contribute to the design and implementation of next-generation ML infrastructure
Develop services on top of open source technologies like Kubernetes, Tensorflow, and PyTorch
Own business-critical infrastructure, participate in the on-call rotation, and resolve production issues
Collaborate with ML engineers to enhance their impact