Bachelor's Degree in Computer Science or related technical field AND technical engineering experience with coding in languages including, but not limited to, C, Rust, C++, C#, Java, JavaScript, or Python
OR equivalent experience
Nice to Haves
Bachelor's Degree in Computer Science
OR related technical field AND technical engineering experience with coding in languages including, but not limited to, C, Rust, C++, C#, Java, JavaScript
OR Python
OR Master's Degree in Computer Science or related technical field AND technical engineering experience with coding in languages including, but not limited to, C, Rust, C++, C#, Java, JavaScript, or Python
OR equivalent experience
Ability and passion for designing and building highly available distributed systems at scale
Ability to exercise sound judgment in ambiguous situations
Experience with agile methodologies and willingness to adopt them
What You'll Be Doing
Partners with appropriate stakeholders spanning across teams and orgs to determine project requirements. Leads the design and architecture of change management features and services in Azure Compute
Identifies dependencies and authors design documents for features and services. Leverages expertise with appropriate stakeholders to develop project plans, release plans, and work items
Develops high quality, extensible, maintainable code and coaches others to do the same. Supports livesite as Designated Responsible Individual (DRI), mentoring engineers across products/solutions, working on-call to monitor system/product/service for degradation, downtime, or interruptions
Proactively seeks new knowledge and adapts to new trends, technical solutions, and patterns that will improve the availability, reliability, efficiency, observability, and performance of products while also driving consistency in monitoring and operations at scale and shares knowledge with other engineers
Collaborates with data scientists and ML engineers to design and integrate predictive models that proactively detect hardware anomalies and trigger live migrations, improving VM uptime and SLA compliance
Leads initiatives to embed AI-driven diagnostics and root cause analysis into availability services, reducing time-to-resolution for incidents and improving operational efficiency. Drives the adoption of generative AI tools to automate documentation, incident summaries, and engineering workflows, enhancing team productivity and knowledge sharing
Partners with platform teams to build intelligent observability pipelines that leverage anomaly detection and trend analysis for early warning systems. Evaluates and integrates large-scale AI models into control plane services to enable smarter, context-aware repair decisions across millions of Azure VMs
Perks and Benefits
Microsoft is an equal opportunity employer providing various benefits
Benefits/perks listed may vary depending on the nature of your employment with Microsoft and the country where you work