Company Logo

Software Engineer

Netflix - 1d ago

Company Logo

Senior Software Engineer

Reddit - 4d ago

Site Reliability Engineer (L4/L5)

AI Summary ✨

Requirements

  • 3+ years of experience as a Site Reliability Engineer or in a similar role
  • Strong scripting and programming skills (Python, Go, Java or JavaScript/Node.js)
  • Experience with complex sociotechnical systems and their successful operations at scale
  • Experience with incident management and response
  • Experience with Infrastructure as code like Terraform and container orchestration tools like Kubernetes, Docker
  • Experience with cloud platforms like AWS, microservices architecture, and enterprise software solutions like Slack & GSuite
  • Excellent communication & collaboration skills and a continuous improvement mindset
  • Proven ability to cultivate relationships through influence
  • Proven ability to troubleshoot complex issues and implement effective solutions
  • Familiarity with Human Factors Engineering
  • Ability to grow expertise, influence & educate others

What You'll Be Doing

  • Design, implement, and maintain scalable and reliable infrastructure to support our services.
  • Collaborate with engineering and product teams to integrate observability, reliability, and security considerations into the entire software development lifecycle.
  • Develop and implement automation tools for monitoring, deployment, and incident response to ensure efficient and reliable operations.
  • Conduct or participate in capacity planning, performance analysis, and system tuning to optimize system reliability.
  • Participate in on-call rotations and contribute to incident response, diagnosis, and resolution.
  • Implement and improve monitoring and alerting systems to proactively identify and address potential issues.
  • Implement and maintain robust disaster recovery and business continuity plans.
  • Continuously evaluate and recommend improvements to enhance system observability and reliability.
  • Proactively identify sources of instability in distributed systems and analyze how complex systems fail from a reliability and resilience perspective.
  • Engage with product teams to diagnose operational surprises and drive improvements.
  • Implement and maintain a robust incident response framework, including blame-aware incident reviews to learn from operational surprises.
  • Champion a growth mindset and continuous learning culture, encouraging proactive innovation and ongoing skill development.

Nice to Haves

  • N/A

Perks and Benefits

  • Inclusion is a core value at Netflix
  • Equal-opportunity employer promoting diversity and inclusion
Apply here
Netflix logo

Netflix

Warsaw, Poland

Experience: Mid-level
Posted: May 7, 2025
Aws
Docker
Golang
Java
Javascript
Kubernetes
Nodejs
Python
Terraform
sitereliability

Similar jobs

  • a day ago
    New
    Remote
  • 7 days ago
  • 13 days ago
  • See all jobs in Poland