3 years of experience in large-scale internet/server environments, including cloud computing and multi-tier architectures.
Strong incident management and leadership skills, with excellent technical triage and troubleshooting abilities, especially during crises. (for TDO)
Hands-on Software engineering skills including Java, Python, GO, etc.
Expert knowledge in large-scale web operations, including web-based Java/J2EE architectures, JVM configurations, and a deep understanding of UNIX, Linux, networking (TCP/IP), and databases (both relational and NoSQL).
Experience in android and iOS application debugging.
Experience with observability tools such as Grafana and Prometheus, and skills in documenting procedures for knowledge management.
What you'll be doing:
Lead Incident Management: Act as the Incident Commander to drive resolution of major incidents, manage alarms, and ensure effective communication with leadership and partner teams.
Proactive Monitoring: Continuously monitor the health of eBay's critical services to identify and address potential issues before they escalate.
Collaborative Problem Solving: Work closely with partner teams to resolve recurring technical issues, onboard new alerts, and develop high-quality Standard Operating Procedures (SOPs).
Automation and Process Enhancement: Identify and implement opportunities to enhance automation and reduce manual workload, improving overall efficiency.
Solution Development: Collaborate with Architecture, Engineering, and Operations teams to develop solutions that ensure high site availability, reliability, and performance.
Enhance Monitoring Tools: Improve tools for monitoring and mitigating site incidents, and conduct reliability audits and tests to strengthen eBay's reliability and incident management capabilities.
Perks and Benefits:
Fixed day shift (7:00 AM to 5:00 PM) in Dublin, Ireland location with no on-call responsibilities.
Team members work four days in a row in 10-hour shifts.