Senior Site Reliability Engineer, Database Operations:Clickhouse
AI Summary ✨
Requirements:
Advanced database platform management experience, preferably using Postgres and Clickhouse at scale
Advanced Cloud Infrastructure automation and management, preferably using Ansible, Chef, Terraform, Helm charts, Operators and Kubernetes
Solid experience with at least one programming language: Go, Ruby or Python
Advanced experience with Linux
Extensive on-call experience as an SRE supporting mission-critical systems
Solid incident management experience, across all phases: Analysis, Remediation, RCA and Corrective Actions
Solid experience implementing monitoring at scale (preferably Prometheus and Grafana)
What You'll Be Doing:
Design, build, and maintain ClickHouse and PostgreSQL clusters to support high-demand, enterprise-scale workloads
Provision and Orchestrate cloud infrastructure using configuration management tools (Ansible, Chef), IaC (Terraform) and the Kubernetes ecosystem (Helm charts, Operators) and distributed consensus (etcd) in GCP
Design and implement enterprise-grade, high-availability ClickHouse solutions with ClickHouse Keeper, sharding, and replication, optimized for large-scale and dynamic datasets
Optimize and scale high-transaction PostgreSQL clusters with Patroni and streaming replication for GitLab’s core applications on GCP
Build and maintain early warning systems, monitoring, and alerting tools (e.g., Prometheus/Grafana) to predict capacity needs, monitor query latency and replication lag, and ensure resource optimization across platforms
Enable cross-database integrations and workflows, such as ClickHouse-to-PostgreSQL data federation, CDC, and logical replication, to support hybrid analytics
Respond to platform alerts, user emergencies, and support requests while ensuring strict adherence to SLOs, including during SRE on-call rotations
Enhance infrastructure security by implementing and updating measures that protect GitLab’s systems and ensure compliance with regulatory requirements (e.g., GDPR, FedRAMP, SOC2, ISO)
Partner with internal and external compliance assessors as Subject Matter Experts during certifications and recertifications
Collaborate with engineering teams to address architectural bottlenecks, plan service rollouts and migrations, and shape the future roadmap while maintaining strong operational readiness
Nice to Haves:
Willingness and ability to live and promote Gitlab's unique CREDIT Values in one's day to day work and interactions with teammates
Superior verbal and written communication skills
Cool, collected and composed under pressure
Comfortable and productive working asynchronously across time zones and cultures, at the speed and scale of business
Enable others to excel
Be a Leader of One
Act Like an Owner with Gitlab's resources
Perks and Benefits:
Benefits to support your health, finances, and well-being
All remote, asynchronous work environment
Flexible Paid Time Off
Team Member Resource Groups
Equity Compensation & Employee Stock Purchase Plan