Requirements

Proven expertise in building and scaling observability systems (e.g., logging platforms, metrics pipelines, tracing infrastructure, or profiling tools).
Lead technical execution for major components of Twilio’s observability overhaul, including our shift to centralized S3-based data lakes, OpenTelemetry instrumentation, and ClickHouse-backed query engines.
Deep proficiency in at least one modern programming language (e.g., Go, Python, Java).
Familiarity with high-cardinality data challenges and telemetry correlation techniques.
Experience designing high-scale telemetry systems (e.g., Prometheus, ClickHouse, OpenTelemetry, Kafka, or equivalent).
Solid understanding of distributed systems and the challenges of observability in complex, microservice-based environments.
Experience with AWS, Kubernetes, and infrastructure-as-code tools.
Provide architectural guidance and thought leadership across teams, helping to establish clear telemetry standards, efficient usage patterns, and scalable platform abstractions.
Ability to make forward-looking technical decisions and lead others through ambiguity and change.

Nice to Haves

Familiarity with ClickHouse, Grafana Loki, Athena, or equivalent systems for log and metrics querying.
Contributions to open-source observability tools or communities.
Experience building cost visibility or FinOps tooling for cloud compute and telemetry pipelines.

Lead the end-to-end architecture and delivery of key observability platform components, with a focus on reliability, scalability, and usability.
Drive consistency and quality across all observability signals—logs, metrics, traces, and continuous profiling—building intuitive workflows for engineers.
Serve as a technical advisor and mentor across the platform org, guiding design decisions and aligning cross-team efforts with long-term architectural goals.
Go deep in one or more problem areas (e.g., high-cardinality telemetry, distributed tracing correlation, compute cost insights), while ensuring the platform scales horizontally.
Collaborate with product teams, SREs, and developer experience groups to deeply understand telemetry needs and integrate observability into core engineering workflows.
Design and build developer-friendly tooling and APIs to support incident response, performance analysis, and platform debugging at scale.
Leverage (and optionally contribute to) open-source standards like OpenTelemetry to ensure interoperability and extensibility.
Champion a pragmatic approach to observability—balancing performance, cost, and user value across diverse engineering teams.