OpenTelemetryobservabilitytelemetryingestionvendorCNCF
## Introduction
OpenTelemetry has emerged as a critical tool for achieving observability in modern distributed systems. As organizations scale and integrate multiple services, the need for standardized telemetry data collection, processing, and export becomes paramount. Delivery Hero, a global food delivery company, faced significant challenges in unifying its observability infrastructure across 13 subsidiaries, each using different vendors and tools. This article explores their journey with OpenTelemetry, highlighting its benefits, challenges, and the trade-offs involved in adopting a vendor-agnostic approach.
## Background and Migration Motivation
### The Problem
Delivery Hero’s fragmented observability landscape stemmed from several key issues:
- **Diverse Vendors**: Each subsidiary used different observability tools, leading to siloed data and integration difficulties.
- **Data Fragmentation**: Telemetry data was stored in disparate backends, complicating unified analysis.
- **Cost Pressures**: The CTO mandated a $6 million annual savings, driving the need for platform and observability tool consolidation.
### The Goal
The migration aimed to achieve three core objectives:
1. **Vendor Neutrality**: Eliminate vendor lock-in by adopting an open standard.
2. **Standardized Semantics**: Implement consistent semantic conventions for telemetry data.
3. **Centralized Tracing**: Use tracing as the core to unify all telemetry data (traces, metrics, logs).
## Migration Strategy and Architecture Design
### Ideal Architecture
The envisioned architecture centered around:
- A **centralized OpenTelemetry Collector Agent** to manage data ingestion.
- **OpenTelemetry SDKs** across all applications, routing data through the collector to a single vendor.
### Actual Implementation
While the ideal architecture was the target, the implementation involved multiple layers:
- **Central Datadog Collector**: Converted Datadog-formatted data to a new vendor.
- **Central OTP Collector**: Handled OTP-formatted data.
- **Federated Collectors**: Supported multiple formats (OTLP, Prometheus, JSON) and integrated into the new vendor.
- **Subsidiary-Specific Collectors**: Units like Glovo and Logistics maintained independent collectors.
### Migration Steps
The process was phased:
1. **Infrastructure Migration**: First, establish centralized data pipelines.
2. **Reinstrumentation**: Later, re-instrument applications to align with OpenTelemetry standards.
## Challenges and Issues
### Custom Components and Upstream Contributions
- **Forked Components**: Custom receivers like the forked DataDog Collector were necessary to support all telemetry types (traces, metrics, logs). However, these forks created compatibility issues with upstream versions, making rollbacks impossible.
- **Custom Logic**: Additional components such as the Span Metrics Connector, Prometheus Exporter, and Delta to Cumulative Processor were developed in-house. These required ongoing maintenance and integration with OpenTelemetry.
- **Upstream Contribution Barriers**: Slow upstream updates and the need to apply patches for custom components increased operational complexity.
### Performance Overhead
- **Serialization/Deserialization Costs**: Data conversion within the collector involved multiple marshalling/unmarshalling steps, leading to high CPU usage and memory consumption. This was likened to the inefficiency of repeatedly unpacking items during a move.
- **GC and CPU Impact**: Flame graphs revealed that a significant portion of CPU cycles were consumed by data transformation, affecting overall system performance.
### Stateful Component Requirements
- **Span Metrics Connector**: Ensured traces were processed by the same collector pod, necessitating a Load Balancing Exporter.
- **Delta to Cumulative Processor**: Required custom routing logic based on DataDog headers, leading to the development of a proxy application.
- **Operational Burden**: Maintaining these stateful components added to operational overhead, with past incidents highlighting the risks of unstable custom solutions.
## Conclusion and Reflection
### Ongoing Challenges
- **Customization vs. Standardization**: Balancing the need for enterprise-specific customizations with OpenTelemetry’s upstream standards remains a challenge.
- **Stateful Infrastructure**: Managing stateful components requires additional infrastructure and monitoring.
- **Performance Trade-offs**: The overhead of data serialization and deserialization must be mitigated through optimization.
### Long-Term Value
- **Vendor Agnosticism**: OpenTelemetry’s standardization is critical for avoiding vendor lock-in and enabling flexible observability.
- **Strategic Investment**: Despite upfront costs, the long-term benefits of a unified observability platform align with Delivery Hero’s strategic goals.
### Final Thoughts
While the migration process was fraught with challenges, OpenTelemetry’s flexibility and standardization offer significant value. Continuous optimization of custom components and upstream contributions will be essential to sustain this approach. The journey underscores the importance of OpenTelemetry as a cornerstone for modern observability in complex, multi-vendor environments.