Here’s what we found for you 👀
# CNCF

OpenTelemetry at Delivery Hero: The Good, the Bad, and the Vendor

OpenTelemetryobservabilitytelemetryingestionvendorCNCF

## Introduction OpenTelemetry has emerged as a critical tool for achieving observability in modern distributed systems. As organizations scale and integrate multiple services, the need for standardized telemetry data collection, processing, and export becomes paramount. Delivery Hero, a global food delivery company, faced significant challenges in unifying its observability infrastructure across 13 subsidiaries, each using different vendors and tools. This article explores their journey with OpenTelemetry, highlighting its benefits, challenges, and the trade-offs involved in adopting a vendor-agnostic approach. ## Background and Migration Motivation ### The Problem Delivery Hero’s fragmented observability landscape stemmed from several key issues: - **Diverse Vendors**: Each subsidiary used different observability tools, leading to siloed data and integration difficulties. - **Data Fragmentation**: Telemetry data was stored in disparate backends, complicating unified analysis. - **Cost Pressures**: The CTO mandated a $6 million annual savings, driving the need for platform and observability tool consolidation. ### The Goal The migration aimed to achieve three core objectives: 1. **Vendor Neutrality**: Eliminate vendor lock-in by adopting an open standard. 2. **Standardized Semantics**: Implement consistent semantic conventions for telemetry data. 3. **Centralized Tracing**: Use tracing as the core to unify all telemetry data (traces, metrics, logs). ## Migration Strategy and Architecture Design ### Ideal Architecture The envisioned architecture centered around: - A **centralized OpenTelemetry Collector Agent** to manage data ingestion. - **OpenTelemetry SDKs** across all applications, routing data through the collector to a single vendor. ### Actual Implementation While the ideal architecture was the target, the implementation involved multiple layers: - **Central Datadog Collector**: Converted Datadog-formatted data to a new vendor. - **Central OTP Collector**: Handled OTP-formatted data. - **Federated Collectors**: Supported multiple formats (OTLP, Prometheus, JSON) and integrated into the new vendor. - **Subsidiary-Specific Collectors**: Units like Glovo and Logistics maintained independent collectors. ### Migration Steps The process was phased: 1. **Infrastructure Migration**: First, establish centralized data pipelines. 2. **Reinstrumentation**: Later, re-instrument applications to align with OpenTelemetry standards. ## Challenges and Issues ### Custom Components and Upstream Contributions - **Forked Components**: Custom receivers like the forked DataDog Collector were necessary to support all telemetry types (traces, metrics, logs). However, these forks created compatibility issues with upstream versions, making rollbacks impossible. - **Custom Logic**: Additional components such as the Span Metrics Connector, Prometheus Exporter, and Delta to Cumulative Processor were developed in-house. These required ongoing maintenance and integration with OpenTelemetry. - **Upstream Contribution Barriers**: Slow upstream updates and the need to apply patches for custom components increased operational complexity. ### Performance Overhead - **Serialization/Deserialization Costs**: Data conversion within the collector involved multiple marshalling/unmarshalling steps, leading to high CPU usage and memory consumption. This was likened to the inefficiency of repeatedly unpacking items during a move. - **GC and CPU Impact**: Flame graphs revealed that a significant portion of CPU cycles were consumed by data transformation, affecting overall system performance. ### Stateful Component Requirements - **Span Metrics Connector**: Ensured traces were processed by the same collector pod, necessitating a Load Balancing Exporter. - **Delta to Cumulative Processor**: Required custom routing logic based on DataDog headers, leading to the development of a proxy application. - **Operational Burden**: Maintaining these stateful components added to operational overhead, with past incidents highlighting the risks of unstable custom solutions. ## Conclusion and Reflection ### Ongoing Challenges - **Customization vs. Standardization**: Balancing the need for enterprise-specific customizations with OpenTelemetry’s upstream standards remains a challenge. - **Stateful Infrastructure**: Managing stateful components requires additional infrastructure and monitoring. - **Performance Trade-offs**: The overhead of data serialization and deserialization must be mitigated through optimization. ### Long-Term Value - **Vendor Agnosticism**: OpenTelemetry’s standardization is critical for avoiding vendor lock-in and enabling flexible observability. - **Strategic Investment**: Despite upfront costs, the long-term benefits of a unified observability platform align with Delivery Hero’s strategic goals. ### Final Thoughts While the migration process was fraught with challenges, OpenTelemetry’s flexibility and standardization offer significant value. Continuous optimization of custom components and upstream contributions will be essential to sustain this approach. The journey underscores the importance of OpenTelemetry as a cornerstone for modern observability in complex, multi-vendor environments.

Securing OpenTelemetry Telemetry in Transit with TLS

OpenTelemetryTelemetry in TransitEncryptionCNCF

## Introduction OpenTelemetry has become a cornerstone for observability in modern distributed systems, enabling the collection and analysis of telemetry data such as traces, metrics, and logs. However, the transmission of this data without encryption poses significant risks, including exposure of sensitive information and potential exploitation by attackers. This article explores the critical need for encrypting telemetry data in transit using TLS, aligns with regulatory requirements, and provides a comprehensive guide to implementing secure configurations within the OpenTelemetry ecosystem. ## Problem Overview ### Information Exposure Risks Unencrypted telemetry data transmitted via OpenTelemetry contains sensitive details such as hostnames, operating systems, Java versions, database types, connection strings, and SQL queries. These details can be easily extracted using tools like Wireshark, enabling attackers to exploit vulnerabilities or target specific systems. ### Regulatory Compliance Multiple regulations mandate encryption during data transmission: - **FedRAMP SC8** (United States) - **GDPR, UK GDPR, ENISA NIS2** (European Union/UK) - **HIPAA** (Healthcare sector) ## TLS Configuration for OpenTelemetry ### TLS Features TLS provides bidirectional authentication and encrypted communication channels, ensuring data confidentiality and integrity. Servers must validate authenticity, and data remains visible only to authorized endpoints. ### OpenTelemetry Collector Setup Configure TLS in `collector.yaml` to enforce secure communication: - **Minimum TLS Version**: Recommend TLS 1.3 for enhanced security. - **Server Certificates**: Use `.crt` for certificates and `.key` for private keys. - **FIPS Compliance**: Utilize OpenSSL versions with FIPS extensions and enable `-provider fips` for algorithm restrictions. ## Certificate Generation and Configuration ### Self-Signed Certificate Creation Generate self-signed certificates using OpenSSL: ```bash openssl req -new -x509 -nodes -days 365 -out collector.crt -keyout collector.key ``` - **Certificate Fields**: Define organizational details (e.g., `O`, `L`, `C`, `CN`) to ensure DNS name alignment. ### Trust Chain Validation - **Public Certificates**: Rely on trusted Certificate Authorities (e.g., Let's Encrypt) for chain validation. - **Self-Signed Certificates**: Manually add root certificates to truststores to avoid `unable to find valid certification path` errors in Java environments. ## Testing and Trust Chain Challenges ### TLS Connection Verification Test TLS 1.3 connectivity to OpenTelemetry Collector's 4318 port using `openssl s_client`. Self-signed certificates may trigger trust chain failures in Java clients. ### Trust Chain Management Ensure certificates are validated through a trusted chain. In isolated networks, verify certificate authenticity to prevent spoofing attacks. ## Configuration Challenges and Best Practices ### Java Client Configuration - Transition endpoints from `http` to `https`. - Address certificate trust issues by explicitly managing truststores. ### Network Isolation Considerations - Validate certificate sources to prevent spoofing. - Regularly update certificates and ensure secure storage of private keys. ## Technical Implementation Details ### Port and TLS Configuration - **Default Port**: OpenTelemetry uses port 4318 for plaintext. Configure a dedicated TLS port (e.g., 465) for clarity. - **TLS Overhead**: Acknowledge increased latency and resource usage in production environments. ### mTLS Recommendations - Implement mutual TLS (mTLS) for enhanced authentication. Clients must embed certificates in requests, as outlined in OpenTelemetry Collector documentation. ## Security Considerations ### Default Security Mechanisms - Advocate for default TLS activation in OpenTelemetry to mitigate plaintext risks. ### Trust Chain Challenges - Self-signed certificates lack full trust chains, requiring manual CA management. ### Development vs. Production Environments - **Development**: Use self-signed certificates for rapid validation. - **Production**: Adhere to strict certificate management protocols, including CA infrastructure and automated renewal processes. ## Conclusion Encrypting telemetry data in transit is essential to comply with regulatory standards and protect sensitive information. Implementing TLS with proper certificate management, trust chain validation, and mTLS configurations ensures secure OpenTelemetry operations. Addressing challenges such as Java truststore management and network isolation is critical for robust security. By prioritizing encryption and adhering to best practices, organizations can safeguard their observability infrastructure against potential threats.

From Sampling To Full Visibility: Scaling Tracing To Trillions of Spans

tracingsamplingSNMPdashboardsspansCNCF

## Introduction In the realm of modern software systems, observability has evolved from basic network monitoring to a comprehensive framework that integrates logs, metrics, traces, and advanced analytics. As systems scale to handle trillions of spans, the challenge of balancing data volume with diagnostic accuracy becomes critical. This article explores the journey from sampling-based tracing to full visibility, highlighting the technical innovations that enable scalable observability. ## The Evolution of Observability ### 1. Network Monitoring Phase - **SNMP (Simple Network Management Protocol)** was used to monitor hardware metrics like CPU usage and memory. - Administrators relied on threshold-based alerts for proactive network polling. ### 2. Service Monitoring Phase - Host and service checks were introduced to track service health. - Red-green light dashboards provided visual indicators of service status. ### 3. Monitoring Phase - Advanced dashboards and deeper insights emerged. - Metrics were used to monitor service health over time. ### 4. Observability Phase - Integration of **Logs**, **Metrics**, and **Traces** formed the three pillars of observability. - **Logs** describe *what happened*, **Metrics** describe *when it happened*, and **Traces** describe *where it happened*. ### 5. Expansion to the Six Pillars Model - Additional pillars like **Profiling**, **Real User Monitoring (RUM)**, and **Synthetic Testing** were added. - This enhanced end-to-end visibility and proactive monitoring capabilities. ### 6. AI-Driven Observability - AI algorithms detect anomalies that humans might miss. - **OpenTelemetry** standardizes data collection for consistent tracing. ## Debugging Workflow and Tracing Value ### Typical Debugging Steps 1. **Receive alerts** (user complaints, monitoring notifications). 2. **Identify abnormal services** via dashboards (peaks, drops, pattern changes). 3. **Analyze trace data** for specific services. 4. **Trace request paths** to identify root causes. ### Core Value of Tracing - **End-to-end visibility** of request paths. - **Time-series analysis** (Duration over Time) for rapid anomaly detection. - **Visual differentiation** using red dots (errors) and purple dots (normal traces). - **Intuitive comparison** of long (large dots) and short (small dots) traces. ## Sampling Challenges and Risks ### The Sampling Dilemma - **Common 1% sampling rate** (1 trace per 100 requests) risks losing critical error traces. - Example: 10 billion daily transactions → 100 million sampled traces (none with errors). ### Root Causes of Sampling Issues - **Data Volume Explosion**: - 100 billion daily transactions → 30 trillion spans. - Each span ~2KB → 60PB data. - Storage requirement: 60 SSD servers per day. - **Processing Overhead**: - Cross-cloud coordination, data structure management. - Impact on application performance. ### Limitations of Tail Sampling - Requires retaining error/slow traces. - Most vendors lack support or charge high fees. - Technical challenges: - Full span collection until request completion. - Cross-cloud coordination and data structure management. ## Technical Solutions and Optimization ### 1. Columnar Storage - **Compression techniques** for similar data types (timestamps, strings, numbers). - **Compression ratio**: 10–20x improvement. - **Examples**: - Timestamps using delta encoding. - Strings using dictionary encoding. - Numbers using bit packing. ### 2. Probabilistic Filters - **Bloom Filters** accelerate trace ID existence checks. - Reduces storage overhead and acts as a lightweight index. ### 3. Partitioning Strategies - **Time-based partitioning**: Fast access to recent data. - **Trace ID partitioning**: Group spans by trace ID for efficient querying. - Enhances query performance and parallel processing. ### 4. Special Sharding - **Separate error/slow spans** into dedicated storage. - **Query optimization**: - Trace ID table → Span table → Direct error span retrieval. - Query efficiency improved by 50–100x. ### 5. Tiered Storage Architecture - **Hot storage (0–7 days)**: High-performance SSDs. - **Warm storage (8–30 days)**: Cost-effective HDDs. - **Cold storage (>30 days)**: Object storage (e.g., S3/GCS). - **Automated data migration**: - Classify data by age. - Extract error/slow spans to dedicated storage. ### 6. Cost Optimization Strategies - **Focus on 1% error data** for storage optimization. - **Dedicated storage structures** reduce costs for storage and querying. - **Accelerate root-cause diagnosis** and improve **MTTR (Mean Time to Repair)**. ## Conclusion ### Core Takeaways - **Tracing provides high-fidelity root-cause diagnosis** and full system state visibility. - **Sampling risks** include blind spots and debugging challenges; balance data volume with diagnostic accuracy. - **Technical solutions** like columnar storage, intelligent sharding, and tiered architecture reduce costs and latency while enhancing observability. ### Recommendations - Implement **tail sampling** with dedicated storage for error/slow traces. - Leverage **columnar storage** and **probabilistic filters** for efficient data management. - Adopt **tiered storage** to optimize costs and query performance. - Integrate **AI-driven observability** for anomaly detection and proactive monitoring.