Observability Day 2024: CNCF Ecosystem Advancements and Community Growth

Observability has become a cornerstone of modern cloud-native systems, enabling teams to monitor, debug, and optimize complex distributed architectures. The Cloud Native Computing Foundation (CNCF) continues to drive innovation in this space, with its ecosystem of projects evolving rapidly. This article highlights key updates from major observability tools within the CNCF landscape, emphasizing their technical advancements, community engagement, and integration capabilities.

Prometheus: Scaling Observability with Enhanced Features

Prometheus, a foundational metric collection system, has seen significant progress in 2024. The project recorded its most active year in 12 months, with activity levels comparable to other CNCF projects, though it remains the second or third largest by GitHub stars. Key updates include:

  • Governance Overhaul: A new governance model has been introduced to enhance scalability and inclusivity, with 25 new team members joining and maintainers nearly doubling.
  • Version 3 Release: The V3 release features a redesigned UI, removal of most flag-based configurations (no breaking changes), and migration guides. Improvements to histograms now support custom bucket boundaries, aligning with OpenTelemetry’s histogram model while retaining native histogram support.
  • Protocol Enhancements: Prometheus Remote Read 2.0 is now live, offering improved performance and efficiency, though it remains experimental. Open Metrics v2 is under development, addressing lessons from v1.
  • Integration with OpenTelemetry: Prometheus now supports OTEL receivers, enabling direct ingestion of OTEL data. The Delta 2 cumulative converter simplifies data processing without requiring an OTEL Collector.
  • Flexibility and Compatibility: Metric and label name character sets have been relaxed to support UTF-8, while explicit unit and type handling improve resource attribute processing. Delta temporality support is set for imminent release.

Fluent Bit: Enhanced Data Processing and Deployment Flexibility

Fluent Bit, a lightweight log processor, has introduced features to improve data sampling, configuration, and deployment:

  • Sampling Mechanisms: Head Sampling and Tail Sampling allow conditional retention or discarding of trace data, optimizing resource usage.
  • Conditional Processing: Log processors now support conditional logic to filter logs based on predefined criteria.
  • TLS Configuration: Users can specify TLS versions and cipher suites, with variables set from the filesystem.
  • Language Support: Z language support has been added, enhancing interoperability with C via a new SDK. Deployment flexibility is expanded to support cloud, on-premises, and edge environments, with Helm Charts and Kubernetes Operators available.

Jager: Deep Integration with OpenTelemetry and Advanced Visualization

Jager, a distributed tracing platform, has undergone architectural changes to align with OpenTelemetry:

  • OpenTelemetry Integration: Jager V2 fully integrates with the OpenTelemetry ecosystem, using a modified OTEL Collector for data ingestion.
  • Storage API Refactor: The storage API now natively supports OTEL, ensuring compatibility with legacy V1 data.
  • UI Enhancements: New topology views and visualization tools, including flame graphs and timelines, improve observability. ClickHouse and OpenSearch are now supported, with real-time metric calculations in OpenSearch.
  • Kafka Integration: Native Kafka support enhances data ingestion performance.

OpenTelemetry: Expanding Ecosystem and Community Reach

OpenTelemetry, a critical observability framework, has made strides in standardization and tooling:

  • SDK 2.0: The JavaScript SDK 2.0 introduces tree shaking and optimizations, with migration guides available.
  • Semantic Conventions: Version 2 of the database semantic conventions is in development, with a stable release expected next month. The community now includes 50% non-US contributors, with European regions calling for increased engagement.
  • Compilation-Time Instrumentation: A Go SIG for compilation-time instrumentation aims to improve observability for Go applications. Continuous profiling, integrated with eBPF, is under development for unified instrumentation.
  • Generative AI Integration: Growing adoption of OpenTelemetry in generative AI frameworks highlights its expanding relevance.

Technical Summary

The CNCF observability ecosystem is advancing rapidly, with key projects like Prometheus, Fluent Bit, Jager, and OpenTelemetry driving innovation. Prometheus’s V3 release and Open Metrics v2 development, Fluent Bit’s enhanced sampling and deployment flexibility, Jager’s OTEL integration, and OpenTelemetry’s semantic conventions and SDK updates collectively strengthen the observability landscape. These tools are designed to work seamlessly together, enabling scalable, interoperable, and efficient monitoring solutions.

Community and CNCF Engagement

The CNCF community continues to grow, with 50% of OpenTelemetry contributors based outside the US. Initiatives like the OpenTelemetry Certified Associate program and free training courses aim to democratize access to observability expertise. Events such as the 2024 Observability Day, planned for June and potentially August in Amsterdam, will further foster collaboration. The community’s emphasis on inclusivity and regional engagement ensures sustained innovation and adoption.

Cortex, OpenSearch, Thanos, and Cilium: Expanding the Ecosystem

  • Cortex: Offers horizontally scalable Prometheus APIs, low-latency queries, and experimental feature stabilization. Upcoming features include native histogram ingestion and Apache Lucene 10 support.
  • OpenSearch: A fork of Elasticsearch, now a CNCF project, focuses on distributed search, vector databases, and real-time analytics. Version 3.0 plans include Lucene 10 integration and improved data processing.
  • Thanos: Enhances long-term storage and global query capabilities. Version 0.38 introduces OTLP protocol support, native histograms, and distributed query execution.
  • Cilium: Advances eBPF-based networking and security with Windows support for ebpf-go, dynamic metrics configuration, and Hubble’s improved event filtering.

Conclusion

The CNCF observability ecosystem is evolving rapidly, driven by community collaboration and technical innovation. Projects like Prometheus, OpenTelemetry, and Thanos are setting new standards for scalability, interoperability, and ease of use. As organizations adopt these tools, the focus on community-driven development ensures that observability remains a cornerstone of cloud-native infrastructure. By leveraging these advancements, teams can build more resilient, transparent, and efficient systems.