From Sampling To Full Visibility: Scaling Tracing To Trillions of Spans

Introduction

In the realm of modern software systems, observability has evolved from basic network monitoring to a comprehensive framework that integrates logs, metrics, traces, and advanced analytics. As systems scale to handle trillions of spans, the challenge of balancing data volume with diagnostic accuracy becomes critical. This article explores the journey from sampling-based tracing to full visibility, highlighting the technical innovations that enable scalable observability.

The Evolution of Observability

1. Network Monitoring Phase

  • SNMP (Simple Network Management Protocol) was used to monitor hardware metrics like CPU usage and memory.
  • Administrators relied on threshold-based alerts for proactive network polling.

2. Service Monitoring Phase

  • Host and service checks were introduced to track service health.
  • Red-green light dashboards provided visual indicators of service status.

3. Monitoring Phase

  • Advanced dashboards and deeper insights emerged.
  • Metrics were used to monitor service health over time.

4. Observability Phase

  • Integration of Logs, Metrics, and Traces formed the three pillars of observability.
  • Logs describe what happened, Metrics describe when it happened, and Traces describe where it happened.

5. Expansion to the Six Pillars Model

  • Additional pillars like Profiling, Real User Monitoring (RUM), and Synthetic Testing were added.
  • This enhanced end-to-end visibility and proactive monitoring capabilities.

6. AI-Driven Observability

  • AI algorithms detect anomalies that humans might miss.
  • OpenTelemetry standardizes data collection for consistent tracing.

Debugging Workflow and Tracing Value

Typical Debugging Steps

  1. Receive alerts (user complaints, monitoring notifications).
  2. Identify abnormal services via dashboards (peaks, drops, pattern changes).
  3. Analyze trace data for specific services.
  4. Trace request paths to identify root causes.

Core Value of Tracing

  • End-to-end visibility of request paths.
  • Time-series analysis (Duration over Time) for rapid anomaly detection.
  • Visual differentiation using red dots (errors) and purple dots (normal traces).
  • Intuitive comparison of long (large dots) and short (small dots) traces.

Sampling Challenges and Risks

The Sampling Dilemma

  • Common 1% sampling rate (1 trace per 100 requests) risks losing critical error traces.
  • Example: 10 billion daily transactions → 100 million sampled traces (none with errors).

Root Causes of Sampling Issues

  • Data Volume Explosion:
    • 100 billion daily transactions → 30 trillion spans.
    • Each span ~2KB → 60PB data.
    • Storage requirement: 60 SSD servers per day.
  • Processing Overhead:
    • Cross-cloud coordination, data structure management.
    • Impact on application performance.

Limitations of Tail Sampling

  • Requires retaining error/slow traces.
  • Most vendors lack support or charge high fees.
  • Technical challenges:
    • Full span collection until request completion.
    • Cross-cloud coordination and data structure management.

Technical Solutions and Optimization

1. Columnar Storage

  • Compression techniques for similar data types (timestamps, strings, numbers).
  • Compression ratio: 10–20x improvement.
  • Examples:
    • Timestamps using delta encoding.
    • Strings using dictionary encoding.
    • Numbers using bit packing.

2. Probabilistic Filters

  • Bloom Filters accelerate trace ID existence checks.
  • Reduces storage overhead and acts as a lightweight index.

3. Partitioning Strategies

  • Time-based partitioning: Fast access to recent data.
  • Trace ID partitioning: Group spans by trace ID for efficient querying.
  • Enhances query performance and parallel processing.

4. Special Sharding

  • Separate error/slow spans into dedicated storage.
  • Query optimization:
    • Trace ID table → Span table → Direct error span retrieval.
    • Query efficiency improved by 50–100x.

5. Tiered Storage Architecture

  • Hot storage (0–7 days): High-performance SSDs.
  • Warm storage (8–30 days): Cost-effective HDDs.
  • Cold storage (>30 days): Object storage (e.g., S3/GCS).
  • Automated data migration:
    • Classify data by age.
    • Extract error/slow spans to dedicated storage.

6. Cost Optimization Strategies

  • Focus on 1% error data for storage optimization.
  • Dedicated storage structures reduce costs for storage and querying.
  • Accelerate root-cause diagnosis and improve MTTR (Mean Time to Repair).

Conclusion

Core Takeaways

  • Tracing provides high-fidelity root-cause diagnosis and full system state visibility.
  • Sampling risks include blind spots and debugging challenges; balance data volume with diagnostic accuracy.
  • Technical solutions like columnar storage, intelligent sharding, and tiered architecture reduce costs and latency while enhancing observability.

Recommendations

  • Implement tail sampling with dedicated storage for error/slow traces.
  • Leverage columnar storage and probabilistic filters for efficient data management.
  • Adopt tiered storage to optimize costs and query performance.
  • Integrate AI-driven observability for anomaly detection and proactive monitoring.