Beyond the Ephemeral: Mastering Serverless Metrics at Scale With Shopify

Introduction

As serverless architectures gain traction, the need for robust metrics instrumentation becomes critical. Shopify, a global e-commerce platform, faced unique challenges in scaling observability to meet the demands of its distributed infrastructure. This article explores how Shopify leveraged serverless technologies, CNCF standards, and advanced metric platforms to achieve scalable observability, addressing the complexities of metrics ingestion, routing, and cardinality control.

Globalization and Observability Needs

Shopify processes cross-regional traffic and ensures service availability for millions of users. From 2020 to 2024, its Black Friday/Cyber Monday GMV growth doubled, necessitating a more resilient observability architecture. The system must handle peak loads of 100GB logs per second, 12 million spans, and 350 million metric samples, demanding a scalable and efficient metric platform.

Service Architecture and Deployment Model

Shopify’s infrastructure runs entirely on Kubernetes within Google Cloud. Engineers create "runtime manifests" that are converted into Kubernetes manifests by the production platform. Observability is automatically integrated for logs, traces, and metrics. Serverless support via Cloud Run enables self-contained applications with independent databases and Kafka consumers, auto-scaling to zero replicas and zero cost. However, challenges like network isolation, data transmission efficiency, and node unpredictability emerged.

Metric Processing Architecture Design

Metric Protocol Selection

Existing Ruby applications used StatsD (text-based protocol), while new services adopted OpenTelemetry (OTLP). To bridge the gap, Shopify developed a custom otel-collector branch for StatsD-to-OTLP conversion, later integrated into upstream versions.

Sidecar Architecture and Data Transmission

Cloud Run applications use sidecars to collect metrics. StatsD data is transmitted via UDP, while OTLP uses loopback interfaces. Data is batch-processed and aggregated by sidecars before being sent over public networks using Protobuf and compression. Metrics enter the cluster through a public ingress.

Metric Routing and Transformation

A metric router within Kubernetes converts StatsD to Prometheus-compatible formats. D-Sketch is employed for cross-source histogram aggregation, reducing cardinality issues. OpenTelemetry mapping libraries convert OTLP to D-Sketch. Delta storage ensures temporal accuracy, avoiding data loss during container termination.

Challenges and Solutions

Network and Data Transmission

Network isolation in Cloud Run (non-peered networks) posed challenges. High StatsD data volumes (6.5MB/s) required optimization. Data loss risks during container shutdowns were mitigated by adjusting grace periods and using delta storage.

System Stability

Reducing otel-collector refresh intervals, controlling container startup order, and optimizing label counts in data aggregation improved stability. Sensitivity to label cardinality was addressed through dynamic label pruning.

Next Steps and Improvements

Expanding Observability

Integrating a centralized logging platform (observe) into Cloud Run, adopting MTLS for authentication, and optimizing metric pipelines are priorities. Cardinality control will be handled directly in otel-collector, simplifying OTLP-to-Prometheus conversions.

System Scalability

Enhancing backpressure mechanisms, improving event-driven auto-scaling (KDA), and establishing a unified metrics pipeline for logs/traces/metrics will ensure long-term scalability.

Technical Details

  • D-Sketch: An in-memory structure for cross-source histogram aggregation, reducing cardinality.
  • Cardinality Control: Labels are dynamically pruned in otel-collector to eliminate redundancy.
  • Delta Storage: Ensures temporal accuracy by tracking changes rather than cumulative values.
  • OTLP Conversion: OpenTelemetry mapping libraries handle format translation.
  • Sidecar Configuration: Traces are processed via OTLP input → resource detection → data cleaning → batch processing → OTLP output. Metrics are handled via StatsD/OTLP input → resource detection → label mapping → cumulative-to-delta conversion → batch processing → output.

Metric Routing and Cardinality Control

The current architecture uses metric routers within clusters to manage cardinality by limiting label counts. Future improvements include handling cardinality control at the otel-collector level, with dynamic label adjustments to retain only essential metrics (e.g., ABC) while removing redundancies.

Metric Processing Pipeline and Backpressure Management

Event-driven auto-scaling (KDA) has successfully managed high-traffic scenarios like Black Friday. Implementing backpressure mechanisms across the pipeline will enhance stability during anomalies. Delta or cumulative metrics do not affect final results due to aggregation during transmission.

Technical Architecture and Tool Integration

  • Metric Collection Layer: OpenTelemetry Collector serves as the core, enabling custom label processing and aggregation.
  • Storage and Monitoring: Integration with Prometheus ensures real-time metric analysis.
  • CNCF Compatibility: Adherence to CNCF standards ensures scalability and maintainability, aligning with cloud-native principles.

Conclusion

Shopify’s approach to serverless metrics highlights the importance of adaptive observability in distributed systems. By leveraging CNCF tools, optimizing metric protocols, and addressing cardinality challenges, the platform achieves scalability and reliability. Prioritizing backpressure management, dynamic label pruning, and unified metric pipelines will further solidify its observability infrastructure. This case study underscores the necessity of a robust metric platform in modern serverless architectures.