As serverless architectures gain traction, the need for robust metrics instrumentation becomes critical. Shopify, a global e-commerce platform, faced unique challenges in scaling observability to meet the demands of its distributed infrastructure. This article explores how Shopify leveraged serverless technologies, CNCF standards, and advanced metric platforms to achieve scalable observability, addressing the complexities of metrics ingestion, routing, and cardinality control.
Shopify processes cross-regional traffic and ensures service availability for millions of users. From 2020 to 2024, its Black Friday/Cyber Monday GMV growth doubled, necessitating a more resilient observability architecture. The system must handle peak loads of 100GB logs per second, 12 million spans, and 350 million metric samples, demanding a scalable and efficient metric platform.
Shopify’s infrastructure runs entirely on Kubernetes within Google Cloud. Engineers create "runtime manifests" that are converted into Kubernetes manifests by the production platform. Observability is automatically integrated for logs, traces, and metrics. Serverless support via Cloud Run enables self-contained applications with independent databases and Kafka consumers, auto-scaling to zero replicas and zero cost. However, challenges like network isolation, data transmission efficiency, and node unpredictability emerged.
Existing Ruby applications used StatsD (text-based protocol), while new services adopted OpenTelemetry (OTLP). To bridge the gap, Shopify developed a custom otel-collector branch for StatsD-to-OTLP conversion, later integrated into upstream versions.
Cloud Run applications use sidecars to collect metrics. StatsD data is transmitted via UDP, while OTLP uses loopback interfaces. Data is batch-processed and aggregated by sidecars before being sent over public networks using Protobuf and compression. Metrics enter the cluster through a public ingress.
A metric router within Kubernetes converts StatsD to Prometheus-compatible formats. D-Sketch is employed for cross-source histogram aggregation, reducing cardinality issues. OpenTelemetry mapping libraries convert OTLP to D-Sketch. Delta storage ensures temporal accuracy, avoiding data loss during container termination.
Network isolation in Cloud Run (non-peered networks) posed challenges. High StatsD data volumes (6.5MB/s) required optimization. Data loss risks during container shutdowns were mitigated by adjusting grace periods and using delta storage.
Reducing otel-collector refresh intervals, controlling container startup order, and optimizing label counts in data aggregation improved stability. Sensitivity to label cardinality was addressed through dynamic label pruning.
Integrating a centralized logging platform (observe) into Cloud Run, adopting MTLS for authentication, and optimizing metric pipelines are priorities. Cardinality control will be handled directly in otel-collector, simplifying OTLP-to-Prometheus conversions.
Enhancing backpressure mechanisms, improving event-driven auto-scaling (KDA), and establishing a unified metrics pipeline for logs/traces/metrics will ensure long-term scalability.
The current architecture uses metric routers within clusters to manage cardinality by limiting label counts. Future improvements include handling cardinality control at the otel-collector level, with dynamic label adjustments to retain only essential metrics (e.g., ABC) while removing redundancies.
Event-driven auto-scaling (KDA) has successfully managed high-traffic scenarios like Black Friday. Implementing backpressure mechanisms across the pipeline will enhance stability during anomalies. Delta or cumulative metrics do not affect final results due to aggregation during transmission.
Shopify’s approach to serverless metrics highlights the importance of adaptive observability in distributed systems. By leveraging CNCF tools, optimizing metric protocols, and addressing cardinality challenges, the platform achieves scalability and reliability. Prioritizing backpressure management, dynamic label pruning, and unified metric pipelines will further solidify its observability infrastructure. This case study underscores the necessity of a robust metric platform in modern serverless architectures.