In the era of distributed systems and microservices, observability has become a cornerstone for maintaining reliability and performance. As systems scale to unprecedented levels—such as eBay’s infrastructure with 190 markets, 2.3 billion active items, and 4,600 microservices—the complexity of troubleshooting and monitoring grows exponentially. Traditional observability tools, while powerful, often fall short in providing actionable insights amid the deluge of data. This is where AI-enabled observability explainers come into play, combining the strengths of artificial intelligence (AI) and explainability to transform raw telemetry into meaningful narratives. This article explores how AI, when paired with structured engineering practices, addresses the challenges of observability in large-scale systems.
Observability refers to the ability to understand the internal state of a system through external outputs, such as logs, metrics, and traces. However, raw data alone is insufficient for diagnosing complex issues. Explainability bridges this gap by providing context, causality, and actionable insights. In AI-driven observability, explainers act as interpreters, translating vast datasets into human-readable stories that highlight anomalies, root causes, and trends.
AI, particularly large language models (LLMs) and machine learning (ML), enhances observability by automating pattern recognition, anomaly detection, and root-cause analysis. For example, eBay’s initial ML-based solutions used tools like Groot for anomaly detection and automated pod repair. However, early applications faced limitations, such as hallucinations due to large data volumes and context window constraints in LLMs. These challenges underscore the need for a hybrid approach that combines AI with engineering rigor.
To manage the overwhelming scale of data, eBay employs a key path algorithm to focus on critical components. By pruning non-essential spans (e.g., using Uber’s Crisp whitepaper as a reference), the system prioritizes high-impact elements like resource-intensive spans or 5xx errors. This reduces noise and accelerates troubleshooting.
Effective AI models require clean, structured data. eBay leverages techniques such as:
checkout
→ 1
) to reduce dimensionality.AI is not a replacement for engineering but a complement. eBay integrates LLMs via APIs for simple inference and summarization, while standardizing tools like Open Telemetry and query languages to ensure consistency. For instance, PromQL is used to generate alerts, reducing reliance on LLMs for repetitive tasks.
When a database query delay impacted performance, the trace explainer localized the issue to a specific span, while the log explainer identified timeout patterns. Combining these insights enabled rapid resolution, demonstrating how explainers streamline root-cause analysis.
By integrating multiple explainers, eBay’s dashboards now provide holistic views of system health. For example, a single dashboard might combine trace, log, and metric data to auto-generate a troubleshooting workflow, reducing manual effort.
Early AI applications faced issues like high randomness and hallucinations. While LLMs excel at pattern recognition, their probabilistic nature makes them unsuitable for deterministic tasks like emergency debugging. This necessitates a hybrid model where AI handles pattern detection and engineering methods manage execution.
AI-driven observability explainers represent a paradigm shift in how we interpret complex systems. By combining AI’s analytical power with engineering precision, organizations like eBay can transform raw telemetry into actionable insights. The key lies in balancing automation with human oversight, ensuring that explainability remains both accurate and interpretable. As the field evolves, standardization and hybrid approaches will be critical to unlocking the full potential of AI in observability.