From Legacy Vendor Tooling to OpenTelemetry: Scaling Observability at MSCI

Introduction

In the modern era of distributed systems and cloud-native architectures, observability has become a cornerstone for maintaining system reliability and performance. For organizations like MSCI, which manage vast amounts of financial data across multiple cloud environments, the transition from legacy vendor tooling to open standards like OpenTelemetry is critical. This article explores MSCI’s journey to adopt OpenTelemetry, addressing challenges posed by fragmented tooling, data silos, and operational inefficiencies, while leveraging the CNCF ecosystem to build a scalable and flexible observability framework.

The Challenge of Legacy Vendor Tooling

MSCI, a global financial services firm with $44 billion in assets under management and a workforce of 6,000, operates in a highly complex environment. Its legacy infrastructure, built over 55 years through organic growth and acquisitions, resulted in a fragmented tooling landscape. Key challenges included:

Vendor Lock-in: Proprietary tools limited flexibility and increased costs.
Data Silos: Dispersed logs, metrics, and traces across systems hindered cross-team collaboration.
Inefficient Data Processing: Legacy systems struggled with high-volume data ingestion, leading to delayed issue detection.
Limited Visibility: Critical performance insights were often obscured by incomplete or fragmented telemetry data.

These issues underscored the need for a unified, open-standard observability solution that could integrate with existing systems while enabling future scalability.

OpenTelemetry: A Flexible and Open-Standard Solution

OpenTelemetry (OTEL) emerged as the ideal solution for MSCI’s needs. As a CNCF project, it provides a vendor-neutral framework for collecting and exporting telemetry data, including traces, metrics, and logs. Key advantages of adopting OTEL include:

Unified Data Collection: OTEL’s ability to ingest diverse data types from applications, services, and infrastructure ensures comprehensive visibility.
Interoperability: Compatibility with existing tools like Elasticsearch, Prometheus, and Jaeger allows seamless integration without overhauling legacy systems.
Scalability: The architecture supports high-throughput data processing, essential for MSCI’s 1GB/s telemetry volume.
Cost Efficiency: By reducing reliance on proprietary tools, MSCI minimized licensing costs and operational overhead.

Technical Architecture and Implementation

MSCI’s observability stack leverages OpenTelemetry as the core abstraction layer, with the following components:

1. OpenTelemetry Collector

The Collector acts as the central hub for data ingestion, processing, and routing. It aggregates telemetry from distributed applications, applying filters, transformations, and exporting data to downstream systems like Elasticsearch, Prometheus, and Jaeger.

2. Data Storage and Processing

Elasticsearch: Serves as the primary storage for logs and traces, handling 1GB/s of data with efficient indexing and querying capabilities.
Prometheus: Monitors metrics, enabling real-time analysis and alerting for performance anomalies.
Jaeger: Provides distributed tracing, offering end-to-end visibility into service interactions.
Grafana: Unifies visualization, allowing teams to monitor logs, traces, and metrics in a single interface.

3. Deployment and Automation

Kubernetes Integration: OpenTelemetry Collector is deployed via Helm charts, enabling automated scaling and version control.
Onboarding Scorecard: A standardized process ensures consistent instrumentation across applications, reducing manual effort.
Locality Principle: Data is optimized for cloud environments, minimizing latency and storage costs.

4. Phased Implementation

MSCI adopted a three-phase approach:

Exploration (2021): Pilot projects validated OTEL’s capabilities, focusing on JVM metrics and log decoding.
Adoption (2022): Cross-team collaboration expanded instrumentation to 80% of applications, establishing standardized workflows.
Optimization (2023): Automated pipelines and localized data strategies improved scalability and reduced operational complexity.

Business and Technical Outcomes

Technical Benefits

High-Volume Data Handling: The system processes 1GB/s of telemetry, supporting MSCI’s global financial operations.
Unified Observability: Centralized data reduces maintenance costs and enhances cross-team collaboration.
Automated Instrumentation: New applications are instrumented automatically, accelerating deployment cycles.

Business Impact

Reduced Mean Time to Detect (MTD): Problem detection improved by 30%, enabling faster incident resolution.
Cost Savings: CNCF-based tools eliminated vendor lock-in, reducing licensing and operational expenses.
Scalable Monitoring: 80% of applications are instrumented, ensuring comprehensive visibility across MSCI’s infrastructure.

Challenges and Mitigation Strategies

Legacy System Integration: Data synchronization and developer resistance were addressed through phased onboarding and clear ROI demonstrations.
Instrumentation Overhead: Existing dashboards were restructured to align with OTEL’s standardized format, minimizing code changes.
Data Management: Advanced anomaly detection and elastic storage strategies ensured efficient handling of 2TB+ daily data volumes.

Future Directions and Innovations

MSCI continues to explore enhancements to its observability framework, including:

eBPF for Cluster Analysis: Leveraging eBPF for deeper insights into Kubernetes environments.
Predictive Monitoring: Integrating AI to anticipate performance issues and optimize resource allocation.
Multi-Observability Testing: Evaluating parallel observability stacks to ensure flexibility without application modifications.

Conclusion

MSCI’s transition from legacy vendor tooling to OpenTelemetry exemplifies the power of open standards in modernizing observability. By adopting OTEL, the company achieved a scalable, cost-effective, and unified monitoring framework that supports its global operations. This approach not only enhances system reliability but also positions MSCI to adapt to future technological advancements. For organizations facing similar challenges, the lessons from MSCI’s journey highlight the importance of flexibility, automation, and a long-term vision in building observability at scale.