Oxia: A Scalable Alternative to ZooKeeper for Distributed Systems

Introduction

ZooKeeper has long been a cornerstone of distributed systems, providing coordination and metadata management. However, its limitations in horizontal and vertical scalability have become increasingly problematic as systems grow in complexity and scale. Oxia, a new distributed metadata storage and coordination system, addresses these challenges by introducing a novel architecture that overcomes ZooKeeper's inherent bottlenecks. This article explores Oxia's design, features, and how it serves as a modern solution for scalable distributed systems.

Technical Overview

Definition and Core Concepts

Oxia is a horizontally scalable key-value store designed to replace ZooKeeper in distributed systems. It supports operations such as distributed locks, naming, and configuration management, while eliminating the single point of failure and performance limitations inherent in ZooKeeper. Unlike ZooKeeper's in-memory database, Oxia employs a log-structured merge tree (LSM) and memory-read strategy, enabling efficient storage of metadata at scale.

Key Features

  1. Horizontal Scalability: Oxia's architecture allows for seamless expansion by adding storage pods, each responsible for a specific shard. This design avoids ZooKeeper's single-leader bottleneck, enabling linear scalability.

  2. Non-Consensus Protocol: Oxia uses push-based replication and log replication (RAL) instead of consensus protocols like Raft or Paxos. This reduces complexity and improves performance by avoiding the overhead of consensus mechanisms.

  3. Kubernetes Integration: Oxia is deeply integrated with Kubernetes, using Operators and custom resource definitions (CRDs) for dynamic configuration and scaling. This makes it ideal for cloud-native environments.

  4. High Throughput and Low Latency: Oxia achieves high throughput (e.g., 70,000 transactions per second) with low latency (<100ms for reads and <300ms for writes), surpassing ZooKeeper's limitations.

  5. Data Persistence and Recovery: Data is written to disk via a write-ahead log (RAL), ensuring durability. Fast recovery is enabled through consistent replication between leader and follower nodes.

Addressing ZooKeeper's Limitations

Horizontal Scaling Bottlenecks

ZooKeeper's single-leader architecture creates a serialization point, leading to performance degradation as the number of nodes increases. Oxia eliminates this by distributing leadership across shards, allowing each pod to act as a leader or follower. This design ensures even load distribution and avoids the 3-network-call overhead seen in ZooKeeper.

Vertical Scaling Constraints

ZooKeeper's reliance on in-memory databases limits its scalability to 2GB of data. Oxia bypasses this by using LSM trees and persistent storage, enabling storage of hundreds of gigabytes of metadata without performance degradation. It also avoids periodic snapshots, which are a major source of latency in ZooKeeper.

Use Cases and Integration

Apache Pulsar Integration

Oxia is designed as the metadata store for Apache Pulsar, replacing ZooKeeper entirely. This integration allows Pulsar to support an effectively unlimited number of topics, with the potential to handle 800 billion PB of data. Oxia's architecture simplifies Pulsar's three-tier structure (Broker, BookKeeper, ZooKeeper) by eliminating the need for ZooKeeper.

Cloud-Native Deployment

Oxia's Kubernetes-native design enables dynamic scaling and cloud elasticity. Its use of ConfigMaps and CRDs allows administrators to configure storage pods, partitions, and replication factors without manual intervention, making it ideal for modern cloud environments.

Performance and Scalability

Benchmarking

Oxia outperforms ZooKeeper in both throughput and latency. While ZooKeeper struggles with 30,000 TPS, Oxia achieves 100,000 TPS under mixed workloads. Its ability to scale linearly with added nodes ensures consistent performance even as data volumes grow.

Dynamic Partitioning

Oxia supports dynamic shard splitting, allowing it to scale to an effectively unlimited number of partitions. This is a critical improvement over ZooKeeper's fixed partitioning, which requires manual reconfiguration for large-scale deployments.

Challenges and Considerations

While Oxia offers significant advantages, its non-consensus approach may raise concerns about fault tolerance compared to traditional consensus protocols. However, its mathematical-proven recovery protocols ensure data consistency and minimal downtime during failures. Additionally, its reliance on LSM trees requires careful tuning of compaction strategies to optimize performance.

Conclusion

Oxia represents a paradigm shift in distributed metadata management by addressing the fundamental limitations of ZooKeeper. Its horizontally scalable architecture, non-consensus replication, and Kubernetes integration make it a compelling choice for modern cloud-native systems. By replacing ZooKeeper in projects like Apache Pulsar, Oxia demonstrates its potential to revolutionize how distributed systems handle coordination and metadata storage. For organizations seeking to build scalable, resilient infrastructure, Oxia offers a robust and future-proof solution.