When Hardware Fails, Ozone Prevails: Apache Ozone's Fault Tolerance Mechanisms Explained

Introduction

In distributed storage systems, hardware failures are inevitable. Apache Ozone, a distributed storage system designed to support both HDFS and S3 protocols, excels in fault tolerance through its robust architecture and automated recovery mechanisms. This article delves into Ozone's fault tolerance strategies, highlighting its ability to maintain data availability and consistency even under hardware failures.

System Architecture Overview

Apache Ozone's architecture is built around three core components:

Ozone Manager (OM): Manages namespace metadata (e.g., file names, volumes, buckets) using Raft protocol for high availability. Data is replicated across three nodes to ensure redundancy.
Storage Container Manager (SCM): Oversees storage containers (Storage Container) metadata, also utilizing Raft protocol to maintain consistency.
Data Node: Stores actual data blocks, managed by SCM for replication and distribution.

This layered design ensures that Ozone can isolate failures and maintain system integrity.

Fault Types and Handling Mechanisms

Network Failures

Leader Partition: When a Leader node loses connectivity with Followers, Followers initiate a new election. The original Leader becomes a Follower after timeout.
Automatic Recovery: Upon network restoration, the former Leader rejoins the cluster and synchronizes its state.
Raft Consensus: All operations require Quorum (2/3 nodes) approval to ensure consistency.

Node Failures

Node Crash: Remaining nodes continue operations. Clients automatically retry requests on the new Leader.
Node Replacement: Faulty nodes require Bootstrapping to rejoin the cluster and synchronize data.
Redundancy Recommendation: RAID 1 configuration is advised to enhance disk redundancy.

Disk Failures

Ozone Manager Disk Failure

Database Failure: If a disk becomes unreadable/writable, the node shuts down and logs the error, triggering a Leader election.
Data Recovery: Administrators replace the disk and rejoin the cluster.

Data Node Disk Failure

Volume Scanner Mechanism:
- Directory Check: Validates disk mount points and permissions.
- Disk Check: Performs I/O validation (write test data → sync → read verify → delete test data).
- Fault Tolerance: Uses a sliding window mechanism. If two consecutive checks fail, the disk is marked faulty and data replication is triggered.
Container Scanner Mechanism:
- Checks all blocks in storage containers for checksums and RockDB metadata.
- If data inconsistency is detected, replication is initiated immediately.

Fault Handling Details

Data Replication Strategy: When a data node fails, SCM marks the node as Stale and triggers replication upon confirmation.
Performance Consideration: Avoid over-replication by ensuring accurate node failure detection.
Monitoring: Recon service periodically collects node status, displayed via Web UI.
Error Handling: For intermittent I/O errors, multiple checks are performed to avoid false disk failure detection.

Fault Recovery Process

Network Partition: Automatic Leader election and state synchronization occur.
Node Failure: Remaining nodes continue operations; clients retry requests on the new Leader.
Disk Failure: Volume Scanner and Container Scanner mark faults and trigger data replication.
Cluster Recovery: Replace faulty hardware, rejoin the cluster, and synchronize the latest state.

Data Consistency and Validation

Disk Health Check Mechanism

Check Process:
1. Write data, read back, and compare with memory data.
2. Delete test data if verification succeeds.
3. If any stage fails, use multi-stage checks.
4. Default sliding window (last 3 checks):
  - Two successful checks indicate a healthy disk.
  - Two failures trigger disk marking and replication.
Abnormal Handling:
- Avoid high-cost replication from intermittent I/O errors.
- Temporarily leave unstable disks to prevent client retries and performance degradation.
- Trigger replication if 2/3 checks fail.

Container Scanner

Two-Stage Check:
1. Metadata Check: Validate container directory existence, readability, and metadata file integrity.
2. Data Check: Traverse all blocks, compare block length, checksum, and RockDB metadata. Limit check frequency to avoid frontend load.
Abnormal Handling:
- Mark the entire container as unhealthy if block data is damaged.
- Replicate the entire container instead of individual blocks, as block damage is rare (e.g., bit rot, cosmic rays). This simplifies replication and reduces complexity.

On-Demand Scans

Trigger Conditions:
- Client read/write anomalies.
- Background scans missing container issues.
Process:
1. Client detects container anomalies during read.
2. Data node marks disk and container for processing.
3. SCM coordinates replication:
  - Copy data from healthy nodes.
  - Delete data on the faulty node.

Client Data Consistency Validation

Write Process:
1. Generate checksum.
2. Concurrently transmit to data nodes.
3. Nodes persist data after verifying checksum matches.
Read Process:
1. Nodes return data and checksum.
2. Client confirms success if checksum matches.
3. Retry other nodes if mismatched.

Future Improvements

Checksum Optimization: Reduce checksum validation overhead. Consider storing checksums in block headers for zero-copy transmission.
Soft Fault Handling: Address slow nodes/disk issues via SCM and Recon server monitoring. Dynamically adjust data transfer paths.
Checksum Storage: Store checksums in block headers to minimize RockDB and network separation.

Configuration and Monitoring

Tunable Parameters:
- Scan frequency and bandwidth allocation.
- Error detection thresholds (e.g., sliding window length).
- Data replication strategies.
Monitoring Metrics:
- Bandwidth and frequency of scans.
- Number and types of errors detected.
- Disk/container failure rates.

Disk Failure Handling Details

RockDB Failure Handling:
- Nodes automatically shut down if Raft transactions fail.
- New nodes assume leadership.
- Faulty nodes require manual troubleshooting (disk damage, database corruption).
- Other nodes retain full data for recovery.
Data Node Disk Configuration:
- Each disk corresponds to a RockDB storage container metadata.
- Metadata is synchronized during container replication.
- Disk failure is treated as container failure, requiring data reconstruction from other replicas.

Volume Failure Handling Mechanism

When a storage volume fails, the system migrates data from other backup nodes to the cluster's existing space. This process updates metadata alongside data migration, ensuring data availability and cluster consistency.

Data Node Maintenance Modes

Data nodes offer two maintenance modes: Decommission and Maintenance.

Decommission Mode:
- Purpose: Permanently remove a node, triggering data re-replication (re-pc).
- Behavior:
  - The node is marked for removal; data is migrated from other replicas.
  - The node cannot be re-enabled.
  - Suitable for long-term node removal.
Maintenance Mode:
- Purpose: Temporarily disable a node to avoid triggering data re-replication.
- Behavior:
  - The node is marked as maintenance; the system does not initiate data re-replication.
  - The node can be restarted predictably.
  - Suitable for OS upgrades or short-term maintenance.

Metadata Node Decommission Mechanism

Metadata nodes (OM) only support Decommission mode due to their static cluster membership (fixed three nodes). The process is as follows:

Use Decommission CLI to stop the target node.
Raft protocol updates the cluster state, marking the node as disabled.
A new leader is elected; the cluster continues operation.
After upgrades, use Recommission to rejoin the cluster.
Repeat this process to rotate maintenance among the three nodes.

This design ensures metadata node stability and cluster high availability.