Introduction
In distributed storage systems, hardware failures are inevitable. Apache Ozone, a distributed storage system designed to support both HDFS and S3 protocols, excels in fault tolerance through its robust architecture and automated recovery mechanisms. This article delves into Ozone's fault tolerance strategies, highlighting its ability to maintain data availability and consistency even under hardware failures.
System Architecture Overview
Apache Ozone's architecture is built around three core components:
Ozone Manager (OM): Manages namespace metadata (e.g., file names, volumes, buckets) using Raft protocol for high availability. Data is replicated across three nodes to ensure redundancy.
Storage Container Manager (SCM): Oversees storage containers (Storage Container) metadata, also utilizing Raft protocol to maintain consistency.
Data Node: Stores actual data blocks, managed by SCM for replication and distribution.
This layered design ensures that Ozone can isolate failures and maintain system integrity.
Fault Types and Handling Mechanisms
Network Failures
- Leader Partition: When a Leader node loses connectivity with Followers, Followers initiate a new election. The original Leader becomes a Follower after timeout.
- Automatic Recovery: Upon network restoration, the former Leader rejoins the cluster and synchronizes its state.
- Raft Consensus: All operations require Quorum (2/3 nodes) approval to ensure consistency.
Node Failures
- Node Crash: Remaining nodes continue operations. Clients automatically retry requests on the new Leader.
- Node Replacement: Faulty nodes require Bootstrapping to rejoin the cluster and synchronize data.
- Redundancy Recommendation: RAID 1 configuration is advised to enhance disk redundancy.
Disk Failures
Ozone Manager Disk Failure
- Database Failure: If a disk becomes unreadable/writable, the node shuts down and logs the error, triggering a Leader election.
- Data Recovery: Administrators replace the disk and rejoin the cluster.
Data Node Disk Failure
Fault Handling Details
- Data Replication Strategy: When a data node fails, SCM marks the node as Stale and triggers replication upon confirmation.
- Performance Consideration: Avoid over-replication by ensuring accurate node failure detection.
- Monitoring: Recon service periodically collects node status, displayed via Web UI.
- Error Handling: For intermittent I/O errors, multiple checks are performed to avoid false disk failure detection.
Fault Recovery Process
- Network Partition: Automatic Leader election and state synchronization occur.
- Node Failure: Remaining nodes continue operations; clients retry requests on the new Leader.
- Disk Failure: Volume Scanner and Container Scanner mark faults and trigger data replication.
- Cluster Recovery: Replace faulty hardware, rejoin the cluster, and synchronize the latest state.
Data Consistency and Validation
Disk Health Check Mechanism
Check Process:
- Write data, read back, and compare with memory data.
- Delete test data if verification succeeds.
- If any stage fails, use multi-stage checks.
- Default sliding window (last 3 checks):
- Two successful checks indicate a healthy disk.
- Two failures trigger disk marking and replication.
Abnormal Handling:
- Avoid high-cost replication from intermittent I/O errors.
- Temporarily leave unstable disks to prevent client retries and performance degradation.
- Trigger replication if 2/3 checks fail.
Container Scanner
Two-Stage Check:
- Metadata Check: Validate container directory existence, readability, and metadata file integrity.
- Data Check: Traverse all blocks, compare block length, checksum, and RockDB metadata. Limit check frequency to avoid frontend load.
Abnormal Handling:
- Mark the entire container as unhealthy if block data is damaged.
- Replicate the entire container instead of individual blocks, as block damage is rare (e.g., bit rot, cosmic rays). This simplifies replication and reduces complexity.
On-Demand Scans
Trigger Conditions:
- Client read/write anomalies.
- Background scans missing container issues.
Process:
- Client detects container anomalies during read.
- Data node marks disk and container for processing.
- SCM coordinates replication:
- Copy data from healthy nodes.
- Delete data on the faulty node.
Client Data Consistency Validation
Write Process:
- Generate checksum.
- Concurrently transmit to data nodes.
- Nodes persist data after verifying checksum matches.
Read Process:
- Nodes return data and checksum.
- Client confirms success if checksum matches.
- Retry other nodes if mismatched.
Future Improvements
- Checksum Optimization: Reduce checksum validation overhead. Consider storing checksums in block headers for zero-copy transmission.
- Soft Fault Handling: Address slow nodes/disk issues via SCM and Recon server monitoring. Dynamically adjust data transfer paths.
- Checksum Storage: Store checksums in block headers to minimize RockDB and network separation.
Configuration and Monitoring
Tunable Parameters:
- Scan frequency and bandwidth allocation.
- Error detection thresholds (e.g., sliding window length).
- Data replication strategies.
Monitoring Metrics:
- Bandwidth and frequency of scans.
- Number and types of errors detected.
- Disk/container failure rates.
Disk Failure Handling Details
Volume Failure Handling Mechanism
When a storage volume fails, the system migrates data from other backup nodes to the cluster's existing space. This process updates metadata alongside data migration, ensuring data availability and cluster consistency.
Data Node Maintenance Modes
Data nodes offer two maintenance modes: Decommission and Maintenance.
Decommission Mode:
- Purpose: Permanently remove a node, triggering data re-replication (re-pc).
- Behavior:
- The node is marked for removal; data is migrated from other replicas.
- The node cannot be re-enabled.
- Suitable for long-term node removal.
Maintenance Mode:
- Purpose: Temporarily disable a node to avoid triggering data re-replication.
- Behavior:
- The node is marked as maintenance; the system does not initiate data re-replication.
- The node can be restarted predictably.
- Suitable for OS upgrades or short-term maintenance.
Metadata Node Decommission Mechanism
Metadata nodes (OM) only support Decommission mode due to their static cluster membership (fixed three nodes). The process is as follows:
- Use Decommission CLI to stop the target node.
- Raft protocol updates the cluster state, marking the node as disabled.
- A new leader is elected; the cluster continues operation.
- After upgrades, use Recommission to rejoin the cluster.
- Repeat this process to rotate maintenance among the three nodes.
This design ensures metadata node stability and cluster high availability.