Efficient, Low Latency Ingestion to Large Tables via Apache Flink and Apache Iceberg

Introduction

In the era of real-time data processing, achieving low-latency ingestion into large-scale data tables is critical for modern data pipelines. Apache Flink and Apache Iceberg, both Apache Foundation projects, offer powerful capabilities for stream processing and structured data management. This article explores an optimized solution for efficiently ingesting data from Kafka into Iceberg tables using Flink, ensuring sub-minute data availability for downstream consumers while addressing performance bottlenecks caused by small file proliferation and inefficient metadata management.

Core Challenges

Problem Scenario

Data streams from Kafka are ingested into Iceberg tables via Flink, but the default configuration leads to excessive small files (thousands per hour) and metadata files, severely degrading query performance. Trino queries take over 20 seconds to execute due to inefficient file scanning and metadata parsing.

Root Causes

Iceberg's Lack of Server-Side Management: Users must manually maintain table structures, leading to fragmented file layouts.
Flink's Unoptimized Write Mechanism: Frequent checkpointing and lack of coordination with Iceberg's compaction strategy result in redundant writes and metadata overhead.

Solution Architecture

1. Flink Write Optimization

Two-Phase Commit Protocol:
- Phase 1: Flink workers immediately write data files to storage (S3/HDFS) without immediate metadata registration.
- Phase 2: After checkpoint confirmation, a committer registers files into Iceberg tables, ensuring atomicity and avoiding data duplication.
Avoid Redundant Writes: Ensures each record is processed exactly once through checkpoint-based coordination.

2. Data Structure Optimization

Compaction Strategy:
- Merges small files into larger ones to reduce I/O overhead during reads.
- A monitor source periodically checks table structure changes and triggers compaction tasks.
Partitioning Strategy Adjustment:
- Hash Partitioning: Distributes data evenly across writers to prevent skew.
- Shuffle Partitioning: Further balances data distribution, minimizing file proliferation.

3. Flink V2 API Enhancements

New Components:
- Trigger Manager: Controls compaction task timing based on workload patterns.
- Compaction Module: Executes file merging and metadata updates in isolation.
Resource Isolation:
- Separates compaction tasks from ingestion jobs to avoid resource contention.
- Supports external schedulers (e.g., Airflow) for task orchestration.

4. Iceberg Table Maintenance

Maintenance Tasks:
- Expire Snapshots: Removes obsolete snapshots and associated files.
- Rewrite Manifest Files: Merges small manifest files to reduce metadata complexity.
- Delete Files: Removes uncommitted data and metadata files.
Automated Workflow:
- Monitor sources continuously check table changes and trigger maintenance via the Trigger Manager.

Implementation Details

Flink Write Pipeline

Data flows through writers to generate data files.
A committer registers files into Iceberg tables after checkpoint confirmation.
Each checkpoint produces a metadata file for versioning.

Compaction Workflow

Monitor sources emit records every 10 minutes to check table changes.
The Trigger Manager evaluates whether compaction is needed.
The Compaction Module merges files, producing a new version of data files.

Performance Gains

File Reduction: From 120 files to 1 file post-compaction.
Query Latency: Reduced by over 90% for Trino queries.
Storage Cost: Lowered through elimination of redundant files.

Technical Challenges and Solutions

Small File Problem

Cause: Frequent checkpoints generate excessive data and metadata files.
Solution: Compaction merges files, reducing storage overhead and improving read efficiency.

Partition Skew

Cause: Single partition overload leads to writer idle time.
Solution: Shuffle partitioning balances data distribution across writers.

Resource Contention

Cause: Compaction and ingestion tasks compete for cluster resources.
Solution: Isolate tasks into separate jobs and use external schedulers for coordination.

Test Results

Test Environment

Data Source: Randomly generated 300 records.
Write Configuration: 8 writers, 1 committer.

Key Metrics

Original Files: 800 data files generated.
Post-Compaction: Reduced to 80 files.
Compaction Tasks: 4 executions, processing 10,000 files.
Final File Count: 80 files, with significant query performance improvement.

Conclusion

By leveraging Flink V2 API's compaction capabilities, Iceberg's metadata management, and optimized partitioning strategies, this solution effectively addresses low-latency ingestion challenges. The combination of resource isolation, automated maintenance, and workload-aware compaction ensures efficient data pipeline operations while minimizing storage costs. This approach provides a scalable framework for real-time data ingestion into large-scale Iceberg tables, suitable for high-throughput and latency-sensitive applications.