Introduction
Impala, an Apache Foundation project, has long been recognized for its ability to deliver fast SQL queries on Hadoop data. With the rise of Iceberg, an open-table format designed for large-scale data lakes, the integration between Impala and Iceberg has become a critical area of focus. This article explores how Impala leverages Iceberg’s capabilities to optimize query performance, addresses challenges in data processing, and highlights key insights from real-world testing scenarios.
System Architecture and Query Flow
Impala Architecture
Impala operates on a multi-node cluster, comprising a Coordinator and Executors. The Coordinator handles query parsing, plan generation, and data distribution, while Executors perform data scanning, joins, and aggregations. Queries are represented as tree structures, with nodes such as Scan, Join, and Exchange.
Iceberg Table Format
Iceberg supports advanced deletion strategies, including:
- Copy and Delete: Rewrites data files to remove specific records.
- Merge and Read: Writes deletion files, merging them during reads.
Position deletes, which track file paths and row positions, require merging with data files during queries. This introduces complexity in data retrieval and processing.
Data Reading Mechanism
Impala interacts with Iceberg through its API to fetch data and deletion files. The process varies based on the presence of deletion files:
- No Deletion Files: Directly scan data files.
- With Deletion Files: Perform a left anti join to merge data and deletion records, adding virtual fields for file path and position.
To reduce network overhead, Impala employs Direct Distribution Mode, using Iceberg’s scheduler to map data files to executors. This minimizes broadcast or partition joins, significantly cutting data transfer costs.
Performance Testing and Results
Test Environment
- 40-node cluster with data stored on S3
- Test table: 8.5 billion records, 10% deleted (850 million)
- Query:
SELECT COUNT(*)
Baseline Performance
- Network transfer cost: 7 seconds (left data) + 4 seconds (right data) = 11 seconds
- Total query time: 21 seconds
Optimized Results
- Direct distribution mode reduced data transfer to 825 million records
- Total query time dropped to 12 seconds (42% improvement)
Large-Scale Test Case
Test Challenges
- 1 trillion-row table (Station, Measure)
- Query:
SELECT ... GROUP BY
- Operations: Add timestamp and sensor type fields, use Iceberg partitioning (Day Truncate, Bucket Hash), and apply Merge on Read for deletions
Results
- No deletions: 2.5 minutes
- 700 million deletions: 7 minutes 15 seconds (3x slower)
Query Execution Details and Optimization
Join Operation Mechanism
Impala builds a hash table mapping data files to position vectors. The Build Operator blocks Join execution, while Union operators parallelize data and deletion file scans.
Performance Bottlenecks
- Hash table construction with C++ vectors caused frequent memory reallocations
- Flame Graph analysis revealed inefficiencies in data structure creation
Optimization Strategies
- Replace
std::vector
with vector of vectors
to minimize memory reallocations
- Batch process position records to reduce per-record overhead
- Parallelize final sorting stages based on Join and Scan fragments
Exchange Operation Optimization
Problem Analysis
Impala serialized position records as strings, leading to redundant memory allocations and copies. Duplicate string values further exacerbated performance issues.
Optimization Methods
- Duplicate neighboring string values to reduce memory usage and copies
- Local testing showed significant performance gains from this approach
Future Improvements
Roaring BitMap Integration
Plan to replace sorted int64
with Roaring BitMap to reduce memory consumption and improve Join/Build speeds
Additional Optimizations
- Further refine data structures and parallel processing strategies
Benchmark Comparison (TPC-DS)
Test Setup
- 1000-scale TPC-DS dataset
- 1 Coordinator, 10 Executors, no caching
- Same instance type for Impala and Engine X
Results
- Total Execution Time: Impala ~15 minutes, Engine X >20 minutes (1.3x faster)
- Individual Query Performance: Impala generally outperformed Engine X, with minimal gaps in a few cases
Engine X Characteristics
- Vectorized engine with different data storage
- Distributed MPP architecture, but Impala showed better performance in this test
Conclusion
Impala’s integration with Iceberg V2 tables demonstrates significant performance gains through optimized data structures and parallel processing. While large-scale deletions introduce challenges, targeted optimizations can mitigate network and computational overhead. Future work will focus on further refining these mechanisms to enhance Iceberg’s query efficiency for big data workloads.