Impala on Iceberg: Performance Optimization and Integration Insights

Introduction

Impala, an Apache Foundation project, has long been recognized for its ability to deliver fast SQL queries on Hadoop data. With the rise of Iceberg, an open-table format designed for large-scale data lakes, the integration between Impala and Iceberg has become a critical area of focus. This article explores how Impala leverages Iceberg’s capabilities to optimize query performance, addresses challenges in data processing, and highlights key insights from real-world testing scenarios.

System Architecture and Query Flow

Impala Architecture

Impala operates on a multi-node cluster, comprising a Coordinator and Executors. The Coordinator handles query parsing, plan generation, and data distribution, while Executors perform data scanning, joins, and aggregations. Queries are represented as tree structures, with nodes such as Scan, Join, and Exchange.

Iceberg Table Format

Iceberg supports advanced deletion strategies, including:

Copy and Delete: Rewrites data files to remove specific records.
Merge and Read: Writes deletion files, merging them during reads.

Position deletes, which track file paths and row positions, require merging with data files during queries. This introduces complexity in data retrieval and processing.

Data Reading Mechanism

Impala interacts with Iceberg through its API to fetch data and deletion files. The process varies based on the presence of deletion files:

No Deletion Files: Directly scan data files.
With Deletion Files: Perform a left anti join to merge data and deletion records, adding virtual fields for file path and position.

To reduce network overhead, Impala employs Direct Distribution Mode, using Iceberg’s scheduler to map data files to executors. This minimizes broadcast or partition joins, significantly cutting data transfer costs.

Performance Testing and Results

Test Environment

40-node cluster with data stored on S3
Test table: 8.5 billion records, 10% deleted (850 million)
Query: SELECT COUNT(*)

Baseline Performance

Network transfer cost: 7 seconds (left data) + 4 seconds (right data) = 11 seconds
Total query time: 21 seconds

Optimized Results

Direct distribution mode reduced data transfer to 825 million records
Total query time dropped to 12 seconds (42% improvement)

Large-Scale Test Case

Test Challenges

1 trillion-row table (Station, Measure)
Query: SELECT ... GROUP BY
Operations: Add timestamp and sensor type fields, use Iceberg partitioning (Day Truncate, Bucket Hash), and apply Merge on Read for deletions

Results

No deletions: 2.5 minutes
700 million deletions: 7 minutes 15 seconds (3x slower)

Query Execution Details and Optimization

Join Operation Mechanism

Impala builds a hash table mapping data files to position vectors. The Build Operator blocks Join execution, while Union operators parallelize data and deletion file scans.

Performance Bottlenecks

Hash table construction with C++ vectors caused frequent memory reallocations
Flame Graph analysis revealed inefficiencies in data structure creation

Optimization Strategies

Replace std::vector with vector of vectors to minimize memory reallocations
Batch process position records to reduce per-record overhead
Parallelize final sorting stages based on Join and Scan fragments

Exchange Operation Optimization

Problem Analysis

Impala serialized position records as strings, leading to redundant memory allocations and copies. Duplicate string values further exacerbated performance issues.

Optimization Methods

Duplicate neighboring string values to reduce memory usage and copies
Local testing showed significant performance gains from this approach

Future Improvements

Roaring BitMap Integration

Plan to replace sorted int64 with Roaring BitMap to reduce memory consumption and improve Join/Build speeds

Additional Optimizations

Further refine data structures and parallel processing strategies

Benchmark Comparison (TPC-DS)

Test Setup

1000-scale TPC-DS dataset
1 Coordinator, 10 Executors, no caching
Same instance type for Impala and Engine X

Results

Total Execution Time: Impala ~15 minutes, Engine X >20 minutes (1.3x faster)
Individual Query Performance: Impala generally outperformed Engine X, with minimal gaps in a few cases

Engine X Characteristics

Vectorized engine with different data storage
Distributed MPP architecture, but Impala showed better performance in this test

Conclusion

Impala’s integration with Iceberg V2 tables demonstrates significant performance gains through optimized data structures and parallel processing. While large-scale deletions introduce challenges, targeted optimizations can mitigate network and computational overhead. Future work will focus on further refining these mechanisms to enhance Iceberg’s query efficiency for big data workloads.

Impala on Iceberg: Performance Optimization and Integration Insights

Introduction

System Architecture and Query Flow

Impala Architecture

Iceberg Table Format

Data Reading Mechanism

Performance Testing and Results

Test Environment

Baseline Performance

Optimized Results

Large-Scale Test Case

Test Challenges

Results

Query Execution Details and Optimization

Join Operation Mechanism

Performance Bottlenecks

Optimization Strategies

Exchange Operation Optimization

Problem Analysis

Optimization Methods

Future Improvements

Roaring BitMap Integration

Additional Optimizations

Benchmark Comparison (TPC-DS)

Test Setup

Results

Engine X Characteristics

Conclusion

推薦閱讀