Impala on Iceberg: Performance Optimization and Integration Insights

Introduction

Impala, an Apache Foundation project, has long been recognized for its ability to deliver fast SQL queries on Hadoop data. With the rise of Iceberg, an open-table format designed for large-scale data lakes, the integration between Impala and Iceberg has become a critical area of focus. This article explores how Impala leverages Iceberg’s capabilities to optimize query performance, addresses challenges in data processing, and highlights key insights from real-world testing scenarios.

System Architecture and Query Flow

Impala Architecture

Impala operates on a multi-node cluster, comprising a Coordinator and Executors. The Coordinator handles query parsing, plan generation, and data distribution, while Executors perform data scanning, joins, and aggregations. Queries are represented as tree structures, with nodes such as Scan, Join, and Exchange.

Iceberg Table Format

Iceberg supports advanced deletion strategies, including:

  • Copy and Delete: Rewrites data files to remove specific records.
  • Merge and Read: Writes deletion files, merging them during reads.

Position deletes, which track file paths and row positions, require merging with data files during queries. This introduces complexity in data retrieval and processing.

Data Reading Mechanism

Impala interacts with Iceberg through its API to fetch data and deletion files. The process varies based on the presence of deletion files:

  • No Deletion Files: Directly scan data files.
  • With Deletion Files: Perform a left anti join to merge data and deletion records, adding virtual fields for file path and position.

To reduce network overhead, Impala employs Direct Distribution Mode, using Iceberg’s scheduler to map data files to executors. This minimizes broadcast or partition joins, significantly cutting data transfer costs.

Performance Testing and Results

Test Environment

  • 40-node cluster with data stored on S3
  • Test table: 8.5 billion records, 10% deleted (850 million)
  • Query: SELECT COUNT(*)

Baseline Performance

  • Network transfer cost: 7 seconds (left data) + 4 seconds (right data) = 11 seconds
  • Total query time: 21 seconds

Optimized Results

  • Direct distribution mode reduced data transfer to 825 million records
  • Total query time dropped to 12 seconds (42% improvement)

Large-Scale Test Case

Test Challenges

  • 1 trillion-row table (Station, Measure)
  • Query: SELECT ... GROUP BY
  • Operations: Add timestamp and sensor type fields, use Iceberg partitioning (Day Truncate, Bucket Hash), and apply Merge on Read for deletions

Results

  • No deletions: 2.5 minutes
  • 700 million deletions: 7 minutes 15 seconds (3x slower)

Query Execution Details and Optimization

Join Operation Mechanism

Impala builds a hash table mapping data files to position vectors. The Build Operator blocks Join execution, while Union operators parallelize data and deletion file scans.

Performance Bottlenecks

  • Hash table construction with C++ vectors caused frequent memory reallocations
  • Flame Graph analysis revealed inefficiencies in data structure creation

Optimization Strategies

  • Replace std::vector with vector of vectors to minimize memory reallocations
  • Batch process position records to reduce per-record overhead
  • Parallelize final sorting stages based on Join and Scan fragments

Exchange Operation Optimization

Problem Analysis

Impala serialized position records as strings, leading to redundant memory allocations and copies. Duplicate string values further exacerbated performance issues.

Optimization Methods

  • Duplicate neighboring string values to reduce memory usage and copies
  • Local testing showed significant performance gains from this approach

Future Improvements

Roaring BitMap Integration

Plan to replace sorted int64 with Roaring BitMap to reduce memory consumption and improve Join/Build speeds

Additional Optimizations

  • Further refine data structures and parallel processing strategies

Benchmark Comparison (TPC-DS)

Test Setup

  • 1000-scale TPC-DS dataset
  • 1 Coordinator, 10 Executors, no caching
  • Same instance type for Impala and Engine X

Results

  • Total Execution Time: Impala ~15 minutes, Engine X >20 minutes (1.3x faster)
  • Individual Query Performance: Impala generally outperformed Engine X, with minimal gaps in a few cases

Engine X Characteristics

  • Vectorized engine with different data storage
  • Distributed MPP architecture, but Impala showed better performance in this test

Conclusion

Impala’s integration with Iceberg V2 tables demonstrates significant performance gains through optimized data structures and parallel processing. While large-scale deletions introduce challenges, targeted optimizations can mitigate network and computational overhead. Future work will focus on further refining these mechanisms to enhance Iceberg’s query efficiency for big data workloads.