This page explains the architecture and workflow for scanning Iceberg tables and reading data files in the C++ implementation. It covers the complete read path from configuring a table scan through planning file tasks to reading Arrow batches.
The query planning system is organized into several subsystems, each documented in detail in child pages:
TableScanBuilder configuration API, TableScan and DataTableScan implementationsFileScanTask structure, ArrowArrayStream integration for reading dataFor related topics, see page 4 (Table Metadata System) for snapshot and manifest structure, and page 8 (File Format Support) for Parquet and Avro reader implementations.
The query planning pipeline transforms a scan configuration into executable file read tasks through several stages:
Pipeline Stages
| Stage | Components | Purpose |
|---|---|---|
| Configuration | TableScanBuilder | Accumulates scan parameters: filters, projections, snapshot selection, options |
| Planning | TableScan, ManifestGroup | Loads manifests, applies multi-level filtering, produces FileScanTask objects |
| Execution | FileScanTask, Reader, ArrowArrayStream | Opens data files and streams Arrow batches to the user |
The architecture emphasizes:
TableScan is immutable once built, enabling safe concurrent useSources: src/iceberg/table_scan.h129-332 src/iceberg/table_scan.cc213-502
Query Planning Class Hierarchy
Sources: src/iceberg/table_scan.h104-332 src/iceberg/table_scan.cc213-502
Key Components
| Component | Class/Struct | Location | Purpose |
|---|---|---|---|
| Scan builder | TableScanBuilder | src/iceberg/table_scan.h129-268 | Fluent API for configuring scans |
| Scan base | TableScan | src/iceberg/table_scan.h271-314 | Abstract scan with common functionality |
| Data scan | DataTableScan | src/iceberg/table_scan.h317-331 | Concrete implementation using ManifestGroup |
| Scan task | FileScanTask | src/iceberg/table_scan.h60-101 | Executable unit: one data file + deletes + filter |
| Scan context | internal::TableScanContext | src/iceberg/table_scan.h106-125 | Holds all scan parameters |
| Manifest processor | ManifestGroup | manifest/manifest_group.h | Coordinates manifest reading and filtering |
Sources: src/iceberg/table_scan.h60-332 src/iceberg/table_scan.cc213-502
End-to-End Scan Flow
Sources: src/iceberg/table_scan.cc213-372 src/iceberg/table_scan.cc475-502
Pipeline Stages
The scan execution progresses through distinct stages:
Configuration Stage (TableScanBuilder)
Planning Stage (DataTableScan::PlanFiles())
ManifestGroup with configurationFileScanTask objectsExecution Stage (FileScanTask::ToArrow())
ArrowArrayStreamSources: src/iceberg/table_scan.h129-332 src/iceberg/table_scan.cc213-502
The TableScanBuilder provides a fluent API for configuring table scans. It accumulates parameters in a TableScanContext and validates them before producing an immutable TableScan.
Builder Pattern Overview
Common Configuration Patterns
| Configuration | Builder Calls | Result |
|---|---|---|
| Full table scan | Make(metadata, io).Build() | All columns, current snapshot, no filter |
| Filtered scan | Filter(expr).Build() | Apply predicate during planning and reading |
| Column subset | Select({"col1", "col2"}).Build() | Only selected columns plus filter columns |
| Time travel | UseSnapshot(id).Build() | Scan specific snapshot by ID |
| Case-insensitive | CaseSensitive(false).Build() | Case-insensitive column name matching |
For detailed documentation of all builder methods, validation rules, and schema resolution logic, see page 7.1 (Table Scan).
Sources: src/iceberg/table_scan.h129-268 src/iceberg/table_scan.cc213-372 src/iceberg/test/table_scan_test.cc267-325
The DataTableScan class implements PlanFiles() to produce FileScanTask objects. It delegates to ManifestGroup for the actual work.
PlanFiles() Execution Flow
Sources: src/iceberg/table_scan.cc475-502
Schema Projection
The scan computes a projected schema that includes:
Select())The projection logic is in TableScan::ResolveProjectedSchema() at src/iceberg/table_scan.cc413-453
Example
Schema: { id: int, data: string, value: long }
Select: ["data"]
Filter: Equal("id", 42)
→ Projected schema: { id: int, data: string }
(includes both selected and filter-referenced columns)
For details on projection computation and the GetProjectedIdsVisitor, see page 7.4 (Schema Projection and Type Utilities).
Sources: src/iceberg/table_scan.cc413-502 src/iceberg/test/table_scan_test.cc669-770
A FileScanTask represents an executable unit of work: reading one data file with associated delete files and a residual filter.
FileScanTask Structure
Sources: src/iceberg/table_scan.h60-101 src/iceberg/table_scan.cc178-211
Key Concepts
| Component | Description |
|---|---|
data_file_ | The primary data file to read (Parquet or Avro) |
delete_files_ | Position and equality delete files that apply to this data file |
residual_filter_ | Filter expression that must be evaluated during file reading |
ToArrow() | Opens the file and returns an ArrowArrayStream for reading batches |
The task encapsulates everything needed to read a file:
For detailed documentation of FileScanTask, ArrowArrayStream integration, and reading patterns, see page 7.2 (File Scan Tasks).
Sources: src/iceberg/table_scan.h60-101 src/iceberg/table_scan.cc178-211 src/iceberg/test/file_scan_task_test.cc43-186
The ManifestGroup class coordinates reading manifests, applying filters, and producing FileScanTask objects. It implements multi-level filtering to minimize I/O.
ManifestGroup Role in Planning
Sources: src/iceberg/table_scan.cc475-502
Multi-Level Filtering
The filtering happens at progressively finer granularity:
| Level | Component | Operates On | Optimization |
|---|---|---|---|
| 1 | ManifestEvaluator | Entire manifest files | Skip reading manifests that can't match partition filters |
| 2 | ManifestReader | Manifest entries | Filter entries based on partition values and file statistics |
| 3 | DeleteFileIndex | Data files | Match applicable delete files to each data file |
| 4 | ResidualEvaluator | Row data | Compute filter remaining after partition/stats pruning |
For detailed documentation of the expression framework, predicate binding, and multi-level filtering, see page 7.3 (Predicate Pushdown and Filtering).
Sources: src/iceberg/table_scan.cc475-502
Column Selection for Manifests
The scan requests different columns from manifest files based on configuration:
Sources: src/iceberg/table_scan.cc43-58 src/iceberg/table_scan.cc455-502
Once planning produces FileScanTask objects, reading data involves opening files and streaming Arrow batches.
Data Read Flow
Sources: src/iceberg/table_scan.cc195-211 src/iceberg/table_scan.cc60-156
ArrowArrayStream C ABI
The ArrowArrayStream struct provides a C ABI for streaming Arrow data:
| Callback | Purpose | Implementation |
|---|---|---|
get_schema | Returns the Arrow schema | Calls reader->Schema() |
get_next | Returns next batch or null if done | Calls reader->Next() |
get_last_error | Returns error message string | Returns stored error from private data |
release | Cleans up resources | Deletes ReaderStreamPrivateData |
The callbacks operate on ReaderStreamPrivateData at src/iceberg/table_scan.cc61-73 which stores the Reader and last error message.
Reading Example
For details on FileScanTask structure and the ArrowArrayStream integration, see page 7.2 (File Scan Tasks). For reader implementations, see page 8 (File Format Support).
Sources: src/iceberg/table_scan.cc60-211 src/iceberg/test/file_scan_task_test.cc129-146
A Snapshot represents a point-in-time view of table data. Each snapshot references a manifest list file containing metadata about all data and delete files.
Sources: src/iceberg/snapshot.h383-454
The SnapshotCache class provides lazy-loading of manifests from the manifest list:
The cache partitions manifests into data and delete manifests during initialization, storing them in a single vector with data manifests at the head and delete manifests at the tail.
Key Methods:
Manifests(): Returns all manifest files (data + delete)DataManifests(): Returns only data manifestsDeleteManifests(): Returns only delete manifestsSources: src/iceberg/snapshot.h459-504 src/iceberg/snapshot.cc223-277
ManifestGroup is the central coordinator for reading manifests and producing scan tasks. It applies filters at multiple levels and matches delete files to data files.
Sources: src/iceberg/manifest/manifest_group.h56-110 src/iceberg/manifest/manifest_group.cc40-85
The ManifestGroup applies filters at three progressively finer levels:
| Level | Component | Filter Type | Applies To | Pruning Effect |
|---|---|---|---|---|
| 1 | ManifestEvaluator | Partition filter | Entire manifest files | Skip reading manifests that don't match partition constraints |
| 2 | ManifestReader | Row filter + file filter | Individual manifest entries | Skip entries based on partition values and file-level statistics |
| 3 | ResidualEvaluator | Remaining row filters | Data within files | Compute filters that must be applied during file reads |
Sources: src/iceberg/manifest/manifest_group.cc160-233 src/iceberg/expression/manifest_evaluator.h
Before reading a manifest file, ManifestEvaluator checks if the manifest's partition summaries can satisfy the partition filter:
Example from tests:
// A manifest with partition summaries for data_bucket in [0, 5]
// Can be skipped if filter requires data_bucket = 10
ManifestEvaluator evaluator(spec, partition_filter);
if (!evaluator.Eval(manifest)) {
// Skip reading this manifest entirely
}
Sources: src/iceberg/manifest/manifest_group.cc162-172 src/iceberg/test/manifest_group_test.cc432-475
ManifestReader applies row-level and file-level filters to manifest entries:
Row Filter: Evaluated against partition values and file-level statistics (bounds, null counts, NaN counts)
File Filter: Evaluated against file metadata fields (record_count, file_size_in_bytes, etc.)
Custom Predicate: User-defined function for arbitrary filtering logic
Sources: src/iceberg/manifest/manifest_reader.h src/iceberg/test/manifest_group_test.cc369-428
The DeleteFileIndex efficiently matches delete files to data files based on partition values and sequence numbers. It supports three types of deletes:
Key Design Points:
Global vs Partitioned: Equality deletes in unpartitioned tables are "global" and apply to all data files. Position deletes are never global (except in unpartitioned tables).
Sequence Number Ordering: Position deletes are sorted by sequence_number to enable binary search when filtering by data file sequence.
Partition Key Hashing: Uses PartitionKey (derived from PartitionValues) as map key for O(1) partition lookup.
Sources: src/iceberg/delete_file_index.h103-316 src/iceberg/delete_file_index.cc
Sources: src/iceberg/delete_file_index.cc src/iceberg/test/delete_file_index_test.cc190-200
The ForDataFile() method returns all delete files applicable to a given data file:
Sequence Number Filtering Logic:
A delete file applies to a data file if:
delete.sequence_number > data.sequence_number (delete was committed after data)referenced_data_file and are always applicableSources: src/iceberg/delete_file_index.cc src/iceberg/test/delete_file_index_test.cc263-359
A FileScanTask encapsulates everything needed to read a data file:
Sources: src/iceberg/table_scan.h src/iceberg/manifest/manifest_group.cc187-232
The ResidualEvaluator determines what portion of a filter must be applied during file reading:
Example:
Original filter: id = 5 AND data = 'test'
Partition: data_bucket = hash(data) % 16
After partition pruning: id = 5 AND data = 'test'
(data = 'test' cannot be fully evaluated from partition, so remains)
Residual filter: id = 5 AND data = 'test'
If a filter can be fully satisfied by partition metadata or file statistics, the residual may simplify to True, meaning no filtering is needed during file reading.
Sources: src/iceberg/expression/residual_evaluator.h src/iceberg/test/manifest_group_test.cc476-537
ManifestGroup supports flags to skip certain entry types:
| Method | Effect | Use Case |
|---|---|---|
IgnoreDeleted() | Skip ManifestStatus::kDeleted entries | Standard reads (ignore removal markers) |
IgnoreExisting() | Skip ManifestStatus::kExisting entries | Incremental processing (only new files) |
IgnoreResiduals() | Don't compute residual filters | When downstream will re-apply full filter |
Sources: src/iceberg/manifest/manifest_group.h101-108 src/iceberg/test/manifest_group_test.cc293-367
ManifestReader can drop statistics (bounds, value counts, etc.) when they're not needed:
Logic:
Schema::kAllColumns wildcard ("*"), keep all statsSources: src/iceberg/manifest/manifest_reader.h src/iceberg/test/manifest_reader_test.cc356-419
Delete files must have compatible partition specs with data files:
Compatible if:
1. Same partition spec ID, OR
2. Delete file is unpartitioned (spec_id = 0) AND content is equality deletes
Global equality deletes (unpartitioned) apply across all partitions. Position deletes are never global.
Sources: src/iceberg/delete_file_index.cc src/iceberg/test/delete_file_index_test.cc570-599
Sources: src/iceberg/test/table_scan_test.cc50-245
This pattern is useful for change data capture or incremental ETL where you only want to process files added since the last snapshot.
Sources: src/iceberg/test/manifest_group_test.cc331-367
The AfterSequenceNumber() method filters out delete files committed before a certain sequence, useful for incremental scans that start from a known sequence number.
Sources: src/iceberg/test/delete_file_index_test.cc232-261
| Component | Header File | Implementation | Purpose |
|---|---|---|---|
| Snapshot | src/iceberg/snapshot.h383-506 | src/iceberg/snapshot.cc156-277 | Snapshot metadata and lazy manifest loading |
| ManifestGroup | src/iceberg/manifest/manifest_group.h52-140 | src/iceberg/manifest/manifest_group.cc40-316 | Manifest processing coordination |
| DeleteFileIndex | src/iceberg/delete_file_index.h103-316 | src/iceberg/delete_file_index.cc | Delete file indexing and matching |
| ManifestReader | src/iceberg/manifest/manifest_reader.h | src/iceberg/manifest/manifest_reader.cc | Reading and filtering manifest entries |
| FileScanTask | src/iceberg/table_scan.h | src/iceberg/table_scan.cc | Scan task abstraction |
Sources: Listed in table
Refresh this wiki
This wiki was recently refreshed. Please wait 7 days to refresh again.