Skip to content

Conversation

@avamingli
Copy link
Contributor

@avamingli avamingli commented Mar 12, 2025

Previously, write operations on partitioned tables would alter the materialized view (MV) status of both ancestor and descendant tables, leading to unnecessary invalidations of MVs that depended on unaffected partitions.

This commit introduces a more efficient approach to updating MV statuses for partitioned tables by focusing on leaf partitions that have actually undergone data changes.

example:

create table par(a int, b int, c int) partition by range(b)
    subpartition by range(c) subpartition template (start (1) end (3) every (1))
    (start(1) end(3) every(1));
insert into par values(1, 1, 1), (1, 1, 2), (2, 2, 1), (2, 2, 2);
create materialized view mv_par as select * from par;
create materialized view mv_par1 as select * from  par_1_prt_1;
create materialized view mv_par1_1 as select * from par_1_prt_1_2_prt_1;
create materialized view mv_par1_2 as select * from par_1_prt_1_2_prt_2;
create materialized view mv_par2 as select * from  par_1_prt_2;
create materialized view mv_par2_2 as select * from  par_1_prt_2_2_prt2;
Description

INSERT INTO Partitioned Table

INSERT INTO P1 VALUES (1, 1, 1);

before this commit:

Description

It is known that data was successfully inserted into P1, which is a descendant node of P0. Due to the increased data, the status needs to be propagated upwards to invalidate the materialized view MV_0 based on P0.

At the same time, P1 is a partitioned table, and it also needs to propagate the status downwards to its child tables. Since it is unclear which child tables have undergone data changes, all materialized views of the child tables must be invalidated.

Consequently, the materialized views MV_0, MV_1, MV_1_1, and MV_1_2 are all invalidated.

after this commit:

Description

When data is inserted into a partitioned table, it is ultimately routed to the child tables at the leaf nodes. The system monitors these leaf nodes for any data changes and propagates the status upward, while ignoring the changes directly made to the partition table P0 in the SQL.

The tuple (1, 1, 1) will be routed to the partition P1_1, invalidating the materialized views of P1_1 and its ancestor partition tables. In contrast, the other sub-partition P1_2 experiences no data changes, so the materialized view MV_1_2 remains valid.

As a result, the materialized views MV_0, MV_1, and MV_1_1 are invalidated, while MV_1_2 remains unaffected.

The new approach allows for status propagation only from the leaf nodes upward, rather than sending status updates in both directions through the intermediate nodes of the partition tree. This safeguards unrelated materialized views from unnecessary refreshes.

(Split) Across Partition Update on root table

UPDATE P0 SET c = 2 WHERE b = 2 AND c = 1;

before this commit:

Description

When the data in the root node is updated, and since it is unclear which specific child table was updated, the status is propagated downwards to all descendant child tables.

As a result, all materialized views based on the partition tree are invalidated: MV_0, MV_1, MV_1_1, MV_1_2, and MV_2_2.

after this commit:

Description

Since c is the partition key, updating the partition key will cause data movement across partitions in the child tables. The UPDATE operation on the parent table will be translated into an INSERT for one child table and a DELETE for another.

The system monitors the changed leaf node child tables: P2_1 and P2_2, and only propagates their respective statuses upward, invalidating the materialized views based on their ancestor tables. The materialized views related to the left subtree P1 remain unaffected.

As a result, MV_0 and MV_2_2 are invalidated, while MV_1, MV_1_1, and MV_1_2 remain valid.

Extend protocol of libpq

The executor (QE) records the modified partition child tables during the data modification process. A mechanism is needed to transmit these results to the QD for updating the metadata. The libpq protocol has been extended to support this data transfer.

Processing on the QE and QD sides:

Description

Each QE records the IDs of the child tables it has modified, deduplicates them in a local bitmap, and sends this information to the QD.

The data changes on each segment node may not be the same. For example, in the diagram, seg0 has changes in two tables, while seg1 has no changes (as the data to be updated is not located on the seg1 node due to distribution).

The QD collects multiple bitmaps returned from the segment nodes and consolidates them into a single bitmap, removing duplicate relids. This deduplicated bitmap is then used to update the status of the ancestor nodes.

Extend Protocol Design

Description

Based on libpq, the protocol has been extended with an 'e' protocol, which stands for "extend protocol," to unify the handling of information returns.

Each transmission begins with the 'e' protocol, followed immediately by data blocks. A single transmission can carry multiple data blocks.

Each data block corresponds to a specific type of data and is distinguished by a subtag. Following the subtag is a 32-bit integer, num, which indicates the total number of relids in the subsequent array.

The relid array follows the num value. After all data blocks have been written, the transmission concludes with a subtag_max.

The subtag_max, which belongs to the same enum category as the subtag, is specifically used to indicate the termination of an extended protocol transmission.

Key changes:

  • Leaf Partition Tracking: Instead of updating the MV status based on the parent table, we now detect and track changes at the leaf partition level. This ensures that only partitions with actual data modifications trigger MV status updates, significantly reducing unnecessary refreshes.

  • Cross-Partition Updates: For operations like UPDATEs that span multiple partitions (e.g., decomposed into an INSERT on one leaf and a DELETE on another), the MV status is updated based on the real data state of the affected leaf partitions, rather than propagating changes through the parent table. This avoids invalidating MVs that depend on unrelated partitions.

  • Executor Enhancements: The executor now detects dynamic partition expansion during query execution and records data modification bitmaps for each partition on the QE. These bitmaps are aggregated on the QD to update MV statuses accurately.

  • Protocol Extension: The libpq protocol has been extended to handle QE feedback uniformly, ensuring that partition-level modification information is properly collected and consumed by the QD.

This optimization minimizes the impact on MVs by ensuring that only relevant partitions trigger status updates, improving performance and reducing unnecessary invalidations.

Authored-by: Zhang Mingli [email protected]

Fixes #ISSUE_Number

What does this PR do?

Type of Change

  • Bug fix (non-breaking change)
  • New feature (non-breaking change)
  • Breaking change (fix or feature with breaking changes)
  • Documentation update

Breaking Changes

Test Plan

  • Unit tests added/updated
  • Integration tests added/updated
  • Passed make installcheck
  • Passed make -C src/test installcheck-cbdb-parallel

Impact

Performance:

User-facing changes:

Dependencies:

Checklist

Additional Context

CI Skip Instructions


@avamingli avamingli force-pushed the mv_par_prun_cbdb branch 2 times, most recently from 2e00c25 to e5c13e7 Compare March 13, 2025 10:44
@avamingli
Copy link
Contributor Author

Added commit: e5c13e7

SINGLENODE and EntryDB materialized view status maintenance optimization.

In SINGLENODE mode or when operating as an entry db, we cannot use
the extend libpq to send messages because we are already functioning in
a QD role. Processing modified relations at EndModifyTable() is
too late, especially since materialized views are updated when
the executor ends.

This change moves the processing of modified relations to
ExecModifyTable() to ensure timely handling of modifications. This
avoids issues with materialized view updates and ensures correct
behavior in SINGLENODE or entry DB scenarios.

@avamingli
Copy link
Contributor Author

avamingli commented Apr 8, 2025

TPC-B

before this commit:

round avg round 1 round 2 round 3
transactions 367850 368535 368258 366757
latency(ms) 16.30733333 16.277 16.289 16.356
TPS 1226.43066 1228.694187 1227.83674 1222.761054

after this commit:

round avg round 1 round 2 round 3
transactions 370138 368501 370823 371090
latency(ms) 16.20633333 16.279 16.176 16.164
TPS 1234.093081 1228.564701 1236.364521 1237.350022

Summary

Summary with PR 990
transactions +0.62%
latency -0.62%
TPS +0.62%

Copy link
Member

@yjhjstz yjhjstz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Previously, write operations on partitioned tables would alter the
materialized view (MV) status of both ancestor and descendant tables,
leading to unnecessary invalidations of MVs that depended on unaffected
partitions. This commit introduces a more efficient approach to updating
MV statuses for partitioned tables by focusing on leaf partitions that
have actually undergone data changes.

Key changes include:

- Leaf Partition Tracking: Instead of updating the MV status based
on the parent table, we now detect and track changes at the leaf
partition level. This ensures that only partitions with actual data
modifications trigger MV status updates, significantly reducing
unnecessary refreshes.

- Cross-Partition Updates: For operations like UPDATEs that span
multiple partitions (e.g., decomposed into an INSERT on one leaf and a
DELETE on another), the MV status is updated based on the real data
state of the affected leaf partitions, rather than propagating changes
through the parent table. This avoids invalidating MVs that depend on
unrelated partitions.

- Executor Enhancements: The executor now detects dynamic partition
expansion during query execution and records data modification bitmaps
for each partition on the QE. These bitmaps are aggregated on the QD
to update MV statuses accurately.

- Protocol Extension: The libpq protocol has been extended to handle
QE feedback uniformly, ensuring that partition-level modification
information is properly collected and consumed by the QD.

This optimization minimizes the impact on MVs by ensuring that only
relevant partitions trigger status updates, improving performance and
reducing unnecessary invalidations.

Authored-by: Zhang Mingli [email protected]
…ion.

In SINGLENODE mode or when operating as an entry db, we cannot use
the extend libpq to send messages because we are already functioning in
a QD role. Processing modified relations at `EndModifyTable()` is
too late, especially since materialized views are updated when
the executor ends.

This change moves the processing of modified relations to
`ExecModifyTable()` to ensure timely handling of modifications. This
avoids issues with materialized view updates and ensures correct
behavior in SINGLENODE or entry DB scenarios.

Authored-by: Zhang Mingli [email protected]
@avamingli avamingli merged commit 49d49b8 into apache:main Apr 10, 2025
22 checks passed
@avamingli avamingli deleted the mv_par_prun_cbdb branch April 10, 2025 05:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants