Skip to content

feat: option to disable record reordering#661

Merged
ismailsimsek merged 7 commits intomemiiso:masterfrom
kinolaev:feat-disable-reordering
Feb 10, 2026
Merged

feat: option to disable record reordering#661
ismailsimsek merged 7 commits intomemiiso:masterfrom
kinolaev:feat-disable-reordering

Conversation

@kinolaev
Copy link
Contributor

@kinolaev kinolaev commented Dec 3, 2025

I'd like to have an option to prioritize the latest event during deduplication, because it seems like debezium events are already ordered. For example, for a sequence of operations

insert into t1(id, text) values(1, 'hello');
delete from t1 where id = 1;
insert into t1(id, text) values(1, 'hallo');

I'd like to get a row 1,'hallo' in an iceberg table. But because deletes have higher priority than inserts, I get 1,'hello'.

I'm not sure that events are always ordered and that is why I propose to add an option instead of changing the default behavior.

@kinolaev kinolaev force-pushed the feat-disable-reordering branch from d38fe8e to f22a7e4 Compare December 4, 2025 05:13
@kinolaev
Copy link
Contributor Author

kinolaev commented Dec 4, 2025

Added a warning when events are received out of order

@ismailsimsek
Copy link
Member

@kinolaev thank you for looking into this. this is very important feature to improve.

I'm not sure that events are always ordered and that is why I propose to add an option instead of changing the default behavior.

this we can ask in debezium zulip to the dev team. AFAIK they are ordered.

in term of fix. Could we use other metadata fields to ensure 100% event ordering somehow?

for example an event has additional information to sort according to their binlog, even when they happen in same nanosecond. the issue is this additional fields are different database to database.

@ismailsimsek
Copy link
Member

ismailsimsek commented Dec 6, 2025

@kinolaev thank you for looking into this. this is very important feature to improve.

I'm not sure that events are always ordered and that is why I propose to add an option instead of changing the default behavior.

this we can ask in debezium zulip to the dev team. AFAIK they are ordered.

in term of fix. Could we use other metadata fields to ensure 100% event ordering somehow?

for example an event has additional information to sort according to their binlog, even when they happen in same nanosecond. the issue is this additional fields are different database to database.

i think 2 things could be done to improve it.

  1. remove the logic and de-duplicate according to the order of the events received (retain latest one)
  2. enable end user to provide multiple fields to do deuplication with. forexample for mysql it could be: (__source_ts_ns, __source_file,__source_pos, __source_row)

Optionally make the de-duplication to able to use strategy pattern, or something similar. so then we can implement multiple version if necessary, like above option1- option-2. Edit: this will ad more complexity, better to avoid it.

Edit: we don't need option 2, since Debezium sends events ordered, confirmed by Debezium dev team.

@github-actions
Copy link
Contributor

github-actions bot commented Jan 6, 2026

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time. Thank you for your contributions.

@github-actions github-actions bot added the stale label Jan 6, 2026
@ismailsimsek
Copy link
Member

Confirmed that [debezium delivers the events in order.](#dev > Consumer Record deduplication (Iceberg Consumer))

simply removing this logic and keeping the last row should do it.

@github-actions github-actions bot removed the stale label Jan 7, 2026
@kinolaev kinolaev force-pushed the feat-disable-reordering branch 2 times, most recently from f22a7e4 to 029f680 Compare February 3, 2026 16:08
@kinolaev
Copy link
Contributor Author

kinolaev commented Feb 3, 2026

Thank you @ismailsimsek for the info! Finally I've got some time to update the PR. I've removed reordering during deduplication and flagged rows that were inserted within the batch, so we won't add their primary keys to the equality delete files.

@kinolaev
Copy link
Contributor Author

kinolaev commented Feb 6, 2026

Fixed delta writer behavior - we have to delete a record when upsert-keep-deletes is true even if the first event for the record in a batch is an insert because a deleted version of the record could exists in a table.

@kinolaev kinolaev requested a review from ismailsimsek February 6, 2026 15:24
@kinolaev kinolaev force-pushed the feat-disable-reordering branch from 37ad251 to 6470820 Compare February 8, 2026 00:42
Copy link
Member

@ismailsimsek ismailsimsek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kinolaev thank you for the changes, everything looks good. commented few small points.

@@ -96,19 +96,6 @@ public void testSimpleUpsert() throws Exception {
Assertions.assertEquals(ds.where("id = 5 AND __op= 'r' ").count(), 1);
Assertions.assertEquals(ds.where("id = 6 AND __op= 'u' AND first_name= 'Updatedname-6-V1'").count(), 1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there any additional or new edge cases to add ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only thing I've noticed is that the cases with nonempty dedup-column are not covered now. Is that important?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be ok. once we format and the ci passes we can merge

spotless:apply should do the formatting.

Copy link
Contributor Author

@kinolaev kinolaev Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the hint, done in c1ee98a.
upd: seems like spotless:apply is not enough( I'll figure it out a bit later.
upd: fixed. The reason was <ratchetFrom>refs/remotes/origin/master</ratchetFrom>, because I've merged several commits to my origin/master branch before.

@kinolaev kinolaev force-pushed the feat-disable-reordering branch from c1ee98a to 541fd08 Compare February 10, 2026 11:12
@ismailsimsek ismailsimsek merged commit 7f96a94 into memiiso:master Feb 10, 2026
6 checks passed
@ismailsimsek
Copy link
Member

Thank you @kinolaev merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants