Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub. 2 Skipped Deployments
|
964a8b0 to
46c5634
Compare
Greptile OverviewGreptile SummaryThis PR adds an Key Changes:
Architecture Highlights:
Backfill Logic: Sets Confidence Score: 5/5
Important Files Changed
|
...alembic/migrations/versions/xx_2026_02_08_2339_f85bd4c08401_add_is_leaf_to_stagedresource.py
Show resolved
Hide resolved
35940be to
b52ea74
Compare
adamsachs
left a comment
There was a problem hiding this comment.
nice work! this generally looks good to me, and thank you for establishing a general framework for these 'backfill' scripts :)
i don't see anything problematic with the particular backfill script here, and the framework generally seem functionally solid: i like the API hooks for visibility/manual intervention, and the batching + retry logic, etc.
my most substantive question is around how we imagine future backfill scripts slotting in, -- what's the lifecycle of the scripts, do they get deleted from the codebase?
other than that, i called out a few minor points. let me know what you think!
...alembic/migrations/versions/xx_2026_02_08_2339_f85bd4c08401_add_is_leaf_to_stagedresource.py
Show resolved
Hide resolved
| raise | ||
|
|
||
|
|
||
| def backfill_is_leaf( |
There was a problem hiding this comment.
i like what you've done to abstract away the particular backfill task from a generic backfill framework. i think one thing that could help formalize that framework a bit would be to pull out the particular backfill task (and its sub-functions) into its own module/file, separate from the generic 'backfill' framework module you've got here. what do you think?
There was a problem hiding this comment.
I like that idea! I’ll move it to another file
There was a problem hiding this comment.
I split the original file into 3 files (added a utils module), and moved shared logic into a decorator for easier reuse across backfills. I also added a README explaining how to use it -> a740f86
| # Backfill is_leaf column (added in migration d05acec55c64) | ||
| results.append(backfill_is_leaf(db, batch_size, batch_delay_seconds)) | ||
|
|
||
| # Add future backfills here: | ||
| # results.append(backfill_some_other_column(db, batch_size, batch_delay_seconds)) |
There was a problem hiding this comment.
OK nice, this seems pretty easy to slot in more backfill scripts as needed!
generally, do you imagine we'll be removing the backfill_is_leaf after a few releases, when we're sure that all of our clients have completed it? or do we just leave it in there for perpetuity, and rely on it basically being skipped over once it's been completed?
i do wonder whether it's worth establishing a very lightweight db table to keep track of the backfills, just so we can definitively say 'backfill x has been completed on this fides instance' -- basically to persist the BackfillResults. i get a bit worried about us relying on the logical check (get_pending_is_leaf_count) if e.g. we ever change some of the application logic and get_pending_is_leaf_count is no longer valid -- we have to remember to update this backfill script, which isn't really part of the 'normall application code...
i don't know the right answer here, just posing the question to make sure we get this right and don't create too much of a maintenance burden going forward.
There was a problem hiding this comment.
I was initially imagining that we’d remove these backfills as we’re able to, and in this specific case, create a follow-up ticket to remove the script once we’re confident all clients are covered. As you said, as long as the logic doesn’t change, having the script around doesn’t really hurt, but at some point it probably should be disabled or removed.
I do like the idea of tracking backfill completion in the database. It’s definitely safer and avoids relying on application logic that could change over time. That said, I also wonder about the long-term maintenance of an extra table like that (and the risk of it becoming something we forget about).
I’m going to try implementing it and see how much effort it actually is, I don’t expect it to take too long (Cursor helping 😄). I’ll report back with what I find and we can decide from there.
One additional consideration: if we ever roll back a column related to a backfill, we’d also need to remove the corresponding row from the new table
There was a problem hiding this comment.
Implemented the database table solution in this commit-> 1096ee8
| BACKFILL_LOCK_TTL = 300 # 5 minutes TTL, refreshed every 10 batches | ||
|
|
||
|
|
||
| def acquire_backfill_lock() -> bool: |
There was a problem hiding this comment.
is there a reason we're not using our redis_lock constructs in lock.py? or at least the native redis.lock?
There was a problem hiding this comment.
It initially felt more practical not to use redis_lock since I didn’t want to pass the lock around as a parameter or define it as a global. That said, using redis.lock does give us lock.owned(), which is an advantage.
I’ve updated it to use redis.lock in this commit -> 04772a9
6b1fd22 to
3fd4eda
Compare
...ides/api/alembic/migrations/versions/xx_2026_02_08_2338_aa8e1bd48402_add_backfill_history.py
Show resolved
Hide resolved
2ac7943 to
a980566
Compare
adamsachs
left a comment
There was a problem hiding this comment.
great work here! thanks for the care you took in addressing all of my feedback. the batched_backfill decorator is really clean, that was a great improvement to help formalize this as a framework we can build on! 👌
| raise | ||
|
|
||
|
|
||
| @batched_backfill( |
There was a problem hiding this comment.
very nice to have this as a decorator 👏
Ticket ENG-2293
Description Of Changes
Adds an
is_leafcolumn to theStagedResourcemodel to efficiently identify leaf fields (fields with no children and non-object data types) in detection/discovery results. This column enables faster queries by avoiding expensive runtime calculations of resource_type = 'Field' AND children = [].The migration includes conditional logic to create the index directly for smaller tables (<1M rows) or defer it to post-upgrade index creation for larger tables to avoid blocking migrations.
A new post-upgrade backfill system is introduced to populate the
is_leafcolumn for existing data. The backfill is designed to be:A new admin API endpoint (POST /admin/backfill) is also added for manual triggering of backfills with configurable batch size and delay parameters.
Code Changes
Steps to Confirm
Pre-Merge Checklist
CHANGELOG.mdupdatedmaindowngrade()migration is correct and works