ENG-2293 - Add is_leaf column to StagedResource and backfill by vcruces · Pull Request #7263 · ethyca/fides

vcruces · 2026-01-28T13:42:59Z

Description Of Changes

Adds an is_leaf column to the StagedResource model to efficiently identify leaf fields (fields with no children and non-object data types) in detection/discovery results. This column enables faster queries by avoiding expensive runtime calculations of resource_type = 'Field' AND children = [].
The migration includes conditional logic to create the index directly for smaller tables (<1M rows) or defer it to post-upgrade index creation for larger tables to avoid blocking migrations.
A new post-upgrade backfill system is introduced to populate the is_leaf column for existing data. The backfill is designed to be:

Idempotent: Safe to run multiple times
Resumable: Can be stopped and restarted, will pick up where it left off
Non-blocking: Uses small batches with delays and SKIP LOCKED to minimize database impact
Error-resilient: Retries transient errors, tracks failures, fails gracefully
A new admin API endpoint (POST /admin/backfill) is also added for manual triggering of backfills with configurable batch size and delay parameters.

Code Changes

Added is_leaf column to StagedResource model with composite index on (monitor_config_id, is_leaf)
Added Alembic migration d05acec55c64 to add the column with conditional index creation
Added post_upgrade_backfill.py module with batched backfill infrastructure including Redis locking, retry logic, and progress tracking
Added POST /admin/backfill endpoint with BACKFILL_EXEC scope for manual backfill triggering
Added entry to post_upgrade_index_creation.py for deferred index creation on large tables
Added unit tests for backfill logic and admin endpoint

Steps to Confirm

Run migrations and verify is_leaf column is added to stagedresource table
Verify existing staged resources with is_leaf = NULL get backfilled on application startup
Test manual backfill via POST /api/v1/admin/backfill endpoint (requires backfill:exec scope)
Verify backfill correctly sets is_leaf = true for fields with no children and non-object data types
Verify backfill correctly sets is_leaf = false for databases, schemas, tables, fields with children, or object-type fields

Pre-Merge Checklist

vercel · 2026-01-28T13:43:05Z

The latest updates on your projects. Learn more about Vercel for GitHub.

2 Skipped Deployments

Project	Deployment	Actions	Updated (UTC)
fides-plus-nightly	Ignored	Preview	Feb 10, 2026 2:50pm
fides-privacy-center	Ignored		Feb 10, 2026 2:50pm

greptile-apps · 2026-01-28T22:13:56Z

Greptile Overview

Greptile Summary

This PR adds an is_leaf column to the StagedResource model to efficiently identify leaf fields in detection/discovery results, replacing expensive runtime calculations. The implementation includes a sophisticated post-upgrade backfill system designed for production safety.

Key Changes:

Database Schema: Adds nullable is_leaf boolean column to stagedresource table with a partial index ix_stagedresource_monitor_leaf_status_urn on (monitor_config_id, is_leaf, diff_status, urn) WHERE is_leaf IS NOT NULL
Migration Strategy: Conditional index creation based on table size (<1M rows creates index immediately, >=1M defers to post-upgrade), avoiding blocking migrations on large tables
Backfill Infrastructure: New reusable backfill system with Redis locking, exponential retry with backoff, batch processing with delays (default 5000 rows/batch, 1s delay), progress tracking, and backfill_history table for idempotency
API: New POST /admin/backfill endpoint with BACKFILL_EXEC scope for manual triggering, GET /admin/backfill for status monitoring
Semantic Design: null indicates "not applicable" for non-datastore monitors (not just temporary backfill state), allowing partial index optimization

Architecture Highlights:

Batched updates with SKIP LOCKED to minimize lock contention
Automatic startup backfill via APScheduler with replace_existing=True preventing duplicate jobs
Proper lock release in finally blocks throughout
Comprehensive test coverage including edge cases for retry logic, consecutive failures, and lock management

Backfill Logic: Sets is_leaf=True for Field resources with no children AND non-object data types; is_leaf=False for all other datastore resources (Database, Schema, Table, Fields with children/object types); null for non-datastore monitor types.

Confidence Score: 5/5

This PR is safe to merge with minimal risk
The implementation is well-architected with comprehensive error handling, proper locking mechanisms, extensive test coverage, and follows database migration best practices. The backfill system is idempotent, resumable, and designed to minimize production impact with batching and delays. One previous comment about the placeholder down_revision has been addressed - it now properly references the actual revision a1b2c3d4e5f7.
No files require special attention

Important Files Changed

Filename	Overview
src/fides/api/alembic/migrations/versions/xx_2026_02_03_2029_841e0b148993_add_backfill_history.py	Adds `backfill_history` table with `backfill_name` primary key and `completed_at` timestamp to track completed backfills
src/fides/api/alembic/migrations/versions/xx_2026_02_03_2033_81d2400b16ab_add_is_leaf_to_stagedresource.py	Adds nullable `is_leaf` boolean column to `stagedresource` table with conditional index creation based on table size (<1M rows), includes proper downgrade logic that clears backfill tracking
src/fides/api/migrations/post_upgrade_backfill.py	Core backfill orchestration module with Redis locking, scheduled task initialization with `replace_existing=True` (line 154), and proper lock management in try-finally blocks
src/fides/api/migrations/backfill_scripts/backfill_stagedresource_is_leaf.py	Implements is_leaf backfill logic using batched decorator, correctly setting is_leaf=True for Fields with no children and non-object data types using SKIP LOCKED
src/fides/api/migrations/backfill_scripts/utils.py	Robust backfill infrastructure with retry logic, exponential backoff, Redis locking, progress tracking, and `@batched_backfill` decorator for resumable operations
src/fides/api/api/v1/endpoints/admin.py	Adds POST and GET `/admin/backfill` endpoints with BACKFILL_EXEC scope, proper lock acquisition, background task execution, and status reporting
tests/ctl/api/test_admin.py	Comprehensive tests for backfill endpoints covering lock acquisition, conflict handling, parameter validation, and status reporting, but mock lock could be improved

greptile-apps

_{4 files reviewed, 2 comments}

_{Edit Code Review Agent Settings | Greptile}

src/fides/api/models/detection_discovery/core.py

...alembic/migrations/versions/xx_2026_02_08_2339_f85bd4c08401_add_is_leaf_to_stagedresource.py

adamsachs

nice work! this generally looks good to me, and thank you for establishing a general framework for these 'backfill' scripts :)

i don't see anything problematic with the particular backfill script here, and the framework generally seem functionally solid: i like the API hooks for visibility/manual intervention, and the batching + retry logic, etc.

my most substantive question is around how we imagine future backfill scripts slotting in, -- what's the lifecycle of the scripts, do they get deleted from the codebase?

other than that, i called out a few minor points. let me know what you think!

...alembic/migrations/versions/xx_2026_02_08_2339_f85bd4c08401_add_is_leaf_to_stagedresource.py

adamsachs · 2026-02-02T15:40:26Z

src/fides/api/migrations/post_upgrade_backfill.py

+        raise
+
+
+def backfill_is_leaf(


i like what you've done to abstract away the particular backfill task from a generic backfill framework. i think one thing that could help formalize that framework a bit would be to pull out the particular backfill task (and its sub-functions) into its own module/file, separate from the generic 'backfill' framework module you've got here. what do you think?

I like that idea! I’ll move it to another file

I split the original file into 3 files (added a utils module), and moved shared logic into a decorator for easier reuse across backfills. I also added a README explaining how to use it -> a740f86

adamsachs · 2026-02-02T15:47:30Z

src/fides/api/migrations/post_upgrade_backfill.py

+    # Backfill is_leaf column (added in migration d05acec55c64)
+    results.append(backfill_is_leaf(db, batch_size, batch_delay_seconds))
+
+    # Add future backfills here:
+    # results.append(backfill_some_other_column(db, batch_size, batch_delay_seconds))


OK nice, this seems pretty easy to slot in more backfill scripts as needed!

generally, do you imagine we'll be removing the backfill_is_leaf after a few releases, when we're sure that all of our clients have completed it? or do we just leave it in there for perpetuity, and rely on it basically being skipped over once it's been completed?

i do wonder whether it's worth establishing a very lightweight db table to keep track of the backfills, just so we can definitively say 'backfill x has been completed on this fides instance' -- basically to persist the BackfillResults. i get a bit worried about us relying on the logical check (get_pending_is_leaf_count) if e.g. we ever change some of the application logic and get_pending_is_leaf_count is no longer valid -- we have to remember to update this backfill script, which isn't really part of the 'normall application code...

i don't know the right answer here, just posing the question to make sure we get this right and don't create too much of a maintenance burden going forward.

I was initially imagining that we’d remove these backfills as we’re able to, and in this specific case, create a follow-up ticket to remove the script once we’re confident all clients are covered. As you said, as long as the logic doesn’t change, having the script around doesn’t really hurt, but at some point it probably should be disabled or removed.

I do like the idea of tracking backfill completion in the database. It’s definitely safer and avoids relying on application logic that could change over time. That said, I also wonder about the long-term maintenance of an extra table like that (and the risk of it becoming something we forget about).

I’m going to try implementing it and see how much effort it actually is, I don’t expect it to take too long (Cursor helping 😄). I’ll report back with what I find and we can decide from there.

One additional consideration: if we ever roll back a column related to a backfill, we’d also need to remove the corresponding row from the new table

Implemented the database table solution in this commit-> 1096ee8

adamsachs · 2026-02-02T15:58:16Z

src/fides/api/migrations/post_upgrade_backfill.py

+BACKFILL_LOCK_TTL = 300  # 5 minutes TTL, refreshed every 10 batches
+
+
+def acquire_backfill_lock() -> bool:


is there a reason we're not using our redis_lock constructs in lock.py? or at least the native redis.lock?

It initially felt more practical not to use redis_lock since I didn’t want to pass the lock around as a parameter or define it as a global. That said, using redis.lock does give us lock.owned(), which is an advantage.

I’ve updated it to use redis.lock in this commit -> 04772a9

vcruces · 2026-02-06T13:50:13Z

@greptileai

greptile-apps

_{8 files reviewed, 3 comments}

_{Edit Code Review Agent Settings | Greptile}

src/fides/api/migrations/post_upgrade_backfill.py

...ides/api/alembic/migrations/versions/xx_2026_02_08_2338_aa8e1bd48402_add_backfill_history.py

tests/ctl/api/test_admin.py

vcruces · 2026-02-06T22:29:06Z

@greptileai

greptile-apps

_{7 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

adamsachs

great work here! thanks for the care you took in addressing all of my feedback. the batched_backfill decorator is really clean, that was a great improvement to help formalize this as a framework we can build on! 👌

adamsachs · 2026-02-10T15:19:07Z

src/fides/api/migrations/backfill_scripts/backfill_stagedresource_is_leaf.py

+        raise
+
+
+@batched_backfill(


very nice to have this as a decorator 👏

vcruces force-pushed the ENG-2293 branch 2 times, most recently from 964a8b0 to 46c5634 Compare January 28, 2026 20:36

vcruces marked this pull request as ready for review January 28, 2026 22:10

vcruces requested a review from a team as a code owner January 28, 2026 22:10

vcruces requested review from adamsachs and thabofletcher and removed request for a team and thabofletcher January 28, 2026 22:10

vcruces added the do not merge Please don't merge yet, bad things will happen if you do label Jan 28, 2026

greptile-apps bot reviewed Jan 28, 2026

View reviewed changes

src/fides/api/models/detection_discovery/core.py Outdated Show resolved Hide resolved

...alembic/migrations/versions/xx_2026_02_08_2339_f85bd4c08401_add_is_leaf_to_stagedresource.py Show resolved Hide resolved

vcruces force-pushed the ENG-2293 branch 4 times, most recently from 35940be to b52ea74 Compare February 2, 2026 16:01

adamsachs reviewed Feb 2, 2026

View reviewed changes

vcruces force-pushed the ENG-2293 branch 3 times, most recently from 6b1fd22 to 3fd4eda Compare February 5, 2026 14:06

greptile-apps bot reviewed Feb 6, 2026

View reviewed changes

src/fides/api/migrations/post_upgrade_backfill.py Show resolved Hide resolved

...ides/api/alembic/migrations/versions/xx_2026_02_08_2338_aa8e1bd48402_add_backfill_history.py Show resolved Hide resolved

tests/ctl/api/test_admin.py Show resolved Hide resolved

vcruces force-pushed the ENG-2293 branch 4 times, most recently from 2ac7943 to a980566 Compare February 6, 2026 22:25

greptile-apps bot reviewed Feb 6, 2026

View reviewed changes

vcruces force-pushed the ENG-2293 branch from a980566 to 864d4f6 Compare February 8, 2026 23:48

ENG-2293 - Add is_leaf column to StagedResource and backfill

a3e6ab3

vcruces added 20 commits February 10, 2026 09:25

Add tests

ebaabaa

Changelog

a66ca97

Add is_leaf to .fides/db_dataset.yml

a9ef097

Fix tests

35c7991

is_leaf null by default

46f7993

Add endpoint to check status

22b97f0

Add comment

4aa0870

Add diff status and urn to the new index

f1fb77b

Small changes to post_upgrade_backfill.py

64c4266

Split the backfill file into 3

1627c2f

Add table backfill_history

53e488b

Use redis.lock

fe5bfe2

__init__.py

b7ac7f8

db_dataset

3d24886

Update migration

746a118

Update migration

fadedb6

Tests small fixes and greptile suggestion

21b9147

Move and add tests

6e61270

Update migrations

2a64fde

Format with ruff

3e57b4b

vcruces force-pushed the ENG-2293 branch from f72e794 to 3e57b4b Compare February 10, 2026 12:25

Fix comment

652022a

adamsachs approved these changes Feb 10, 2026

View reviewed changes

vcruces added this pull request to the merge queue Feb 10, 2026

vcruces removed the do not merge Please don't merge yet, bad things will happen if you do label Feb 10, 2026

Merged via the queue into main with commit 3b22f14 Feb 10, 2026
54 of 55 checks passed

vcruces deleted the ENG-2293 branch February 10, 2026 15:59

		BACKFILL_LOCK_TTL = 300 # 5 minutes TTL, refreshed every 10 batches


		def acquire_backfill_lock() -> bool:

		raise


		@batched_backfill(

Conversation

vcruces commented Jan 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description Of Changes

Code Changes

Steps to Confirm

Pre-Merge Checklist

Uh oh!

vercel bot commented Jan 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

greptile-apps bot commented Jan 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Overview

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

adamsachs left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vcruces Feb 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vcruces commented Feb 6, 2026

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vcruces commented Feb 6, 2026

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

adamsachs left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

vcruces commented Jan 28, 2026 •

edited

Loading

vercel bot commented Jan 28, 2026 •

edited

Loading

greptile-apps bot commented Jan 28, 2026 •

edited

Loading

vcruces Feb 3, 2026 •

edited

Loading