Skip to content

ENG-2756: Skip watchdog cancellation for pending tasks awaiting upstream dependencies#7525

Merged
JadeCara merged 16 commits intomainfrom
ENG-2756-fix-watchdog-false-positive-pending-tasks-2
Mar 2, 2026
Merged

ENG-2756: Skip watchdog cancellation for pending tasks awaiting upstream dependencies#7525
JadeCara merged 16 commits intomainfrom
ENG-2756-fix-watchdog-false-positive-pending-tasks-2

Conversation

@JadeCara
Copy link
Copy Markdown
Contributor

@JadeCara JadeCara commented Feb 27, 2026

Ticket ENG-2756

Description Of Changes

Fixes a false-positive in the requeue_interrupted_tasks watchdog that was incorrectly canceling privacy requests during normal DSR execution.

Root cause: The watchdog queries for request tasks in pending or in_processing status and checks for a cached Celery task ID in Redis. Tasks only get a cache key when they're dispatched via queue_request_task(), which only happens once all upstream tasks are complete. Tasks that are legitimately pending and waiting for upstream dependencies have never been dispatched and therefore have no cache key — the watchdog was treating these as "stuck" and canceling the entire privacy request.

This was most visible on the erasure phase of multi-connector DSRs: the access terminator completes and re-queues the privacy request for erasure, but the new erasure-phase tasks start as pending with no cache keys, and the watchdog fires during that transition window.

Fix: Before canceling, check if the task is pending with incomplete upstream dependencies. If so, skip it — it's legitimately waiting. Only cancel if upstream tasks are complete but the task was never dispatched (truly stuck).

Code Changes

  • src/fides/api/service/privacy_request/request_service.py - Added upstream dependency check in requeue_interrupted_tasks before the cancel path for tasks with no cached subtask ID
  • tests/task/test_requeue_interrupted_tasks.py - Added two new tests covering the pending-awaiting-upstream (skip) and pending-upstream-complete-no-cache-key (cancel) cases

Steps to Confirm

Run the existing test suite for the watchdog — no regressions in tests/task/test_requeue_interrupted_tasks.py is the primary verification for this fix.

pytest tests/task/test_requeue_interrupted_tasks.py

The two new tests added in this PR directly cover the fixed behavior:

  • test_pending_task_awaiting_upstream_is_not_canceled — verifies the watchdog skips legitimately waiting tasks
  • test_pending_task_with_complete_upstream_and_no_cache_key_is_canceled — verifies truly stuck tasks are still canceled

Pre-Merge Checklist

  • Issue requirements met
  • All CI pipelines succeeded
  • CHANGELOG.md updated
    • Add a db-migration This indicates that a change includes a database migration label to the entry if your change includes a DB migration
    • Add a high-risk This issue suggests changes that have a high-probability of breaking existing code label to the entry if your change includes a high-risk change (i.e. potential for performance impact or unexpected regression) that should be flagged
    • Updates unreleased work already in Changelog, no new entry necessary
  • UX feedback:
    • No UX review needed
  • Followup issues:
    • No followup issues
  • Database migrations:
    • No migrations
  • Documentation:
    • No documentation updates required

Made with Cursor

…encies

The requeue_interrupted_tasks monitor was false-positiving on pending
request tasks that haven't been dispatched to Celery yet because their
upstream dependencies aren't complete. These tasks legitimately have no
cache key (cache_task_tracking_key is only called at dispatch time), so
the monitor was incorrectly treating them as stuck and canceling the
entire privacy request.

Fix: before canceling, check if the task is pending and its upstream
tasks are incomplete — if so, skip it. Only cancel if upstream is done
but the task was never dispatched (truly stuck).

Fixes ENG-2756.

Made-with: Cursor
@vercel
Copy link
Copy Markdown
Contributor

vercel bot commented Feb 27, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

2 Skipped Deployments
Project Deployment Actions Updated (UTC)
fides-plus-nightly Ignored Ignored Preview Mar 2, 2026 9:24pm
fides-privacy-center Ignored Ignored Mar 2, 2026 9:24pm

Request Review

Jade Wibbels added 2 commits February 27, 2026 15:56
@JadeCara JadeCara marked this pull request as ready for review February 27, 2026 23:07
@JadeCara JadeCara requested a review from a team as a code owner February 27, 2026 23:07
@JadeCara JadeCara requested review from erosselli and removed request for a team February 27, 2026 23:07
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Feb 27, 2026

Greptile Summary

This PR fixes a false-positive in the requeue_interrupted_tasks watchdog that was incorrectly canceling privacy requests whose tasks were legitimately pending while waiting for upstream dependencies to finish. The fix refactors _get_request_task_ids_in_progress to yield (task_id, status, awaiting_upstream) tuples and adds an upstream-dependency guard before the cancel path, correctly mirroring the existing RequestTask.upstream_tasks_complete() model logic via an efficient pre-built status lookup dictionary.

  • Core logic (request_service.py): _get_request_task_ids_in_progress now loads all tasks for the privacy request in a single query, builds a (collection_address, action_type) → status lookup dict, and yields an awaiting_upstream flag derived from that dict — avoiding extra per-task DB round-trips while correctly handling missing upstream records (returns None, which fails the completion check, same safe default as the model method).
  • Guard condition (requeue_interrupted_tasks): A pending task without a cache key is now only canceled if awaiting_upstream is False; if it's still waiting on upstreams it is skipped via continue, preventing the false-positive.
  • Tests (test_requeue_interrupted_tasks.py): Two new integration tests directly cover the two key branches; existing unit-test mocks in test_request_service.py are updated to the new tuple return format.
  • Minor style issues: A single-character loop variable t is used in a dict comprehension in request_service.py, and both new tests manually delete database records in finally blocks despite the database being automatically cleared between runs.

Confidence Score: 4/5

  • Safe to merge — the fix correctly addresses the false-positive cancel path with sound logic and good test coverage; only minor style concerns remain.
  • The upstream-completion logic correctly mirrors the existing model method, the single-query approach is efficient, existing tests are updated, and two new integration tests directly exercise the fixed code paths. Minor deductions for a single-character variable name and unnecessary manual DB teardown in the new tests.
  • No files require special attention beyond the minor style notes on request_service.py and test_requeue_interrupted_tasks.py.

Important Files Changed

Filename Overview
src/fides/api/service/privacy_request/request_service.py Core fix: refactors _get_request_task_ids_in_progress from a list of IDs to a generator of (id, status, awaiting_upstream) tuples; correctly mirrors RequestTask.upstream_tasks_complete() logic using a pre-built status lookup dict; adds the upstream-awaiting guard before the cancel path. Minor style issue with single-character variable t.
tests/task/test_requeue_interrupted_tasks.py Adds two well-structured integration tests covering the pending-awaiting-upstream (skip) and pending-upstream-complete-no-cache-key (cancel) scenarios; manual record deletion in finally blocks is unnecessary and inconsistent with automated DB teardown.
tests/ops/service/privacy_request/test_request_service.py Updates two existing unit-test mocks to return the new tuple format (id, status, awaiting_upstream) instead of plain strings; straightforward and correct.
changelog/7525-fix-watchdog-false-positive-pending-tasks.yaml New changelog entry correctly typed as Fixed with a clear description of the bug addressed.

Last reviewed commit: 7f09aff

Copy link
Copy Markdown
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

@JadeCara
Copy link
Copy Markdown
Contributor Author

@greptile please review

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we name the file 7525-descriptive-slug ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — renamed to 7525-fix-watchdog-false-positive-pending-tasks.yaml.

should_requeue = False
break

# A pending task that hasn't been dispatched to Celery yet will
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this method is already huge, should we maybe wrap the new logic in a method that we call here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed both this and the query optimization comment below — reworked _get_request_task_ids_in_progress to load full RequestTask objects and pre-compute an awaiting_upstream flag. This moves the pending-task logic out of the main method and eliminates the per-iteration re-query of RequestTask by ID. The upstream_tasks_complete() call still happens per pending task within the helper, but the extra object lookup is gone.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follow-up in 51c48a6: went further and replaced the per-task upstream_tasks_complete() DB calls with a single column-projection query + in-memory lookup dict. Also switched to a generator to avoid building the full result list, and queries only the 5 columns needed (avoids loading large JSON blobs on RequestTask).

Comment on lines +664 to +667
request_task_obj = (
db.query(RequestTask)
.filter(RequestTask.id == request_task_id)
.first()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to avoid the extra query (or more than one if upstream_tasks_complete also runs its own query) in each iteration, don't we want to rework _get_request_task_ids_in_progress -- or write a separate method -- that returns something like task_id, status, has_incomplete_upstream_tasks instead?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — see reply above. _get_request_task_ids_in_progress now returns (task_id, status, awaiting_upstream) tuples with upstream completion pre-computed, eliminating the per-iteration RequestTask lookup.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follow-up in 51c48a6: fully addressed — _get_request_task_ids_in_progress now does a single db.query() with column projection (id, status, collection_address, action_type, upstream_tasks), builds a (collection_address, action_type) → status lookup, and computes upstream completion in Python. Zero per-task DB queries regardless of how many pending tasks.

Jade Wibbels and others added 3 commits March 2, 2026 11:52
- Rename changelog to 7525-fix-watchdog-false-positive-pending-tasks.yaml
- Move pending-task upstream check into _get_request_task_ids_in_progress
- Return (task_id, status, awaiting_upstream) tuples to eliminate per-iteration DB query
- Update test mocks to match new 3-tuple signature

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Jade Wibbels and others added 2 commits March 2, 2026 12:58
Replace per-task upstream_tasks_complete() DB calls with a single
column-projection query and in-memory lookup. Uses a generator to
avoid building the full result list and queries only the 5 columns
needed (avoiding large JSON blobs on RequestTask).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…f github.com:ethyca/fides into ENG-2756-fix-watchdog-false-positive-pending-tasks-2
Jade Wibbels and others added 3 commits March 2, 2026 13:11
- Move pending+awaiting check before async task DB query to skip
  unnecessary hit for tasks already known to be legitimately waiting
- Parametrize stuck-task cancellation test to cover both scenarios:
  complete upstream (existing) and root task with no upstream (new)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Documents that the in-memory status_by_address lookup mirrors
RequestTask.upstream_tasks_complete() — missing upstream records
are treated as incomplete (same safe default).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@JadeCara
Copy link
Copy Markdown
Contributor Author

JadeCara commented Mar 2, 2026

@greptile please review

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Mar 2, 2026

Additional Comments (2)

src/fides/api/service/privacy_request/request_service.py, line 488
Single-character variable name in dict comprehension

The loop variable t is a single character, which violates the project convention of using full names for variables. Consider a more descriptive name to improve readability.

    status_by_address: dict[tuple[str, str], ExecutionLogStatus] = {
        (task.collection_address, task.action_type): task.status for task in all_tasks
    }

Context Used: Rule from dashboard - Use full names for variables, not 1 to 2 characters (source)

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!


tests/task/test_requeue_interrupted_tasks.py, line 647
Manual record deletion in tests

Both new tests manually delete database records in finally blocks (downstream_task.delete(db), upstream_task.delete(db), privacy_request.delete(db)). The database is automatically cleared between test runs, so these explicit deletions are unnecessary. The same pattern also appears at tests/task/test_requeue_interrupted_tasks.py:736-741.

The standard approach in this test file is to create records in pytest fixtures (see the existing in_progress_privacy_request / in_progress_request_task fixtures). Consider extracting the setup into parametrized fixtures so the teardown is handled automatically.

Context Used: Rule from dashboard - Do not manually delete database records in test fixtures or at the end of tests, as the database is ... (source)

Jade Wibbels and others added 2 commits March 2, 2026 14:16
- Rename single-char `t` to `task` in status_by_address dict comprehension
- Remove manual try/finally cleanup in tests (db is cleared between runs)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@erosselli erosselli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

approved with one more suggestion

Comment on lines +504 to +508
awaiting_upstream = not all(
status_by_address.get((addr, task.action_type))
in COMPLETED_EXECUTION_LOG_STATUSES
for addr in upstream_addrs
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can't we use any so we short-circuit when we find one , rather than iterating over all and then negating with not ?

Suggested change
awaiting_upstream = not all(
status_by_address.get((addr, task.action_type))
in COMPLETED_EXECUTION_LOG_STATUSES
for addr in upstream_addrs
)
awaiting_upstream = any(
status_by_address.get((addr, task.action_type))
not in COMPLETED_EXECUTION_LOG_STATUSES
for addr in upstream_addrs
)

Clearer intent: "any upstream is incomplete" reads more directly
than "not all upstreams are complete". Same short-circuit behavior.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@JadeCara JadeCara added this pull request to the merge queue Mar 2, 2026
Merged via the queue into main with commit 859b345 Mar 2, 2026
57 checks passed
@JadeCara JadeCara deleted the ENG-2756-fix-watchdog-false-positive-pending-tasks-2 branch March 2, 2026 22:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants