Skip to content

ENG-2474: Fix duplicate detection race conditions#7785

Merged
JadeCara merged 9 commits intomainfrom
eng-2474/duplicate-detection-tla-spec
Mar 31, 2026
Merged

ENG-2474: Fix duplicate detection race conditions#7785
JadeCara merged 9 commits intomainfrom
eng-2474/duplicate-detection-tla-spec

Conversation

@JadeCara
Copy link
Copy Markdown
Contributor

@JadeCara JadeCara commented Mar 30, 2026

Ticket ENG-2474

Description Of Changes

TLA+ model checking of the duplicate detection flow revealed two race condition bugs that allow multiple requests for the same identity to be actively processed simultaneously.

Code Changes

  • request_service.py: Added check_for_duplicates call in _handle_privacy_request_requeue before requeuing — if the request is now a duplicate, skip the requeue entirely
  • privacy_request.py: Removed PrivacyRequestStatus.duplicate from allowed statuses in verify_identity — a request already marked as duplicate can no longer be re-verified back to pending
  • specs/DuplicateDetection.tla: TLA+ spec with two configs (_buggy.cfg violates AtMostOneActivePerIdentity, _fixed.cfg passes all invariants across 3,592 states)

Steps to Confirm

  1. Run test_request_service.py and test_duplication_detection.py — all 67 tests pass
  2. Run TLA+ model checker with _buggy.cfg (expect violation) and _fixed.cfg (expect pass)
  3. Manual: submit two DSRs with the same identity, verify the second — only one should be active in Request Manager

Pre-Merge Checklist

  • Issue requirements met
  • All CI pipelines succeeded
  • CHANGELOG.md updated
    • Add a db-migration This indicates that a change includes a database migration label to the entry if your change includes a DB migration
    • Add a high-risk This issue suggests changes that have a high-probability of breaking existing code label to the entry if your change includes a high-risk change
    • Updates unreleased work already in Changelog, no new entry necessary
  • UX feedback:
    • No UX review needed
  • Followup issues:
    • No followup issues
  • Database migrations:
    • No migrations
  • Documentation:
    • No documentation updates required

🤖 Generated with Claude Code

TLA+ model checking revealed two bugs in duplicate detection:

1. Requeue watchdog (_handle_privacy_request_requeue) would blindly
   requeue requests without checking for duplicates first. A request
   marked duplicate by another process could be requeued and processed.
   Fix: call check_for_duplicates before requeuing; skip if duplicate.

2. verify_identity allowed transition from "duplicate" → "pending",
   meaning a request already marked as duplicate could be re-verified
   and escape back into the processing pipeline.
   Fix: only allow verification from "identity_unverified" status.

Also includes the TLA+ spec (specs/DuplicateDetection.tla) with two
configs: _buggy.cfg (both fixes OFF, AtMostOneActivePerIdentity
violated) and _fixed.cfg (both fixes ON, all invariants pass).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@vercel
Copy link
Copy Markdown
Contributor

vercel bot commented Mar 30, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
fides-plus-nightly Ready Ready Preview, Comment Mar 31, 2026 3:05pm
1 Skipped Deployment
Project Deployment Actions Updated (UTC)
fides-privacy-center Ignored Ignored Mar 31, 2026 3:05pm

Request Review

@JadeCara JadeCara marked this pull request as ready for review March 30, 2026 16:51
@JadeCara JadeCara requested a review from a team as a code owner March 30, 2026 16:51
@JadeCara JadeCara requested review from erosselli and removed request for a team March 30, 2026 16:51
Copy link
Copy Markdown

@claude claude bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review — ENG-2474: Fix duplicate detection race conditions

Good bug identification and solid use of TLA+ to reason about the race conditions. The verify_identity fix is correct and closes the key vulnerability cleanly. However, there are two issues that need attention before merge.


Critical (Must Fix)

1. check_for_duplicates in _handle_privacy_request_requeue is dead code

requeue_interrupted_tasks only fetches requests with status in [in_processing, approved, requires_input, pending_external] (request_service.py:587–593). All four of those statuses are members of ACTIONED_REQUEST_STATUSES. Inside is_duplicate_request, line 383 of duplication_detection.py short-circuits immediately:

if request.status in ACTIONED_REQUEST_STATUSES:
    return False  # actioned requests are never duplicates

So check_for_duplicates will never mark the privacy request as duplicate, and the guard if privacy_request.status == PrivacyRequestStatus.duplicate on line 433 of request_service.py is unreachable. The requeue watchdog fix provides no actual protection.

The bug report scenario ("requires_input" and "unverified" visible simultaneously) is addressed entirely by the verify_identity change — the requeue code adds nothing. To genuinely prevent a stuck-but-now-superseded ACTIONED request from being requeued, the call site would need to bypass the early-return (e.g., query the duplicate group directly to check for a newer canonical request) or is_duplicate_request would need a separate entry point for this context.

2. No tests for the new code path

TestHandlePrivacyRequestRequeue has four test cases but none cover the new duplicate-check branch. Writing a meaningful test for this path will likely surface the no-op issue described above.


Suggestions

TLA+/Python model gap (see inline comment on DuplicateDetection.tla line 106): The DupCheckStatus operator in the spec allows the watchdog to mark an ACTIONED request as duplicate when two ACTIONED requests coexist. The Python implementation cannot do this (early return). The formal proof that AtMostOneActivePerIdentity holds therefore does not validate the Python requeue fix — it validates a slightly more powerful algorithm than what was implemented. This gap should either be documented or closed by adding the corresponding guard to the spec.

VerifiedRequestProgress liveness not checked: The temporal property is defined in the spec but missing from both configs' PROPERTIES section. Safety invariants alone don't rule out starvation. See inline comment on DuplicateDetection_fixed.cfg.


Nice to Have

Consider a NumRequests = 3 TLC config to exercise three-way interleavings, which is where the tiebreaker logic in DupCheckStatus gets the most thorough workout.

"Privacy request {} is a duplicate, skipping requeue.",
privacy_request.id,
)
return
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Critical: check_for_duplicates here is always a no-op

requeue_interrupted_tasks only selects requests in [in_processing, approved, requires_input, pending_external] (lines 587–593 of this file). Every one of those statuses is in ACTIONED_REQUEST_STATUSES. Inside is_duplicate_request, there is an early return at line 383 of duplication_detection.py:

if request.status in ACTIONED_REQUEST_STATUSES:
    return False   # actioned requests are never duplicates

So check_for_duplicates will always return without marking the request as duplicate, and the guard on line 433 (if privacy_request.status == PrivacyRequestStatus.duplicate) can never be True. The early-return block is effectively dead code.

The real protection against the race condition in the bug report comes entirely from the verify_identity change in privacy_request.py (removing duplicate from allowed starting statuses). This requeue guard does not add any protection.

To actually prevent a stuck-but-now-duplicate request from being requeued, is_duplicate_request would need to treat ACTIONED requests differently at this call site (e.g., skip the early return and let the "others are actioned" rule mark this one as duplicate), or an alternative check is needed (e.g., look up the duplicate group directly and check if a newer canonical request exists).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — you're right that this is dead code. All statuses fetched by requeue_interrupted_tasks are in ACTIONED_REQUEST_STATUSES, so is_duplicate_request early-returns False every time. Removing the dead code. The verify_identity change is the substantive fix here.

# ENG-2474: Run duplicate detection before requeuing. If the request is
# now a duplicate (e.g. another request for the same identity was verified
# while this one was stuck), mark it and skip the requeue entirely.
check_for_duplicates(db, privacy_request)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing test coverage for this new branch

TestHandlePrivacyRequestRequeue has four test cases but none of them exercise the new duplicate-detection path. There should be at least:

  1. A test that mocks check_for_duplicates to flip privacy_request.status to PrivacyRequestStatus.duplicate and asserts that _requeue_privacy_request is not called and the info log is emitted.
  2. A test confirming that when check_for_duplicates leaves the status unchanged, the existing requeue/cancel logic proceeds normally (to guard against unintended interactions).

Note that writing test (1) will likely expose the no-op issue described on line 438, because the only way to make the guard fire in a real call is to manually override the status — which is not what the code does.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved by removing the dead code (see previous comment). Since check_for_duplicates is a no-op for all requeued statuses, the branch was removed entirely — no new test needed.

PrivacyRequestStatus.identity_unverified,
PrivacyRequestStatus.duplicate,
]:
if self.status != PrivacyRequestStatus.identity_unverified:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change looks correct and is the substantive fix for the identity-verification race. Removing duplicate from the allowed starting statuses closes the hole where a request that was already marked as a duplicate could be re-verified back to pending, bypassing the duplicate-detection decision entirely.

Minor nit: the existing error message still says "Invalid identity verification request" — worth keeping an eye on whether the message remains clear to callers (the HTTP layer translates IdentityVerificationException to a 400, so the status value in the message is the main signal for debugging).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks — agreed this is the key fix. The error message is pre-existing and generic enough to cover this case (the HTTP 400 + status value in logs provides the debugging signal). Not changing it in this PR to keep scope tight.

[can EXCEPT ![req] = dup] >>
\* Rule 2: other is verified, this is not => this is duplicate
ELSE IF \E r \in oth : ver[r] = TRUE /\ ver[req] = FALSE THEN
LET dup == CHOOSE r \in oth : ver[r] = TRUE IN
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TLA+ model doesn't capture Python's early-return for ACTIONED requests

In DupCheckStatus, there is no guard that prevents marking an ACTIONED req as a duplicate when another ACTIONED request exists for the same identity. The Python is_duplicate_request has this early return at line 383 of duplication_detection.py:

if request.status in ACTIONED_REQUEST_STATUSES:
    return False  # actioned requests are never duplicates

Because of this, DupCheckStatus (TLA+) and is_duplicate_request (Python) have divergent semantics for the requeue case: the model allows the watchdog to mark an ACTIONED request as duplicate, but the implementation cannot. The formal proof that AtMostOneActivePerIdentity holds with FixedRequeue = TRUE therefore does not directly validate the Python code path.

Either:

  • Add a guard to DupCheckStatus matching the Python early-return (so the model accurately reflects the implementation), or
  • Explicitly document this abstraction gap and explain why safety is maintained despite it.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed the spec: added the ACTIONED early-return guard to DupCheckStatus (matching Python line 383), removed the FixedRequeue toggle since the requeue dup-check was dead code (all requeued requests are ACTIONED → early-return), and simplified the watchdog to a no-op. TLC confirms: fixed config passes all invariants (3,592 states), buggy config correctly finds AtMostOneActivePerIdentity violation.

DuplicateConsistency
DuplicatePointsToLive

SPECIFICATION Spec
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Liveness property VerifiedRequestProgress is not checked

VerifiedRequestProgress is defined in the spec as a temporal (liveness) property but is absent from both configs' PROPERTIES section. Safety invariants alone don't rule out starvation — a verified request that never makes progress would satisfy all four invariants listed here.

If liveness checking was intentionally omitted (e.g., performance or symmetry-reduction concerns), a comment in the spec or the config explaining why would help future readers. If it was an oversight, add:

PROPERTIES
    VerifiedRequestProgress

to at least the _fixed.cfg.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added VerifiedRequestProgress to the fixed config. TLC confirms it holds (checked 2 branches of temporal properties across 3,592 distinct states).

@@ -0,0 +1,13 @@
CONSTANTS
NumRequests = 2
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice to have: consider a 3-request config

With NumRequests = 2 and MaxVerifyAttempts = 1, TLC reports 3,592 states. A NumRequests = 3 variant would explore additional interleavings (e.g., three concurrent DSRs for the same identity: one actioned, one unverified, one just-verified). Three-way races are where subtle ordering bugs tend to hide and where the tiebreaker rules in DupCheckStatus are most exercised. This could be a separate _fixed_3req.cfg to keep CI times predictable.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added DuplicateDetection_fixed_3req.cfg with NumRequests = 3. It immediately found a real bug: DuplicateConsistency violated in a 3-way race where a request already marked as duplicate could still run DupCheckStatus and mark others as duplicates of itself — creating a broken canonical chain.

Root cause: the spec was missing two Python early-returns from is_duplicate_request:

  1. ACTIONED requests return early (line 383) — already added
  2. Already-duplicate requests return early (line 398) — was missing

Also added re-parenting logic so that when a request is marked duplicate, any existing duplicates pointing to it get updated to point to the new canonical (mirrors the group-based model in Python).

TLC results after fixes:

  • _fixed.cfg (2-req): 3,592 states, all invariants + liveness pass
  • _fixed_3req.cfg (3-req): 95,144 states, all invariants + liveness pass
  • _buggy.cfg: correctly finds violation

The 3-request config was absolutely worth adding — 26x more states explored and it caught two real model gaps.

…thon

- Remove dead check_for_duplicates call in _handle_privacy_request_requeue
  (all requeued statuses are ACTIONED, so is_duplicate_request early-returns)
- Add ACTIONED and already-duplicate early-return guards to DupCheckStatus
  (matching Python lines 383 and 398 of duplication_detection.py)
- Add canonical re-parenting in DupCheckStatus to mirror group-based model
- Remove FixedRequeue constant (requeue dup-check was dead code)
- Add VerifiedRequestProgress liveness property to fixed config
- Add 3-request config (95,144 states, catches 3-way race bugs)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@JadeCara JadeCara enabled auto-merge March 31, 2026 14:59
@JadeCara JadeCara added this pull request to the merge queue Mar 31, 2026
Merged via the queue into main with commit 1d0b67d Mar 31, 2026
57 of 58 checks passed
@JadeCara JadeCara deleted the eng-2474/duplicate-detection-tla-spec branch March 31, 2026 15:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants