ENG-2474: Fix duplicate detection race conditions#7785
Conversation
TLA+ model checking revealed two bugs in duplicate detection: 1. Requeue watchdog (_handle_privacy_request_requeue) would blindly requeue requests without checking for duplicates first. A request marked duplicate by another process could be requeued and processed. Fix: call check_for_duplicates before requeuing; skip if duplicate. 2. verify_identity allowed transition from "duplicate" → "pending", meaning a request already marked as duplicate could be re-verified and escape back into the processing pipeline. Fix: only allow verification from "identity_unverified" status. Also includes the TLA+ spec (specs/DuplicateDetection.tla) with two configs: _buggy.cfg (both fixes OFF, AtMostOneActivePerIdentity violated) and _fixed.cfg (both fixes ON, all invariants pass). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
The latest updates on your projects. Learn more about Vercel for GitHub.
1 Skipped Deployment
|
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Code Review — ENG-2474: Fix duplicate detection race conditions
Good bug identification and solid use of TLA+ to reason about the race conditions. The verify_identity fix is correct and closes the key vulnerability cleanly. However, there are two issues that need attention before merge.
Critical (Must Fix)
1. check_for_duplicates in _handle_privacy_request_requeue is dead code
requeue_interrupted_tasks only fetches requests with status in [in_processing, approved, requires_input, pending_external] (request_service.py:587–593). All four of those statuses are members of ACTIONED_REQUEST_STATUSES. Inside is_duplicate_request, line 383 of duplication_detection.py short-circuits immediately:
if request.status in ACTIONED_REQUEST_STATUSES:
return False # actioned requests are never duplicatesSo check_for_duplicates will never mark the privacy request as duplicate, and the guard if privacy_request.status == PrivacyRequestStatus.duplicate on line 433 of request_service.py is unreachable. The requeue watchdog fix provides no actual protection.
The bug report scenario ("requires_input" and "unverified" visible simultaneously) is addressed entirely by the verify_identity change — the requeue code adds nothing. To genuinely prevent a stuck-but-now-superseded ACTIONED request from being requeued, the call site would need to bypass the early-return (e.g., query the duplicate group directly to check for a newer canonical request) or is_duplicate_request would need a separate entry point for this context.
2. No tests for the new code path
TestHandlePrivacyRequestRequeue has four test cases but none cover the new duplicate-check branch. Writing a meaningful test for this path will likely surface the no-op issue described above.
Suggestions
TLA+/Python model gap (see inline comment on DuplicateDetection.tla line 106): The DupCheckStatus operator in the spec allows the watchdog to mark an ACTIONED request as duplicate when two ACTIONED requests coexist. The Python implementation cannot do this (early return). The formal proof that AtMostOneActivePerIdentity holds therefore does not validate the Python requeue fix — it validates a slightly more powerful algorithm than what was implemented. This gap should either be documented or closed by adding the corresponding guard to the spec.
VerifiedRequestProgress liveness not checked: The temporal property is defined in the spec but missing from both configs' PROPERTIES section. Safety invariants alone don't rule out starvation. See inline comment on DuplicateDetection_fixed.cfg.
Nice to Have
Consider a NumRequests = 3 TLC config to exercise three-way interleavings, which is where the tiebreaker logic in DupCheckStatus gets the most thorough workout.
| "Privacy request {} is a duplicate, skipping requeue.", | ||
| privacy_request.id, | ||
| ) | ||
| return |
There was a problem hiding this comment.
Critical: check_for_duplicates here is always a no-op
requeue_interrupted_tasks only selects requests in [in_processing, approved, requires_input, pending_external] (lines 587–593 of this file). Every one of those statuses is in ACTIONED_REQUEST_STATUSES. Inside is_duplicate_request, there is an early return at line 383 of duplication_detection.py:
if request.status in ACTIONED_REQUEST_STATUSES:
return False # actioned requests are never duplicatesSo check_for_duplicates will always return without marking the request as duplicate, and the guard on line 433 (if privacy_request.status == PrivacyRequestStatus.duplicate) can never be True. The early-return block is effectively dead code.
The real protection against the race condition in the bug report comes entirely from the verify_identity change in privacy_request.py (removing duplicate from allowed starting statuses). This requeue guard does not add any protection.
To actually prevent a stuck-but-now-duplicate request from being requeued, is_duplicate_request would need to treat ACTIONED requests differently at this call site (e.g., skip the early return and let the "others are actioned" rule mark this one as duplicate), or an alternative check is needed (e.g., look up the duplicate group directly and check if a newer canonical request exists).
There was a problem hiding this comment.
Good catch — you're right that this is dead code. All statuses fetched by requeue_interrupted_tasks are in ACTIONED_REQUEST_STATUSES, so is_duplicate_request early-returns False every time. Removing the dead code. The verify_identity change is the substantive fix here.
| # ENG-2474: Run duplicate detection before requeuing. If the request is | ||
| # now a duplicate (e.g. another request for the same identity was verified | ||
| # while this one was stuck), mark it and skip the requeue entirely. | ||
| check_for_duplicates(db, privacy_request) |
There was a problem hiding this comment.
Missing test coverage for this new branch
TestHandlePrivacyRequestRequeue has four test cases but none of them exercise the new duplicate-detection path. There should be at least:
- A test that mocks
check_for_duplicatesto flipprivacy_request.statustoPrivacyRequestStatus.duplicateand asserts that_requeue_privacy_requestis not called and the info log is emitted. - A test confirming that when
check_for_duplicatesleaves the status unchanged, the existing requeue/cancel logic proceeds normally (to guard against unintended interactions).
Note that writing test (1) will likely expose the no-op issue described on line 438, because the only way to make the guard fire in a real call is to manually override the status — which is not what the code does.
There was a problem hiding this comment.
Resolved by removing the dead code (see previous comment). Since check_for_duplicates is a no-op for all requeued statuses, the branch was removed entirely — no new test needed.
| PrivacyRequestStatus.identity_unverified, | ||
| PrivacyRequestStatus.duplicate, | ||
| ]: | ||
| if self.status != PrivacyRequestStatus.identity_unverified: |
There was a problem hiding this comment.
This change looks correct and is the substantive fix for the identity-verification race. Removing duplicate from the allowed starting statuses closes the hole where a request that was already marked as a duplicate could be re-verified back to pending, bypassing the duplicate-detection decision entirely.
Minor nit: the existing error message still says "Invalid identity verification request" — worth keeping an eye on whether the message remains clear to callers (the HTTP layer translates IdentityVerificationException to a 400, so the status value in the message is the main signal for debugging).
There was a problem hiding this comment.
Thanks — agreed this is the key fix. The error message is pre-existing and generic enough to cover this case (the HTTP 400 + status value in logs provides the debugging signal). Not changing it in this PR to keep scope tight.
| [can EXCEPT ![req] = dup] >> | ||
| \* Rule 2: other is verified, this is not => this is duplicate | ||
| ELSE IF \E r \in oth : ver[r] = TRUE /\ ver[req] = FALSE THEN | ||
| LET dup == CHOOSE r \in oth : ver[r] = TRUE IN |
There was a problem hiding this comment.
TLA+ model doesn't capture Python's early-return for ACTIONED requests
In DupCheckStatus, there is no guard that prevents marking an ACTIONED req as a duplicate when another ACTIONED request exists for the same identity. The Python is_duplicate_request has this early return at line 383 of duplication_detection.py:
if request.status in ACTIONED_REQUEST_STATUSES:
return False # actioned requests are never duplicatesBecause of this, DupCheckStatus (TLA+) and is_duplicate_request (Python) have divergent semantics for the requeue case: the model allows the watchdog to mark an ACTIONED request as duplicate, but the implementation cannot. The formal proof that AtMostOneActivePerIdentity holds with FixedRequeue = TRUE therefore does not directly validate the Python code path.
Either:
- Add a guard to
DupCheckStatusmatching the Python early-return (so the model accurately reflects the implementation), or - Explicitly document this abstraction gap and explain why safety is maintained despite it.
There was a problem hiding this comment.
Fixed the spec: added the ACTIONED early-return guard to DupCheckStatus (matching Python line 383), removed the FixedRequeue toggle since the requeue dup-check was dead code (all requeued requests are ACTIONED → early-return), and simplified the watchdog to a no-op. TLC confirms: fixed config passes all invariants (3,592 states), buggy config correctly finds AtMostOneActivePerIdentity violation.
| DuplicateConsistency | ||
| DuplicatePointsToLive | ||
|
|
||
| SPECIFICATION Spec |
There was a problem hiding this comment.
Liveness property VerifiedRequestProgress is not checked
VerifiedRequestProgress is defined in the spec as a temporal (liveness) property but is absent from both configs' PROPERTIES section. Safety invariants alone don't rule out starvation — a verified request that never makes progress would satisfy all four invariants listed here.
If liveness checking was intentionally omitted (e.g., performance or symmetry-reduction concerns), a comment in the spec or the config explaining why would help future readers. If it was an oversight, add:
PROPERTIES
VerifiedRequestProgress
to at least the _fixed.cfg.
There was a problem hiding this comment.
Added VerifiedRequestProgress to the fixed config. TLC confirms it holds (checked 2 branches of temporal properties across 3,592 distinct states).
| @@ -0,0 +1,13 @@ | |||
| CONSTANTS | |||
| NumRequests = 2 | |||
There was a problem hiding this comment.
Nice to have: consider a 3-request config
With NumRequests = 2 and MaxVerifyAttempts = 1, TLC reports 3,592 states. A NumRequests = 3 variant would explore additional interleavings (e.g., three concurrent DSRs for the same identity: one actioned, one unverified, one just-verified). Three-way races are where subtle ordering bugs tend to hide and where the tiebreaker rules in DupCheckStatus are most exercised. This could be a separate _fixed_3req.cfg to keep CI times predictable.
There was a problem hiding this comment.
Added DuplicateDetection_fixed_3req.cfg with NumRequests = 3. It immediately found a real bug: DuplicateConsistency violated in a 3-way race where a request already marked as duplicate could still run DupCheckStatus and mark others as duplicates of itself — creating a broken canonical chain.
Root cause: the spec was missing two Python early-returns from is_duplicate_request:
- ACTIONED requests return early (line 383) — already added
- Already-duplicate requests return early (line 398) — was missing
Also added re-parenting logic so that when a request is marked duplicate, any existing duplicates pointing to it get updated to point to the new canonical (mirrors the group-based model in Python).
TLC results after fixes:
_fixed.cfg(2-req): 3,592 states, all invariants + liveness pass_fixed_3req.cfg(3-req): 95,144 states, all invariants + liveness pass_buggy.cfg: correctly finds violation
The 3-request config was absolutely worth adding — 26x more states explored and it caught two real model gaps.
…thon - Remove dead check_for_duplicates call in _handle_privacy_request_requeue (all requeued statuses are ACTIONED, so is_duplicate_request early-returns) - Add ACTIONED and already-duplicate early-return guards to DupCheckStatus (matching Python lines 383 and 398 of duplication_detection.py) - Add canonical re-parenting in DupCheckStatus to mirror group-based model - Remove FixedRequeue constant (requeue dup-check was dead code) - Add VerifiedRequestProgress liveness property to fixed config - Add 3-request config (95,144 states, catches 3-way race bugs) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Ticket ENG-2474
Description Of Changes
TLA+ model checking of the duplicate detection flow revealed two race condition bugs that allow multiple requests for the same identity to be actively processed simultaneously.
Code Changes
request_service.py: Addedcheck_for_duplicatescall in_handle_privacy_request_requeuebefore requeuing — if the request is now a duplicate, skip the requeue entirelyprivacy_request.py: RemovedPrivacyRequestStatus.duplicatefrom allowed statuses inverify_identity— a request already marked as duplicate can no longer be re-verified back topendingspecs/DuplicateDetection.tla: TLA+ spec with two configs (_buggy.cfgviolatesAtMostOneActivePerIdentity,_fixed.cfgpasses all invariants across 3,592 states)Steps to Confirm
test_request_service.pyandtest_duplication_detection.py— all 67 tests pass_buggy.cfg(expect violation) and_fixed.cfg(expect pass)Pre-Merge Checklist
CHANGELOG.mdupdated🤖 Generated with Claude Code