ENG-2737: RequestTask.status refactor#7680
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub. 2 Skipped Deployments
|
This comment was marked as resolved.
This comment was marked as resolved.
tests/ops/api/v1/endpoints/privacy_request/test_privacy_request_endpoints.py
Outdated
Show resolved
Hide resolved
tests/ops/api/v1/endpoints/privacy_request/test_privacy_request_endpoints.py
Outdated
Show resolved
Hide resolved
tests/ops/api/v1/endpoints/privacy_request/test_privacy_request_endpoints.py
Show resolved
Hide resolved
adamsachs
left a comment
There was a problem hiding this comment.
not a major concern from the BE, and this seems like it'll provide very useful functionality so i'm good moving forward with this.
but the update here has (i think) basically exposed some longstanding flaws in this endpoint that could start to degrade performance at scale, so i think we should at least get a follow-up queued up to refactor and remove those problems 👍
| PrivacyRequest.execution_and_audit_logs_by_dataset = property( | ||
| execution_and_audit_logs_by_dataset_name | ||
| ) | ||
| PrivacyRequest.task_status_by_dataset = property(task_status_by_dataset_name) |
There was a problem hiding this comment.
so i think this effectively runs task_status_by_dataset_name for each privacy request record returned in the response, right? since task_status_by_dataset_name has a that's an n+1 query pattern, which is never great. it's true that the line above, with execution_and_audit_logs_by_dataset_name also has a similar problem already existing, so a this problem has been here for a little while...we're just making it a bit worse, now it's a 2n + 1 query pattern :)
in general, overwriting the class-level property like we're doing here, e.g. PrivacyRequest.task_status_by_dataset, is strange and potentially not thread safe. it just does not seem like the right way to do this.
BUT again, none of that is new with your PR here, you're just following the existing pattern which has some (rather significant) flaws. it does exacerbate those flaws a bit. i don't think this is enough of a reason not to move forward with this, but i think we should probably make a note (and a followup ticket) to address the concerns that have been lurking here, because they will likely come back to bite us if we don't:
- not monkeypatch the
PrivacyRequestclass properties and avoid thread safety issues - remove the n + 1 query pattern(s) from this endpoint
fwiw, here's the roundup from Claude as i dug into this a bit:
Is this N+1?
Yes, it is. Here's the execution flow:
- Line 488: paginate(query, params) executes the main query — loads N PrivacyRequest objects
- Lines 490-501: Loop over items, setting instance attributes (identity, resume_endpoint, etc.)
- Line 503: return paginated — FastAPI serializes using PrivacyRequestVerboseResponse
- During serialization: Pydantic reads instance.task_status_by_dataset, which triggers the property getter, which runs a DB query per instance
So for a page of N results: 1 + 2N queries (1 paginated fetch + N for execution_and_audit_logs_by_dataset + N for the new task_status_by_dataset). Classic N+1, and the PR adds a second N+1 on top of the existing
one.On the class-level property monkey-patching mechanism
This pattern is worth scrutinizing. What's happening:
PrivacyRequest.task_status_by_dataset = property(task_status_by_dataset_name)
This mutates the class itself, not individual instances. A few concerns:
Thread safety. If two concurrent requests hit this endpoint — one with verbose=True, one without — they race on setting the class property. One could overwrite the other's assignment between the property set and
serialization. This is a pre-existing issue (same pattern for execution_and_audit_logs_by_dataset), but every new property added compounds the risk.Invisible data flow. The DB query that populates this field fires lazily during Pydantic serialization, far from where the "data fetching" logically lives. You can't see it by reading the route handler — you have
to know that the property descriptor triggers a query. This makes the session lifecycle hard to reason about (if the session closes before serialization, you get DetachedInstanceError).The type annotation is misleading. On the model:
task_status_by_dataset: Optional[property] = None
The runtime type is actually Optional[Dict[str, str]] — property here is the Python descriptor type, not the return type. The annotation exists only to give Pydantic/SQLAlchemy a class-level default of None that
the property descriptor later overwrites.What a cleaner approach would look like. The loop at lines 490-501 already iterates over every item. You could eagerly compute and set instance attributes there:
if filters.verbose:
for item in paginated.items:
item.task_status_by_dataset = task_status_by_dataset_name(item)This would be instance-level (no class mutation), explicit (visible data fetching in the route), and wouldn't change the query count (still N queries). It would also eliminate the thread-safety issue. The same
could be done for execution_and_audit_logs_by_dataset.Or, to actually fix the N+1, both fields could be batch-computed with a single query for all privacy request IDs in the page, then assigned per-instance. But that's a bigger refactor.
Bottom line: the PR correctly follows the existing pattern, so this isn't a criticism of the PR itself — but the pattern is fragile enough that it's worth flagging as tech debt, especially now that there are two
properties using it.
There was a problem hiding this comment.
Yeah, I couldn't unsee the ugliness once I saw it so, I ended up fixing all this. Take a look and let me know what you think. I fixed both execution_and_audit_logs_by_dataset and task_status_by_dataset: now we batch into single queries and use eager instance-level assignment instead of class-level property descriptors.
adamsachs
left a comment
There was a problem hiding this comment.
nice, your update makes sense/looks good to me from taking a quick look through it! thanks for making these updates.
i think it should be good to go. i'd recommend a bit of manual testing if you haven't already, just because now the changes are a bit broader 👍
I did test it manually, although my local environment doesn't have a lot of privacy requests with different task statuses etc, so... I don't know. Maybe we can also test this later in a test cloud env? |
Ticket ENG-2737
Description Of Changes
The frontend was determining task status (error/polling/awaiting processing) by scanning execution logs returned in the verbose privacy request response. With long-lived polling tasks that produce 50+ execution logs, the 50-log limit could cause a completed task to still appear as "Awaiting Polling" in the UI.
This adds a
task_status_by_dataset_namefield to the verbose response, populated directly fromRequestTask.status. The frontend now reads task status from this field instead of inferring it from execution log history.Code Changes
task_status_by_dataset_namehelper onPrivacyRequestmodel — queriesRequestTaskrows and groups their status by dataset nametask_status_by_dataset_nametoPrivacyRequestVerboseResponseschema_shared_privacy_request_searchwhenverbose=TrueusePrivacyRequestEventLogshook to use the new field forhasPolling,hasError,hasAwaitingProcessinginstead of scanning execution logstask_status_by_dataset_nameto thePrivacyRequestEntityTypeScript typeSteps to Confirm
/api/v1/privacy-request?verbose=Trueresponse includestask_status_by_dataset_namewith correct statuses per datasetPre-Merge Checklist
CHANGELOG.mdupdatedmaindowngrade()migration is correct and works