[FEAT] Add POST /datasets/untag endpoint #263
Conversation
WalkthroughThis pull request introduces an untag feature for datasets. The changes include: a new database function 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs). Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Hey - I've found 1 issue, and left some high level feedback:
- Instead of fetching all tags and doing a case-insensitive membership check in Python before deleting, consider letting
untagreturn the number of affected rows (viarowcount) and derive the tag-not-found error from that to avoid an extra query and potential race conditions. - The case-insensitive tag comparison in
untag_datasetcurrently rebuilds a list on every request; you could simplify and make this more efficient by normalizingtagsonce to asetofcasefold()ed values or by normalizing input on insert and comparing directly.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- Instead of fetching all tags and doing a case-insensitive membership check in Python before deleting, consider letting `untag` return the number of affected rows (via `rowcount`) and derive the tag-not-found error from that to avoid an extra query and potential race conditions.
- The case-insensitive tag comparison in `untag_dataset` currently rebuilds a list on every request; you could simplify and make this more efficient by normalizing `tags` once to a `set` of `casefold()`ed values or by normalizing input on insert and comparing directly.
## Individual Comments
### Comment 1
<location path="src/routers/openml/datasets.py" line_range="64-67" />
<code_context>
+ raise create_authentication_failed_error()
+
+ tags = database.datasets.get_tags_for(data_id, expdb_db)
+ if tag.casefold() not in [t.casefold() for t in tags]:
+ raise create_tag_not_found_error(data_id, tag)
+
+ database.datasets.untag(data_id, tag, connection=expdb_db)
+ return {
+ "data_untag": {"id": str(data_id)},
</code_context>
<issue_to_address>
**issue (bug_risk):** Case-insensitive existence check but case-sensitive delete can lead to silent no-op.
The membership check uses `casefold()` but `untag` receives the raw `tag`. If the stored tag differs only by case (e.g. "Foo" vs "foo"), the check can succeed while the `DELETE` affects 0 rows (depending on DB collation), so the endpoint reports success without untagging. Normalize the tag consistently for both check and delete, or enforce a canonical casing in storage so their behavior matches.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
| if tag.casefold() not in [t.casefold() for t in tags]: | ||
| raise create_tag_not_found_error(data_id, tag) | ||
|
|
||
| database.datasets.untag(data_id, tag, connection=expdb_db) |
There was a problem hiding this comment.
issue (bug_risk): Case-insensitive existence check but case-sensitive delete can lead to silent no-op.
The membership check uses casefold() but untag receives the raw tag. If the stored tag differs only by case (e.g. "Foo" vs "foo"), the check can succeed while the DELETE affects 0 rows (depending on DB collation), so the endpoint reports success without untagging. Normalize the tag consistently for both check and delete, or enforce a canonical casing in storage so their behavior matches.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/routers/openml/datasets.py`:
- Around line 63-69: Validation currently compares tag case-insensitively using
tags = database.datasets.get_tags_for(data_id, expdb_db) but then calls
database.datasets.untag(data_id, tag, connection=expdb_db) with the raw input,
which can no-op on case-sensitive DBs; change the flow to find the canonical
stored tag from tags (e.g., pick the element t from tags where t.casefold() ==
tag.casefold()) and pass that canonical value to database.datasets.untag; keep
the same create_tag_not_found_error path when no match is found and return the
same payload using the data_id.
ℹ️ Review info
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
src/database/datasets.pysrc/routers/openml/datasets.pytests/routers/openml/dataset_tag_test.py
| tags = database.datasets.get_tags_for(data_id, expdb_db) | ||
| if tag.casefold() not in [t.casefold() for t in tags]: | ||
| raise create_tag_not_found_error(data_id, tag) | ||
|
|
||
| database.datasets.untag(data_id, tag, connection=expdb_db) | ||
| return { | ||
| "data_untag": {"id": str(data_id)}, |
There was a problem hiding this comment.
Use the canonical stored tag for delete to avoid false-success on mixed-case input.
Validation is case-insensitive, but deletion uses the raw input. On a case-sensitive collation, this can return success without removing anything.
💡 Suggested fix
- tags = database.datasets.get_tags_for(data_id, expdb_db)
- if tag.casefold() not in [t.casefold() for t in tags]:
+ tags = database.datasets.get_tags_for(data_id, expdb_db)
+ matching_tag = next((existing for existing in tags if existing.casefold() == tag.casefold()), None)
+ if matching_tag is None:
raise create_tag_not_found_error(data_id, tag)
- database.datasets.untag(data_id, tag, connection=expdb_db)
+ database.datasets.untag(data_id, matching_tag, connection=expdb_db)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| tags = database.datasets.get_tags_for(data_id, expdb_db) | |
| if tag.casefold() not in [t.casefold() for t in tags]: | |
| raise create_tag_not_found_error(data_id, tag) | |
| database.datasets.untag(data_id, tag, connection=expdb_db) | |
| return { | |
| "data_untag": {"id": str(data_id)}, | |
| tags = database.datasets.get_tags_for(data_id, expdb_db) | |
| matching_tag = next((existing for existing in tags if existing.casefold() == tag.casefold()), None) | |
| if matching_tag is None: | |
| raise create_tag_not_found_error(data_id, tag) | |
| database.datasets.untag(data_id, matching_tag, connection=expdb_db) | |
| return { | |
| "data_untag": {"id": str(data_id)}, |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/routers/openml/datasets.py` around lines 63 - 69, Validation currently
compares tag case-insensitively using tags =
database.datasets.get_tags_for(data_id, expdb_db) but then calls
database.datasets.untag(data_id, tag, connection=expdb_db) with the raw input,
which can no-op on case-sensitive DBs; change the flow to find the canonical
stored tag from tags (e.g., pick the element t from tags where t.casefold() ==
tag.casefold()) and pass that canonical value to database.datasets.untag; keep
the same create_tag_not_found_error path when no match is found and return the
same payload using the data_id.
Summary
Implements the
POST /datasets/untagendpoint (issue #20), part of #6.Changes
src/database/datasets.py— addeduntag()function to delete a row fromdataset_tagsrc/routers/openml/datasets.py— addeduntag_datasetendpoint andcreate_tag_not_found_error()helper (error code474)tests/routers/openml/dataset_tag_test.py— added tests covering: unauthenticated requests, successful untag, tag-not-found error, and invalid tag validationBehaviour
103if missing)474if the tag is not present on the dataset{"data_untag": {"id": "<id>"}}on success