Skip to content

[FEAT] Add POST /datasets/untag endpoint #263

Open
ritoban23 wants to merge 1 commit intoopenml:mainfrom
ritoban23:feature/issue-20-dataset-untag
Open

[FEAT] Add POST /datasets/untag endpoint #263
ritoban23 wants to merge 1 commit intoopenml:mainfrom
ritoban23:feature/issue-20-dataset-untag

Conversation

@ritoban23
Copy link

Summary

Implements the POST /datasets/untag endpoint (issue #20), part of #6.

Changes

  • src/database/datasets.py — added untag() function to delete a row from dataset_tag
  • src/routers/openml/datasets.py — added untag_dataset endpoint and create_tag_not_found_error() helper (error code 474)
  • tests/routers/openml/dataset_tag_test.py — added tests covering: unauthenticated requests, successful untag, tag-not-found error, and invalid tag validation

Behaviour

  • Requires authentication (error 103 if missing)
  • Returns 474 if the tag is not present on the dataset
  • Returns {"data_untag": {"id": "<id>"}} on success

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 28, 2026

Walkthrough

This pull request introduces an untag feature for datasets. The changes include: a new database function untag() in src/database/datasets.py that removes tags from datasets using a DELETE query; a new POST /untag API endpoint in src/routers/openml/datasets.py with authentication requirements, tag existence validation, and error handling; and comprehensive test coverage in tests/routers/openml/dataset_tag_test.py for authorization checks, successful untagging across user roles, tag validation, and error scenarios.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 7.69% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main change: adding a new POST /datasets/untag endpoint, which is clearly related to the changeset containing database function, router endpoint, and corresponding tests.
Description check ✅ Passed The description is well-structured and directly related to the changeset, detailing the three modified files, behavioral expectations including error codes, and authentication requirements.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Tip

Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs).
Share your feedback on Discord.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 1 issue, and left some high level feedback:

  • Instead of fetching all tags and doing a case-insensitive membership check in Python before deleting, consider letting untag return the number of affected rows (via rowcount) and derive the tag-not-found error from that to avoid an extra query and potential race conditions.
  • The case-insensitive tag comparison in untag_dataset currently rebuilds a list on every request; you could simplify and make this more efficient by normalizing tags once to a set of casefold()ed values or by normalizing input on insert and comparing directly.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- Instead of fetching all tags and doing a case-insensitive membership check in Python before deleting, consider letting `untag` return the number of affected rows (via `rowcount`) and derive the tag-not-found error from that to avoid an extra query and potential race conditions.
- The case-insensitive tag comparison in `untag_dataset` currently rebuilds a list on every request; you could simplify and make this more efficient by normalizing `tags` once to a `set` of `casefold()`ed values or by normalizing input on insert and comparing directly.

## Individual Comments

### Comment 1
<location path="src/routers/openml/datasets.py" line_range="64-67" />
<code_context>
+        raise create_authentication_failed_error()
+
+    tags = database.datasets.get_tags_for(data_id, expdb_db)
+    if tag.casefold() not in [t.casefold() for t in tags]:
+        raise create_tag_not_found_error(data_id, tag)
+
+    database.datasets.untag(data_id, tag, connection=expdb_db)
+    return {
+        "data_untag": {"id": str(data_id)},
</code_context>
<issue_to_address>
**issue (bug_risk):** Case-insensitive existence check but case-sensitive delete can lead to silent no-op.

The membership check uses `casefold()` but `untag` receives the raw `tag`. If the stored tag differs only by case (e.g. "Foo" vs "foo"), the check can succeed while the `DELETE` affects 0 rows (depending on DB collation), so the endpoint reports success without untagging. Normalize the tag consistently for both check and delete, or enforce a canonical casing in storage so their behavior matches.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Comment on lines +64 to +67
if tag.casefold() not in [t.casefold() for t in tags]:
raise create_tag_not_found_error(data_id, tag)

database.datasets.untag(data_id, tag, connection=expdb_db)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (bug_risk): Case-insensitive existence check but case-sensitive delete can lead to silent no-op.

The membership check uses casefold() but untag receives the raw tag. If the stored tag differs only by case (e.g. "Foo" vs "foo"), the check can succeed while the DELETE affects 0 rows (depending on DB collation), so the endpoint reports success without untagging. Normalize the tag consistently for both check and delete, or enforce a canonical casing in storage so their behavior matches.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/routers/openml/datasets.py`:
- Around line 63-69: Validation currently compares tag case-insensitively using
tags = database.datasets.get_tags_for(data_id, expdb_db) but then calls
database.datasets.untag(data_id, tag, connection=expdb_db) with the raw input,
which can no-op on case-sensitive DBs; change the flow to find the canonical
stored tag from tags (e.g., pick the element t from tags where t.casefold() ==
tag.casefold()) and pass that canonical value to database.datasets.untag; keep
the same create_tag_not_found_error path when no match is found and return the
same payload using the data_id.

ℹ️ Review info

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 72989df and 5ae0a22.

📒 Files selected for processing (3)
  • src/database/datasets.py
  • src/routers/openml/datasets.py
  • tests/routers/openml/dataset_tag_test.py

Comment on lines +63 to +69
tags = database.datasets.get_tags_for(data_id, expdb_db)
if tag.casefold() not in [t.casefold() for t in tags]:
raise create_tag_not_found_error(data_id, tag)

database.datasets.untag(data_id, tag, connection=expdb_db)
return {
"data_untag": {"id": str(data_id)},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Use the canonical stored tag for delete to avoid false-success on mixed-case input.

Validation is case-insensitive, but deletion uses the raw input. On a case-sensitive collation, this can return success without removing anything.

💡 Suggested fix
-    tags = database.datasets.get_tags_for(data_id, expdb_db)
-    if tag.casefold() not in [t.casefold() for t in tags]:
+    tags = database.datasets.get_tags_for(data_id, expdb_db)
+    matching_tag = next((existing for existing in tags if existing.casefold() == tag.casefold()), None)
+    if matching_tag is None:
         raise create_tag_not_found_error(data_id, tag)
 
-    database.datasets.untag(data_id, tag, connection=expdb_db)
+    database.datasets.untag(data_id, matching_tag, connection=expdb_db)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
tags = database.datasets.get_tags_for(data_id, expdb_db)
if tag.casefold() not in [t.casefold() for t in tags]:
raise create_tag_not_found_error(data_id, tag)
database.datasets.untag(data_id, tag, connection=expdb_db)
return {
"data_untag": {"id": str(data_id)},
tags = database.datasets.get_tags_for(data_id, expdb_db)
matching_tag = next((existing for existing in tags if existing.casefold() == tag.casefold()), None)
if matching_tag is None:
raise create_tag_not_found_error(data_id, tag)
database.datasets.untag(data_id, matching_tag, connection=expdb_db)
return {
"data_untag": {"id": str(data_id)},
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/routers/openml/datasets.py` around lines 63 - 69, Validation currently
compares tag case-insensitively using tags =
database.datasets.get_tags_for(data_id, expdb_db) but then calls
database.datasets.untag(data_id, tag, connection=expdb_db) with the raw input,
which can no-op on case-sensitive DBs; change the flow to find the canonical
stored tag from tags (e.g., pick the element t from tags where t.casefold() ==
tag.casefold()) and pass that canonical value to database.datasets.untag; keep
the same create_tag_not_found_error path when no match is found and return the
same payload using the data_id.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant