[ENH] V1 -> V2 Migration - Flows (module)#1609
[ENH] V1 -> V2 Migration - Flows (module)#1609Omswastik-11 wants to merge 249 commits intoopenml:mainfrom
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1609 +/- ##
==========================================
+ Coverage 53.09% 54.34% +1.25%
==========================================
Files 37 61 +24
Lines 4362 5073 +711
==========================================
+ Hits 2316 2757 +441
- Misses 2046 2316 +270 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
…into flow-migration-stacked
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 53 out of 54 changed files in this pull request and generated 11 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| from typing import Any, Iterator | ||
| from pathlib import Path | ||
| import platform | ||
| from urllib.parse import urlparse |
There was a problem hiding this comment.
urlparse is imported but never used in this test module. Please remove the unused import to avoid lint failures.
| # Example script which will appear in the upcoming OpenML-Python paper | ||
| # This test ensures that the example will keep running! | ||
| with overwrite_config_context( | ||
| with openml.config.overwrite_config_context( # noqa: F823 |
There was a problem hiding this comment.
overwrite_config_context is referenced via openml.config and should be resolvable here, so the # noqa: F823 suppression looks incorrect/unnecessary. Please remove it (or use the correct code if there is an actual linter error to suppress).
| @mock.patch.object(requests.Session, "request") | ||
| def test_delete_flow_not_owned(mock_request, test_files_directory, test_api_key): | ||
| openml.config.start_using_configuration_for_example() | ||
| content_file = test_files_directory / "mock_responses" / "flows" / "flow_delete_not_owned.xml" |
There was a problem hiding this comment.
start_using_configuration_for_example() mutates global configuration state; these tests never call stop_using_configuration_for_example(), which can leak state into later tests and cause order-dependent failures. Please wrap this in a context/fixture that guarantees stop_... runs (e.g., try/finally or a dedicated pytest fixture).
openml/_config.py
Outdated
| "apikey": "normaluser", | ||
| }, | ||
| APIVersion.V2: { | ||
| "server": "http://localhost:8002/api/v1/xml/", |
There was a problem hiding this comment.
SERVERS_REGISTRY['local'][APIVersion.V2]['server'] points to a v1 XML base path (/api/v1/xml/). This looks inconsistent with the v2 clients (which call endpoints like flows/.../ under an /api/v2/ base). Please correct the local v2 base URL (and/or add a test to validate v2 server base URLs).
| "server": "http://localhost:8002/api/v1/xml/", | |
| "server": "http://localhost:8002/api/v2/", |
| if isinstance(id_value, (str, int)): | ||
| return int(id_value) | ||
|
|
||
| # Fallback: check all values for numeric/string IDs | ||
| for v in root_value.values(): | ||
| if isinstance(v, (str, int)): | ||
| return int(v) |
There was a problem hiding this comment.
In _extract_id_from_upload, the fallback loop attempts int(v) for the first string value it finds. This can raise ValueError on non-numeric strings like the XML namespace (e.g. "http://openml.org/openml"). Please restrict the fallback to numeric strings (or explicitly look for known keys) before casting to int.
| if isinstance(id_value, (str, int)): | |
| return int(id_value) | |
| # Fallback: check all values for numeric/string IDs | |
| for v in root_value.values(): | |
| if isinstance(v, (str, int)): | |
| return int(v) | |
| if isinstance(id_value, int): | |
| return id_value | |
| if isinstance(id_value, str): | |
| id_str = id_value.strip() | |
| if id_str.isdigit(): | |
| return int(id_str) | |
| # Fallback: check all values for numeric/string IDs | |
| for v in root_value.values(): | |
| if isinstance(v, int): | |
| return v | |
| if isinstance(v, str): | |
| v_str = v.strip() | |
| if v_str.isdigit(): | |
| return int(v_str) |
| self._config = replace( | ||
| self._config, | ||
| servers=config["servers"], | ||
| api_version=config["api_version"], | ||
| fallback_api_version=config["fallback_api_version"], | ||
| show_progress=config["show_progress"], | ||
| avoid_duplicate_runs=config["avoid_duplicate_runs"], | ||
| retry_policy=config["retry_policy"], | ||
| connection_n_retries=int(config["connection_n_retries"]), | ||
| ) |
There was a problem hiding this comment.
_setup() assigns api_version and fallback_api_version directly from the parsed config dict. If these values come from a config file/CLI they will be strings (e.g. "v2"), which will break later lookups like servers[self.api_version] (servers keys are APIVersion). Please coerce string values to APIVersion (and validate) when loading config, and consider similarly validating/normalizing servers.
| return OpenMLFlow._from_dict(xmltodict.parse(flow_xml)) | ||
|
|
There was a problem hiding this comment.
FlowV1API.get() does not detect v1-style error payloads (<oml:error>...) that are returned with HTTP 200. HTTPClient only validates by status code, so this method can end up passing an error dict into OpenMLFlow._from_dict() and failing with a confusing parsing error. Please add an <oml:error> check similar to exists() / list() and raise OpenMLServerException with the server-provided code/message.
| return OpenMLFlow._from_dict(xmltodict.parse(flow_xml)) | |
| result_dict = xmltodict.parse(flow_xml) | |
| # Detect v1-style error payloads and raise a clear exception | |
| if "oml:error" in result_dict: | |
| err = result_dict["oml:error"] | |
| code = int(err.get("oml:code", 0)) if "oml:code" in err else None | |
| message = err.get("oml:message", "Server returned an error") | |
| raise OpenMLServerException(message=message, code=code) | |
| return OpenMLFlow._from_dict(result_dict) |
| path_parts = parsed_url.path.strip("/").split("/") | ||
|
|
||
| filtered_params = {k: v for k, v in params.items() if k != "api_key"} | ||
| params_part = [urlencode(filtered_params)] if filtered_params else [] |
There was a problem hiding this comment.
HTTPCache.get_key() uses urlencode(filtered_params) over the raw dict, which makes the cache key depend on the insertion order of params. This can cause avoidable cache misses for semantically identical requests. Please sort parameters (e.g., by key) before encoding to make cache keys stable.
| params_part = [urlencode(filtered_params)] if filtered_params else [] | |
| sorted_params = sorted(filtered_params.items()) | |
| params_part = [urlencode(sorted_params)] if sorted_params else [] |
| def push_tag(self, tag: str) -> None: | ||
| """Annotates this flow with a tag on the server. | ||
|
|
||
| Parameters | ||
| ---------- | ||
| tag : str | ||
| Tag to attach to the flow. | ||
| """ | ||
| if self.flow_id is None: | ||
| raise ValueError("Flow does not have an ID. Please publish the flow before tagging.") | ||
| openml._backend.flow.tag(self.flow_id, tag) | ||
|
|
||
| def remove_tag(self, tag: str) -> None: | ||
| """Removes a tag from this flow on the server. | ||
|
|
||
| Parameters | ||
| ---------- | ||
| tag : str | ||
| Tag to remove from the flow. | ||
| """ | ||
| if self.flow_id is None: | ||
| raise ValueError("Flow does not have an ID. Please publish the flow before untagging.") | ||
| openml._backend.flow.untag(self.flow_id, tag) |
There was a problem hiding this comment.
OpenMLFlow already inherits push_tag / remove_tag from OpenMLBase. Re-defining them here creates duplicated API paths and potentially inconsistent behavior across resource types (some entities tag via openml.utils._tag_openml_base, flows via openml._backend). Consider removing these overrides and updating the shared implementation in OpenMLBase to use the backend for all resources instead.
| def dummy_task_v2(http_client_v2, minio_client) -> DummyTaskV1API: | ||
| return DummyTaskV2API(http=http_client_v2, minio=minio_client) | ||
|
|
||
|
|
||
| @pytest.fixture | ||
| def dummy_task_fallback(dummy_task_v1, dummy_task_v2) -> DummyTaskV1API: | ||
| return FallbackProxy(dummy_task_v2, dummy_task_v1) |
There was a problem hiding this comment.
The fixture return type annotations in this file look incorrect: dummy_task_v2 is annotated as DummyTaskV1API but returns DummyTaskV2API, and dummy_task_fallback is annotated as DummyTaskV1API but returns FallbackProxy. Please fix the annotations to match the actual returned objects to avoid type-checking confusion.
Co-authored-by: Armaghan Shakir <raoarmaghanshakir040@gmail.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 57 out of 58 changed files in this pull request and generated 8 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| @pytest.fixture | ||
| def dummy_task_v2(http_client_v2, minio_client) -> DummyTaskV1API: | ||
| return DummyTaskV2API(http=http_client_v2, minio=minio_client) | ||
|
|
||
|
|
||
| @pytest.fixture | ||
| def dummy_task_fallback(dummy_task_v1, dummy_task_v2) -> DummyTaskV1API: | ||
| return FallbackProxy(dummy_task_v2, dummy_task_v1) | ||
|
|
There was a problem hiding this comment.
Fixture return type annotations are inconsistent: dummy_task_v2 is annotated as returning DummyTaskV1API but returns DummyTaskV2API, and dummy_task_fallback is annotated as DummyTaskV1API but returns a FallbackProxy. Fix the annotations to match the actual return types to avoid confusing readers and type checkers.
| import requests | ||
| from openml.testing import SimpleImputer, TestBase, create_request_response | ||
|
|
There was a problem hiding this comment.
create_request_response is imported but unused in this file. Either remove the import or use it to build the mocked Response objects for consistency with other tests.
| OpenMLHashException | ||
| If checksum verification fails. | ||
| """ | ||
| url = urljoin(self.server, path) |
There was a problem hiding this comment.
urljoin(self.server, path) will generate incorrect URLs if self.server does not end with a trailing / (e.g. base .../api/v1/xml + task/1 becomes .../api/v1/task/1). Existing user config files and tests still set servers without the trailing slash, so requests can silently hit the wrong endpoint. Normalize server to always end with / (e.g. in config setup/server setter) or avoid urljoin here and manually ensure exactly one / separator.
| url = urljoin(self.server, path) | |
| if path.startswith(("https://", "https://")): | |
| url = path | |
| else: | |
| base = self.server.rstrip("/") | |
| url = f"{base}/{path.lstrip('/')}" |
| flow_xml = openml._backend.http_client.get(f"flow/{flow_id}").text | ||
| flow_dict = xmltodict.parse(flow_xml) |
There was a problem hiding this comment.
openml.config.get_backend() does not exist (config is now an OpenMLConfigManager instance). This will raise an AttributeError and break the test. Use the new backend entrypoint (openml._backend.http_client.get(...) or openml._backend.flow.get(...)) instead of calling a non-existent config method.
| openml.config.server = "temp-server1" | ||
| openml.config.apikey = "temp-apikey1" | ||
| openml.config.get_servers(mode)["server"] = 'temp-server2' | ||
| openml.config.get_servers(mode)["apikey"] = 'temp-server2' | ||
|
|
There was a problem hiding this comment.
get_servers() returns a dict keyed by APIVersion, but this test indexes it with string keys (["server"]/["apikey"]), which will raise KeyError. To test deepcopy/immutability, mutate openml.config.get_servers(mode)[api_version]["server"] (and apikey) instead.
| @pytest.mark.production_server() | ||
| def test_switch_to_example_configuration(self): | ||
| """Verifies the test configuration is loaded properly.""" | ||
| # Below is the default test key which would be used anyway, but just for clarity: | ||
| openml.config.apikey = "any-api-key" | ||
| openml.config.server = self.production_server | ||
| openml.config.set_servers("production") | ||
|
|
||
| openml.config.start_using_configuration_for_example() | ||
|
|
||
| assert openml.config.apikey == TestBase.user_key | ||
| assert openml.config.server == self.test_server | ||
| openml.config.servers = openml.config.get_servers("test") | ||
|
|
||
| @pytest.mark.production_server() | ||
| def test_switch_from_example_configuration(self): | ||
| """Verifies the previous configuration is loaded after stopping.""" | ||
| # Below is the default test key which would be used anyway, but just for clarity: | ||
| openml.config.apikey = TestBase.user_key | ||
| openml.config.server = self.production_server | ||
| openml.config.set_servers("production") | ||
|
|
||
| openml.config.start_using_configuration_for_example() | ||
| openml.config.stop_using_configuration_for_example() | ||
|
|
||
| assert openml.config.apikey == TestBase.user_key | ||
| assert openml.config.server == self.production_server | ||
| openml.config.servers = openml.config.get_servers("production") | ||
|
|
There was a problem hiding this comment.
The example-configuration tests no longer assert that start_using_configuration_for_example() actually switched to the test config (or that stopping restores the previous config). They currently just overwrite openml.config.servers manually, which defeats the purpose of the test and could hide regressions. Add assertions comparing openml.config.servers (and/or server/apikey) before/after start/stop instead of mutating the config inside the test.
| @@ -33,6 +35,7 @@ | |||
| utils, | |||
| ) | |||
| from .__version__ import __version__ | |||
| from ._api import _backend | |||
| from .datasets import OpenMLDataFeature, OpenMLDataset | |||
| from .evaluations import OpenMLEvaluation | |||
| from .flows import OpenMLFlow | |||
| @@ -49,6 +52,11 @@ | |||
| OpenMLTask, | |||
| ) | |||
|
|
|||
| if TYPE_CHECKING: | |||
| from ._config import OpenMLConfigManager | |||
|
|
|||
| config: OpenMLConfigManager = _config_module.__config | |||
|
|
|||
There was a problem hiding this comment.
The package-level change makes openml.config an instance attribute and removes the openml/config.py module. This is a breaking change for users who do import openml.config or from openml.config import .... If backward compatibility is desired, consider reintroducing a thin openml/config.py shim that re-exports the config manager (and legacy symbols) so existing imports keep working.
| from urllib.parse import urlparse | ||
|
|
There was a problem hiding this comment.
from urllib.parse import urlparse is imported but never used in this test module, which will fail linting in configurations that enforce unused-import checks. Remove the import or use it in the tests.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 57 out of 58 changed files in this pull request and generated 9 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| def publish(self, path: str | None = None, files: Mapping[str, Any] | None = None) -> int: # type: ignore[override] # noqa: ARG002 | ||
| self._not_supported(method="publish") |
There was a problem hiding this comment.
The publish method signature in FlowV2API has a # type: ignore[override] comment, but this is because the return type doesn't match the parent class ResourceV2API.publish. However, the parent ResourceV2API.publish raises OpenMLNotSupportedError, so it never actually returns. The return type annotation -> int is misleading since the method always raises an exception. Consider using -> NoReturn as the return type or removing the return type annotation entirely since the method only raises.
| def list( | ||
| self, | ||
| limit: int | None = None, # noqa: ARG002 | ||
| offset: int | None = None, # noqa: ARG002 | ||
| tag: str | None = None, # noqa: ARG002 | ||
| uploader: str | None = None, # noqa: ARG002 | ||
| ) -> pd.DataFrame: | ||
| self._not_supported(method="list") |
There was a problem hiding this comment.
Similar to the publish method, the list method in FlowV2API should have -> NoReturn as its return type instead of -> pd.DataFrame since it only calls self._not_supported() which never returns.
| def publish(self, path: str | None = None, files: Mapping[str, Any] | None = None) -> int: | ||
| """Publish a flow on the OpenML server. | ||
|
|
||
| Parameters | ||
| ---------- | ||
| files : Mapping[str, Any] | None | ||
| Files to upload (including description). | ||
|
|
||
| Returns | ||
| ------- | ||
| int | ||
| The server-assigned flow id. | ||
| """ | ||
| path = "flow" | ||
| return super().publish(path, files) |
There was a problem hiding this comment.
The path parameter in the publish method is marked as optional (str | None), but on line 159 it's immediately overwritten with the hardcoded value "flow". This makes the parameter pointless. Either remove the path parameter from the method signature entirely, or remove the line that overwrites it if the parameter should actually be used.
| Whether to ignore the cache. If ``true`` this will download and overwrite the flow xml | ||
| even if the requested flow is already cached. |
There was a problem hiding this comment.
The docstring states "If true this will download..." but the parameter is ignore_cache, not true. It should say "If True, this will download..." or "If set to True, this will download..." for consistency with Python boolean conventions.
| def publish(self, path: str, files: Mapping[str, Any] | None) -> int: # noqa: ARG002 | ||
| self._not_supported(method="publish") | ||
|
|
||
| def delete(self, resource_id: int) -> bool: # noqa: ARG002 | ||
| self._not_supported(method="delete") | ||
|
|
||
| def tag(self, resource_id: int, tag: str) -> list[str]: # noqa: ARG002 | ||
| self._not_supported(method="tag") | ||
|
|
||
| def untag(self, resource_id: int, tag: str) -> list[str]: # noqa: ARG002 | ||
| self._not_supported(method="untag") |
There was a problem hiding this comment.
The return type annotation for these methods in ResourceV2API is incorrect. Since these methods call self._not_supported() which has a return type of NoReturn, these methods should also have -> NoReturn as their return type instead of -> int, -> bool, or -> list[str]. The # noqa: ARG002 comment suggests awareness that the arguments are unused, but the return type should also reflect that these methods never return.
| "openml_logger", | ||
| "_examples", | ||
| "OPENML_CACHE_DIR_ENV_VAR", | ||
| "OPENML_SKIP_PARQUET_ENV_VAR", |
There was a problem hiding this comment.
The attribute OPENML_TEST_SERVER_ADMIN_KEY_ENV_VAR is missing from the allowed set in __setattr__ on line 166-177. This could cause issues if code tries to set this attribute. It should be added to the set that includes OPENML_CACHE_DIR_ENV_VAR and OPENML_SKIP_PARQUET_ENV_VAR.
| "OPENML_SKIP_PARQUET_ENV_VAR", | |
| "OPENML_SKIP_PARQUET_ENV_VAR", | |
| "OPENML_TEST_SERVER_ADMIN_KEY_ENV_VAR", |
| self._HEADERS: dict[str, str] = {"user-agent": f"openml-python/{__version__}"} | ||
|
|
||
| self._config: OpenMLConfig = OpenMLConfig() | ||
| # for legacy test `test_non_writable_home` |
There was a problem hiding this comment.
In line 147, there's a comment "# for legacy test test_non_writable_home" but the _defaults attribute is also used in other methods like set_field_in_config_file (line 461). The comment is misleading as it suggests this is only for one specific test when it may have broader usage. Consider updating the comment or verifying if _defaults is truly only needed for that one test.
| # for legacy test `test_non_writable_home` | |
| # snapshot of default config (used for resets, e.g. in legacy tests like `test_non_writable_home`) |
|
|
||
| def _mocked_perform_api_call(call, request_method): | ||
| url = openml.config.server + "/" + call | ||
| url = openml.config.server + call |
There was a problem hiding this comment.
The URL construction has a spacing issue. There's an extra space before the + operator which results in inconsistent concatenation. The line should be url = openml.config.server + call without the extra space before the + operator.
| TestBase._mark_entity_for_removal("flow", flow.flow_id, flow.name) | ||
| TestBase.logger.info(f"collected from {__file__.split('/')[-1]}: {flow.flow_id}") | ||
|
|
||
|
|
There was a problem hiding this comment.
There's a trailing whitespace at the end of line 301. Please remove the whitespace after the comment to maintain code cleanliness.
Fixes #1601
added a
Createmethod inFlowAPIfor publishing flow but not refactored with oldpublish. (Needs discussion on this)Added tests using
fake_methodsso that we can test without localV2server . I have tested theFlowsV2methods (getandexists) anddeleteandlistwere not implemented inV2server so I skipped them .