Conversation
Codecov Report
@@ Coverage Diff @@
## develop #352 +/- ##
==========================================
- Coverage 89.69% 89.59% -0.1%
==========================================
Files 32 32
Lines 2522 2788 +266
==========================================
+ Hits 2262 2498 +236
- Misses 260 290 +30
Continue to review full report at Codecov.
|
amueller
left a comment
There was a problem hiding this comment.
looks good but it's not clear to me why we have to do any casting at all. Also, shouldn't the arff contain the info on what the data type is? (actually the numpy recarray stil contained it)
openml/datasets/dataset.py
Outdated
| else: | ||
| if isinstance(target, six.string_types): | ||
| target = [target] | ||
| legal_target_types = (int, float) |
There was a problem hiding this comment.
Is float float64? Why not float32? And why do we require to cast?
doc/usage.rst
Outdated
| >>> print(datasets[0].name) | ||
| mfeat-factors | ||
| OpenML contains several key concepts which it needs to make machine learning | ||
| research shareable. A machine learning experiment consists of several runs, |
There was a problem hiding this comment.
I would say an ML experiment could also be a single run.
doc/usage.rst
Outdated
| OpenML contains several key concepts which it needs to make machine learning | ||
| research shareable. A machine learning experiment consists of several runs, | ||
| which describe the performance of an algorithm (called a flow in OpenML) on a | ||
| task. Task is the combination of a dataset, a split and an evaluation metric. In |
doc/usage.rst
Outdated
| which describe the performance of an algorithm (called a flow in OpenML) on a | ||
| task. Task is the combination of a dataset, a split and an evaluation metric. In | ||
| this user guide we will go through listing and exploring existing tasks to | ||
| actually running machine learning algorithms on them. In a further user guide |
There was a problem hiding this comment.
Maybe say "run is flow + setup + task and produces metric and predictions"? Right now you don't explain "run" right? Maybe make the key concepts bold.
doc/usage.rst
Outdated
| Tasks are containers, defining how to split the dataset into a train and test | ||
| set, whether to use several disjoint train and test splits (cross-validation) | ||
| and whether this should be repeated several times. Also, the task defines a | ||
| target metric for which a flow should be optimized. You can think of a task as |
There was a problem hiding this comment.
I would make the "You can" sentence the first sentence. I think more essential is that the task defines which dataset to use, which column (if any) is the target and whether it's a classification, regression, clustering etc task.
|
|
||
| Just like datasets, tasks are identified by IDs and can be accessed in three | ||
| different ways: | ||
| Tasks are identified by IDs and can be accessed in two different ways: |
There was a problem hiding this comment.
Can we not filter by tags? Maybe I would say "you can explore tasks on the website or via list_tasks. You can get a single task with get_task". Because these two methods do semantically very different things.
doc/usage.rst
Outdated
| @@ -293,71 +134,55 @@ Let's find out more about the datasets: | |||
|
|
|||
| Now we can restrict the tasks to all tasks with the desired resampling strategy: | |||
There was a problem hiding this comment.
filtering by CV strategy seems a bit unnatural to me. Can we do it by dataset?
doc/usage.rst
Outdated
| .. code:: python | ||
|
|
||
| >>> tasks = openml.tasks.list_tasks(tag='study_1') | ||
| >>> filtered_tasks = filtered_tasks.query('NumberOfInstances > 500 and NumberOfInstances < 1000') |
There was a problem hiding this comment.
Or just move this up, this seems more natural then the CV type to me? Or motivate the CV type?
doc/usage.rst
Outdated
| the concepts of flows and runs. | ||
| In order to upload and share results of running a machine learning algorithm | ||
| on a task, we need to create an :class:`~openml.OpenMLRun`. A run object can | ||
| be created by running a :class:`~openml.OpenMLFlow` or a scikit-learn model on |
This PR fixes: