Make task the default concept to work with by mfeurer · Pull Request #352 · openml/openml-python

mfeurer · 2017-10-12T09:47:29Z

This PR fixes:

get_data rounds off REAL values #197 by not automatically casting data on the dataset level
make task the standard interface to work with, not dataset #134 by rewriting the user guide.

codecov-io · 2017-10-12T12:44:48Z

Codecov Report

Merging #352 into develop will decrease coverage by 0.09%.
The diff coverage is 88.23%.

@@            Coverage Diff             @@
##           develop     #352     +/-   ##
==========================================
- Coverage    89.69%   89.59%   -0.1%     
==========================================
  Files           32       32             
  Lines         2522     2788    +266     
==========================================
+ Hits          2262     2498    +236     
- Misses         260      290     +30

Impacted Files	Coverage Δ
openml/tasks/task.py	`95.45% <100%> (-0.33%)`	⬇️
openml/datasets/dataset.py	`80.16% <86.66%> (+1.31%)`	⬆️
openml/exceptions.py	`100% <0%> (ø)`	⬆️
openml/datasets/functions.py	`91.01% <0%> (+0.96%)`	⬆️
openml/tasks/functions.py	`87.55% <0%> (+1.66%)`	⬆️
openml/_api_calls.py	`92.7% <0%> (+2.7%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e01ef40...c6f85b6. Read the comment docs.

amueller

looks good but it's not clear to me why we have to do any casting at all. Also, shouldn't the arff contain the info on what the data type is? (actually the numpy recarray stil contained it)

amueller · 2017-10-12T14:20:34Z

openml/datasets/dataset.py

        else:
            if isinstance(target, six.string_types):
                target = [target]
+            legal_target_types = (int, float)


Is float float64? Why not float32? And why do we require to cast?

amueller · 2017-10-12T14:22:25Z

doc/usage.rst

-    >>> print(datasets[0].name)
-    mfeat-factors
+OpenML contains several key concepts which it needs to make machine learning
+research shareable. A machine learning experiment consists of several runs,


I would say an ML experiment could also be a single run.

amueller · 2017-10-12T14:22:46Z

doc/usage.rst

+OpenML contains several key concepts which it needs to make machine learning
+research shareable. A machine learning experiment consists of several runs,
+which describe the performance of an algorithm (called a flow in OpenML) on a
+task. Task is the combination of a dataset, a split and an evaluation metric. In


amueller · 2017-10-12T14:24:12Z

doc/usage.rst

+which describe the performance of an algorithm (called a flow in OpenML) on a
+task. Task is the combination of a dataset, a split and an evaluation metric. In
+this user guide we will go through listing and exploring existing tasks to
+actually running machine learning algorithms on them. In a further user guide


Maybe say "run is flow + setup + task and produces metric and predictions"? Right now you don't explain "run" right? Maybe make the key concepts bold.

amueller · 2017-10-12T14:25:15Z

doc/usage.rst

+Tasks are containers, defining how to split the dataset into a train and test
+set, whether to use several disjoint train and test splits (cross-validation)
+and whether this should be repeated several times. Also, the task defines a
+target metric for which a flow should be optimized. You can think of a task as


I would make the "You can" sentence the first sentence. I think more essential is that the task defines which dataset to use, which column (if any) is the target and whether it's a classification, regression, clustering etc task.

amueller · 2017-10-12T14:26:45Z

doc/usage.rst


-Just like datasets, tasks are identified by IDs and can be accessed in three
-different ways:
+Tasks are identified by IDs and can be accessed in two different ways:


Can we not filter by tags? Maybe I would say "you can explore tasks on the website or via list_tasks. You can get a single task with get_task". Because these two methods do semantically very different things.

amueller · 2017-10-12T14:28:00Z

doc/usage.rst

@@ -293,71 +134,55 @@ Let's find out more about the datasets:

 Now we can restrict the tasks to all tasks with the desired resampling strategy:


filtering by CV strategy seems a bit unnatural to me. Can we do it by dataset?

amueller · 2017-10-12T14:28:37Z

doc/usage.rst

 .. code:: python

-    >>> tasks = openml.tasks.list_tasks(tag='study_1')
+    >>> filtered_tasks = filtered_tasks.query('NumberOfInstances > 500 and NumberOfInstances < 1000')


Or just move this up, this seems more natural then the CV type to me? Or motivate the CV type?

amueller · 2017-10-12T14:29:09Z

doc/usage.rst

-the concepts of flows and runs.
+In order to upload and share results of running a machine learning algorithm
+on a task, we need to create an :class:`~openml.OpenMLRun`. A run object can
+be created by running a :class:`~openml.OpenMLFlow` or a scikit-learn model on


scikit-learn compatible?

amueller

lgtm

FIX #197, do not automatically cast target attribute

886a217

mfeurer force-pushed the fix_#197 branch from 6ac48a1 to 73f81fc Compare October 12, 2017 11:46

Simplify usage, make task primary object for beginners

311c861

mfeurer force-pushed the fix_#197 branch from 73f81fc to 311c861 Compare October 12, 2017 12:18

mfeurer changed the title ~~WIP: Make task the default concept to work with~~ Make task the default concept to work with Oct 12, 2017

mfeurer requested a review from amueller October 12, 2017 13:45

amueller reviewed Oct 12, 2017

View reviewed changes

mfeurer added 2 commits October 12, 2017 17:22

include suggestions from @amueller

4181c4a

remove argument which value can be inferred from data

c6f85b6

amueller approved these changes Oct 12, 2017

View reviewed changes

mfeurer merged commit 1fff169 into develop Oct 13, 2017

mfeurer deleted the fix_#197 branch October 13, 2017 08:35

This was referenced Oct 13, 2017

get_data rounds off REAL values #197

Closed

make task the standard interface to work with, not dataset #134

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Make task the default concept to work with#352

Make task the default concept to work with#352
mfeurer merged 4 commits intodevelopfrom
fix_#197

mfeurer commented Oct 12, 2017

Uh oh!

codecov-io commented Oct 12, 2017 •

edited

Loading

Uh oh!

amueller left a comment

Uh oh!

amueller Oct 12, 2017

Uh oh!

amueller Oct 12, 2017

Uh oh!

amueller Oct 12, 2017

Uh oh!

amueller Oct 12, 2017

Uh oh!

amueller Oct 12, 2017

Uh oh!

amueller Oct 12, 2017

Uh oh!

amueller Oct 12, 2017

Uh oh!

amueller Oct 12, 2017

Uh oh!

amueller Oct 12, 2017

Uh oh!

amueller left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		@@ -293,71 +134,55 @@ Let's find out more about the datasets:

		Now we can restrict the tasks to all tasks with the desired resampling strategy:

Uh oh!

Conversation

mfeurer commented Oct 12, 2017

Uh oh!

codecov-io commented Oct 12, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

amueller left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

amueller left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

codecov-io commented Oct 12, 2017 •

edited

Loading