[WIP] Fix and improve dataset upload#440
Conversation
…he dataset functions module
mfeurer
left a comment
There was a problem hiding this comment.
Thanks a lot!
Could you please remove the function the publish function, too? I think that this function currently does not provide enough information to the user on how to use it and is therefore not really helpful. Also, besides for re-uploading of a fixed dataset, I don't see a reason to keep this. Maybe we should add a helper function to re-upload a fixed dataset?
openml/datasets/functions.py
Outdated
|
|
||
|
|
||
| def upload_dataset(dataset_description, file): | ||
| """Upload a dataset to OpenMl. |
There was a problem hiding this comment.
Please use an uppercase L in OpenML.
There was a problem hiding this comment.
Type, fixed on the next commit.
| dataset = OpenMLDataset( | ||
| name="anneal", version=1, description="test", | ||
| format="ARFF", licence="public", default_target_attribute="class", data_file=file_path) | ||
| "anneal", "test", "ARFF", |
There was a problem hiding this comment.
Is there a reason to do these changes?
There was a problem hiding this comment.
yes, I changed name, description and format to positional arguments as they were mandatory when uploading a dataset to the server. To reflect the changes on the dataset class, I made the changes above to the unit tests.
| dataset = OpenMLDataset( | ||
| name="UploadTestWithURL", version=1, description="test", | ||
| format="ARFF", | ||
| "UploadTestWithURL", "test", "ARFF", |
There was a problem hiding this comment.
Is there a reason to do these changes?
openml/datasets/dataset.py
Outdated
| xml_dataset += "<oml:{0}>{1}</oml:{0}>\n".format(prop, content) | ||
| xml_dataset += "</oml:data_set_description>" | ||
| return xml_dataset | ||
| #if isinstance(content, (list,set)): |
There was a problem hiding this comment.
Could you please remove those comments?
openml/datasets/functions.py
Outdated
| ---------- | ||
| dataset_description : OpenMLDataset | ||
| OpenMLDataset which contains the description of the dataset. | ||
| file : str |
There was a problem hiding this comment.
It would be great if this was an arff dictionary as expected by liac-arff.
There was a problem hiding this comment.
I changed the function as we discussed regarding the arff file.
openml/datasets/functions.py
Outdated
| Parameters | ||
| ---------- | ||
| dataset_description : OpenMLDataset | ||
| OpenMLDataset which contains the description of the dataset. |
There was a problem hiding this comment.
I'm not sure if this is a great idea, and I think besides for re-uploading a dataset this will be handy very often. Could you please change it so that the user needs to pass all relevant information separately?
There was a problem hiding this comment.
I do not agree on this one, I think we should pass an OpenMLDataset as it encapsulates all the necessary information and the user can decide what he wants to fill, aside from the mandatory fields. I also did not increase the number of function arguments, as some can be retrieved from the OpenMLDataset object.
There was a problem hiding this comment.
I see. Probably we should then move the logic of this function to publish and actually add a function called create_dataset. The user would then pass all arguments necessary to create the dataset object locally and could then upload it.
mfeurer
left a comment
There was a problem hiding this comment.
This looks great! Could you please make the requested changes? As I cannot approve this as I created the PR in the first place, please also approve and squash and merge it.
openml/datasets/functions.py
Outdated
| License of the data. | ||
| attributes: list | ||
| A list of tuples. Each tuple consists of the attribute name and type. | ||
| data : numpy.matrix |
There was a problem hiding this comment.
This is not correct, it should be array-like or sparse matrix, shape=(n_samples, n_features)
openml/datasets/functions.py
Outdated
| attributes: list | ||
| A list of tuples. Each tuple consists of the attribute name and type. | ||
| data : numpy.matrix | ||
| A matrix that contains both the attributes and targets. |
There was a problem hiding this comment.
Please use the term array instead of matrix because arrays are more common in python.
openml/datasets/functions.py
Outdated
|
|
||
| Returns | ||
| ------- | ||
| class:`openml.OpenMLDataset |
There was a problem hiding this comment.
This looks like a formatting error.
examples/Dataset_import.ipynb
Outdated
| ], | ||
| "source": [ | ||
| "%load_ext autoreload\n", | ||
| "%autoreload 2\n", |
There was a problem hiding this comment.
Could you please remove the autoreload extension and run the notebook again to remove the text?
Fixes/Implements #337, #331, #292, #117 and #113.