Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
45 commits
Select commit Hold shift + click to select a range
fda4b90
Add unit test for list of lists dataset upload
ArlindKadra Sep 21, 2018
2c7fd30
Fixing xml pattern typo
ArlindKadra Sep 21, 2018
ed33768
Fix pep8 no newline at the end of file
ArlindKadra Sep 21, 2018
a4ebfa7
Remove format from definitions
ArlindKadra Sep 21, 2018
ebd7113
Restoring format in dataset
ArlindKadra Sep 21, 2018
5d6053e
Fixing a couple of unused imports and fixings bugs with create_datase…
ArlindKadra Sep 21, 2018
fbc1f6b
Adapting unit tests to changes
ArlindKadra Sep 21, 2018
6a3ffb8
Fixing failing unit tests
ArlindKadra Sep 21, 2018
7dc9355
fixing typo
ArlindKadra Sep 21, 2018
7b0fdde
Enforce pep8 style guide, fix doc tutorial trying to invoke create_da…
ArlindKadra Sep 21, 2018
2d7b75c
Workaround for pep8 style guide
ArlindKadra Sep 21, 2018
2919dd6
fix long time typo
ArlindKadra Sep 21, 2018
1c4faff
update pep8 failing statement and bug fix for dataset upload tutorial
ArlindKadra Sep 22, 2018
3602739
fixed problem with arff file
ArlindKadra Sep 24, 2018
46cf1fa
Fix pep8 line too long
ArlindKadra Sep 24, 2018
693c368
Extending the unit test for dataset upload, changing upload tutorial
ArlindKadra Sep 25, 2018
e29cf4d
Workaround for the dataset upload unit test
ArlindKadra Sep 27, 2018
f0d8200
Adding example with weather dataset into the dataset upload tutorial
ArlindKadra Sep 28, 2018
be7791f
Fixing builds failure
ArlindKadra Oct 1, 2018
5011216
Adding support for sparse datasets, implementing corresponding unit t…
ArlindKadra Oct 1, 2018
005649a
fix bug
ArlindKadra Oct 1, 2018
b4103df
More unit tests and bug fix
ArlindKadra Oct 1, 2018
2e898ee
Fixing bugs
ArlindKadra Oct 1, 2018
43c6530
Fix bug and pep8 errors
ArlindKadra Oct 1, 2018
f45adbf
Enforcing pep8 and fixing changing the name of attribute format as it…
ArlindKadra Oct 1, 2018
cfd5767
Implementing change in a better way
ArlindKadra Oct 1, 2018
82c7173
Fixing bugs introduced by changing the format in the constructor
ArlindKadra Oct 1, 2018
61cd547
Another try to tackle the bugs
ArlindKadra Oct 1, 2018
4ec6b23
Small refactor
ArlindKadra Oct 3, 2018
45321d2
Fixing pep8 error
ArlindKadra Oct 3, 2018
654cbd0
Fix python2.7 bug
ArlindKadra Oct 4, 2018
714619f
making changes in accordance with Guillaume's suggestions
ArlindKadra Oct 8, 2018
e858689
Adding unit tests, small refactoring
ArlindKadra Oct 8, 2018
e711267
Enforcing pep8 style
ArlindKadra Oct 8, 2018
a3dbb9a
Following Matthias's suggestions
ArlindKadra Oct 15, 2018
4ae71be
Fixing bug introduced by variable name change
ArlindKadra Oct 15, 2018
f922654
Changing the breast_cancer dataset to diabetes, fixing typo with weat…
ArlindKadra Oct 15, 2018
e84c42d
Further changes
ArlindKadra Oct 15, 2018
0f653a3
Merge branch 'develop' into fix540
ArlindKadra Oct 15, 2018
1d7f8eb
Adding more changes
ArlindKadra Oct 16, 2018
82b4758
Fixing bug
ArlindKadra Oct 16, 2018
fc53ef6
Merge branch 'fix540' of https://github.com/openml/openml-python into…
ArlindKadra Oct 16, 2018
0edea31
Pep8 enforce
ArlindKadra Oct 16, 2018
751f8c9
few changes
ArlindKadra Oct 16, 2018
0c66cfc
Fixing typo in dataset name attributes
ArlindKadra Oct 17, 2018
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions .travis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,11 @@ env:
- DISTRIB="conda" PYTHON_VERSION="3.6" SKLEARN_VERSION="0.19.2"
- DISTRIB="conda" PYTHON_VERSION="3.6" SKLEARN_VERSION="0.18.2"

# Travis issue
# https://github.com/travis-ci/travis-ci/issues/8920
before_install:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please add a comment here on why this is necessary as in #560?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I missed doing that.

- python -c "import fcntl; fcntl.fcntl(1, fcntl.F_SETFL, 0)"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is it useful for?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A workaround, the travis-ci builds started failing not long ago.


install: source ci_scripts/install.sh
script: bash ci_scripts/test.sh
after_success: source ci_scripts/success.sh && source ci_scripts/create_doc.sh $TRAVIS_BRANCH "doc_result"
Expand Down
2 changes: 1 addition & 1 deletion ci_scripts/flake8_diff.sh
Original file line number Diff line number Diff line change
Expand Up @@ -125,7 +125,7 @@ check_files() {
if [ -n "$files" ]; then
# Conservative approach: diff without context (--unified=0) so that code
# that was not changed does not create failures
git diff --unified=0 $COMMIT_RANGE -- $files | flake8 --diff --show-source $options
git diff --unified=0 $COMMIT_RANGE -- $files | flake8 --ignore E402 --diff --show-source $options
fi
}

Expand Down
205 changes: 171 additions & 34 deletions examples/create_upload_tutorial.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,53 +5,87 @@
A tutorial on how to create and upload a dataset to OpenML.
"""
import numpy as np
import openml
import sklearn.datasets
from scipy.sparse import coo_matrix

import openml
from openml.datasets.functions import create_dataset

############################################################################
# For this example we will upload to the test server to not pollute the live server with countless copies of the same dataset.
# For this tutorial we will upload to the test server to not pollute the live
# server with countless copies of the same dataset.
openml.config.server = 'https://test.openml.org/api/v1/xml'

############################################################################
# Prepare the data
# ^^^^^^^^^^^^^^^^
# Load an example dataset from scikit-learn which we will upload to OpenML.org via the API.
breast_cancer = sklearn.datasets.load_breast_cancer()
name = 'BreastCancer(scikit-learn)'
X = breast_cancer.data
y = breast_cancer.target
attribute_names = breast_cancer.feature_names
targets = breast_cancer.target_names
description = breast_cancer.DESCR
# Below we will cover the following cases of the
# dataset object:
#
# * A numpy array
# * A list
# * A sparse matrix

############################################################################
# OpenML does not distinguish between the attributes and targets on the data level and stores all data in a
# single matrix. The target feature is indicated as meta-data of the dataset (and tasks on that data).
# Dataset is a numpy array
# ========================
# A numpy array can contain lists in the case of dense data
# or it can contain OrderedDicts in the case of sparse data.
#
# Prepare dataset
# ^^^^^^^^^^^^^^^
# Load an example dataset from scikit-learn which we
# will upload to OpenML.org via the API.

diabetes = sklearn.datasets.load_diabetes()
name = 'Diabetes(scikit-learn)'
X = diabetes.data
y = diabetes.target
attribute_names = diabetes.feature_names
description = diabetes.DESCR

############################################################################
# OpenML does not distinguish between the attributes and
# targets on the data level and stores all data in a single matrix.
#
# The target feature is indicated as meta-data of the
# dataset (and tasks on that data).

data = np.concatenate((X, y.reshape((-1, 1))), axis=1)
attribute_names = list(attribute_names)
attributes = [
(attribute_name, 'REAL') for attribute_name in attribute_names
] + [('class', 'REAL')]
] + [('class', 'INTEGER')]
citation = (
"Bradley Efron, Trevor Hastie, Iain Johnstone and "
"Robert Tibshirani (2004) (Least Angle Regression) "
"Annals of Statistics (with discussion), 407-499"
)
paper_url = (
'http://web.stanford.edu/~hastie/Papers/'
'LARS/LeastAngle_2002.pdf'
)

############################################################################
# Create the dataset object
# ^^^^^^^^^^^^^^^^^^^^^^^^^
# The definition of all fields can be found in the XSD files describing the expected format:
# The definition of all fields can be found in the
# XSD files describing the expected format:
#
# https://github.com/openml/OpenML/blob/master/openml_OS/views/pages/api_new/v1/xsd/openml.data.upload.xsd
dataset = openml.datasets.functions.create_dataset(

diabetes_dataset = create_dataset(
# The name of the dataset (needs to be unique).
# Must not be longer than 128 characters and only contain
# a-z, A-Z, 0-9 and the following special characters: _\-\.(),
name=name,
# Textual description of the dataset.
description=description,
# The person who created the dataset.
creator='Dr. William H. Wolberg, W. Nick Street, Olvi L. Mangasarian',
creator="Bradley Efron, Trevor Hastie, "
"Iain Johnstone and Robert Tibshirani",
# People who contributed to the current version of the dataset.
contributor=None,
# The date the data was originally collected, given by the uploader.
collection_date='01-11-1995',
collection_date='09-01-2012',
# Language in which the data is represented.
# Starts with 1 upper case letter, rest lower case, e.g. 'English'.
language='English',
Expand All @@ -64,26 +98,129 @@
# Attributes that should be excluded in modelling, such as identifiers and indexes.
ignore_attribute=None,
# How to cite the paper.
citation=(
"W.N. Street, W.H. Wolberg and O.L. Mangasarian. "
"Nuclear feature extraction for breast tumor diagnosis. "
"IS&T/SPIE 1993 International Symposium on Electronic Imaging: Science and Technology, "
"volume 1905, pages 861-870, San Jose, CA, 1993."
),
citation=citation,
# Attributes of the data
attributes=attributes,
data=data,
# Format of the dataset. Only 'arff' for now.
format='arff',
# A version label which is provided by the user.
version_label='test',
original_data_url='https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic)',
paper_url='https://www.spiedigitallibrary.org/conference-proceedings-of-spie/1905/0000/Nuclear-feature-extraction-for-breast-tumor-diagnosis/10.1117/12.148698.short?SSO=1'
original_data_url=(
'http://www4.stat.ncsu.edu/~boos/var.select/diabetes.html'
),
paper_url=paper_url,
)

############################################################################
try:
upload_id = dataset.publish()
print('URL for dataset: %s/data/%d' % (openml.config.server, upload_id))
except openml.exceptions.PyOpenMLError as err:
print("OpenML: {0}".format(err))

upload_did = diabetes_dataset.publish()
print('URL for dataset: %s/data/%d' % (openml.config.server, upload_did))

############################################################################
# Dataset is a list
# =================
# A list can contain lists in the case of dense data
# or it can contain OrderedDicts in the case of sparse data.
#
# Weather dataset:
# http://storm.cis.fordham.edu/~gweiss/data-mining/datasets.html

data = [
['sunny', 85, 85, 'FALSE', 'no'],
['sunny', 80, 90, 'TRUE', 'no'],
['overcast', 83, 86, 'FALSE', 'yes'],
['rainy', 70, 96, 'FALSE', 'yes'],
['rainy', 68, 80, 'FALSE', 'yes'],
['rainy', 65, 70, 'TRUE', 'no'],
['overcast', 64, 65, 'TRUE', 'yes'],
['sunny', 72, 95, 'FALSE', 'no'],
['sunny', 69, 70, 'FALSE', 'yes'],
['rainy', 75, 80, 'FALSE', 'yes'],
['sunny', 75, 70, 'TRUE', 'yes'],
['overcast', 72, 90, 'TRUE', 'yes'],
['overcast', 81, 75, 'FALSE', 'yes'],
['rainy', 71, 91, 'TRUE', 'no'],
]

attribute_names = [
('outlook', ['sunny', 'overcast', 'rainy']),
('temperature', 'REAL'),
('humidity', 'REAL'),
('windy', ['TRUE', 'FALSE']),
('play', ['yes', 'no']),
]

description = (
'The weather problem is a tiny dataset that we will use repeatedly'
' to illustrate machine learning methods. Entirely fictitious, it '
'supposedly concerns the conditions that are suitable for playing '
'some unspecified game. In general, instances in a dataset are '
'characterized by the values of features, or attributes, that measure '
'different aspects of the instance. In this case there are four '
'attributes: outlook, temperature, humidity, and windy. '
'The outcome is whether to play or not.'
)

citation = (
'I. H. Witten, E. Frank, M. A. Hall, and ITPro,'
'Data mining practical machine learning tools and techniques, '
'third edition. Burlington, Mass.: Morgan Kaufmann Publishers, 2011'
)

weather_dataset = create_dataset(
name="Weather",
description=description,
creator='I. H. Witten, E. Frank, M. A. Hall, and ITPro',
contributor=None,
collection_date='01-01-2011',
language='English',
licence=None,
default_target_attribute='play',
row_id_attribute=None,
ignore_attribute=None,
citation=citation,
attributes=attribute_names,
data=data,
version_label='example',
)

############################################################################

upload_did = weather_dataset.publish()
print('URL for dataset: %s/data/%d' % (openml.config.server, upload_did))

############################################################################
# Dataset is a sparse matrix
# ==========================

sparse_data = coo_matrix((
[0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0],
([0, 1, 1, 2, 2, 3, 3], [0, 1, 2, 0, 2, 0, 1]),
))

column_names = [
('input1', 'REAL'),
('input2', 'REAL'),
('y', 'REAL'),
]

xor_dataset = create_dataset(
name="XOR",
description='Dataset representing the XOR operation',
creator=None,
contributor=None,
collection_date=None,
language='English',
licence=None,
default_target_attribute='y',
row_id_attribute=None,
ignore_attribute=None,
citation=None,
attributes=column_names,
data=sparse_data,
version_label='example',
)

############################################################################

upload_did = xor_dataset.publish()
print('URL for dataset: %s/data/%d' % (openml.config.server, upload_did))
23 changes: 18 additions & 5 deletions openml/datasets/__init__.py
Original file line number Diff line number Diff line change
@@ -1,8 +1,21 @@
from .functions import (list_datasets, check_datasets_active,
get_datasets, get_dataset, status_update)
from .functions import (
check_datasets_active,
create_dataset,
get_dataset,
get_datasets,
list_datasets,
status_update,
)
from .dataset import OpenMLDataset
from .data_feature import OpenMLDataFeature

__all__ = ['check_datasets_active', 'get_dataset', 'get_datasets',
'OpenMLDataset', 'OpenMLDataFeature', 'list_datasets',
'status_update']
__all__ = [
'check_datasets_active',
'create_dataset',
'get_dataset',
'get_datasets',
'list_datasets',
'OpenMLDataset',
'OpenMLDataFeature',
'status_update',
]
Loading