-
-
Notifications
You must be signed in to change notification settings - Fork 123
Data Repositories
This is a list of public dataset repositories we aim to connect to for getting more varied datasets in OpenML. These have widely varying data formats, so we need both manual selection plus automatic conversion or meta-data extraction to make them easily usable.
-
A collection of sources made by different users
-
Machine learning dataset repositories (mostly already in OpenML)
- UCI: https://archive.ics.uci.edu/ml/index.html
- KEEL: http://sci2s.ugr.es/keel/datasets.php
- LIBSVM: http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/
- AutoWEKA datasets: http://www.cs.ubc.ca/labs/beta/Projects/autoweka/datasets/
-
Time series data:
-
Causality related datasets:
-
Deep learning datasets (mostly image data)
- http://deeplearning.net/datasets/
- https://deeplearning4j.org/opendata
- http://rodrigob.github.io/are_we_there_yet/build/classification_datasets_results.html
- https://www.kaggle.com/crawford/emnist
-
Extreme classification:
-
MLData (will merge with OpenML in 2018)
http://mldata.org/
-
AutoWEKA datasets:
-
Kaggle public datasets
https://www.kaggle.com/datasets
-
RAMP Challenge datasets
http://www.ramp.studio/data_domains
-
Wolfram data repository
http://datarepository.wolframcloud.com/
-
Data.world
https://data.world/
-
Figshare (needs digging, lots of Excel files)
https://figshare.com/search?q=dataset&quick=1
-
KDNuggets list of data sets (meta-list, lots of stuff here):
http://www.kdnuggets.com/datasets/index.html
-
Benchmark Data Sets for Highly Imbalanced Binary Classification
-
Feature Selection Challenge Datasets
http://www.nipsfsc.ecs.soton.ac.uk/datasets/ http://featureselection.asu.edu/datasets.php
-
BigML's list of 1000+ data sources
-
Massive list from Data Science Central.
-
R packages (also see https://github.com/openml/openml-r/issues/185)
-
UTwente Activity recognition datasets:
http://ps.ewi.utwente.nl/Datasets.php
-
Vanderbilt:
http://biostat.mc.vanderbilt.edu/wiki/Main/DataSets
-
Quandl
https://www.quandl.com
-
Microarray data:
http://genomics-pubs.princeton.edu/oncology/ http://svitsrv25.epfl.ch/R-doc/library/multtest/html/golub.html
-
Medical data:
http://www.healthdata.gov/
http://homepages.inf.ed.ac.uk/rbf/IAPR/researchers/PPRPAGES/pprdat.htm
http://hcup-us.ahrq.gov/
https://www.cms.gov/Research-Statistics-Data-and-Systems/Statistics-Trends-and-Reports/Medicare-Provider-Charge-Data/Physician-and-Other-Supplier.html
https://nsduhweb.rti.org/respweb/homepage.cfm
http://orwh.od.nih.gov/resources/policyreports/womenofcolor.asp -
Nature.com Scientific data repositories list
https://www.nature.com/sdata/policies/repositories
Drafts:
Proposals:
Other: