Scikit Learn User Guide 0.12
Scikit Learn User Guide 0.12
Release 0.12-git
scikit-learn developers
June 04, 2012
CONTENTS
1 User Guide 3
1.1 Installing scikit-learn . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Tutorials: From the bottom up with scikit-learn . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3 Supervised learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
1.4 Unsupervised learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
1.5 Model Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
1.6 Dataset transformations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204
1.7 Dataset loading utilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218
1.8 Reference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226
2 Example Gallery 661
2.1 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 661
3 Development 981
3.1 Contributing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 981
3.2 How to optimize for speed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 988
3.3 Utilities for Developers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 993
3.4 Developers Tips for Debugging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 997
3.5 About us . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 997
3.6 Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 999
3.7 0.12 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1000
3.8 0.11 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1000
3.9 0.10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1004
3.10 0.9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1007
3.11 0.8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1010
3.12 0.7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1012
3.13 0.6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1013
3.14 0.5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1014
3.15 0.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1016
3.16 Presentations and Tutorials on Scikit-Learn . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1017
Bibliography 1019
Python Module Index 1023
Python Module Index 1025
Index 1027
i
ii
scikit-learn user guide, Release 0.12-git
scikit-learn is a Python module integrating classic machine learning algorithms in the tightly-knit sci-
entic Python world (numpy, scipy, matplotlib). It aims to provide simple and efcient solutions to learning
problems, accessible to everybody and reusable in various contexts: machine-learning as a versatile tool for
science and engineering.
License: Open source, commercially usable: BSD license (3 clause)
Documentation for scikit-learn version 0.12-git. For other versions and printable format, see Documentation re-
sources.
CONTENTS 1
scikit-learn user guide, Release 0.12-git
2 CONTENTS
CHAPTER
ONE
USER GUIDE
1.1 Installing scikit-learn
There are different ways to get scikit-learn installed:
Install the version of scikit-learn provided by your operating system distribution . This is the quickest option for
those who have operating systems that distribute scikit-learn.
Install an ofcial release. This is the best approach for users who want a stable version number and arent
concerned about running a slightly older version of scikit-learn.
Install the latest development version. This is best for users who want the latest-and-greatest features and arent
afraid of running brand-new code.
Note: If you wish to contribute to the project, its recommended you install the latest development version.
1.1.1 Installing an ofcial release
Installing from source
Installing from source requires you to have installed python (>= 2.6), numpy (>= 1.3), scipy (>= 0.7), setuptools,
python development headers and a working C++ compiler. Under Debian-based systems you can get all this by
executing with root privileges:
sudo apt-get install python-dev python-numpy python-numpy-dev python-setuptools python-numpy-dev python-scipy libatlas-dev g++
Note: In Order to build the documentation and run the example code contains in this documentation you will need
matplotlib:
sudo apt-get install python-matplotlib
Note: On Ubuntu LTS (10.04) the package libatlas-dev is called libatlas-headers
Easy install
This is usually the fastest way to install the latest stable release. If you have pip or easy_install, you can install or
update with the command:
3
scikit-learn user guide, Release 0.12-git
pip install -U scikit-learn
or:
easy_install -U scikit-learn
for easy_install. Note that you might need root privileges to run these commands.
From source package
Download the package from http://pypi.python.org/pypi/scikit-learn/ , unpack the sources and cd into archive.
This packages uses distutils, which is the default way of installing python modules. The install command is:
python setup.py install
Windows installer
You can download a windows installer from downloads in the projects web page. Note that must also have installed
the packages numpy and setuptools.
This package is also expected to work with python(x,y) as of 2.6.5.5.
Installing on Windows 64bit
To install a 64bit version of the scikit, you can download the binaries from
http://www.lfd.uci.edu/~gohlke/pythonlibs/#scikit-learn Note that this will require a compatible version
of numpy, scipy and matplotlib. The easiest option is to also download them from the same URL.
Building on windows
To build scikit-learn on windows you will need a C/C++ compiler in addition to numpy, scipy and setuptools. At least
MinGW (a port of GCC to Windows OS) and the Microsoft Visual C++ 2008 should work out of the box. To force the
use of a particular compiler, write a le named setup.cfg in the source directory with the content:
[build_ext]
compiler=my_compiler
[build]
compiler=my_compiler
where my_compiler should be one of mingw32 or msvc.
When the appropriate compiler has been set, and assuming Python is in your PATH (see Python FAQ for windows for
more details), installation is done by executing the command:
python setup.py install
To build a precompiled package like the ones distributed at the downloads section, the command to execute is:
python setup.py bdist_wininst -b doc/logos/scikit-learn-logo.bmp
This will create an installable binary under directory dist/.
4 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
1.1.2 Third party distributions of scikit-learn
Some third-party distributions are now providing versions of scikit-learn integrated with their package-management
systems.
These can make installation and upgrading much easier for users since the integration includes the ability to automat-
ically install dependencies (numpy, scipy) that scikit-learn requires.
The following is a list of Linux distributions that provide their own version of scikit-learn:
Debian and derivatives (Ubuntu)
The Debian package is named python-sklearn (formerly python-scikits-learn) and can be installed using the following
commands with root privileges:
apt-get install python-sklearn
Additionally, backport builds of the most recent release of scikit-learn for existing releases of Debian and Ubuntu are
available from NeuroDebian repository .
Python(x, y)
The Python(x, y) distributes scikit-learn as an additional plugin, which can be found in the Additional plugins page.
Enthought Python distribution
The Enthought Python Distribution already ships a recent version.
Macports
The macports package is named py26-sklearn or py27-sklearn depending on the version of Python. It can be installed
by typing the following command:
sudo port install py26-scikits-learn
or:
sudo port install py27-scikits-learn
depending on the version of Python you want to use.
NetBSD
scikit-learn is available via pkgsrc-wip:
http://pkgsrc.se/wip/py-scikit_learn
1.1.3 Bleeding Edge
See section Retrieving the latest code on how to get the development version.
1.1. Installing scikit-learn 5
scikit-learn user guide, Release 0.12-git
1.1.4 Testing
Testing requires having the nose library. After installation, the package can be tested by executing from outside the
source directory:
nosetests sklearn --exe
This should give you a lot of output (and some warnings) but eventually should nish with the a text similar to:
Ran 601 tests in 27.920s
OK (SKIP=2)
otherwise please consider posting an issue into the bug tracker or to the Mailing List.
Note: Alternative testing method
If for some reason the recommended method is failing for you, please try the alternate method:
python -c "import sklearn; sklearn.test()"
This method might display doctest failures because of nosetests issues.
scikit-learn can also be tested without having the package installed. For this you must compile the sources inplace
from the source directory:
python setup.py build_ext --inplace
Test can now be run using nosetests:
nosetests sklearn/
This is automated in the commands:
make in
and:
make test
1.2 Tutorials: From the bottom up with scikit-learn
Quick start
In this section, we introduce the machine learning vocabulary that we use through-out scikit-learn and give a
simple learning example.
1.2.1 An Introduction to machine learning with scikit-learn
Section contents
In this section, we introduce the machine learning vocabulary that we use through-out scikit-learn and give a
simple learning example.
6 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
Machine learning: the problem setting
In general, a learning problem considers a set of n samples of data and try to predict properties of unknown data. If
each sample is more than a single number, and for instance a multi-dimensional entry (aka multivariate data), is it said
to have several attributes, or features.
We can separate learning problems in a few large categories:
supervised learning, in which the data comes with additional attributes that we want to predict (Click here to go
to the Scikit-Learn supervised learning page).This problem can be either:
classication: samples belong to two or more classes and we want to learn from already labeled data how
to predict the class of unlabeled data. An example of classication problem would be the digit recognition
example, in which the aim is to assign each input vector to one of a nite number of discrete categories.
regression: if the desired output consists of one or more continuous variables, then the task is called
regression. An example of a regression problem would be the prediction of the length of a salmon as a
function of its age and weight.
unsupervised learning, in which the training data consists of a set of input vectors x without any corresponding
target values. The goal in such problems may be to discover groups of similar examples within the data, where
it is called clustering, or to determine the distribution of data within the input space, known as density estima-
tion, or to project the data from a high-dimensional space down to two or thee dimensions for the purpose of
visualization (Click here to go to the Scikit-Learn unsupervised learning page).
Training set and testing set
Machine learning is about learning some properties of a data set and applying them to new data. This is why a
common practice in machine learning to evaluate an algorithm is to split the data at hand in two sets, one that we
call a training set on which we learn data properties, and one that we call a testing set, on which we test these
properties.
Loading an example dataset
scikit-learn comes with a few standard datasets, for instance the iris and digits datasets for classication and the boston
house prices dataset for regression.:
>>> from sklearn import datasets
>>> iris = datasets.load_iris()
>>> digits = datasets.load_digits()
A dataset is a dictionary-like object that holds all the data and some metadata about the data. This data is stored in
the .data member, which is a n_samples, n_features array. In the case of supervised problem, explanatory
variables are stored in the .target member. More details on the different datasets can be found in the dedicated
section.
For instance, in the case of the digits dataset, digits.data gives access to the features that can be used to classify
the digits samples:
>>> print digits.data
[[ 0. 0. 5. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 10. 0. 0.]
[ 0. 0. 0. ..., 16. 9. 0.]
...,
[ 0. 0. 1. ..., 6. 0. 0.]
[ 0. 0. 2. ..., 12. 0. 0.]
[ 0. 0. 10. ..., 12. 1. 0.]]
1.2. Tutorials: From the bottom up with scikit-learn 7
scikit-learn user guide, Release 0.12-git
and digits.target gives the ground truth for the digit dataset, that is the number corresponding to each digit image that
we are trying to learn:
>>> digits.target
array([0, 1, 2, ..., 8, 9, 8])
Shape of the data arrays
The data is always a 2D array, n_samples, n_features, although the original data may have had a different shape.
In the case of the digits, each original sample is an image of shape 8, 8 and can be accessed using:
>>> digits.images[0]
array([[ 0., 0., 5., 13., 9., 1., 0., 0.],
[ 0., 0., 13., 15., 10., 15., 5., 0.],
[ 0., 3., 15., 2., 0., 11., 8., 0.],
[ 0., 4., 12., 0., 0., 8., 8., 0.],
[ 0., 5., 8., 0., 0., 9., 8., 0.],
[ 0., 4., 11., 0., 1., 12., 7., 0.],
[ 0., 2., 14., 5., 10., 12., 0., 0.],
[ 0., 0., 6., 13., 10., 0., 0., 0.]])
The simple example on this dataset illustrates how starting from the original problem one can shape the data for
consumption in the scikit-learn.
Learning and Predicting
In the case of the digits dataset, the task is to predict the value of a hand-written digit from an image. We are given
samples of each of the 10 possible classes on which we t an estimator to be able to predict the labels corresponding
to new data.
In scikit-learn, an estimator is just a plain Python class that implements the methods t(X, Y) and predict(T).
An example of estimator is the class sklearn.svm.SVC that implements Support Vector Classication. The con-
structor of an estimator takes as arguments the parameters of the model, but for the time being, we will consider the
estimator as a black box:
>>> from sklearn import svm
>>> clf = svm.SVC(gamma=0.001, C=100.)
Choosing the parameters of the model
In this example we set the value of gamma manually. It is possible to automatically nd good values for the
parameters by using tools such as grid search and cross validation.
We call our estimator instance clf as it is a classier. It now must be tted to the model, that is, it must learn from the
model. This is done by passing our training set to the fit method. As a training set, let us use all the images of our
dataset apart from the last one:
>>> clf.fit(digits.data[:-1], digits.target[:-1])
SVC(C=100.0, cache_size=200, class_weight=None, coef0=0.0, degree=3,
gamma=0.001, kernel=rbf, probability=False, shrinking=True, tol=0.001,
verbose=False)
Now you can predict new values, in particular, we can ask to the classier what is the digit of our last image in the
digits dataset, which we have not used to train the classier:
8 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
>>> clf.predict(digits.data[-1])
array([ 8.])
The corresponding image is the following: As you can see, it is a challenging task: the
images are of poor resolution. Do you agree with the classier?
A complete example of this classication problem is available as an example that you can run and study: Recognizing
hand-written digits.
Model persistence
It is possible to save a model in the scikit by using Pythons built-in persistence model, namely pickle:
>>> from sklearn import svm
>>> from sklearn import datasets
>>> clf = svm.SVC()
>>> iris = datasets.load_iris()
>>> X, y = iris.data, iris.target
>>> clf.fit(X, y)
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, degree=3, gamma=0.25,
kernel=rbf, probability=False, shrinking=True, tol=0.001,
verbose=False)
>>> import pickle
>>> s = pickle.dumps(clf)
>>> clf2 = pickle.loads(s)
>>> clf2.predict(X[0])
array([ 0.])
>>> y[0]
0
In the specic case of the scikit, it may be more interesting to use joblibs replacement of pickle (joblib.dump &
joblib.load), which is more efcient on big data, but can only pickle to the disk and not to a string:
>>> from sklearn.externals import joblib
>>> joblib.dump(clf, filename.pkl)
Statistical-learning Tutorial
This tutorial covers some of the models and tools available to do data-processing with Scikit Learn and how to
learn from your data.
1.2.2 A tutorial on statistical-learning for scientic data processing
1.2. Tutorials: From the bottom up with scikit-learn 9
scikit-learn user guide, Release 0.12-git
Statistical learning
Machine learning is a technique with a growing importance, as the size of the datasets experimental sciences
are facing is rapidly growing. Problems it tackles range from building a prediction function linking different
observations, to classifying observations, or learning the structure in an unlabeled dataset.
This tutorial will explore statistical learning, that is the use of machine learning techniques with the goal of
statistical inference: drawing conclusions on the data at hand.
sklearn is a Python module integrating classic machine learning algorithms in the tightly-knit world of scien-
tic Python packages (numpy, scipy, matplotlib).
Warning: In scikit-learn release 0.9, the import path has changed from scikits.learn to sklearn. To import with
cross-version compatibility, use:
try:
from sklearn import something
except ImportError:
from scikits.learn import something
Statistical learning: the setting and the estimator object in the scikit-learn
Datasets
The scikit-learn deals with learning information from one or more datasets that are represented as 2D arrays. They
can be understood as a list of multi-dimensional observations. We say that the rst axis of these arrays is the samples
axis, while the second is the features axis.
A simple example shipped with the scikit: iris dataset
>>> from sklearn import datasets
>>> iris = datasets.load_iris()
>>> data = iris.data
>>> data.shape
(150, 4)
It is made of 150 observations of irises, each described by 4 features: their sepal and petal length and width, as
detailed in iris.DESCR.
When the data is not intially in the (n_samples, n_features) shape, it needs to be preprocessed to be used by the scikit.
10 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
An example of reshaping data: the digits dataset
The digits dataset is made of 1797 8x8 images of hand-written digits
>>> digits = datasets.load_digits()
>>> digits.images.shape
(1797, 8, 8)
>>> import pylab as pl
>>> pl.imshow(digits.images[-1], cmap=pl.cm.gray_r)
<matplotlib.image.AxesImage object at ...>
To use this dataset with the scikit, we transform each 8x8 image in a feature vector of length 64
>>> data = digits.images.reshape((digits.images.shape[0], -1))
Estimators objects
Fitting data: The core object of the scikit-learn is the estimator object. All estimator objects expose a t method, that
takes a dataset (2D array):
>>> estimator.fit(data)
Estimator parameters: All the parameters of an estimator can be set when it is instanciated, or by modifying the
corresponding attribute:
>>> estimator = Estimator(param1=1, param2=2)
>>> estimator.param1
1
Estimated parameters: When data is tted with an estimator, parameters are estimated from the data at hand. All the
estimated parameters are attributes of the estimator object ending by an underscore:
>>> estimator.estimated_param_
Supervised learning: predicting an output variable from high-dimensional observations
The problem solved in supervised learning
Supervised learning consists in learning the link between two datasets: the observed data X, and an external
variable y that we are trying to predict, usually called target or labels. Most often, y is a 1D array of length
n_samples.
All supervised estimators in the scikit-learn implement a t(X, y) method to t the model, and a predict(X)
method that, given unlabeled observations X, returns the predicted labels y.
1.2. Tutorials: From the bottom up with scikit-learn 11
scikit-learn user guide, Release 0.12-git
Vocabulary: classication and regression
If the prediction task is to classify the observations in a set of nite labels, in other words to name the objects
observed, the task is said to be a classication task. On the opposite, if the goal is to predict a continous target
variable, it is said to be a regression task.
In the scikit-learn, for classication tasks, y is a vector of integers.
Note: See the Introduction to machine learning with Scikit-learn Tutorial for a quick run-through on the basic
machine learning vocabulary used within Scikit-learn.
Nearest neighbor and the curse of dimensionality
Classifying irises:
The iris dataset is a classication task consisting in identifying 3
different types of irises (Setosa, Versicolour, and Virginica) from their petal and sepal length and width:
>>> import numpy as np
>>> from sklearn import datasets
>>> iris = datasets.load_iris()
>>> iris_X = iris.data
>>> iris_y = iris.target
>>> np.unique(iris_y)
array([0, 1, 2])
k-Nearest neighbors classier The simplest possible classier is the nearest neighbor: given a new observation
X_test, nd in the training set (i.e. the data used to train the estimator) the observation with the closest feature
vector. (Please see the Nearest Neighbors section of the online Scikit-learn documentation for more information about
this type of classier.)
Training set and testing set
When experimenting with learning algorithm, it is important not to test the prediction of an estimator on the data
used to t the estimator, as this would not be evaluating the performance of the estimator on new data. This is
why datasets are often split into train and test data.
12 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
KNN(k nearest neighbors) classication example:
>>> # Split iris data in train and test data
>>> # A random permutation, to split the data randomly
>>> np.random.seed(0)
>>> indices = np.random.permutation(len(iris_X))
>>> iris_X_train = iris_X[indices[:-10]]
>>> iris_y_train = iris_y[indices[:-10]]
>>> iris_X_test = iris_X[indices[-10:]]
>>> iris_y_test = iris_y[indices[-10:]]
>>> # Create and fit a nearest-neighbor classifier
>>> from sklearn.neighbors import KNeighborsClassifier
>>> knn = KNeighborsClassifier()
>>> knn.fit(iris_X_train, iris_y_train)
KNeighborsClassifier(algorithm=auto, leaf_size=30, n_neighbors=5, p=2,
warn_on_equidistant=True, weights=uniform)
>>> knn.predict(iris_X_test)
array([1, 2, 1, 0, 0, 0, 2, 1, 2, 0])
>>> iris_y_test
array([1, 1, 1, 0, 0, 0, 2, 1, 2, 0])
The curse of dimensionality For an estimator to be effective, you need the distance between neighboring points to
be less than some value d, which depends on the problem. In one dimension, this requires on average n ~ 1/d points.
In the context of the above KNN example, if the data is only described by one feature, with values ranging from 0 to
1 and with n training observations, new data will thus be no further away than 1/n. Therefore, the nearest neighbor
decision rule will be efcient as soon as 1/n is small compared to the scale of between-class feature variations.
If the number of features is p, you now require n ~ 1/d^p points. Lets say that we require 10 points in one dimension:
Now 10^p points are required in p dimensions to pave the [0, 1] space. As p becomes large, the number of training
points required for a good estimator grows exponentially.
For example, if each point is just a single number (8 bytes), then an effective KNN estimator in a paltry p~20 di-
mensions would require more training data than the current estimated size of the entire internet! (1000 Exabytes or
so).
This is called the curse of dimensionality and is a core problem that machine learning addresses.
1.2. Tutorials: From the bottom up with scikit-learn 13
scikit-learn user guide, Release 0.12-git
Linear model: from regression to sparsity
Diabetes dataset
The diabetes dataset consists of 10 physiological variables (age, sex, weight, blood pressure) measure on 442
patients, and an indication of disease progression after one year:
>>> diabetes = datasets.load_diabetes()
>>> diabetes_X_train = diabetes.data[:-20]
>>> diabetes_X_test = diabetes.data[-20:]
>>> diabetes_y_train = diabetes.target[:-20]
>>> diabetes_y_test = diabetes.target[-20:]
The task at hand is to predict disease progression from physiological variables.
Linear regression LinearRegression, in its simplest form, ts a linear model to the data set by adjust-
ing a set of parameters, in order to make the sum of the squared residuals of the model as small as possilbe.
Linear models: y = X +
X: data
y: target variable
: Coefcients
: Observation noise
>>> from sklearn import linear_model
>>> regr = linear_model.LinearRegression()
>>> regr.fit(diabetes_X_train, diabetes_y_train)
LinearRegression(copy_X=True, fit_intercept=True, normalize=False)
>>> print regr.coef_
[ 0.30349955 -237.63931533 510.53060544 327.73698041 -814.13170937
492.81458798 102.84845219 184.60648906 743.51961675 76.09517222]
>>> # The mean square error
>>> np.mean((regr.predict(diabetes_X_test)-diabetes_y_test)
**
2)
2004.56760268...
>>> # Explained variance score: 1 is perfect prediction
>>> # and 0 means that there is no linear relationship
>>> # between X and Y.
>>> regr.score(diabetes_X_test, diabetes_y_test)
0.5850753022690...
14 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
Shrinkage If there are few data points per dimension, noise in the observations induces high variance:
>>> X = np.c_[ .5, 1].T
>>> y = [.5, 1]
>>> test = np.c_[ 0, 2].T
>>> regr = linear_model.LinearRegression()
>>> import pylab as pl
>>> pl.figure()
>>> np.random.seed(0)
>>> for _ in range(6):
... this_X = .1
*
np.random.normal(size=(2, 1)) + X
... regr.fit(this_X, y)
... pl.plot(test, regr.predict(test))
... pl.scatter(this_X, y, s=3)
A solution, in high-dimensional statistical learning, is to shrink the regression coefcients to zero: any
two randomly chosen set of observations are likely to be uncorrelated. This is called Ridge regression:
>>> regr = linear_model.Ridge(alpha=.1)
>>> pl.figure()
>>> np.random.seed(0)
>>> for _ in range(6):
... this_X = .1
*
np.random.normal(size=(2, 1)) + X
... regr.fit(this_X, y)
... pl.plot(test, regr.predict(test))
... pl.scatter(this_X, y, s=3)
1.2. Tutorials: From the bottom up with scikit-learn 15
scikit-learn user guide, Release 0.12-git
This is an example of bias/variance tradeoff: the larger the ridge alpha parameter, the higher the bias and the lower
the variance.
We can choose alpha to minimize left out error, this time using the diabetes dataset, rather than our synthetic data:
>>> alphas = np.logspace(-4, -1, 6)
>>> print [regr.set_params(alpha=alpha
... ).fit(diabetes_X_train, diabetes_y_train,
... ).score(diabetes_X_test, diabetes_y_test) for alpha in alphas]
[0.5851110683883..., 0.5852073015444..., 0.5854677540698..., 0.5855512036503..., 0.5830717085554..., 0.57058999437...]
Note: Capturing in the tted parameters noise that prevents the model to generalize to new data is called overtting.
The bias introduced by the ridge regression is called a regularization.
Sparsity Fitting only features 1 and 2
Note: A representation of the full diabetes dataset would involve 11 dimensions (10 feature dimensions, and one of
the target variable). It is hard to develop an intuition on such representation, but it may be useful to keep in mind that
it would be a fairly empty space.
We can see that although feature 2 has a strong coefcient on the full model, it conveys little information on y when
considered with feature 1.
To improve the conditioning of the problem (mitigate the The curse of dimensionality), it would be interesting to
select only the informative features and set non-informative ones, like feature 2 to 0. Ridge regression will decrease
their contribution, but not set them to zero. Another penalization approach, called Lasso (least absolute shrinkage and
selection operator), can set some coefcients to zero. Such methods are called sparse method, and sparsity can be
seen as an application of Occams razor: prefer simpler models.
16 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
>>> regr = linear_model.Lasso()
>>> scores = [regr.set_params(alpha=alpha
... ).fit(diabetes_X_train, diabetes_y_train
... ).score(diabetes_X_test, diabetes_y_test)
... for alpha in alphas]
>>> best_alpha = alphas[scores.index(max(scores))]
>>> regr.alpha = best_alpha
>>> regr.fit(diabetes_X_train, diabetes_y_train)
Lasso(alpha=0.025118864315095794, copy_X=True, fit_intercept=True,
max_iter=1000, normalize=False, positive=False, precompute=auto,
tol=0.0001, warm_start=False)
>>> print regr.coef_
[ 0. -212.43764548 517.19478111 313.77959962 -160.8303982 -0.
-187.19554705 69.38229038 508.66011217 71.84239008]
Different algorithms for a same problem
Different algorithms can be used to solve the same mathematical problem. For instance the Lasso object in
the scikit-learn solves the lasso regression using a coordinate decent method, that is efcient on large datasets.
However, the scikit-learn also provides the LassoLars object, using the LARS which is very efcient for
problems in which the weight vector estimated is very sparse, that is problems with very few observations.
Classication For classication, as in the labeling iris task, linear
regression is not the right approach, as it will give too much weight to data far from the decision frontier. A linear
approach is to t a sigmoid function, or logistic function:
y = sigmoid(X offset) + =
1
1 + exp(X + offset)
+
>>> logistic = linear_model.LogisticRegression(C=1e5)
>>> logistic.fit(iris_X_train, iris_y_train)
LogisticRegression(C=100000.0, class_weight=None, dual=False,
fit_intercept=True, intercept_scaling=1, penalty=l2,
tol=0.0001)
1.2. Tutorials: From the bottom up with scikit-learn 17
scikit-learn user guide, Release 0.12-git
This is known as LogisticRegression.
Multiclass classication
If you have several classes to predict, an option often used is to t one-versus-all classiers, and use a voting
heuristic for the nal decision.
Shrinkage and sparsity with logistic regression
The C parameter controls the amount of regularization in the LogisticRegression object: a large value for
C results in less regularization. penalty=l2 gives Shrinkage (i.e. non-sparse coefcients), while penalty=l1
gives Sparsity.
Exercise
Try classifying the digits dataset with nearest neighbors and a linear model. Leave out the last 10% and test
prediction performance on these observations.
from sklearn import datasets, neighbors, linear_model
digits = datasets.load_digits()
X_digits = digits.data
y_digits = digits.target
Solution: ../../auto_examples/exercises/plot_digits_classification_exercise.py
Support vector machines (SVMs)
Linear SVMs Support Vector Machines belong to the discrimant model family: they try to nd a combination of
samples to build a plane maximizing the margin between the two classes. Regularization is set by the C parameter:
a small value for C means the margin is calculated using many or all of the observations around the separating line
(more regularization); a large value for C means the margin is calculated on observations close to the separating line
(less regularization).
18 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
Unregularized SVM Regularized SVM (default)
SVMs can be used in regression SVR (Support Vector
Regression), or in classication SVC (Support Vector Classication).
>>> from sklearn import svm
>>> svc = svm.SVC(kernel=linear)
>>> svc.fit(iris_X_train, iris_y_train)
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, degree=3, gamma=0.0,
kernel=linear, probability=False, shrinking=True, tol=0.001,
verbose=False)
Warning: Normalizing data
For many estimators, including the SVMs, having datasets with unit standard deviation for each feature is important
to get good prediction.
Using kernels Classes are not always linearly separable in feature space. The solution is to build a decision function
that is not linear but that may be for instance polynomial. This is done using the kernel trick that can be seen as
creating an decision energy by positioning kernels on observations:
1.2. Tutorials: From the bottom up with scikit-learn 19
scikit-learn user guide, Release 0.12-git
Linear kernel Polynomial kernel
>>> svc = svm.SVC(kernel=linear) >>> svc = svm.SVC(kernel=poly,
... degree=3)
>>> # degree: polynomial degree
RBF kernel (Radial Basis Function)
>>> svc = svm.SVC(kernel=rbf)
>>> # gamma: inverse of size of
>>> # radial kernel
Interactive example
See the SVM GUI to download svm_gui.py; add data points of both classes with right and left button, t the
model and change parameters and data.
20 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
Exercise
Try classifying classes 1 and 2 from the iris dataset with SVMs, with the 2 rst features. Leave out 10% of each
class and test prediction performance on these observations.
Warning: the classes are ordered, do not leave out the last 10%, you would be testing on only one class.
Hint: You can use the decision_function method on a grid to get intuitions.
iris = datasets.load_iris()
X = iris.data
y = iris.target
X = X[y != 0, :2]
y = y[y != 0]
Solution: ../../auto_examples/exercises/plot_iris_exercise.py
Model selection: choosing estimators and their parameters
Score, and cross-validated scores
As we have seen, every estimator exposes a score method that can judge the quality of the t (or the prediction) on
new data. Bigger is better.
>>> from sklearn import datasets, svm
>>> digits = datasets.load_digits()
>>> X_digits = digits.data
>>> y_digits = digits.target
>>> svc = svm.SVC(C=1, kernel=linear)
>>> svc.fit(X_digits[:-100], y_digits[:-100]).score(X_digits[-100:], y_digits[-100:])
0.97999999999999998
To get a better measure of prediction accuracy (which we can use as a proxy for goodness of t of the model), we can
successively split the data in folds that we use for training and testing:
>>> import numpy as np
>>> X_folds = np.array_split(X_digits, 3)
>>> y_folds = np.array_split(y_digits, 3)
>>> scores = list()
>>> for k in range(3):
... # We use list to copy, in order to pop later on
... X_train = list(X_folds)
... X_test = X_train.pop(k)
... X_train = np.concatenate(X_train)
... y_train = list(y_folds)
... y_test = y_train.pop(k)
... y_train = np.concatenate(y_train)
... scores.append(svc.fit(X_train, y_train).score(X_test, y_test))
>>> print scores
[0.93489148580968284, 0.95659432387312182, 0.93989983305509184]
This is called a KFold cross validation
Cross-validation generators
The code above to split data in train and test sets is tedious to write. The sklearn exposes cross-validation generators
to generate list of indices for this purpose:
1.2. Tutorials: From the bottom up with scikit-learn 21
scikit-learn user guide, Release 0.12-git
>>> from sklearn import cross_validation
>>> k_fold = cross_validation.KFold(n=6, k=3, indices=True)
>>> for train_indices, test_indices in k_fold:
... print Train: %s | test: %s % (train_indices, test_indices)
Train: [2 3 4 5] | test: [0 1]
Train: [0 1 4 5] | test: [2 3]
Train: [0 1 2 3] | test: [4 5]
The cross-validation can then be implemented easily:
>>> kfold = cross_validation.KFold(len(X_digits), k=3)
>>> [svc.fit(X_digits[train], y_digits[train]).score(X_digits[test], y_digits[test])
... for train, test in kfold]
[0.93489148580968284, 0.95659432387312182, 0.93989983305509184]
To compute the score method of an estimator, the sklearn exposes a helper function:
>>> cross_validation.cross_val_score(svc, X_digits, y_digits, cv=kfold, n_jobs=-1)
array([ 0.93489149, 0.95659432, 0.93989983])
n_jobs=-1 means that the computation will be dispatched on all the CPUs of the computer.
Cross-validation generators
KFold (n, k) StratifiedKFold (y, k) LeaveOneOut
(n)
LeaveOneLabelOut
(labels)
Split it K folds, train on
K-1, test on left-out
Make sure that all classes are
even accross the folds
Leave one
observation out
Takes a label array to
group observations
22 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
Exercise
On the digits dataset, plot the cross-validation score of a SVC estimator with an RBF kernel as a function of
parameter C (use a logarithmic grid of points, from 1 to 10).
from sklearn import cross_validation, datasets, svm
digits = datasets.load_digits()
X = digits.data
y = digits.target
svc = svm.SVC()
C_s = np.logspace(1, 10, 10)
scores = list()
scores_std = list()
Solution: ../../auto_examples/exercises/plot_cv_digits.py
Grid-search and cross-validated estimators
Grid-search The sklearn provides an object that, given data, computes the score during the t of an estimator on
a parameter grid and chooses the parameters to maximize the cross-validation score. This object takes an estimator
during the construction and exposes an estimator API:
>>> from sklearn.grid_search import GridSearchCV
>>> gammas = np.logspace(-6, -1, 10)
>>> clf = GridSearchCV(estimator=svc, param_grid=dict(gamma=gammas),
... n_jobs=-1)
>>> clf.fit(X_digits[:1000], y_digits[:1000])
GridSearchCV(cv=None,...
>>> clf.best_score_
0.988991985997974
>>> clf.best_estimator_.gamma
9.9999999999999995e-07
>>> # Prediction performance on test set is not as good as on train set
>>> clf.score(X_digits[1000:], y_digits[1000:])
0.94228356336260977
By default the GridSearchCV uses a 3-fold cross-validation. However, if it detects that a classier is passed, rather
than a regressor, it uses a stratied 3-fold.
Nested cross-validation
>>> cross_validation.cross_val_score(clf, X_digits, y_digits)
array([ 0.97996661, 0.98163606, 0.98330551])
Two cross-validation loops are performed in parallel: one by the GridSearchCV estimator to set gamma, the
other one by cross_val_score to measure the prediction performance of the estimator. The resulting scores are
unbiased estimates of the prediction score on new data.
Warning: You cannot nest objects with parallel computing (n_jobs different than 1).
1.2. Tutorials: From the bottom up with scikit-learn 23
scikit-learn user guide, Release 0.12-git
Cross-validated estimators Cross-validation to set a parameter can be done more efciently on an algorithm-by-
algorithm basis. This is why, for certain estimators, the sklearn exposes Cross-Validation: evaluating estimator per-
formance estimators, that set their parameter automatically by cross-validation:
>>> from sklearn import linear_model, datasets
>>> lasso = linear_model.LassoCV()
>>> diabetes = datasets.load_diabetes()
>>> X_diabetes = diabetes.data
>>> y_diabetes = diabetes.target
>>> lasso.fit(X_diabetes, y_diabetes)
LassoCV(alphas=array([ 2.14804, 2.00327, ..., 0.0023 , 0.00215]),
copy_X=True, cv=None, eps=0.001, fit_intercept=True, max_iter=1000,
n_alphas=100, normalize=False, precompute=auto, tol=0.0001,
verbose=False)
>>> # The estimator chose automatically its lambda:
>>> lasso.alpha
0.01318...
These estimators are called similarly to their counterparts, with CV appended to their name.
Exercise
On the diabetes dataset, nd the optimal regularization parameter alpha.
Bonus: How much can you trust the selection of alpha?
import numpy as np
import pylab as pl
from sklearn import cross_validation, datasets, linear_model
diabetes = datasets.load_diabetes()
X = diabetes.data
y = diabetes.target
lasso = linear_model.Lasso()
alphas = np.logspace(-4, -1, 20)
Solution: ../../auto_examples/exercises/plot_cv_diabetes.py
Unsupervised learning: seeking representations of the data
Clustering: grouping observations together
The problem solved in clustering
Given the iris dataset, if we knew that there were 3 types of iris, but did not have access to a taxonomist to label
them: we could try a clustering task: split the observations in well-separated group called clusters.
24 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
K-means clustering Note that their exists a lot of different clustering criteria and associated algorithms. The sim-
plest clustering algorithm is the K-means.
>>> from sklearn import cluster, datasets
>>> iris = datasets.load_iris()
>>> X_iris = iris.data
>>> y_iris = iris.target
>>> k_means = cluster.KMeans(n_clusters=3)
>>> k_means.fit(X_iris)
KMeans(copy_x=True, init=k-means++, ...
>>> print k_means.labels_[::10]
[1 1 1 1 1 0 0 0 0 0 2 2 2 2 2]
>>> print y_iris[::10]
[0 0 0 0 0 1 1 1 1 1 2 2 2 2 2]
Warning: There is absolutely no guarantee of recovering a ground truth. First choosing the right number of
clusters is hard. Second, the algorithm is sensitive to initialization, and can fall in local minima, although in the
sklearn package we play many tricks to mitigate this issue.
Bad initialization 8 clusters Ground truth
Dont over-interpret clustering results
1.2. Tutorials: From the bottom up with scikit-learn 25
scikit-learn user guide, Release 0.12-git
Application example: vector quantization
Clustering in general and KMeans in particular, can be seen as a way of choosing a small number of examplars
to compress the information, a problem sometimes known as vector quantization. For instance, this can be used
to posterize an image:
>>> import scipy as sp
>>> try:
... lena = sp.lena()
... except AttributeError:
... from scipy import misc
... lena = misc.lena()
>>> X = lena.reshape((-1, 1)) # We need an (n_sample, n_feature) array
>>> k_means = cluster.KMeans(n_clusters=5, n_init=1)
>>> k_means.fit(X)
KMeans(copy_x=True, init=k-means++, ...
>>> values = k_means.cluster_centers_.squeeze()
>>> labels = k_means.labels_
>>> lena_compressed = np.choose(labels, values)
>>> lena_compressed.shape = lena.shape
Raw image K-means quantization Equal bins Image histogram
Hierarchical agglomerative clustering: Ward A Hierarchical clustering method is a type of cluster analysis that
aims to build a hierarchy of clusters. In general, the various approaches of this technique are either:
Agglomerative - bottom-up approaches, or
Divisive - top-down approaches.
For estimating a large number of clusters, top-down approaches are both statisticaly ill-posed, and slow - due to it
starting with all observations as one cluster, which it splits recursively. Agglomerative hierarchical-clustering is a
bottom-up approach that successively merges observations together and is particularly useful when the clusters of
interest are made of only a few observations. Ward clustering minimizes a criterion similar to k-means in a bottom-up
approach. When the number of clusters is large, it is much more computationally efcient than k-means.
Connectivity-constrained clustering With Ward clustering, it is possible to specify which samples can be clus-
tered together by giving a connectivity graph. Graphs in the scikit are represented by their adjacency matrix. Of-
ten a sparse matrix is used. This can be useful for instance to retrieve connect regions when clustering an image:
26 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
###############################################################################
# Generate data
lena = sp.misc.lena()
# Downsample the image by a factor of 4
lena = lena[::2, ::2] + lena[1::2, ::2] + lena[::2, 1::2] + lena[1::2, 1::2]
X = np.reshape(lena, (-1, 1))
###############################################################################
# Define the structure A of the data. Pixels connected to their neighbors.
connectivity = grid_to_graph(
*
lena.shape)
###############################################################################
# Compute clustering
print "Compute structured hierarchical clustering..."
st = time.time()
n_clusters = 15 # number of regions
ward = Ward(n_clusters=n_clusters, connectivity=connectivity).fit(X)
label = np.reshape(ward.labels_, lena.shape)
print "Elaspsed time: ", time.time() - st
print "Number of pixels: ", label.size
print "Number of clusters: ", np.unique(label).size
Feature agglomeration We have seen that sparsity could be used to mitigate the curse of dimensionality, i.e the
insufcience of observations compared to the number of features. Another approach is to merge together similar
features: feature agglomeration. This approach can be implementing by clustering in the feature direction, in other
words clustering the transposed data.
>>> digits = datasets.load_digits()
>>> images = digits.images
>>> X = np.reshape(images, (len(images), -1))
>>> connectivity = grid_to_graph(
*
images[0].shape)
>>> agglo = cluster.WardAgglomeration(connectivity=connectivity,
... n_clusters=32)
>>> agglo.fit(X)
WardAgglomeration(connectivity=...
>>> X_reduced = agglo.transform(X)
>>> X_approx = agglo.inverse_transform(X_reduced)
>>> images_approx = np.reshape(X_approx, images.shape)
1.2. Tutorials: From the bottom up with scikit-learn 27
scikit-learn user guide, Release 0.12-git
transform and inverse_transform methods
Some estimators expose a transform method, for instance to reduce the dimensionality of the dataset.
Decompositions: from a signal to components and loadings
Components and loadings
If X is our multivariate data, the problem that we are trying to solve is to rewrite it on a different observation
basis: we want to learn loadings L and a set of components C such that X = L C. Different criteria exist to choose
the components
Principal component analysis: PCA Principal component analysis (PCA) selects the successive components that
explain the maximum variance in the signal.
The point cloud spanned by the observations above is very at in one direction: one of the 3 univariate features can
almost be exactly computed using the 2 other. PCA nds the directions in which the data is not at
When used to transform data, PCA can reduce the dimensionality of the data by projecting on a principal subspace.
>>> # Create a signal with only 2 useful dimensions
>>> x1 = np.random.normal(size=100)
>>> x2 = np.random.normal(size=100)
>>> x3 = x1 + x2
>>> X = np.c_[x1, x2, x3]
>>> from sklearn import decomposition
>>> pca = decomposition.PCA()
>>> pca.fit(X)
PCA(copy=True, n_components=None, whiten=False)
>>> print pca.explained_variance_
[ 2.18565811e+00 1.19346747e+00 8.43026679e-32]
>>> # As we can see, only the 2 first components are useful
>>> pca.n_components = 2
>>> X_reduced = pca.fit_transform(X)
>>> X_reduced.shape
(100, 2)
28 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
Independent Component Analysis: ICA Independent component analysis (ICA) selects components so that the
distribution of their loadings carries a maximum amount of independent information. It is able to recover non-
Gaussian independent signals:
>>> # Generate sample data
>>> time = np.linspace(0, 10, 2000)
>>> s1 = np.sin(2
*
time) # Signal 1 : sinusoidal signal
>>> s2 = np.sign(np.sin(3
*
time)) # Signal 2 : square signal
>>> S = np.c_[s1, s2]
>>> S += 0.2
*
np.random.normal(size=S.shape) # Add noise
>>> S /= S.std(axis=0) # Standardize data
>>> # Mix data
>>> A = np.array([[1, 1], [0.5, 2]]) # Mixing matrix
>>> X = np.dot(S, A.T) # Generate observations
>>> # Compute ICA
>>> ica = decomposition.FastICA()
>>> S_ = ica.fit(X).transform(X) # Get the estimated sources
>>> A_ = ica.get_mixing_matrix() # Get estimated mixing matrix
>>> np.allclose(X, np.dot(S_, A_.T))
True
1.2. Tutorials: From the bottom up with scikit-learn 29
scikit-learn user guide, Release 0.12-git
Putting it all together
Pipelining
We have seen that some estimators can transform data, and some estimators can predict variables. We can create
combined estimators:
import pylab as pl
from sklearn import linear_model, decomposition, datasets
logistic = linear_model.LogisticRegression()
pca = decomposition.PCA()
from sklearn.pipeline import Pipeline
pipe = Pipeline(steps=[(pca, pca), (logistic, logistic)])
digits = datasets.load_digits()
X_digits = digits.data
y_digits = digits.target
###############################################################################
# Plot the PCA spectrum
pca.fit(X_digits)
pl.figure(1, figsize=(4, 3))
pl.clf()
pl.axes([.2, .2, .7, .7])
pl.plot(pca.explained_variance_, linewidth=2)
pl.axis(tight)
pl.xlabel(n_components)
pl.ylabel(explained_variance_)
###############################################################################
# Prediction
from sklearn.grid_search import GridSearchCV
n_components = [20, 40, 64]
Cs = np.logspace(-4, 4, 3)
#Parameters of pipelines can be set using __ separated parameter names:
estimator = GridSearchCV(pipe,
dict(pca__n_components=n_components,
logistic__C=Cs))
30 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
estimator.fit(X_digits, y_digits)
pl.axvline(estimator.best_estimator_.named_steps[pca].n_components,
linestyle=:, label=n_components chosen)
pl.legend(prop=dict(size=12))
Face recognition with eigenfaces
The dataset used in this example is a preprocessed excerpt of the Labeled Faces in the Wild, aka LFW:
http://vis-www.cs.umass.edu/lfw/lfw-funneled.tgz (233MB)
"""
===================================================
Faces recognition example using eigenfaces and SVMs
===================================================
The dataset used in this example is a preprocessed excerpt of the
"Labeled Faces in the Wild", aka LFW_:
http://vis-www.cs.umass.edu/lfw/lfw-funneled.tgz (233MB)
.. _LFW: http://vis-www.cs.umass.edu/lfw/
Expected results for the top 5 most represented people in the dataset::
precision recall f1-score support
Gerhard_Schroeder 0.91 0.75 0.82 28
Donald_Rumsfeld 0.84 0.82 0.83 33
Tony_Blair 0.65 0.82 0.73 34
Colin_Powell 0.78 0.88 0.83 58
George_W_Bush 0.93 0.86 0.90 129
avg / total 0.86 0.84 0.85 282
"""
print __doc__
from time import time
import logging
import pylab as pl
from sklearn.cross_validation import train_test_split
from sklearn.datasets import fetch_lfw_people
from sklearn.grid_search import GridSearchCV
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.decomposition import RandomizedPCA
from sklearn.svm import SVC
# Display progress logs on stdout
logging.basicConfig(level=logging.INFO, format=%(asctime)s %(message)s)
1.2. Tutorials: From the bottom up with scikit-learn 31
scikit-learn user guide, Release 0.12-git
###############################################################################
# Download the data, if not already on disk and load it as numpy arrays
lfw_people = fetch_lfw_people(min_faces_per_person=70, resize=0.4)
# introspect the images arrays to find the shapes (for plotting)
n_samples, h, w = lfw_people.images.shape
# fot machine learning we use the 2 data directly (as relative pixel
# positions info is ignored by this model)
X = lfw_people.data
n_features = X.shape[1]
# the label to predict is the id of the person
y = lfw_people.target
target_names = lfw_people.target_names
n_classes = target_names.shape[0]
print "Total dataset size:"
print "n_samples: %d" % n_samples
print "n_features: %d" % n_features
print "n_classes: %d" % n_classes
###############################################################################
# Split into a training set and a test set using a stratified k fold
# split into a training and testing set
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_fraction=0.25)
###############################################################################
# Compute a PCA (eigenfaces) on the face dataset (treated as unlabeled
# dataset): unsupervised feature extraction / dimensionality reduction
n_components = 150
print "Extracting the top %d eigenfaces from %d faces" % (
n_components, X_train.shape[0])
t0 = time()
pca = RandomizedPCA(n_components=n_components, whiten=True).fit(X_train)
print "done in %0.3fs" % (time() - t0)
eigenfaces = pca.components_.reshape((n_components, h, w))
print "Projecting the input data on the eigenfaces orthonormal basis"
t0 = time()
X_train_pca = pca.transform(X_train)
X_test_pca = pca.transform(X_test)
print "done in %0.3fs" % (time() - t0)
###############################################################################
# Train a SVM classification model
print "Fitting the classifier to the training set"
t0 = time()
param_grid = {
32 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
C: [1e3, 5e3, 1e4, 5e4, 1e5],
gamma: [0.0001, 0.0005, 0.001, 0.005, 0.01, 0.1],
}
clf = GridSearchCV(SVC(kernel=rbf, class_weight=auto), param_grid)
clf = clf.fit(X_train_pca, y_train)
print "done in %0.3fs" % (time() - t0)
print "Best estimator found by grid search:"
print clf.best_estimator_
###############################################################################
# Quantitative evaluation of the model quality on the test set
print "Predicting the people names on the testing set"
t0 = time()
y_pred = clf.predict(X_test_pca)
print "done in %0.3fs" % (time() - t0)
print classification_report(y_test, y_pred, target_names=target_names)
print confusion_matrix(y_test, y_pred, labels=range(n_classes))
###############################################################################
# Qualitative evaluation of the predictions using matplotlib
def plot_gallery(images, titles, h, w, n_row=3, n_col=4):
"""Helper function to plot a gallery of portraits"""
pl.figure(figsize=(1.8
*
n_col, 2.4
*
n_row))
pl.subplots_adjust(bottom=0, left=.01, right=.99, top=.90, hspace=.35)
for i in range(n_row
*
n_col):
pl.subplot(n_row, n_col, i + 1)
pl.imshow(images[i].reshape((h, w)), cmap=pl.cm.gray)
pl.title(titles[i], size=12)
pl.xticks(())
pl.yticks(())
# plot the result of the prediction on a portion of the test set
def title(y_pred, y_test, target_names, i):
pred_name = target_names[y_pred[i]].rsplit( , 1)[-1]
true_name = target_names[y_test[i]].rsplit( , 1)[-1]
return predicted: %s\ntrue: %s % (pred_name, true_name)
prediction_titles = [title(y_pred, y_test, target_names, i)
for i in range(y_pred.shape[0])]
plot_gallery(X_test, prediction_titles, h, w)
# plot the gallery of the most significative eigenfaces
eigenface_titles = ["eigenface %d" % i for i in range(eigenfaces.shape[0])]
plot_gallery(eigenfaces, eigenface_titles, h, w)
pl.show()
1.2. Tutorials: From the bottom up with scikit-learn 33
scikit-learn user guide, Release 0.12-git
Prediction Eigenfaces
Expected results for the top 5 most represented people in the dataset:
precision recall f1-score support
Gerhard_Schroeder 0.91 0.75 0.82 28
Donald_Rumsfeld 0.84 0.82 0.83 33
Tony_Blair 0.65 0.82 0.73 34
Colin_Powell 0.78 0.88 0.83 58
George_W_Bush 0.93 0.86 0.90 129
avg / total 0.86 0.84 0.85 282
Open problem: Stock Market Structure
Can we predict the variation in stock prices for Google?
Visualizing the stock market structure
Finding help
The project mailing list
If you encounter a bug with scikit-learn or something that needs clarication in the docstring or the online
documentation, please feel free to ask on the Mailing List
Q&A communities with Machine Learning practictioners
Metaoptimize/QA A forum for Machine Learning, Natural Language Processing and
other Data Analytics discussions (similar to what Stackoverow is for developers):
http://metaoptimize.com/qa
A good starting point is the discussion on good freely available textbooks on machine
learning
Quora.com Quora has a topic for Machine Learning related questions that also features some
interesting discussions: http://quora.com/Machine-Learning
Have a look at the best questions section, eg: What are some good resources for learning
about machine learning.
Note: Videos
Videos with tutorials can also be found in the Videos section.
34 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
Note: Doctest Mode
The code-examples in the above tutorials are written in a python-console format. If you wish to easily execute these
examples in iPython, use:
%doctest_mode
in the iPython-console. You can then simply copy and paste the examples directly into iPython without having to
worry about removing the >>> manually.
1.3 Supervised learning
1.3.1 Generalized Linear Models
The following are a set of methods intended for regression in which the target value is expected to be a linear combi-
nation of the input variables. In mathematical notion, if y is the predicted value.
y(w, x) = w
0
+w
1
x
1
+... +w
p
x
p
Across the module, we designate the vector w = (w
1
, ..., w
p
) as coef_ and w
0
as intercept_.
To perform classication with generalized linear models, see Logisitic regression.
Ordinary Least Squares
LinearRegression ts a linear model with coefcients w = (w
1
, ..., w
p
) to minimize the residual sum of squares
between the observed responses in the dataset, and the responses predicted by the linear approximation. Mathemati-
cally it solves a problem of the form:
min
w
||Xw y||
2
2
1.3. Supervised learning 35
scikit-learn user guide, Release 0.12-git
LinearRegression will take in its t method arrays X, y and will store the coefcients w of the linear model in
its coef_ member:
>>> from sklearn import linear_model
>>> clf = linear_model.LinearRegression()
>>> clf.fit ([[0, 0], [1, 1], [2, 2]], [0, 1, 2])
LinearRegression(copy_X=True, fit_intercept=True, normalize=False)
>>> clf.coef_
array([ 0.5, 0.5])
However, coefcient estimates for Ordinary Least Squares rely on the independence of the model terms. When terms
are correlated and the columns of the design matrix X have an approximate linear dependence, the design matrix
becomes close to singular and as a result, the least-squares estimate becomes highly sensitive to random errors in the
observed response, producing a large variance. This situation of multicollinearity can arise, for example, when data
are collected without an experimental design.
Examples:
Linear Regression Example
Ordinary Least Squares Complexity
This method computes the least squares solution using a singular value decomposition of X. If X is a matrix of size (n,
p) this method has a cost of O(np
2
), assuming that n p.
Ridge Regression
Ridge regression addresses some of the problems of Ordinary Least Squares by imposing a penalty on the size of
coefcients. The ridge coefcients minimize a penalized residual sum of squares,
min
w
||Xw y||
2
2
+||w||
2
2
Here, 0 is a complexity parameter that controls the amount of shrinkage: the larger the value of , the greater the
amount of shrinkage and thus the coefcients become more robust to collinearity.
As with other linear models, Ridge will take in its t method arrays X, y and will store the coefcients w of the linear
model in its coef_ member:
36 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
>>> from sklearn import linear_model
>>> clf = linear_model.Ridge (alpha = .5)
>>> clf.fit ([[0, 0], [0, 0], [1, 1]], [0, .1, 1])
Ridge(alpha=0.5, copy_X=True, fit_intercept=True, normalize=False, tol=0.001)
>>> clf.coef_
array([ 0.34545455, 0.34545455])
>>> clf.intercept_
0.13636...
Examples:
Plot Ridge coefcients as a function of the regularization
Classication of text documents using sparse features
Ridge Complexity
This method has the same order of complexity than an Ordinary Least Squares.
Setting the regularization parameter: generalized Cross-Validation
RidgeCV implements ridge regression with built-in cross-validation of the alpha parameter. The object works in
the same way as GridSearchCV except that it defaults to Generalized Cross-Validation (GCV), an efcient form of
leave-one-out cross-validation:
>>> from sklearn import linear_model
>>> clf = linear_model.RidgeCV(alphas=[0.1, 1.0, 10.0])
>>> clf.fit([[0, 0], [0, 0], [1, 1]], [0, .1, 1])
RidgeCV(alphas=[0.1, 1.0, 10.0], cv=None, fit_intercept=True, loss_func=None,
normalize=False, score_func=None)
>>> clf.best_alpha
0.1
References
Notes on Regularized Least Squares, Rifkin & Lippert (technical report, course slides).
Lasso
The Lasso is a linear model that estimates sparse coefcients. It is useful in some contexts due to its tendency
to prefer solutions with fewer parameter values, effectively reducing the number of variables upon which the given
solution is dependent. For this reason, the Lasso and its variants are fundamental to the eld of compressed sensing.
Under certain conditions, it can recover the exact set of non-zero weights (see Compressive sensing: tomography
reconstruction with L1 prior (Lasso)).
Mathematically, it consists of a linear model trained with
1
prior as regularizer. The objective function to minimize
is:
min
w
1
2n
samples
||Xw y||
2
2
+||w||
1
1.3. Supervised learning 37
scikit-learn user guide, Release 0.12-git
The lasso estimate thus solves the minimization of the least-squares penalty with ||w||
1
added, where is a constant
and ||w||
1
is the
1
-norm of the parameter vector.
The implementation in the class Lasso uses coordinate descent as the algorithm to t the coefcients. See Least
Angle Regression for another implementation:
>>> clf = linear_model.Lasso(alpha = 0.1)
>>> clf.fit([[0, 0], [1, 1]], [0, 1])
Lasso(alpha=0.1, copy_X=True, fit_intercept=True, max_iter=1000,
normalize=False, positive=False, precompute=auto, tol=0.0001,
warm_start=False)
>>> clf.predict([[1, 1]])
array([ 0.8])
Also useful for lower-level tasks is the function lasso_path that computes the coefcients along the full path of
possible values.
Examples:
Lasso and Elastic Net for Sparse Signals
Compressive sensing: tomography reconstruction with L1 prior (Lasso)
Note: Feature selection with Lasso
As the Lasso regression yields sparse models, it can thus be used to perform feature selection, as detailed in L1-based
feature selection.
Setting regularization parameter
The alpha parameter control the degree of sparsity of the coefcients estimated.
Using cross-validation scikit-learn exposes objects that set the Lasso alpha parameter by cross-validation:
LassoCV and LassoLarsCV. LassoLarsCV is based on the Least Angle Regression algorithm explained below.
For high-dimensional datasets with many collinear regressors, LassoCV is most often preferrable. How,
LassoLarsCV has the advantage of exploring more relevant values of alpha parameter, and if the number of samples
is very small compared to the number of observations, it is often faster than LassoCV.
38 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
Information-criteria based model selection Alternatively, the estimator LassoLarsIC proposes to use the
Akaike information criterion (AIC) and the Bayes Information criterion (BIC). It is a computationally cheaper al-
ternative to nd the optimal value of alpha as the regularization path is computed only once instead of k+1 times
when using k-fold cross-validation. However, such criteria needs a proper estimation of the degrees of freedom of
the solution, are derived for large samples (asymptotic results) and assume the model is correct, i.e. that the data are
actually generated by this model. They also tend to break when the problem is badly conditioned (more features than
samples).
Examples:
Lasso model selection: Cross-Validation / AIC / BIC
Elastic Net
ElasticNet is a linear model trained with L1 and L2 prior as regularizer.
The objective function to minimize is in this case
min
w
1
2n
samples
||Xw y||
2
2
+||w||
1
+
(1 )
2
||w||
2
2
The class ElasticNetCV can be used to set the parameters alpha and rho by cross-validation.
Examples:
Lasso and Elastic Net for Sparse Signals
Lasso and Elastic Net
Least Angle Regression
Least-angle regression (LARS) is a regression algorithm for high-dimensional data, developed by Bradley Efron,
Trevor Hastie, Iain Johnstone and Robert Tibshirani.
The advantages of LARS are:
1.3. Supervised learning 39
scikit-learn user guide, Release 0.12-git
It is numerically efcient in contexts where p >> n (i.e., when the number of dimensions is signicantly greater
than the number of points)
It is computationally just as fast as forward selection and has the same order of complexity as an ordinary least
squares.
It produces a full piecewise linear solution path, which is useful in cross-validation or similar attempts to tune
the model.
If two variables are almost equally correlated with the response, then their coefcients should increase at ap-
proximately the same rate. The algorithm thus behaves as intuition would expect, and also is more stable.
It is easily modied to produce solutions for other estimators, like the Lasso.
The disadvantages of the LARS method include:
Because LARS is based upon an iterative retting of the residuals, it would appear to be especially sensitive to
the effects of noise. This problem is discussed in detail by Weisberg in the discussion section of the Efron et al.
(2004) Annals of Statistics article.
The LARS model can be used using estimator Lars, or its low-level implementation lars_path.
LARS Lasso
LassoLars is a lasso model implemented using the LARS algorithm, and unlike the implementation based on
coordinate_descent, this yields the exact solution, which is piecewise linear as a function of the norm of its coefcients.
>>> from sklearn import linear_model
>>> clf = linear_model.LassoLars(alpha=.1)
>>> clf.fit([[0, 0], [1, 1]], [0, 1])
LassoLars(alpha=0.1, copy_X=True, eps=..., fit_intercept=True,
max_iter=500, normalize=True, precompute=auto, verbose=False)
>>> clf.coef_
array([ 0.717157..., 0. ])
Examples:
Lasso path using LARS
40 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
The Lars algorithm provides the full path of the coefcients along the regularization parameter almost for free, thus a
common operation consist of retrieving the path with function lars_path
Mathematical formulation
The algorithm is similar to forward stepwise regression, but instead of including variables at each step, the estimated
parameters are increased in a direction equiangular to each ones correlations with the residual.
Instead of giving a vector result, the LARS solution consists of a curve denoting the solution for each value of the L1
norm of the parameter vector. The full coefents path is stored in the array coef_path_, which has size (n_features,
max_features+1). The rst column is always zero.
References:
Original Algorithm is detailed in the paper Least Angle Regression by Hastie et al.
Orthogonal Matching Pursuit (OMP)
OrthogonalMatchingPursuit and orthogonal_mp implements the OMP algorithm for approximating the
t of a linear model with constraints imposed on the number of non-zero coefcients (ie. the L
0
pseudo-norm).
Being a forward feature selection method like Least Angle Regression, orthogonal matching pursuit can approximate
the optimum solution vector with a xed number of non-zero elements:
arg min ||y X||
2
2
subject to ||||
0
n
nonzero_coefs
Alternatively, orthogonal matching pursuit can target a specic error instead of a specic number of non-zero coef-
cients. This can be expressed as:
arg min ||||
0
subject to ||y X||
2
2
tol
OMP is based on a greedy algorithm that includes at each step the atom most highly correlated with the current
residual. It is similar to the simpler matching pursuit (MP) method, but better in that at each iteration, the residual is
recomputed using an orthogonal projection on the space of the previously chosen dictionary elements.
1.3. Supervised learning 41
scikit-learn user guide, Release 0.12-git
Examples:
Orthogonal Matching Pursuit
References:
http://www.cs.technion.ac.il/~ronrubin/Publications/KSVD-OMP-v2.pdf
Matching pursuits with time-frequency dictionaries, S. G. Mallat, Z. Zhang,
Bayesian Regression
Bayesian regression techniques can be used to include regularization parameters in the estimation procedure: the
regularization parameter is not set in a hard sense but tuned to the data at hand.
This can be done by introducing uninformative priors over the hyper parameters of the model. The
2
regularization
used in Ridge Regression is equivalent to nding a maximum a-postiori solution under a Gaussian prior over the
parameters w with precision
i
is choosen to be the same gamma distribution given by hyperparameters
1
and
2
.
Examples:
Automatic Relevance Determination Regression (ARD)
References:
Logisitic regression
If the task at hand is to choose which class a sample belongs to given a nite (hopefuly small) set of choices, the
learning problem is a classication, rather than regression. Linear models can be used for such a decision, but it is best
to use what is called a logistic regression, that doesnt try to minimize the sum of square residuals, as in regression,
but rather a hit or miss cost.
The LogisticRegression class can be used to do L1 or L2 penalized logistic regression. L1 penalization yields
sparse predicting weights. For L1 penalization sklearn.svm.l1_min_c allows to calculate the lower bound for
C in order to get a non null (all feature weights to zero) model.
1
David Wipf and Srikantan Nagarajan: A new view of automatic relevance determination.
44 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
Examples:
L1 Penalty and Sparsity in Logistic Regression
Path with L1- Logistic Regression
Note: Feature selection with sparse logistic regression
A logistic regression with L1 penalty yields sparse models, and can thus be used to perform feature selection, as
detailed in L1-based feature selection.
Stochastic Gradient Descent - SGD
Stochastic gradient descent is a simple yet very efcient approach to t linear models. It is particulary useful when the
number of samples (and the number of features) is very large.
The classes SGDClassifier and SGDRegressor provide functionality to t linear models for classication and
regression using different (convex) loss functions and different penalties.
References
Stochastic Gradient Descent
Perceptron
The Perceptron is another simple algorithm suitable for large scale learning. By default:
It does not require a learning rate.
It is not regularized (penalized).
It updates its model only on mistakes.
The last characteristic implies that the Perceptron is slightly faster to train than SGD with the hinge loss and that the
resulting models are sparser.
1.3.2 Support Vector Machines
Support vector machines (SVMs) are a set of supervised learning methods used for classication, regression and
outliers detection.
The advantages of support vector machines are:
Effective in high dimensional spaces.
Still effective in cases where number of dimensions is greater than the number of samples.
Uses a subset of training points in the decision function (called support vectors), so it is also memory efcient.
Versatile: different Kernel functions can be specied for the decision function. Common kernels are provided,
but it is also possible to specify custom kernels.
The disadvantages of support vector machines include:
If the number of features is much greater than the number of samples, the method is likely to give poor perfor-
mances.
1.3. Supervised learning 45
scikit-learn user guide, Release 0.12-git
SVMs do not directly provide probability estimates, these are calculated using ve-fold cross-validation, and
thus performance can suffer.
The support vector machines in scikit-learn support both dens (numpy.ndarray and convertible to that by
numpy.asarray) and sparse (any scipy.sparse) sample vectors as input. However, to use an SVMto make pre-
dictions for sparse data, it must have been t on such data. For optimal performance, use C-ordered numpy.ndarray
(dense) or scipy.sparse.csr_matrix (sparse) with dtype=float64.
In previous versions of scikit-learn, sparse input support existed only in the sklearn.svm.sparse module which
duplicated the sklearn.svm interface. This module still exists for backward compatibility, but is deprecated and
will be removed in scikit-learn 0.12.
Classication
SVC, NuSVC and LinearSVC are classes capable of performing multi-class classication on a dataset.
SVC and NuSVC are similar methods, but accept slightly different sets of parameters and have different mathematical
formulations (see section Mathematical formulation). On the other hand, LinearSVC is another implementation of
Support Vector Classication for the case of a linear kernel. Note that LinearSVC does not accept keyword kernel,
as this is assumed to be linear. It also lacks some of the members of SVC and NuSVC, like support_.
As other classiers, SVC, NuSVC and LinearSVC take as input two arrays: an array Xof size [n_samples, n_features]
holding the training samples, and an array Yof integer values, size [n_samples], holding the class labels for the training
samples:
46 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
>>> from sklearn import svm
>>> X = [[0, 0], [1, 1]]
>>> Y = [0, 1]
>>> clf = svm.SVC()
>>> clf.fit(X, Y)
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, degree=3,
gamma=0.5, kernel=rbf, probability=False, shrinking=True, tol=0.001,
verbose=False)
After being tted, the model can then be used to predict new values:
>>> clf.predict([[2., 2.]])
array([ 1.])
SVMs decision function depends on some subset of the training data, called the support vectors. Some properties of
these support vectors can be found in members support_vectors_, support_ and n_support:
>>> # get support vectors
>>> clf.support_vectors_
array([[ 0., 0.],
[ 1., 1.]])
>>> # get indices of support vectors
>>> clf.support_
array([0, 1]...)
>>> # get number of support vectors for each class
>>> clf.n_support_
array([1, 1]...)
Multi-class classication
SVC and NuSVC implement the one-against-one approach (Knerr et al., 1990) for multi- class classication. If
n_class is the number of classes, then n_class * (n_class - 1)/2 classiers are constructed and each one trains data from
two classes:
>>> X = [[0], [1], [2], [3]]
>>> Y = [0, 1, 2, 3]
>>> clf = svm.SVC()
>>> clf.fit(X, Y)
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, degree=3,
gamma=1.0, kernel=rbf, probability=False, shrinking=True,
tol=0.001, verbose=False)
>>> dec = clf.decision_function([[1]])
>>> dec.shape[1] # 4 classes: 4
*
3/2 = 6
6
On the other hand, LinearSVC implements one-vs-the-rest multi-class strategy, thus training n_class models. If
there are only two classes, only one model is trained:
>>> lin_clf = svm.LinearSVC()
>>> lin_clf.fit(X, Y)
LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
intercept_scaling=1, loss=l2, multi_class=ovr, penalty=l2,
tol=0.0001, verbose=0)
>>> dec = lin_clf.decision_function([[1]])
>>> dec.shape[1]
4
See Mathematical formulation for a complete description of the decision function.
1.3. Supervised learning 47
scikit-learn user guide, Release 0.12-git
Note that the LinearSVC also implements an alternative multi-class strategy, the so-called multi-class SVM formu-
lated by Crammer and Singer, by using the option multi_class=crammer_singer. This method is consistent, which
is not true for one-vs-rest classication. In practice, on-vs-rest classication is usually preferred, since the results are
mostly similar, but the runtime is signicantly less.
For one-vs-rest LinearSVC the attributes coef_ and intercept_ have the shape [n_class,
n_features] and [n_class] respectively. Each row of the coefcients corresponds to one of the n_class
many one-vs-rest classiers and simliar for the interecepts, in the order of the one class.
In the case of one-vs-one SVC, the layout of the attributes is a little more involved. In the case of having a linear
kernel, The layout of coef_ and intercept_ is similar to the one described for LinearSVC described above,
except that the shape of coef_ is [n_class
*
(n_class - 1) / 2, corresponding to as many binary clas-
siers. The order for classes 0 to n is 0 vs 1, 0 vs 2 , ... 0 vs n, 1 vs 2, 1 vs 3, 1 vs n, . . . n-1 vs
n.
The shape of dual_coef_ is [n_class-1, n_SV] with a somewhat hard to grasp layout. The columns corre-
spond to the support vectors involved in any of the n_class
*
(n_class - 1) / 2 one-vs-one classiers.
Each of the support vectors is used in n_class - 1 classiers. The n_class - 1 entries in each row correspond
to the dual coefcients for these classiers.
This might be made more clear by an example:
Consider a three class problem with with class 0 having 3 support vectors v
0
0
, v
1
0
, v
2
0
and class 1 and 2
having two support vectors v
0
1
, v
1
1
and v
0
1
, v
1
1
respectively. For each support vector v
j
i
, there are 2 dual
coefcients. Lets call the coefcient of support vector v
j
i
in the classier between classes i and k
j
i,k
.
Then dual_coef_ looks like this:
Unbalanced problems
In problems where it is desired to give more importance to certain classes or certain individual samples keywords
class_weight and sample_weight can be used.
SVC (but not NuSVC) implement a keyword class_weight in the t method. Its a dictionary of the form
{class_label : value}, where value is a oating point number > 0 that sets the parameter C of class
class_label to C * value.
SVC, NuSVC, SVR, NuSVR and OneClassSVM implement also weights for individual samples in method fit
through keyword sample_weight.
Examples:
Plot different SVM classiers in the iris dataset,
SVM: Maximum margin separating hyperplane,
SVM: Separating hyperplane for unbalanced classes
SVM-Anova: SVM with univariate feature selection,
Non-linear SVM
SVM: Weighted samples,
Regression
The method of Support Vector Classication can be extended to solve regression problems. This method is called
Support Vector Regression.
The model produced by support vector classication (as described above) depends only on a subset of the training
data, because the cost function for building the model does not care about training points that lie beyond the margin.
48 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
1.3. Supervised learning 49
scikit-learn user guide, Release 0.12-git
Analogously, the model produced by Support Vector Regression depends only on a subset of the training data, because
the cost function for building the model ignores any training data close to the model prediction.
There are two avors of Support Vector Regression: SVR and NuSVR.
As with classication classes, the t method will take as argument vectors X, y, only that in this case y is expected to
have oating point values instead of integer values:
>>> from sklearn import svm
>>> X = [[0, 0], [2, 2]]
>>> y = [0.5, 2.5]
>>> clf = svm.SVR()
>>> clf.fit(X, y)
SVR(C=1.0, cache_size=200, coef0=0.0, degree=3,
epsilon=0.1, gamma=0.5, kernel=rbf, probability=False, shrinking=True,
tol=0.001, verbose=False)
>>> clf.predict([[1, 1]])
array([ 1.5])
Examples:
Support Vector Regression (SVR) using linear and non-linear kernels
Density estimation, novelty detection
One-class SVM is used for novelty detection, that is, given a set of samples, it will detect the soft boundary of that set
so as to classify new points as belonging to that set or not. The class that implements this is called OneClassSVM.
In this case, as it is a type of unsupervised learning, the t method will only take as input an array X, as there are no
class labels.
See, section Novelty and Outlier Detection for more details on this usage.
50 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
Examples:
One-class SVM with non-linear kernel (RBF)
Species distribution modeling
Complexity
Support Vector Machines are powerful tools, but their compute and storage requirements increase rapidly with the
number of training vectors. The core of an SVM is a quadratic programming problem (QP), separating support
vectors from the rest of the training data. The QP solver used by this libsvm-based implementation scales between
O(n
features
n
2
samples
) and O(n
features
n
3
samples
) depending on how efciently the libsvm cache is used in
practice (dataset dependent). If the data is very sparse n
features
should be replaced by the average number of non-
zero features in a sample vector.
Also note that for the linear case, the algorithm used in LinearSVC by the liblinear implementation is much more
efcient than its libsvm-based SVC counterpart and can scale almost linearly to millions of samples and/or features.
Tips on Practical Use
Avoiding data copy: For SVC, SVR, NuSVCand NuSVR, if the data passed to certain methods is not C-ordered
contiguous, and double precision, it will be copied before calling the underlying C implementation. You can
check whether a give numpy array is C-contiguous by inspecting its ags attribute.
For LinearSVC (and LogisticRegression) any input passed as a numpy array will be copied and converted to the
liblinear internal sparse data representation (double precision oats and int32 indices of non-zero components).
If you want to t a large-scale linear classier without copying a dense numpy C-contiguous double precision
array as input we suggest to use the SGDClassier class instead. The objective function can be congured to be
almost the same as the LinearSVC model.
Kernel cache size: For SVC, SVR, nuSVC and NuSVR, the size of the kernel cache has a strong impact on run
times for larger problems. If you have enough RAM available, it is recommended to set cache_size to a higher
value than the default of 200(MB), such as 500(MB) or 1000(MB).
Setting C: C is 1 by default and its a reasonable default choice. If you have a lot of noisy observations you
should decrease it. It corresponds to regularize more the estimation.
Support Vector Machine algorithms are not scale invariant, so it is highly recommended to scale your data.
For example, scale each attribute on the input vector X to [0,1] or [-1,+1], or standardize it to have mean 0
and variance 1. Note that the same scaling must be applied to the test vector to obtain meaningful results. See
section Preprocessing data for more details on scaling and normalization.
Parameter nu in NuSVC/OneClassSVM/NuSVR approximates the fraction of training errors and support vec-
tors.
In SVC, if data for classication are unbalanced (e.g. many positive and few negative), set class_weight=auto
and/or try different penalty parameters C.
The underlying LinearSVC implementation uses a random number generator to select features when tting
the model. It is thus not uncommon, to have slightly different results for the same input data. If that happens,
try with a smaller tol parameter.
Using L1 penalization as provided by LinearSVC(loss=l2, penalty=l1, dual=False) yields a sparse solution,
i.e. only a subset of feature weights is different from zero and contribute to the decision function. Increasing C
yields a more complex model (more feature are selected). The C value that yields a null model (all weights
equal to zero) can be calculated using l1_min_c.
1.3. Supervised learning 51
scikit-learn user guide, Release 0.12-git
Kernel functions
The kernel function can be any of the following:
linear: < x
i
, x
j
>.
polynomial: ( < x, x
> +r)
d
. d is specied by keyword degree, r by coef0.
rbf (exp(|x x
|
2
), > 0). is specied by keyword gamma.
sigmoid (tanh(< x
i
, x
j
> +r)), where r is specied by coef0.
Different kernels are specied by keyword kernel at initialization:
>>> linear_svc = svm.SVC(kernel=linear)
>>> linear_svc.kernel
linear
>>> rbf_svc = svm.SVC(kernel=rbf)
>>> rbf_svc.kernel
rbf
Custom Kernels
You can dene your own kernels by either giving the kernel as a python function or by precomputing the Gram matrix.
Classiers with custom kernels behave the same way as any other classiers, except that:
Field support_vectors_ is now empty, only indices of support vectors are stored in support_
A reference (and not a copy) of the rst argument in the t() method is stored for future reference. If that array
changes between the use of t() and predict() you will have unexpected results.
Using python functions as kernels You can also use your own dened kernels by passing a function to the keyword
kernel in the constructor.
Your kernel must take as arguments two matrices and return a third matrix.
The following code denes a linear kernel and creates a classier instance that will use that kernel:
>>> import numpy as np
>>> from sklearn import svm
>>> def my_kernel(x, y):
... return np.dot(x, y.T)
...
>>> clf = svm.SVC(kernel=my_kernel)
Examples:
SVM with custom kernel.
Using the Gram matrix Set kernel=precomputed and pass the Gram matrix instead of X in the t method. At the
moment, the kernel values between all training vectors and the test vectors must be provided.
>>> import numpy as np
>>> from sklearn import svm
>>> X = np.array([[0, 0], [1, 1]])
>>> y = [0, 1]
52 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
>>> clf = svm.SVC(kernel=precomputed)
>>> # linear kernel computation
>>> gram = np.dot(X, X.T)
>>> clf.fit(gram, y)
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, degree=3,
gamma=0.0, kernel=precomputed, probability=False, shrinking=True,
tol=0.001, verbose=False)
>>> # predict on training examples
>>> clf.predict(gram)
array([ 0., 1.])
Parameters of the RBF Kernel When training an SVM with the Radial Basis Function (RBF) kernel, two parame-
ters must be considered: C and gamma. The parameter C, common to all SVM kernels, trades off misclassication of
training examples against simplicity of the decision surface. A low C makes the decision surface smooth, while a high
C aims at classifying all training examples correctly. gamma denes how much inuence a single training example
has. The larger gamma is, the closer other examples must be to be affected.
Proper choice of C and gamma is critical to the SVMs performance. One is advised to use GridSearchCV with C
and gamma spaced exponentially far apart to choose good values.
Examples:
RBF SVM parameters
Mathematical formulation
A support vector machine constructs a hyper-plane or set of hyper-planes in a high or innite dimensional space, which
can be used for classication, regression or other tasks. Intuitively, a good separation is achieved by the hyper-plane
that has the largest distance to the nearest training data points of any class (so-called functional margin), since in
general the larger the margin the lower the generalization error of the classier.
SVC
Given training vectors x
i
R
p
, i=1,..., n, in two classes, and a vector y R
n
such that y
i
1, 1, SVC solves the
following primal problem:
min
w,b,
1
2
w
T
w +C
i=1,n
i
subject to y
i
(w
T
(x
i
) +b) 1
i
,
i
0, i = 1, ..., n
Its dual is
min
1
2
T
Q e
T
subject to y
T
= 0
0
i
C, i = 1, ..., l
where e is the vector of all ones, C > 0 is the upper bound, Q is an n by n positive semidenite matrix, Q
i
j K(x
i
, x
j
)
and (x
i
)
T
(x) is the kernel. Here training vectors are mapped into a higher (maybe innite) dimensional space by
the function .
1.3. Supervised learning 53
scikit-learn user guide, Release 0.12-git
The decision function is:
sgn(
n
i=1
y
i
i
K(x
i
, x) +)
Note: While SVM models derived from libsvm and liblinear use C as regularization parameter, most other estimators
use alpha. The relation between both is C =
nsamples
alpha
.
This parameters can be accessed through the members dual_coef_ which holds the product y
i
i
, support_vectors_
which holds the support vectors, and intercept_ which holds the independent term :
References:
Automatic Capacity Tuning of Very Large VC-dimension Classiers I Guyon, B Boser, V Vapnik -
Advances in neural information processing 1993,
Support-vector networks C. Cortes, V. Vapnik, Machine Leaming, 20, 273-297 (1995)
NuSVC
We introduce a new parameter which controls the number of support vectors and training errors. The parameter
(0, 1] is an upper bound on the fraction of training errors and a lower bound of the fraction of support vectors.
It can be shown that the nu-SVC formulation is a reparametrization of the C-SVC and therefore mathematically
equivalent.
54 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
Implementation details
Internally, we use libsvm and liblinear to handle all computations. These libraries are wrapped using C and Cython.
References:
For a description of the implementation and details of the algorithms used, please refer to
LIBSVM: a library for Support Vector Machines
LIBLINEAR A Library for Large Linear Classication
1.3.3 Stochastic Gradient Descent
Stochastic Gradient Descent (SGD) is a simple yet very efcient approach to discriminative learning of linear clas-
siers under convex loss functions such as (linear) Support Vector Machines and Logistic Regression. Even though
SGD has been around in the machine learning community for a long time, it has received a considerable amount of
attention just recently in the context of large-scale learning.
SGD has been successfully applied to large-scale and sparse machine learning problems often encountered in text
classication and natural language processing. Given that the data is sparse, the classiers in this module easily scale
to problems with more than 10^5 training examples and more than 10^5 features.
The advantages of Stochastic Gradient Descent are:
Efciency.
Ease of implementation (lots of opportunities for code tuning).
The disadvantages of Stochastic Gradient Descent include:
SGD requires a number of hyperparameters such as the regularization parameter and the number of iterations.
SGD is sensitive to feature scaling.
Classication
Warning: Make sure you permute (shufe) your training data before tting the model or use shufe=True to
shufe after each iterations.
The class SGDClassifier implements a plain stochastic gradient descent learning routine which supports different
loss functions and penalties for classication.
As other classiers, SGD has to be tted with two arrays: an array X of size [n_samples, n_features] holding the
training samples, and an array Y of size [n_samples] holding the target values (class labels) for the training samples:
>>> from sklearn.linear_model import SGDClassifier
>>> X = [[0., 0.], [1., 1.]]
>>> y = [0, 1]
>>> clf = SGDClassifier(loss="hinge", penalty="l2")
>>> clf.fit(X, y)
SGDClassifier(alpha=0.0001, class_weight=None, epsilon=0.1, eta0=0.0,
fit_intercept=True, learning_rate=optimal, loss=hinge, n_iter=5,
n_jobs=1, penalty=l2, power_t=0.5, rho=0.85, seed=0,
shuffle=False, verbose=0, warm_start=False)
After being tted, the model can then be used to predict new values:
1.3. Supervised learning 55
scikit-learn user guide, Release 0.12-git
>>> clf.predict([[2., 2.]])
array([1])
SGD ts a linear model to the training data. The member coef_ holds the model parameters:
>>> clf.coef_
array([[ 9.90090187, 9.90090187]])
Member intercept_ holds the intercept (aka offset or bias):
>>> clf.intercept_
array([-9.990...])
Whether or not the model should use an intercept, i.e. a biased hyperplane, is controlled by the parameter t_intercept.
To get the signed distance to the hyperplane use decision_function:
>>> clf.decision_function([[2., 2.]])
array([ 29.61357756])
The concrete loss function can be set via the loss parameter. SGDClassifier supports the following loss functions:
loss=hinge: (soft-margin) linear Support Vector Machine,
loss=modied_huber: smoothed hinge loss,
loss=log: Logistic Regression,
and all regression losses below.
The rst two loss functions are lazy, they only update the model parameters if an example violates the margin con-
straint, which makes training very efcient and may result in sparser models, even when L2 penalty is used.
In the case of binary classication and loss=log or loss=modied_huber you get a probability estimate P(y=C|x)
using predict_proba, where C is the largest class label:
56 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
>>> clf = SGDClassifier(loss="log").fit(X, y)
>>> clf.predict_proba([[1., 1.]])
array([ 0.99999949])
The concrete penalty can be set via the penalty parameter. SGD supports the following penalties:
penalty=l2: L2 norm penalty on coef_.
penalty=l1: L1 norm penalty on coef_.
penalty=elasticnet: Convex combination of L2 and L1; rho * L2 + (1 - rho) * L1.
The default setting is penalty=l2. The L1 penalty leads to sparse solutions, driving most coefcients to zero. The
Elastic Net solves some deciencies of the L1 penalty in the presence of highly correlated attributes. The parameter
rho has to be specied by the user.
SGDClassifier supports multi-class classication by combining multiple binary classiers in a one versus all
(OVA) scheme. For each of the K classes, a binary classier is learned that discriminates between that and all other
K-1 classes. At testing time, we compute the condence score (i.e. the signed distances to the hyperplane) for each
classier and choose the class with the highest condence. The Figure below illustrates the OVA approach on the iris
dataset. The dashed lines represent the three OVA classiers; the background colors show the decision surface induced
by the three classiers.
In the case of multi-class classication coef_ is a two-dimensionaly array of shape [n_classes, n_features] and in-
tercept_ is a one dimensional array of shape [n_classes]. The i-th row of coef_ holds the weight vector of the OVA
classier for the i-th class; classes are indexed in ascending order (see attribute classes). Note that, in principle, since
they allow to create a probability model, loss=log and loss=modied_huber are more suitable for one-vs-all
classication.
SGDClassifier supports both weighted classes and weighted instances via the t parameters class_weight and
sample_weight. See the examples below and the doc string of SGDClassifier.fit for further information.
1.3. Supervised learning 57
scikit-learn user guide, Release 0.12-git
Examples:
SGD: Maximum margin separating hyperplane,
Plot multi-class SGD on the iris dataset
SGD: Separating hyperplane with weighted classes
SGD: Weighted samples
Regression
The class SGDRegressor implements a plain stochastic gradient descent learning routine which supports different
loss functions and penalties to t linear regression models. SGDRegressor is well suited for regression prob-
lems with a large number of training samples (> 10.000), for other problems we recommend Ridge, Lasso, or
ElasticNet.
The concrete loss function can be set via the loss parameter. SGDRegressor supports the following loss functions:
loss=squared_loss: Ordinary least squares,
loss=huber: Huber loss for robust regression,
loss=epsilon_insensitive: linear Support Vector Regression.
The Huber and epsilon-insensitive loss functions can be used for robust regression. The width of the insensitive region
has to be specied via the parameter epsilon. This parameter depends on the scale of the target variables.
Examples:
Ordinary Least Squares with SGD,
58 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
Stochastic Gradient Descent for sparse data
Note: The sparse implementation produces slightly different results than the dense implementation due to a shrunk
learning rate for the intercept.
There is built-in support for sparse data given in any matrix in a format supported by scipy.sparse. For maximum
efciency, however, use the CSR matrix format as dened in scipy.sparse.csr_matrix.
Examples:
Classication of text documents using sparse features
Complexity
The major advantage of SGD is its efciency, which is basically linear in the number of training examples. If X is a
matrix of size (n, p) training has a cost of O(kn p), where k is the number of iterations (epochs) and p is the average
number of non-zero attributes per sample.
Recent theoretical results, however, show that the runtime to get some desired optimization accuracy does not increase
as the training set size increases.
Tips on Practical Use
Stochastic Gradient Descent is sensitive to feature scaling, so it is highly recommended to scale your data. For
example, scale each attribute on the input vector X to [0,1] or [-1,+1], or standardize it to have mean 0 and
variance 1. Note that the same scaling must be applied to the test vector to obtain meaningful results. This can
be easily done using Scaler:
from sklearn.preprocessing import Scaler
scaler = Scaler()
scaler.fit(X_train) # Dont cheat - fit only on training data
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test) # apply same transformation to test data
If your attributes have an intrinsic scale (e.g. word frequencies or indicator features) scaling is not needed.
Finding a reasonable regularization term is best done using GridSearchCV, usually in the range 10.0**-
np.arange(1,7).
Empirically, we found that SGD converges after observing approx. 10^6 training samples. Thus, a reasonable
rst guess for the number of iterations is n_iter = np.ceil(10**6 / n), where n is the size of the training set.
If you apply SGD to features extracted using PCA we found that it is often wise to scale the feature values by
some constant c such that the average L2 norm of the training data equals one.
References:
Efcient BackProp Y. LeCun, L. Bottou, G. Orr, K. Mller - In Neural Networks: Tricks of the Trade
1998.
1.3. Supervised learning 59
scikit-learn user guide, Release 0.12-git
Mathematical formulation
Given a set of training examples (x
1
, y
1
), . . . , (x
n
, y
n
) where x
i
R
n
and y
i
{1, 1}, our goal is to learn a linear
scoring function f(x) = w
T
x +b with model parameters w R
m
and intercept b R. In order to make predictions,
we simply look at the sign of f(x). A common choice to nd the model parameters is by minimizing the regularized
training error given by
E(w, b) =
n
i=1
L(y
i
, f(x
i
)) +R(w)
where L is a loss function that measures model (mis)t and R is a regularization term (aka penalty) that penalizes
model complexity; > 0 is a non-negative hyperparameter.
Different choices for L entail different classiers such as
Hinge: (soft-margin) Support Vector Machines.
Log: Logistic Regression.
Least-Squares: Ridge Regression.
Epsilon-Insensitive: (soft-margin) Support Vector Regression.
All of the above loss functions can be regarded as an upper bound on the misclassication error (Zero-one loss) as
shown in the Figure below.
Popular choices for the regularization term R include:
L2 norm: R(w) :=
1
2
n
i=1
w
2
i
,
L1 norm: R(w) :=
n
i=1
|w
i
|, which leads to sparse solutions.
Elastic Net: R(w) :=
1
2
n
i=1
w
2
i
+ (1 )
n
i=1
|w
i
|, a convex combination of L2 and L1.
The Figure below shows the contours of the different regularization terms in the parameter space when R(w) = 1.
60 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
SGD
Stochastic gradient descent is an optimization method for unconstrained optimization problems. In contrast to (batch)
gradient descent, SGD approximates the true gradient of E(w, b) by considering a single training example at a time.
The class SGDClassifier implements a rst-order SGD learning routine. The algorithm iterates over the training
examples and for each example updates the model parameters according to the update rule given by
w w (
R(w)
w
+
L(w
T
x
i
+b, y
i
)
w
)
where is the learning rate which controls the step-size in the parameter space. The intercept b is updated similarly
but without regularization.
The learning rate can be either constant or gradually decaying. For classication, the default learning rate schedule
(learning_rate=optimal) is given by
(t)
=
1
(t
0
+t)
where t is the time step (there are a total of n_samples * epochs time steps), t
0
is determined based on a heuristic
proposed by Lon Bottou such that the expected initial updates are comparable with the expected size of the weights
(this assuming that the norm of the training samples is approx. 1). The exact denition can be found in _init_t in
BaseSGD.
For regression, the default learning rate schedule, inverse scaling (learning_rate=invscaling), is given by
(t)
=
eta
0
t
power_t
where eta
0
and power_t are hyperparameters choosen by the user via eta0 and power_t, resp.
For a constant learning rate use learning_rate=constant and use eta0 to specify the learning rate.
The model parameters can be accessed through the members coef_ and intercept_:
1.3. Supervised learning 61
scikit-learn user guide, Release 0.12-git
Member coef_ holds the weights w
Member intercept_ holds b
References:
Solving large scale linear prediction problems using stochastic gradient descent algorithms T. Zhang -
In Proceedings of ICML 04.
Regularization and variable selection via the elastic net H. Zou, T. Hastie - Journal of the Royal Statis-
tical Society Series B, 67 (2), 301-320.
Implementation details
The implementation of SGD is inuenced by the Stochastic Gradient SVM of Lon Bottou. Similar to SvmSGD,
the weight vector is represented as the product of a scalar and a vector which allows an efcient weight update in
the case of L2 regularization. In the case of sparse feature vectors, the intercept is updated with a smaller learning
rate (multiplied by 0.01) to account for the fact that it is updated more frequently. Training examples are picked up
sequentially and the learning rate is lowered after each observed example. We adopted the learning rate schedule from
Shalev-Shwartz et al. 2007. For multi-class classication, a one versus all approach is used. We use the truncated
gradient algorithm proposed by Tsuruoka et al. 2009 for L1 regularization (and the Elastic Net). The code is written
in Cython.
References:
Stochastic Gradient Descent L. Bottou - Website, 2010.
The Tradeoffs of Large Scale Machine Learning L. Bottou - Website, 2011.
Pegasos: Primal estimated sub-gradient solver for svm S. Shalev-Shwartz, Y. Singer, N. Srebro - In
Proceedings of ICML 07.
Stochastic gradient descent training for l1-regularized log-linear models with cumulative penalty Y.
Tsuruoka, J. Tsujii, S. Ananiadou - In Proceedings of the AFNLP/ACL 09.
1.3.4 Nearest Neighbors
sklearn.neighbors provides functionality for unsupervised and supervised neighbors-based learning methods.
Unsupervised nearest neighbors is the foundation of many other learning methods, notably manifold learning and
spectral clustering. Supervised neighbors-based learning comes in two avors: classication for data with discrete
labels, and regression for data with continuous labels.
The principle behind nearest neighbor methods is to nd a predened number of training samples closest in distance
to the new point, and predict the label from these. The number of samples can be a user-dened constant (k-nearest
neighbor learning), or vary based on the local density of points (radius-based neighbor learning). The distance can,
in general, be any metric measure: standard Euclidean distance is the most common choice. Neighbors-based meth-
ods are known as non-generalizing machine learning methods, since they simply remember all of its training data
(possibly transformed into a fast indexing structure such as a Ball Tree or KD Tree.).
Despite its simplicity, nearest neighbors has been successful in a large number of classication and regression prob-
lems, including handwritten digits or satellite image scenes. It is often successful in classication situations where the
decision boundary is very irregular.
The classes in sklearn.neighbors can handle either Numpy arrays or scipy.sparse matrices as input. Arbitrary
Minkowski metrics are supported for searches.
62 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
Unsupervised Nearest Neighbors
NearestNeighbors implements unsupervised nearest neighbors learning. It acts as a uniform interface to
three different nearest neighbors algorithms: BallTree, scipy.spatial.cKDTree, and a brute-force algo-
rithm based on routines in sklearn.metrics.pairwise. The choice of neighbors search algorithm is con-
trolled through the keyword algorithm, which must be one of [auto, ball_tree, kd_tree,
brute]. When the default value auto is passed, the algorithm attempts to determine the best approach from
the training data. For a discussion of the strengths and weaknesses of each option, see Nearest Neighbor Algorithms.
Nearest Neighbors Classication
Neighbors-based classication is a type of instance-based learning or non-generalizing learning: it does not attempt
to construct a general internal model, but simply stores instances of the training data. Classication is computed from
a simple majority vote of the nearest neighbors of each point: a query point is assigned the data class which has the
most representatives within the nearest neighbors of the point.
scikit-learn implements two different nearest neighbors classiers: KNeighborsClassifier implements learn-
ing based on the k nearest neighbors of each query point, where k is an integer value specied by the user.
RadiusNeighborsClassifier implements learning based on the number of neighbors within a xed radius
r of each training point, where r is a oating-point value specied by the user.
The k-neighbors classication in KNeighborsClassifier is the more commonly used of the two techniques.
The optimal choice of the value k is highly data-dependent: in general a larger k suppresses the effects of noise, but
makes the classication boundaries less distinct.
In cases where the data is not uniformly sampled, radius-based neighbors classication in
RadiusNeighborsClassifier can be a better choice. The user species a xed radius r, such that
points in sparser neighborhoods use fewer nearest neighbors for the classication. For high-dimensional parameter
spaces, this method becomes less effective due to the so-called curse of dimensionality.
The basic nearest neighbors classication uses uniformweights: that is, the value assigned to a query point is computed
from a simple majority vote of the nearest neighbors. Under some circumstances, it is better to weight the neighbors
such that nearer neighbors contribute more to the t. This can be accomplished through the weights keyword. The
default value, weights = uniform, assigns uniform weights to each neighbor. weights = distance
assigns weights proportional to the inverse of the distance from the query point. Alternatively, a user-dened function
of the distance can be supplied which is used to compute the weights.
1.3. Supervised learning 63
scikit-learn user guide, Release 0.12-git
Examples:
Nearest Neighbors Classication: an example of classication using nearest neighbors.
Nearest Neighbors Regression
Neighbors-based regression can be used in cases where the data labels are continuous rather than discrete variables.
The label assigned to a query point is computed based the mean of the labels of its nearest neighbors.
scikit-learn implements two different neighbors regressors: KNeighborsRegressor implements learning
based on the k nearest neighbors of each query point, where k is an integer value specied by the user.
RadiusNeighborsRegressor implements learning based on the neighbors within a xed radius r of the query
point, where r is a oating-point value specied by the user.
The basic nearest neighbors regression uses uniform weights: that is, each point in the local neighborhood contributes
uniformly to the classication of a query point. Under some circumstances, it can be advantageous to weight points
such that nearby points contribute more to the regression than faraway points. This can be accomplished through the
weights keyword. The default value, weights = uniform, assigns equal weights to all points. weights
= distance assigns weights proportional to the inverse of the distance from the query point. Alternatively, a
user-dened function of the distance can be supplied, which will be used to compute the weights.
Examples:
Nearest Neighbors regression: an example of regression using nearest neighbors.
Nearest Neighbor Algorithms
Brute Force
Fast computation of nearest neighbors is an active area of research in machine learning. The most naive neighbor
search implementation involves the brute-force computation of distances between all pairs of points in the dataset:
for N samples in D dimensions, this approach scales as O[DN
2
]. Efcient brute-force neighbors searches can
be very competitive for small data samples. However, as the number of samples N grows, the brute-force ap-
proach quickly becomes infeasible. In the classes within sklearn.neighbors, brute-force neighbors searches
64 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
are specied using the keyword algorithm = brute, and are computed using the routines available in
sklearn.metrics.pairwise.
K-D Tree
To address the computational inefciencies of the brute-force approach, a variety of tree-based data structures have
been invented. In general, these structures attempt to reduce the required number of distance calculations by efciently
encoding aggregate distance information for the sample. The basic idea is that if point A is very distant from point
B, and point B is very close to point C, then we know that points A and C are very distant, without having to
explicitly calculate their distance. In this way, the computational cost of a nearest neighbors search can be reduced to
O[DN log(N)] or better. This is a signicant improvement over brute-force for large N.
An early approach to taking advantage of this aggregate information was the KD tree data structure (short for K-
dimensional tree), which generalizes two-dimensional Quad-trees and 3-dimensional Oct-trees to an arbitrary number
of dimensions. The KD tree is a tree structure which recursively partitions the parameter space along the data axes,
dividing it into nested orthotopic regions into which data points are led. The construction of a KD tree is very fast:
because partitioning is performed only along the data axes, no D-dimensional distances need to be computed. Once
constructed, the nearest neighbor of a query point can be determined with only O[log(N)] distance computations.
Though the KD tree approach is very fast for low-dimensional (D < 20) neighbors searches, it becomes inefcient
as D grows very large: this is one manifestation of the so-called curse of dimensionality. In scikit-learn, KD tree
neighbors searches are specied using the keyword algorithm = kd_tree, and are computed using the class
scipy.spatial.cKDTree.
References:
Multidimensional binary search trees used for associative searching, Bentley, J.L., Communications of
the ACM (1975)
1.3. Supervised learning 65
scikit-learn user guide, Release 0.12-git
Ball Tree
To address the inefciencies of KD Trees in higher dimensions, the ball tree data structure was developed. Where KD
trees partition data along Cartesian axes, ball trees partition data in a series of nesting hyper-spheres. This makes tree
construction more costly than that of the KD tree, but results in a data structure which allows for efcient neighbors
searches even in very high dimensions.
A ball tree recursively divides the data into nodes dened by a centroid C and radius r, such that each point in the
node lies within the hyper-sphere dened by r and C. The number of candidate points for a neighbor search is reduced
through use of the triangle inequality:
|x +y| |x| +|y|
With this setup, a single distance calculation between a test point and the centroid is sufcient to determine a
lower and upper bound on the distance to all points within the node. Because of the spherical geometry of
the ball tree nodes, its performance does not degrade at high dimensions. In scikit-learn, ball-tree-based neigh-
bors searches are specied using the keyword algorithm = ball_tree, and are computed using the class
sklearn.neighbors.BallTree. Alternatively, the user can work with the BallTree class directly.
References:
Five balltree construction algorithms, Omohundro, S.M., International Computer Science Institute Tech-
nical Report (1989)
Choice of Nearest Neighbors Algorithm
The optimal algorithm for a given dataset is a complicated choice, and depends on a number of factors:
number of samples N (i.e. n_samples) and dimensionality D (i.e. n_features).
Brute force query time grows as O[DN]
Ball tree query time grows as approximately O[Dlog(N)]
KD tree query time changes with D in a way that is difcult to precisely characterise. For small D (less
than 20 or so) the cost is approximately O[Dlog(N)], and the KD tree query can be very efcient. For
larger D, the cost increases to nearly O[DN], and the overhead due to the tree structure can lead to queries
which are slower than brute force.
For small data sets (N less than 30 or so), log(N) is comparable to N, and brute force algorithms can be more
efcient than a tree-based approach. Both cKDTree and BallTree address this through providing a leaf
size parameter: this controls the number of samples at which a query switches to brute-force. This allows both
algorithms to approach the efciency of a brute-force computation for small N.
data structure: intrinsic dimensionality of the data and/or sparsity of the data. Intrinsic dimensionality refers
to the dimension d D of a manifold on which the data lies, which can be linearly or nonlinearly embedded
in the parameter space. Sparsity refers to the degree to which the data lls the parameter space (this is to be
distinguished from the concept as used in sparse matrices. The data matrix may have no zero entries, but the
structure can still be sparse in this sense).
Brute force query time is unchanged by data structure.
Ball tree and KDtree query times can be greatly inuenced by data structure. In general, sparser data with a
smaller intrinsic dimensionality leads to faster query times. Because the KD tree internal representation is
aligned with the parameter axes, it will not generally show as much improvement as ball tree for arbitrarily
structured data.
66 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
Datasets used in machine learning tend to be very structured, and are very well-suited for tree-based queries.
number of neighbors k requested for a query point.
Brute force query time is largely unaffected by the value of k
Ball tree and KD tree query time will become slower as k increases. This is due to two effects: rst, a
larger k leads to the necessity to search a larger portion of the parameter space. Second, using k > 1
requires internal queueing of results as the tree is traversed.
As k becomes large compared to N, the ability to prune branches in a tree-based query is reduced. In this
situation, Brute force queries can be more efcient.
number of query points. Both the ball tree and the KD Tree require a construction phase. The cost of this
construction becomes negligible when amortized over many queries. If only a small number of queries will
be performed, however, the construction can make up a signicant fraction of the total cost. If very few query
points will be required, brute force is better than a tree-based method.
Currently, algorithm = auto selects ball_tree if k < N/2, and brute otherwise. This choice is
based on the assumption that the number of query points is at least the same order as the number of training points,
and that leaf_size is close to its default value of 30.
Effect of leaf_size
As noted above, for small sample sizes a brute force search can be more efcient than a tree-based query. This fact is
accounted for in the ball tree and KD tree by internally switching to brute force searches within leaf nodes. The level
of this switch can be specied with the parameter leaf_size. This parameter choice has many effects:
construction time A larger leaf_size leads to a faster tree construction time, because fewer nodes need to be
created
query time Both a large or small leaf_size can lead to suboptimal query cost. For leaf_size approaching
1, the overhead involved in traversing nodes can signicantly slow query times. For leaf_size approach-
ing the size of the training set, queries become essentially brute force. A good compromise between these is
leaf_size = 30, the default value of the parameter.
memory As leaf_size increases, the memory required to store a tree structure decreases. This is especially
important in the case of ball tree, which stores a D-dimensional centroid for each node. The required storage
space for BallTree is approximately 1 / leaf_size times the size of the training set.
leaf_size is not referenced for brute force queries.
Nearest Centroid Classier
The NearestCentroid classier is a simple algorithm that represents each class by the centroid of its members. In
effect, this makes it similar to the label updating phase of the sklearn.KMeans algorithm. It also has no parameters
to choose, making it a good baseline classier. It does, however, suffer on non-convex classes, as well as when classes
have drastically different variances, as equal variance in all dimensions is assumed. See Linear Discriminant Analysis
(sklearn.lda.LDA) and Quadratic Discriminant Analysis (sklearn.qda.QDA) for more complex methods that
do not make this assumption. Usage of the default NearestCentroid is simple:
>>> from sklearn.neighbors.nearest_centroid import NearestCentroid
>>> import numpy as np
>>> X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
>>> y = np.array([1, 1, 1, 2, 2, 2])
>>> clf = NearestCentroid()
>>> clf.fit(X, y)
NearestCentroid(metric=euclidean, shrink_threshold=None)
1.3. Supervised learning 67
scikit-learn user guide, Release 0.12-git
>>> print clf.predict([[-0.8, -1]])
[1]
Nearest Shrunken Centroid
The NearestCentroid classier has a shrink_threshold parameter, which implements the nearest shrunken cen-
troid classier. In effect, the value of each feature for each centroid is divided by the within-class variance of that
feature. The feature values are then reduced by shrink_threshold. Most notably, if a particular feature value crosses
zero, it is set to zero. In effect, this removes the feature from affecting the classication. This is useful, for example,
for removing noisy features.
In the example below, using a small shrink threshold increases the accuracy of the model from 0.81 to 0.82.
Examples:
Nearest Centroid Classication: an example of classication using nearest centroid with different shrink
thresholds.
68 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
1.3.5 Gaussian Processes
Gaussian Processes for Machine Learning (GPML) is a generic supervised learning method primarily designed to
solve regression problems. It has also been extended to probabilistic classication, but in the present implementation,
this is only a post-processing of the regression exercise.
The advantages of Gaussian Processes for Machine Learning are:
The prediction interpolates the observations (at least for regular correlation models).
The prediction is probabilistic (Gaussian) so that one can compute empirical condence intervals and excee-
dence probabilities that might be used to ret (online tting, adaptive tting) the prediction in some region of
interest.
Versatile: different linear regression models and correlation models can be specied. Common models are
provided, but it is also possible to specify custom models provided they are stationary.
The disadvantages of Gaussian Processes for Machine Learning include:
It is not sparse. It uses the whole samples/features information to perform the prediction.
It loses efciency in high dimensional spaces namely when the number of features exceeds a few dozens. It
might indeed give poor performance and it loses computational efciency.
Classication is only a post-processing, meaning that one rst need to solve a regression problem by providing
the complete scalar oat precision output y of the experiment one attempt to model.
Thanks to the Gaussian property of the prediction, it has been given varied applications: e.g. for global optimization,
probabilistic classication.
Examples
An introductory regression example
Say we want to surrogate the function g(x) = xsin(x). To do so, the function is evaluated onto a design of experi-
ments. Then, we dene a GaussianProcess model whose regression and correlation models might be specied using
additional kwargs, and ask for the model to be tted to the data. Depending on the number of parameters provided at
instanciation, the tting procedure may recourse to maximum likelihood estimation for the parameters or alternatively
it uses the given parameters.
>>> import numpy as np
>>> from sklearn import gaussian_process
>>> def f(x):
... return x
*
np.sin(x)
>>> X = np.atleast_2d([1., 3., 5., 6., 7., 8.]).T
>>> y = f(X).ravel()
>>> x = np.atleast_2d(np.linspace(0, 10, 1000)).T
>>> gp = gaussian_process.GaussianProcess(theta0=1e-2, thetaL=1e-4, thetaU=1e-1)
>>> gp.fit(X, y)
GaussianProcess(beta0=None, corr=<function squared_exponential at 0x...>,
normalize=True, nugget=array(2.22...-15),
optimizer=fmin_cobyla, random_start=1, random_state=...
regr=<function constant at 0x...>, storage_mode=full,
theta0=array([[ 0.01]]), thetaL=array([[ 0.0001]]),
thetaU=array([[ 0.1]]), verbose=False)
>>> y_pred, sigma2_pred = gp.predict(x, eval_MSE=True)
1.3. Supervised learning 69
scikit-learn user guide, Release 0.12-git
70 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
Fitting Noisy Data
When the data to be t includes noise, the Gaussian process model can be used by specifying the variance of the noise
for each point. GaussianProcess takes a parameter nugget which is added to the diagonal of the correlation
matrix between training points: in general this is a type of Tikhonov regularization. In the special case of a squared-
exponential correlation function, this normalization is equivalent to specifying a fractional variance in the input. That
is
nugget
i
=
_
i
y
i
_
2
With nugget and corr properly set, Gaussian Processes can be used to robustly recover an underlying function
from noisy data:
Other examples
Gaussian Processes classication example: exploiting the probabilistic output
1.3. Supervised learning 71
scikit-learn user guide, Release 0.12-git
Mathematical formulation
The initial assumption
Suppose one wants to model the output of a computer experiment, say a mathematical function:
g :R
n
features
R
X y = g(X)
GPML starts with the assumption that this function is a conditional sample path of a Gaussian process G which is
additionally assumed to read as follows:
G(X) = f(X)
T
+Z(X)
where f(X)
T
is a linear regression model and Z(X) is a zero-mean Gaussian process with a fully stationary covari-
ance function:
C(X, X
) =
2
R(|X X
|)
2
being its variance and R being the correlation function which solely depends on the absolute relative distance
between each sample, possibly featurewise (this is the stationarity assumption).
From this basic formulation, note that GPML is nothing but an extension of a basic least squares linear regression
problem:
g(X) f(X)
T
Except we additionaly assume some spatial coherence (correlation) between the samples dictated by the correlation
function. Indeed, ordinary least squares assumes the correlation model R(|X X
|) is one when X = X
and zero
otherwise : a dirac correlation model sometimes referred to as a nugget correlation model in the kriging literature.
The best linear unbiased prediction (BLUP)
We now derive the best linear unbiased prediction of the sample path g conditioned on the observations:
G(X) = G(X|y
1
= g(X
1
), ..., y
n
samples
= g(X
n
samples
))
It is derived from its given properties:
It is linear (a linear combination of the observations)
G(X) a(X)
T
y
It is unbiased
E[G(X)
G(X)] = 0
It is the best (in the Mean Squared Error sense)
G(X)
= arg min
G(X)
E[(G(X)
G(X))
2
]
So that the optimal weight vector a(X) is solution of the following equality constrained optimization problem:
a(X)
= arg min
a(X)
E[(G(X) a(X)
T
y)
2
]
s.t. E[G(X) a(X)
T
y] = 0
72 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
Rewriting this constrained optimization problem in the form of a Lagrangian and looking further for the rst order
optimality conditions to be satised, one ends up with a closed formexpression for the sought predictor see references
for the complete proof.
In the end, the BLUP is shown to be a Gaussian random variate with mean:
Y
(X) = f(X)
T
+r(X)
T
and variance:
Y
(X) =
2
Y
(1 r(X)
T
R
1
r(X) +u(X)
T
(F
T
R
1
F)
1
u(X))
where we have introduced:
the correlation matrix whose terms are dened wrt the autocorrelation function and its built-in parameters :
R
i j
= R(|X
i
X
j
|, ), i, j = 1, ..., m
the vector of cross-correlations between the point where the prediction is made and the points in the DOE:
r
i
= R(|X X
i
|, ), i = 1, ..., m
the regression matrix (eg the Vandermonde matrix if f is a polynomial basis):
F
i j
= f
i
(X
j
), i = 1, ..., p, j = 1, ..., m
the generalized least square regression weights:
= (F
T
R
1
F)
1
F
T
R
1
Y
and the vectors:
= R
1
(Y F
)
u(X) = F
T
R
1
r(X) f(X)
It is important to notice that the probabilistic response of a Gaussian Process predictor is fully analytic and mostly relies
on basic linear algebra operations. More precisely the mean prediction is the sum of two simple linear combinations
(dot products), and the variance requires two matrix inversions, but the correlation matrix can be decomposed only
once using a Cholesky decomposition algorithm.
The empirical best linear unbiased predictor (EBLUP)
Until now, both the autocorrelation and regression models were assumed given. In practice however they are never
known in advance so that one has to make (motivated) empirical choices for these models Correlation Models.
Provided these choices are made, one should estimate the remaining unknown parameters involved in the BLUP. To
do so, one uses the set of provided observations in conjunction with some inference technique. The present implemen-
tation, which is based on the DACEs Matlab toolbox uses the maximum likelihood estimation technique see DACE
manual in references for the complete equations. This maximum likelihood estimation problem is turned into a global
optimization problem onto the autocorrelation parameters. In the present implementation, this global optimization is
solved by means of the fmin_cobyla optimization function from scipy.optimize. In the case of anisotropy however, we
provide an implementation of Welchs componentwise optimization algorithm see references.
For a more comprehensive description of the theoretical aspects of Gaussian Processes for Machine Learning, please
refer to the references below:
1.3. Supervised learning 73
scikit-learn user guide, Release 0.12-git
References:
DACE, A Matlab Kriging Toolbox S Lophaven, HB Nielsen, J Sondergaard 2002
Screening, predicting, and computer experiments WJ Welch, RJ Buck, J Sacks, HP Wynn, TJ Mitchell,
and MD Morris Technometrics 34(1) 1525, 1992
Gaussian Processes for Machine Learning CE Rasmussen, CKI Williams MIT Press, 2006 (Ed. T Diet-
trich)
The design and analysis of computer experiments TJ Santner, BJ Williams, W Notz Springer, 2003
Correlation Models
Common correlation models matches some famous SVMs kernels because they are mostly built on equivalent as-
sumptions. They must fulll Mercers conditions and should additionaly remain stationary. Note however, that the
choice of the correlation model should be made in agreement with the known properties of the original experiment
from which the observations come. For instance:
If the original experiment is known to be innitely differentiable (smooth), then one should use the squared-
exponential correlation model.
If its not, then one should rather use the exponential correlation model.
Note also that there exists a correlation model that takes the degree of derivability as input: this is the Matern
correlation model, but its not implemented here (TODO).
For a more detailed discussion on the selection of appropriate correlation models, see the book by Rasmussen &
Williams in references.
Regression Models
Common linear regression models involve zero- (constant), rst- and second-order polynomials. But one may specify
its own in the form of a Python function that takes the features X as input and that returns a vector containing the
values of the functional set. The only constraint is that the number of functions must not exceed the number of
available observations so that the underlying regression problem is not underdetermined.
Implementation details
The present implementation is based on a translation of the DACE Matlab toolbox.
References:
DACE, A Matlab Kriging Toolbox S Lophaven, HB Nielsen, J Sondergaard 2002,
W.J. Welch, R.J. Buck, J. Sacks, H.P. Wynn, T.J. Mitchell, and M.D. Morris (1992). Screening, predicting,
and computer experiments. Technometrics, 34(1) 1525.
1.3.6 Partial Least Squares
Partial least squares (PLS) models are useful to nd linear relations between two multivariate datasets: in PLS the X
and Y arguments of the t method are 2D arrays.
PLS nds the fundamental relations between two matrices (X and Y): it is a latent variable approach to modeling
the covariance structures in these two spaces. A PLS model will try to nd the multidimensional direction in the X
74 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
space that explains the maximum multidimensional variance direction in the Y space. PLS-regression is particularly
suited when the matrix of predictors has more variables than observations, and when there is multicollinearity among
X values. By contrast, standard regression will fail in these cases.
Classes included in this module are PLSRegression PLSCanonical, CCA and PLSSVD
Reference:
JA Wegelin A survey of Partial Least Squares (PLS) methods, with emphasis on the two-block case
Examples:
PLS Partial Least Squares
1.3.7 Naive Bayes
Naive Bayes methods are a set of supervised learning algorithms based on applying Bayes theorem with the naive
assumption of independence between every pair of features. Given a class variable y and a dependent feature vector
x
1
through x
n
, Bayes theorem states the following relationship:
P(y | x
1
, . . . , x
n
) =
P(y)P(x
1
, . . . x
n
| y)
P(x
1
, . . . , x
n
)
Using the naive independence assumption that
P(x
i
|y, x
1
, . . . , x
i1
, x
i+1
, . . . , x
n
) = P(x
i
|y),
1.3. Supervised learning 75
scikit-learn user guide, Release 0.12-git
for all i, this relationship is simplied to
P(y | x
1
, . . . , x
n
) =
P(y)
n
i=1
P(x
i
| y)
P(x
1
, . . . , x
n
)
Since P(x
1
, . . . , x
n
) is constant given the input, we can use the following classication rule:
P(y | x
1
, . . . , x
n
) P(y)
n
i=1
P(x
i
| y)
y = arg max
y
P(y)
n
i=1
P(x
i
| y),
and we can use Maximum A Posteriori (MAP) estimation to estimate P(y) and P(x
i
| y); the former is then the
relative frequency of class y in the training set.
The different Naive Bayes classiers differ mainly by the assumptions they make regarding the distribution of P(x
i
|
y).
In spite of their apparently over-simplied assumptions, Naive Bayes classiers have worked quite well in many real-
world situations, famously document classication and spam ltering. They requires a small amount of training data
to estimate the necessary parameters. (For theoretical reasons why Naive Bayes works well, and on which types of
data it does, see the references below.)
Naive Bayes learners and classiers can be extremely fast compared to more sophisticated methods. The decoupling
of the class conditional feature distributions means that each distribution can be independently estimated as a one
dimensional distribution. This in turn helps to alleviate problems stemming from the curse of dimensionality.
On the ip side, although Naive Bayes is known as a decent classier, it is known to be a bad estimator, so the
probability outputs from predict_proba are not to be taken too seriously.
References:
H. Zhang (2004). The optimality of Naive Bayes. Proc. FLAIRS.
Gaussian Naive Bayes
GaussianNB implements the Gaussian Naive Bayes algorithm for classication. The likelihood of the features is
assumed to be Gaussian:
P(x
i
| y) =
1
_
2
2
y
exp
_
(x
i
y
)
2
2
2
y
_
The parameters
y
and
y
are estimated using maximum likelihood.
>>> from sklearn import datasets
>>> iris = datasets.load_iris()
>>> from sklearn.naive_bayes import GaussianNB
>>> gnb = GaussianNB()
>>> y_pred = gnb.fit(iris.data, iris.target).predict(iris.data)
>>> print "Number of mislabeled points : %d" % (iris.target != y_pred).sum()
Number of mislabeled points : 6
76 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
Multinomial Naive Bayes
MultinomialNB implements the Naive Bayes algorithm for multinomially distributed data, and is one of the two
classic Naive Bayes variants used in text classication (where the data are typically represented as word vector counts,
although tf-idf vectors are also known to work well in practice). The distribution is parametrized by vectors
y
=
(
y1
, . . . ,
yn
) for each class y, where n is the number of features (in text classication, the size of the vocabulary)
and
yi
is the probability P(x
i
| y) of feature i appearing in a sample belonging to class y.
The parameters
y
is estimated by a smoothed version of maximum likelihood, i.e. relative frequency counting:
yi
=
N
yi
+
N
y
+n
where N
yi
=
xT
x
i
is the number of times feature i appears in a sample of class y in the training set T, and
N
y
=
|T|
i=1
N
yi
is the total count of all features for class y.
The smoothing priors 0 accounts for features not present in the learning samples and prevents zero probabilities
in further computations. Setting = 1 is called Laplace smoothing, while < 1 is called Lidstone smoothing.
Bernoulli Naive Bayes
BernoulliNB implements the Naive Bayes training and classication algorithms for data that is distributed ac-
cording to multivariate Bernoulli distributions; i.e., there may be multiple features but each one is assumed to be a
binary-valued (Bernoulli, boolean) variable. Therefore, this class requires samples to be represented as binary-valued
feature vectors; if handed any other kind of data, a BernoulliNB instance may binarizes its input (depending on the
binarize parameter).
The decision rule for Bernoulli Naive Bayes is based on
P(x
i
| y) = P(i | y)x
i
(1 P(i | y))(1 x
i
)
which differs from multinomial NBs rule in that it explicitly penalizes the non-occurrence of a feature i that is an
indicator for class y, where the multinomial variant would simply ignore a non-occurring feature.
In the case of text classication, word occurrence vectors (rather than word count vectors) may be used to train and
use this classier. BernoulliNB might perform better on some datasets, especially those with shorter documents.
It is advisable to evaluate both models, if time permits.
References:
C.D. Manning, P. Raghavan and H. Schtze (2008). Introduction to Information Retrieval. Cambridge
University Press, pp. 234-265.
A. McCallum and K. Nigam (1998). A comparison of event models for Naive Bayes text classication.
Proc. AAAI/ICML-98 Workshop on Learning for Text Categorization, pp. 41-48.
V. Metsis, I. Androutsopoulos and G. Paliouras (2006). Spam ltering with Naive Bayes Which Naive
Bayes? 3rd Conf. on Email and Anti-Spam (CEAS).
1.3.8 Decision Trees
Decision Trees (DTs) are a non-parametric supervised learning method used for classication and regression. The
goal is to create a model that predicts the value of a target variable by learning simple decision rules inferred from the
data features.
For instance, in the example below, decision trees learn from data to approximate a sine curve with a set of if-then-else
decision rules. The deeper the tree, the more complex the decision rules and the tter the model.
1.3. Supervised learning 77
scikit-learn user guide, Release 0.12-git
Some advantages of decision trees are:
Simple to understand and to interpret. Trees can be visualised.
Requires little data preparation. Other techniques often require data normalisation, dummy variables need to be
created and blank values to be removed. Note however that this module does not support missing values.
The cost of using the tree (i.e., predicting data) is logarithmic in the number of data points used to train the tree.
Able to handle both numerical and categorical data. Other techniques are usually specialised in analysing
datasets that have only one type of variable. See algorithms for more information.
Uses a white box model. If a given situation is observable in a model, the explanation for the condition is easily
explained by boolean logic. By constrast, in a black box model (e.g., in an articial neural network), results may
be more difcult to interpret.
Possible to validate a model using statistical tests. That makes it possible to account for the reliability of the
model.
Performs well even if its assumptions are somewhat violated by the true model from which the data were
generated.
The disadvantages of decision trees include:
Decision-tree learners can create over-complex trees that do not generalise the data well. This is called overt-
ting. Mechanisms such as pruning (not currently supported), setting the minimum number of samples required
at a leaf node or setting the maximum depth of the tree are necessary to avoid this problem.
Decision trees can be unstable because small variations in the data might result in a completely different tree
being generated. This problem is mitigated by using decision trees within an ensemble.
The problemof learning an optimal decision tree is known to be NP-complete under several aspects of optimality
and even for simple concepts. Consequently, practical decision-tree learning algorithms are based on heuristic
algorithms such as the greedy algorithm where locally optimal decisions are made at each node. Such algorithms
cannot guarantee to return the globally optimal decision tree. This can be mitigated by training multiple trees in
an ensemble learner, where the features and samples are randomly sampled with replacement.
78 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
There are concepts that are hard to learn because decision trees do not express them easily, such as XOR, parity
or multiplexer problems.
Decision tree learners create biased trees if some classes dominate. It is therefore recommended to balance the
dataset prior to tting with the decision tree.
Classication
DecisionTreeClassifier is a class capable of performing multi-class classication on a dataset.
As other classiers, DecisionTreeClassifier take as input two arrays: an array X of size [n_samples,
n_features] holding the training samples, and an array Y of integer values, size [n_samples], holding the class la-
bels for the training samples:
>>> from sklearn import tree
>>> X = [[0, 0], [1, 1]]
>>> Y = [0, 1]
>>> clf = tree.DecisionTreeClassifier()
>>> clf = clf.fit(X, Y)
After being tted, the model can then be used to predict new values:
>>> clf.predict([[2., 2.]])
array([1])
DecisionTreeClassifier is capable of both binary (where the labels are [-1, 1]) classication and multiclass
(where the labels are [0, ..., K-1]) classication.
Using the Iris dataset, we can construct a tree as follows:
>>> from sklearn.datasets import load_iris
>>> from sklearn import tree
>>> iris = load_iris()
>>> clf = tree.DecisionTreeClassifier()
>>> clf = clf.fit(iris.data, iris.target)
Once trained, we can export the tree in Graphviz format using the export_graphviz exporter. Belowis an example
export of a tree trained on the entire iris dataset:
>>> from StringIO import StringIO
>>> out = StringIO()
>>> out = tree.export_graphviz(clf, out_file=out)
After being tted, the model can then be used to predict new values:
>>> clf.predict(iris.data[0, :])
array([0])
Examples:
Plot the decision surface of a decision tree on the iris dataset
Regression
Decision trees can also be applied to regression problems, using the DecisionTreeRegressor class.
1.3. Supervised learning 79
scikit-learn user guide, Release 0.12-git
petal length (cm) <= 2.45000004768
error = 0.666666686535
samples = 150
value = [ 50. 50. 50.]
error = 0.0
samples = 50
value = [ 50. 0. 0.]
petal width (cm) <= 1.75
error = 0.5
samples = 100
value = [ 0. 50. 50.]
petal length (cm) <= 4.94999980927
error = 0.168038412929
samples = 54
value = [ 0. 49. 5.]
petal length (cm) <= 4.85000038147
error = 0.0425330810249
samples = 46
value = [ 0. 1. 45.]
petal width (cm) <= 1.65000009537
error = 0.040798611939
samples = 48
value = [ 0. 47. 1.]
petal width (cm) <= 1.54999995232
error = 0.444444447756
samples = 6
value = [ 0. 2. 4.]
sepal length (cm) <= 5.94999980927
error = 0.444444447756
samples = 3
value = [ 0. 1. 2.]
error = 0.0
samples = 43
value = [ 0. 0. 43.]
error = 0.0
samples = 47
value = [ 0. 47. 0.]
error = 0.0
samples = 1
value = [ 0. 0. 1.]
error = 0.0
samples = 3
value = [ 0. 0. 3.]
sepal length (cm) <= 6.94999980927
error = 0.444444447756
samples = 3
value = [ 0. 2. 1.]
error = 0.0
samples = 2
value = [ 0. 2. 0.]
error = 0.0
samples = 1
value = [ 0. 0. 1.]
error = 0.0
samples = 1
value = [ 0. 1. 0.]
error = 0.0
samples = 2
value = [ 0. 0. 2.]
80 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
As in the classication setting, the t method will take as argument arrays X and y, only that in this case y is expected
to have oating point values instead of integer values:
>>> from sklearn import tree
>>> X = [[0, 0], [2, 2]]
>>> y = [0.5, 2.5]
>>> clf = tree.DecisionTreeRegressor()
>>> clf = clf.fit(X, y)
>>> clf.predict([[1, 1]])
array([ 0.5])
Examples:
Decision Tree Regression
Complexity
In general, the run time cost to construct a balanced binary tree is O(n
samples
n
features
log(n
samples
)) and query
time O(log(n
samples
)). Although the tree construction algorithm attempts to generate balanced trees, they will not
always be balanced. Assuming that the subtrees remain approximately balanced, the cost at each node consists of
searching through O(n
features
) to nd the feature that offers the largest reduction in entropy. This has a cost of
O(n
features
n
samples
log(n
samples
)) at each node, leading to a total cost over the entire trees (by summing the cost at
each node) of O(n
features
n
2
samples
log(n
samples
)).
Scikit-learn offers a more efcient implementation for the construction of decision trees. A naive implementation
(as above) would recompute the class label histograms (for classication) or the means (for regression) at for each
new split point along a given feature. By presorting the feature over all relevant samples, and retaining a running
label count, we reduce the complexity at each node to O(n
features
log(n
samples
)), which results in a total cost of
O(n
features
n
samples
log(n
samples
)).
1.3. Supervised learning 81
scikit-learn user guide, Release 0.12-git
This implementation also offers a parameter min_density to control an optimization heuristic. A sample mask is used
to mask data points that are inactive at a given node, which avoids the copying of data (important for large datasets or
training trees within an ensemble). Density is dened as the ratio of active data samples to total samples at a given
node. The minimum density parameter species the level below which fancy indexing (and therefore data copied) and
the sample mask reset. If min_density is 1, then fancy indexing is always used for data partitioning during the tree
building phase. In this case, the size of memory (as a proportion of the input data a) required at a node of depth n can
be approximated using a geometric series: size = a
1r
n
1r
where r is the ratio of samples used at each node. A best
case analysis shows that the lowest memory requirement (for an innitely deep tree) is 2 a, where each partition
divides the data in half. A worst case analysis shows that the memory requirement can increase to n a. In practise
it usually requires 3 to 4 times a. Setting min_density to 0 will always use the sample mask to select the subset of
samples at each node. This results in little to no additional memory being allocated, making it appropriate for massive
datasets or within ensemble learners. The default value for min_density is 0.1 which empirically leads to fast training
for many problems. Typically high values of min_density will lead to excessive reallocation, slowing down the
algorithm signicantly.
Tips on practical use
Decision trees tend to overt on data with a large number of features. Getting the right ratio of samples to
number of features is important, since a tree with few samples in high dimensional space is very likely to
overt.
Consider performing dimensionality reduction (PCA, ICA, or Feature selection) beforehand to give your tree a
better chance of nding features that are discriminative.
Visualise your tree as you are training by using the export function. Use max_depth=3 as an initial tree
depth to get a feel for how the tree is tting to your data, and then increase the depth.
Remember that the number of samples required to populate the tree doubles for each additional level the tree
grows to. Use max_depth to control the size of the tree to prevent overtting.
Use min_samples_split or min_samples_leaf to control the number of samples at a leaf node. A
very small number will usually mean the tree will overt, whereas a large number will prevent the tree from
learning the data. Try min_samples_leaf=5 as an initial value. The main difference between the two is that
min_samples_leaf guarantees a minimumnumber of samples in a leaf, while min_samples_split can
create arbitrary small leaves, though min_samples_split is more common in the literature.
Balance your dataset before training to prevent the tree from creating a tree biased toward the classes that are
dominant.
All decision trees use Fortran ordered np.float32 arrays internally. If training data is not in this format, a
copy of the dataset will be made.
Tree algorithms: ID3, C4.5, C5.0 and CART
What are all the various decision tree algorithms and how do they differ from each other? Which one is implemented
in scikit-learn?
ID3 (Iterative Dichotomiser 3) was developed in 1986 by Ross Quinlan. The algorithm creates a multiway tree, nding
for each node (i.e. in a greedy manner) the categorical feature that will yield the largest information gain for categorical
targets. Trees are grown to their maximum size and then a pruning step is usually applied to improve the ability of the
tree to generalise to unseen data.
C4.5 is the successor to ID3 and removed the restriction that features must be categorical by dynamically dening
a discrete attribute (based on numerical variables) that partitions the continuous attribute value into a discrete set of
intervals. C4.5 converts the trained trees (i.e. the output of the ID3 algorithm) into sets of if-then rules. These accuracy
of each rule is then evaluated to determine the order in which they should be applied. Pruning is done by removing a
rules precondition if the accuracy of the rule improves without it.
82 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
C5.0 is Quinlans latest version release under a proprietary license. It uses less memory and builds smaller rulesets
than C4.5 while being more accurate.
CART (Classication and Regression Trees) is very similar to C4.5, but it differs in that it supports numerical target
variables (regression) and does not compute rule sets. CART constructs binary trees using the feature and threshold
that yield the largest information gain at each node.
scikit-learn uses an optimised version of the CART algorithm.
Mathematical formulation
Given training vectors x
i
R
n
, i=1,..., l and a label vector y R
l
, a decision tree recursively partitions the space
such that the samples with the same labels are grouped together.
Let the data at node mbe represented by Q. For each candidate split = (j, t
m
) consisting of a feature j and threshold
t
m
, partition the data into Q
left
() and Q
right
() subsets
Q
left
() = (x, y)|x
j
<= t
m
Q
right
() = Q\ Q
left
()
The impurity at m is computed using an impurity function H(), the choice of which depends on the task being solved
(classication or regression)
G(Q, ) =
n
left
N
m
H(Q
left
()) +
n
right
N
m
H(Q
right
())
Select the parameters that minimises the impurity
= argmin
G(Q, )
Recurse for subsets Q
left
(
) and Q
right
(
xiRm
I(y
i
= k)
be the proportion of class k observations in node m
Common measures of impurity are Gini
H(X
m
) =
k
p
mk
(1 p
mk
)
Cross-Entropy
H(X
m
) =
k
p
mk
log(p
mk
)
and Misclassication
H(X
m
) = 1 max(p
mk
)
1.3. Supervised learning 83
scikit-learn user guide, Release 0.12-git
Regression criteria
If the target is a continuous value, then for node m, representing a region R
m
with N
m
observations, a common
criterion to minimise is the Mean Squared Error
c
m
=
1
N
m
iNm
y
i
H(X
m
) =
1
N
m
iNm
(y
i
c
m
)
2
References:
http://en.wikipedia.org/wiki/Decision_tree_learning
http://en.wikipedia.org/wiki/Predictive_analytics
L. Breiman, J. Friedman, R. Olshen, and C. Stone. Classication and Regression Trees. Wadsworth,
Belmont, CA, 1984.
J.R. Quinlan. C4. 5: programs for machine learning. Morgan Kaufmann, 1993.
T. Hastie, R. Tibshirani and J. Friedman. Elements of Statistical Learning, Springer, 2009.
1.3.9 Ensemble methods
The goal of ensemble methods is to combine the predictions of several models built with a given learning algorithm
in order to improve generalizability / robustness over a single model.
Two families of ensemble methods are usually distinguished:
In averaging methods, the driving principle is to build several models independently and then to average their
predictions. On average, the combined model is usually better than any of the single model because its variance
is reduced.
Examples: Bagging methods, Forests of randomized trees...
By contrast, in boosting methods, models are built sequentially and one tries to reduce the bias of the combined
model. The motivation is to combine several weak models to produce a powerful ensemble.
Examples: AdaBoost, Least Squares Boosting, Gradient Tree Boosting, ...
Forests of randomized trees
The sklearn.ensemble module includes two averaging algorithms based on randomized decision trees: the Ran-
domForest algorithm and the Extra-Trees method. Both algorithms are perturb-and-combine techniques [B1998]
specically designed for trees. This means a diverse set of classiers is created by introducing randomness in the
classier construction. The prediction of the ensemble is given as the averaged prediction of the individual classiers.
As other classiers, forest classiers have to be tted with two arrays: an array X of size [n_samples,
n_features] holding the training samples, and an array Y of size [n_samples] holding the target values (class
labels) for the training samples:
>>> from sklearn.ensemble import RandomForestClassifier
>>> X = [[0, 0], [1, 1]]
>>> Y = [0, 1]
>>> clf = RandomForestClassifier(n_estimators=10)
>>> clf = clf.fit(X, Y)
84 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
Random Forests
In random forests (see RandomForestClassifier and RandomForestRegressor classes), each tree in the
ensemble is built from a sample drawn with replacement (i.e., a bootstrap sample) from the training set. In addition,
when splitting a node during the construction of the tree, the split that is chosen is no longer the best split among all
features. Instead, the split that is picked is the best split among a random subset of the features. As a result of this
randomness, the bias of the forest usually slightly increases (with respect to the bias of a single non-random tree) but,
due to averaging, its variance also decreases, usually more than compensating for the increase in bias, hence yielding
an overall better model.
In contrast to the original publication [B2001], the scikit-learn implementation combines classiers by averaging their
probabilistic prediction, instead of letting each classier vote for a single class.
Extremely Randomized Trees
In extremely randomized trees (see ExtraTreesClassifier and ExtraTreesRegressor classes), random-
ness goes one step further in the way splits are computed. As in random forests, a random subset of candidate features
is used, but instead of looking for the most discriminative thresholds, thresholds are drawn at random for each candi-
date feature and the best of these randomly-generated thresholds is picked as the splitting rule. This usually allows to
reduce the variance of the model a bit more, at the expense of a slightly greater increase in bias:
>>> from sklearn.cross_validation import cross_val_score
>>> from sklearn.datasets import make_blobs
>>> from sklearn.ensemble import RandomForestClassifier
>>> from sklearn.ensemble import ExtraTreesClassifier
>>> from sklearn.tree import DecisionTreeClassifier
>>> X, y = make_blobs(n_samples=10000, n_features=10, centers=100,
... random_state=0)
>>> clf = DecisionTreeClassifier(max_depth=None, min_samples_split=1,
... random_state=0)
>>> scores = cross_val_score(clf, X, y)
>>> scores.mean()
0.978...
>>> clf = RandomForestClassifier(n_estimators=10, max_depth=None,
... min_samples_split=1, random_state=0)
>>> scores = cross_val_score(clf, X, y)
>>> scores.mean()
0.999...
>>> clf = ExtraTreesClassifier(n_estimators=10, max_depth=None,
... min_samples_split=1, random_state=0)
>>> scores = cross_val_score(clf, X, y)
>>> scores.mean() > 0.999
True
Parameters
The main parameters to adjust when using these methods is n_estimators and max_features. The former
is the number of trees in the forest. The larger the better, but also the longer it will take to compute. In addition,
note that results will stop getting signicantly better beyond a critical number of trees. The latter is the size of the
random subsets of features to consider when splitting a node. The lower the greater the reduction of variance, but also
the greater the increase in bias. Empiricial good default values are max_features=n_features for regression
1.3. Supervised learning 85
scikit-learn user guide, Release 0.12-git
problems, and max_features=sqrt(n_features) for classication tasks (where n_features is the number
of features in the data). The best results are also usually reached when setting max_depth=None in combination
with min_samples_split=1 (i.e., when fully developping the trees). Bear in mind though that these values are
usually not optimal. The best parameter values should always be cross- validated. In addition, note that bootstrap
samples are used by default in random forests (bootstrap=True) while the default strategy is to use the original
dataset for building extra-trees (bootstrap=False).
When training on large datasets, where runtime and memory requirements are important, it might also be benecial
to adjust the min_density parameter, that controls a heuristic for speeding up computations in each tree. See
Complexity of trees for details.
Parallelization
Finally, this module also features the parallel construction of the trees and the parallel computation of the predictions
through the n_jobs parameter. If n_jobs=k then computations are partitioned into k jobs, and run on k cores of
the machine. If n_jobs=-1 then all cores available on the machine are used. Note that because of inter-process
communication overhead, the speedup might not be linear (i.e., using k jobs will unfortunately not be k times as fast).
Signicant speedup can still be achieved though when building a large number of trees, or when building a single tree
requires a fair amount of time (e.g., on large datasets).
Examples:
Plot the decision surfaces of ensembles of trees on the iris dataset
Pixel importances with a parallel forest of trees
References
86 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
Gradient Tree Boosting
Gradient Tree Boosting or Gradient Boosted Regression Trees (GBRT) is a generalization of boosting to arbitrary
differentiable loss functions. GBRT is an accurate and effective off-the-shelf procedure that can be used for both
regression and classication problems. Gradient Tree Boosting models are used in a variety of areas including Web
search ranking and ecology.
The advantages of GBRT are:
Natural handling of data of mixed type (= heterogeneous features)
Predictive power
Robustness to outliers in input space (via robust loss functions)
The disadvantages of GBRT are:
Scalability, due to the sequential nature of boosting it can hardly be parallelized.
The module sklearn.ensemble provides methods for both classication and regression via gradient boosted
regression trees.
Classication
GradientBoostingClassifier supports both binary and multi-class classication via the deviance loss func-
tion (loss=deviance). The following example shows how to t a gradient boosting classier with 100 decision
stumps as weak learners:
>>> from sklearn.datasets import make_hastie_10_2
>>> from sklearn.ensemble import GradientBoostingClassifier
>>> X, y = make_hastie_10_2(random_state=0)
>>> X_train, X_test = X[:2000], X[2000:]
>>> y_train, y_test = y[:2000], y[2000:]
>>> clf = GradientBoostingClassifier(n_estimators=100, learn_rate=1.0,
... max_depth=1, random_state=0).fit(X_train, y_train)
>>> clf.score(X_test, y_test)
0.913...
The number of weak learners (i.e. regression trees) is controlled by the parameter n_estimators; The maximum
depth of each tree is controlled via max_depth. The learn_rate is a hyper-parameter in the range (0.0, 1.0] that
controls overtting via shrinkage.
Note: Classication with more than 2 classes requires the induction of n_classes regression trees at each at each
iteration, thus, the total number of induced trees equals n_classes
*
n_estimators. For datasets with a large
number of classes we strongly recommend to use RandomForestClassifier as an alternative to GBRT.
Regression
GradientBoostingRegressor supports a number of different loss functions for regression which can be spec-
ied via the argument loss. Currently, supported are least squares (loss=ls) and least absolute deviation
(loss=lad), which is more robust w.r.t. outliers. See [F2001] for detailed information.
1.3. Supervised learning 87
scikit-learn user guide, Release 0.12-git
>>> import numpy as np
>>> from sklearn.metrics import mean_squared_error
>>> from sklearn.datasets import make_friedman1
>>> from sklearn.ensemble import GradientBoostingRegressor
>>> X, y = make_friedman1(n_samples=1200, random_state=0, noise=1.0)
>>> X_train, X_test = X[:200], X[200:]
>>> y_train, y_test = y[:200], y[200:]
>>> clf = GradientBoostingRegressor(n_estimators=100, learn_rate=1.0,
... max_depth=1, random_state=0, loss=ls).fit(X_train, y_train)
>>> mean_squared_error(y_test, clf.predict(X_test))
6.90...
The gure below shows the results of applying GradientBoostingRegressor with least squares loss and 500
base learners to the Boston house-price dataset (see sklearn.datasets.load_boston). The plot on the left
shows the train and test error at each iteration. Plots like these are often used for early stopping. The plot on the right
shows the feature importances which can be optained via the feature_importance property.
Mathematical formulation
GBRT considers additive models of the following form:
F(x) =
M
m=1
m
h
m
(x)
where h
m
(x) are the basis functions which are usually called weak learners in the context of boosting. Gradient Tree
Boosting uses decision trees of xed size as weak learners. Decision trees have a number of abilities that make them
valuable for boosting, namely the ability to handle data of mixed type and the ability to model complex functions.
Similar to other boosting algorithms GBRT builds the additive model in a forward stagewise fashion:
F
m
(x) = F
m1
(x) +
m
h
m
(x)
At each stage the decision tree h
m
(x) is choosen that minimizes the loss function L given the current model F
m1
and its t F
m1
(x
i
)
88 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
F
m
(x) = F
m1
(x) + arg min
h
n
i=1
L(y
i
, F
m1
(x
i
) h(x))
The initial model F
0
is problem specic, for least-squares regression one usually chooses the mean of the target values.
Note: The initial model can also be specied via the init argument. The passed object has to implement fit and
predict.
Gradient Boosting attempts to solve this minimization problem numerically via steepest descent: The steepest descent
direction is the negative gradient of the loss function evaluated at the current model F
m1
which can be calculated for
any differentiable loss function:
F
m
(x) = F
m1
(x) +
m
n
i=1
F
L(y
i
, F
m1
(x
i
))
Where the step length
m
is choosen using line search:
m
= arg min
i=1
L(y
i
, F
m1
(x
i
)
L(y
i
, F
m1
(x
i
))
F
m1
(x
i
)
)
The algorithms for regression and classication only differ in the concrete loss function used.
Loss Functions The following loss functions are supported and can be specied using the parameter loss:
Regression
Least squares (ls): The natural choice for regression due to its superior computational properties. The
initial model is given by the mean of the target values.
Least absolute deviation (lad): A robust loss function for regression. The initial model is given by the
median of the target values.
Classication
Binomial deviance (deviance): The negative binomial log-likelihood loss function for binary classi-
cation (provides probability estimates). The initial model is given by the log odds-ratio.
Multinomial deviance (deviance): The negative multinomial log-likelihood loss function for multi-
class classication with n_classes mutually exclusive classes. It provides probability estimates. The
initial model is given by the prior probability of each class. At each iteration n_classes regression trees
have to be constructed which makes GBRT rather inefcient for data sets with a large number of classes.
Regularization
Shrinkage [F2001] proposed a simple regularization strategy that scales the contribution of each weak learner by a
factor :
F
m
(x) = F
m1
(x) +
m
h
m
(x)
1.3. Supervised learning 89
scikit-learn user guide, Release 0.12-git
The parameter is also called the learning rate because it scales the step length the the gradient descent procedure; it
can be set via the learn_rate parameter.
The parameter learn_rate strongly interacts with the parameter n_estimators, the number of weak learners
to t. Smaller values of learn_rate require larger numbers of weak learners to maintain a constant training error.
Empirical evidence suggests that small values of learn_rate favor better test error. [HTF2009] recommend to set
the learning rate to a small constant (e.g. learn_rate <= 0.1) and choose n_estimators by early stopping.
For a more detailed discussion of the interaction between learn_rate and n_estimators see [R2007].
Subsampling [F1999] proposed stochastic gradient boosting, which combines gradient boosting with bootstrap av-
eraging (bagging). At each iteration the base classier is trained on a fraction subsample of the available training
data. The subsample is drawn without replacement. A typical value of subsample is 0.5.
The gure below illustrates the effect of shrinkage and subsampling on the goodness-of-t of the model. We can
clearly see that shrinkage outperforms no-shrinkage. Subsampling with shrinkage can further increase the accuracy of
the model. Subsampling without shrinkage, on the other hand, does poorly.
Examples:
Gradient Boosting regression
Gradient Boosting regularization
References
1.3.10 Multiclass and multilabel algorithms
This module implements multiclass and multilabel learning algorithms:
90 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
one-vs-the-rest / one-vs-all
one-vs-one
error correcting output codes
Multiclass classication means classication with more than two classes. Multilabel classication is a different task,
where a classier is used to predict a set of target labels for each instance; i.e., the set of target classes is not assumed
to be disjoint as in ordinary (binary or multiclass) classication. This is also called any-of classication.
The estimators provided in this module are meta-estimators: they require a base estimator to be provided in their
constructor. For example, it is possible to use these estimators to turn a binary classier or a regressor into a multiclass
classier. It is also possible to use these estimators with multiclass estimators in the hope that their accuracy or runtime
performance improves.
Note: You dont need to use these estimators unless you want to experiment with different multiclass strategies:
all classiers in scikit-learn support multiclass classication out-of-the-box. Below is a summary of the classiers
supported in scikit-learn grouped by the strategy used.
Inherently multiclass: Naive Bayes, sklearn.lda.LDA, Decision Trees, Random Forests
One-Vs-One: sklearn.svm.SVC.
One-Vs-All: sklearn.svm.LinearSVC, sklearn.linear_model.LogisticRegression,
sklearn.linear_model.SGDClassifier, sklearn.linear_model.RidgeClassifier.
Note: At the moment there are no evaluation metrics implemented for multilabel learnings.
One-Vs-The-Rest
This strategy, also known as one-vs-all, is implemented in OneVsRestClassifier. The strategy consists in
tting one classier per class. For each classier, the class is tted against all the other classes. In addition to its
computational efciency (only n_classes classiers are needed), one advantage of this approach is its interpretability.
Since each class is represented by one and one classier only, it is possible to gain knowledge about the class by
inspecting its corresponding classier. This is the most commonly used strategy and is a fair default choice. Below is
an example:
>>> from sklearn import datasets
>>> from sklearn.multiclass import OneVsRestClassifier
>>> from sklearn.svm import LinearSVC
>>> iris = datasets.load_iris()
>>> X, y = iris.data, iris.target
>>> OneVsRestClassifier(LinearSVC()).fit(X, y).predict(X)
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 1, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])
Multilabel learning with OvR
OneVsRestClassifier also supports multilabel classication. To use this feature, feed the classier a list of
tuples containing target labels, like in the example below.
1.3. Supervised learning 91
scikit-learn user guide, Release 0.12-git
Examples:
Multilabel classication
One-Vs-One
OneVsOneClassifier constructs one classier per pair of classes. At prediction time, the class which received
the most votes is selected. Since it requires to t n_classes * (n_classes - 1) / 2 classiers, this method is usually
slower than one-vs-the-rest, due to its O(n_classes^2) complexity. However, this method may be advantageous for
algorithms such as kernel algorithms which dont scale well with n_samples. This is because each individual learning
problem only involves a small subset of the data whereas, with one-vs-the-rest, the complete dataset is used n_classes
times. Below is an example:
>>> from sklearn import datasets
>>> from sklearn.multiclass import OneVsOneClassifier
>>> from sklearn.svm import LinearSVC
>>> iris = datasets.load_iris()
>>> X, y = iris.data, iris.target
>>> OneVsOneClassifier(LinearSVC()).fit(X, y).predict(X)
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])
92 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
Error-Correcting Output-Codes
Output-code based strategies are fairly different from one-vs-the-rest and one-vs-one. With these strategies, each class
is represented in a euclidean space, where each dimension can only be 0 or 1. Another way to put it is that each class
is represented by a binary code (an array of 0 and 1). The matrix which keeps track of the location/code of each class
is called the code book. The code size is the dimensionality of the aforementioned space. Intuitively, each class should
be represented by a code as unique as possible and a good code book should be designed to optimize classication
accuracy. In this implementation, we simply use a randomly-generated code book as advocated in
2
although more
elaborate methods may be added in the future.
At tting time, one binary classier per bit in the code book is tted. At prediction time, the classiers are used to
project new points in the class space and the class closest to the points is chosen.
In OutputCodeClassifier, the code_size attribute allows the user to control the number of classiers which will
be used. It is a percentage of the total number of classes.
A number between 0 and 1 will require fewer classiers than one-vs-the-rest. In theory, log2(n_classes) /
n_classes is sufcient to represent each class unambiguously. However, in practice, it may not lead to good
accuracy since log2(n_classes) is much smaller than n_classes.
A number greater than than 1 will require more classiers than one-vs-the-rest. In this case, some classiers will in
theory correct for the mistakes made by other classiers, hence the name error-correcting. In practice, however, this
may not happen as classier mistakes will typically be correlated. The error-correcting output codes have a similar
effect to bagging.
Example:
>>> from sklearn import datasets
>>> from sklearn.multiclass import OutputCodeClassifier
>>> from sklearn.svm import LinearSVC
>>> iris = datasets.load_iris()
>>> X, y = iris.data, iris.target
>>> OutputCodeClassifier(LinearSVC(), code_size=2, random_state=0).fit(X, y).predict(X)
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1,
1, 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 2, 2, 2, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 1, 1, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])
References:
1.3.11 Feature selection
The classes in the sklearn.feature_selection module can be used for feature selection/dimensionality re-
duction on sample sets, either to improve estimators accuracy scores or to boost their performance on very high-
dimensional datasets.
Univariate feature selection
Univariate feature selection works by selecting the best features based on univariate statistical tests. It can seen as
a preprocessing step to an estimator. Scikit-Learn exposes feature selection routines a objects that implement the
2
The error coding method and PICTs, James G., Hastie T., Journal of Computational and Graphical statistics 7, 1998.
1.3. Supervised learning 93
scikit-learn user guide, Release 0.12-git
transform method:
selecting the k-best features SelectKBest
setting a percentile of features to keep SelectPercentile
using common univariate statistical tests for each feature: false positive rate SelectFpr, false discovery rate
SelectFdr, or family wise error SelectFwe.
These objects take as input a scoring function that returns univariate p-values:
For regression: f_regression
For classication: chi2 or f_classif
Feature selection with sparse data
If you use sparse data (i.e. data represented as sparse matrices), only chi2 will deal with the data without
making it dense.
Warning: Beware not to use a regression scoring function with a classication problem, you will get useless
results.
Examples:
Univariate Feature Selection
Recursive feature elimination
Given an external estimator that assigns weights to features (e.g., the coefcients of a linear model), recursive feature
elimination (RFE) is to select features by recursively considering smaller and smaller sets of features. First, the
estimator is trained on the initial set of features and weights are assigned to each one of them. Then, features whose
absolute weights are the smallest are pruned from the current set features. That procedure is recursively repeated on
the pruned set until the desired number of features to select is eventually reached.
Examples:
Recursive feature elimination: A recursive feature elimination example showing the relevance of pixels in
a digit classication task.
Recursive feature elimination with cross-validation: A recursive feature elimination example with auto-
matic tuning of the number of features selected with cross-validation.
L1-based feature selection
Selecting non-zero coefcients
Linear models penalized with the L1 norm have sparse solutions: many of their estimated coefcients are zero. When
the goal is to reduce the dimensionality of the data to use with another classier, they expose a transform method to se-
lect the non-zero coefcient. In particular, sparse estimators useful for this purpose are the linear_model.Lasso
for regression, and of linear_model.LogisticRegression and svm.LinearSVC for classication:
94 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
>>> from sklearn.svm import LinearSVC
>>> from sklearn.datasets import load_iris
>>> iris = load_iris()
>>> X, y = iris.data, iris.target
>>> X.shape
(150, 4)
>>> X_new = LinearSVC(C=0.01, penalty="l1", dual=False).fit_transform(X, y)
>>> X_new.shape
(150, 3)
With SVMs and logistic-regression, the parameter C controls the sparsity: the smaller C the fewer features selected.
With Lasso, the higher the alpha parameter, the fewer features selected.
Examples:
Classication of text documents using sparse features: Comparison of different algorithms for document
classication including L1-based feature selection.
L1-recovery and compressive sensing
For a good choice of alpha, the Lasso can fully recover the exact set of non-zero variables using only few
observations, provided certain specic conditions are met. In paraticular, the number of samples should be
sufciently large, or L1 models will perform at random, where sufciently large depends on the number of
non-zero coefcients, the logarithm of the number of features, the amount of noise, the smallest absolute value
of non-zero coefcients, and the structure of the design matrix X. In addition, the design matrix must display
certain specic properties, such as not being too correlated.
There is no general rule to select an alpha parameter for recovery of non-zero coefcients. It can by set by
cross-validation (LassoCV or LassoLarsCV), though this may lead to under-penalized models: including a
small number of non-relevant variables is not detrimental to prediction score. BIC (LassoLarsIC) tends, on
the opposite, to set high values of alpha.
Reference Richard G. Baraniuk Compressive Sensing, IEEE Signal Processing Magazine [120] July 2007
http://dsp.rice.edu/les/cs/baraniukCSlecture07.pdf
Randomized sparse models
The limitation of L1-based sparse models is that faced with a group of very correlated features, they will select only
one. To mitigate this problem, it is possible to use randomization techniques, reestimating the sparse model many
times perturbing the design matrix or sub-sampling data and counting how many times a given regressor is selected.
RandomizedLasso implements this strategy for regression settings, using the Lasso, while
RandomizedLogisticRegression uses the logistic regression and is suitable for classication tasks. To
get a full path of stability scores you can use lasso_stability_path.
Note that for randomized sparse models to be more powerful than standard F statistics at detecting non-zero features,
the ground truth model should be sparse, in other words, there should be only a small fraction of features non zero.
Examples:
Sparse recovery: feature selection for sparse linear models: An example comparing different feature
selection approaches and discussing in which situation each approach is to be favored.
1.3. Supervised learning 95
scikit-learn user guide, Release 0.12-git
References:
N. Meinshausen, P. Buhlmann, Stability selection, Journal of the Royal Statistical Society, 72 (2010)
http://arxiv.org/pdf/0809.2932
F. Bach, Model-Consistent Sparse Estimation through the Bootstrap http://hal.inria.fr/hal-00354771/
Tree-based feature selection
Tree-based estimators (see the sklearn.tree module and forest of trees in the sklearn.ensemble module)
can be used to compute feature importances, which in turn can be used to discard irrelevant features:
>>> from sklearn.ensemble import ExtraTreesClassifier
>>> from sklearn.datasets import load_iris
>>> iris = load_iris()
>>> X, y = iris.data, iris.target
>>> X.shape
(150, 4)
>>> clf = ExtraTreesClassifier(compute_importances=True, random_state=0)
>>> X_new = clf.fit(X, y).transform(X)
>>> X_new.shape
(150, 2)
Examples:
Feature importances with forests of trees: example on synthetic data showing the recovery of the actually
meaningful features.
Pixel importances with a parallel forest of trees: example on face recognition data.
96 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
1.3.12 Semi-Supervised
Semi-supervised learning is a situation in which in your training data some of the samples are not labeled. The
semi-supervised estimators, in sklean.semi_supervised are able to make use of this addition unlabeled data
to capture better the shape of the underlying data distribution and generalize better to new samples. These algorithms
can perform well when we have a very small amount of labeled points and a large amount of unlabeled points.
Unlabeled entries in y
It is important to assign an identier to unlabeled points along with the labeled data when training the model
with the t method. The identier that this implementation uses the integer value 1.
Label Propagation
Label propagation denote a few variations of semi-supervised graph inference algorithms.
A few features available in this model:
Can be used for classication and regression tasks
Kernel methods to project data into alternate dimensional spaces
scikit-learn provides two label propagation models: LabelPropagation and LabelSpreading. Both work by
constructing a similarity graph over all items in the input dataset.
Figure 1.1: An illustration of label-propagation: the structure of unlabeled observations is consistent with the class
structure, and thus the class label can be propagated to the unlabeled observations of the training set.
LabelPropagation and LabelSpreading differ in modications to the similarity matrix that graph and the
clamping effect on the label distributions. Clamping allows the algorithm to change the weight of the true ground
labeled data to some degree. The LabelPropagation algorithm performs hard clamping of input labels, which
means = 1. This clamping factor can be relaxed, to say = 0.8, which means that we will always retain 80 percent
of our original label distribution, but the algorithm gets to change its condence of the distribution within 20 percent.
LabelPropagation uses the raw similarity matrix constructed from the data with no modications. In contrast,
LabelSpreading minimizes a loss function that has regularization properties, as such it is often more robust to
noise. The algorithmiterates on a modied version of the original graph and normalizes the edge weights by computing
the normalized graph Laplacian matrix. This procedure is also used in Spectral clustering.
Label propagation models have two built-in kernel methods. Choice of kernel effects both scalability and performance
of the algorithms. The following are available:
rbf (exp(|x y|
2
), > 0). is specied by keyword gamma.
1.3. Supervised learning 97
scikit-learn user guide, Release 0.12-git
knn (1[x
P(X|y
) p(y
))
In linear and quadratic discriminant analysis, P(X|y) is modeled as a Gaussian distribution. In the case of LDA, the
Gaussians for each class are assumed to share the same covariance matrix. This leads to a linear decision surface, as
can be seen by comparing the the log-probability rations log[P(y = k|X)/P(y = l|X)].
In the case of QDA, there are no assumptions on the covariance matrices of the Gaussians, leading to a quadratic
decision surface.
1.4 Unsupervised learning
1.4.1 Gaussian mixture models
sklearn.mixture is a package which enables one to learn Gaussian Mixture Models (diagonal, spherical, tied and full
covariance matrices supported), sample them, and estimate them from data. Facilities to help determine the appropriate
number of components are also provided.
A Gaussian mixture model is a probabilistic model that assumes all the data points are generated from a mixture of a
nite number of Gaussian distributions with unknown parameters. One can think of mixture models as generalizing
k-means clustering to incorporate information about the covariance structure of the data as well as the centers of the
latent Gaussians.
The scikit-learn implements different classes to estimate Gaussian mixture models, that correspond to different esti-
mation strategies, detailed below.
1.4. Unsupervised learning 99
scikit-learn user guide, Release 0.12-git
Figure 1.2: Two-component Gaussian mixture model: data points, and equi-probability surfaces of the model.
GMM classier
The GMM object implements the expectation-maximization (EM) algorithm for tting mixture-of-Gaussian models. It
can also draw condence ellipsoids for multivariate models, and compute the Bayesian Information Criterion to assess
the number of clusters in the data. A GMM.fit method is provided that learns a Gaussian Mixture Model from train
data. Given test data, it can assign to each sample the class of the Gaussian it mostly probably belong to using the
GMM.predict method.
The GMM comes with different options to constrain the covariance of the difference classes estimated: spherical,
diagonal, tied or full covariance.
Examples:
See GMM classication for an example of using a GMM as a classier on the iris dataset.
See Density Estimation for a mixture of Gaussians for an example on plotting the density estimation.
Pros and cons of class GMM: expectation-maximization inference
Pros
Speed it is the fastest algorithm for learning mixture models
Agnostic as this algorithm maximizes only the likelihood, it will not bias the means towards zero, or bias
the cluster sizes to have specic structures that might or might not apply.
Cons
Singularities when one has insufciently many points per mixture, estimating the covariance matrices
becomes difcult, and the algorithm is known to diverge and nd solutions with innite likelihood
unless one regularizes the covariances articially.
Number of components this algorithm will always use all the components it has access to, needing held-
out data or information theoretical criteria to decide how many components to use in the absence of
external cues.
100 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
Selecting the number of components in a classical GMM
The BIC criterion can be used to select the number of components in a GMM in an efcient way. In theory, it recovers
the true number of components only in the asymptotic regime (i.e. if much data is available). Note that using a
DPGMM avoids the specication of the number of components for a Gaussian mixture model.
Examples:
See Gaussian Mixture Model Selection for an example of model selection performed with classical GMM.
Estimation algorithm Expectation-maximization
The main difculty in learning Gaussian mixture models from unlabeled data is that it is one usually doesnt know
which points came from which latent component (if one has access to this information it gets very easy to t a separate
Gaussian distribution to each set of points). Expectation-maximization is a well-fundamented statistical algorithm to
get around this problem by an iterative process. First one assumes random components (randomly centered on data
points, learned from k-means, or even just normally distributed around the origin) and computes for each point a
probability of being generated by each component of the model. Then, one tweaks the parameters to maximize the
likelihood of the data given those assignments. Repeating this process is guaranteed to always converge to a local
optimum.
1.4. Unsupervised learning 101
scikit-learn user guide, Release 0.12-git
VBGMM classier: variational Gaussian mixtures
The VBGMM object implements a variant of the Gaussian mixture model with variational inference algorithms. The
API is identical to GMM. It is essentially a middle-ground between GMM and DPGMM, as it has some of the properties of
the Dirichlet process.
Pros and cons of class VBGMM: variational inference
Pros
Regularization due to the incorporation of prior information, variational solutions have less pathological
special cases than expectation-maximization solutions. One can then use full covariance matrices
in high dimensions or in cases where some components might be centered around a single point
without risking divergence.
Cons
Bias to regularize a model one has to add biases. The variational algorithm will bias all the means
towards the origin (part of the prior information adds a ghost point in the origin to every mixture
component) and it will bias the covariances to be more spherical. It will also, depending on the
concentration parameter, bias the cluster structure either towards uniformity or towards a rich-get-
richer scenario.
Hyperparameters this algorithm needs an extra hyperparameter that might need experimental tuning via
cross-validation.
Estimation algorithm: variational inference
Variational inference is an extension of expectation-maximization that maximizes a lower bound on model evidence
(including priors) instead of data likelihood. The principle behind variational methods is the same as expectation-
maximization (that is both are iterative algorithms that alternate between nding the probabilities for each point to
be generated by each mixture and tting the mixtures to these assigned points), but variational methods add regular-
ization by integrating information from prior distributions. This avoids the singularities often found in expectation-
maximization solutions but introduces some subtle biases to the model. Inference is often notably slower, but not
usually as much so as to render usage unpractical.
102 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
Due to its Bayesian nature, the variational algorithm needs more hyper-parameters than expectation-maximization,
the most important of these being the concentration parameter alpha. Specifying a high value of alpha leads more
often to uniformly-sized mixture components, while specifying small (between 0 and 1) values will lead to some
mixture components getting almost all the points while most mixture components will be centered on just a few of the
remaining points.
DPGMM classier: Innite Gaussian mixtures
The DPGMM object implements a variant of the Gaussian mixture model with a variable (but bounded) number of
components using the Dirichlet Process. The API is identical to GMM. This class doesnt require the user to choose the
number of components, and at the expense of extra computational time the user only needs to specify a loose upper
bound on this number and a concentration parameter.
The examples above compare Gaussian mixture models with xed number of components, to DPGMM models. On
the left the GMM is tted with 5 components on a dataset composed of 2 clusters. We can see that the DPGMM
is able to limit itself to only 2 components whereas the GMM ts the data t too many components. Note that with
very little observations, the DPGMM can take a conservative stand, and t only one component. On the right we are
tting a dataset not well-depicted by a mixture of Gaussian. Adjusting the alpha parameter of the DPGMM controls
the number of components used to t this data.
Examples:
See Gaussian Mixture Model Ellipsoids for an example on plotting the condence ellipsoids for both GMM
and DPGMM.
Gaussian Mixture Model Sine Curve shows using GMM and DPGMM to t a sine wave
Pros and cons of class DPGMM: Diriclet process mixture model
Pros
Less sensitivity to the number of parameters unlike nite models, which will almost always use all
components as much as they can, and hence will produce wildly different solutions for different
numbers of components, the Dirichlet process solution wont change much with changes to the
parameters, leading to more stability and less tuning.
No need to specify the number of components only an upper bound of this number needs to be pro-
vided. Note however that the DPMM is not a formal model selection procedure, and thus provides
no guarantee on the result.
1.4. Unsupervised learning 103
scikit-learn user guide, Release 0.12-git
Cons
Speed the extra parametrization necessary for variational inference and for the structure of the Dirichlet
process can and will make inference slower, although not by much.
Bias as in variational techniques, but only more so, there are many implicit biases in the Dirichlet process
and the inference algorithms, and whenever there is a mismatch between these biases and the data it
might be possible to t better models using a nite mixture.
The Dirichlet Process
Here we describe variational inference algorithms on Dirichlet process mixtures. The Dirichlet process is a prior
probability distribution on clusterings with an innite, unbounded, number of partitions. Variational techniques let us
incorporate this prior structure on Gaussian mixture models at almost no penalty in inference time, comparing with a
nite Gaussian mixture model.
An important question is how can the Dirichlet process use an innite, unbounded number of clusters and still be
consistent. While a full explanation doesnt t this manual, one can think of its chinese restaurant process analogy to
help understanding it. The chinese restaurant process is a generative story for the Dirichlet process. Imagine a chinese
restaurant with an innite number of tables, at rst all empty. When the rst customer of the day arrives, he sits at
the rst table. Every following customer will then either sit on an occupied table with probability proportional to the
number of customers in that table or sit in an entirely new table with probability proportional to the concentration
parameter alpha. After a nite number of customers has sat, it is easy to see that only nitely many of the innite
tables will ever be used, and the higher the value of alpha the more total tables will be used. So the Dirichlet process
does clustering with an unbounded number of mixture components by assuming a very asymmetrical prior structure
over the assignments of points to components that is very concentrated (this property is known as rich-get-richer, as
the full tables in the Chinese restaurant process only tend to get fuller as the simulation progresses).
Variational inference techniques for the Dirichlet process still work with a nite approximation to this innite mixture
model, but instead of having to specify a priori how many components one wants to use, one just species the concen-
tration parameter and an upper bound on the number of mixture components (this upper bound, assuming it is higher
than the true number of components, affects only algorithmic complexity, not the actual number of components
used).
Derivation:
See here the full derivation of this algorithm.
Variational Gaussian Mixture Models The API is identical to that of the GMM class, the main difference being that
it offers access to precision matrices as well as covariance matrices.
The inference algorithm is the one from the following paper:
Variational Inference for Dirichlet Process Mixtures David Blei, Michael Jordan. Bayesian Analysis, 2006
While this paper presents the parts of the inference algorithm that are concerned with the structure of the dirichlet pro-
cess, it does not go into detail in the mixture modeling part, which can be just as complex, or even more. For this reason
we present here a full derivation of the inference algorithm and all the update and lower-bound equations. If youre
not interested in learning how to derive similar algorithms yourself and youre not interested in changing/debugging
the implementation in the scikit this document is not for you.
The complexity of this implementation is linear in the number of mixture components and data points. With regards
to the dimensionality, it is linear when using spherical or diag and quadratic/cubic when using tied or full. For
spherical or diag it is O(n_states * n_points * dimension) and for tied or full it is O(n_states * n_points * dimension^2
104 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
+ n_states * dimension^3) (it is necessary to invert the covariance/precision matrices and compute its determinant,
hence the cubic term).
This implementation is expected to scale at least as well as EM for the mixture of Gaussians.
Update rules for VB inference Here the full mathematical derivation of the Variational Bayes update rules for
Gaussian Mixture Models is given. The main parameters of the model, dened for any class k [1..K] are the class
proportion
k
, the mean parameters
k
, the covariance parameters
k
, which is characterized by variational Wishart
density, Wishart(a
k
, B
k
), where a is the degrees of freedom, and B is the scale matrix. Depending on the covariance
parameterization, B
k
can be a positive scalar, a positive vector or a Symmetric Positive Denite matrix.
The spherical model The model then is
k
Beta(1,
1
)
k
Normal(0, I)
k
Gamma(1, 1)
z
i
SBP()
X
t
Normal(
zi
,
1
z
i
I)
The variational distribution well use is
k
Beta(
k,1
,
k,2
)
k
Normal(
k
, I)
k
Gamma(a
k
, b
k
)
z
i
Discrete(
zi
)
The bound The variational bound is
log P(X)
k
(E
q
[log P(
k
)] E
q
[log Q(
k
)])
+
k
(E
q
[log P(
k
)] E
q
[log Q(
k
)])
+
k
(E
q
[log P(
k
)] E
q
[log Q(
k
)])
+
i
(E
q
[log P(z
i
)] E
q
[log Q(z
i
)])
+
i
E
q
[log P(X
t
)]
The bound for
k
E
q
[log Beta(1, )] E[log Beta(
k,1
,
k,2
)] = log (1 +) log ()
+( 1)((
k,2
) (
k,1
+
k,2
))
log (
k,1
+
k,2
) + log (
k,1
) + log (
k,2
)
(
k,1
1)((
k,1
) (
k,1
+
k,2
))
(
k,2
1)((
k,2
) (
k,1
+
k,2
))
The bound for
k
E
q
[log P(
k
)] E
q
[log Q(
k
)]
=
_
d
f
q(
f
) log P(
f
)
_
d
f
q(
f
) log Q(
f
)
=
D
2
log 2
1
2
||
k
||
2
D
2
+
D
2
log 2e
The bound for
k
Here Ill use the inverse scale parametrization of the gamma distribution.
E
q
[log P(
k
)] E
q
[log Q(
k
)]
= log (a
k
) (a
k
1)(a
k
) log b
k
+a
k
a
k
b
k
1.4. Unsupervised learning 105
scikit-learn user guide, Release 0.12-git
The bound for z
E
q
[log P(z)] E
q
[log Q(z)]
=
k
__
K
j=k+1
zi,j
_
((k, 1) (k, 1 +
k,2
)) +
z
i,k
((
k,1
) (
k,1
+
k,2
)) log
z
i,k
_
The bound for X
Recall that there is no need for a Q(X) so this bound is just
E
q
[log P(X
i
)] =
k
z
k
_
D
2
log 2 +
D
2
((a
k
) log(b
k
))
a
k
2b
k
(||X
i
k
||
2
+D) log 2e
_
For simplicity Ill later call the term inside the parenthesis E
q
[log P(X
i
|z
i
= k)]
The updates Updating
k,1
= 1 +
z
i,k
k,2
= +
j>k
zi,j
.
Updating
The updates for mu essentially are just weighted expectations of X regularized by the prior. We can see this by taking
the gradient of the bound w.r.t.
k
+
z
i,k
b
k
a
k
(X
i
+
)
so the update is
k
=
i
z
i,k
b
k
a
k
X
i
1 +
i
z
i,k
b
k
a
k
Updating a and b
For some odd reason it doesnt really work when you derive the updates for a and b using the gradients of the lower
bound (terms involving the
function show up and a is hard to isolate). However, we can use the other formula,
log Q(
k
) = E
v=
k
[log P] +const
All the terms not involving
k
get folded over into the constant and we get two terms: the prior and the probability of
X. This gives us
log Q(
k
) =
k
+
D
2
z
i,k
log
k
k
2
z
i,k
(||X
i
k
||
2
+D)
This is the log of a gamma distribution, with a
k
= 1 +
D
2
z
i,k
and
b
k
= 1 +
1
2
z
i,k
(||X
i
k
||
2
+D).
You can verify this by normalizing the previous term.
Updating z
log
z
i,k
(
k,1
) (
k,1
+
k,2
) +E
Q
[log P(X
i
|z
i
= k)] +
j<k
((
j,2
) (
j,1
+
j,2
)) .
106 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
The diagonal model The model then is
k
Beta(1,
1
)
k
Normal(0, I)
k,d
Gamma(1, 1)
z
i
SBP()
X
t
Normal(
zi
,
1
zi
)
Tha variational distribution well use is
k
Beta(
k,1
,
k,2
)
k
Normal(
k
, I)
k,d
Gamma(a
k,d
, b
k,d
)
z
i
Discrete(
zi
)
The lower bound The changes in this lower bound from the previous model are in the distributions of (as there
are a lot more s now) and X.
The bound for
k,d
is the same bound for
k
and can be safelly ommited.
The bound for X :
The main difference here is that the precision matrix
k
scales the norm, so we have an extra term after computing the
expectation of
T
k
k
k
, which is
T
k
+
k,d
. We then have
E
q
[log P(X
i
)] =
k
z
k
_
D
2
log 2 +
1
2
d
((a
k,d
) log(b
k,d
))
1
2
((X
i
k
)
T ak
bk
(X
i
k
) +
k,d
) log 2e
_
The updates The updates only chance for (to weight them with the new ), z (but the change is all folded into the
E
q
[P(X
i
|z
i
= k)] term), and the a and b variables themselves.
The update for
k
=
_
I +
z
i,k
b
k
a
k
_
1
_
z
i,k
b
k
a
k
X
i
_
The updates for a and b
Here well do something very similar to the spheric model. The main difference is that now each
k,d
controls only
one dimension of the bound:
log Q(
k,d
) =
k,d
+
z
i,k
1
2
log
k,d
k,d
2
z
i,k
((X
i,d
k,d
)
2
+ 1)
Hence
a
k,d
= 1 +
1
2
z
i,k
b
k,d
= 1 +
1
2
z
i,k
((X
i,d
k,d
)
2
+ 1)
1.4. Unsupervised learning 107
scikit-learn user guide, Release 0.12-git
The tied model The model then is
k
Beta(1,
1
)
k
Normal(0, I)
Wishart(D, I)
z
i
SBP()
X
t
Normal(
zi
,
1
)
Tha variational distribution well use is
k
Beta(
k,1
,
k,2
)
k
Normal(
k
, I)
Wishart(a, B)
z
i
Discrete(
zi
)
The lower bound There are two changes in the lower-bound: for and for X.
The bound for
D
2
2
log 2 +
d
log (
D+1d
2
)
aD
2
log 2 +
a
2
log |B| +
d
log (
a+1d
2
)
+
aD
2
_
d
_
a+1d
2
_
+Dlog 2 + log |B|
_
+
1
2
atr[BI]
The bound for X
E
q
[log P(X
i
)] =
k
z
k
_
D
2
log 2 +
1
2
_
d
_
a+1d
2
_
+Dlog 2 + log |B|
_
1
2
((X
i
k
)aB(X
i
k
) +atr(B)) log 2e
_
The updates As in the last setting, what changes are the trivial update for z, the update for and the update for a
and B.
The update for
k
=
_
I +aB
z
i,k
_
1
_
aB
z
i,k
X
i
_
The update for a and B
As this distribution is far too complicated Im not even going to try going at it the gradient way.
log Q() = +
1
2
log ||
1
2
tr[] +
z
i,k
_
+
1
2
log ||
1
2
((X
i
k
)
T
(X
i
k
) +tr[])
_
which non-trivially (seeing that the quadratic form with in the middle can be expressed as the trace of something)
reduces to
log Q() = +
1
2
log ||
1
2
tr[] +
z
i,k
_
+
1
2
log ||
1
2
(tr[(X
i
k
)(X
i
k
)
T
] +tr[I])
_
hence this (with a bit of squinting) looks like a wishart with parameters
a = 2 +D +T
and
B =
_
I +
z
i,k
(X
i
k
)(X
i
k
)
T
_
1
108 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
The full model
The model then is
k
Beta(1,
1
)
k
Normal(0, I)
k
Wishart(D, I)
z
i
SBP()
X
t
Normal(
zi
,
1
z,i
)
The variational distribution well use is
k
Beta(
k,1
,
k,2
)
k
Normal(
k
, I)
k
Wishart(a
k
, B
k
)
z
i
Discrete(
zi
)
The lower bound All that changes in this lower bound in comparison to the previous one is that there are K priors
on different precision matrices and there are the correct indices on the bound for X.
The updates All that changes in the updates is that the update for mu uses only the proper sigma and the updates
for a and B dont have a sum over K, so
k
=
_
I +a
k
B
k
z
i,k
_
1
_
a
k
B
k
z
i,k
X
i
_
a
k
= 2 +D +
z
i,k
and
B =
__
z
i,k
+ 1
_
I +
z
i,k
(X
i
k
)(X
i
k
)
T
_
1
1.4.2 Manifold learning
Look for the bare necessities
The simple bare necessities
Forget about your worries and your strife
I mean the bare necessities
Old Mother Natures recipes
That bring the bare necessities of life
Baloos song [The Jungle Book]
Manifold learning is an approach to nonlinear dimensionality reduction. Algorithms for this task are based on the idea
that the dimensionality of many data sets is only articially high.
1.4. Unsupervised learning 109
scikit-learn user guide, Release 0.12-git
Introduction
High-dimensional datasets can be very difcult to visualize. While data in two or three dimensions can be plotted to
show the inherent structure of the data, equivalent high-dimensional plots are much less intuitive. To aid visualization
of the structure of a dataset, the dimension must be reduced in some way.
The simplest way to accomplish this dimensionality reduction is by taking a random projection of the data. Though
this allows some degree of visualization of the data structure, the randomness of the choice leaves much to be desired.
In a random projection, it is likely that the more interesting structure within the data will be lost.
110 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
To address this concern, a number of supervised and unsupervised linear dimensionality reduction frameworks have
been designed, such as Principal Component Analysis (PCA), Independent Component Analysis, Linear Discriminant
Analysis, and others. These algorithms dene specic rubrics to choose an interesting linear projection of the data.
These methods can be powerful, but often miss important nonlinear structure in the data.
Manifold Learning can be thought of as an attempt to generalize linear frameworks like PCA to be sensitive to non-
linear structure in data. Though supervised variants exist, the typical manifold learning problem is unsupervised: it
learns the high-dimensional structure of the data from the data itself, without the use of predetermined classications.
Examples:
See Manifold learning on handwritten digits: Locally Linear Embedding, Isomap... for an example of
dimensionality reduction on handwritten digits.
See Comparison of Manifold Learning methods for an example of dimensionality reduction on a toy S-
curve dataset.
The manifold learning implementations available in sklearn are summarized below
Isomap
One of the earliest approaches to manifold learning is the Isomap algorithm, short for Isometric Mapping. Isomap can
be viewed as an extension of Multi-dimensional Scaling (MDS) or Kernel PCA. Isomap seeks a lower-dimensional
1.4. Unsupervised learning 111
scikit-learn user guide, Release 0.12-git
embedding which maintains geodesic distances between all points. Isomap can be performed with the object Isomap.
Complexity
The Isomap algorithm comprises three stages:
1. Nearest neighbor search. Isomap uses sklearn.neighbors.BallTree for efcient neighbor search.
The cost is approximately O[Dlog(k)N log(N)], for k nearest neighbors of N points in D dimensions.
2. Shortest-path graph search. The most efcient known algorithms for this are Dijkstras Algorithm, which is
approximately O[N
2
(k + log(N))], or the Floyd-Warshall algorithm, which is O[N
3
]. The algorithm can be
selected by the user with the path_method keyword of Isomap. If unspecied, the code attempts to choose
the best algorithm for the input data.
3. Partial eigenvalue decomposition. The embedding is encoded in the eigenvectors corresponding to the d
largest eigenvalues of the N N isomap kernel. For a dense solver, the cost is approximately O[dN
2
]. This
cost can often be improved using the ARPACK solver. The eigensolver can be specied by the user with the
path_method keyword of Isomap. If unspecied, the code attempts to choose the best algorithm for the
input data.
The overall complexity of Isomap is O[Dlog(k)N log(N)] +O[N
2
(k + log(N))] +O[dN
2
].
N : number of training data points
D : input dimension
k : number of nearest neighbors
d : output dimension
References:
A global geometric framework for nonlinear dimensionality reduction Tenenbaum, J.B.; De Silva, V.; &
Langford, J.C. Science 290 (5500)
112 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
Locally Linear Embedding
Locally linear embedding (LLE) seeks a lower-dimensional projection of the data which preserves distances within
local neighborhoods. It can be thought of as a series of local Principal Component Analyses which are globally
compared to nd the best nonlinear embedding.
Locally linear embedding can be performed with function locally_linear_embedding or its object-oriented
counterpart LocallyLinearEmbedding.
Complexity
The standard LLE algorithm comprises three stages:
1. Nearest Neighbors Search. See discussion under Isomap above.
2. Weight Matrix Construction. O[DNk
3
]. The construction of the LLE weight matrix involves the solution of
a k k linear equation for each of the N local neighborhoods
3. Partial Eigenvalue Decomposition. See discussion under Isomap above.
The overall complexity of standard LLE is O[Dlog(k)N log(N)] +O[DNk
3
] +O[dN
2
].
N : number of training data points
D : input dimension
k : number of nearest neighbors
d : output dimension
References:
Nonlinear dimensionality reduction by locally linear embedding Roweis, S. & Saul, L. Science
290:2323 (2000)
Modied Locally Linear Embedding
One well-known issue with LLE is the regularization problem. When the number of neighbors is greater than the
number of input dimensions, the matrix dening each local neighborhood is rank-decient. To address this, standard
1.4. Unsupervised learning 113
scikit-learn user guide, Release 0.12-git
LLE applies an arbitrary regularization parameter r, which is chosen relative to the trace of the local weight matrix.
Though it can be shown formally that as r 0, the solution coverges to the desired embedding, there is no guarantee
that the optimal solution will be found for r > 0. This problem manifests itself in embeddings which distort the
underlying geometry of the manifold.
One method to address the regularization problem is to use multiple weight vectors in each neighborhood.
This is the essence of modied locally linear embedding (MLLE). MLLE can be performed with function
locally_linear_embedding or its object-oriented counterpart LocallyLinearEmbedding, with the key-
word method = modified. It requires n_neighbors > n_components.
Complexity
The MLLE algorithm comprises three stages:
1. Nearest Neighbors Search. Same as standard LLE
2. Weight Matrix Construction. Approximately O[DNk
3
]+O[N(kD)k
2
]. The rst term is exactly equivalent
to that of standard LLE. The second term has to do with constructing the weight matrix from multiple weights.
In practice, the added cost of constructing the MLLE weight matrix is relatively small compared to the cost of
steps 1 and 3.
3. Partial Eigenvalue Decomposition. Same as standard LLE
The overall complexity of MLLE is O[Dlog(k)N log(N)] +O[DNk
3
] +O[N(k D)k
2
] +O[dN
2
].
N : number of training data points
D : input dimension
k : number of nearest neighbors
d : output dimension
References:
MLLE: Modied Locally Linear Embedding Using Multiple Weights Zhang, Z. & Wang, J.
114 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
Hessian Eigenmapping
Hessian Eigenmapping (also known as Hessian-based LLE: HLLE) is another method of solving the regularization
problem of LLE. It revolves around a hessian-based quadratic form at each neighborhood which is used to recover
the locally linear structure. Though other implementations note its poor scaling with data size, sklearn imple-
ments some algorithmic improvements which make its cost comparable to that of other LLE variants for small output
dimension. HLLE can be performed with function locally_linear_embedding or its object-oriented counter-
part LocallyLinearEmbedding, with the keyword method = hessian. It requires n_neighbors >
n_components
*
(n_components + 3) / 2.
Complexity
The HLLE algorithm comprises three stages:
1. Nearest Neighbors Search. Same as standard LLE
2. Weight Matrix Construction. Approximately O[DNk
3
] + O[Nd
6
]. The rst term reects a similar cost to
that of standard LLE. The second term comes from a QR decomposition of the local hessian estimator.
3. Partial Eigenvalue Decomposition. Same as standard LLE
The overall complexity of standard HLLE is O[Dlog(k)N log(N)] +O[DNk
3
] +O[Nd
6
] +O[dN
2
].
N : number of training data points
D : input dimension
k : number of nearest neighbors
d : output dimension
References:
Hessian Eigenmaps: Locally linear embedding techniques for high-dimensional data Donoho, D. &
Grimes, C. Proc Natl Acad Sci USA. 100:5591 (2003)
1.4. Unsupervised learning 115
scikit-learn user guide, Release 0.12-git
Local Tangent Space Alignment
Though not technically a variant of LLE, Local tangent space alignment (LTSA) is algorithmically similar enough
to LLE that it can be put in this category. Rather than focusing on preserving neighborhood distances as in LLE,
LTSA seeks to characterize the local geometry at each neighborhood via its tangent space, and performs a global
optimization to align these local tangent spaces to learn the embedding. LTSA can be performed with function
locally_linear_embedding or its object-oriented counterpart LocallyLinearEmbedding, with the key-
word method = ltsa.
Complexity
The LTSA algorithm comprises three stages:
1. Nearest Neighbors Search. Same as standard LLE
2. Weight Matrix Construction. Approximately O[DNk
3
] +O[k
2
d]. The rst term reects a similar cost to that
of standard LLE.
3. Partial Eigenvalue Decomposition. Same as standard LLE
The overall complexity of standard LTSA is O[Dlog(k)N log(N)] +O[DNk
3
] +O[k
2
d] +O[dN
2
].
N : number of training data points
D : input dimension
k : number of nearest neighbors
d : output dimension
References:
Principal manifolds and nonlinear dimensionality reduction via tangent space alignment Zhang, Z. &
Zha, H. Journal of Shanghai Univ. 8:406 (2004)
Multi-dimensional Scaling (MDS)
Multidimensional scaling (MDS) seeks a low-dimensional representation of the data in which the distances respect well
the distances in the original high-dimensional space.
116 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
In general, is a technique used for analyzing similarity or dissimilarity data. MDS attempts to model similarity or
dissimilarity data as distances in a geometric spaces. The data can be ratings of similarity between objects, interaction
frequencies of molecules, or trade indices between countries.
There exists two types of MDS algorithm: metric and non metric. In the scikit-learn, the class MDS implements
both. In Metric MDS, the input simiarity matrix arises from a metric (and thus respects the triangular inequality), the
distances between output two points are then set to be as close as possible to the similarity or dissimilarity data. In
the non metric vision, the algorithms will try to preserve the order of the distances, and hence seek for a monotonic
relationship between the distances in the embedded space and the similarities/dissimilarities.
Let S be the similarity matrix, and X the coordinates of the n input points. Disparities
d
ij
are transformation of the
similarities chosen in some optimal ways. The objective, called the stress, is then dened by sum
i<j
d
ij
(X)
d
ij
(X)
Metric MDS
The simplest metric MDS model, called absolute MDS, disparities are dened by
d
ij
= S
ij
. With absolute MDS, the
value S
ij
should then correspond exactly to the distance between point i and j in the embedding point.
Most commonly, disparities are set to
d
ij
= bS
ij
.
Nonmetric MDS
Non metric MDS focuses on the ordination of the data. If S
ij
< S
kl
, then the embedding should enforce d
ij
< d
jk
.
A simple algorithm to enforce that is to use a monotonic regression of d
ij
on S
ij
, yielding disparities
d
ij
in the same
order as S
ij
.
A trivial solution to this problem is to set all the points on the origin. In order to avoid that, the disparities
d
ij
are
normalized.
References:
Modern Multidimensional Scaling - Theory and Applications Borg, I.; Groenen P. Springer Series in
Statistics (1997)
Nonmetric multidimensional scaling: a numerical method Kruskal, J. Psychometrika, 29 (1964)
Multidimensional scaling by optimizing goodness of t to a nonmetric hypothesis Kruskal, J. Psychome-
trika, 29, (1964)
1.4. Unsupervised learning 117
scikit-learn user guide, Release 0.12-git
Tips on practical use
Make sure the same scale is used over all features. Because manifold learning methods are based on a nearest-
neighbor search, the algorithm may perform poorly otherwise. See Scaler for convenient ways of scaling het-
erogeneous data.
The reconstruction error computed by each routine can be used to choose the optimal output dimension. For a
d-dimensional manifold embedded in a D-dimensional parameter space, the reconstruction error will decrease
as n_components is increased until n_components == d.
Note that noisy data can short-circuit the manifold, in essence acting as a bridge between parts of the manifold
that would otherwise be well-separated. Manifold learning on noisy and/or incomplete data is an active area of
research.
Certain input congurations can lead to singular weight matrices, for example when more than two points in the
dataset are identical, or when the data is split into disjointed groups. In this case, method=arpack will
fail to nd the null space. The easiest way to address this is to use method=dense which will work on a
singular matrix, though it may be very slow depending on the number of input points. Alternatively, one can
attempt to understand the source of the singularity: if it is due to disjoint sets, increasing n_neighbors may
help. If it is due to identical points in the dataset, removing these points may help.
1.4.3 Clustering
Clustering of unlabeled data can be performed with the module sklearn.cluster.
Each clustering algorithm comes in two variants: a class, that implements the t method to learn the clusters on train
data, and a function, that, given train data, returns an array of integer labels corresponding to the different clusters. For
the class, the labels over the training data can be found in the labels_ attribute.
118 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
Input data
One important thing to note is that the algorithms implemented in this module take different kinds of ma-
trix as input. On one hand, MeanShift and KMeans take data matrices of shape [n_samples, n_features].
These can be obtained from the classes in the sklearn.feature_extraction module. On the
other hand, AffinityPropagation and SpectralClustering take similarity matrices of shape
[n_samples, n_samples]. These can be obtained from the functions in the sklearn.metrics.pairwise
module. In other words, MeanShift and KMeans work with points in a vector space, whereas
AffinityPropagation and SpectralClustering can work with arbitrary objects, as long as a simi-
larity measure exists for such objects.
Overview of clustering methods
Figure 1.3: A comparison of the clustering algorithms in scikit-learn
Method
name
Parame-
ters
Scalability Usecase Geometry (metric
used)
K-Means number of
clusters
Very large n_samples,
medium n_clusters with
MiniBatch code
General-purpose, even cluster
size, at geometry, not too
many clusters
Distances between
points
Afnity
propaga-
tion
damping,
sample
preference
Not scalable with
n_samples
Many clusters, uneven cluster
size, non-at geometry
Graph distance
(e.g.
nearest-neighbor
graph)
Mean-
shift
bandwidth Not scalable with
n_samples
Many clusters, uneven cluster
size, non-at geometry
Distances between
points
Spectral
clustering
number of
clusters
Medium n_samples, small
n_clusters
Few clusters, even cluster size,
non-at geometry
Graph distance
(e.g.
nearest-neighbor
graph)
Hierar-
chical
clustering
number of
clusters
Large n_samples and
n_clusters
Many clusters, possibly
connectivity constraints
Distances between
points
DBSCAN neighbor-
hood
size
Very large n_samples,
medium n_clusters
Non-at geometry, uneven
cluster sizes
Distances between
nearest points
Gaussian
mixtures
many Not scalable Flat geometry, good for density
estimation
Mahalanobis
distances to centers
1.4. Unsupervised learning 119
scikit-learn user guide, Release 0.12-git
Non-at geometry clustering is useful when the clusters have a specic shape, i.e. a non-at manifold, and the standard
euclidean distance is not the right metric. This case arises in the two top rows of the gure above.
Gaussian mixture models, useful for clustering, are described in another chapter of the documentation dedicated
to mixture models. KMeans can be seen as a special case of Gaussian mixture model with equal covariance per
component.
K-means
The KMeans algorithmclusters data by trying to separate samples in n groups of equal variance, minimizing a criterion
known as the inertia of the groups. This algorithm requires the number of cluster to be specied. It scales well to
large number of samples, however its results may be dependent on an initialisation. As a result, the computation is
often done several times, with different initialisation of the centroids.
K-means is often referred to as Lloyds algorithm. After initialization, k-means consists of looping between two major
steps. First the Voronoi diagram of the points is calculated using the current centroids. Each segment in the Voronoi
diagram becomes a separate cluster. Secondly, the centroids are updated to the mean of each segment. The algorithm
then repeats this until a stopping criteria is fullled. Usually, as in this implementation, the algorithm stops when the
relative increment in the results between iterations is less than the given tolerance value.
A parameter can be given to allow K-means to be run in parallel, called n_jobs. Giving this parameter a positive
value uses that many processors (default=1). A value of -1 uses all processors, with -2 using one less, and so on.
Parallelization generally speeds up computation at the cost of memory (in this case, multiple copies of centroids need
to be stored, one for each job).
K-means can be used for vector quantization. This is achieved using the transform method of a trained model of
KMeans.
Examples:
A demo of K-Means clustering on the handwritten digits data: Clustering handwritten digits
Mini Batch K-Means
The MiniBatchKMeans is a variant of the KMeans algorithm using mini-batches, random subset of the dataset, to
compute the centroids.
Althought the MiniBatchKMeans converge faster than the KMeans version, the quality of the results, measured by
the inertia, the sum of the distance of each points to the nearest centroid, is not as good as the KMeans algorithm.
Examples:
A demo of the K Means clustering algorithm: Comparison of KMeans and MiniBatchKMeans
Clustering text documents using k-means: Document clustering using sparse MiniBatchKMeans
References:
Web Scale K-Means clustering D. Sculley, Proceedings of the 19th international conference on World
wide web (2010)
120 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
Afnity propagation
AffinityPropagation clusters data by diffusion in the similarity matrix. This algorithm automatically sets its
numbers of cluster. It will have difculties scaling to thousands of samples.
Examples:
Demo of afnity propagation clustering algorithm: Afnity Propagation on a synthetic 2D datasets with 3
classes.
Visualizing the stock market structure Afnity Propagation on Financial time series to nd groups of
companies
Mean Shift
MeanShift clusters data by estimating blobs in a smooth density of points matrix. This algorithm automati-
cally sets its numbers of cluster. It will have difculties scaling to thousands of samples. The utility function
estimate_bandwidth can be used to guess the optimal bandwidth for MeanShift from the data.
1.4. Unsupervised learning 121
scikit-learn user guide, Release 0.12-git
Examples:
A demo of the mean-shift clustering algorithm: Mean Shift clustering on a synthetic 2D datasets with 3
classes.
Spectral clustering
SpectralClustering does a low-dimension embedding of the afnity matrix between samples, followed by a
KMeans in the low dimensional space. It is especially efcient if the afnity matrix is sparse and the pyamg module
is installed. SpectralClustering requires the number of clusters to be specied. It works well for a small number of
clusters but is not advised when using many clusters.
For two clusters, it solves a convex relaxation of the normalised cuts problem on the similarity graph: cutting the graph
in two so that the weight of the edges cut is small compared to the weights in of edges inside each cluster. This criteria
is especially interesting when working on images: graph vertices are pixels, and edges of the similarity graph are a
function of the gradient of the image.
122 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
Warning: Shapeless isotropic data
When the data is really shapeless (i.e. generated from a random distribution with no clusters), the spectral-
clustering problem is ill-conditioned: the different choices are almost equivalent, and the spectral clustering solver
chooses an arbitrary one, putting the rst sample alone in one bin.
Warning: Transforming distance to well-behaved similarities
Note that if the values of your similarity matrix are not well distributed, e.g. with negative values or with a distance
matrix rather than a similarity, the spectral problem will be singular and the problem not solvable. In which case
it is advised to apply a transformation to the entries of the matrix. For instance, in the case of a signed distance
matrix, is common to apply a heat kernel:
similarity = np.exp(-beta
*
distance / distance.std())
See the examples for such an application.
Examples:
Spectral clustering for image segmentation: Segmenting objects from a noisy background using spectral
clustering.
Segmenting the picture of Lena in regions: Spectral clustering to split the image of lena in regions.
References:
A Tutorial on Spectral Clustering Ulrike von Luxburg, 2007
Normalized cuts and image segmentation Jianbo Shi, Jitendra Malik, 2000
A Random Walks View of Spectral Segmentation Marina Meila, Jianbo Shi, 2001
On Spectral Clustering: Analysis and an algorithm Andrew Y. Ng, Michael I. Jordan, Yair Weiss, 2001
Hierarchical clustering
Hierarchical clustering is a general family of clustering algorithms that build nested clusters by merging them succes-
sively. This hierarchy of clusters represented as a tree (or dendrogram). The root of the tree is the unique cluster that
gathers all the samples, the leaves being the clusters with only one sample. See the Wikipedia page for more details.
The Ward object performs a hierarchical clustering based on the Ward algorithm, that is a variance-minimizing ap-
proach. At each step, it minimizes the sum of squared differences within all clusters (inertia criterion).
This algorithm can scale to large number of samples when it is used jointly with an connectivity matrix, but can be
computationally expensive when no connectivity constraints are added between samples: it considers at each step all
the possible merges.
Adding connectivity constraints
An interesting aspect of the Ward object is that connectivity constraints can be added to this algorithm (only adjacent
clusters can be merged together), through an connectivity matrix that denes for each sample the neighboring samples
following a given structure of the data. For instance, in the swiss-roll example below, the connectivity constraints
forbid the merging of points that are not adjacent on the swiss roll, and thus avoid forming clusters that extend across
overlapping folds of the roll.
1.4. Unsupervised learning 123
scikit-learn user guide, Release 0.12-git
The connectivity constraints are imposed via an connectivity matrix: a scipy sparse matrix that has elements
only at the intersection of a row and a column with indices of the dataset that should be connected. This ma-
trix can be constructed from a-priori information, for instance if you wish to cluster web pages, but only merg-
ing pages with a link pointing from one to another. It can also be learned from the data, for instance using
sklearn.neighbors.kneighbors_graph to restrict merging to nearest neighbors as in the swiss roll exam-
ple, or using sklearn.feature_extraction.image.grid_to_graph to enable only merging of neigh-
boring pixels on an image, as in the Lena example.
Examples:
A demo of structured Ward hierarchical clustering on Lena image: Ward clustering to split the image of
lena in regions.
Hierarchical clustering: structured vs unstructured ward: Example of Ward algorithm on a swiss-roll,
comparison of structured approaches versus unstructured approaches.
Feature agglomeration vs. univariate selection: Example of dimensionality reduction with feature ag-
glomeration based on Ward hierarchical clustering.
DBSCAN
The DBSCAN algorithm clusters data by nding core points which have many neighbours within a given radius. After
a core point is found, the cluster is expanded by adding its neighbours to the current cluster and recursively checking
if any are core points. Formally, a point is considered a core point if it has more than min_points points which are of
a similarity greater than the given threshold eps. This is shown in the gure below, where the color indicates cluster
membership and large circles indicate core points found by the algorithm. Moreover, the algorithm can detect outliers,
indicated by black points below. The outliers are dened as points which do not belong to any current cluster and do
not have enough close neighbours to start a new cluster.
124 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
Examples:
Demo of DBSCAN clustering algorithm: Clustering synthetic data with DBSCAN
References:
A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise Ester, M.,
H. P. Kriegel, J. Sander, and X. Xu, In Proceedings of the 2nd International Conference on Knowledge
Discovery and Data Mining, Portland, OR, AAAI Press, pp. 226231. 1996
Clustering performance evaluation
Evaluating the performance of a clustering algorithm is not as trivial as counting the number of errors or the precision
and recall of a supervised classication algorithm. In particular any evaluation metric should not take the absolute
values of the cluster labels into account but rather if this clustering dene separations of the data similar to some
ground truth set of classes or satisfying some assumption such that members belong to the same class are more similar
that members of different classes according to some similarity metric.
Inertia
Presentation and usage TODO: factorize inertia computation out of kmeans and then write me!
Advantages
No need for the ground truth knowledge of the real classes.
Drawbacks
Inertia makes the assumption that clusters are convex and isotropic which is not always the case especially of the
clusters are manifolds with weird shapes: for instance inertia is a useless metrics to evaluate clustering algorithm
that tries to identify nested circles on a 2D plane.
Inertia is not a normalized metrics: we just know that lower values are better and bounded by zero. One
potential solution would be to adjust inertia for random clustering (assuming the number of ground truth classes
is known).
1.4. Unsupervised learning 125
scikit-learn user guide, Release 0.12-git
Adjusted Rand index
Presentation and usage Given the knowledge of the ground truth class assignments labels_true and our clus-
tering algorithm assignments of the same samples labels_pred, the adjusted Rand index is a function that mea-
sures the similarity of the two assignments, ignoring permutations and with chance normalization:
>>> from sklearn import metrics
>>> labels_true = [0, 0, 0, 1, 1, 1]
>>> labels_pred = [0, 0, 1, 1, 2, 2]
>>> metrics.adjusted_rand_score(labels_true, labels_pred)
0.24...
One can permute 0 and 1 in the predicted labels and rename 2 by 3 and get the same score:
>>> labels_pred = [1, 1, 0, 0, 3, 3]
>>> metrics.adjusted_rand_score(labels_true, labels_pred)
0.24...
Furthermore, adjusted_rand_score is symmetric: swapping the argument does not change the score. It can
thus be used as a consensus measure:
>>> metrics.adjusted_rand_score(labels_pred, labels_true)
0.24...
Perfect labeling is scored 1.0:
>>> labels_pred = labels_true[:]
>>> metrics.adjusted_rand_score(labels_true, labels_pred)
1.0
Bad (e.g. independent labelings) have negative or close to 0.0 scores:
>>> labels_true = [0, 1, 2, 0, 3, 4, 5, 1]
>>> labels_pred = [1, 1, 0, 0, 2, 2, 2, 2]
>>> metrics.adjusted_rand_score(labels_true, labels_pred)
-0.12...
Advantages
Random (uniform) label assignments have a ARI score close to 0.0 for any value of n_clusters and
n_samples (which is not the case for raw Rand index or the V-measure for instance).
Bounded range [-1, 1]: negative values are bad (independent labelings), similar clusterings have a positive ARI,
1.0 is the perfect match score.
No assumption is made on the cluster structure: can be used to compare clustering algorithms such as k-
means which assumes isotropic blob shapes with results of spectral clustering algorithms which can nd cluster
with folded shapes.
Drawbacks
Contrary to inertia, ARI requires knowledge of the ground truth classes while is almost never available in
practice or requires manual assignment by human annotators (as in the supervised learning setting).
However ARI can also be useful in a purely unsupervised setting as a building block for a Consensus Index that
can be used for clustering model selection (TODO).
126 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
Examples:
Adjustment for chance in clustering performance evaluation: Analysis of the impact of the dataset size on
the value of clustering measures for random assignments.
Mathematical formulation If C is a ground truth class assignment and K the clustering, let us dene a and b as:
a, the number of pairs of elements that are in the same set in C and in the same set in K
b, the number of pairs of elements that are in different sets in C and in different sets in K
The raw (unadjusted) Rand index is then given by:
RI =
a +b
C
n
samples
2
Where C
n
samples
2
is the total number of possible pairs in the dataset (without ordering).
However the RI score does not guarantee that random label assignments will get a value close to zero (esp. if the
number of clusters is in the same order of magnitude as the number of samples).
To counter this effect we can discount the expected RI E[RI] of random labelings by dening the adjusted Rand index
as follows:
ARI =
RI E[RI]
max(RI) E[RI]
References
Comparing Partitions L. Hubert and P. Arabie, Journal of Classication 1985
Wikipedia entry for the adjusted Rand index
Mutual Information based scores
Presentation and usage Given the knowledge of the ground truth class assignments labels_true and our clus-
tering algorithm assignments of the same samples labels_pred, the Mutual Information is a function that mea-
sures the agreement of the two assignments, ignoring permutations. Two different normalized versions of this measure
are available, Normalized Mutual Information(NMI) and Adjusted Mutual Information(AMI). NMI is often used
in the literature while AMI was proposed more recently and is normalized against chance:
>>> from sklearn import metrics
>>> labels_true = [0, 0, 0, 1, 1, 1]
>>> labels_pred = [0, 0, 1, 1, 2, 2]
>>> metrics.adjusted_mutual_info_score(labels_true, labels_pred)
0.22504...
One can permute 0 and 1 in the predicted labels and rename 2 by 3 and get the same score:
>>> labels_pred = [1, 1, 0, 0, 3, 3]
>>> metrics.adjusted_mutual_info_score(labels_true, labels_pred)
0.22504...
1.4. Unsupervised learning 127
scikit-learn user guide, Release 0.12-git
All, mutual_info_score, adjusted_mutual_info_score and normalized_mutual_info_score
are symmetric: swapping the argument does not change the score. Thus they can be used as a consensus measure:
>>> metrics.adjusted_mutual_info_score(labels_pred, labels_true)
0.22504...
Perfect labeling is scored 1.0:
>>> labels_pred = labels_true[:]
>>> metrics.adjusted_mutual_info_score(labels_true, labels_pred)
1.0
>>> metrics.normalized_mutual_info_score(labels_true, labels_pred)
1.0
This is not true for mutual_info_score, which is therefore harder to judge:
>>> metrics.mutual_info_score(labels_true, labels_pred)
0.69...
Bad (e.g. independent labelings) have non-positive scores:
>>> labels_true = [0, 1, 2, 0, 3, 4, 5, 1]
>>> labels_pred = [1, 1, 0, 0, 2, 2, 2, 2]
>>> metrics.adjusted_mutual_info_score(labels_true, labels_pred)
-0.10526...
Advantages
Random (uniform) label assignments have a AMI score close to 0.0 for any value of n_clusters and
n_samples (which is not the case for raw Mutual Information or the V-measure for instance).
Bounded range [0, 1]: Values close to zero indicate two label assignments that are largely independent, while
values close to one indicate signicant agreement. Further, values of exactly 0 indicate purely independent
label assignments and a AMI of exactly 1 indicates that the two label assignments are equal (with or without
permutation).
No assumption is made on the cluster structure: can be used to compare clustering algorithms such as k-
means which assumes isotropic blob shapes with results of spectral clustering algorithms which can nd cluster
with folded shapes.
Drawbacks
Contrary to inertia, MI-based measures require the knowledge of the ground truth classes while almost
never available in practice or requires manual assignment by human annotators (as in the supervised learning
setting).
However MI-based measures can also be useful in purely unsupervised setting as a building block for a Consen-
sus Index that can be used for clustering model selection.
NMI and MI are not adjusted against chance.
Examples:
Adjustment for chance in clustering performance evaluation: Analysis of the impact of the dataset size on
the value of clustering measures for random assignments. This example also includes the Adjusted Rand
Index.
128 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
Mathematical formulation Assume two label assignments (of the same data), U with R classes and V with C
classes. The entropy of either is the amount of uncertaintly for an array, and can be calculated as:
H(U) =
|R|
i=1
P(i) log(P(i))
Where P(i) is the number of instances in U that are in class R
i
. Likewise, for V :
H(V ) =
|C|
j=1
P
(j) log(P
(j))
Where P(j) is the number of instances in V that are in class C
j
.
The mutual information between U and V is calculated by:
MI(U, V ) =
|R|
i=1
|C|
j=1
P(i, j) log
_
P(i, j)
P(i)P
(j)
_
Where P(i, j) is the number of instances with label R
i
and also with label C
j
.
The normalized mutual information is dened as
NMI(U, V ) =
MI(U, V )
_
H(U)H(V )
This value of the mutual information and also the normalized variant is not adjusted for chance and will tend to increase
as the number of different labels (clusters) increases, regardless of the actual amount of mutual information between
the label assignments.
The expected value for the mutual information can be calculated using the following equation, from Vinh, Epps, and
Bailey, (2009). In this equation, a
i
is the number of instances with label U
i
and b
j
is the number of instances with
label V
j
.
E[MI(U, V )] =
R
i=1
C
j=1
min(ai,bj)
nij=(ai+bjN)
+
n
ij
N
log
_
N.n
ij
a
i
b
j
_
a
i
!b
j
!(N a
i
)!(N b
j
)!
N!n
ij
!(a
i
n
ij
)!(b
j
n
ij
)!(N a
i
b
j
+n
ij
)!
Using the expected value, the adjusted mutual information can then be calculated using a similar form to that of the
adjusted Rand index:
AMI =
MI E[MI]
max(H(U), H(V )) E[MI]
1.4. Unsupervised learning 129
scikit-learn user guide, Release 0.12-git
References
Strehl, Alexander, and Joydeep Ghosh (2002). Cluster ensembles a knowledge reuse frame-
work for combining multiple partitions. Journal of Machine Learning Research 3: 583617.
doi:10.1162/153244303321897735
Vinh, Epps, and Bailey, (2009). Information theoretic measures for clusterings comparison.
Proceedings of the 26th Annual International Conference on Machine Learning - ICML 09.
doi:10.1145/1553374.1553511. ISBN 9781605585161.
Vinh, Epps, and Bailey, (2010). Information Theoretic Measures for Clusterings
Comparison: Variants, Properties, Normalization and Correction for Chance}, JMLR
http://jmlr.csail.mit.edu/papers/volume11/vinh10a/vinh10a.pdf
Wikipedia entry for the (normalized) Mutual Information
Wikipedia entry for the Adjusted Mutual Information
Homogeneity, completeness and V-measure
Presentation and usage Given the knowledge of the ground truth class assignments of the samples, it is possible to
dene some intuitive metric using conditional entropy analysis.
In particular Rosenberg and Hirschberg (2007) dene the following two desirable objectives for any cluster assign-
ment:
homogeneity: each cluster contains only members of a single class.
completeness: all members of a given class are assigned to the same cluster.
We can turn those concept as scores homogeneity_score and completeness_score. Both are bounded
below by 0.0 and above by 1.0 (higher is better):
>>> from sklearn import metrics
>>> labels_true = [0, 0, 0, 1, 1, 1]
>>> labels_pred = [0, 0, 1, 1, 2, 2]
>>> metrics.homogeneity_score(labels_true, labels_pred)
0.66...
>>> metrics.completeness_score(labels_true, labels_pred)
0.42...
Their harmonic mean called V-measure is computed by v_measure_score:
>>> metrics.v_measure_score(labels_true, labels_pred)
0.51...
The V-measure is actually equivalent to the mutual information (NMI) discussed above normalized by the sum of the
label entropies [B2011].
Homogeneity, completensess and V-measure can be computed at once using
homogeneity_completeness_v_measure as follows:
>>> metrics.homogeneity_completeness_v_measure(labels_true, labels_pred)
...
(0.66..., 0.42..., 0.51...)
The following clustering assignment is slighlty better, since it is homogeneous but not complete:
>>> labels_pred = [0, 0, 0, 1, 2, 2]
>>> metrics.homogeneity_completeness_v_measure(labels_true, labels_pred)
130 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
...
(1.0, 0.68..., 0.81...)
Note: v_measure_score is symmetric: it can be used to evaluate the agreement of two independent assignments
on the same dataset.
This is not the case for completeness_score and homogeneity_score: both are bound by the relationship:
homogeneity_score(a, b) == completeness_score(b, a)
Advantages
Bounded scores: 0.0 is as bad as it can be, 1.0 is a perfect score
Intuitive interpretation: clustering with bad V-measure can be qualitatively analyzed in terms of homogeneity
and completeness to better feel what kind of mistakes is done by the assigmenent.
No assumption is made on the cluster structure: can be used to compare clustering algorithms such as k-
means which assumes isotropic blob shapes with results of spectral clustering algorithms which can nd cluster
with folded shapes.
Drawbacks
The previously introduced metrics are not normalized w.r.t. random labeling: this means that depending on
the number of samples, clusters and ground truth classes, a completely random labeling will not always yield the
same values for homogeneity, completeness and hence v-measure. In particular random labeling wont yield
zero scores especially when the number of clusters is large.
This problem can safely be ignored when the number of samples is more than a thousand and the number of
clusters is less than 10. For smaller sample sizes or larger number of clusters it is safer to use an adjusted
index such as the Adjusted Rand Index (ARI).
These metrics require the knowledge of the ground truth classes while almost never available in practice or
requires manual assignment by human annotators (as in the supervised learning setting).
Examples:
Adjustment for chance in clustering performance evaluation: Analysis of the impact of the dataset size on
the value of clustering measures for random assignments.
Mathematical formulation Homogeneity and completeness scores are formally given by:
h = 1
H(C|K)
H(C)
c = 1
H(K|C)
H(K)
1.4. Unsupervised learning 131
scikit-learn user guide, Release 0.12-git
132 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
where H(C|K) is the conditional entropy of the classes given the cluster assignments and is given by:
H(C|K) =
|C|
c=1
|K|
k=1
n
c,k
n
log
_
n
c,k
n
k
_
and H(C) is the entropy of the classes and is given by:
H(C) =
|C|
c=1
n
c
n
log
_
n
c
n
_
with n the total number of samples, n
c
and n
k
the number of samples respectively belonging to class c and cluster k,
and nally n
c,k
the number of samples from class c assigned to cluster k.
The conditional entropy of clusters given class H(K|C) and the entropy of clusters H(K) are dened in a sym-
metric manner.
Rosenberg and Hirschberg further dene V-measure as the harmonic mean of homogeneity and completeness:
v = 2
h c
h +c
References
Silhouette Coefcient
Presentation and usage If the ground truth labels are not known, evaluation must be performed using the model it-
self. The Silhouette Coefcient (sklearn.metrics.silhouette_score) is an example of such an evaluation,
where a higher Silhouette Coefcient score relates to a model with better dened clusters. The Silhouette Coefcient
is dened for each sample and is composed of two scores:
a: The mean distance between a sample and all other points in the same class.
b: The mean distance between a sample and all other points in the next nearest cluster.
The Silhoeutte Coefcient s for a single sample is then given as:
s =
b a
max(a, b)
The Silhouette Coefcient for a set of samples is given as the mean of the Silhouette Coefcient for each sample.
>>> from sklearn import metrics
>>> from sklearn.metrics import pairwise_distances
>>> from sklearn import datasets
>>> dataset = datasets.load_iris()
>>> X = dataset.data
>>> y = dataset.target
In normal usage, the Silhouette Coefcient is applied to the results of a cluster analysis.
1.4. Unsupervised learning 133
scikit-learn user guide, Release 0.12-git
>>> import numpy as np
>>> from sklearn.cluster import KMeans
>>> kmeans_model = KMeans(n_clusters=3, random_state=1).fit(X)
>>> labels = kmeans_model.labels_
>>> metrics.silhouette_score(X, labels, metric=euclidean)
...
0.55...
References
Peter J. Rousseeuw (1987). Silhouettes: a Graphical Aid to the Interpretation and Validation of Cluster
Analysis. Computational and Applied Mathematics 20: 5365. doi:10.1016/0377-0427(87)90125-7.
Advantages
The score is bounded between -1 for incorrect clustering and +1 for highly dense clustering. Scores around zero
indicate overlapping clusters.
The score is higher when clusters are dense and well separated, which relates to a standard concept of a cluster.
Drawbacks
The Silhouette Coefcient is generally higher for convex clusters than other concepts of clusters, such as density
based clusters like those obtained through DBSCAN.
1.4.4 Decomposing signals in components (matrix factorization problems)
Principal component analysis (PCA)
Exact PCA and probabilistic interpretation
PCA is used to decompose a multivariate dataset in a set of successive orthogonal components that explain a maximum
amount of the variance. In scikit-learn, PCA is implemented as a transformer object that learns n components in its t
method, and can be used on new data to project it on these components.
The optional parameter whiten=True parameter make it possible to project the data onto the singular space while
scaling each component to unit variance. This is often useful if the models down-stream make strong assumptions
on the isotropy of the signal: this is for example the case for Support Vector Machines with the RBF kernel and the
K-Means clustering algorithm. However in that case the inverse transform is no longer exact since some information
is lost while forward transforming.
In addition, the ProbabilisticPCA object provides a probabilistic interpretation of the PCA that can give a like-
lihood of data based on the amount of variance it explains. As such it implements a score method that can be used in
cross-validation.
Below is an example of the iris dataset, which is comprised of 4 features, projected on the 2 dimensions that explain
most variance:
Examples:
Comparison of LDA and PCA 2D projection of Iris dataset
134 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
Approximate PCA
Often we are interested in projecting the data onto a lower dimensional space that preserves most of the variance by
dropping the singular vector of components associated with lower singular values.
For instance for face recognition, if we work with 64x64 gray level pixel pictures the dimensionality of the data is
4096 and it is slow to train a RBF Support Vector Machine on such wide data. Furthermore we know that intrinsic
dimensionality of the data is much lower than 4096 since all faces pictures look alike. The samples lie on a manifold
of much lower dimension (say around 200 for instance). The PCA algorithm can be used to linearly transform the data
while both reducing the dimensionality and preserve most of the explained variance at the same time.
The class RandomizedPCA is very useful in that case: since we are going to drop most of the singular vectors it
is much more efcient to limit the computation to an approximated estimate of the singular vectors we will keep to
actually perform the transform.
For instance, the following shows 16 sample portraits (centered around 0.0) from the Olivetti dataset. On the right
hand side are the rst 16 singular vectors reshaped as portraits. Since we only require the top 16 singular vectors of a
dataset with size n
samples
= 400 and n
features
= 64 64 = 4096, the computation time it less than 1s:
1.4. Unsupervised learning 135
scikit-learn user guide, Release 0.12-git
RandomizedPCA can hence be used as a drop in replacement for PCA minor the exception that we need to give it
the size of the lower dimensional space n_components as mandatory input parameter.
If we note n
max
= max(n
samples
, n
features
) and n
min
= min(n
samples
, n
features
), the time complexity of
RandomizedPCA is O(n
2
max
n
components
) instead of O(n
2
max
n
min
) for the exact method implemented in PCA.
The memory footprint of RandomizedPCA is also proportional to 2 n
max
n
components
instead of n
max
n
min
for
the exact method.
Furthermore RandomizedPCA is able to work with scipy.sparse matrices as input which make it suitable for reducing
the dimensionality of features extracted from text documents for instance.
Note: the implementation of inverse_transform in RandomizedPCA is not the exact inverse transform of transform
even when whiten=False (default).
Examples:
Faces recognition example using eigenfaces and SVMs
Faces dataset decompositions
136 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
References:
Finding structure with randomness: Stochastic algorithms for constructing approximate matrix decom-
positions Halko, et al., 2009
Kernel PCA
KernelPCA is an extension of PCA which achieves non-linear dimensionality reduction through the use of kernels.
It has many applications including denoising, compression and structured prediction (kernel dependency estimation).
KernelPCA supports both transform and inverse_transform.
Examples:
Kernel PCA
Sparse Principal Components Analysis (SparsePCA and MiniBatchSparsePCA)
SparsePCA is a variant of PCA, with the goal of extracting the set of sparse components that best reconstruct the
data.
Mini Batch Sparse PCA (MiniBatchSparsePCA) is a variant of SparsePCA that is faster but less accurate. The
increased speed is reached by iterating over small chunks of the set of features, for a given number of iterations.
Principal component analysis (PCA) has the disadvantage that the components extracted by this method have exclu-
sively dense expressions, i.e. they have non-zero coefcients when expressed as linear combinations of the original
variables. This can make interpretation difcult. In many cases, the real underlying components can be more naturally
imagined as sparse vectors; for example in face recognition, components might naturally map to parts of faces.
1.4. Unsupervised learning 137
scikit-learn user guide, Release 0.12-git
Sparse principal components yields a more parsimonious, interpretable representation, clearly emphasizing which of
the original features contribute to the differences between samples.
The following example illustrates 16 components extracted using sparse PCA from the Olivetti faces dataset. It can
be seen how the regularization term induces many zeros. Furthermore, the natural structure of the data causes the
non-zero coefcients to be vertically adjacent. The model does not enforce this mathematically: each component is
a vector h R
4096
, and there is no notion of vertical adjacency except during the human-friendly visualization as
64x64 pixel images. The fact that the components shown below appear local is the effect of the inherent structure of
the data, which makes such local patterns minimize reconstruction error. There exist sparsity-inducing norms that take
into account adjacency and different kinds of structure; see see [Jen09] for a review of such methods. For more details
on how to use Sparse PCA, see the Examples section below.
Note that there are many different formulations for the Sparse PCA problem. The one implemented here is based
on [Mrl09] . The optimization problem solved is a PCA problem (dictionary learning) with an
1
penalty on the
components:
(U
, V
) = arg min
U,V
1
2
||X UV ||
2
2
+||V ||
1
subject to ||U
k
||
2
= 1 for all 0 k < n
components
The sparsity inducing
1
norm also prevents learning components from noise when few training samples are available.
138 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
The degree of penalization (and thus sparsity) can be adjusted through the hyperparameter alpha. Small values lead to
a gently regularized factorization, while larger values shrink many coefcients to zero.
Note: While in the spirit of an online algorithm, the class MiniBatchSparsePCA does not implement partial_t
because the algorithm is online along the features direction, not the samples direction.
Examples:
Faces dataset decompositions
References:
Dictionary Learning
Sparse coding with a precomputed dictionary
The SparseCoder object is an estimator that can be used to transform signals into sparse linear combination of
atoms from a xed, precomputed dictionary such as a discrete wavelet basis. This object therefore does not implement
a t method. The transformation amounts to a sparse coding problem: nding a representation of the data as a linear
combination of as few dictionary atoms as possible. All variations of dictionary learning implement the following
transform methods, controllable via the transform_method initialization parameter:
Orthogonal matching pursuit (Orthogonal Matching Pursuit (OMP))
Least-angle regression (Least Angle Regression)
Lasso computed by least-angle regression
Lasso using coordinate descent (Lasso)
Thresholding
Thresholding is very fast but it does not yield accurate reconstructions. They have been shown useful in literature for
classication tasks. For image reconstruction tasks, orthogonal matching pursuit yields the most accurate, unbiased
reconstruction.
The dictionary learning objects offer, via the split_code parameter, the possibility to separate the positive and negative
values in the results of sparse coding. This is useful when dictionary learning is used for extracting features that will be
used for supervised learning, because it allows the learning algorithm to assign different weights to negative loadings
of a particular atom, from to the corresponding positive loading.
The split code for a single sample has length 2 * n_atoms and is constructed using the following rule: First, the regular
code of length n_atoms is computed. Then, the rst n_atoms entries of the split_code are lled with the positive part
of the regular code vector. The second half of the split code is lled with the negative part of the code vector, only
with a positive sign. Therefore, the split_code is non-negative.
Examples:
Sparse coding with a precomputed dictionary
1.4. Unsupervised learning 139
scikit-learn user guide, Release 0.12-git
Generic dictionary learning
Dictionary learning (DictionaryLearning) is a matrix factorization problem that amounts to nding a (usually
overcomplete) dictionary that will perform good at sparsely encoding the tted data.
Representing data as sparse combinations of atoms from an overcomplete dictionary is suggested to be the way the
mammal primary visual cortex works. Consequently, dictionary learning applied on image patches has been shown
to give good results in image processing tasks such as image completion, inpainting and denoising, as well as for
supervised recognition tasks.
Dictionary learning is an optimization problem solved by alternatively updating the sparse code, as a solution to
multiple Lasso problems, considering the dictionary xed, and then updating the dictionary to best t the sparse code.
(U
, V
) = arg min
U,V
1
2
||X UV ||
2
2
+||U||
1
subject to ||V
k
||
2
= 1 for all 0 k < n
atoms
After using such a procedure to t the dictionary, the transform is simply a sparse coding step that shares the same
implementation with all dictionary learning objects (see Sparse coding with a precomputed dictionary).
140 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
The following image shows how a dictionary learned from 4x4 pixel image patches extracted from part of the image
of Lena looks like.
Examples:
Image denoising using dictionary learning
References:
Online dictionary learning for sparse coding J. Mairal, F. Bach, J. Ponce, G. Sapiro, 2009
Mini-batch dictionary learning
MiniBatchDictionaryLearning implements a faster, but less accurate version of the dictionary learning algo-
rithm that is better suited for large datasets.
By default, MiniBatchDictionaryLearning divides the data into mini-batches and optimizes in an online
manner by cycling over the mini-batches for the specied number of iterations. However, at the moment it does not
implement a stopping condition.
The estimator also implements partial_t, which updates the dictionary by iterating only once over a mini-batch. This
can be used for online learning when the data is not readily available from the start, or for when the data does not t
into the memory.
Independent component analysis (ICA)
Independent component analysis separates a multivariate signal into additive subcomponents that are maximally inde-
pendent. It is implemented in scikit-learn using the Fast ICA algorithm.
It is classically used to separate mixed signals (a problem known as blind source separation), as in the example below:
ICA can also be used as yet another non linear decomposition that nds components with some sparsity:
1.4. Unsupervised learning 141
scikit-learn user guide, Release 0.12-git
142 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
Examples:
Blind source separation using FastICA
FastICA on 2D point clouds
Faces dataset decompositions
Non-negative matrix factorization (NMF or NNMF)
NMF is an alternative approach to decomposition that assumes that the data and the components are non-negative. NMF
can be plugged in instead of PCA or its variants, in the cases where the data matrix does not contain negative values.
Unlike PCA, the representation of a vector is obtained in an additive fashion, by superimposing the components,
without substracting. Such additive models are efcient for representing images and text.
It has been observed in [Hoyer, 04] that, when carefully constrained, NMF can produce a parts-based representation of
the dataset, resulting in interpretable models. The following example displays 16 sparse components found by NMF
from the images in the Olivetti faces dataset, in comparison with the PCA eigenfaces.
The init attribute determines the initialization method applied, which has a great impact on the performance of the
method. NMF implements the method Nonnegative Double Singular Value Decomposition. NNDSVD is based on two
SVD processes, one approximating the data matrix, the other approximating positive sections of the resulting partial
1.4. Unsupervised learning 143
scikit-learn user guide, Release 0.12-git
SVD factors utilizing an algebraic property of unit rank matrices. The basic NNDSVD algorithm is better t for sparse
factorization. Its variants NNDSVDa (in which all zeros are set equal to the mean of all elements of the data), and
NNDSVDar (in which the zeros are set to random perturbations less than the mean of the data divided by 100) are
recommended in the dense case.
NMF can also be initialized with random non-negative matrices, by passing an integer seed or a RandomState to init.
In NMF, sparseness can be enforced by setting the attribute sparseness to data or components. Sparse components
lead to localized features, and sparse data leads to a more efcient representation of the data.
Examples:
Faces dataset decompositions
Topics extraction with Non-Negative Matrix Factorization
References:
Learning the parts of objects by non-negative matrix factorization D. Lee, S. Seung, 1999
Non-negative Matrix Factorization with Sparseness Constraints P. Hoyer, 2004
Projected gradient methods for non-negative matrix factorization C.-J. Lin, 2007
SVDbased initialization: A head start for nonnegative matrix factorization C. Boutsidis, E. Gallopoulos,
2008
1.4.5 Covariance estimation
Many statistical problems require at some point the estimation of a populations covariance matrix, which can be seen
as an estimation of data set scatter plot shape. Most of the time, such an estimation has to be done on a sample whose
properties (size, structure, homogeneity) has a large inuence on the estimations quality. The sklearn.covariance
package aims at providing tools affording an accurate estimation of a populations covariance matrix under various
settings.
We assume that the observations are independent and identically distributed (i.i.d.).
Empirical covariance
The covariance matrix of a data set is known to be well approximated with the classical MaximumLikelihood Estimator
(or empirical covariance), provided the number of observations is large enough compared to the number of features
(the variables describing the observations). More precisely, the Maximum Likelihood Estimator of a sample is an
unbiased estimator of the corresponding population covariance matrix.
The empirical covariance matrix of a sample can be computed using the empirical_covariance func-
tion of the package, or by tting an EmpiricalCovariance object to the data sample with the
EmpiricalCovariance.fit method. Be careful that depending whether the data are centered or not, the re-
sult will be different, so one may want to use the assume_centered parameter accurately.
Examples:
See Ledoit-Wolf vs Covariance simple estimation for an example on how to t an
EmpiricalCovariance object to data.
144 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
Shrunk Covariance
Basic shrinkage
Despite being an unbiased estimator of the covariance matrix, the Maximum Likelihood Estimator is not a good esti-
mator of the eigenvalues of the covariance matrix, so the precision matrix obtained from its inversion is not accurate.
Sometimes, it even occurs that the empirical covariance matrix cannot be inverted for numerical reasons. To avoid
such an inversion problem, a transformation of the empirical covariance matrix has been introduced: the shrinkage.
It consists in reducing the ratio between the smallest and the largest eigenvalue of the empirical covariance matrix.
This can be done by simply shifting every eigenvalue according to a given offset, which is equivalent of nding the
l2-penalized Maximum Likelihood Estimator of the covariance matrix, or by reducing the highest eigenvalue while
increasing the smallest with the help of a convex transformation :
shrunk
= (1 )
+
Tr
p
Id. The latter approach
has been implemented in scikit-learn.
A convex transformation (with a user-dened shrinkage coefcient) can be directly applied to a pre-computed covari-
ance with the shrunk_covariance method. Also, a shrunk estimator of the covariance can be tted to data with a
ShrunkCovariance object and its ShrunkCovariance.fit method. Again, depending whether the data are
centered or not, the result will be different, so one may want to use the assume_centered parameter accurately.
Examples:
See Ledoit-Wolf vs Covariance simple estimation for an example on how to t a ShrunkCovariance
object to data.
Ledoit-Wolf shrinkage
In their 2004 paper [1], O. Ledoit and M. Wolf propose a formula so as to compute the optimal shrinkage coefcient
that minimizes the Mean Squared Error between the estimated and the real covariance matrix in terms of Frobenius
norm.
The Ledoit-Wolf estimator of the covariance matrix can be computed on a sample with the ledoit_wolf function of
the sklearn.covariance package, or it can be otherwise obtained by tting a LedoitWolf object to the same sample.
[1] O. Ledoit and M. Wolf, A Well-Conditioned Estimator for Large-Dimensional Covariance Matrices, Jour-
nal of Multivariate Analysis, Volume 88, Issue 2, February 2004, pages 365-411.
Examples:
See Ledoit-Wolf vs Covariance simple estimation for an example on how to t a LedoitWolf object to
data and for visualizing the performances of the Ledoit-Wolf estimator in terms of likelihood.
Oracle Approximating Shrinkage
Under the assumption that the data are Gaussian distributed, Chen et al. [2] derived a formula aimed at choosing a
shrinkage coefcient that yields a smaller Mean Squared Error than the one given by Ledoit and Wolfs formula. The
resulting estimator is known as the Oracle Shrinkage Approximating estimator of the covariance.
The OAS estimator of the covariance matrix can be computed on a sample with the oas function of the
sklearn.covariance package, or it can be otherwise obtained by tting an OAS object to the same sample. The for-
mula we used to implement the OAS does not correspond to the one given in the article. It has been taken from the
MATLAB program available from the authors webpage (https://tbayes.eecs.umich.edu/yilun/covestimation).
1.4. Unsupervised learning 145
scikit-learn user guide, Release 0.12-git
[2] Chen et al., Shrinkage Algorithms for MMSE Covariance Estimation, IEEE Trans. on Sign. Proc., Volume
58, Issue 10, October 2010.
Examples:
See Ledoit-Wolf vs Covariance simple estimation for an example on how to t an OAS object to data.
See Ledoit-Wolf vs OAS estimation to visualize the Mean Squared Error difference between a
LedoitWolf and an OAS estimator of the covariance.
Sparse inverse covariance
The matrix inverse of the covariance matrix, often called the precision matrix, is proportional to the partial correlation
matrix. It gives the partial independence relationship. In other words, if two features are independent conditionally on
the others, the corresponding coefcient in the precision matrix will be zero. This is why it makes sense to estimate a
sparse precision matrix: by learning independence relations from the data, the estimation of the covariance matrix is
better conditioned. This is known as covariance selection.
In the small-samples situation, in which n_samples is on the order of magnitude of n_features or smaller, sparse inverse
covariance estimators tend to work better than shrunk covariance estimators. However, in the opposite situation, or for
very correlated data, they can be numerically unstable. In addition, unlike shrinkage estimators, sparse estimators are
able to recover off-diagonal structure.
The GraphLasso estimator uses an l1 penalty to enforce sparsity on the precision matrix: the higher its alpha
parameter, the more sparse the precision matrix. The corresponding GraphLassoCV object uses cross-validation to
automatically set the alpha parameter.
Note: Structure recovery
Recovering a graphical structure from correlations in the data is a challenging thing. If you are interested in such
recovery keep in mind that:
Recovery is easier from a correlation matrix than a covariance matrix: standardize your observations before
running GraphLasso
146 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
Figure 1.4: A comparison of maximum likelihood, shrinkage and sparse estimates of the covariance and precision
matrix in the very small samples settings.
1.4. Unsupervised learning 147
scikit-learn user guide, Release 0.12-git
If the underlying graph has nodes with much more connections than the average node, the algorithm will miss
some of these connections.
If your number of observations is not large compared to the number of edges in your underlying graph, you will
not recover it.
Even if you are in favorable recovery conditions, the alpha parameter chosen by cross-validation (e.g. using the
GraphLassoCV object) will lead to selecting too many edges. However, the relevant edges will have heavier
weights than the irrelevant ones.
The mathematical formulation is the following:
K = argmin
K
_
trSK logdetK +K
1
_
Where K is the precision matrix to be estimated, and S is the sample covariance matrix. K
1
is the sumof the absolute
values of off-diagonal coefcients of K. The algorithm employed to solve this problem is the GLasso algorithm, from
the Friedman 2008 Biostatistics paper. It is the same algorithm as in the R glasso package.
Examples:
Sparse inverse covariance estimation: example on synthetic data showing some recovery of a structure,
and comparing to other covariance estimators.
Visualizing the stock market structure: example on real stock market data, nding which symbols are most
linked.
References:
Friedman et al, Sparse inverse covariance estimation with the graphical lasso, Biostatistics 9, pp 432,
2008
Robust Covariance Estimation
Real data set are often subjects to measurement or recording errors. Regular but uncommon observations may also
appear for a variety of reason. Every observation which is very uncommon is called an outlier. The empirical covari-
ance estimator and the shrunk covariance estimators presented above are very sensitive to the presence of outlying
observations in the data. Therefore, one should use robust covariance estimators to estimate the covariance of its real
data sets. Alternatively, robust covariance estimators can be used to perform outlier detection and discard/downweight
some observations according to further processing of the data.
The sklearn.covariance package implements a robust estimator of covariance, the Minimum Covariance Determinant
[3].
Minimum Covariance Determinant
The Minimum Covariance Determinant estimator is a robust estimator of a data sets covariance introduced by
P.J.Rousseuw in [3]. The idea is to nd a given proportion (h) of good observations which are not outliers and com-
pute their empirical covariance matrix. This empirical covariance matrix is then rescaled to compensate the performed
selection of observations (consistency step). Having computed the Minimum Covariance Determinant estimator,
one can give weights to observations according to their Mahalanobis distance, leading the a reweighted estimate of the
covariance matrix of the data set (reweighting step).
148 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
Rousseuw and Van Driessen [4] developed the FastMCD algorithm in order to compute the Minimum Covariance
Determinant. This algorithm is used in scikit-learn when tting an MCD object to data. The FastMCD algorithm also
computes a robust estimate of the data set location at the same time.
Raw estimates can be accessed as raw_location_ and raw_covariance_ attributes of a MinCovDet robust covariance
estimator object.
[3] P. J. Rousseeuw. Least median of squares regression.
10. Am Stat Ass, 79:871, 1984.
[4] A Fast Algorithm for the Minimum Covariance Determinant Estimator, 1999, American Statistical Associa-
tion and the American Society for Quality, TECHNOMETRICS.
Examples:
See Robust vs Empirical covariance estimate for an example on how to t a MinCovDet object to data
and see how the estimate remains accurate despite the presence of outliers.
See Robust covariance estimation and Mahalanobis distances relevance to visualize the difference be-
tween EmpiricalCovariance and MinCovDet covariance estimators in terms of Mahalanobis dis-
tance (so we get a better estimate of the precision matrix too).
Inuence of outliers on location and covariance
estimates
Separating inliers from outliers using a Mahalonis
distance
1.4.6 Novelty and Outlier Detection
Many applications require being able to decide whether a new observation belongs to the same distribution as exiting
observations (it is an inlier), or should be considered as different (it is an outlier). Often, this ability is used to clean
real data sets. Two important distinction must be made:
novelty detection The training data is not polluted by outliers, and we are interested in detecting anoma-
lies in new observations.
outlier detection The training data contains outliers, and we need to t the central mode of the training
data, ignoring the deviant observations.
The scikit-learn project provides a set of machine learning tools that can be used both for novelty or outliers detection.
This strategy is implemented with objects learning in an unsupervised way from the data:
estimor.fit(X_train)
new observations can then be sorted as inliers or outliers with a predict method:
estimator.predict(X_test)
Inliers are labeled 0, while outliers are labeled 1.
1.4. Unsupervised learning 149
scikit-learn user guide, Release 0.12-git
Novelty Detection
Consider a data set of n observations from the same distribution described by p features. Consider now that we add
one more observation to that data set. Is the new observation so different from the others that we can doubt it is
regular? (i.e. does it come from the same distribution?) Or on the contrary, is it so similar to the other that we cannot
distinguish it from the original observations? This is the question adressed by the novelty detection tools and methods.
In general, it is about to learn a rough, close frontier delimiting the contour of the initial observations distribution,
plotted in embedding p-dimensional space. Then, if further observations lay within the frontier-delimited subspace,
they are considered as coming from the same population than the initial observations. Otherwise, if they lay outside
the frontier, we can say that they are abnormal with a given condence in our assessment.
The One-Class SVM has been introduced in [1] for that purpose and implemented in the Support Vector Machines
module in the svm.OneClassSVM object. It requires the choice of a kernel and a scalar parameter to dene a
frontier. The RBF kernel is usually chosen although there exist no exact formula or algorithm to set its bandwith
parameter. This is the default in the scikit-learn implementation. The parameter, also known as the margin of the
One-Class SVM, corresponds to the probability of nding a new, but regular, observation outside the frontier.
Examples:
See One-class SVM with non-linear kernel (RBF) for vizualizing the frontier learned around some data by
a svm.OneClassSVM object.
Outlier Detection
Outlier detection is similar to novelty detection in the sense that the goal is to separate a core of regular observations
from some polutting ones, called outliers. Yet, in the case of outlier detection, we dont have a clean data set
representing the population of regular observations that can be used to train any tool.
150 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
Fitting an elliptic envelop
One common way of performing outlier detection is to assume that the regular data come from a known distribution
(e.g. data are Gaussian distributed). From this assumption, we generaly try to dene the shape of the data, and can
dene outlying observations as observations which stand far enough from the t shape.
The scikit-learn provides an object covariance.EllipticEnvelope that ts a robust covariance estimate to
the data, and thus ts an ellipse to the central data points, ignoring points outside the central mode.
For instance, assuming that the inlier data are Gaussian distributed, it will estimate the inlier location and covariance
in a robust way (i.e. whithout being inuenced by outliers). The Mahalanobis distances obtained from this estimate is
used to derive a measure of outlyingness. This strategy is illustrated below.
Examples:
See Robust covariance estimation and Mahalanobis distances relevance for an illustration of the dif-
ference between using a standard (covariance.EmpiricalCovariance) or a robust estimate
(covariance.MinCovDet) of location and covariance to assess the degree of outlyingness of an ob-
servation.
References:
One-class SVM versus elliptic envelop
Strictly-speaking, the One-class SVM is not an outlier-detection method, but a novelty-detection method: its training
set should not be contaminated by outliers as it may t them. That said, outlier detection in high-dimension, or without
1.4. Unsupervised learning 151
scikit-learn user guide, Release 0.12-git
any assumptions on the distribution of the inlying data is very challenging, and a One-class SVM gives useful results
in these situations.
The examples below illustrate how the performance of the covariance.EllipticEnvelope degrades as the
data is less and less unimodal. svm.OneClassSVM works better on data with multiple modes.
Table 1.1: Comparing One-class SVM approach, and elliptic envelopp
For a inlier mode well-centered and elliptic, the svm.OneClassSVM is not able
to benet from the rotational symmetry of the inlier population. In addition, it ts
a bit the outliers present in the training set. On the opposite, the decision rule
based on tting an covariance.EllipticEnvelope learns an ellipse,
which ts well the inlier distribution.
As the inlier distribution becomes bimodal, the
covariance.EllipticEnvelope does not t well the inliers. However, we
can see that the svm.OneClassSVM tends to overt: because it has not model of
inliers, it interprets a region where, by chance some outliers are clustered, as
inliers.
If the inlier distribution is strongly non Gaussian, the svm.OneClassSVM is
able to recover a reasonable approximation, whereas the
covariance.EllipticEnvelope completely fails.
Examples:
See Outlier detection with several methods. for a comparison of the svm.OneClassSVM
(tuned to perform like an outlier detection method) and a covariance-based outlier detection with
covariance.MinCovDet.
1.4.7 Hidden Markov Models
sklearn.hmm implements the algorithms of Hidden Markov Model (HMM). HMM is a generative probabilistic model,
in which a sequence of observable X variable is generated by a sequence of internal hidden state Z. The hidden
states can not be observed directly. The transition of hidden states is aussumed to be the rst order Markov Chain. It
can be specied by the start probability vector and the transition probability matrix A. The emission probability
of observable can be any distribution with the parameters
i
conditioned on the current hidden state index. (e.g.
Multinomial, Gaussian). Thus the HMM can be completely determined by , Aand
i
.
There are three fundamental problems of HMM:
Given the model parameters and observed data, estimate the optimal sequence of hidden states.
Given the model parameters and observed data, calculate the likelihood of the data.
Given just the observed data, estimate the model parameters.
The rst and the second problem can be solved by the dynamic programing algorithms known as the Viterbi algorithm
and the Forward-Backward algorithm respectively. The last one can be solved by an Expectation-Maximization (EM)
iterative algorithm, known as Baum-Welch algorithm.
See the ref listed below for further detailed information.
152 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
References:
[Rabiner89] A tutorial on hidden Markov models and selected applications in speech recognition Lawrence, R.
Rabiner, 1989
Using HMM
Classes in this module include MultinomalHMM GaussianHMM, and GMMHMM. They implement HMM with emis-
sion probability of Multimomial distribution, Gaussian distribution and the mixture of Gaussian distributions.
Building HMM and generating samples
You can build an HMM instance by passing the parameters described above to the constructor. Then, you can generate
samples from the HMM by calling sample.:
>>> import numpy as np
>>> from sklearn import hmm
>>> startprob = np.array([0.6, 0.3, 0.1])
>>> transmat = np.array([[0.7, 0.2, 0.1], [0.3, 0.5, 0.2], [0.3, 0.3, 0.4]])
>>> means = np.array([[0.0, 0.0], [3.0, -3.0], [5.0, 10.0]])
>>> covars = np.tile(np.identity(2), (3, 1, 1))
>>> model = hmm.GaussianHMM(3, "full", startprob, transmat)
>>> model.means_ = means
>>> model.covars_ = covars
>>> X, Z = model.sample(100)
1.4. Unsupervised learning 153
scikit-learn user guide, Release 0.12-git
Examples:
Demonstration of sampling from HMM
Training HMM parameters and infering the hidden states
You can train the HMM by calling t method. The input is the list of the sequence of observed value. Note, since
EM-algorithm is a gradient based optimization method, it will generally be stuck at local optimal. You should try to
run t with various initialization and select the highest scored model. The score of the model can be calculated by the
score method. The infered optimal hidden states can be obtained by calling predict method. The predict method can
be specied with decoder algorithm. Currently Viterbi algorithm viterbi, and maximum a posteriori estimation map is
supported. This time, the input is a single sequence of observed values.:
>>> model2 = hmm.GaussianHMM(3, "full")
>>> model2.fit([X])
GaussianHMM(algorithm=viterbi,...
>>> Z2 = model.predict(X)
Examples:
Gaussian HMM of stock data
Implementing HMMs with other emission probabilities
If you want to implement other emission probability (e.g. Poisson), you have to make you own HMM class by
inheriting the _BaseHMM and override necessary methods. They should be __init__, _compute_log_likelihood, _set
and _get for addiitional parameters, _initialize_sufcient_statistics, _accumulate_sufcient_statistics and _do_mstep.
1.5 Model Selection
1.5.1 Cross-Validation: evaluating estimator performance
Learning the parameters of a prediction function and testing it on the same data is a methodological mistake: a model
that would just repeat the labels of the samples that it has just seen would have a perfect score but would fail to predict
anything useful on yet-unseen data.
To avoid over-tting, we have to dene two different sets : a training set X_train, y_train which is used for
learning the parameters of a predictive model, and a testing set X_test, y_test which is used for evaluating the
tted predictive model.
In scikit-learn such a random split can be quickly computed with the train_test_split helper function. Let load
the iris data set to t a linear Support Vector Machine model on it:
>>> import numpy as np
>>> from sklearn import cross_validation
>>> from sklearn import datasets
>>> from sklearn import svm
>>> iris = datasets.load_iris()
154 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
>>> iris.data.shape, iris.target.shape
((150, 4), (150,))
We can now quickly sample a training set while holding out 40% of the data for testing (evaluating) our classier:
>>> X_train, X_test, y_train, y_test = cross_validation.train_test_split(
... iris.data, iris.target, test_size=0.4, random_state=0)
>>> X_train.shape, y_train.shape
((90, 4), (90,))
>>> X_test.shape, y_test.shape
((60, 4), (60,))
>>> clf = svm.SVC(kernel=linear, C=1).fit(X_train, y_train)
>>> clf.score(X_test, y_test)
0.96...
However, by dening these two sets, we drastically reduce the number of samples which can be used for learning the
model, and the results can depend on a particular random choice for the pair of (train, test) sets.
A solution is to split the whole data several consecutive times in different train set and test set, and to return the
averaged value of the prediction scores obtained with the different sets. Such a procedure is called cross-validation.
This approach can be computationally expensive, but does not waste too much data (as it is the case when xing
an arbitrary test set), which is a major advantage in problem such as inverse inference where the number of samples is
very small.
Computing cross-validated metrics
The simplest way to use perform cross-validation in to call the cross_val_score helper function on the estimator
and the dataset.
The following example demonstrates how to estimate the accuracy of a linear kernel Support Vector Machine on the
iris dataset by splitting the data and tting a model and computing the score 5 consecutive times (with different splits
each time):
>>> clf = svm.SVC(kernel=linear, C=1)
>>> scores = cross_validation.cross_val_score(
... clf, iris.data, iris.target, cv=5)
...
>>> scores
array([ 1. ..., 0.96..., 0.9 ..., 0.96..., 1. ])
The mean score and the standard deviation of the score estimate are hence given by:
>>> print "Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() / 2)
Accuracy: 0.97 (+/- 0.02)
By default, the score computed at each CV iteration is the score method of the estimator. It is possible to change
this by passing a custom scoring function, e.g. from the metrics module:
>>> from sklearn import metrics
>>> cross_validation.cross_val_score(clf, iris.data, iris.target, cv=5,
... score_func=metrics.f1_score)
...
array([ 1. ..., 0.96..., 0.89..., 0.96..., 1. ])
In the case of the Iris dataset, the samples are balanced across target classes hence the accuracy and the F1-score are
almost equal.
1.5. Model Selection 155
scikit-learn user guide, Release 0.12-git
When the cv argument is an integer, cross_val_score uses the KFold or StratifiedKFold strategies by
default (depending on the absence or presence of the target array).
It is also possible to use othe cross validation strategies by passing a cross validation iterator instead, for instance:
>>> n_samples = iris.data.shape[0]
>>> cv = cross_validation.ShuffleSplit(n_samples, n_iterations=3,
... test_size=0.3, random_state=0)
>>> cross_validation.cross_val_score(clf, iris.data, iris.target, cv=cv)
...
array([ 0.97..., 0.97..., 1. ])
The available cross validation iterators are introduced in the following.
Examples
Receiver operating characteristic (ROC) with cross validation,
Recursive feature elimination with cross-validation,
Parameter estimation using grid search with a nested cross-validation,
Sample pipeline for text feature extraction and evaluation,
Cross validation iterators
The following sections list utilities to generate boolean masks or indices that can be used to generate dataset splits
according to different cross validation strategies.
Boolean mask vs integer indices
Most cross validators support generating both boolean masks or integer indices to select the samples from a
given fold.
When the data matrix is sparse, only the integer indices will work as expected. Integer indexing is hence the
default behavior (since version 0.10).
You can explicitly pass indices=False to the constructor of the CV object (when supported) to use the
boolean mask method instead.
K-fold
KFold divides all the samples in math:K groups of samples, called folds (if K = n, this is equivalent to the Leave
One Out strategy), of equal sizes (if possible). The prediction function is learned using K 1 folds, and the fold left
out is used for test.
Example of 2-fold:
>>> import numpy as np
>>> from sklearn.cross_validation import KFold
>>> X = np.array([[0., 0.], [1., 1.], [-1., -1.], [2., 2.]])
>>> Y = np.array([0, 1, 0, 1])
>>> kf = KFold(len(Y), 2, indices=False)
>>> print kf
sklearn.cross_validation.KFold(n=4, k=2)
156 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
>>> for train, test in kf:
... print train, test
[False False True True] [ True True False False]
[ True True False False] [False False True True]
Each fold is constituted by two arrays: the rst one is related to the training set, and the second one to the test set.
Thus, one can create the training/test sets using:
>>> X_train, X_test, y_train, y_test = X[train], X[test], Y[train], Y[test]
If X or Y are scipy.sparse matrices, train and test need to be integer indices. It can be obtained by setting the parameter
indices to True when creating the cross-validation procedure:
>>> X = np.array([[0., 0.], [1., 1.], [-1., -1.], [2., 2.]])
>>> Y = np.array([0, 1, 0, 1])
>>> kf = KFold(len(Y), 2, indices=True)
>>> for train, test in kf:
... print train, test
[2 3] [0 1]
[0 1] [2 3]
Stratied K-Fold
StratifiedKFold is a variation of K-fold, which returns stratied folds, i.e which creates folds by preserving the
same percentage for each target class as in the complete set.
Example of stratied 2-fold:
>>> from sklearn.cross_validation import StratifiedKFold
>>> X = [[0., 0.],
... [1., 1.],
... [-1., -1.],
... [2., 2.],
... [3., 3.],
... [4., 4.],
... [0., 1.]]
>>> Y = [0, 0, 0, 1, 1, 1, 0]
>>> skf = StratifiedKFold(Y, 2)
>>> print skf
sklearn.cross_validation.StratifiedKFold(labels=[0 0 0 1 1 1 0], k=2)
>>> for train, test in skf:
... print train, test
[1 4 6] [0 2 3 5]
[0 2 3 5] [1 4 6]
Leave-One-Out - LOO
LeaveOneOut (or LOO) is a simple cross-validation. Each learning set is created by taking all the samples except
one, the test set being the sample left out. Thus, for n samples, we have n different learning sets and n different tests
set. This cross-validation procedure does not waste much data as only one sample is removed from the learning set:
>>> from sklearn.cross_validation import LeaveOneOut
>>> X = np.array([[0., 0.], [1., 1.], [-1., -1.], [2., 2.]])
1.5. Model Selection 157
scikit-learn user guide, Release 0.12-git
>>> Y = np.array([0, 1, 0, 1])
>>> loo = LeaveOneOut(len(Y))
>>> print loo
sklearn.cross_validation.LeaveOneOut(n=4)
>>> for train, test in loo:
... print train, test
[1 2 3] [0]
[0 2 3] [1]
[0 1 3] [2]
[0 1 2] [3]
Leave-P-Out - LPO
LeavePOut is very similar to Leave-One-Out, as it creates all the possible training/test sets by removing P samples
from the complete set.
Example of Leave-2-Out:
>>> from sklearn.cross_validation import LeavePOut
>>> X = [[0., 0.], [1., 1.], [-1., -1.], [2., 2.]]
>>> Y = [0, 1, 0, 1]
>>> lpo = LeavePOut(len(Y), 2)
>>> print lpo
sklearn.cross_validation.LeavePOut(n=4, p=2)
>>> for train, test in lpo:
... print train, test
[2 3] [0 1]
[1 3] [0 2]
[1 2] [0 3]
[0 3] [1 2]
[0 2] [1 3]
[0 1] [2 3]
Leave-One-Label-Out - LOLO
LeaveOneLabelOut (LOLO) is a cross-validation scheme which holds out the samples according to a third-party
provided label. This label information can be used to encode arbitrary domain specic stratications of the samples as
integers.
Each training set is thus constituted by all the samples except the ones related to a specic label.
For example, in the cases of multiple experiments, LOLO can be used to create a cross-validation based on the different
experiments: we create a training set using the samples of all the experiments except one:
>>> from sklearn.cross_validation import LeaveOneLabelOut
>>> X = [[0., 0.], [1., 1.], [-1., -1.], [2., 2.]]
>>> Y = [0, 1, 0, 1]
>>> labels = [1, 1, 2, 2]
>>> lolo = LeaveOneLabelOut(labels)
>>> print lolo
sklearn.cross_validation.LeaveOneLabelOut(labels=[1, 1, 2, 2])
158 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
>>> for train, test in lolo:
... print train, test
[2 3] [0 1]
[0 1] [2 3]
Another common application is to use time information: for instance the labels could be the year of collection of the
samples and thus allow for cross-validation against time-based splits.
Leave-P-Label-Out
LeavePLabelOut is similar as Leave-One-Label-Out, but removes samples related to P labels for each training/test
set.
Example of Leave-2-Label Out:
>>> from sklearn.cross_validation import LeavePLabelOut
>>> X = [[0., 0.], [1., 1.], [-1., -1.], [2., 2.], [3., 3.], [4., 4.]]
>>> Y = [0, 1, 0, 1, 0, 1]
>>> labels = [1, 1, 2, 2, 3, 3]
>>> lplo = LeavePLabelOut(labels, 2)
>>> print lplo
sklearn.cross_validation.LeavePLabelOut(labels=[1, 1, 2, 2, 3, 3], p=2)
>>> for train, test in lplo:
... print train, test
[4 5] [0 1 2 3]
[2 3] [0 1 4 5]
[0 1] [2 3 4 5]
Random permutations cross-validation a.k.a. Shufe & Split
ShuffleSplit
The ShuffleSplit iterator will generate a user dened number of independent train / test dataset splits. Samples
are rst shufed and then splitted into a pair of train and test sets.
It is possible to control the randomness for reproducibility of the results by explicitly seeding the random_state
pseudo random number generator.
Here is a usage example:
>>> ss = cross_validation.ShuffleSplit(5, n_iterations=3, test_size=0.25,
... random_state=0)
>>> len(ss)
3
>>> print ss
ShuffleSplit(5, n_iterations=3, test_size=0.25, indices=True, ...)
>>> for train_index, test_index in ss:
... print train_index, test_index
...
[1 3 4] [2 0]
[1 4 3] [0 2]
[4 0 2] [1 3]
1.5. Model Selection 159
scikit-learn user guide, Release 0.12-git
ShuffleSplit is thus a good alternative to KFold cross validation that allows a ner control on the number of
iterations and the proportion of samples in on each side of the train / test split.
See also
StratifiedShuffleSplit is a variation of ShufeSplit, which returns stratied splits, i.e which creates splits
by preserving the same percentage for each target class as in the complete set.
Bootstrapping cross-validation
Bootstrap
Bootstrapping is a general statistics technique that iterates the computation of an estimator on a resampled dataset.
The Bootstrap iterator will generate a user dened number of independent train / test dataset splits. Samples are
then drawn (with replacement) on each side of the split. It furthermore possible to control the size of the train and test
subset to make their union smaller than the total dataset if it is very large.
Note: Contrary to other cross-validation strategies, bootstrapping will allow some samples to occur several times in
each splits.
>>> bs = cross_validation.Bootstrap(9, random_state=0)
>>> len(bs)
3
>>> print bs
Bootstrap(9, n_bootstraps=3, train_size=5, test_size=4, random_state=0)
>>> for train_index, test_index in bs:
... print train_index, test_index
...
[1 8 7 7 8] [0 3 0 5]
[5 4 2 4 2] [6 7 1 0]
[4 7 0 1 1] [5 3 6 5]
Cross validation and model selection
Cross validation iterators can also be used to directly perform model selection using Grid Search for the optimal
hyperparameters of the model. This is the topic if the next section: Grid Search: setting estimator parameters.
1.5.2 Grid Search: setting estimator parameters
Grid Search is used to optimize the parameters of a model (e.g. C, kernel and gamma for Support Vector Classier,
alpha for Lasso, etc.) using an internal Cross-Validation: evaluating estimator performance scheme).
GridSearchCV
The main class for implementing hyperparameters grid search in scikit-learn is grid_search.GridSearchCV.
This class is passed a base model instance (for example sklearn.svm.SVC()) along with a grid of potential
hyper-parameter values such as:
160 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
[{C: [1, 10, 100, 1000], gamma: [0.001, 0.0001], kernel: [rbf]},
{C: [1, 10, 100, 1000], kernel: [linear]}]
The grid_search.GridSearchCV instance implements the usual estimator API: when tting it on a dataset
all the possible combinations of hyperparameter values are evaluated and the best combinations is retained.
Model selection: development and evaluation
Model selection with GridSearchCV can be seen as a way to use the labeled data to train the hyper-
parameters of the grid.
When evaluating the resulting model it is important to do it on held-out samples that were not seen during the grid
search process: it is recommended to split the data into a development set (to be fed to the GridSearchCV
instance) and an evaluation set to compute performance metrics.
This can be done by using the cross_validation.train_test_split utility function.
Examples
See Parameter estimation using grid search with a nested cross-validation for an example of Grid Search com-
putation on the digits dataset.
See Sample pipeline for text feature extraction and evaluation for an example of Grid Search coupling parame-
ters from a text documents feature extractor (n-gram count vectorizer and TF-IDF transformer) with a classier
(here a linear SVM trained with SGD with either elastic net or L2 penalty) using a pipeline.Pipeline
instance.
Note: Computations can be run in parallel if your OS supports it, by using the keyword n_jobs=-1, see function
signature for more details.
Alternatives to brute force grid search
Model specic cross-validation
Some models can t data for a range of value of some parameter almost as efciently as tting the estimator for a
single value of the parameter. This feature can be leveraged to perform a more efcient cross-validation used for
model selection of this parameter.
The most common parameter amenable to this strategy is the parameter encoding the strength of the regularizer. In
this case we say that we compute the regularization path of the estimator.
Here is the list of such models:
linear_model.RidgeCV([alphas, ...]) Ridge regression with built-in cross-validation.
linear_model.RidgeClassifierCV([alphas, ...]) Ridge classier with built-in cross-validation.
linear_model.LarsCV([t_intercept, ...]) Cross-validated Least Angle Regression model
linear_model.LassoLarsCV([t_intercept, ...]) Cross-validated Lasso, using the LARS algorithm
linear_model.LassoCV([eps, n_alphas, ...]) Lasso linear model with iterative tting along a regularization path
linear_model.ElasticNetCV([rho, eps, ...]) Elastic Net model with iterative tting along a regularization path
sklearn.linear_model.RidgeCV
1.5. Model Selection 161
scikit-learn user guide, Release 0.12-git
class sklearn.linear_model.RidgeCV(alphas=array([ 0.1, 1., 10. ]), t_intercept=True, nor-
malize=False, score_func=None, loss_func=None, cv=None,
gcv_mode=None)
Ridge regression with built-in cross-validation.
By default, it performs Generalized Cross-Validation, which is a form of efcient Leave-One-Out cross-
validation.
Parameters alphas: numpy array of shape [n_alpha] :
Array of alpha values to try. Small positive values of alpha improve the conditioning of
the problem and reduce the variance of the estimates. Alpha corresponds to (2
*
C)^-1
in other linear models such as LogisticRegression or LinearSVC.
t_intercept : boolean
Whether to calculate the intercept for this model. If set to false, no intercept will be
used in calculations (e.g. data is expected to be already centered).
normalize : boolean, optional
If True, the regressors X are normalized
score_func: callable, optional :
function that takes 2 arguments and compares them in order to evaluate the performance
of prediction (big is good) if None is passed, the score of the estimator is maximized
loss_func: callable, optional :
function that takes 2 arguments and compares them in order to evaluate the performance
of prediction (small is good) if None is passed, the score of the estimator is maximized
cv : cross-validation generator, optional
If None, Generalized Cross-Validation (efcient Leave-One-Out) will be used.
See Also:
RidgeRidge regression
RidgeClassifierRidge classier
RidgeCVRidge regression with built-in cross validation
Attributes
coef_ array, shape = [n_features] or
[n_classes, n_features]
Weight vector(s).
gcv_mode {None, auto, svd, eigen}, op-
tional
Flag indicating which strategy to
use when performing Generalized
Cross-Validation. Options are:
auto : use svd if n_samples > n_features, otherwise use eigen
svd : force computation via singular value decomposition of X
eigen : force computation via eigendecomposition of X^T X
The auto mode is the default and
is intended to pick the cheaper op-
tion of the two depending upon the
shape of the training data.
162 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
Methods
decision_function(X) Decision function of the linear model
fit(X, y[, sample_weight]) Fit Ridge regression model
get_params([deep]) Get parameters for the estimator
predict(X) Predict using the linear model
score(X, y) Returns the coefcient of determination R^2 of the prediction.
set_params(**params) Set the parameters of the estimator.
__init__(alphas=array([ 0.1, 1., 10. ]), t_intercept=True, normalize=False, score_func=None,
loss_func=None, cv=None, gcv_mode=None)
decision_function(X)
Decision function of the linear model
Parameters X : numpy array of shape [n_samples, n_features]
Returns C : array, shape = [n_samples]
Returns predicted values.
fit(X, y, sample_weight=1.0)
Fit Ridge regression model
Parameters X : array-like, shape = [n_samples, n_features]
Training data
y : array-like, shape = [n_samples] or [n_samples, n_responses]
Target values
sample_weight : oat or array-like of shape [n_samples]
Sample weight
Returns self : Returns self.
get_params(deep=True)
Get parameters for the estimator
Parameters deep: boolean, optional :
If True, will return the parameters for this estimator and contained subobjects that are
estimators.
predict(X)
Predict using the linear model
Parameters X : numpy array of shape [n_samples, n_features]
Returns C : array, shape = [n_samples]
Returns predicted values.
score(X, y)
Returns the coefcient of determination R^2 of the prediction.
The coefcient R^2 is dened as (1 - u/v), where u is the regression sum of squares ((y - y_pred) **
2).sum() and v is the residual sum of squares ((y_true - y_true.mean()) ** 2).sum(). Best possible score is
1.0, lower values are worse.
Parameters X : array-like, shape = [n_samples, n_features]
1.5. Model Selection 163
scikit-learn user guide, Release 0.12-git
Training set.
y : array-like, shape = [n_samples]
Returns z : oat
set_params(**params)
Set the parameters of the estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The former have
parameters of the form <component>__<parameter> so that its possible to update each component
of a nested object.
Returns self :
sklearn.linear_model.RidgeClassierCV
class sklearn.linear_model.RidgeClassifierCV(alphas=array([ 0.1, 1., 10. ]),
t_intercept=True, normalize=False,
score_func=None, loss_func=None, cv=None,
class_weight=None)
Ridge classier with built-in cross-validation.
By default, it performs Generalized Cross-Validation, which is a form of efcient Leave-One-Out cross-
validation. Currently, only the n_features > n_samples case is handled efciently.
Parameters alphas: numpy array of shape [n_alpha] :
Array of alpha values to try. Small positive values of alpha improve the conditioning of
the problem and reduce the variance of the estimates. Alpha corresponds to (2*C)^-1 in
other linear models such as LogisticRegression or LinearSVC.
t_intercept : boolean
Whether to calculate the intercept for this model. If set to false, no intercept will be
used in calculations (e.g. data is expected to be already centered).
normalize : boolean, optional
If True, the regressors X are normalized
score_func: callable, optional :
function that takes 2 arguments and compares them in order to evaluate the performance
of prediction (big is good) if None is passed, the score of the estimator is maximized
loss_func: callable, optional :
function that takes 2 arguments and compares them in order to evaluate the performance
of prediction (small is good) if None is passed, the score of the estimator is maximized
cv : cross-validation generator, optional
If None, Generalized Cross-Validation (efcient Leave-One-Out) will be used.
class_weight : dict, optional
Weights associated with classes in the form {class_label : weight}. If not given, all
classes are supposed to have weight one.
See Also:
RidgeRidge regression
RidgeClassifierRidge classier
164 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
RidgeCVRidge regression with built-in cross validation
Notes
For multi-class classication, n_class classiers are trained in a one-versus-all approach. Concretely, this is
implemented by taking advantage of the multi-variate response support in Ridge.
Methods
decision_function(X)
fit(X, y[, sample_weight, class_weight]) Fit the ridge classier.
get_params([deep]) Get parameters for the estimator
predict(X) Predict target values according to the tted model.
score(X, y) Returns the coefcient of determination R^2 of the prediction.
set_params(**params) Set the parameters of the estimator.
__init__(alphas=array([ 0.1, 1., 10. ]), t_intercept=True, normalize=False, score_func=None,
loss_func=None, cv=None, class_weight=None)
fit(X, y, sample_weight=1.0, class_weight=None)
Fit the ridge classier.
Parameters X : array-like, shape = [n_samples, n_features]
Training vectors, where n_samples is the number of samples and n_features is the num-
ber of features.
y : array-like, shape = [n_samples]
Target values.
sample_weight : oat or numpy array of shape [n_samples]
Sample weight
class_weight : dict, optional
Weights associated with classes in the form {class_label : weight}. If not given, all
classes are supposed to have weight one.
Returns self : object
Returns self.
get_params(deep=True)
Get parameters for the estimator
Parameters deep: boolean, optional :
If True, will return the parameters for this estimator and contained subobjects that are
estimators.
predict(X)
Predict target values according to the tted model.
Parameters X : array-like, shape = [n_samples, n_features]
Returns y : array, shape = [n_samples]
1.5. Model Selection 165
scikit-learn user guide, Release 0.12-git
score(X, y)
Returns the coefcient of determination R^2 of the prediction.
The coefcient R^2 is dened as (1 - u/v), where u is the regression sum of squares ((y - y_pred) **
2).sum() and v is the residual sum of squares ((y_true - y_true.mean()) ** 2).sum(). Best possible score is
1.0, lower values are worse.
Parameters X : array-like, shape = [n_samples, n_features]
Training set.
y : array-like, shape = [n_samples]
Returns z : oat
set_params(**params)
Set the parameters of the estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The former have
parameters of the form <component>__<parameter> so that its possible to update each component
of a nested object.
Returns self :
sklearn.linear_model.LarsCV
class sklearn.linear_model.LarsCV(t_intercept=True, verbose=False, max_iter=500, normal-
ize=True, precompute=auto, cv=None, max_n_alphas=1000,
n_jobs=1, eps=2.2204460492503131e-16, copy_X=True)
Cross-validated Least Angle Regression model
Parameters t_intercept : boolean
whether to calculate the intercept for this model. If set to false, no intercept will be used
in calculations (e.g. data is expected to be already centered).
verbose : boolean or integer, optional
Sets the verbosity amount
normalize : boolean, optional
If True, the regressors X are normalized
copy_X : boolean, optional, default True
If True, X will be copied; else, it may be overwritten.
precompute : True | False | auto | array-like
Whether to use a precomputed Gram matrix to speed up calculations. If set to auto let
us decide. The Gram matrix can also be passed as argument.
max_iter: integer, optional :
Maximum number of iterations to perform.
cv : crossvalidation generator, optional
see sklearn.cross_validation module. If None is passed, default to a 5-fold strategy
max_n_alphas : integer, optional
The maximum number of points on the path used to compute the residuals in the cross-
validation
n_jobs : integer, optional
166 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
Number of CPUs to use during the cross validation. If -1, use all the CPUs
eps: oat, optional :
The machine-precision regularization in the computation of the Cholesky diagonal fac-
tors. Increase this for very ill-conditioned systems.
See Also:
lars_path, LassoLARS, LassoLarsCV
Attributes
coef_ array, shape =
[n_features]
parameter vector (w in the fomulation
formula)
intercept_ oat independent term in decision function.
coef_path: array, shape = [n_features,
n_alpha]
the varying values of the coefcients
along the path
Methods
decision_function(X) Decision function of the linear model
fit(X, y) Fit the model using X, y as training data.
get_params([deep]) Get parameters for the estimator
predict(X) Predict using the linear model
score(X, y) Returns the coefcient of determination R^2 of the prediction.
set_params(**params) Set the parameters of the estimator.
__init__(t_intercept=True, verbose=False, max_iter=500, normalize=True, precompute=auto,
cv=None, max_n_alphas=1000, n_jobs=1, eps=2.2204460492503131e-16, copy_X=True)
decision_function(X)
Decision function of the linear model
Parameters X : numpy array of shape [n_samples, n_features]
Returns C : array, shape = [n_samples]
Returns predicted values.
fit(X, y)
Fit the model using X, y as training data.
Parameters X : array-like, shape = [n_samples, n_features]
Training data.
y : array-like, shape = [n_samples]
Target values.
Returns self : object
returns an instance of self.
get_params(deep=True)
Get parameters for the estimator
Parameters deep: boolean, optional :
1.5. Model Selection 167
scikit-learn user guide, Release 0.12-git
If True, will return the parameters for this estimator and contained subobjects that are
estimators.
predict(X)
Predict using the linear model
Parameters X : numpy array of shape [n_samples, n_features]
Returns C : array, shape = [n_samples]
Returns predicted values.
score(X, y)
Returns the coefcient of determination R^2 of the prediction.
The coefcient R^2 is dened as (1 - u/v), where u is the regression sum of squares ((y - y_pred) **
2).sum() and v is the residual sum of squares ((y_true - y_true.mean()) ** 2).sum(). Best possible score is
1.0, lower values are worse.
Parameters X : array-like, shape = [n_samples, n_features]
Training set.
y : array-like, shape = [n_samples]
Returns z : oat
set_params(**params)
Set the parameters of the estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The former have
parameters of the form <component>__<parameter> so that its possible to update each component
of a nested object.
Returns self :
sklearn.linear_model.LassoLarsCV
class sklearn.linear_model.LassoLarsCV(t_intercept=True, verbose=False, max_iter=500,
normalize=True, precompute=auto,
cv=None, max_n_alphas=1000, n_jobs=1,
eps=2.2204460492503131e-16, copy_X=True)
Cross-validated Lasso, using the LARS algorithm
The optimization objective for Lasso is:
(1 / (2
*
n_samples))
*
||y - Xw||^2_2 + alpha
*
||w||_1
Parameters t_intercept : boolean
whether to calculate the intercept for this model. If set to false, no intercept will be used
in calculations (e.g. data is expected to be already centered).
verbose : boolean or integer, optional
Sets the verbosity amount
normalize : boolean, optional
If True, the regressors X are normalized
precompute : True | False | auto | array-like
Whether to use a precomputed Gram matrix to speed up calculations. If set to auto let
us decide. The Gram matrix can also be passed as argument.
168 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
max_iter: integer, optional :
Maximum number of iterations to perform.
cv : crossvalidation generator, optional
see sklearn.cross_validation module. If None is passed, default to a 5-fold strategy
max_n_alphas : integer, optional
The maximum number of points on the path used to compute the residuals in the cross-
validation
n_jobs : integer, optional
Number of CPUs to use during the cross validation. If -1, use all the CPUs
eps: oat, optional :
The machine-precision regularization in the computation of the Cholesky diagonal fac-
tors. Increase this for very ill-conditioned systems.
copy_X : boolean, optional, default True
If True, X will be copied; else, it may be overwritten.
See Also:
lars_path, LassoLars, LarsCV, LassoCV
Notes
The object solves the same problem as the LassoCV object. However, unlike the LassoCV, it nd the relevent
alphas values by itself. In general, because of this property, it will be more stable. However, it is more fragile to
heavily multicollinear datasets.
It is more efcient than the LassoCV if only a small number of features are selected compared to the total
number, for instance if there are very few samples compared to the number of features.
Attributes
coef_ array, shape =
[n_features]
parameter vector (w in the fomulation formula)
intercept_ oat independent term in decision function.
coef_path: array, shape =
[n_features, n_alpha]
the varying values of the coefcients along the path
alphas_: array, shape =
[n_alpha]
the different values of alpha along the path
cv_alphas: array, shape =
[n_cv_alphas]
all the values of alpha along the path for the different
folds
cv_mse_path_: array, shape =
[n_folds, n_cv_alphas]
the mean square error on left-out for each fold along
the path (alpha values given by cv_alphas)
Methods
decision_function(X) Decision function of the linear model
Continued on next page
1.5. Model Selection 169
scikit-learn user guide, Release 0.12-git
Table 1.6 continued from previous page
fit(X, y) Fit the model using X, y as training data.
get_params([deep]) Get parameters for the estimator
predict(X) Predict using the linear model
score(X, y) Returns the coefcient of determination R^2 of the prediction.
set_params(**params) Set the parameters of the estimator.
__init__(t_intercept=True, verbose=False, max_iter=500, normalize=True, precompute=auto,
cv=None, max_n_alphas=1000, n_jobs=1, eps=2.2204460492503131e-16, copy_X=True)
decision_function(X)
Decision function of the linear model
Parameters X : numpy array of shape [n_samples, n_features]
Returns C : array, shape = [n_samples]
Returns predicted values.
fit(X, y)
Fit the model using X, y as training data.
Parameters X : array-like, shape = [n_samples, n_features]
Training data.
y : array-like, shape = [n_samples]
Target values.
Returns self : object
returns an instance of self.
get_params(deep=True)
Get parameters for the estimator
Parameters deep: boolean, optional :
If True, will return the parameters for this estimator and contained subobjects that are
estimators.
predict(X)
Predict using the linear model
Parameters X : numpy array of shape [n_samples, n_features]
Returns C : array, shape = [n_samples]
Returns predicted values.
score(X, y)
Returns the coefcient of determination R^2 of the prediction.
The coefcient R^2 is dened as (1 - u/v), where u is the regression sum of squares ((y - y_pred) **
2).sum() and v is the residual sum of squares ((y_true - y_true.mean()) ** 2).sum(). Best possible score is
1.0, lower values are worse.
Parameters X : array-like, shape = [n_samples, n_features]
Training set.
y : array-like, shape = [n_samples]
Returns z : oat
170 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
set_params(**params)
Set the parameters of the estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The former have
parameters of the form <component>__<parameter> so that its possible to update each component
of a nested object.
Returns self :
sklearn.linear_model.LassoCV
class sklearn.linear_model.LassoCV(eps=0.001, n_alphas=100, alphas=None, t_intercept=True,
normalize=False, precompute=auto, max_iter=1000,
tol=0.0001, copy_X=True, cv=None, verbose=False)
Lasso linear model with iterative tting along a regularization path
The best model is selected by cross-validation.
The optimization objective for Lasso is:
(1 / (2
*
n_samples))
*
||y - Xw||^2_2 + alpha
*
||w||_1
Parameters eps : oat, optional
Length of the path. eps=1e-3 means that alpha_min / alpha_max = 1e-3.
n_alphas : int, optional
Number of alphas along the regularization path
alphas : numpy array, optional
List of alphas where to compute the models. If None alphas are set automatically
precompute : True | False | auto | array-like
Whether to use a precomputed Gram matrix to speed up calculations. If set to auto let
us decide. The Gram matrix can also be passed as argument.
max_iter: int, optional :
The maximum number of iterations
tol: oat, optional :
The tolerance for the optimization: if the updates are smaller than tol, the optimization
code checks the dual gap for optimality and continues until it is smaller than tol.
cv : integer or crossvalidation generator, optional
If an integer is passed, it is the number of fold (default 3). Specic crossvalidation ob-
jects can be passed, see sklearn.cross_validation module for the list of possible objects
verbose : bool or integer
amount of verbosity
See Also:
lars_path, lasso_path, LassoLars, Lasso, LassoLarsCV
1.5. Model Selection 171
scikit-learn user guide, Release 0.12-git
Notes
See examples/linear_model/lasso_path_with_crossvalidation.py for an example.
To avoid unnecessary memory duplication the X argument of the t method should be directly passed as a fortran
contiguous numpy array.
Attributes
alpha_: oat The amount of penalization choosen by cross
validation
coef_ array, shape =
[n_features]
parameter vector (w in the fomulation
formula)
intercept_ oat independent term in decision function.
mse_path_: array, shape =
[n_alphas, n_folds]
mean square error for the test set on each fold,
varying alpha
Methods
decision_function(X) Decision function of the linear model
fit(X, y) Fit linear model with coordinate descent along decreasing alphas
get_params([deep]) Get parameters for the estimator
path(X, y[, eps, n_alphas, alphas, ...]) Compute Lasso path with coordinate descent
predict(X) Predict using the linear model
score(X, y) Returns the coefcient of determination R^2 of the prediction.
set_params(**params) Set the parameters of the estimator.
__init__(eps=0.001, n_alphas=100, alphas=None, t_intercept=True, normalize=False, precom-
pute=auto, max_iter=1000, tol=0.0001, copy_X=True, cv=None, verbose=False)
decision_function(X)
Decision function of the linear model
Parameters X : numpy array of shape [n_samples, n_features]
Returns C : array, shape = [n_samples]
Returns predicted values.
fit(X, y)
Fit linear model with coordinate descent along decreasing alphas using cross-validation
Parameters X : numpy array of shape [n_samples,n_features]
Training data. Pass directly as fortran contiguous data to avoid unnecessary memory
duplication
y : numpy array of shape [n_samples]
Target values
get_params(deep=True)
Get parameters for the estimator
Parameters deep: boolean, optional :
172 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
If True, will return the parameters for this estimator and contained subobjects that are
estimators.
static path(X, y, eps=0.001, n_alphas=100, alphas=None, precompute=auto, Xy=None,
t_intercept=True, normalize=False, copy_X=True, verbose=False, **params)
Compute Lasso path with coordinate descent
The optimization objective for Lasso is:
(1 / (2
*
n_samples))
*
||y - Xw||^2_2 + alpha
*
||w||_1
Parameters X : numpy array of shape [n_samples,n_features]
Training data. Pass directly as fortran contiguous data to avoid unnecessary memory
duplication
y : numpy array of shape [n_samples]
Target values
eps : oat, optional
Length of the path. eps=1e-3 means that alpha_min / alpha_max = 1e-3
n_alphas : int, optional
Number of alphas along the regularization path
alphas : numpy array, optional
List of alphas where to compute the models. If None alphas are set automatically
precompute : True | False | auto | array-like
Whether to use a precomputed Gram matrix to speed up calculations. If set to auto let
us decide. The Gram matrix can also be passed as argument.
Xy : array-like, optional
Xy = np.dot(X.T, y) that can be precomputed. It is useful only when the Gram matrix is
precomputed.
t_intercept : bool
Fit or not an intercept
normalize : boolean, optional
If True, the regressors X are normalized
copy_X : boolean, optional, default True
If True, X will be copied; else, it may be overwritten.
verbose : bool or integer
Amount of verbosity
params : kwargs
keyword arguments passed to the Lasso objects
Returns models : a list of models along the regularization path
See Also:
lars_path, Lasso, LassoLars, LassoCV, LassoLarsCV,
sklearn.decomposition.sparse_encode
1.5. Model Selection 173
scikit-learn user guide, Release 0.12-git
Notes
See examples/linear_model/plot_lasso_coordinate_descent_path.py for an example.
To avoid unnecessary memory duplication the X argument of the t method should be directly passed as a
fortran contiguous numpy array.
predict(X)
Predict using the linear model
Parameters X : numpy array of shape [n_samples, n_features]
Returns C : array, shape = [n_samples]
Returns predicted values.
score(X, y)
Returns the coefcient of determination R^2 of the prediction.
The coefcient R^2 is dened as (1 - u/v), where u is the regression sum of squares ((y - y_pred) **
2).sum() and v is the residual sum of squares ((y_true - y_true.mean()) ** 2).sum(). Best possible score is
1.0, lower values are worse.
Parameters X : array-like, shape = [n_samples, n_features]
Training set.
y : array-like, shape = [n_samples]
Returns z : oat
set_params(**params)
Set the parameters of the estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The former have
parameters of the form <component>__<parameter> so that its possible to update each component
of a nested object.
Returns self :
sklearn.linear_model.ElasticNetCV
class sklearn.linear_model.ElasticNetCV(rho=0.5, eps=0.001, n_alphas=100, alphas=None,
t_intercept=True, normalize=False, precom-
pute=auto, max_iter=1000, tol=0.0001, cv=None,
copy_X=True, verbose=0, n_jobs=1)
Elastic Net model with iterative tting along a regularization path
The best model is selected by cross-validation.
Parameters rho : oat, optional
oat between 0 and 1 passed to ElasticNet (scaling between l1 and l2 penalties). For
rho = 0 the penalty is an L1 penalty. For rho = 1 it is an L2 penalty. For 0 < rho < 1,
the penalty is a combination of L1 and L2 This parameter can be a list, in which case
the different values are tested by cross-validation and the one giving the best prediction
score is used. Note that a good choice of list of values for rho is often to put more values
close to 1 (i.e. Lasso) and less close to 0 (i.e. Ridge), as in [.1, .5, .7, .9, .95, .99, 1]
eps : oat, optional
Length of the path. eps=1e-3 means that alpha_min / alpha_max = 1e-3.
n_alphas : int, optional
174 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
Number of alphas along the regularization path
alphas : numpy array, optional
List of alphas where to compute the models. If None alphas are set automatically
precompute : True | False | auto | array-like
Whether to use a precomputed Gram matrix to speed up calculations. If set to auto let
us decide. The Gram matrix can also be passed as argument.
max_iter: int, optional :
The maximum number of iterations
tol: oat, optional :
The tolerance for the optimization: if the updates are smaller than tol, the optimization
code checks the dual gap for optimality and continues until it is smaller than tol.
cv : integer or crossvalidation generator, optional
If an integer is passed, it is the number of fold (default 3). Specic crossvalidation ob-
jects can be passed, see sklearn.cross_validation module for the list of possible objects
verbose : bool or integer
amount of verbosity
n_jobs : integer, optional
Number of CPUs to use during the cross validation. If -1, use all the CPUs. Note that
this is used only if multiple values for rho are given.
See Also:
enet_path, ElasticNet
Notes
See examples/linear_model/lasso_path_with_crossvalidation.py for an example.
To avoid unnecessary memory duplication the X argument of the t method should be directly passed as a fortran
contiguous numpy array.
The parameter rho corresponds to alpha in the glmnet R package while alpha corresponds to the lambda param-
eter in glmnet. More specically, the optimization objective is:
1 / (2
*
n_samples)
*
||y - Xw||^2_2 +
+ alpha
*
rho
*
||w||_1 + 0.5
*
alpha
*
(1 - rho)
*
||w||^2_2
If you are interested in controlling the L1 and L2 penalty separately, keep in mind that this is equivalent to:
a
*
L1 + b
*
L2
for:
alpha = a + b and rho = a / (a + b)
1.5. Model Selection 175
scikit-learn user guide, Release 0.12-git
Attributes
alpha_: oat The amount of penalization choosen by cross
validation
rho_: oat The compromise between l1 and l2 penalization
choosen by cross validation
coef_ array, shape =
[n_features]
parameter vector (w in the fomulation formula)
intercept_ oat independent term in decision function.
mse_path_: array, shape = [n_rho,
n_alpha, n_folds]
mean square error for the test set on each fold,
varying rho and alpha
Methods
decision_function(X) Decision function of the linear model
fit(X, y) Fit linear model with coordinate descent along decreasing alphas
get_params([deep]) Get parameters for the estimator
path(X, y[, rho, eps, n_alphas, alphas, ...]) Compute Elastic-Net path with coordinate descent
predict(X) Predict using the linear model
score(X, y) Returns the coefcient of determination R^2 of the prediction.
set_params(**params) Set the parameters of the estimator.
__init__(rho=0.5, eps=0.001, n_alphas=100, alphas=None, t_intercept=True, normalize=False,
precompute=auto, max_iter=1000, tol=0.0001, cv=None, copy_X=True, verbose=0,
n_jobs=1)
decision_function(X)
Decision function of the linear model
Parameters X : numpy array of shape [n_samples, n_features]
Returns C : array, shape = [n_samples]
Returns predicted values.
fit(X, y)
Fit linear model with coordinate descent along decreasing alphas using cross-validation
Parameters X : numpy array of shape [n_samples,n_features]
Training data. Pass directly as fortran contiguous data to avoid unnecessary memory
duplication
y : numpy array of shape [n_samples]
Target values
get_params(deep=True)
Get parameters for the estimator
Parameters deep: boolean, optional :
If True, will return the parameters for this estimator and contained subobjects that are
estimators.
static path(X, y, rho=0.5, eps=0.001, n_alphas=100, alphas=None, precompute=auto, Xy=None,
t_intercept=True, normalize=False, copy_X=True, verbose=False, **params)
Compute Elastic-Net path with coordinate descent
176 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
The Elastic Net optimization function is:
1 / (2
*
n_samples)
*
||y - Xw||^2_2 +
+ alpha
*
rho
*
||w||_1 + 0.5
*
alpha
*
(1 - rho)
*
||w||^2_2
Parameters X : numpy array of shape [n_samples, n_features]
Training data. Pass directly as fortran contiguous data to avoid unnecessary memory
duplication
y : numpy array of shape [n_samples]
Target values
rho : oat, optional
oat between 0 and 1 passed to ElasticNet (scaling between l1 and l2 penalties). rho=1
corresponds to the Lasso
eps : oat
Length of the path. eps=1e-3 means that alpha_min / alpha_max = 1e-3
n_alphas : int, optional
Number of alphas along the regularization path
alphas : numpy array, optional
List of alphas where to compute the models. If None alphas are set automatically
precompute : True | False | auto | array-like
Whether to use a precomputed Gram matrix to speed up calculations. If set to auto let
us decide. The Gram matrix can also be passed as argument.
Xy : array-like, optional
Xy = np.dot(X.T, y) that can be precomputed. It is useful only when the Gram matrix is
precomputed.
t_intercept : bool
Fit or not an intercept
normalize : boolean, optional
If True, the regressors X are normalized
copy_X : boolean, optional, default True
If True, X will be copied; else, it may be overwritten.
verbose : bool or integer
Amount of verbosity
params : kwargs
keyword arguments passed to the Lasso objects
Returns models : a list of models along the regularization path
See Also:
ElasticNet, ElasticNetCV
1.5. Model Selection 177
scikit-learn user guide, Release 0.12-git
Notes
See examples/plot_lasso_coordinate_descent_path.py for an example.
predict(X)
Predict using the linear model
Parameters X : numpy array of shape [n_samples, n_features]
Returns C : array, shape = [n_samples]
Returns predicted values.
score(X, y)
Returns the coefcient of determination R^2 of the prediction.
The coefcient R^2 is dened as (1 - u/v), where u is the regression sum of squares ((y - y_pred) **
2).sum() and v is the residual sum of squares ((y_true - y_true.mean()) ** 2).sum(). Best possible score is
1.0, lower values are worse.
Parameters X : array-like, shape = [n_samples, n_features]
Training set.
y : array-like, shape = [n_samples]
Returns z : oat
set_params(**params)
Set the parameters of the estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The former have
parameters of the form <component>__<parameter> so that its possible to update each component
of a nested object.
Returns self :
Information Criterion
Some models can offer an information-theoretic closed-form formula of the optimal estimate of the regularization
parameter by computing a single regularization path (instead of several when using cross-validation).
Here is the list of models benetting from the Aikike Information Criterion (AIC) or the Bayesian Information Crite-
rion (BIC) for automated model selection:
linear_model.LassoLarsIC([criterion, ...]) Lasso model t with Lars using BIC or AIC for model selection
sklearn.linear_model.LassoLarsIC
class sklearn.linear_model.LassoLarsIC(criterion=aic, t_intercept=True, verbose=False,
normalize=True, precompute=auto, max_iter=500,
eps=2.2204460492503131e-16, copy_X=True)
Lasso model t with Lars using BIC or AIC for model selection
The optimization objective for Lasso is:
(1 / (2
*
n_samples))
*
||y - Xw||^2_2 + alpha
*
||w||_1
AIC is the Akaike information criterion and BIC is the Bayes Information criterion. Such criteria are useful
to select the value of the regularization parameter by making a trade-off between the goodness of t and the
complexity of the model. A good model should explain well the data while being simple.
178 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
Parameters criterion: bic | aic :
The type of criterion to use.
t_intercept : boolean
whether to calculate the intercept for this model. If set to false, no intercept will be used
in calculations (e.g. data is expected to be already centered).
verbose : boolean or integer, optional
Sets the verbosity amount
normalize : boolean, optional
If True, the regressors X are normalized
copy_X : boolean, optional, default True
If True, X will be copied; else, it may be overwritten.
precompute : True | False | auto | array-like
Whether to use a precomputed Gram matrix to speed up calculations. If set to auto let
us decide. The Gram matrix can also be passed as argument.
max_iter: integer, optional :
Maximum number of iterations to perform. Can be used for early stopping.
eps: oat, optional :
The machine-precision regularization in the computation of the Cholesky diagonal fac-
tors. Increase this for very ill-conditioned systems. Unlike the tol parameter in some
iterative optimization-based algorithms, this parameter does not control the tolerance of
the optimization.
See Also:
lars_path, LassoLars, LassoLarsCV
Notes
The estimation of the number of degrees of freedom is given by:
On the degrees of freedom of the lasso Hui Zou, Trevor Hastie, and Robert Tibshirani Ann. Statist. Volume
35, Number 5 (2007), 2173-2192.
http://en.wikipedia.org/wiki/Akaike_information_criterion http://en.wikipedia.org/wiki/Bayesian_information_criterion
Examples
>>> from sklearn import linear_model
>>> clf = linear_model.LassoLarsIC(criterion=bic)
>>> clf.fit([[-1, 1], [0, 0], [1, 1]], [-1.1111, 0, -1.1111])
...
LassoLarsIC(copy_X=True, criterion=bic, eps=..., fit_intercept=True,
max_iter=500, normalize=True, precompute=auto,
verbose=False)
>>> print(clf.coef_)
[ 0. -1.11...]
1.5. Model Selection 179
scikit-learn user guide, Release 0.12-git
Attributes
coef_ array, shape = [n_features] parameter vector (w in the fomulation formula)
intercept_ oat independent term in decision function.
alpha_ oat the alpha parameter chosen by the information criterion
Methods
decision_function(X) Decision function of the linear model
fit(X, y[, copy_X]) Fit the model using X, y as training data.
get_params([deep]) Get parameters for the estimator
predict(X) Predict using the linear model
score(X, y) Returns the coefcient of determination R^2 of the prediction.
set_params(**params) Set the parameters of the estimator.
__init__(criterion=aic, t_intercept=True, verbose=False, normalize=True, precompute=auto,
max_iter=500, eps=2.2204460492503131e-16, copy_X=True)
decision_function(X)
Decision function of the linear model
Parameters X : numpy array of shape [n_samples, n_features]
Returns C : array, shape = [n_samples]
Returns predicted values.
fit(X, y, copy_X=True)
Fit the model using X, y as training data.
Parameters x : array-like, shape = [n_samples, n_features]
training data.
y : array-like, shape = [n_samples]
target values.
Returns self : object
returns an instance of self.
get_params(deep=True)
Get parameters for the estimator
Parameters deep: boolean, optional :
If True, will return the parameters for this estimator and contained subobjects that are
estimators.
predict(X)
Predict using the linear model
Parameters X : numpy array of shape [n_samples, n_features]
Returns C : array, shape = [n_samples]
Returns predicted values.
180 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
score(X, y)
Returns the coefcient of determination R^2 of the prediction.
The coefcient R^2 is dened as (1 - u/v), where u is the regression sum of squares ((y - y_pred) **
2).sum() and v is the residual sum of squares ((y_true - y_true.mean()) ** 2).sum(). Best possible score is
1.0, lower values are worse.
Parameters X : array-like, shape = [n_samples, n_features]
Training set.
y : array-like, shape = [n_samples]
Returns z : oat
set_params(**params)
Set the parameters of the estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The former have
parameters of the form <component>__<parameter> so that its possible to update each component
of a nested object.
Returns self :
Out of Bag Estimates
When using ensemble methods base upon bagging, i.e. generating new training sets using sampling with replacement,
part of the training set remains unused. For each classier in the ensemble, a different part of the training set is left
out.
This left out portion can be used to estimate the generalization error without having to rely on a separate validation
set. This estimate comes for free as no addictional data is needed and can be used for model selection.
This is currently implemented in the following classes:
ensemble.RandomForestClassifier([...]) A random forest classier.
ensemble.RandomForestRegressor([...]) A random forest regressor.
ensemble.ExtraTreesClassifier([...]) An extra-trees classier.
ensemble.ExtraTreesRegressor([n_estimators, ...]) An extra-trees regressor.
ensemble.GradientBoostingClassifier([loss, ...]) Gradient Boosting for classication.
ensemble.GradientBoostingRegressor([loss, ...]) Gradient Boosting for regression.
sklearn.ensemble.RandomForestClassier
class sklearn.ensemble.RandomForestClassifier(n_estimators=10, criterion=gini,
max_depth=None, min_samples_split=1,
min_samples_leaf=1, min_density=0.1,
max_features=auto, bootstrap=True, com-
pute_importances=False, oob_score=False,
n_jobs=1, random_state=None, verbose=0)
A random forest classier.
A random forest is a meta estimator that ts a number of classical decision trees on various sub-samples of the
dataset and use averaging to improve the predictive accuracy and control over-tting.
Parameters n_estimators : integer, optional (default=10)
The number of trees in the forest.
criterion : string, optional (default=gini)
1.5. Model Selection 181
scikit-learn user guide, Release 0.12-git
The function to measure the quality of a split. Supported criteria are gini for the Gini
impurity and entropy for the information gain. Note: this parameter is tree-specic.
max_depth : integer or None, optional (default=None)
The maximum depth of the tree. If None, then nodes are expanded until all leaves
are pure or until all leaves contain less than min_samples_split samples. Note: this
parameter is tree-specic.
min_samples_split : integer, optional (default=1)
The minimum number of samples required to split an internal node. Note: this parame-
ter is tree-specic.
min_samples_leaf : integer, optional (default=1)
The minimum number of samples in newly created leaves. A split is discarded if after
the split, one of the leaves would contain less then min_samples_leaf samples.
Note: this parameter is tree-specic.
min_density : oat, optional (default=0.1)
This parameter controls a trade-off in an optimization heuristic. It controls the minimum
density of the sample_mask (i.e. the fraction of samples in the mask). If the density falls
below this threshold the mask is recomputed and the input data is packed which results
in data copying. If min_density equals to one, the partitions are always represented
as copies of the original data. Otherwise, partitions are represented as bit masks (aka
sample masks). Note: this parameter is tree-specic.
max_features : int, string or None, optional (default=auto)
The number of features to consider when looking for the best split:
If auto, then max_features=sqrt(n_features) on classication tasks and
max_features=n_features on regression problems.
If sqrt, then max_features=sqrt(n_features).
If log2, then max_features=log2(n_features).
If None, then max_features=n_features.
Note: this parameter is tree-specic.
bootstrap : boolean, optional (default=True)
Whether bootstrap samples are used when building trees.
compute_importances : boolean, optional (default=True)
Whether feature importances are computed and stored into the
feature_importances_ attribute when calling t.
oob_score : bool
Whether to use out-of-bag samples to estimate the generalization error.
n_jobs : integer, optional (default=1)
The number of jobs to run in parallel. If -1, then the number of jobs is set to the number
of cores.
random_state : int, RandomState instance or None, optional (default=None)
182 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
If int, random_state is the seed used by the random number generator; If RandomState
instance, random_state is the random number generator; If None, the random number
generator is the RandomState instance used by np.random.
verbose : int, optional (default=0)
Controlls the verbosity of the tree building process.
See Also:
DecisionTreeClassifier, ExtraTreesClassifier
References
[R59]
Attributes
fea-
ture_importances_
array, shape = [n_features] The feature importances (the higher, the more
important the feature).
oob_score_ oat Score of the training dataset obtained using an
out-of-bag estimate.
oob_decision_function_ array, shape = [n_samples,
n_classes]
Decision function computed with out-of-bag estimate
on the training set.
Methods
fit(X, y) Build a forest of trees from the training set (X, y).
fit_transform(X[, y]) Fit to data, then transform it
get_params([deep]) Get parameters for the estimator
predict(X) Predict class for X.
predict_log_proba(X) Predict class log-probabilities for X.
predict_proba(X) Predict class probabilities for X.
score(X, y) Returns the mean accuracy on the given test data and labels.
set_params(**params) Set the parameters of the estimator.
transform(X[, threshold]) Reduce X to its most important features.
__init__(n_estimators=10, criterion=gini, max_depth=None, min_samples_split=1,
min_samples_leaf=1, min_density=0.1, max_features=auto, bootstrap=True, com-
pute_importances=False, oob_score=False, n_jobs=1, random_state=None, verbose=0)
fit(X, y)
Build a forest of trees from the training set (X, y).
Parameters X : array-like of shape = [n_samples, n_features]
The training input samples.
y : array-like, shape = [n_samples]
The target values (integers that correspond to classes in classication, real numbers in
regression).
Returns self : object
1.5. Model Selection 183
scikit-learn user guide, Release 0.12-git
Returns self.
fit_transform(X, y=None, **t_params)
Fit to data, then transform it
Fits transformer to X and y with optional parameters t_params and returns a transformed version of X.
Parameters X : numpy array of shape [n_samples, n_features]
Training set.
y : numpy array of shape [n_samples]
Target values.
Returns X_new : numpy array of shape [n_samples, n_features_new]
Transformed array.
Notes
This method just calls t and transform consecutively, i.e., it is not an optimized implementation of
t_transform, unlike other transformers such as PCA.
get_params(deep=True)
Get parameters for the estimator
Parameters deep: boolean, optional :
If True, will return the parameters for this estimator and contained subobjects that are
estimators.
predict(X)
Predict class for X.
The predicted class of an input sample is computed as the majority prediction of the trees in the forest.
Parameters X : array-like of shape = [n_samples, n_features]
The input samples.
Returns y : array of shape = [n_samples]
The predicted classes.
predict_log_proba(X)
Predict class log-probabilities for X.
The predicted class log-probabilities of an input sample is computed as the mean predicted class log-
probabilities of the trees in the forest.
Parameters X : array-like of shape = [n_samples, n_features]
The input samples.
Returns p : array of shape = [n_samples]
The class log-probabilities of the input samples. Classes are ordered by arithmetical
order.
predict_proba(X)
Predict class probabilities for X.
The predicted class probabilities of an input sample is computed as the mean predicted class probabilities
of the trees in the forest.
184 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
Parameters X : array-like of shape = [n_samples, n_features]
The input samples.
Returns p : array of shape = [n_samples]
The class probabilities of the input samples. Classes are ordered by arithmetical order.
score(X, y)
Returns the mean accuracy on the given test data and labels.
Parameters X : array-like, shape = [n_samples, n_features]
Training set.
y : array-like, shape = [n_samples]
Labels for X.
Returns z : oat
set_params(**params)
Set the parameters of the estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The former have
parameters of the form <component>__<parameter> so that its possible to update each component
of a nested object.
Returns self :
transform(X, threshold=None)
Reduce X to its most important features.
Parameters X : array or scipy sparse matrix of shape [n_samples, n_features]
The input samples.
threshold : string, oat or None, optional (default=None)
The threshold value to use for feature selection. Features whose importance is greater
or equal are kept while the others are discarded. If median (resp. mean), then the
threshold value is the median (resp. the mean) of the feature importances. A scaling
factor (e.g., 1.25*mean) may also be used. If None and if available, the object attribute
threshold is used. Otherwise, mean is used by default.
Returns X_r : array of shape [n_samples, n_selected_features]
The input samples with only the selected features.
sklearn.ensemble.RandomForestRegressor
class sklearn.ensemble.RandomForestRegressor(n_estimators=10, criterion=mse,
max_depth=None, min_samples_split=1,
min_samples_leaf=1, min_density=0.1,
max_features=auto, bootstrap=True, com-
pute_importances=False, oob_score=False,
n_jobs=1, random_state=None, verbose=0)
A random forest regressor.
A random forest is a meta estimator that ts a number of classical decision trees on various sub-samples of the
dataset and use averaging to improve the predictive accuracy and control over-tting.
Parameters n_estimators : integer, optional (default=10)
The number of trees in the forest.
1.5. Model Selection 185
scikit-learn user guide, Release 0.12-git
criterion : string, optional (default=mse)
The function to measure the quality of a split. The only supported criterion is mse for
the mean squared error. Note: this parameter is tree-specic.
max_depth : integer or None, optional (default=None)
The maximum depth of the tree. If None, then nodes are expanded until all leaves
are pure or until all leaves contain less than min_samples_split samples. Note: this
parameter is tree-specic.
min_samples_split : integer, optional (default=1)
The minimum number of samples required to split an internal node. Note: this parame-
ter is tree-specic.
min_samples_leaf : integer, optional (default=1)
The minimum number of samples in newly created leaves. A split is discarded if after
the split, one of the leaves would contain less then min_samples_leaf samples.
Note: this parameter is tree-specic.
min_density : oat, optional (default=0.1)
This parameter controls a trade-off in an optimization heuristic. It controls the minimum
density of the sample_mask (i.e. the fraction of samples in the mask). If the density falls
below this threshold the mask is recomputed and the input data is packed which results
in data copying. If min_density equals to one, the partitions are always represented
as copies of the original data. Otherwise, partitions are represented as bit masks (aka
sample masks). Note: this parameter is tree-specic.
max_features : int, string or None, optional (default=auto)
The number of features to consider when looking for the best split:
If auto, then max_features=sqrt(n_features) on classication tasks and
max_features=n_features on regression problems.
If sqrt, then max_features=sqrt(n_features).
If log2, then max_features=log2(n_features).
If None, then max_features=n_features.
Note: this parameter is tree-specic.
bootstrap : boolean, optional (default=True)
Whether bootstrap samples are used when building trees.
compute_importances : boolean, optional (default=True)
Whether feature importances are computed and stored into the
feature_importances_ attribute when calling t.
oob_score : bool
whether to use out-of-bag samples to estimate the generalization error.
n_jobs : integer, optional (default=1)
The number of jobs to run in parallel. If -1, then the number of jobs is set to the number
of cores.
random_state : int, RandomState instance or None, optional (default=None)
186 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
If int, random_state is the seed used by the random number generator; If RandomState
instance, random_state is the random number generator; If None, the random number
generator is the RandomState instance used by np.random.
verbose : int, optional (default=0)
Controlls the verbosity of the tree building process.
See Also:
DecisionTreeRegressor, ExtraTreesRegressor
References
[R60]
Attributes
fea-
ture_importances_
array of shape =
[n_features]
The feature mportances (the higher, the more important
the feature).
oob_score_ oat Score of the training dataset obtained using an out-of-bag
estimate.
oob_prediction_ array, shape =
[n_samples]
Prediction computed with out-of-bag estimate on the
training set.
Methods
fit(X, y) Build a forest of trees from the training set (X, y).
fit_transform(X[, y]) Fit to data, then transform it
get_params([deep]) Get parameters for the estimator
predict(X) Predict regression target for X.
score(X, y) Returns the coefcient of determination R^2 of the prediction.
set_params(**params) Set the parameters of the estimator.
transform(X[, threshold]) Reduce X to its most important features.
__init__(n_estimators=10, criterion=mse, max_depth=None, min_samples_split=1,
min_samples_leaf=1, min_density=0.1, max_features=auto, bootstrap=True, com-
pute_importances=False, oob_score=False, n_jobs=1, random_state=None, verbose=0)
fit(X, y)
Build a forest of trees from the training set (X, y).
Parameters X : array-like of shape = [n_samples, n_features]
The training input samples.
y : array-like, shape = [n_samples]
The target values (integers that correspond to classes in classication, real numbers in
regression).
Returns self : object
Returns self.
1.5. Model Selection 187
scikit-learn user guide, Release 0.12-git
fit_transform(X, y=None, **t_params)
Fit to data, then transform it
Fits transformer to X and y with optional parameters t_params and returns a transformed version of X.
Parameters X : numpy array of shape [n_samples, n_features]
Training set.
y : numpy array of shape [n_samples]
Target values.
Returns X_new : numpy array of shape [n_samples, n_features_new]
Transformed array.
Notes
This method just calls t and transform consecutively, i.e., it is not an optimized implementation of
t_transform, unlike other transformers such as PCA.
get_params(deep=True)
Get parameters for the estimator
Parameters deep: boolean, optional :
If True, will return the parameters for this estimator and contained subobjects that are
estimators.
predict(X)
Predict regression target for X.
The predicted regression target of an input sample is computed as the mean predicted regression targets of
the trees in the forest.
Parameters X : array-like of shape = [n_samples, n_features]
The input samples.
Returns y: array of shape = [n_samples] :
The predicted values.
score(X, y)
Returns the coefcient of determination R^2 of the prediction.
The coefcient R^2 is dened as (1 - u/v), where u is the regression sum of squares ((y - y_pred) **
2).sum() and v is the residual sum of squares ((y_true - y_true.mean()) ** 2).sum(). Best possible score is
1.0, lower values are worse.
Parameters X : array-like, shape = [n_samples, n_features]
Training set.
y : array-like, shape = [n_samples]
Returns z : oat
set_params(**params)
Set the parameters of the estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The former have
parameters of the form <component>__<parameter> so that its possible to update each component
of a nested object.
188 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
Returns self :
transform(X, threshold=None)
Reduce X to its most important features.
Parameters X : array or scipy sparse matrix of shape [n_samples, n_features]
The input samples.
threshold : string, oat or None, optional (default=None)
The threshold value to use for feature selection. Features whose importance is greater
or equal are kept while the others are discarded. If median (resp. mean), then the
threshold value is the median (resp. the mean) of the feature importances. A scaling
factor (e.g., 1.25*mean) may also be used. If None and if available, the object attribute
threshold is used. Otherwise, mean is used by default.
Returns X_r : array of shape [n_samples, n_selected_features]
The input samples with only the selected features.
sklearn.ensemble.ExtraTreesClassier
class sklearn.ensemble.ExtraTreesClassifier(n_estimators=10, criterion=gini,
max_depth=None, min_samples_split=1,
min_samples_leaf=1, min_density=0.1,
max_features=auto, bootstrap=False, com-
pute_importances=False, oob_score=False,
n_jobs=1, random_state=None, verbose=0)
An extra-trees classier.
This class implements a meta estimator that ts a number of randomized decision trees (a.k.a. extra-trees) on
various sub-samples of the dataset and use averaging to improve the predictive accuracy and control over-tting.
Parameters n_estimators : integer, optional (default=10)
The number of trees in the forest.
criterion : string, optional (default=gini)
The function to measure the quality of a split. Supported criteria are gini for the Gini
impurity and entropy for the information gain. Note: this parameter is tree-specic.
max_depth : integer or None, optional (default=None)
The maximum depth of the tree. If None, then nodes are expanded until all leaves
are pure or until all leaves contain less than min_samples_split samples. Note: this
parameter is tree-specic.
min_samples_split : integer, optional (default=1)
The minimum number of samples required to split an internal node. Note: this parame-
ter is tree-specic.
min_samples_leaf : integer, optional (default=1)
The minimum number of samples in newly created leaves. A split is discarded if after
the split, one of the leaves would contain less then min_samples_leaf samples.
Note: this parameter is tree-specic.
min_density : oat, optional (default=0.1)
This parameter controls a trade-off in an optimization heuristic. It controls the minimum
density of the sample_mask (i.e. the fraction of samples in the mask). If the density falls
1.5. Model Selection 189
scikit-learn user guide, Release 0.12-git
below this threshold the mask is recomputed and the input data is packed which results
in data copying. If min_density equals to one, the partitions are always represented
as copies of the original data. Otherwise, partitions are represented as bit masks (aka
sample masks). Note: this parameter is tree-specic.
max_features : int, string or None, optional (default=auto)
The number of features to consider when looking for the best split.
If auto, then max_features=sqrt(n_features) on classication tasks and
max_features=n_features on regression problems.
If sqrt, then max_features=sqrt(n_features).
If log2, then max_features=log2(n_features).
If None, then max_features=n_features.
Note: this parameter is tree-specic.
bootstrap : boolean, optional (default=False)
Whether bootstrap samples are used when building trees.
compute_importances : boolean, optional (default=True)
Whether feature importances are computed and stored into the
feature_importances_ attribute when calling t.
oob_score : bool
Whether to use out-of-bag samples to estimate the generalization error.
n_jobs : integer, optional (default=1)
The number of jobs to run in parallel. If -1, then the number of jobs is set to the number
of cores.
random_state : int, RandomState instance or None, optional (default=None)
If int, random_state is the seed used by the random number generator; If RandomState
instance, random_state is the random number generator; If None, the random number
generator is the RandomState instance used by np.random.
verbose : int, optional (default=0)
Controlls the verbosity of the tree building process.
See Also:
sklearn.tree.ExtraTreeClassifierBase classier for this ensemble.
RandomForestClassifierEnsemble Classier based on trees with optimal splits.
References
[R57]
190 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
Attributes
fea-
ture_importances_
array of shape =
[n_features]
The feature mportances (the higher, the more
important the feature).
oob_score_ oat Score of the training dataset obtained using an
out-of-bag estimate.
oob_decision_function_ array, shape = [n_samples,
n_classes]
Decision function computed with out-of-bag estimate
on the training set.
Methods
fit(X, y) Build a forest of trees from the training set (X, y).
fit_transform(X[, y]) Fit to data, then transform it
get_params([deep]) Get parameters for the estimator
predict(X) Predict class for X.
predict_log_proba(X) Predict class log-probabilities for X.
predict_proba(X) Predict class probabilities for X.
score(X, y) Returns the mean accuracy on the given test data and labels.
set_params(**params) Set the parameters of the estimator.
transform(X[, threshold]) Reduce X to its most important features.
__init__(n_estimators=10, criterion=gini, max_depth=None, min_samples_split=1,
min_samples_leaf=1, min_density=0.1, max_features=auto, bootstrap=False, com-
pute_importances=False, oob_score=False, n_jobs=1, random_state=None, verbose=0)
fit(X, y)
Build a forest of trees from the training set (X, y).
Parameters X : array-like of shape = [n_samples, n_features]
The training input samples.
y : array-like, shape = [n_samples]
The target values (integers that correspond to classes in classication, real numbers in
regression).
Returns self : object
Returns self.
fit_transform(X, y=None, **t_params)
Fit to data, then transform it
Fits transformer to X and y with optional parameters t_params and returns a transformed version of X.
Parameters X : numpy array of shape [n_samples, n_features]
Training set.
y : numpy array of shape [n_samples]
Target values.
Returns X_new : numpy array of shape [n_samples, n_features_new]
Transformed array.
1.5. Model Selection 191
scikit-learn user guide, Release 0.12-git
Notes
This method just calls t and transform consecutively, i.e., it is not an optimized implementation of
t_transform, unlike other transformers such as PCA.
get_params(deep=True)
Get parameters for the estimator
Parameters deep: boolean, optional :
If True, will return the parameters for this estimator and contained subobjects that are
estimators.
predict(X)
Predict class for X.
The predicted class of an input sample is computed as the majority prediction of the trees in the forest.
Parameters X : array-like of shape = [n_samples, n_features]
The input samples.
Returns y : array of shape = [n_samples]
The predicted classes.
predict_log_proba(X)
Predict class log-probabilities for X.
The predicted class log-probabilities of an input sample is computed as the mean predicted class log-
probabilities of the trees in the forest.
Parameters X : array-like of shape = [n_samples, n_features]
The input samples.
Returns p : array of shape = [n_samples]
The class log-probabilities of the input samples. Classes are ordered by arithmetical
order.
predict_proba(X)
Predict class probabilities for X.
The predicted class probabilities of an input sample is computed as the mean predicted class probabilities
of the trees in the forest.
Parameters X : array-like of shape = [n_samples, n_features]
The input samples.
Returns p : array of shape = [n_samples]
The class probabilities of the input samples. Classes are ordered by arithmetical order.
score(X, y)
Returns the mean accuracy on the given test data and labels.
Parameters X : array-like, shape = [n_samples, n_features]
Training set.
y : array-like, shape = [n_samples]
Labels for X.
Returns z : oat
192 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
set_params(**params)
Set the parameters of the estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The former have
parameters of the form <component>__<parameter> so that its possible to update each component
of a nested object.
Returns self :
transform(X, threshold=None)
Reduce X to its most important features.
Parameters X : array or scipy sparse matrix of shape [n_samples, n_features]
The input samples.
threshold : string, oat or None, optional (default=None)
The threshold value to use for feature selection. Features whose importance is greater
or equal are kept while the others are discarded. If median (resp. mean), then the
threshold value is the median (resp. the mean) of the feature importances. A scaling
factor (e.g., 1.25*mean) may also be used. If None and if available, the object attribute
threshold is used. Otherwise, mean is used by default.
Returns X_r : array of shape [n_samples, n_selected_features]
The input samples with only the selected features.
sklearn.ensemble.ExtraTreesRegressor
class sklearn.ensemble.ExtraTreesRegressor(n_estimators=10, criterion=mse,
max_depth=None, min_samples_split=1,
min_samples_leaf=1, min_density=0.1,
max_features=auto, bootstrap=False, com-
pute_importances=False, oob_score=False,
n_jobs=1, random_state=None, verbose=0)
An extra-trees regressor.
This class implements a meta estimator that ts a number of randomized decision trees (a.k.a. extra-trees) on
various sub-samples of the dataset and use averaging to improve the predictive accuracy and control over-tting.
Parameters n_estimators : integer, optional (default=10)
The number of trees in the forest.
criterion : string, optional (default=mse)
The function to measure the quality of a split. The only supported criterion is mse for
the mean squared error. Note: this parameter is tree-specic.
max_depth : integer or None, optional (default=None)
The maximum depth of the tree. If None, then nodes are expanded until all leaves
are pure or until all leaves contain less than min_samples_split samples. Note: this
parameter is tree-specic.
min_samples_split : integer, optional (default=1)
The minimum number of samples required to split an internal node. Note: this parame-
ter is tree-specic.
min_samples_leaf : integer, optional (default=1)
1.5. Model Selection 193
scikit-learn user guide, Release 0.12-git
The minimum number of samples in newly created leaves. A split is discarded if after
the split, one of the leaves would contain less then min_samples_leaf samples.
Note: this parameter is tree-specic.
min_density : oat, optional (default=0.1)
This parameter controls a trade-off in an optimization heuristic. It controls the minimum
density of the sample_mask (i.e. the fraction of samples in the mask). If the density falls
below this threshold the mask is recomputed and the input data is packed which results
in data copying. If min_density equals to one, the partitions are always represented
as copies of the original data. Otherwise, partitions are represented as bit masks (aka
sample masks). Note: this parameter is tree-specic.
max_features : int, string or None, optional (default=auto)
The number of features to consider when looking for the best split:
If auto, then max_features=sqrt(n_features) on classication tasks and
max_features=n_features on regression problems.
If sqrt, then max_features=sqrt(n_features).
If log2, then max_features=log2(n_features).
If None, then max_features=n_features.
Note: this parameter is tree-specic.
bootstrap : boolean, optional (default=False)
Whether bootstrap samples are used when building trees. Note: this parameter is tree-
specic.
compute_importances : boolean, optional (default=True)
Whether feature importances are computed and stored into the
feature_importances_ attribute when calling t.
oob_score : bool
Whether to use out-of-bag samples to estimate the generalization error.
n_jobs : integer, optional (default=1)
The number of jobs to run in parallel. If -1, then the number of jobs is set to the number
of cores.
random_state : int, RandomState instance or None, optional (default=None)
If int, random_state is the seed used by the random number generator; If RandomState
instance, random_state is the random number generator; If None, the random number
generator is the RandomState instance used by np.random.
verbose : int, optional (default=0)
Controlls the verbosity of the tree building process.
See Also:
sklearn.tree.ExtraTreeRegressorBase estimator for this ensemble.
RandomForestRegressorEnsemble regressor using trees with optimal splits.
194 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
References
[R58]
Attributes
fea-
ture_importances_
array of shape =
[n_features]
The feature mportances (the higher, the more important
the feature).
oob_score_ oat Score of the training dataset obtained using an out-of-bag
estimate.
oob_prediction_ array, shape =
[n_samples]
Prediction computed with out-of-bag estimate on the
training set.
Methods
fit(X, y) Build a forest of trees from the training set (X, y).
fit_transform(X[, y]) Fit to data, then transform it
get_params([deep]) Get parameters for the estimator
predict(X) Predict regression target for X.
score(X, y) Returns the coefcient of determination R^2 of the prediction.
set_params(**params) Set the parameters of the estimator.
transform(X[, threshold]) Reduce X to its most important features.
__init__(n_estimators=10, criterion=mse, max_depth=None, min_samples_split=1,
min_samples_leaf=1, min_density=0.1, max_features=auto, bootstrap=False, com-
pute_importances=False, oob_score=False, n_jobs=1, random_state=None, verbose=0)
fit(X, y)
Build a forest of trees from the training set (X, y).
Parameters X : array-like of shape = [n_samples, n_features]
The training input samples.
y : array-like, shape = [n_samples]
The target values (integers that correspond to classes in classication, real numbers in
regression).
Returns self : object
Returns self.
fit_transform(X, y=None, **t_params)
Fit to data, then transform it
Fits transformer to X and y with optional parameters t_params and returns a transformed version of X.
Parameters X : numpy array of shape [n_samples, n_features]
Training set.
y : numpy array of shape [n_samples]
Target values.
Returns X_new : numpy array of shape [n_samples, n_features_new]
1.5. Model Selection 195
scikit-learn user guide, Release 0.12-git
Transformed array.
Notes
This method just calls t and transform consecutively, i.e., it is not an optimized implementation of
t_transform, unlike other transformers such as PCA.
get_params(deep=True)
Get parameters for the estimator
Parameters deep: boolean, optional :
If True, will return the parameters for this estimator and contained subobjects that are
estimators.
predict(X)
Predict regression target for X.
The predicted regression target of an input sample is computed as the mean predicted regression targets of
the trees in the forest.
Parameters X : array-like of shape = [n_samples, n_features]
The input samples.
Returns y: array of shape = [n_samples] :
The predicted values.
score(X, y)
Returns the coefcient of determination R^2 of the prediction.
The coefcient R^2 is dened as (1 - u/v), where u is the regression sum of squares ((y - y_pred) **
2).sum() and v is the residual sum of squares ((y_true - y_true.mean()) ** 2).sum(). Best possible score is
1.0, lower values are worse.
Parameters X : array-like, shape = [n_samples, n_features]
Training set.
y : array-like, shape = [n_samples]
Returns z : oat
set_params(**params)
Set the parameters of the estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The former have
parameters of the form <component>__<parameter> so that its possible to update each component
of a nested object.
Returns self :
transform(X, threshold=None)
Reduce X to its most important features.
Parameters X : array or scipy sparse matrix of shape [n_samples, n_features]
The input samples.
threshold : string, oat or None, optional (default=None)
196 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
The threshold value to use for feature selection. Features whose importance is greater
or equal are kept while the others are discarded. If median (resp. mean), then the
threshold value is the median (resp. the mean) of the feature importances. A scaling
factor (e.g., 1.25*mean) may also be used. If None and if available, the object attribute
threshold is used. Otherwise, mean is used by default.
Returns X_r : array of shape [n_samples, n_selected_features]
The input samples with only the selected features.
sklearn.ensemble.GradientBoostingClassier
class sklearn.ensemble.GradientBoostingClassifier(loss=deviance, learn_rate=0.1,
n_estimators=100, subsam-
ple=1.0, min_samples_split=1,
min_samples_leaf=1, max_depth=3,
init=None, random_state=None)
Gradient Boosting for classication.
GB builds an additive model in a forward stage-wise fashion; it allows for the optimization of arbitrary differen-
tiable loss functions. In each stage n_classes_ regression trees are t on the negative gradient of the binomial
or multinomial deviance loss function. Binary classication is a special case where only a single regression tree
is induced.
Parameters loss : {deviance, ls}, optional (default=deviance)
loss function to be optimized. deviance refers to deviance (= logistic regression) for
classication with probabilistic outputs. ls refers to least squares regression.
learn_rate : oat, optional (default=0.1)
learning rate shrinks the contribution of each tree by learn_rate. There is a trade-off
between learn_rate and n_estimators.
n_estimators : int (default=100)
The number of boosting stages to perform. Gradient boosting is fairly robust to over-
tting so a large number usually results in better performance.
max_depth : integer, optional (default=3)
maximum depth of the individual regression estimators. The maximum depth limits the
number of nodes in the tree. Tune this parameter for best performance; the best value
depends on the interaction of the input variables.
min_samples_split : integer, optional (default=1)
The minimum number of samples required to split an internal node.
min_samples_leaf : integer, optional (default=1)
The minimum number of samples required to be at a leaf node.
subsample : oat, optional (default=1.0)
The fraction of samples to be used for tting the individual base learners. If smaller than
1.0 this results in Stochastic Gradient Boosting. subsample interacts with the parameter
n_estimators.
See Also:
sklearn.tree.DecisionTreeClassifier, RandomForestClassifier
1.5. Model Selection 197
scikit-learn user guide, Release 0.12-git
References
J. Friedman, Greedy Function Approximation: A Gradient Boosting Machine, The Annals of Statistics, Vol. 29,
No. 5, 2001.
10.Friedman, Stochastic Gradient Boosting, 1999
T. Hastie, R. Tibshirani and J. Friedman. Elements of Statistical Learning Ed. 2, Springer, 2009.
Examples
>>> samples = [[0, 0, 2], [1, 0, 0]]
>>> labels = [0, 1]
>>> from sklearn.ensemble import GradientBoostingClassifier
>>> gb = GradientBoostingClassifier().fit(samples, labels)
>>> print gb.predict([[0.5, 0, 0]])
[0]
Methods
fit(X, y) Fit the gradient boosting model.
fit_stage(i, X, X_argsorted, y, y_pred, ...) Fit another stage of n_classes_ trees to the boosting model.
get_params([deep]) Get parameters for the estimator
predict(X) Predict class for X.
predict_proba(X) Predict class probabilities for X.
score(X, y) Returns the mean accuracy on the given test data and labels.
set_params(**params) Set the parameters of the estimator.
staged_decision_function(X) Compute decision function for X.
__init__(loss=deviance, learn_rate=0.1, n_estimators=100, subsample=1.0, min_samples_split=1,
min_samples_leaf=1, max_depth=3, init=None, random_state=None)
fit(X, y)
Fit the gradient boosting model.
Parameters X : array-like, shape = [n_samples, n_features]
Training vectors, where n_samples is the number of samples and n_features is the num-
ber of features. Use fortran-style to avoid memory copies.
y : array-like, shape = [n_samples]
Target values (integers in classication, real numbers in regression) For classication,
labels must correspond to classes 0, 1, ..., n_classes_-1
Returns self : object
Returns self.
fit_stage(i, X, X_argsorted, y, y_pred, sample_mask)
Fit another stage of n_classes_ trees to the boosting model.
get_params(deep=True)
Get parameters for the estimator
Parameters deep: boolean, optional :
198 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
If True, will return the parameters for this estimator and contained subobjects that are
estimators.
predict(X)
Predict class for X.
Parameters X : array-like of shape = [n_samples, n_features]
The input samples.
Returns y : array of shape = [n_samples]
The predicted classes.
predict_proba(X)
Predict class probabilities for X.
Parameters X : array-like of shape = [n_samples, n_features]
The input samples.
Returns p : array of shape = [n_samples]
The class probabilities of the input samples. Classes are ordered by arithmetical order.
score(X, y)
Returns the mean accuracy on the given test data and labels.
Parameters X : array-like, shape = [n_samples, n_features]
Training set.
y : array-like, shape = [n_samples]
Labels for X.
Returns z : oat
set_params(**params)
Set the parameters of the estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The former have
parameters of the form <component>__<parameter> so that its possible to update each component
of a nested object.
Returns self :
staged_decision_function(X)
Compute decision function for X.
This method allows monitoring (i.e. determine error on testing set) after each stage.
Parameters X : array-like of shape = [n_samples, n_features]
The input samples.
Returns f : array of shape = [n_samples, n_classes]
The decision function of the input samples. Classes are ordered by arithmetical order.
Regression and binary classication are special cases with n_classes == 1.
sklearn.ensemble.GradientBoostingRegressor
1.5. Model Selection 199
scikit-learn user guide, Release 0.12-git
class sklearn.ensemble.GradientBoostingRegressor(loss=ls, learn_rate=0.1,
n_estimators=100, subsam-
ple=1.0, min_samples_split=1,
min_samples_leaf=1, max_depth=3,
init=None, random_state=None)
Gradient Boosting for regression.
GB builds an additive model in a forward stage-wise fashion; it allows for the optimization of arbitrary differ-
entiable loss functions. In each stage a regression tree is t on the negative gradient of the given loss function.
Parameters loss : {ls, lad}, optional (default=ls)
loss function to be optimized. ls refers to least squares regression. lad (least absolute
deviation) is a highly robust loss function soley based on order information of the input
variables.
learn_rate : oat, optional (default=0.1)
learning rate shrinks the contribution of each tree by learn_rate. There is a trade-off
between learn_rate and n_estimators.
n_estimators : int (default=100)
The number of boosting stages to perform. Gradient boosting is fairly robust to over-
tting so a large number usually results in better performance.
max_depth : integer, optional (default=3)
maximum depth of the individual regression estimators. The maximum depth limits the
number of nodes in the tree. Tune this parameter for best performance; the best value
depends on the interaction of the input variables.
min_samples_split : integer, optional (default=1)
The minimum number of samples required to split an internal node.
min_samples_leaf : integer, optional (default=1)
The minimum number of samples required to be at a leaf node.
subsample : oat, optional (default=1.0)
The fraction of samples to be used for tting the individual base learners. If smaller than
1.0 this results in Stochastic Gradient Boosting. subsample interacts with the parameter
n_estimators.
See Also:
sklearn.tree.DecisionTreeRegressor, RandomForestRegressor
References
J. Friedman, Greedy Function Approximation: A Gradient Boosting Machine, The Annals of Statistics, Vol. 29,
No. 5, 2001.
10.Friedman, Stochastic Gradient Boosting, 1999
T. Hastie, R. Tibshirani and J. Friedman. Elements of Statistical Learning Ed. 2, Springer, 2009.
200 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
Examples
>>> samples = [[0, 0, 2], [1, 0, 0]]
>>> labels = [0, 1]
>>> from sklearn.ensemble import GradientBoostingRegressor
>>> gb = GradientBoostingRegressor().fit(samples, labels)
>>> print gb.predict([[0, 0, 0]])
[ 1.32806997e-05]
Attributes
fea-
ture_importances_
array, shape
=
[n_features]
The feature importances (the higher, the more important the feature).
oob_score_ array, shape
=
[n_estimators]
Score of the training dataset obtained using an out-of-bag estimate. The i-th
score oob_score_[i] is the deviance (= loss) of the model at iteration i
on the out-of-bag sample.
train_score_ array, shape
=
[n_estimators]
The i-th score train_score_[i] is the deviance (= loss) of the model at
iteration i on the in-bag sample. If subsample == 1 this is the deviance
on the training data.
Methods
fit(X, y) Fit the gradient boosting model.
fit_stage(i, X, X_argsorted, y, y_pred, ...) Fit another stage of n_classes_ trees to the boosting model.
get_params([deep]) Get parameters for the estimator
predict(X) Predict regression target for X.
score(X, y) Returns the coefcient of determination R^2 of the prediction.
set_params(**params) Set the parameters of the estimator.
staged_decision_function(X) Compute decision function for X.
staged_predict(X) Predict regression target at each stage for X.
__init__(loss=ls, learn_rate=0.1, n_estimators=100, subsample=1.0, min_samples_split=1,
min_samples_leaf=1, max_depth=3, init=None, random_state=None)
fit(X, y)
Fit the gradient boosting model.
Parameters X : array-like, shape = [n_samples, n_features]
Training vectors, where n_samples is the number of samples and n_features is the num-
ber of features. Use fortran-style to avoid memory copies.
y : array-like, shape = [n_samples]
Target values (integers in classication, real numbers in regression) For classication,
labels must correspond to classes 0, 1, ..., n_classes_-1
Returns self : object
Returns self.
fit_stage(i, X, X_argsorted, y, y_pred, sample_mask)
Fit another stage of n_classes_ trees to the boosting model.
1.5. Model Selection 201
scikit-learn user guide, Release 0.12-git
get_params(deep=True)
Get parameters for the estimator
Parameters deep: boolean, optional :
If True, will return the parameters for this estimator and contained subobjects that are
estimators.
predict(X)
Predict regression target for X.
Parameters X : array-like of shape = [n_samples, n_features]
The input samples.
Returns y: array of shape = [n_samples] :
The predicted values.
score(X, y)
Returns the coefcient of determination R^2 of the prediction.
The coefcient R^2 is dened as (1 - u/v), where u is the regression sum of squares ((y - y_pred) **
2).sum() and v is the residual sum of squares ((y_true - y_true.mean()) ** 2).sum(). Best possible score is
1.0, lower values are worse.
Parameters X : array-like, shape = [n_samples, n_features]
Training set.
y : array-like, shape = [n_samples]
Returns z : oat
set_params(**params)
Set the parameters of the estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The former have
parameters of the form <component>__<parameter> so that its possible to update each component
of a nested object.
Returns self :
staged_decision_function(X)
Compute decision function for X.
This method allows monitoring (i.e. determine error on testing set) after each stage.
Parameters X : array-like of shape = [n_samples, n_features]
The input samples.
Returns f : array of shape = [n_samples, n_classes]
The decision function of the input samples. Classes are ordered by arithmetical order.
Regression and binary classication are special cases with n_classes == 1.
staged_predict(X)
Predict regression target at each stage for X.
This method allows monitoring (i.e. determine error on testing set) after each stage.
Parameters X : array-like of shape = [n_samples, n_features]
The input samples.
Returns y : array of shape = [n_samples]
202 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
The predicted value of the input samples.
1.5.3 Pipeline: chaining estimators
Pipeline can be used to chain multiple estimators into one. This is useful as there is often a xed sequence of
steps in processing the data, for example feature selection, normalization and classication. Pipeline serves two
purposes here:
Convenience: You only have to call fit and predict once on your data to t a whole sequence of
estimators.
Joint parameter selection: You can grid search over parameters of all estimators in the pipeline at once.
For estimators to be usable within a pipeline, all except the last one need to have a transform function. Otherwise,
the dataset can not be passed through this estimator.
Usage
The Pipeline is build using a list of (key, value) pairs, where the key a string containing the name you want
to give this step and value is an estimator object:
>>> from sklearn.pipeline import Pipeline
>>> from sklearn.svm import SVC
>>> from sklearn.decomposition import PCA
>>> estimators = [(reduce_dim, PCA()), (svm, SVC())]
>>> clf = Pipeline(estimators)
>>> clf
Pipeline(steps=[(reduce_dim, PCA(copy=True, n_components=None,
whiten=False)), (svm, SVC(C=1.0, cache_size=200, class_weight=None,
coef0=0.0, degree=3, gamma=0.0, kernel=rbf, probability=False,
shrinking=True, tol=0.001, verbose=False))])
The estimators of the pipeline are stored as a list in the steps attribute:
>>> clf.steps[0]
(reduce_dim, PCA(copy=True, n_components=None, whiten=False))
and as a dict in named_steps:
>>> clf.named_steps[reduce_dim]
PCA(copy=True, n_components=None, whiten=False)
Parameters of the estimators in the pipeline can be accessed using the <estimator>__<parameter> syntax:
>>> clf.set_params(svm__C=10) # NORMALIZE_WHITESPACE
Pipeline(steps=[(reduce_dim, PCA(copy=True, n_components=None, whiten=False)), (svm, SVC(C=10, cache_size=200, class_weight=None, coef0=0.0, degree=3, gamma=0.0,
kernel=rbf, probability=False, shrinking=True, tol=0.001,
verbose=False))])
This is particularly important for doing grid searches:
>>> from sklearn.grid_search import GridSearchCV
>>> params = dict(reduce_dim__n_components=[2, 5, 10],
... svm__C=[0.1, 10, 100])
>>> grid_search = GridSearchCV(clf, param_grid=params)
1.5. Model Selection 203
scikit-learn user guide, Release 0.12-git
Examples:
Pipeline Anova SVM
Sample pipeline for text feature extraction and evaluation
Pipelining: chaining a PCA and a logistic regression
Explicit feature map approximation for RBF kernels
SVM-Anova: SVM with univariate feature selection
Notes
Calling fit on the pipeline is the same as calling fit on each estimator in turn, transform the input and pass it
on to the next step. The pipeline has all the methods that the last estimator in the pipline has, i.e. if the last estimator
is a classier, the Pipeline can be used as a classier. If the last estimator is a transformer, again, so is the pipeline.
1.6 Dataset transformations
1.6.1 Preprocessing data
The sklearn.preprocessing package provides several common utility functions and transformer classes to
change raw feature vectors into a representation that is more suitable for the downstream estimators.
Standardization or Mean Removal and Variance Scaling
Standardization of datasets is a common requirement for many machine learning estimators implemented in the
scikit: they might behave badly if the individual feature do not more or less look like standard normally distributed
data: Gaussian with zero mean and unit variance.
In practice we often ignore the shape of the distribution and just transform the data to center it by removing the mean
value of each feature, then scale it by dividing non-constant features by their standard deviation.
For instance, many elements used in the objective function of a learning algorithm (such as the RBF kernel of Support
Vector Machines or the l1 and l2 regularizers of linear models) assume that all features are centered around zero and
have variance in the same order. If a feature has a variance that is orders of magnitude larger that others, it might
dominate the objective function and make the estimator unable to learn from other features correctly as expected.
The function scale provides a quick and easy way to perform this operation on a single array-like dataset:
>>> from sklearn import preprocessing
>>> X = [[ 1., -1., 2.],
... [ 2., 0., 0.],
... [ 0., 1., -1.]]
>>> X_scaled = preprocessing.scale(X)
>>> X_scaled
array([[ 0. ..., -1.22..., 1.33...],
[ 1.22..., 0. ..., -0.26...],
[-1.22..., 1.22..., -1.06...]])
Scaled data has zero mean and unit variance:
204 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
>>> X_scaled.mean(axis=0)
array([ 0., 0., 0.])
>>> X_scaled.std(axis=0)
array([ 1., 1., 1.])
The preprocessing module further provides a utility class Scaler that implements the Transformer API to
compute the mean and standard deviation on a training set so as to be able to later reapply the same transformation on
the testing set. This class is hence suitable for use in the early steps of a sklearn.pipeline.Pipeline:
>>> scaler = preprocessing.Scaler().fit(X)
>>> scaler
Scaler(copy=True, with_mean=True, with_std=True)
>>> scaler.mean_
array([ 1. ..., 0. ..., 0.33...])
>>> scaler.std_
array([ 0.81..., 0.81..., 1.24...])
>>> scaler.transform(X)
array([[ 0. ..., -1.22..., 1.33...],
[ 1.22..., 0. ..., -0.26...],
[-1.22..., 1.22..., -1.06...]])
The scaler instance can then be used on new data to transform it the same way it did on the training set:
>>> scaler.transform([[-1., 1., 0.]])
array([[-2.44..., 1.22..., -0.26...]])
It is possible to disable either centering or scaling by either passing with_mean=False or with_std=False to
the constructor of Scaler.
References:
Further discussion on the importance of centering and scaling data is available on this FAQ: Should I normal-
ize/standardize/rescale the data?
Scaling vs Whitening
It is sometimes not enough to center and scale the features independently, since a downstream model can further
make some assumption on the linear independence of the features.
To address this issue you can use sklearn.decomposition.PCA or
sklearn.decomposition.RandomizedPCA with whiten=True to further remove the linear
correlation across features.
1.6. Dataset transformations 205
scikit-learn user guide, Release 0.12-git
Sparse input
scale and Scaler accept scipy.sparse matrices as input only when with_mean=False is explicitly
passed to the constructor. Otherwise a ValueError will be raised as silently centering would break the
sparsity and would often crash the execution by allocating excessive amounts of memory unintentionally.
If the centered data is expected to be small enough, explicitly convert the input to an array using the toarray
method of sparse matrices instead.
For sparse input the data is converted to the Compressed Sparse Rows representation (see
scipy.sparse.csr_matrix). To avoid unnecessary memory copies, it is recommended to choose the
CSR representation upstream.
Scaling target variables in regression
scale and Scaler work out-of-the-box with 1d arrays. This is very useful for scaling the target / response
variables used for regression.
Normalization
Normalization is the process of scaling individual samples to have unit norm. This process can be useful if you plan
to use a quadratic form such as the dot-product or any other kernel to quantify the similarity of any pair of samples.
This assumption is the base of the Vector Space Model often used in text classication and clustering contexts.
The function normalize provides a quick and easy way to perform this operation on a single array-like dataset,
either using the l1 or l2 norms:
>>> X = [[ 1., -1., 2.],
... [ 2., 0., 0.],
... [ 0., 1., -1.]]
>>> X_normalized = preprocessing.normalize(X, norm=l2)
>>> X_normalized
array([[ 0.40..., -0.40..., 0.81...],
[ 1. ..., 0. ..., 0. ...],
[ 0. ..., 0.70..., -0.70...]])
The preprocessing module further provides a utility class Normalizer that implements the same operation
using the Transformer API (even though the fit method is useless in this case: the class is stateless as this
operation treats samples independently).
This class is hence suitable for use in the early steps of a sklearn.pipeline.Pipeline:
>>> normalizer = preprocessing.Normalizer().fit(X) # fit does nothing
>>> normalizer
Normalizer(copy=True, norm=l2)
The normalizer instance can then be used on sample vectors as any transformer:
>>> normalizer.transform(X)
array([[ 0.40..., -0.40..., 0.81...],
[ 1. ..., 0. ..., 0. ...],
[ 0. ..., 0.70..., -0.70...]])
>>> normalizer.transform([[-1., 1., 0.]])
array([[-0.70..., 0.70..., 0. ...]])
206 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
Sparse input
normalize and Normalizer accept both dense array-like and sparse matrices from scipy.sparse as
input.
For sparse input the data is converted to the Compressed Sparse Rows representation (see
scipy.sparse.csr_matrix) before being fed to efcient Cython routines. To avoid unnecessary memory
copies, it is recommended to choose the CSR representation upstream.
Binarization
Feature binarization
Feature binarization is the process of thresholding numerical features to get boolean values. This can be useful for
downsteam probabilistic estimators that make assumption that the input data is distributed according to a multi-variate
Bernoulli distribution. For instance, this is the case for the most common class of (Restricted) Boltzmann Machines
(not yet implemented in the scikit).
It is also commmon among the text processing community to use binary feature values (probably to simplify the
probabilistic reasoning) even if normalized counts (a.k.a. term frequencies) or TF-IDF valued features often perform
slightly better in practice.
As for the Normalizer, the utility class Binarizer is meant to be used in the early stages of
sklearn.pipeline.Pipeline. The fit method does nothing as each sample is treated independently of
others:
>>> X = [[ 1., -1., 2.],
... [ 2., 0., 0.],
... [ 0., 1., -1.]]
>>> binarizer = preprocessing.Binarizer().fit(X) # fit does nothing
>>> binarizer
Binarizer(copy=True, threshold=0.0)
>>> binarizer.transform(X)
array([[ 1., 0., 1.],
[ 1., 0., 0.],
[ 0., 1., 0.]])
It is possible to adjust the threshold of the binarizer:
>>> binarizer = preprocessing.Binarizer(threshold=1.1)
>>> binarizer.transform(X)
array([[ 0., 0., 1.],
[ 1., 0., 0.],
[ 0., 0., 0.]])
As for the Scaler and Normalizer classes, the preprocessing module provides a companion function binarize
to be used when the transformer API is not necessary.
1.6. Dataset transformations 207
scikit-learn user guide, Release 0.12-git
Sparse input
binarize and Binarizer accept both dense array-like and sparse matrices from scipy.sparse as input.
For sparse input the data is converted to the Compressed Sparse Rows representation (see
scipy.sparse.csr_matrix). To avoid unnecessary memory copies, it is recommended to choose the
CSR representation upstream.
Label preprocessing
Label binarization
LabelBinarizer is a utility class to help create a label indicator matrix from a list of multi-class labels:
>>> lb = preprocessing.LabelBinarizer()
>>> lb.fit([1, 2, 6, 4, 2])
LabelBinarizer(neg_label=0, pos_label=1)
>>> lb.classes_
array([1, 2, 4, 6])
>>> lb.transform([1, 6])
array([[ 1., 0., 0., 0.],
[ 0., 0., 0., 1.]])
LabelBinarizer also supports multiple labels per instance:
>>> lb.fit_transform([(1, 2), (3,)])
array([[ 1., 1., 0.],
[ 0., 0., 1.]])
>>> lb.classes_
array([1, 2, 3])
Label encoding
LabelEncoder is a utility class to help normalize labels such that they contain only values between 0 and n_classes-
1. This is sometimes useful for writing efcient Cython routines. LabelEncoder can be used as follows:
>>> from sklearn import preprocessing
>>> le = preprocessing.LabelEncoder()
>>> le.fit([1, 2, 2, 6])
LabelEncoder()
>>> le.classes_
array([1, 2, 6])
>>> le.transform([1, 1, 2, 6])
array([0, 0, 1, 2])
>>> le.inverse_transform([0, 0, 1, 2])
array([1, 1, 2, 6])
It can also be used to transform non-numerical labels (as long as they are hashable and comparable) to numerical
labels:
>>> le = preprocessing.LabelEncoder()
>>> le.fit(["paris", "paris", "tokyo", "amsterdam"])
LabelEncoder()
>>> list(le.classes_)
[amsterdam, paris, tokyo]
>>> le.transform(["tokyo", "tokyo", "paris"])
208 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
array([2, 2, 1])
>>> list(le.inverse_transform([2, 2, 1]))
[tokyo, tokyo, paris]
1.6.2 Feature extraction
The sklearn.feature_extraction module can be used to extract features in a format supported by machine
learning algorithms from datasets consisting of formats such as text and image.
Note: Feature extraction is very different from Feature selection: the former consists in transforming arbitrary data,
such as text or images, into numerical features usable for machine learning. The later is a machine learning technique
applied on these features.
Loading features from dicts
The class DictVectorizer can be used to convert feature arrays represented as lists of standard Python dict
objects to the NumPy/SciPy representation used by scikit-learn estimators.
While not particularly fast to process, Pythons dict has the advantages of being convenient to use, being sparse
(absent features need not be stored) and storing feature names in addition to values.
DictVectorizer implements what is called one-of-K or one-hot coding for categorical (aka nominal, discrete)
features. Categorical features are attribute-value pairs where the value is restricted to a list of discrete of possibilities
without ordering (e.g. topic identiers, types of objects, tags, names...).
In the following, city is a categorical attribute while temperature is a traditional numerical feature:
>>> measurements = [
... {city: Dubai, temperature: 33.},
... {city: London, temperature: 12.},
... {city: San Fransisco, temperature: 18.},
... ]
>>> from sklearn.feature_extraction import DictVectorizer
>>> vec = DictVectorizer()
>>> vec.fit_transform(measurements).toarray()
array([[ 1., 0., 0., 33.],
[ 0., 1., 0., 12.],
[ 0., 0., 1., 18.]])
>>> vec.get_feature_names()
[city=Dubai, city=London, city=San Fransisco, temperature]
DictVectorizer is also a useful representation transformation for training sequence classiers in Natural Lan-
guage Processing models that typically work by extracting feature windows around a particular word of interest.
For example, suppose that we have a rst algorithm that extracts Part of Speech (PoS) tags that we want to use as
complementary tags for training a sequence classier (e.g. a chunker). The following dict could be such a window of
feature extracted around the word sat in the sentence The cat sat on the mat.:
>>> pos_window = [
... {
... word-2: the,
... pos-2: DT,
1.6. Dataset transformations 209
scikit-learn user guide, Release 0.12-git
... word-1: cat,
... pos-1: NN,
... word+1: on,
... pos+1: PP,
... },
... # in a real application one would extract many such dictionaries
... ]
This description can be vectorized into a sparse two-dimensional matrix suitable for feeding into a classier (maybe
after being piped into a text.TfidfTransformer for normalization):
>>> vec = DictVectorizer()
>>> pos_vectorized = vec.fit_transform(pos_window)
>>> pos_vectorized
<1x6 sparse matrix of type <type numpy.float64>
with 6 stored elements in COOrdinate format>
>>> pos_vectorized.toarray()
array([[ 1., 1., 1., 1., 1., 1.]])
>>> vec.get_feature_names()
[pos+1=PP, pos-1=NN, pos-2=DT, word+1=on, word-1=cat, word-2=the]
As you can imagine, if one extracts such a context around each individual word of a corpus of documents the resulting
matrix will be very wide (many one-hot-features) with most of them being valued to zero most of the time. So as to
make the resulting data structure able to t in memory the DictVectorizer class uses a scipy.sparse matrix
by default instead of a numpy.ndarray.
Text feature extraction
The Bag of Words representation
Text Analysis is a major application eld for machine learning algorithms. However the raw data, a sequence of
symbols cannot be fed directly to the algorithms themselves as most of them expect numerical feature vectors with a
xed size rather than the raw text documents with variable length.
In order to address this, scikit-learn provides utilities for the most common ways to extract numerical features from
text content, namely:
tokenizing strings and giving an integer id for each possible token, for instance by using whitespaces and
punctuation as token separators.
counting the occurrences of tokens in each document.
normalizing and weighting with diminishing importance tokens that occur in the majority of samples / docu-
ments.
In this scheme, features and samples are dened as follows:
each individual token occurrence frequency (normalized or not) is treated as a feature.
the vector of all the token frequencies for a given document is considered a multivariate sample.
A corpus of documents can thus be represented by a matrix with one row per document and one column per token
(e.g. word) occurring in the corpus.
We call vectorization the general process of turning a collection of text documents into numerical feature vectors. This
specic stragegy (tokenization, counting and normalization) is called the Bag of Words or Bag of n-grams repre-
sentation. Documents are described by word occurrences while completely ignoring the relative position information
of the words in the document.
When combined with TF-IDF normalization, the bag of words encoding is also known as the Vector Space Model.
210 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
Sparsity
As most documents will typically use a very subset of a the words used in the corpus, the resulting matrix will have
many feature values that are zeros (typically more than 99% of them).
For instance a collection of 10,000 short text documents (such as emails) will use a vocabulary with a size in the order
of 100,000 unique words in total while each document will use 100 to 1000 unique words individually.
In order to be able to store such a matrix in memory but also to speed up algebraic operations matrix / vector, imple-
mentations will typically use a sparse representation such as the implementations available in the scipy.sparse
package.
Common Vectorizer usage
CountVectorizer implements both tokenization and occurrence counting in a single class:
>>> from sklearn.feature_extraction.text import CountVectorizer
This model has many parameters, however the default values are quite reasonable (please see the reference documen-
tation for the details):
>>> vectorizer = CountVectorizer()
>>> vectorizer
CountVectorizer(analyzer=word, binary=False, charset=utf-8,
charset_error=strict, dtype=<type long>, input=content,
lowercase=True, max_df=1.0, max_features=None, max_n=1, min_n=1,
preprocessor=None, stop_words=None, strip_accents=None,
token_pattern=u\\b\\w\\w+\\b, tokenizer=None, vocabulary=None)
Lets use it to tokenize and count the word occurrences of a minimalistic corpus of text documents:
>>> corpus = [
... This is the first document.,
... This is the second second document.,
... And the third one.,
... Is this the first document?,
... ]
>>> X = vectorizer.fit_transform(corpus)
>>> X
<4x9 sparse matrix of type <type numpy.int64>
with 19 stored elements in COOrdinate format>
The default conguration tokenizes the string by extracting words of at least 2 letters. The specic function that does
this step can be requested explicitly:
>>> analyze = vectorizer.build_analyzer()
>>> analyze("This is a text document to analyze.")
[uthis, uis, utext, udocument, uto, uanalyze]
Each term found by the analyzer during the t is assigned a unique integer index corresponding to a column in the
resulting matrix. This interpretation of the columns can be retrieved as follows:
>>> vectorizer.get_feature_names()
[uand, udocument, ufirst, uis, uone, usecond, uthe, uthird, uthis]
>>> X.toarray()
array([[0, 1, 1, 1, 0, 0, 1, 0, 1],
[0, 1, 0, 1, 0, 2, 1, 0, 1],
1.6. Dataset transformations 211
scikit-learn user guide, Release 0.12-git
[1, 0, 0, 0, 1, 0, 1, 1, 0],
[0, 1, 1, 1, 0, 0, 1, 0, 1]]...)
The converse mapping from feature name to column index is stored in the vocabulary_ attribute of the vectorizer:
>>> vectorizer.vocabulary_.get(document)
1
Hence words that were not seen in the training corpus will be completely ignored in future calls to the transform
method:
>>> vectorizer.transform([Something completely new.]).toarray()
...
array([[0, 0, 0, 0, 0, 0, 0, 0, 0]]...)
Note that in the previous corpus, the rst and the last documents have exactly the same words hence are encoded in
equal vectors. In particular we lose the information that the last document is an interogative form. To preserve some
of the local ordering information we can extract 2-grams of words in addition to the 1-grams (the word themselvs):
>>> bigram_vectorizer = CountVectorizer(min_n=1, max_n=2,
... token_pattern=ur\b\w+\b)
>>> analyze = bigram_vectorizer.build_analyzer()
>>> analyze(Bi-grams are cool!)
[ubi, ugrams, uare, ucool, ubi grams, ugrams are, uare cool]
The vocabulary extracted by this vectorizer is hence much bigger and can now resolve ambiguities encoded in local
positioning patterns:
>>> X_2 = bigram_vectorizer.fit_transform(corpus).toarray()
>>> X_2
...
array([[0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0],
[0, 0, 1, 0, 0, 1, 1, 0, 0, 2, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0],
[1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0],
[0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1]]...)
In particular the interogative form Is this is only present in the last document:
>>> feature_index = bigram_vectorizer.vocabulary_.get(uis this)
>>> X_2[:, feature_index]
array([0, 0, 0, 1]...)
TF-IDF normalization
In a large text corpus, some words will be very present (e.g. the, a, is in English) hence carrying very little
meaningul information about the actual contents of the document. If we were to feed the direct count data directly to
a classier those very frequent terms would shadow the frequencies of rarer yet more interesting terms.
In order to re-weight the count features into oating point values suitable for usage by a classier it is very common
to use the tfidf transform.
Tf means term-frequency while tfidf means term-frequency times inverse document-frequency. This is a orginally
a term weighting scheme developed for information retrieval (as a ranking function for search engines results), that
has also found good use in document classication and clustering.
This normalization is implemented by the text.TfidfTransformer class:
212 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
>>> from sklearn.feature_extraction.text import TfidfTransformer
>>> transformer = TfidfTransformer()
>>> transformer
TfidfTransformer(norm=l2, smooth_idf=True, sublinear_tf=False, use_idf=True)
Again please see the reference documentation for the details on all the parameters.
Lets take an example with the following counts. The rst term is present 100% of the time hence not very interesting.
The two other features only in less than 50% of the time hence probably more representative of the content of the
documents:
>>> counts = [[3, 0, 1],
... [2, 0, 0],
... [3, 0, 0],
... [4, 0, 0],
... [3, 2, 0],
... [3, 0, 2]]
...
>>> tfidf = transformer.fit_transform(counts)
>>> tfidf
<6x3 sparse matrix of type <type numpy.float64>
with 9 stored elements in Compressed Sparse Row format>
>>> tfidf.toarray()
array([[ 0.85..., 0. ..., 0.52...],
[ 1. ..., 0. ..., 0. ...],
[ 1. ..., 0. ..., 0. ...],
[ 1. ..., 0. ..., 0. ...],
[ 0.55..., 0.83..., 0. ...],
[ 0.63..., 0. ..., 0.77...]])
Each row is normalized to have unit euclidean norm. The weights of each feature computed by the fit method call
are stored in a model attribute:
>>> transformer.idf_
array([ 1. ..., 2.25..., 1.84...])
As tfidf is a very often used for text features, there is also another class called TfidfVectorizer that combines
all the option of CountVectorizer and TfidfTransformer in a single model:
>>> from sklearn.feature_extraction.text import TfidfVectorizer
>>> vectorizer = TfidfVectorizer()
>>> vectorizer.fit_transform(corpus)
...
<4x9 sparse matrix of type <type numpy.float64>
with 19 stored elements in Compressed Sparse Row format>
While the tfidf normalization is often very useful, there might be cases where the binary occurrence markers might
offer better features. This can be achieved by using the binary parameter of CountVectorizer. In particular,
some estimators such as Bernoulli Naive Bayes explicitly model discrete boolean random variables. Also very short
text are likely to have noisy tfidf values while the binary occurrence info is more stable.
As usual the only way how to best adjust the feature extraction parameters is to use a cross-validated grid search, for
instance by pipelining the feature extractor with a classier:
Sample pipeline for text feature extraction and evaluation
1.6. Dataset transformations 213
scikit-learn user guide, Release 0.12-git
Applications and examples
The bag of words representation is quite simplistic but surprisingly useful in practice.
In particular in a supervised setting it can be successfully combined with fast and scalable linear models to train
document classicers, for instance:
Classication of text documents using sparse features
In an unsupervised setting it can be used to group similar documents together by applying clustering algorithms such
as K-means:
Clustering text documents using k-means
Finally it is possible to discover the main topics of a corpus by relaxing the hard assignement constraint of clustering,
for instance by using Non-negative matrix factorization (NMF or NNMF):
Topics extraction with Non-Negative Matrix Factorization
Limitations of the Bag of Words representation
While some local positioning information can be preserved by extracting n-grams instead of individual words, Bag of
Words and Bag of n-grams destroy most of the inner structure of the document and hence most of the meaning carried
by that internal structure.
In order to address the wider task of Natural Language Understanding, the local structure of sentences and paragraphs
should thus be taken into account. Many such models will thus be casted as Structured output problems which are
currently outside of the scope of scikit-learn.
Customizing the vectorizer classes
It is possible to customize the behavior by passing some callable as parameters of the vectorizer:
>>> def my_tokenizer(s):
... return s.split()
...
>>> vectorizer = CountVectorizer(tokenizer=my_tokenizer)
>>> vectorizer.build_analyzer()(u"Some... punctuation!")
[usome..., upunctuation!]
In particular we name:
preprocessor a callable that takes a string as input and return another string (removing HTML tags or
converting to lower case for instance)
tokenizer a callable that takes a string as input and output a sequence of feature occurrences (a.k.a. the
tokens).
analyzer a callable that wraps calls to the preprocessor and tokenizer and further perform some ltering or
n-grams extractions on the tokens.
To make the preprocessor, tokenizer and analyzers aware of the model parameters it is possible to derive from the
class and override the build_preprocessor, build_tokenizer and build_analyzer factory method
instead.
Customizing the vectorizer can be very useful to handle Asian languages that do not use an explicit word separator
such as the whitespace for instance.
214 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
Image feature extraction
Patch extraction
The extract_patches_2d function extracts patches from an image stored as a two-dimensional array, or
three-dimensional with color information along the third axis. For rebuilding an image from all its patches, use
reconstruct_from_patches_2d. For example let use generate a 4x4 pixel picture with 3 color channels (e.g.
in RGB format):
>>> import numpy as np
>>> from sklearn.feature_extraction import image
>>> one_image = np.arange(4
*
4
*
3).reshape((4, 4, 3))
>>> one_image[:, :, 0] # R channel of a fake RGB picture
array([[ 0, 3, 6, 9],
[12, 15, 18, 21],
[24, 27, 30, 33],
[36, 39, 42, 45]])
>>> patches = image.extract_patches_2d(one_image, (2, 2), max_patches=2,
... random_state=0)
>>> patches.shape
(2, 2, 2, 3)
>>> patches[:, :, :, 0]
array([[[ 0, 3],
[12, 15]],
[[15, 18],
[27, 30]]])
>>> patches = image.extract_patches_2d(one_image, (2, 2))
>>> patches.shape
(9, 2, 2, 3)
>>> patches[4, :, :, 0]
array([[15, 18],
[27, 30]])
Let us now try to reconstruct the original image from the patches by averaging on overlapping areas:
>>> reconstructed = image.reconstruct_from_patches_2d(patches, (4, 4, 3))
>>> np.testing.assert_array_equal(one_image, reconstructed)
The PatchExtractor class works in the same way as extract_patches_2d, only it supports multiple images
as input. It is implemented as an estimator, so it can be used in pipelines. See:
>>> five_images = np.arange(5
*
4
*
4
*
3).reshape(5, 4, 4, 3)
>>> patches = image.PatchExtractor((2, 2)).transform(five_images)
>>> patches.shape
(45, 2, 2, 3)
Connectivity graph of an image
Several estimators in the scikit-learn can use connectivity information between features or samples. For instance Ward
clustering (Hierarchical clustering) can cluster together only neighboring pixels of an image, thus forming contiguous
patches:
For this purpose, the estimators use a connectivity matrix, giving which samples are connected.
1.6. Dataset transformations 215
scikit-learn user guide, Release 0.12-git
The function img_to_graph returns such a matrix from a 2D or 3D image. Similarly, grid_to_graph build a
connectivity matrix for images given the shape of these image.
These matrices can be used to impose connectivity in estimators that use connectivity information, such as Ward
clustering (Hierarchical clustering), but also to build precomputed kernels, or similarity matrices.
Note: Examples
A demo of structured Ward hierarchical clustering on Lena image
Spectral clustering for image segmentation
Feature agglomeration vs. univariate selection
1.6.3 Kernel Approximation
This submodule contains functions that approximate the feature mappings that correspond to certain kernels, as they
are used for example in support vector machines (see Support Vector Machines). The following feature functions
perform non-linear transformations of the input, which can serve as a basis for linear classication or other algorithms.
The advantage of using approximate explicit feature maps compared to the kernel trick, which makes use of feature
maps implicitly, is that explicit mappings can be better suited for online learning and can signicantly reduce the
cost of learning with very large datasets. Standard kernelized SVMs do not scale well to large datasets, but using an
approximate kernel map it is possible to use much more efcient linear SVMs. In particularly the combination of
kernel map approximations with SGDClassifier can make nonlinear learning on large datasets possible.
Since there has not been much empirical work using approximate embeddings, it is advisable to compare results
against exact kernel methods when possible.
Radial Basis Function Kernel
The RBFSampler constructs an approximate mapping for the radial basis function kernel.
The mapping relies on a Monte Carlo approximation to the kernel values. The fit function performs the Monte Carlo
sampling, whereas the transform method performs the mapping of the data. Because of the inherent randomness
of the process, results may vary between different calls to the fit function.
The fit function takes two arguments: n_components, which is the target dimensionality of the feature transform,
and gamma, the parameter of the RBF-kernel. A higher n_components will result in a better approximation of the
kernel and will yield results more similar to those produced by a kernel SVM. Note that tting the feature function
does not actually depend on the data given to the fit function. Only the dimensionality of the data is used. Details
on the method can be found in [RR2007].
216 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
Figure 1.5: Comparing an exact RBF kernel (left) with the approximation (right)
Examples:
Explicit feature map approximation for RBF kernels
Additive Chi Squared Kernel
The chi squared kernel is a kernel on histograms, often used in computer vision.
The chi squared kernel is given by
k(x, y) =
i
2x
i
y
i
x
i
+y
i
Since the kernel is additive, it is possible to treat all components x
i
separately for embedding. This makes it possible
to sample the Fourier transform in regular intervals, instead of approximating using Monte Carlo sampling.
The class AdditiveChi2Sampler implements this component wise deterministic sampling. Each component is
sampled n times, yielding 2n+1 dimensions per input dimension (the multiple of two stems from the real and complex
part of the Fourier transform). In the literature, n is usually choosen to be 1 or 2, transforming the dataset to size
n_samples x 5 * n_features (in the case of n=2).
The approximate feature map provided by AdditiveChi2Sampler can be combined with the approximate feature
map provided by RBFSampler to yield an approximate feature map for the exponentiated chi squared kernel. See
the [VZ2010] for details and [VVZ2010] for combination with the RBFSampler.
Skewed Chi Squared Kernel
The skewed chi squared kernel is given by:
k(x, y) =
i
2
x
i
+c
y
i
+c
x
i
+y
i
+ 2c
It has properties that are similar to the exponentiated chi squared kernel often used in computer vision, but allows for
a simple Monte Carlo approximation of the feature map.
The usage of the SkewedChi2Sampler is the same as the usage described above for the RBFSampler. The only
difference is in the free parameter, that is called c. For a motivation for this mapping and the mathematical details see
[LS2010].
1.6. Dataset transformations 217
scikit-learn user guide, Release 0.12-git
Mathematical Details
Kernel methods like support vector machines or kernelized PCA rely on a property of reproducing kernel Hilbert
spaces. For any positive denite kernel function k (a so called Mercer kernel), it is guaranteed that there exists a
mapping into a Hilbert space H, such that
k(x, y) =< (x), (y) >
Where < , > denotes the inner product in the Hilbert space.
If an algorithm, such as a linear support vector machine or PCA, relies only on the scalar product of data points x
i
,
one may use the value of k(x
i
, x
j
), which corresponds to applying the algorithm to the mapped data points (x
i
). The
advantage of using k is that the mapping never has to be calculated explicitly, allowing for arbitrary large features
(even innite).
One drawback of kernel methods is, that it might be necessary to store many kernel values k(x
i
, x
j
) during optimiza-
tion. If a kernelized classier is applied to new data y
j
, k(x
i
, y
j
) needs to be computed to make predictions, possibly
for many different x
i
in the training set.
The classes in this submodule allow to approximate the embedding , thereby working explicitly with the representa-
tions (x
i
), which obviates the need to apply the kernel or store training examples.
References:
1.7 Dataset loading utilities
The sklearn.datasets package embeds some small toy datasets as introduced in the Getting Started section.
To evaluate the impact of the scale of the dataset (n_samples and n_features) while controlling the statistical
properties of the data (typically the correlation and informativeness of the features), it is also possible to generate
synthetic data.
This package also features helpers to fetch larger datasets commonly used by the machine learning community to
benchmark algorithm on data that comes from the real world.
1.7.1 General dataset API
There are three distinct kinds of dataset interfaces for different types of datasets. The simplest one is the interface for
sample images, which is described below in the Sample images section.
The dataset generation functions and the svmlight loader share a simplistic interface, returning a tuple (X, y) con-
sisting of a n_samples x n_features numpy array X and an array of length n_samples containing the targets y.
The toy datasets as well as the real world datasets and the datasets fetched from mldata.org have more sophisti-
cated structure. These functions return a bunch (which is a dictionary that is accessible with the dict.key syntax).
All datasets have at least two keys, data, containg an array of shape n_samples x n_features (except for
20newsgroups) and target, a numpy array of length n_features, containing the targets.
The datasets also contain a description in DESCR and some contain feature_names and target_names. See
the dataset descriptions below for details.
218 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
1.7.2 Toy datasets
scikit-learn comes with a few small standard datasets that do not require to download any le from some external
website.
load_boston() Load and return the boston house-prices dataset (regression).
load_iris() Load and return the iris dataset (classication).
load_diabetes() Load and return the diabetes dataset (regression).
load_digits([n_class]) Load and return the digits dataset (classication).
load_linnerud() Load and return the linnerud dataset (multivariate regression).
These datasets are useful to quickly illustrate the behavior of the various algorithms implemented in the scikit. They
are however often too small to be representative of real world machine learning tasks.
1.7.3 Sample images
The scikit also embed a couple of sample JPEG images published under Creative Commons license by their authors.
Those image can be useful to test algorithms and pipeline on 2D data.
load_sample_images() Load sample images for image manipulation.
load_sample_image(image_name) Load the numpy array of a single sample image
Warning: The default coding of images is based on the uint8 dtype to spare memory. Often machine learning
algorithms work best if the input is converted to a oating point representation rst. Also, if you plan to use
pylab.imshow dont forget to scale to the range 0 - 1 as done in the following example.
Examples:
Color Quantization using K-Means
1.7. Dataset loading utilities 219
scikit-learn user guide, Release 0.12-git
1.7.4 Sample generators
In addition, scikit-learn includes various random sample generators that can be used to build artical datasets of
controled size and complexity.
make_classification([n_samples, n_features, ...]) Generate a random n-class classication problem.
make_multilabel_classification([n_samples, ...]) Generate a random multilabel classication problem.
make_regression([n_samples, n_features, ...]) Generate a random regression problem.
make_blobs([n_samples, n_features, centers, ...]) Generate isotropic Gaussian blobs for clustering.
make_friedman1([n_samples, n_features, ...]) Generate the Friedman #1 regression problem
make_friedman2([n_samples, noise, random_state]) Generate the Friedman #2 regression problem
make_friedman3([n_samples, noise, random_state]) Generate the Friedman #3 regression problem
make_hastie_10_2([n_samples, random_state]) Generates data for binary classication used in
make_low_rank_matrix([n_samples, ...]) Generate a mostly low rank matrix with bell-shaped singular values
make_sparse_coded_signal(n_samples, ...[, ...]) Generate a signal as a sparse combination of dictionary elements.
make_sparse_uncorrelated([n_samples, ...]) Generate a random regression problem with sparse uncorrelated design
make_spd_matrix(n_dim[, random_state]) Generate a random symmetric, positive-denite matrix.
make_swiss_roll([n_samples, noise, random_state]) Generate a swiss roll dataset.
make_s_curve([n_samples, noise, random_state]) Generate an S curve dataset.
make_sparse_spd_matrix([dim, alpha, ...]) Generate a sparse symetric denite positive matrix.
1.7.5 Datasets in svmlight / libsvm format
scikit-learn includes utility functions for loading datasets in the svmlight / libsvm format. In this format, each
line takes the form <label> <feature-id>:<feature-value> <feature-id>:<feature-value>
.... This format is especially suitable for sparse datasets. In this module, scipy sparse CSR matrices are used for X
and numpy arrays are used for y.
You may load a dataset like as follows:
>>> from sklearn.datasets import load_svmlight_file
>>> X_train, y_train = load_svmlight_file("/path/to/train_dataset.txt")
...
You may also load two (or more) datasets at once:
>>> X_train, y_train, X_test, y_test = load_svmlight_files(
... ("/path/to/train_dataset.txt", "/path/to/test_dataset.txt"))
...
220 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
In this case, X_train and X_test are guaranteed to have the same number of features. Another way to achieve the
same result is to x the number of features:
>>> X_test, y_test = load_svmlight_file(
... "/path/to/test_dataset.txt", n_features=X_train.shape[1])
...
Related links:
Public datasets in svmlight / libsvm format: http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/
Faster API-compatible implementation: https://github.com/mblondel/svmlight-loader
1.7.6 The Olivetti faces dataset
This dataset contains a set of face images taken between April 1992 and April 1994 at AT&T Laboratories Cambridge.
The website describing the original dataset is now defunct, but archived copies can be accessed through the Internet
Archives Wayback Machine.
As described on the original website:
There are ten different images of each of 40 distinct subjects. For some subjects, the images were taken
at different times, varying the lighting, facial expressions (open / closed eyes, smiling / not smiling) and
facial details (glasses / no glasses). All the images were taken against a dark homogeneous background
with the subjects in an upright, frontal position (with tolerance for some side movement).
The image is quantized to 256 grey levels and stored as unsigned 8-bit integers; the loader will convert these to oating
point values on the interval [0, 1], which are easier to work with for many algorithms.
The target for this database is an integer from 0 to 39 indicating the identity of the person pictured; however, with
only 10 examples per class, this relatively small dataset is more interesting from an unsupervised or semi-supervised
perspective.
The original dataset consisted of 92 x 112, while the version available here consists of 64x64 images.
When using these images, please give credit to AT&T Laboratories Cambridge.
1.7.7 The 20 newsgroups text dataset
The 20 newsgroups dataset comprises around 18000 newsgroups posts on 20 topics splitted in two subsets: one for
training (or development) and the other one for testing (or for performance evaluation). The split between the train
and test set is based upon a messages posted before and after a specic date.
This module contains two loaders. The rst one, sklearn.datasets.fetch_20newsgroups,
returns a list of the raw text les that can be fed to text feature extractors such as
sklearn.feature_extraction.text.Vectorizer with custom parameters so as to extract feature
vectors. The second one, sklearn.datasets.fetch_20newsgroups_vectorized, returns ready-to-use
features, i.e., it is not necessary to use a feature extractor.
Usage
The sklearn.datasets.fetch_20newsgroups function is a data fetching / caching functions that
downloads the data archive from the original 20 newsgroups website, extracts the archive contents in the
~/scikit_learn_data/20news_home folder and calls the sklearn.datasets.load_file on either
the training or testing set folder, or both of them:
1.7. Dataset loading utilities 221
scikit-learn user guide, Release 0.12-git
>>> from sklearn.datasets import fetch_20newsgroups
>>> newsgroups_train = fetch_20newsgroups(subset=train)
>>> from pprint import pprint
>>> pprint(list(newsgroups_train.target_names))
[alt.atheism,
comp.graphics,
comp.os.ms-windows.misc,
comp.sys.ibm.pc.hardware,
comp.sys.mac.hardware,
comp.windows.x,
misc.forsale,
rec.autos,
rec.motorcycles,
rec.sport.baseball,
rec.sport.hockey,
sci.crypt,
sci.electronics,
sci.med,
sci.space,
soc.religion.christian,
talk.politics.guns,
talk.politics.mideast,
talk.politics.misc,
talk.religion.misc]
The real data lies in the filenames and target attributes. The target attribute is the integer index of the category:
>>> newsgroups_train.filenames.shape
(11314,)
>>> newsgroups_train.target.shape
(11314,)
>>> newsgroups_train.target[:10]
array([12, 6, 9, 8, 6, 7, 9, 2, 13, 19])
It is possible to load only a sub-selection of the categories by passing the list of the categories to load to the
fetch_20newsgroups function:
>>> cats = [alt.atheism, sci.space]
>>> newsgroups_train = fetch_20newsgroups(subset=train, categories=cats)
>>> list(newsgroups_train.target_names)
[alt.atheism, sci.space]
>>> newsgroups_train.filenames.shape
(1073,)
>>> newsgroups_train.target.shape
(1073,)
>>> newsgroups_train.target[:10]
array([1, 1, 1, 0, 1, 0, 0, 1, 1, 1])
In order to feed predictive or clustering models with the text data, one rst need to turn the text into vec-
tors of numerical values suitable for statistical analysis. This can be achieved with the utilities of the
sklearn.feature_extraction.text as demonstrated in the following example that extract TF-IDF vectors
of unigram tokens:
>>> from sklearn.feature_extraction.text import Vectorizer
>>> documents = [open(f).read() for f in newsgroups_train.filenames]
>>> vectorizer = Vectorizer()
>>> vectors = vectorizer.fit_transform(documents)
222 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
>>> vectors.shape
(1073, 21108)
The extracted TF-IDF vectors are very sparse with an average of 118 non zero components by sample in a more than
20000 dimensional space (less than 1% non zero features):
>>> vectors.nnz / vectors.shape[0]
118
sklearn.datasets.fetch_20newsgroups_vectorized is a function which returns ready-to-use tdf
features instead of le names.
Examples
Sample pipeline for text feature extraction and evaluation
Classication of text documents using sparse features
1.7.8 Downloading datasets from the mldata.org repository
mldata.org is a public repository for machine learning data, supported by the PASCAL network .
The sklearn.datasets package is able to directly download data sets from the repository using the function
fetch_mldata(dataname).
For example, to download the MNIST digit recognition database:
>>> from sklearn.datasets import fetch_mldata
>>> mnist = fetch_mldata(MNIST original, data_home=custom_data_home)
The MNIST database contains a total of 70000 examples of handwritten digits of size 28x28 pixels, labeled from 0 to
9:
>>> mnist.data.shape
(70000, 784)
>>> mnist.target.shape
(70000,)
>>> np.unique(mnist.target)
array([ 0., 1., 2., 3., 4., 5., 6., 7., 8., 9.])
After the rst download, the dataset is cached locally in the path specied by the data_home keyword argument,
which defaults to ~/scikit_learn_data/:
>>> os.listdir(os.path.join(custom_data_home, mldata))
[mnist-original.mat]
Data sets in mldata.org do not adhere to a strict naming or formatting convention. fetch_mldata is able to make
sense of the most common cases, but allows to tailor the defaults to individual datasets:
The data arrays in mldata.org are most often shaped as (n_features, n_samples). This is the op-
posite of the scikit-learn convention, so fetch_mldata transposes the matrix by default. The
transpose_data keyword controls this behavior:
>>> iris = fetch_mldata(iris, data_home=custom_data_home)
>>> iris.data.shape
(150, 4)
>>> iris = fetch_mldata(iris, transpose_data=False,
... data_home=custom_data_home)
1.7. Dataset loading utilities 223
scikit-learn user guide, Release 0.12-git
>>> iris.data.shape
(4, 150)
For datasets with multiple columns, fetch_mldata tries to identify the target and data columns and rename
them to target and data. This is done by looking for arrays named label and data in the dataset, and
failing that by choosing the rst array to be target and the second to be data. This behavior can be changed
with the target_name and data_name keywords, setting them to a specic name or index number (the
name and order of the columns in the datasets can be found at its mldata.org under the tab Data:
>>> iris2 = fetch_mldata(datasets-UCI iris, target_name=1, data_name=0,
... data_home=custom_data_home)
>>> iris3 = fetch_mldata(datasets-UCI iris, target_name=class,
... data_name=double0, data_home=custom_data_home)
1.7.9 The Labeled Faces in the Wild face recognition dataset
This dataset is a collection of JPEG pictures of famous people collected over the internet, all details are available on
the ofcial website:
http://vis-www.cs.umass.edu/lfw/
Each picture is centered on a single face. The typical task is called Face Verication: given a pair of two pictures, a
binary classier must predict whether the two images are from the same person.
An alternative task, Face Recognition or Face Identication is: given the picture of the face of an unknown person,
identify the name of the person by referring to a gallery of previously seen pictures of identied persons.
Both Face Verication and Face Recognition are tasks that are typically performed on the output of a model trained to
perform Face Detection. The most popular model for Face Detection is called Viola-Jones and is implemented in the
OpenCV library. The LFW faces were extracted by this face detector from various online websites.
Usage
scikit-learn provides two loaders that will automatically download, cache, parse the metadata les, decode
the jpeg and convert the interesting slices into memmaped numpy arrays. This dataset size is more than 200 MB.
The rst load typically takes more than a couple of minutes to fully decode the relevant part of the JPEG les into
numpy arrays. If the dataset has been loaded once, the following times the loading times less than 200ms by using a
memmaped version memoized on the disk in the ~/scikit_learn_data/lfw_home/ folder using joblib.
The rst loader is used for the Face Identication task: a multi-class classication task (hence supervised learning):
>>> from sklearn.datasets import fetch_lfw_people
>>> lfw_people = fetch_lfw_people(min_faces_per_person=70, resize=0.4)
>>> for name in lfw_people.target_names:
... print name
...
Ariel Sharon
Colin Powell
Donald Rumsfeld
George W Bush
Gerhard Schroeder
Hugo Chavez
Tony Blair
The default slice is a rectangular shape around the face, removing most of the background:
224 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
>>> lfw_people.data.dtype
dtype(float32)
>>> lfw_people.data.shape
(1288, 1850)
>>> lfw_people.images.shape
(1288, 50, 37)
Each of the 1140 faces is assigned to a single person id in the target array:
>>> lfw_people.target.shape
(1288,)
>>> list(lfw_people.target[:10])
[5, 6, 3, 1, 0, 1, 3, 4, 3, 0]
The second loader is typically used for the face verication task: each sample is a pair of two picture belonging or not
to the same person:
>>> from sklearn.datasets import fetch_lfw_pairs
>>> lfw_pairs_train = fetch_lfw_pairs(subset=train)
>>> list(lfw_pairs_train.target_names)
[Different persons, Same person]
>>> lfw_pairs_train.pairs.shape
(2200, 2, 62, 47)
>>> lfw_pairs_train.data.shape
(2200, 5828)
>>> lfw_pairs_train.target.shape
(2200,)
Both for the fetch_lfw_people and fetch_lfw_pairs function it is possible to get an additional dimension
with the RGB color channels by passing color=True, in that case the shape will be (2200, 2, 62, 47, 3).
The fetch_lfw_pairs datasets is subdived in 3 subsets: the development train set, the development test set
and an evaluation 10_folds set meant to compute performance metrics using a 10-folds cross validation scheme.
References:
Labeled Faces in the Wild: A Database for Studying Face Recognition in Unconstrained Environments.
Gary B. Huang, Manu Ramesh, Tamara Berg, and Erik Learned-Miller. University of Massachusetts,
Amherst, Technical Report 07-49, October, 2007.
Examples
Faces recognition example using eigenfaces and SVMs
1.7. Dataset loading utilities 225
scikit-learn user guide, Release 0.12-git
1.8 Reference
This is the class and function reference of scikit-learn. Please refer to the full user guide for further details, as the class
and function raw specications may not be enough to give full guidelines on their uses.
List of modules
sklearn.cluster: Clustering
Classes
Functions
sklearn.covariance: Covariance Estimators
sklearn.cross_validation: Cross Validation
sklearn.datasets: Datasets
Loaders
Samples generator
sklearn.decomposition: Matrix Decomposition
sklearn.ensemble: Ensemble Methods
sklearn.feature_extraction: Feature Extraction
From images
From text
sklearn.feature_selection: Feature Selection
sklearn.gaussian_process: Gaussian Processes
sklearn.grid_search: Grid Search
sklearn.hmm: Hidden Markov Models
sklearn.kernel_approximation Kernel Approximation
sklearn.semi_supervised Semi-Supervised Learning
sklearn.lda: Linear Discriminant Analysis
sklearn.linear_model: Generalized Linear Models
For dense data
For sparse data
sklearn.manifold: Manifold Learning
sklearn.metrics: Metrics
Classication metrics
Regression metrics
Clustering metrics
Pairwise metrics
sklearn.mixture: Gaussian Mixture Models
sklearn.multiclass: Multiclass and multilabel classication
Multiclass and multilabel classication strategies
sklearn.naive_bayes: Naive Bayes
sklearn.neighbors: Nearest Neighbors
sklearn.pls: Partial Least Squares
sklearn.pipeline: Pipeline
sklearn.preprocessing: Preprocessing and Normalization
sklearn.qda: Quadratic Discriminant Analysis
sklearn.svm: Support Vector Machines
Estimators
Low-level methods
sklearn.tree: Decision Trees
sklearn.utils: Utilities
226 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
1.8.1 sklearn.cluster: Clustering
The sklearn.cluster module gathers popular unsupervised clustering algorithms.
User guide: See the Clustering section for further details.
Classes
cluster.AffinityPropagation([damping, ...]) Perform Afnity Propagation Clustering of data
cluster.DBSCAN([eps, min_samples, metric, ...]) Perform DBSCAN clustering from vector array or distance matrix.
cluster.KMeans([n_clusters, init, n_init, ...]) K-Means clustering
cluster.MiniBatchKMeans([n_clusters, init, ...]) Mini-Batch K-Means clustering
cluster.MeanShift([bandwidth, seeds, ...]) MeanShift clustering
cluster.SpectralClustering([n_clusters, ...]) Apply k-means to a projection to the normalized laplacian
cluster.Ward([n_clusters, memory, ...]) Ward hierarchical clustering: constructs a tree and cuts it.
sklearn.cluster.AfnityPropagation
class sklearn.cluster.AffinityPropagation(damping=0.5, max_iter=200, convit=30,
copy=True)
Perform Afnity Propagation Clustering of data
Parameters damping : oat, optional
Damping factor
max_iter : int, optional
Maximum number of iterations
convit : int, optional
Number of iterations with no change in the number of estimated clusters that stops the
convergence.
copy: boolean, optional :
Make a copy of input data. True by default.
Notes
See examples/plot_afnity_propagation.py for an example.
The algorithmic complexity of afnity propagation is quadratic in the number of points.
References
Brendan J. Frey and Delbert Dueck, Clustering by Passing Messages Between Data Points, Science Feb. 2007
Attributes
cluster_centers_indices_ array, [n_clusters] Indices of cluster centers
labels_ array, [n_samples] Labels of each point
1.8. Reference 227
scikit-learn user guide, Release 0.12-git
Methods
fit(S[, p]) Compute afnity propagation clustering.
get_params([deep]) Get parameters for the estimator
set_params(**params) Set the parameters of the estimator.
__init__(damping=0.5, max_iter=200, convit=30, copy=True)
fit(S, p=None)
Compute afnity propagation clustering.
Parameters S: array [n_points, n_points] :
Matrix of similarities between points
p: array [n_points,] or oat, optional :
Preferences for each point - points with larger values of preferences are more likely to
be chosen as exemplars. The number of exemplars, ie of clusters, is inuenced by the
input preferences value. If the preferences are not passed as arguments, they will be set
to the median of the input similarities.
damping : oat, optional
Damping factor
copy: boolean, optional :
If copy is False, the afnity matrix is modied inplace by the algorithm, for memory
efciency
get_params(deep=True)
Get parameters for the estimator
Parameters deep: boolean, optional :
If True, will return the parameters for this estimator and contained subobjects that are
estimators.
set_params(**params)
Set the parameters of the estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The former have
parameters of the form <component>__<parameter> so that its possible to update each component
of a nested object.
Returns self :
sklearn.cluster.DBSCAN
class sklearn.cluster.DBSCAN(eps=0.5, min_samples=5, metric=euclidean, random_state=None)
Perform DBSCAN clustering from vector array or distance matrix.
DBSCAN - Density-Based Spatial Clustering of Applications with Noise. Finds core samples of high density
and expands clusters from them. Good for data which contains clusters of similar density.
Parameters eps : oat, optional
The maximum distance between two samples for them to be considered as in the same
neighborhood.
228 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
min_samples : int, optional
The number of samples in a neighborhood for a point to be considered as a core point.
metric : string, or callable
The metric to use when calculating distance between instances in a feature array.
If metric is a string or callable, it must be one of the options allowed by met-
rics.pairwise.calculate_distance for its metric parameter. If metric is precomputed,
X is assumed to be a distance matrix and must be square.
random_state : numpy.RandomState, optional
The generator used to initialize the centers. Defaults to numpy.random.
Notes
See examples/plot_dbscan.py for an example.
References
Ester, M., H. P. Kriegel, J. Sander, and X. Xu, A Density-Based Algorithm for Discovering Clusters in Large
Spatial Databases with Noise. In: Proceedings of the 2nd International Conference on Knowledge Discovery
and Data Mining, Portland, OR, AAAI Press, pp. 226231. 1996
Attributes
core_sample_indices_ array, shape =
[n_core_samples]
Indices of core samples.
components_ array, shape =
[n_core_samples,
n_features]
Copy of each core sample found by training.
labels_ array, shape = [n_samples] Cluster labels for each point in the dataset given to t().
Noisy samples are given the label -1.
Methods
fit(X, **params) Perform DBSCAN clustering from vector array or distance matrix.
get_params([deep]) Get parameters for the estimator
set_params(**params) Set the parameters of the estimator.
__init__(eps=0.5, min_samples=5, metric=euclidean, random_state=None)
fit(X, **params)
Perform DBSCAN clustering from vector array or distance matrix.
Parameters X: array [n_samples, n_samples] or [n_samples, n_features] :
Array of distances between samples, or a feature array. The array is treated as a feature
array unless the metric is given as precomputed.
params: dict :
Overwrite keywords from __init__.
1.8. Reference 229
scikit-learn user guide, Release 0.12-git
get_params(deep=True)
Get parameters for the estimator
Parameters deep: boolean, optional :
If True, will return the parameters for this estimator and contained subobjects that are
estimators.
set_params(**params)
Set the parameters of the estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The former have
parameters of the form <component>__<parameter> so that its possible to update each component
of a nested object.
Returns self :
sklearn.cluster.KMeans
class sklearn.cluster.KMeans(n_clusters=8, init=k-means++, n_init=10, max_iter=300,
tol=0.0001, precompute_distances=True, verbose=0, ran-
dom_state=None, copy_x=True, n_jobs=1, k=None)
K-Means clustering
Parameters n_clusters : int, optional, default: 8
The number of clusters to form as well as the number of centroids to generate.
max_iter : int
Maximum number of iterations of the k-means algorithm for a single run.
n_init: int, optional, default: 10 :
Number of time the k-means algorithm will be run with different centroid seeds. The
nal results will be the best output of n_init consecutive runs in terms of inertia.
init : {k-means++, random or an ndarray}
Method for initialization, defaults to k-means++:
k-means++ : selects initial cluster centers for k-mean clustering in a smart way to
speed up convergence. See section Notes in k_init for more details.
random: choose k observations (rows) at random from data for the initial centroids.
if init is an 2d array, it is used as a seed for the centroids
precompute_distances : boolean
Precompute distances (faster but takes more memory).
tol: oat, optional default: 1e-4 :
Relative tolerance w.r.t. inertia to declare convergence
n_jobs: int :
The number of jobs to use for the computation. This works by breaking down the
pairwise matrix into n_jobs even slices and computing them in parallel.
If -1 all CPUs are used. If 1 is given, no parallel computing code is used at all, which
is useful for debuging. For n_jobs below -1, (n_cpus + 1 - n_jobs) are used. Thus for
n_jobs = -2, all CPUs but one are used.
230 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
random_state: integer or numpy.RandomState, optional :
The generator used to initialize the centers. If an integer is given, it xes the seed.
Defaults to the global numpy random number generator.
See Also:
MiniBatchKMeansAlternative online implementation that does incremental updates of the centers positions
using mini-batches. For large scale learning (say n_samples > 10k) MiniBatchKMeans is probably much
faster to than the default batch implementation.
Notes
The k-means problem is solved using Lloyds algorithm.
The average complexity is given by O(k n T), were n is the number of samples and T is the number of iteration.
The worst case complexity is given by O(n^(k+2/p)) with n = n_samples, p = n_features. (D. Arthur and S.
Vassilvitskii, How slow is the k-means method? SoCG2006)
In practice, the k-means algorithm is very fast (one of the fastest clustering algorithms available), but it falls in
local minima. Thats why it can be useful to restart it several times.
Attributes
cluster_centers_: array, [n_clusters,
n_features]
Coordinates of cluster centers
labels_: Labels of each point
inertia_: oat The value of the inertia criterion associated with the chosen
partition.
Methods
fit(X[, y]) Compute k-means
fit_predict(X) Compute cluster centers and predict cluster index for each sample.
get_params([deep]) Get parameters for the estimator
predict(X) Predict the closest cluster each sample in X belongs to.
score(X) Opposite of the value of X on the K-means objective.
set_params(**params) Set the parameters of the estimator.
transform(X[, y]) Transform the data to a cluster-distance space
__init__(n_clusters=8, init=k-means++, n_init=10, max_iter=300, tol=0.0001, precom-
pute_distances=True, verbose=0, random_state=None, copy_x=True, n_jobs=1, k=None)
fit(X, y=None)
Compute k-means
fit_predict(X)
Compute cluster centers and predict cluster index for each sample.
Convenience method; equivalent to calling t(X) followed by predict(X).
get_params(deep=True)
Get parameters for the estimator
1.8. Reference 231
scikit-learn user guide, Release 0.12-git
Parameters deep: boolean, optional :
If True, will return the parameters for this estimator and contained subobjects that are
estimators.
predict(X)
Predict the closest cluster each sample in X belongs to.
In the vector quantization literature, cluster_centers_ is called the code book and each value returned by
predict is the index of the closest code in the code book.
Parameters X: {array-like, sparse matrix}, shape = [n_samples, n_features] :
New data to predict.
Returns Y : array, shape [n_samples,]
Index of the closest center each sample belongs to.
score(X)
Opposite of the value of X on the K-means objective.
Parameters X: {array-like, sparse matrix}, shape = [n_samples, n_features] :
New data.
Returns score: oat :
Opposite of the value of X on the K-means objective.
set_params(**params)
Set the parameters of the estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The former have
parameters of the form <component>__<parameter> so that its possible to update each component
of a nested object.
Returns self :
transform(X, y=None)
Transform the data to a cluster-distance space
In the new space, each dimension is the distance to the cluster centers. Note that even if X is sparse, the
array returned by transform will typically be dense.
Parameters X: {array-like, sparse matrix}, shape = [n_samples, n_features] :
New data to transform.
Returns X_new : array, shape [n_samples, k]
X transformed in the new space.
sklearn.cluster.MiniBatchKMeans
class sklearn.cluster.MiniBatchKMeans(n_clusters=8, init=k-means++, max_iter=100,
batch_size=100, verbose=0, compute_labels=True,
random_state=None, tol=0.0, max_no_improvement=10,
init_size=None, n_init=3, chunk_size=None, k=None)
Mini-Batch K-Means clustering
Parameters n_clusters : int, optional, default: 8
The number of clusters to form as well as the number of centroids to generate.
232 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
max_iter : int, optional
Maximumnumber of iterations over the complete dataset before stopping independently
of any early stopping criterion heuristics.
max_no_improvement : int, optional
Control early stopping based on the consecutive number of mini batches that does not
yield an improvement on the smoothed inertia.
To disable convergence detection based on inertia, set max_no_improvement to None.
tol : oat, optional
Control early stopping based on the relative center changes as measured by a smoothed,
variance-normalized of the mean center squared position changes. This early stopping
heuristics is closer to the one used for the batch variant of the algorithms but induces a
slight computational and memory overhead over the inertia heuristic.
To disable convergence detection based on normalized center change, set tol to 0.0
(default).
batch_size: int, optional, default: 100 :
Size of the mini batches.
init_size: int, optional, default: 3 * batch_size :
Number of samples to randomly sample for speeding up the initialization (sometimes at
the expense of accurracy): the only algorithm is initialized by running a batch KMeans
on a random subset of the data. This needs to be larger than k.
init : {k-means++, random or an ndarray}
Method for initialization, defaults to k-means++:
k-means++ : selects initial cluster centers for k-mean clustering in a smart way to
speed up convergence. See section Notes in k_init for more details.
random: choose k observations (rows) at random from data for the initial centroids.
if init is an 2d array, it is used as a seed for the centroids
compute_labels: boolean :
Compute label assignements and inertia for the complete dataset once the minibatch
optimization has converged in t.
random_state: integer or numpy.RandomState, optional :
The generator used to initialize the centers. If an integer is given, it xes the seed.
Defaults to the global numpy random number generator.
Notes
See http://www.eecs.tufts.edu/~dsculley/papers/fastkmeans.pdf
1.8. Reference 233
scikit-learn user guide, Release 0.12-git
Attributes
cluster_centers_:
array, [n_clusters,
n_features]
Coordinates of cluster centers
labels_: Labels of each point (if compute_labels is set to True).
inertia_: oat The value of the inertia criterion associated with the chosen partition (if
compute_labels is set to True). The inertia is dened as the sum of square
distances of samples to their nearest neighbor.
Methods
fit(X[, y]) Compute the centroids on X by chunking it into mini-batches.
fit_predict(X) Compute cluster centers and predict cluster index for each sample.
get_params([deep]) Get parameters for the estimator
partial_fit(X[, y]) Update k means estimate on a single mini-batch X.
predict(X) Predict the closest cluster each sample in X belongs to.
score(X) Opposite of the value of X on the K-means objective.
set_params(**params) Set the parameters of the estimator.
transform(X[, y]) Transform the data to a cluster-distance space
__init__(n_clusters=8, init=k-means++, max_iter=100, batch_size=100, verbose=0,
compute_labels=True, random_state=None, tol=0.0, max_no_improvement=10,
init_size=None, n_init=3, chunk_size=None, k=None)
fit(X, y=None)
Compute the centroids on X by chunking it into mini-batches.
Parameters X: array-like, shape = [n_samples, n_features] :
Coordinates of the data points to cluster
fit_predict(X)
Compute cluster centers and predict cluster index for each sample.
Convenience method; equivalent to calling t(X) followed by predict(X).
get_params(deep=True)
Get parameters for the estimator
Parameters deep: boolean, optional :
If True, will return the parameters for this estimator and contained subobjects that are
estimators.
partial_fit(X, y=None)
Update k means estimate on a single mini-batch X.
Parameters X: array-like, shape = [n_samples, n_features] :
Coordinates of the data points to cluster.
predict(X)
Predict the closest cluster each sample in X belongs to.
In the vector quantization literature, cluster_centers_ is called the code book and each value returned by
predict is the index of the closest code in the code book.
234 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
Parameters X: {array-like, sparse matrix}, shape = [n_samples, n_features] :
New data to predict.
Returns Y : array, shape [n_samples,]
Index of the closest center each sample belongs to.
score(X)
Opposite of the value of X on the K-means objective.
Parameters X: {array-like, sparse matrix}, shape = [n_samples, n_features] :
New data.
Returns score: oat :
Opposite of the value of X on the K-means objective.
set_params(**params)
Set the parameters of the estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The former have
parameters of the form <component>__<parameter> so that its possible to update each component
of a nested object.
Returns self :
transform(X, y=None)
Transform the data to a cluster-distance space
In the new space, each dimension is the distance to the cluster centers. Note that even if X is sparse, the
array returned by transform will typically be dense.
Parameters X: {array-like, sparse matrix}, shape = [n_samples, n_features] :
New data to transform.
Returns X_new : array, shape [n_samples, k]
X transformed in the new space.
sklearn.cluster.MeanShift
class sklearn.cluster.MeanShift(bandwidth=None, seeds=None, bin_seeding=False, clus-
ter_all=True)
MeanShift clustering
Parameters bandwidth: oat, optional :
Bandwith used in the RBF kernel If not set, the bandwidth is estimated. See cluster-
ing.estimate_bandwidth
seeds: array [n_samples, n_features], optional :
Seeds used to initialize kernels. If not set, the seeds are calculated by cluster-
ing.get_bin_seeds with bandwidth as the grid size and default values for other parame-
ters.
cluster_all: boolean, default True :
If true, then all points are clustered, even those orphans that are not within any kernel.
Orphans are assigned to the nearest kernel. If false, then orphans are given cluster label
-1.
1.8. Reference 235
scikit-learn user guide, Release 0.12-git
Notes
Scalability:
Because this implementation uses a at kernel and a Ball Tree to look up members of each kernel, the complexity
will is to O(T*n*log(n)) in lower dimensions, with n the number of samples and T the number of points. In
higher dimensions the complexity will tend towards O(T*n^2).
Scalability can be boosted by using fewer seeds, for examply by using a higher value of min_bin_freq in the
get_bin_seeds function.
Note that the estimate_bandwidth function is much less scalable than the mean shift algorithm and will be the
bottleneck if it is used.
References
Dorin Comaniciu and Peter Meer, Mean Shift: A robust approach toward feature space analysis. IEEE Trans-
actions on Pattern Analysis and Machine Intelligence. 2002. pp. 603-619.
Attributes
cluster_centers_ array, [n_clusters, n_features] Coordinates of cluster centers
labels_ : Labels of each point
Methods
fit(X) Compute MeanShift
get_params([deep]) Get parameters for the estimator
set_params(**params) Set the parameters of the estimator.
__init__(bandwidth=None, seeds=None, bin_seeding=False, cluster_all=True)
fit(X)
Compute MeanShift
Parameters X : array [n_samples, n_features]
Input points
get_params(deep=True)
Get parameters for the estimator
Parameters deep: boolean, optional :
If True, will return the parameters for this estimator and contained subobjects that are
estimators.
set_params(**params)
Set the parameters of the estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The former have
parameters of the form <component>__<parameter> so that its possible to update each component
of a nested object.
Returns self :
236 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
sklearn.cluster.SpectralClustering
class sklearn.cluster.SpectralClustering(n_clusters=8, mode=None, random_state=None,
n_init=10, k=None)
Apply k-means to a projection to the normalized laplacian
In practice Spectral Clustering is very useful when the structure of the individual clusters is highly non-convex
or more generally when a measure of the center and spread of the cluster is not a suitable description of the
complete cluster. For instance when clusters are nested circles on the 2D plan.
If afnity is the adjacency matrix of a graph, this method can be used to nd normalized graph cuts.
Parameters n_clusters : integer, optional
The dimension of the projection subspace.
mode : {None, arpack or amg}
The eigenvalue decomposition strategy to use. AMG requires pyamg to be installed. It
can be faster on very large, sparse problems, but may also lead to instabilities
random_state : int seed, RandomState instance, or None (default)
A pseudo random number generator used for the initialization of the lobpcg eigen vec-
tors decomposition when mode == amg and by the K-Means initialization.
n_init : int, optional, default: 10
Number of time the k-means algorithm will be run with different centroid seeds. The
nal results will be the best output of n_init consecutive runs in terms of inertia.
References
Normalized cuts and image segmentation, 2000 Jianbo Shi, Jitendra Malik
http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.160.2324
A Tutorial on Spectral Clustering, 2007 Ulrike von Luxburg
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.165.9323
Attributes
labels_ : Labels of each point
Methods
fit(X) Compute the spectral clustering from the afnity matrix
get_params([deep]) Get parameters for the estimator
set_params(**params) Set the parameters of the estimator.
__init__(n_clusters=8, mode=None, random_state=None, n_init=10, k=None)
fit(X)
Compute the spectral clustering from the afnity matrix
Parameters X: array-like or sparse matrix, shape: (n_samples, n_samples) :
1.8. Reference 237
scikit-learn user guide, Release 0.12-git
An afnity matrix describing the pairwise similarity of the data. If can also be an ad-
jacency matrix of the graph to embed. X must be symmetric and its entries must be
positive or zero. Zero means that elements have nothing in common, whereas high
values mean that elements are strongly similar.
Notes
If you have an afnity matrix, such as a distance matrix, for which 0 means identical elements, and high
values means very dissimilar elements, it can be transformed in a similarity matrix that is well suited for
the algorithm by applying the gaussian (heat) kernel:
np.exp(- X
**
2 / (2.
*
delta
**
2))
Another alternative is to take a symmetric version of the k nearest neighbors connectivity matrix of the
points.
If the pyamg package is installed, it is used: this greatly speeds up computation.
get_params(deep=True)
Get parameters for the estimator
Parameters deep: boolean, optional :
If True, will return the parameters for this estimator and contained subobjects that are
estimators.
set_params(**params)
Set the parameters of the estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The former have
parameters of the form <component>__<parameter> so that its possible to update each component
of a nested object.
Returns self :
sklearn.cluster.Ward
class sklearn.cluster.Ward(n_clusters=2, memory=Memory(cachedir=None), connectivity=None,
copy=True, n_components=None)
Ward hierarchical clustering: constructs a tree and cuts it.
Parameters n_clusters : int or ndarray
The number of clusters to nd.
connectivity : sparse matrix.
Connectivity matrix. Denes for each sample the neigbhoring samples following a
given structure of the data. Default is None, i.e, the hiearchical clustering algorithm is
unstructured.
memory : Instance of joblib.Memory or string
Used to cache the output of the computation of the tree. By default, no caching is done.
If a string is given, it is the path to the caching directory.
copy : bool
Copy the connectivity matrix or work inplace.
n_components : int (optional)
238 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
The number of connected components in the graph dened by the connectivity matrix.
If not set, it is estimated.
Attributes
children_ array-like, shape = [n_nodes, 2] List of the children of each nodes. Leaves of the tree do not appear.
labels_ array [n_points] cluster labels for each point
n_leaves_ int Number of leaves in the hiearchical tree.
Methods
fit(X) Fit the hierarchical clustering on the data
get_params([deep]) Get parameters for the estimator
set_params(**params) Set the parameters of the estimator.
__init__(n_clusters=2, memory=Memory(cachedir=None), connectivity=None, copy=True,
n_components=None)
fit(X)
Fit the hierarchical clustering on the data
Parameters X : array-like, shape = [n_samples, n_features]
The samples a.k.a. observations.
Returns self :
get_params(deep=True)
Get parameters for the estimator
Parameters deep: boolean, optional :
If True, will return the parameters for this estimator and contained subobjects that are
estimators.
set_params(**params)
Set the parameters of the estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The former have
parameters of the form <component>__<parameter> so that its possible to update each component
of a nested object.
Returns self :
Functions
cluster.estimate_bandwidth(X[, quantile, ...]) Estimate the bandwith to use with MeanShift algorithm
cluster.k_means(X, n_clusters[, init, ...]) K-means clustering algorithm.
cluster.ward_tree(X[, connectivity, ...]) Ward clustering based on a Feature matrix.
cluster.affinity_propagation(S[, p, convit, ...]) Perform Afnity Propagation Clustering of data
cluster.dbscan(X[, eps, min_samples, ...]) Perform DBSCAN clustering from vector array or distance matrix.
cluster.mean_shift(X[, bandwidth, seeds, ...]) Perform MeanShift Clustering of data using a at kernel
cluster.spectral_clustering(afnity[, ...]) Apply k-means to a projection to the normalized laplacian
1.8. Reference 239
scikit-learn user guide, Release 0.12-git
sklearn.cluster.estimate_bandwidth
sklearn.cluster.estimate_bandwidth(X, quantile=0.3, n_samples=None, random_state=0)
Estimate the bandwith to use with MeanShift algorithm
Parameters X: array [n_samples, n_features] :
Input points
quantile: oat, default 0.3 :
should be between [0, 1] 0.5 means that the median is all pairwise distances is used
n_samples: int :
The number of samples to use. If None, all samples are used.
random_state: int or RandomState :
Pseudo number generator state used for random sampling.
Returns bandwidth: oat :
The bandwidth parameter
sklearn.cluster.k_means
sklearn.cluster.k_means(X, n_clusters, init=k-means++, precompute_distances=True, n_init=10,
max_iter=300, verbose=False, tol=0.0001, random_state=None,
copy_x=True, n_jobs=1, k=None)
K-means clustering algorithm.
Parameters X: array-like of oats, shape (n_samples, n_features) :
The observations to cluster.
n_clusters: int :
The number of clusters to form as well as the number of centroids to generate.
max_iter: int, optional, default 300 :
Maximum number of iterations of the k-means algorithm to run.
n_init: int, optional, default: 10 :
Number of time the k-means algorithm will be run with different centroid seeds. The
nal results will be the best output of n_init consecutive runs in terms of inertia.
init: {k-means++, random, or ndarray, or a callable}, optional :
Method for initialization, default to k-means++:
k-means++ : selects initial cluster centers for k-mean clustering in a smart way to
speed up convergence. See section Notes in k_init for more details.
random: generate k centroids from a Gaussian with mean and variance estimated from
the data.
If an ndarray is passed, it should be of shape (k, p) and gives the initial centers.
If a callable is passed, it should take arguments X, k and and a random state and return
an initialization.
tol: oat, optional :
240 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
The relative increment in the results before declaring convergence.
verbose: boolean, optional :
Verbosity mode
random_state: integer or numpy.RandomState, optional :
The generator used to initialize the centers. If an integer is given, it xes the seed.
Defaults to the global numpy random number generator.
copy_x: boolean, optional :
When pre-computing distances it is more numerically accurate to center the data rst.
If copy_x is True, then the original data is not modied. If False, the original data is
modied, and put back before the function returns, but small numerical differences may
be introduced by subtracting and then adding the data mean.
n_jobs: int :
The number of jobs to use for the computation. This works by breaking down the
pairwise matrix into n_jobs even slices and computing them in parallel.
If -1 all CPUs are used. If 1 is given, no parallel computing code is used at all, which
is useful for debuging. For n_jobs below -1, (n_cpus + 1 - n_jobs) are used. Thus for
n_jobs = -2, all CPUs but one are used.
Returns centroid: oat ndarray with shape (k, n_features) :
Centroids found at the last iteration of k-means.
label: integer ndarray with shape (n_samples,) :
label[i] is the code or index of the centroid the ith observation is closest to.
inertia: oat :
The nal value of the inertia criterion (sum of squared distances to the closest centroid
for all observations in the training set).
sklearn.cluster.ward_tree
sklearn.cluster.ward_tree(X, connectivity=None, n_components=None, copy=True)
Ward clustering based on a Feature matrix.
The inertia matrix uses a Heapq-based representation.
This is the structured version, that takes into account a some topological structure between samples.
Parameters X : array of shape (n_samples, n_features)
feature matrix representing n_samples samples to be clustered
connectivity : sparse matrix.
connectivity matrix. Denes for each sample the neigbhoring samples following a given
structure of the data. The matrix is assumed to be symmetric and only the upper trian-
gular half is used. Default is None, i.e, the Ward algorithm is unstructured.
n_components : int (optional)
Number of connected components. If None the number of connected components is
estimated from the connectivity matrix.
copy : bool (optional)
1.8. Reference 241
scikit-learn user guide, Release 0.12-git
Make a copy of connectivity or work inplace. If connectivity is not of LIL type there
will be a copy in any case.
Returns children : list of pairs. Lenght of n_nodes
list of the children of each nodes. Leaves of the tree have empty list of children.
n_components : sparse matrix.
The number of connected components in the graph.
n_leaves : int
The number of leaves in the tree
sklearn.cluster.afnity_propagation
sklearn.cluster.affinity_propagation(S, p=None, convit=30, max_iter=200, damping=0.5,
copy=True, verbose=False)
Perform Afnity Propagation Clustering of data
Parameters S: array [n_points, n_points] :
Matrix of similarities between points
p: array [n_points,] or oat, optional :
Preferences for each point - points with larger values of preferences are more likely to
be chosen as exemplars. The number of exemplars, ie of clusters, is inuenced by the
input preferences value. If the preferences are not passed as arguments, they will be set
to the median of the input similarities (resulting in a moderate number of clusters). For
a smaller amount of clusters, this can be set to the minimum value of the similarities.
damping : oat, optional
Damping factor
copy: boolean, optional :
If copy is False, the afnity matrix is modied inplace by the algorithm, for memory
efciency
verbose: boolean, optional :
The verbosity level
Returns cluster_centers_indices: array [n_clusters] :
index of clusters centers
labels : array [n_points]
cluster labels for each point
Notes
See examples/plot_afnity_propagation.py for an example.
References
Brendan J. Frey and Delbert Dueck, Clustering by Passing Messages Between Data Points, Science Feb. 2007
242 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
sklearn.cluster.dbscan
sklearn.cluster.dbscan(X, eps=0.5, min_samples=5, metric=euclidean, random_state=None)
Perform DBSCAN clustering from vector array or distance matrix.
Parameters X: array [n_samples, n_samples] or [n_samples, n_features] :
Array of distances between samples, or a feature array. The array is treated as a feature
array unless the metric is given as precomputed.
eps: oat, optional :
The maximum distance between two samples for them to be considered as in the same
neighborhood.
min_samples: int, optional :
The number of samples in a neighborhood for a point to be considered as a core point.
metric: string, or callable :
The metric to use when calculating distance between instances in a feature array.
If metric is a string or callable, it must be one of the options allowed by met-
rics.pairwise.calculate_distance for its metric parameter. If metric is precomputed,
X is assumed to be a distance matrix and must be square.
random_state: numpy.RandomState, optional :
The generator used to initialize the centers. Defaults to numpy.random.
Returns core_samples: array [n_core_samples] :
Indices of core samples.
labels : array [n_samples]
Cluster labels for each point. Noisy samples are given the label -1.
Notes
See examples/plot_dbscan.py for an example.
References
Ester, M., H. P. Kriegel, J. Sander, and X. Xu, A Density-Based Algorithm for Discovering Clusters in Large
Spatial Databases with Noise. In: Proceedings of the 2nd International Conference on Knowledge Discovery
and Data Mining, Portland, OR, AAAI Press, pp. 226231. 1996
sklearn.cluster.mean_shift
sklearn.cluster.mean_shift(X, bandwidth=None, seeds=None, bin_seeding=False, clus-
ter_all=True, max_iterations=300)
Perform MeanShift Clustering of data using a at kernel
Seed using a binning technique for scalability.
Parameters X : array [n_samples, n_features]
Input points
1.8. Reference 243
scikit-learn user guide, Release 0.12-git
bandwidth : oat, optional
kernel bandwidth If bandwidth is not dened, it is set using a heuristic given by the
median of all pairwise distances
seeds: array [n_seeds, n_features] :
point used as initial kernel locations
bin_seeding: boolean :
If true, initial kernel locations are not locations of all points, but rather the location of
the discretized version of points, where points are binned onto a grid whose coarseness
corresponds to the bandwidth. Setting this option to True will speed up the algorithm
because fewer seeds will be initialized. default value: False Ignored if seeds argument
is not None
min_bin_freq: int, optional :
To speed up the algorithm, accept only those bins with at least min_bin_freq points as
seeds. If not dened, set to 1.
Returns cluster_centers : array [n_clusters, n_features]
Coordinates of cluster centers
labels : array [n_samples]
cluster labels for each point
Notes
See examples/plot_meanshift.py for an example.
sklearn.cluster.spectral_clustering
sklearn.cluster.spectral_clustering(afnity, n_clusters=8, n_components=None,
mode=None, random_state=None, n_init=10, k=None)
Apply k-means to a projection to the normalized laplacian
In practice Spectral Clustering is very useful when the structure of the individual clusters is highly non-convex
or more generally when a measure of the center and spread of the cluster is not a suitable description of the
complete cluster. For instance when clusters are nested circles on the 2D plan.
If afnity is the adjacency matrix of a graph, this method can be used to nd normalized graph cuts.
Parameters afnity: array-like or sparse matrix, shape: (n_samples, n_samples) :
The afnity matrix describing the relationship of the samples to embed. Must be sy-
metric.
Possible examples:
adjacency matrix of a graph,
heat kernel of the pairwise distance matrix of the samples,
symmetic k-nearest neighbours connectivity matrix of the samples.
n_clusters: integer, optional :
Number of clusters to extract.
244 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
n_components: integer, optional, default is k :
Number of eigen vectors to use for the spectral embedding
mode: {None, arpack or amg} :
The eigenvalue decomposition strategy to use. AMG requires pyamg to be installed. It
can be faster on very large, sparse problems, but may also lead to instabilities
random_state: int seed, RandomState instance, or None (default) :
A pseudo random number generator used for the initialization of the lobpcg eigen vec-
tors decomposition when mode == amg and by the K-Means initialization.
n_init: int, optional, default: 10 :
Number of time the k-means algorithm will be run with different centroid seeds. The
nal results will be the best output of n_init consecutive runs in terms of inertia.
Returns labels: array of integers, shape: n_samples :
The labels of the clusters.
centers: array of integers, shape: k :
The indices of the cluster centers
Notes
The graph should contain only one connect component, elsewhere the results make little sense.
This algorithm solves the normalized cut for k=2: it is a normalized spectral clustering.
References
Normalized cuts and image segmentation, 2000 Jianbo Shi, Jitendra Malik
http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.160.2324
A Tutorial on Spectral Clustering, 2007 Ulrike von Luxburg
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.165.9323
1.8.2 sklearn.covariance: Covariance Estimators
The sklearn.covariance module includes methods and algorithms to robustly estimate the covariance of fea-
tures given a set of points. The precision matrix dened as the inverse of the covariance is also estimated. Covariance
estimation is closely related to the theory of Gaussian Graphical Models.
User guide: See the Covariance estimation section for further details.
covariance.EmpiricalCovariance([...]) Maximum likelihood covariance estimator
covariance.EllipticEnvelope([...]) An object for detecting outliers in a Gaussian distributed dataset.
covariance.GraphLasso([alpha, mode, tol, ...]) Sparse inverse covariance estimation with an l1-penalized estimator.
covariance.GraphLassoCV([alphas, ...]) Sparse inverse covariance w/ cross-validated choice of the l1 penality
covariance.LedoitWolf([store_precision, ...]) LedoitWolf Estimator
covariance.MinCovDet([store_precision, ...]) Minimum Covariance Determinant (MCD): robust estimator of covariance
covariance.OAS([store_precision, ...]) Oracle Approximating Shrinkage Estimator
covariance.ShrunkCovariance([...]) Covariance estimator with shrinkage
1.8. Reference 245
scikit-learn user guide, Release 0.12-git
sklearn.covariance.EmpiricalCovariance
class sklearn.covariance.EmpiricalCovariance(store_precision=True, as-
sume_centered=False)
Maximum likelihood covariance estimator
Parameters store_precision : bool
Species if the estimated precision is stored
Attributes
covari-
ance_
2D ndarray, shape (n_features,
n_features)
Estimated covariance matrix
preci-
sion_
2D ndarray, shape (n_features,
n_features)
Estimated pseudo-inverse matrix. (stored only if
store_precision is True)
Methods
error_norm(comp_cov[, norm, scaling, squared]) Computes the Mean Squared Error between two covariance estimators.
fit(X) Fits the Maximum Likelihood Estimator covariance model
get_params([deep]) Get parameters for the estimator
mahalanobis(observations) Computes the mahalanobis distances of given observations.
score(X_test[, assume_centered]) Computes the log-likelihood of a gaussian data set with self.covariance_ as an estimator of its covariance matrix.
set_params(**params) Set the parameters of the estimator.
__init__(store_precision=True, assume_centered=False)
Parameters store_precision: bool :
Specify if the estimated precision is stored
assume_centered: Boolean :
If True, data are not centered before computation. Useful when working with data
whose mean is almost, but not exactly zero. If False, data are centered before computa-
tion.
error_norm(comp_cov, norm=frobenius, scaling=True, squared=True)
Computes the Mean Squared Error between two covariance estimators. (In the sense of the Frobenius
norm)
Parameters comp_cov: array-like, shape = [n_features, n_features] :
The covariance to compare with.
norm: str :
The type of norm used to compute the error. Available error types: - frobenius (de-
fault): sqrt(tr(A^t.A)) - spectral: sqrt(max(eigenvalues(A^t.A)) where A is the error
(comp_cov - self.covariance_).
scaling: bool :
If True (default), the squared error norm is divided by n_features. If False, the squared
error norm is not rescaled.
246 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
squared: bool :
Whether to compute the squared error norm or the error norm. If True (default), the
squared error norm is returned. If False, the error norm is returned.
Returns The Mean Squared Error (in the sense of the Frobenius norm) between :
self and comp_cov covariance estimators. :
fit(X)
Fits the Maximum Likelihood Estimator covariance model according to the given training data and param-
eters.
Parameters X : array-like, shape = [n_samples, n_features]
Training data, where n_samples is the number of samples and n_features is the number
of features.
Returns self : object
Returns self.
get_params(deep=True)
Get parameters for the estimator
Parameters deep: boolean, optional :
If True, will return the parameters for this estimator and contained subobjects that are
estimators.
mahalanobis(observations)
Computes the mahalanobis distances of given observations.
The provided observations are assumed to be centered. One may want to center them using a location
estimate rst.
Parameters observations: array-like, shape = [n_observations, n_features] :
The observations, the Mahalanobis distances of the which we compute.
Returns mahalanobis_distance: array, shape = [n_observations,] :
Mahalanobis distances of the observations.
score(X_test, assume_centered=False)
Computes the log-likelihood of a gaussian data set with self.covariance_ as an estimator of its covariance
matrix.
Parameters X_test : array-like, shape = [n_samples, n_features]
Test data of which we compute the likelihood, where n_samples is the number of sam-
ples and n_features is the number of features.
Returns res : oat
The likelihood of the data set with self.covariance_ as an estimator of its covariance
matrix.
set_params(**params)
Set the parameters of the estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The former have
parameters of the form <component>__<parameter> so that its possible to update each component
of a nested object.
Returns self :
1.8. Reference 247
scikit-learn user guide, Release 0.12-git
sklearn.covariance.EllipticEnvelope
class sklearn.covariance.EllipticEnvelope(store_precision=True, assume_centered=False,
support_fraction=None, contamination=0.1)
An object for detecting outliers in a Gaussian distributed dataset.
Parameters store_precision: bool :
Specify if the estimated precision is stored
assume_centered: Boolean :
If True, the support of robust location and covariance estimates is computed, and a
covariance estimate is recomputed from it, without centering the data. Useful to work
with data whose mean is signicantly equal to zero but is not exactly zero. If False,
the robust location and covariance are directly computed with the FastMCD algorithm
without additional treatment.
support_fraction: oat, 0 < support_fraction < 1 :
The proportion of points to be included in the support of the raw MCD estimate. Default
is None, which implies that the minimum value of support_fraction will be used within
the algorithm: [n_sample + n_features + 1] / 2
contamination: oat, 0. < contamination < 0.5 :
The amount of contamination of the data set, i.e. the proportion of outliers in the data
set.
See Also:
EmpiricalCovariance, MinCovDet
Notes
Outlier detection from covariance estimation may break or not perform well in high-dimensional settings. In
particular, one will always take care to work with n_samples > n_features
**
2.
References
Attributes
contamination: oat, 0. <
contamination < 0.5
The amount of contamination of the data set, i.e. the proportion of
outliers in the data set.
location_: array-like, shape
(n_features,)
Estimated robust location
covariance_: array-like, shape
(n_features, n_features)
Estimated robust covariance matrix
precision_: array-like, shape
(n_features, n_features)
Estimated pseudo inverse matrix. (stored only if store_precision is
True)
support_: array-like, shape
(n_samples,)
A mask of the observations that have been used to compute the
robust estimates of location and shape.
Methods
248 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
correct_covariance(data) Apply a correction to raw Minimum Covariance Determinant estimates.
decision_function(X[, raw_mahalanobis]) Compute the decision function of the given observations.
error_norm(comp_cov[, norm, scaling, squared]) Computes the Mean Squared Error between two covariance estimators.
fit(X)
get_params([deep]) Get parameters for the estimator
mahalanobis(observations) Computes the mahalanobis distances of given observations.
predict(X) Outlyingness of observations in X according to the tted model.
reweight_covariance(data) Reweight raw Minimum Covariance Determinant estimates.
score(X, y) Returns the mean accuracy on the given test data and labels.
set_params(**params) Set the parameters of the estimator.
__init__(store_precision=True, assume_centered=False, support_fraction=None, contamina-
tion=0.1)
correct_covariance(data)
Apply a correction to raw Minimum Covariance Determinant estimates.
Correction using the empirical correction factor suggested by Rousseeuw and Van Driessen in
[Rouseeuw1984].
Parameters data: array-like, shape (n_samples, n_features) :
The data matrix, with p features and n samples. The data set must be the one which was
used to compute the raw estimates.
Returns covariance_corrected: array-like, shape (n_features, n_features) :
Corrected robust covariance estimate.
decision_function(X, raw_mahalanobis=False)
Compute the decision function of the given observations.
Parameters X: array-like, shape (n_samples, n_features) :
raw_mahalanobis: bool :
Whether or not to consider raw Mahalanobis distances as the decision function. Must
be False (default) for compatibility with the others outlier detection tools.
Returns decision: array-like, shape (n_samples, ) :
The values of the decision function for each observations. It is equal to the Mahalanobis
distances if raw_mahalanobis is True. By default (raw_mahalanobis=True), it is
equal to the cubic root of the shifted Mahalanobis distances. In that case, the threshold
for being an outlier is 0, which ensures a compatibility with other outlier detection tools
such as the One-Class SVM.
error_norm(comp_cov, norm=frobenius, scaling=True, squared=True)
Computes the Mean Squared Error between two covariance estimators. (In the sense of the Frobenius
norm)
Parameters comp_cov: array-like, shape = [n_features, n_features] :
The covariance to compare with.
norm: str :
The type of norm used to compute the error. Available error types: - frobenius (de-
fault): sqrt(tr(A^t.A)) - spectral: sqrt(max(eigenvalues(A^t.A)) where A is the error
(comp_cov - self.covariance_).
1.8. Reference 249
scikit-learn user guide, Release 0.12-git
scaling: bool :
If True (default), the squared error norm is divided by n_features. If False, the squared
error norm is not rescaled.
squared: bool :
Whether to compute the squared error norm or the error norm. If True (default), the
squared error norm is returned. If False, the error norm is returned.
Returns The Mean Squared Error (in the sense of the Frobenius norm) between :
self and comp_cov covariance estimators. :
fit(X)
get_params(deep=True)
Get parameters for the estimator
Parameters deep: boolean, optional :
If True, will return the parameters for this estimator and contained subobjects that are
estimators.
mahalanobis(observations)
Computes the mahalanobis distances of given observations.
The provided observations are assumed to be centered. One may want to center them using a location
estimate rst.
Parameters observations: array-like, shape = [n_observations, n_features] :
The observations, the Mahalanobis distances of the which we compute.
Returns mahalanobis_distance: array, shape = [n_observations,] :
Mahalanobis distances of the observations.
predict(X)
Outlyingness of observations in X according to the tted model.
Parameters X: array-like, shape = (n_samples, n_features) :
Returns is_outliers: array, shape = (n_samples, ), dtype = bool :
For each observations, tells whether or not it should be considered as an outlier accord-
ing to the tted model.
threshold: oat, :
The values of the less outlying points decision function.
reweight_covariance(data)
Reweight raw Minimum Covariance Determinant estimates.
Reweight observations using Rousseeuws method (equivalent to deleting outlying observations from the
data set before computing location and covariance estimates). [Rouseeuw1984]
Parameters data: array-like, shape (n_samples, n_features) :
The data matrix, with p features and n samples. The data set must be the one which was
used to compute the raw estimates.
Returns location_reweighted: array-like, shape (n_features, ) :
Reweighted robust location estimate.
covariance_reweighted: array-like, shape (n_features, n_features) :
250 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
Reweighted robust covariance estimate.
support_reweighted: array-like, type boolean, shape (n_samples,) :
A mask of the observations that have been used to compute the reweighted robust loca-
tion and covariance estimates.
score(X, y)
Returns the mean accuracy on the given test data and labels.
Parameters X : array-like, shape = [n_samples, n_features]
Training set.
y : array-like, shape = [n_samples]
Labels for X.
Returns z : oat
set_params(**params)
Set the parameters of the estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The former have
parameters of the form <component>__<parameter> so that its possible to update each component
of a nested object.
Returns self :
sklearn.covariance.GraphLasso
class sklearn.covariance.GraphLasso(alpha=0.01, mode=cd, tol=0.0001, max_iter=100, ver-
bose=False)
Sparse inverse covariance estimation with an l1-penalized estimator.
Parameters alpha: positive oat, optional :
The regularization parameter: the higher alpha, the more regularization, the sparser the
inverse covariance
cov_init: 2D array (n_features, n_features), optional :
The initial guess for the covariance
mode: {cd, lars} :
The Lasso solver to use: coordinate descent or LARS. Use LARS for very sparse un-
derlying graphs, where p > n. Elsewhere prefer cd which is more numerically stable.
tol: positive oat, optional :
The tolerance to declare convergence: if the dual gap goes below this value, iterations
are stopped
max_iter: integer, optional :
The maximum number of iterations
verbose: boolean, optional :
If verbose is True, the objective function and dual gap are plotted at each iteration
See Also:
graph_lasso, GraphLassoCV
1.8. Reference 251
scikit-learn user guide, Release 0.12-git
Attributes
covariance_ array-like, shape (n_features, n_features) Estimated covariance matrix
precision_ array-like, shape (n_features, n_features) Estimated pseudo inverse matrix.
Methods
error_norm(comp_cov[, norm, scaling, squared]) Computes the Mean Squared Error between two covariance estimators.
fit(X[, y])
get_params([deep]) Get parameters for the estimator
mahalanobis(observations) Computes the mahalanobis distances of given observations.
score(X_test[, assume_centered]) Computes the log-likelihood of a gaussian data set with self.covariance_ as an estimator of its covariance matrix.
set_params(**params) Set the parameters of the estimator.
__init__(alpha=0.01, mode=cd, tol=0.0001, max_iter=100, verbose=False)
error_norm(comp_cov, norm=frobenius, scaling=True, squared=True)
Computes the Mean Squared Error between two covariance estimators. (In the sense of the Frobenius
norm)
Parameters comp_cov: array-like, shape = [n_features, n_features] :
The covariance to compare with.
norm: str :
The type of norm used to compute the error. Available error types: - frobenius (de-
fault): sqrt(tr(A^t.A)) - spectral: sqrt(max(eigenvalues(A^t.A)) where A is the error
(comp_cov - self.covariance_).
scaling: bool :
If True (default), the squared error norm is divided by n_features. If False, the squared
error norm is not rescaled.
squared: bool :
Whether to compute the squared error norm or the error norm. If True (default), the
squared error norm is returned. If False, the error norm is returned.
Returns The Mean Squared Error (in the sense of the Frobenius norm) between :
self and comp_cov covariance estimators. :
get_params(deep=True)
Get parameters for the estimator
Parameters deep: boolean, optional :
If True, will return the parameters for this estimator and contained subobjects that are
estimators.
mahalanobis(observations)
Computes the mahalanobis distances of given observations.
The provided observations are assumed to be centered. One may want to center them using a location
estimate rst.
Parameters observations: array-like, shape = [n_observations, n_features] :
252 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
The observations, the Mahalanobis distances of the which we compute.
Returns mahalanobis_distance: array, shape = [n_observations,] :
Mahalanobis distances of the observations.
score(X_test, assume_centered=False)
Computes the log-likelihood of a gaussian data set with self.covariance_ as an estimator of its covariance
matrix.
Parameters X_test : array-like, shape = [n_samples, n_features]
Test data of which we compute the likelihood, where n_samples is the number of sam-
ples and n_features is the number of features.
Returns res : oat
The likelihood of the data set with self.covariance_ as an estimator of its covariance
matrix.
set_params(**params)
Set the parameters of the estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The former have
parameters of the form <component>__<parameter> so that its possible to update each component
of a nested object.
Returns self :
sklearn.covariance.GraphLassoCV
class sklearn.covariance.GraphLassoCV(alphas=4, n_renements=4, cv=None, tol=0.0001,
max_iter=100, mode=cd, n_jobs=1, verbose=False)
Sparse inverse covariance w/ cross-validated choice of the l1 penality
Parameters alphas: integer, or list positive oat, optional :
If an integer is given, it xes the number of points on the grids of alpha to be used. If
a list is given, it gives the grid to be used. See the notes in the class docstring for more
details.
n_renements: strictly positive integer :
The number of time the grid is rened. Not used if explicit values of alphas are passed.
cv : crossvalidation generator, optional
see sklearn.cross_validation module. If None is passed, default to a 3-fold strategy
tol: positive oat, optional :
The tolerance to declare convergence: if the dual gap goes below this value, iterations
are stopped
max_iter: integer, optional :
The maximum number of iterations
mode: {cd, lars} :
The Lasso solver to use: coordinate descent or LARS. Use LARS for very sparse un-
derlying graphs, where p > n. Elsewhere prefer cd which is more numerically stable.
n_jobs: int, optional :
1.8. Reference 253
scikit-learn user guide, Release 0.12-git
number of jobs to run in parallel (default 1)
verbose: boolean, optional :
If verbose is True, the objective function and dual gap are print at each iteration
See Also:
graph_lasso, GraphLasso
Notes
The search for the optimal alpha is done on an iteratively rened grid: rst the cross-validated scores on a grid
are computed, then a new rened grid is center around the maximum...
One of the challenges that we have to face is that the solvers can fail to converge to a well-conditioned estimate.
The corresponding values of alpha then come out as missing values, but the optimum may be close to these
missing values.
Attributes
covariance_ array-like, shape (n_features,
n_features)
Estimated covariance matrix
precision_ array-like, shape (n_features,
n_features)
Estimated precision matrix (inverse
covariance).
alpha_: oat Penalization parameter selected
cv_alphas_: list of oat All the penalization parameters explored
cv_scores: 2D array
(n_alphas, n_folds)
The log-likelihood score on left-out data
across the folds.
Methods
error_norm(comp_cov[, norm, scaling, squared]) Computes the Mean Squared Error between two covariance estimators.
fit(X[, y])
get_params([deep]) Get parameters for the estimator
mahalanobis(observations) Computes the mahalanobis distances of given observations.
score(X_test[, assume_centered]) Computes the log-likelihood of a gaussian data set with self.covariance_ as an estimator of its covariance matrix.
set_params(**params) Set the parameters of the estimator.
__init__(alphas=4, n_renements=4, cv=None, tol=0.0001, max_iter=100, mode=cd, n_jobs=1,
verbose=False)
error_norm(comp_cov, norm=frobenius, scaling=True, squared=True)
Computes the Mean Squared Error between two covariance estimators. (In the sense of the Frobenius
norm)
Parameters comp_cov: array-like, shape = [n_features, n_features] :
The covariance to compare with.
norm: str :
The type of norm used to compute the error. Available error types: - frobenius (de-
fault): sqrt(tr(A^t.A)) - spectral: sqrt(max(eigenvalues(A^t.A)) where A is the error
(comp_cov - self.covariance_).
254 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
scaling: bool :
If True (default), the squared error norm is divided by n_features. If False, the squared
error norm is not rescaled.
squared: bool :
Whether to compute the squared error norm or the error norm. If True (default), the
squared error norm is returned. If False, the error norm is returned.
Returns The Mean Squared Error (in the sense of the Frobenius norm) between :
self and comp_cov covariance estimators. :
get_params(deep=True)
Get parameters for the estimator
Parameters deep: boolean, optional :
If True, will return the parameters for this estimator and contained subobjects that are
estimators.
mahalanobis(observations)
Computes the mahalanobis distances of given observations.
The provided observations are assumed to be centered. One may want to center them using a location
estimate rst.
Parameters observations: array-like, shape = [n_observations, n_features] :
The observations, the Mahalanobis distances of the which we compute.
Returns mahalanobis_distance: array, shape = [n_observations,] :
Mahalanobis distances of the observations.
score(X_test, assume_centered=False)
Computes the log-likelihood of a gaussian data set with self.covariance_ as an estimator of its covariance
matrix.
Parameters X_test : array-like, shape = [n_samples, n_features]
Test data of which we compute the likelihood, where n_samples is the number of sam-
ples and n_features is the number of features.
Returns res : oat
The likelihood of the data set with self.covariance_ as an estimator of its covariance
matrix.
set_params(**params)
Set the parameters of the estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The former have
parameters of the form <component>__<parameter> so that its possible to update each component
of a nested object.
Returns self :
sklearn.covariance.LedoitWolf
class sklearn.covariance.LedoitWolf(store_precision=True, assume_centered=False)
LedoitWolf Estimator
1.8. Reference 255
scikit-learn user guide, Release 0.12-git
Ledoit-Wolf is a particular form of shrinkage, where the shrinkage coefcient is computed using O.Ledoit and
M.Wolfs formula as described in A Well-Conditioned Estimator for Large-Dimensional Covariance Matrices,
Ledoit and Wolf, Journal of Multivariate Analysis, Volume 88, Issue 2, February 2004, pages 365-411.
Parameters store_precision : bool
Specify if the estimated precision is stored
Notes
The regularised covariance is:
(1 - shrinkage)
*
cov
+ shrinkage
*
mu
*
np.identity(n_features)
where mu = trace(cov) / n_features and shinkage is given by the Ledoit and Wolf formula (see References)
References
A Well-Conditioned Estimator for Large-Dimensional Covariance Matrices, Ledoit and Wolf, Journal of Mul-
tivariate Analysis, Volume 88, Issue 2, February 2004, pages 365-411.
Attributes
covariance_ array-like, shape
(n_features, n_features)
Estimated covariance matrix
precision_ array-like, shape
(n_features, n_features)
Estimated pseudo inverse matrix. (stored only if
store_precision is True)
shrinkage_: oat, 0 <=
shrinkage <= 1
coefcient in the convex combination used for the
computation of the shrunk estimate.
Methods
error_norm(comp_cov[, norm, scaling, squared]) Computes the Mean Squared Error between two covariance estimators.
fit(X[, assume_centered]) Fits the Ledoit-Wolf shrunk covariance model
get_params([deep]) Get parameters for the estimator
mahalanobis(observations) Computes the mahalanobis distances of given observations.
score(X_test[, assume_centered]) Computes the log-likelihood of a gaussian data set with self.covariance_ as an estimator of its covariance matrix.
set_params(**params) Set the parameters of the estimator.
__init__(store_precision=True, assume_centered=False)
Parameters store_precision: bool :
Specify if the estimated precision is stored
assume_centered: Boolean :
If True, data are not centered before computation. Useful when working with data
whose mean is almost, but not exactly zero. If False, data are centered before computa-
tion.
error_norm(comp_cov, norm=frobenius, scaling=True, squared=True)
256 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
Computes the Mean Squared Error between two covariance estimators. (In the sense of the Frobenius
norm)
Parameters comp_cov: array-like, shape = [n_features, n_features] :
The covariance to compare with.
norm: str :
The type of norm used to compute the error. Available error types: - frobenius (de-
fault): sqrt(tr(A^t.A)) - spectral: sqrt(max(eigenvalues(A^t.A)) where A is the error
(comp_cov - self.covariance_).
scaling: bool :
If True (default), the squared error norm is divided by n_features. If False, the squared
error norm is not rescaled.
squared: bool :
Whether to compute the squared error norm or the error norm. If True (default), the
squared error norm is returned. If False, the error norm is returned.
Returns The Mean Squared Error (in the sense of the Frobenius norm) between :
self and comp_cov covariance estimators. :
fit(X, assume_centered=False)
Fits the Ledoit-Wolf shrunk covariance model according to the given training data and parameters.
Parameters X : array-like, shape = [n_samples, n_features]
Training data, where n_samples is the number of samples and n_features is the number
of features.
assume_centered: Boolean :
If True, data are not centered before computation. Usefull to work with data whose
mean is signicantly equal to zero but is not exactly zero. If False, data are centered
before computation.
Returns self : object
Returns self.
get_params(deep=True)
Get parameters for the estimator
Parameters deep: boolean, optional :
If True, will return the parameters for this estimator and contained subobjects that are
estimators.
mahalanobis(observations)
Computes the mahalanobis distances of given observations.
The provided observations are assumed to be centered. One may want to center them using a location
estimate rst.
Parameters observations: array-like, shape = [n_observations, n_features] :
The observations, the Mahalanobis distances of the which we compute.
Returns mahalanobis_distance: array, shape = [n_observations,] :
Mahalanobis distances of the observations.
1.8. Reference 257
scikit-learn user guide, Release 0.12-git
score(X_test, assume_centered=False)
Computes the log-likelihood of a gaussian data set with self.covariance_ as an estimator of its covariance
matrix.
Parameters X_test : array-like, shape = [n_samples, n_features]
Test data of which we compute the likelihood, where n_samples is the number of sam-
ples and n_features is the number of features.
Returns res : oat
The likelihood of the data set with self.covariance_ as an estimator of its covariance
matrix.
set_params(**params)
Set the parameters of the estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The former have
parameters of the form <component>__<parameter> so that its possible to update each component
of a nested object.
Returns self :
sklearn.covariance.MinCovDet
class sklearn.covariance.MinCovDet(store_precision=True, assume_centered=False, sup-
port_fraction=None, random_state=None)
Minimum Covariance Determinant (MCD): robust estimator of covariance
Parameters store_precision: bool :
Specify if the estimated precision is stored
assume_centered: Boolean :
If True, the support of robust location and covariance estimates is computed, and a
covariance estimate is recomputed from it, without centering the data. Useful to work
with data whose mean is signicantly equal to zero but is not exactly zero. If False,
the robust location and covariance are directly computed with the FastMCD algorithm
without additional treatment.
support_fraction: oat, 0 < support_fraction < 1 :
The proportion of points to be included in the support of the raw MCD estimate. Default
is None, which implies that the minimum value of support_fraction will be used within
the algorithm: [n_sample + n_features + 1] / 2
random_state: integer or numpy.RandomState, optional :
The random generator used. If an integer is given, it xes the seed. Defaults to the
global numpy random number generator.
References
[Rouseeuw1984], [Rouseeuw1999], [Butler1993]
258 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
Attributes
raw_location_: array-like,
shape (n_features,)
The raw robust estimated location before correction and reweighting
raw_covariance_: array-like,
shape (n_features, n_features)
The raw robust estimated covariance before correction and reweighting
raw_support_: array-like,
shape (n_samples,)
A mask of the observations that have been used to compute the raw
robust estimates of location and shape, before correction and reweighting.
location_: array-like, shape
(n_features,)
Estimated robust location
covariance_: array-like, shape
(n_features, n_features)
Estimated robust covariance matrix
precision_: array-like, shape
(n_features, n_features)
Estimated pseudo inverse matrix. (stored only if store_precision is True)
support_: array-like, shape
(n_samples,)
A mask of the observations that have been used to compute the robust
estimates of location and shape.
Methods
correct_covariance(data) Apply a correction to raw Minimum Covariance Determinant estimates.
error_norm(comp_cov[, norm, scaling, squared]) Computes the Mean Squared Error between two covariance estimators.
fit(X) Fits a Minimum Covariance Determinant with the FastMCD algorithm.
get_params([deep]) Get parameters for the estimator
mahalanobis(observations) Computes the mahalanobis distances of given observations.
reweight_covariance(data) Reweight raw Minimum Covariance Determinant estimates.
score(X_test[, assume_centered]) Computes the log-likelihood of a gaussian data set with self.covariance_ as an estimator of its covariance matrix.
set_params(**params) Set the parameters of the estimator.
__init__(store_precision=True, assume_centered=False, support_fraction=None, ran-
dom_state=None)
correct_covariance(data)
Apply a correction to raw Minimum Covariance Determinant estimates.
Correction using the empirical correction factor suggested by Rousseeuw and Van Driessen in
[Rouseeuw1984].
Parameters data: array-like, shape (n_samples, n_features) :
The data matrix, with p features and n samples. The data set must be the one which was
used to compute the raw estimates.
Returns covariance_corrected: array-like, shape (n_features, n_features) :
Corrected robust covariance estimate.
error_norm(comp_cov, norm=frobenius, scaling=True, squared=True)
Computes the Mean Squared Error between two covariance estimators. (In the sense of the Frobenius
norm)
Parameters comp_cov: array-like, shape = [n_features, n_features] :
The covariance to compare with.
norm: str :
1.8. Reference 259
scikit-learn user guide, Release 0.12-git
The type of norm used to compute the error. Available error types: - frobenius (de-
fault): sqrt(tr(A^t.A)) - spectral: sqrt(max(eigenvalues(A^t.A)) where A is the error
(comp_cov - self.covariance_).
scaling: bool :
If True (default), the squared error norm is divided by n_features. If False, the squared
error norm is not rescaled.
squared: bool :
Whether to compute the squared error norm or the error norm. If True (default), the
squared error norm is returned. If False, the error norm is returned.
Returns The Mean Squared Error (in the sense of the Frobenius norm) between :
self and comp_cov covariance estimators. :
fit(X)
Fits a Minimum Covariance Determinant with the FastMCD algorithm.
Parameters X: array-like, shape = [n_samples, n_features] :
Training data, where n_samples is the number of samples and n_features is the number
of features.
Returns self: object :
Returns self.
get_params(deep=True)
Get parameters for the estimator
Parameters deep: boolean, optional :
If True, will return the parameters for this estimator and contained subobjects that are
estimators.
mahalanobis(observations)
Computes the mahalanobis distances of given observations.
The provided observations are assumed to be centered. One may want to center them using a location
estimate rst.
Parameters observations: array-like, shape = [n_observations, n_features] :
The observations, the Mahalanobis distances of the which we compute.
Returns mahalanobis_distance: array, shape = [n_observations,] :
Mahalanobis distances of the observations.
reweight_covariance(data)
Reweight raw Minimum Covariance Determinant estimates.
Reweight observations using Rousseeuws method (equivalent to deleting outlying observations from the
data set before computing location and covariance estimates). [Rouseeuw1984]
Parameters data: array-like, shape (n_samples, n_features) :
The data matrix, with p features and n samples. The data set must be the one which was
used to compute the raw estimates.
Returns location_reweighted: array-like, shape (n_features, ) :
Reweighted robust location estimate.
260 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
covariance_reweighted: array-like, shape (n_features, n_features) :
Reweighted robust covariance estimate.
support_reweighted: array-like, type boolean, shape (n_samples,) :
A mask of the observations that have been used to compute the reweighted robust loca-
tion and covariance estimates.
score(X_test, assume_centered=False)
Computes the log-likelihood of a gaussian data set with self.covariance_ as an estimator of its covariance
matrix.
Parameters X_test : array-like, shape = [n_samples, n_features]
Test data of which we compute the likelihood, where n_samples is the number of sam-
ples and n_features is the number of features.
Returns res : oat
The likelihood of the data set with self.covariance_ as an estimator of its covariance
matrix.
set_params(**params)
Set the parameters of the estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The former have
parameters of the form <component>__<parameter> so that its possible to update each component
of a nested object.
Returns self :
sklearn.covariance.OAS
class sklearn.covariance.OAS(store_precision=True, assume_centered=False)
Oracle Approximating Shrinkage Estimator
OAS is a particular form of shrinkage described in Shrinkage Algorithms for MMSE Covariance Estimation
Chen et al., IEEE Trans. on Sign. Proc., Volume 58, Issue 10, October 2010.
The formula used here does not correspond to the one given in the article. It has been taken from the matlab
programm available from the authors webpage (https://tbayes.eecs.umich.edu/yilun/covestimation).
Parameters store_precision : bool
Specify if the estimated precision is stored
Notes
The regularised covariance is:
(1 - shrinkage)
*
cov
+ shrinkage
*
mu
*
np.identity(n_features)
where mu = trace(cov) / n_features and shinkage is given by the OAS formula (see References)
References
Shrinkage Algorithms for MMSE Covariance Estimation Chen et al., IEEE Trans. on Sign. Proc., Volume 58,
Issue 10, October 2010.
1.8. Reference 261
scikit-learn user guide, Release 0.12-git
Attributes
covariance_ array-like, shape
(n_features, n_features)
Estimated covariance matrix
precision_ array-like, shape
(n_features, n_features)
Estimated pseudo inverse matrix. (stored only if
store_precision is True)
shrinkage_: oat, 0 <=
shrinkage <= 1
coefcient in the convex combination used for the
computation of the shrunk estimate.
Methods
error_norm(comp_cov[, norm, scaling, squared]) Computes the Mean Squared Error between two covariance estimators.
fit(X[, assume_centered]) Fits the Oracle Approximating Shrinkage covariance model
get_params([deep]) Get parameters for the estimator
mahalanobis(observations) Computes the mahalanobis distances of given observations.
score(X_test[, assume_centered]) Computes the log-likelihood of a gaussian data set with self.covariance_ as an estimator of its covariance matrix.
set_params(**params) Set the parameters of the estimator.
__init__(store_precision=True, assume_centered=False)
Parameters store_precision: bool :
Specify if the estimated precision is stored
assume_centered: Boolean :
If True, data are not centered before computation. Useful when working with data
whose mean is almost, but not exactly zero. If False, data are centered before computa-
tion.
error_norm(comp_cov, norm=frobenius, scaling=True, squared=True)
Computes the Mean Squared Error between two covariance estimators. (In the sense of the Frobenius
norm)
Parameters comp_cov: array-like, shape = [n_features, n_features] :
The covariance to compare with.
norm: str :
The type of norm used to compute the error. Available error types: - frobenius (de-
fault): sqrt(tr(A^t.A)) - spectral: sqrt(max(eigenvalues(A^t.A)) where A is the error
(comp_cov - self.covariance_).
scaling: bool :
If True (default), the squared error norm is divided by n_features. If False, the squared
error norm is not rescaled.
squared: bool :
Whether to compute the squared error norm or the error norm. If True (default), the
squared error norm is returned. If False, the error norm is returned.
Returns The Mean Squared Error (in the sense of the Frobenius norm) between :
self and comp_cov covariance estimators. :
262 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
fit(X, assume_centered=False)
Fits the Oracle Approximating Shrinkage covariance model according to the given training data and pa-
rameters.
Parameters X : array-like, shape = [n_samples, n_features]
Training data, where n_samples is the number of samples and n_features is the number
of features.
assume_centered: boolean :
If True, data are not centered before computation. Usefull to work with data whose
mean is signicantly equal to zero but is not exactly zero. If False, data are centered
before computation.
Returns self : object
Returns self.
get_params(deep=True)
Get parameters for the estimator
Parameters deep: boolean, optional :
If True, will return the parameters for this estimator and contained subobjects that are
estimators.
mahalanobis(observations)
Computes the mahalanobis distances of given observations.
The provided observations are assumed to be centered. One may want to center them using a location
estimate rst.
Parameters observations: array-like, shape = [n_observations, n_features] :
The observations, the Mahalanobis distances of the which we compute.
Returns mahalanobis_distance: array, shape = [n_observations,] :
Mahalanobis distances of the observations.
score(X_test, assume_centered=False)
Computes the log-likelihood of a gaussian data set with self.covariance_ as an estimator of its covariance
matrix.
Parameters X_test : array-like, shape = [n_samples, n_features]
Test data of which we compute the likelihood, where n_samples is the number of sam-
ples and n_features is the number of features.
Returns res : oat
The likelihood of the data set with self.covariance_ as an estimator of its covariance
matrix.
set_params(**params)
Set the parameters of the estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The former have
parameters of the form <component>__<parameter> so that its possible to update each component
of a nested object.
Returns self :
1.8. Reference 263
scikit-learn user guide, Release 0.12-git
sklearn.covariance.ShrunkCovariance
class sklearn.covariance.ShrunkCovariance(store_precision=True, shrinkage=0.1)
Covariance estimator with shrinkage
Parameters store_precision : bool
Specify if the estimated precision is stored
shrinkage: oat, 0 <= shrinkage <= 1 :
coefcient in the convex combination used for the computation of the shrunk estimate.
Notes
The regularized covariance is given by
(1 - shrinkage)*cov
shrinkage*mu*np.identity(n_features)
where mu = trace(cov) / n_features
Attributes
covariance_ array-like, shape
(n_features, n_features)
Estimated covariance matrix
precision_ array-like, shape
(n_features, n_features)
Estimated pseudo inverse matrix. (stored only if
store_precision is True)
shrinkage: oat, 0 <=
shrinkage <= 1
coefcient in the convex combination used for the
computation of the shrunk estimate.
Methods
error_norm(comp_cov[, norm, scaling, squared]) Computes the Mean Squared Error between two covariance estimators.
fit(X[, assume_centered]) Fits the shrunk covariance model according to the given training data and parameters.
get_params([deep]) Get parameters for the estimator
mahalanobis(observations) Computes the mahalanobis distances of given observations.
score(X_test[, assume_centered]) Computes the log-likelihood of a gaussian data set with self.covariance_ as an estimator of its covariance matrix.
set_params(**params) Set the parameters of the estimator.
__init__(store_precision=True, shrinkage=0.1)
error_norm(comp_cov, norm=frobenius, scaling=True, squared=True)
Computes the Mean Squared Error between two covariance estimators. (In the sense of the Frobenius
norm)
Parameters comp_cov: array-like, shape = [n_features, n_features] :
The covariance to compare with.
norm: str :
The type of norm used to compute the error. Available error types: - frobenius (de-
fault): sqrt(tr(A^t.A)) - spectral: sqrt(max(eigenvalues(A^t.A)) where A is the error
264 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
(comp_cov - self.covariance_).
scaling: bool :
If True (default), the squared error norm is divided by n_features. If False, the squared
error norm is not rescaled.
squared: bool :
Whether to compute the squared error norm or the error norm. If True (default), the
squared error norm is returned. If False, the error norm is returned.
Returns The Mean Squared Error (in the sense of the Frobenius norm) between :
self and comp_cov covariance estimators. :
fit(X, assume_centered=False)
Fits the shrunk covariance model according to the given training data and parameters.
Parameters X : array-like, shape = [n_samples, n_features]
Training data, where n_samples is the number of samples and n_features is the number
of features.
assume_centered: Boolean :
If True, data are not centered before computation. Usefull to work with data whose
mean is signicantly equal to zero but is not exactly zero. If False, data are centered
before computation.
Returns self : object
Returns self.
get_params(deep=True)
Get parameters for the estimator
Parameters deep: boolean, optional :
If True, will return the parameters for this estimator and contained subobjects that are
estimators.
mahalanobis(observations)
Computes the mahalanobis distances of given observations.
The provided observations are assumed to be centered. One may want to center them using a location
estimate rst.
Parameters observations: array-like, shape = [n_observations, n_features] :
The observations, the Mahalanobis distances of the which we compute.
Returns mahalanobis_distance: array, shape = [n_observations,] :
Mahalanobis distances of the observations.
score(X_test, assume_centered=False)
Computes the log-likelihood of a gaussian data set with self.covariance_ as an estimator of its covariance
matrix.
Parameters X_test : array-like, shape = [n_samples, n_features]
Test data of which we compute the likelihood, where n_samples is the number of sam-
ples and n_features is the number of features.
Returns res : oat
1.8. Reference 265
scikit-learn user guide, Release 0.12-git
The likelihood of the data set with self.covariance_ as an estimator of its covariance
matrix.
set_params(**params)
Set the parameters of the estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The former have
parameters of the form <component>__<parameter> so that its possible to update each component
of a nested object.
Returns self :
covariance.empirical_covariance(X[, ...]) Computes the Maximum likelihood covariance estimator
covariance.ledoit_wolf(X[, assume_centered]) Estimates the shrunk Ledoit-Wolf covariance matrix.
covariance.shrunk_covariance(emp_cov[, ...]) Calculates a covariance matrix shrunk on the diagonal
covariance.oas(X[, assume_centered]) Estimate covariance with the Oracle Approximating Shrinkage algorithm.
covariance.graph_lasso(emp_cov, alpha[, ...]) l1-penalized covariance estimator
sklearn.covariance.empirical_covariance
sklearn.covariance.empirical_covariance(X, assume_centered=False)
Computes the Maximum likelihood covariance estimator
Parameters X: 2D ndarray, shape (n_samples, n_features) :
Data from which to compute the covariance estimate
assume_centered: Boolean :
If True, data are not centered before computation. Useful when working with data
whose mean is almost, but not exactly zero. If False, data are centered before computa-
tion.
Returns covariance: 2D ndarray, shape (n_features, n_features) :
Empirical covariance (Maximum Likelihood Estimator)
sklearn.covariance.ledoit_wolf
sklearn.covariance.ledoit_wolf(X, assume_centered=False)
Estimates the shrunk Ledoit-Wolf covariance matrix.
Parameters X: array-like, shape (n_samples, n_features) :
Data from which to compute the covariance estimate
assume_centered: Boolean :
If True, data are not centered before computation. Usefull to work with data whose
mean is signicantly equal to zero but is not exactly zero. If False, data are centered
before computation.
Returns shrunk_cov: array-like, shape (n_features, n_features) :
Shrunk covariance
shrinkage: oat :
coefcient in the convex combination used for the computation of the shrunk estimate.
266 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
Notes
The regularised (shrunk) covariance is:
(1 - shrinkage)*cov
shrinkage * mu * np.identity(n_features)
where mu = trace(cov) / n_features
sklearn.covariance.shrunk_covariance
sklearn.covariance.shrunk_covariance(emp_cov, shrinkage=0.1)
Calculates a covariance matrix shrunk on the diagonal
Parameters emp_cov: array-like, shape (n_features, n_features) :
Covariance matrix to be shrunk
shrinkage: oat, 0 <= shrinkage <= 1 :
coefcient in the convex combination used for the computation of the shrunk estimate.
Returns shrunk_cov: array-like :
shrunk covariance
Notes
The regularized (shrunk) covariance is given by
(1 - shrinkage)*cov
shrinkage*mu*np.identity(n_features)
where mu = trace(cov) / n_features
sklearn.covariance.oas
sklearn.covariance.oas(X, assume_centered=False)
Estimate covariance with the Oracle Approximating Shrinkage algorithm.
Parameters X: array-like, shape (n_samples, n_features) :
Data from which to compute the covariance estimate
assume_centered: boolean :
If True, data are not centered before computation. Usefull to work with data whose
mean is signicantly equal to zero but is not exactly zero. If False, data are centered
before computation.
Returns shrunk_cov: array-like, shape (n_features, n_features) :
Shrunk covariance
shrinkage: oat :
coefcient in the convex combination used for the computation of the shrunk estimate.
1.8. Reference 267
scikit-learn user guide, Release 0.12-git
Notes
The regularised (shrunk) covariance is:
(1 - shrinkage)*cov
shrinkage * mu * np.identity(n_features)
where mu = trace(cov) / n_features
sklearn.covariance.graph_lasso
sklearn.covariance.graph_lasso(emp_cov, alpha, cov_init=None, mode=cd, tol=0.0001,
max_iter=100, verbose=False, return_costs=False,
eps=2.2204460492503131e-16)
l1-penalized covariance estimator
Parameters emp_cov: 2D ndarray, shape (n_features, n_features) :
Empirical covariance from which to compute the covariance estimate
alpha: positive oat :
The regularization parameter: the higher alpha, the more regularization, the sparser the
inverse covariance
cov_init: 2D array (n_features, n_features), optional :
The initial guess for the covariance
mode: {cd, lars} :
The Lasso solver to use: coordinate descent or LARS. Use LARS for very sparse un-
derlying graphs, where p > n. Elsewhere prefer cd which is more numerically stable.
tol: positive oat, optional :
The tolerance to declare convergence: if the dual gap goes below this value, iterations
are stopped
max_iter: integer, optional :
The maximum number of iterations
verbose: boolean, optional :
If verbose is True, the objective function and dual gap are printed at each iteration
return_costs: boolean, optional :
If return_costs is True, the objective function and dual gap at each iteration are returned
eps: oat, optional :
The machine-precision regularization in the computation of the Cholesky diagonal fac-
tors. Increase this for very ill-conditioned systems.
Returns covariance : 2D ndarray, shape (n_features, n_features)
The estimated covariance matrix
precision : 2D ndarray, shape (n_features, n_features)
The estimated (sparse) precision matrix
costs : list of (objective, dual_gap) pairs
268 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
The list of values of the objective function and the dual gap at each iteration. Returned
only if return_costs is True
See Also:
GraphLasso, GraphLassoCV
Notes
The algorithm employed to solve this problem is the GLasso algorithm, from the Friedman 2008 Biostatistics
paper. It is the same algorithm as in the R glasso package.
One possible difference with the glasso R package is that the diagonal coefcients are not penalized.
1.8.3 sklearn.cross_validation: Cross Validation
The sklearn.cross_validation module includes utilities for cross- validation and performance evaluation.
User guide: See the Cross-Validation: evaluating estimator performance section for further details.
cross_validation.Bootstrap(n[, ...]) Random sampling with replacement cross-validation iterator
cross_validation.KFold(n, k[, indices, ...]) K-Folds cross validation iterator
cross_validation.LeaveOneLabelOut(labels[, ...]) Leave-One-Label_Out cross-validation iterator
cross_validation.LeaveOneOut(n[, indices]) Leave-One-Out cross validation iterator.
cross_validation.LeavePLabelOut(labels, p[, ...]) Leave-P-Label_Out cross-validation iterator
cross_validation.LeavePOut(n, p[, indices]) Leave-P-Out cross validation iterator
cross_validation.StratifiedKFold(y, k[, indices]) Stratied K-Folds cross validation iterator
cross_validation.ShuffleSplit(n[, ...]) Random permutation cross-validation iterator.
cross_validation.StratifiedShuffleSplit(y[, ...]) Stratied ShufeSplit cross validation iterator
sklearn.cross_validation.Bootstrap
class sklearn.cross_validation.Bootstrap(n, n_bootstraps=3, train_size=0.5, test_size=None,
n_train=None, n_test=None, random_state=None)
Random sampling with replacement cross-validation iterator
Provides train/test indices to split data in train test sets while resampling the input n_bootstraps times: each time
a new random split of the data is performed and then samples are drawn (with replacement) on each side of the
split to build the training and test sets.
Note: contrary to other cross-validation strategies, bootstrapping will allow some samples to occur several times
in each splits. However a sample that occurs in the train split will never occur in the test split and vice-versa.
If you want each sample to occur at most once you should probably use ShufeSplit cross validation instead.
Parameters n : int
Total number of elements in the dataset.
n_bootstraps : int (default is 3)
Number of bootstrapping iterations
train_size : int or oat (default is 0.5)
If int, number of samples to include in the training split (should be smaller than the total
number of samples passed in the dataset).
1.8. Reference 269
scikit-learn user guide, Release 0.12-git
If oat, should be between 0.0 and 1.0 and represent the proportion of the dataset to
include in the train split.
test_size : int or oat or None (default is None)
If int, number of samples to include in the training set (should be smaller than the total
number of samples passed in the dataset).
If oat, should be between 0.0 and 1.0 and represent the proportion of the dataset to
include in the test split.
If None, n_test is set as the complement of n_train.
random_state : int or RandomState
Pseudo number generator state used for random sampling.
See Also:
ShuffleSplitcross validation using random permutations.
Examples
>>> from sklearn import cross_validation
>>> bs = cross_validation.Bootstrap(9, random_state=0)
>>> len(bs)
3
>>> print(bs)
Bootstrap(9, n_bootstraps=3, train_size=5, test_size=4, random_state=0)
>>> for train_index, test_index in bs:
... print("TRAIN: %s TEST: %s" % (train_index, test_index))
...
TRAIN: [1 8 7 7 8] TEST: [0 3 0 5]
TRAIN: [5 4 2 4 2] TEST: [6 7 1 0]
TRAIN: [4 7 0 1 1] TEST: [5 3 6 5]
__init__(n, n_bootstraps=3, train_size=0.5, test_size=None, n_train=None, n_test=None, ran-
dom_state=None)
sklearn.cross_validation.KFold
class sklearn.cross_validation.KFold(n, k, indices=True, shufe=False, random_state=None)
K-Folds cross validation iterator
Provides train/test indices to split data in train test sets. Split dataset into k consecutive folds (without shufing).
Each fold is then used a validation set once while the k - 1 remaining fold form the training set.
Parameters n: int :
Total number of elements
k: int :
Number of folds
indices: boolean, optional (default True) :
Return train/test split as arrays of indices, rather than a boolean mask array. Integer
indices are required when dealing with sparse matrices, since those cannot be indexed
by boolean masks.
270 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
shufe: boolean, optional :
whether to shufe the data before splitting into batches
random_state: int or RandomState :
Pseudo number generator state used for random sampling.
See Also:
StratifiedKFoldtake label information into account to avoid building
folds, classification
Notes
All the folds have size trunc(n_samples / n_folds), the last one has the complementary.
Examples
>>> from sklearn import cross_validation
>>> X = np.array([[1, 2], [3, 4], [1, 2], [3, 4]])
>>> y = np.array([1, 2, 3, 4])
>>> kf = cross_validation.KFold(4, k=2)
>>> len(kf)
2
>>> print(kf)
sklearn.cross_validation.KFold(n=4, k=2)
>>> for train_index, test_index in kf:
... print("TRAIN: %s TEST: %s" % (train_index, test_index))
... X_train, X_test = X[train_index], X[test_index]
... y_train, y_test = y[train_index], y[test_index]
TRAIN: [2 3] TEST: [0 1]
TRAIN: [0 1] TEST: [2 3]
__init__(n, k, indices=True, shufe=False, random_state=None)
sklearn.cross_validation.LeaveOneLabelOut
class sklearn.cross_validation.LeaveOneLabelOut(labels, indices=True)
Leave-One-Label_Out cross-validation iterator
Provides train/test indices to split data according to a third-party provided label. This label information can be
used to encode arbitrary domain specic stratications of the samples as integers.
For instance the labels could be the year of collection of the samples and thus allow for cross-validation against
time-based splits.
Parameters labels : array-like of int with shape (n_samples,)
Arbitrary domain-specic stratication of the data to be used to draw the splits.
indices: boolean, optional (default True) :
Return train/test split as arrays of indices, rather than a boolean mask array. Integer
indices are required when dealing with sparse matrices, since those cannot be indexed
by boolean masks.
1.8. Reference 271
scikit-learn user guide, Release 0.12-git
Examples
>>> from sklearn import cross_validation
>>> X = np.array([[1, 2], [3, 4], [5, 6], [7, 8]])
>>> y = np.array([1, 2, 1, 2])
>>> labels = np.array([1, 1, 2, 2])
>>> lol = cross_validation.LeaveOneLabelOut(labels)
>>> len(lol)
2
>>> print(lol)
sklearn.cross_validation.LeaveOneLabelOut(labels=[1 1 2 2])
>>> for train_index, test_index in lol:
... print("TRAIN: %s TEST: %s" % (train_index, test_index))
... X_train, X_test = X[train_index], X[test_index]
... y_train, y_test = y[train_index], y[test_index]
... print("%s %s %s %s" % (X_train, X_test, y_train, y_test))
TRAIN: [2 3] TEST: [0 1]
[[5 6]
[7 8]] [[1 2]
[3 4]] [1 2] [1 2]
TRAIN: [0 1] TEST: [2 3]
[[1 2]
[3 4]] [[5 6]
[7 8]] [1 2] [1 2]
__init__(labels, indices=True)
sklearn.cross_validation.LeaveOneOut
class sklearn.cross_validation.LeaveOneOut(n, indices=True)
Leave-One-Out cross validation iterator.
Provides train/test indices to split data in train test sets. Each sample is used once as a test set (singleton) while
the remaining samples form the training set.
Due to the high number of test sets (which is the same as the number of samples) this cross validation method
can be very costly. For large datasets one should favor KFold, StratiedKFold or ShufeSplit.
Parameters n: int :
Total number of elements
indices: boolean, optional (default True) :
Return train/test split as arrays of indices, rather than a boolean mask array. Integer
indices are required when dealing with sparse matrices, since those cannot be indexed
by boolean masks.
See Also:
LeaveOneLabelOut, domain-specific
Examples
>>> from sklearn import cross_validation
>>> X = np.array([[1, 2], [3, 4]])
>>> y = np.array([1, 2])
>>> loo = cross_validation.LeaveOneOut(2)
272 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
>>> len(loo)
2
>>> print(loo)
sklearn.cross_validation.LeaveOneOut(n=2)
>>> for train_index, test_index in loo:
... print("TRAIN: %s TEST: %s" % (train_index, test_index))
... X_train, X_test = X[train_index], X[test_index]
... y_train, y_test = y[train_index], y[test_index]
... print("%s %s %s %s" % (X_train, X_test, y_train, y_test))
TRAIN: [1] TEST: [0]
[[3 4]] [[1 2]] [2] [1]
TRAIN: [0] TEST: [1]
[[1 2]] [[3 4]] [1] [2]
__init__(n, indices=True)
sklearn.cross_validation.LeavePLabelOut
class sklearn.cross_validation.LeavePLabelOut(labels, p, indices=True)
Leave-P-Label_Out cross-validation iterator
Provides train/test indices to split data according to a third-party provided label. This label information can be
used to encode arbitrary domain specic stratications of the samples as integers.
For instance the labels could be the year of collection of the samples and thus allow for cross-validation against
time-based splits.
The difference between LeavePLabelOut and LeaveOneLabelOut is that the former builds the test sets with all
the samples assigned to p different values of the labels while the latter uses samples all assigned the same labels.
Parameters labels : array-like of int with shape (n_samples,)
Arbitrary domain-specic stratication of the data to be used to draw the splits.
p : int
Number of samples to leave out in the test split.
indices: boolean, optional (default True) :
Return train/test split as arrays of indices, rather than a boolean mask array. Integer
indices are required when dealing with sparse matrices, since those cannot be indexed
by boolean masks.
Examples
>>> from sklearn import cross_validation
>>> X = np.array([[1, 2], [3, 4], [5, 6]])
>>> y = np.array([1, 2, 1])
>>> labels = np.array([1, 2, 3])
>>> lpl = cross_validation.LeavePLabelOut(labels, p=2)
>>> len(lpl)
3
>>> print(lpl)
sklearn.cross_validation.LeavePLabelOut(labels=[1 2 3], p=2)
>>> for train_index, test_index in lpl:
... print("TRAIN: %s TEST: %s" % (train_index, test_index))
... X_train, X_test = X[train_index], X[test_index]
1.8. Reference 273
scikit-learn user guide, Release 0.12-git
... y_train, y_test = y[train_index], y[test_index]
... print("%s %s %s %s" % (X_train, X_test, y_train, y_test))
TRAIN: [2] TEST: [0 1]
[[5 6]] [[1 2]
[3 4]] [1] [1 2]
TRAIN: [1] TEST: [0 2]
[[3 4]] [[1 2]
[5 6]] [2] [1 1]
TRAIN: [0] TEST: [1 2]
[[1 2]] [[3 4]
[5 6]] [1] [2 1]
__init__(labels, p, indices=True)
sklearn.cross_validation.LeavePOut
class sklearn.cross_validation.LeavePOut(n, p, indices=True)
Leave-P-Out cross validation iterator
Provides train/test indices to split data in train test sets. The test set is built using p samples while the remaining
samples form the training set.
Due to the high number of iterations which grows with the number of samples this cross validation method can
be very costly. For large datasets one should favor KFold, StratiedKFold or ShufeSplit.
Parameters n: int :
Total number of elements
p: int :
Size of the test sets
indices: boolean, optional (default True) :
Return train/test split as arrays of indices, rather than a boolean mask array. Integer
indices are required when dealing with sparse matrices, since those cannot be indexed
by boolean masks.
Examples
>>> from sklearn import cross_validation
>>> X = np.array([[1, 2], [3, 4], [5, 6], [7, 8]])
>>> y = np.array([1, 2, 3, 4])
>>> lpo = cross_validation.LeavePOut(4, 2)
>>> len(lpo)
6
>>> print(lpo)
sklearn.cross_validation.LeavePOut(n=4, p=2)
>>> for train_index, test_index in lpo:
... print("TRAIN: %s TEST: %s" % (train_index, test_index))
... X_train, X_test = X[train_index], X[test_index]
... y_train, y_test = y[train_index], y[test_index]
TRAIN: [2 3] TEST: [0 1]
TRAIN: [1 3] TEST: [0 2]
TRAIN: [1 2] TEST: [0 3]
TRAIN: [0 3] TEST: [1 2]
274 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
TRAIN: [0 2] TEST: [1 3]
TRAIN: [0 1] TEST: [2 3]
__init__(n, p, indices=True)
sklearn.cross_validation.StratiedKFold
class sklearn.cross_validation.StratifiedKFold(y, k, indices=True)
Stratied K-Folds cross validation iterator
Provides train/test indices to split data in train test sets.
This cross-validation object is a variation of KFold, which returns stratied folds. The folds are made by
preserving the percentage of samples for each class.
Parameters y: array, [n_samples] :
Samples to split in K folds
k: int :
Number of folds
indices: boolean, optional (default True) :
Return train/test split as arrays of indices, rather than a boolean mask array. Integer
indices are required when dealing with sparse matrices, since those cannot be indexed
by boolean masks.
Notes
All the folds have size trunc(n_samples / n_folds), the last one has the complementary.
Examples
>>> from sklearn import cross_validation
>>> X = np.array([[1, 2], [3, 4], [1, 2], [3, 4]])
>>> y = np.array([0, 0, 1, 1])
>>> skf = cross_validation.StratifiedKFold(y, k=2)
>>> len(skf)
2
>>> print(skf)
sklearn.cross_validation.StratifiedKFold(labels=[0 0 1 1], k=2)
>>> for train_index, test_index in skf:
... print("TRAIN: %s TEST: %s" % (train_index, test_index))
... X_train, X_test = X[train_index], X[test_index]
... y_train, y_test = y[train_index], y[test_index]
TRAIN: [1 3] TEST: [0 2]
TRAIN: [0 2] TEST: [1 3]
__init__(y, k, indices=True)
1.8. Reference 275
scikit-learn user guide, Release 0.12-git
sklearn.cross_validation.ShufeSplit
class sklearn.cross_validation.ShuffleSplit(n, n_iterations=10, test_size=0.1,
train_size=None, indices=True, ran-
dom_state=None, test_fraction=None,
train_fraction=None)
Random permutation cross-validation iterator.
Yields indices to split data into training and test sets.
Note: contrary to other cross-validation strategies, random splits do not guarantee that all folds will be different,
although this is still very likely for sizeable datasets.
Parameters n : int
Total number of elements in the dataset.
n_iterations : int (default 10)
Number of re-shufing & splitting iterations.
test_size : oat (default 0.1) or int
If oat, should be between 0.0 and 1.0 and represent the proportion of the dataset to
include in the test split. If int, represents the absolute number of test samples.
train_size : oat, int, or None (default is None)
If oat, should be between 0.0 and 1.0 and represent the proportion of the dataset to
include in the train split. If int, represents the absolute number of train samples. If
None, the value is automatically set to the complement of the test fraction.
indices : boolean, optional (default True)
Return train/test split as arrays of indices, rather than a boolean mask array. Integer
indices are required when dealing with sparse matrices, since those cannot be indexed
by boolean masks.
random_state : int or RandomState
Pseudo-random number generator state used for random sampling.
See Also:
Bootstrapcross-validation using re-sampling with replacement.
Examples
>>> from sklearn import cross_validation
>>> rs = cross_validation.ShuffleSplit(4, n_iterations=3,
... test_size=.25, random_state=0)
>>> len(rs)
3
>>> print(rs)
...
ShuffleSplit(4, n_iterations=3, test_size=0.25, indices=True, ...)
>>> for train_index, test_index in rs:
... print("TRAIN: %s TEST: %s" % (train_index, test_index))
...
TRAIN: [3 1 0] TEST: [2]
276 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
TRAIN: [2 1 3] TEST: [0]
TRAIN: [0 2 1] TEST: [3]
>>> rs = cross_validation.ShuffleSplit(4, n_iterations=3,
... train_size=0.5, test_size=.25, random_state=0)
>>> for train_index, test_index in rs:
... print("TRAIN: %s TEST: %s" % (train_index, test_index))
...
TRAIN: [3 1] TEST: [2]
TRAIN: [2 1] TEST: [0]
TRAIN: [0 2] TEST: [3]
__init__(n, n_iterations=10, test_size=0.1, train_size=None, indices=True, random_state=None,
test_fraction=None, train_fraction=None)
sklearn.cross_validation.StratiedShufeSplit
class sklearn.cross_validation.StratifiedShuffleSplit(y, n_iterations=10, test_size=0.1,
train_size=None, indices=True,
random_state=None)
Stratied ShufeSplit cross validation iterator
Provides train/test indices to split data in train test sets.
This cross-validation object is a merge of StratiedKFold and ShufeSplit, which returns stratied randomized
folds. The folds are made by preserving the percentage of samples for each class.
Note: like the ShufeSplit strategy, stratied random splits do not guarantee that all folds will be different,
although this is still very likely for sizeable datasets.
Parameters y: array, [n_samples] :
Labels of samples.
n_iterations : int (default 10)
Number of re-shufing & splitting iterations.
test_size : oat (default 0.1) or int
If oat, should be between 0.0 and 1.0 and represent the proportion of the dataset to
include in the test split. If int, represents the absolute number of test samples.
train_size : oat, int, or None (default is None)
If oat, should be between 0.0 and 1.0 and represent the proportion of the dataset to
include in the train split. If int, represents the absolute number of train samples. If
None, the value is automatically set to the complement of the test fraction.
indices: boolean, optional (default True) :
Return train/test split as arrays of indices, rather than a boolean mask array. Integer
indices are required when dealing with sparse matrices, since those cannot be indexed
by boolean masks.
Examples
1.8. Reference 277
scikit-learn user guide, Release 0.12-git
>>> from sklearn.cross_validation import StratifiedShuffleSplit
>>> X = np.array([[1, 2], [3, 4], [1, 2], [3, 4]])
>>> y = np.array([0, 0, 1, 1])
>>> sss = StratifiedShuffleSplit(y, 3, test_size=0.5, random_state=0)
>>> len(sss)
3
>>> print(sss)
StratifiedShuffleSplit(labels=[0 0 1 1], n_iterations=3, ...)
>>> for train_index, test_index in sss:
... print("TRAIN: %s TEST: %s" % (train_index, test_index))
... X_train, X_test = X[train_index], X[test_index]
... y_train, y_test = y[train_index], y[test_index]
TRAIN: [0 3] TEST: [1 2]
TRAIN: [0 2] TEST: [1 3]
TRAIN: [1 2] TEST: [0 3]
__init__(y, n_iterations=10, test_size=0.1, train_size=None, indices=True, random_state=None)
cross_validation.train_test_split(*arrays, ...) Split arrays or matrices into random train and test subsets
cross_validation.cross_val_score(estimator, X) Evaluate a score by cross-validation
cross_validation.permutation_test_score(...) Evaluate the signicance of a cross-validated score with permutations
cross_validation.check_cv(cv[, X, y, classier]) Input checker utility for building a CV in a user friendly way.
sklearn.cross_validation.train_test_split
sklearn.cross_validation.train_test_split(*arrays, **options)
Split arrays or matrices into random train and test subsets
Quick utility that wraps calls to check_arrays and iter(ShuffleSplit(n_samples)).next()
and application to input data into a single call for splitting (and optionally subsampling) data in a oneliner.
Parameters *arrays : sequence of arrays or scipy.sparse matrices with same shape[0]
Python lists or tuples occurring in arrays are converted to 1D numpy arrays.
test_size : oat (default 0.25) or int
If oat, should be between 0.0 and 1.0 and represent the proportion of the dataset to
include in the test split. If int, represents the absolute number of test samples.
train_size : oat, int, or None (default is None)
If oat, should be between 0.0 and 1.0 and represent the proportion of the dataset to
include in the train split. If int, represents the absolute number of train samples. If
None, the value is automatically set to the complement of the test fraction.
random_state : int or RandomState
Pseudo-random number generator state used for random sampling.
dtype : a numpy dtype instance, None by default
Enforce a specic dtype.
Examples
278 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
>>> import numpy as np
>>> from sklearn.cross_validation import train_test_split
>>> a, b = np.arange(10).reshape((5, 2)), range(5)
>>> a
array([[0, 1],
[2, 3],
[4, 5],
[6, 7],
[8, 9]])
>>> list(b)
[0, 1, 2, 3, 4]
>>> a_train, a_test, b_train, b_test = train_test_split(
... a, b, test_size=0.33, random_state=42)
...
>>> a_train
array([[4, 5],
[0, 1],
[6, 7]])
>>> b_train
array([2, 0, 3])
>>> a_test
array([[2, 3],
[8, 9]])
>>> b_test
array([1, 4])
sklearn.cross_validation.cross_val_score
sklearn.cross_validation.cross_val_score(estimator, X, y=None, score_func=None,
cv=None, n_jobs=1, verbose=0)
Evaluate a score by cross-validation
Parameters estimator: estimator object implementing t :
The object to use to t the data
X: array-like of shape at least 2D :
The data to t.
y: array-like, optional :
The target variable to try to predict in the case of supervised learning.
score_func: callable, optional :
callable, has priority over the score function in the estimator. In a non-supervised set-
ting, where y is None, it takes the test data (X_test) as its only argument. In a supervised
setting it takes the test target (y_true) and the test prediction (y_pred) as arguments.
cv: cross-validation generator, optional :
A cross-validation generator. If None, a 3-fold cross validation is used or 3-fold strati-
ed cross-validation when y is supplied and estimator is a classier.
n_jobs: integer, optional :
The number of CPUs to use to do the computation. -1 means all CPUs.
verbose: integer, optional :
1.8. Reference 279
scikit-learn user guide, Release 0.12-git
The verbosity level
sklearn.cross_validation.permutation_test_score
sklearn.cross_validation.permutation_test_score(estimator, X, y, score_func, cv=None,
n_permutations=100, n_jobs=1,
labels=None, random_state=0, ver-
bose=0)
Evaluate the signicance of a cross-validated score with permutations
Parameters estimator: estimator object implementing t :
The object to use to t the data
X: array-like of shape at least 2D :
The data to t.
y: array-like :
The target variable to try to predict in the case of supervised learning.
score_func: callable :
Callable taking as arguments the test targets (y_test) and the predicted targets (y_pred)
and returns a oat. The score functions are expected to return a bigger value for a better
result otherwise the returned value does not correspond to a p-value (see Returns below
for further details).
cv : integer or crossvalidation generator, optional
If an integer is passed, it is the number of fold (default 3). Specic crossvalidation ob-
jects can be passed, see sklearn.cross_validation module for the list of possible objects
n_jobs: integer, optional :
The number of CPUs to use to do the computation. -1 means all CPUs.
labels: array-like of shape [n_samples] (optional) :
Labels constrain the permutation among groups of samples with a same label.
random_state: RandomState or an int seed (0 by default) :
A random number generator instance to dene the state of the random permutations
generator.
verbose: integer, optional :
The verbosity level
Returns score: oat :
The true score without permuting targets.
permutation_scores : array, shape = [n_permutations]
The scores obtained for each permutations.
pvalue: oat :
The returned value equals p-value if score_func returns bigger numbers for better scores
(e.g., zero_one). If score_func is rather a loss function (i.e. when lower is better such
as with mean_squared_error) then this is actually the complement of the p-value: 1 -
p-value.
280 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
Notes
This function implements Test 1 in:
Ojala and Garriga. Permutation Tests for Studying Classier Performance. The Journal of Machine
Learning Research (2010) vol. 11
sklearn.cross_validation.check_cv
sklearn.cross_validation.check_cv(cv, X=None, y=None, classier=False)
Input checker utility for building a CV in a user friendly way.
Parameters cv: an integer, a cv generator instance, or None :
The input specifying which cv generator to use. It can be an integer, in which case it is
the number of folds in a KFold, None, in which case 3 fold is used, or another object,
that will then be used as a cv generator.
X: 2D ndarray :
the data the cross-val object will be applied on
y: 1D ndarray :
the target variable for a supervised learning problem
classier: boolean optional :
whether the task is a classication task, in which case stratied KFold will be used.
1.8.4 sklearn.datasets: Datasets
The sklearn.datasets module includes utilities to load datasets, including methods to load and fetch popular
reference datasets. It also features some articial data generators.
User guide: See the Dataset loading utilities section for further details.
Loaders
datasets.load_20newsgroups(*args, **kwargs) DEPRECATED: Use fetch_20newsgroups instead with download_if_missing=False
datasets.fetch_20newsgroups([data_home, ...]) Load the lenames of the 20 newsgroups dataset.
datasets.fetch_20newsgroups_vectorized([...]) Load the 20 newsgroups dataset and transform it into tf-idf vectors.
datasets.load_boston() Load and return the boston house-prices dataset (regression).
datasets.load_diabetes() Load and return the diabetes dataset (regression).
datasets.load_digits([n_class]) Load and return the digits dataset (classication).
datasets.load_files(container_path[, ...]) Load text les with categories as subfolder names.
datasets.load_iris() Load and return the iris dataset (classication).
datasets.load_lfw_pairs([download_if_missing]) Alias for fetch_lfw_pairs(download_if_missing=False)
datasets.fetch_lfw_pairs([subset, ...]) Loader for the Labeled Faces in the Wild (LFW) pairs dataset
datasets.load_lfw_people([download_if_missing]) Alias for fetch_lfw_people(download_if_missing=False)
datasets.fetch_lfw_people([data_home, ...]) Loader for the Labeled Faces in the Wild (LFW) people dataset
datasets.load_linnerud() Load and return the linnerud dataset (multivariate regression).
datasets.fetch_olivetti_faces([data_home, ...]) Loader for the Olivetti faces data-set from AT&T.
datasets.load_sample_image(image_name) Load the numpy array of a single sample image
Continued on next page
1.8. Reference 281
scikit-learn user guide, Release 0.12-git
Table 1.42 continued from previous page
datasets.load_sample_images() Load sample images for image manipulation.
datasets.load_svmlight_file(f[, n_features, ...]) Load datasets in the svmlight / libsvm format into sparse CSR matrix
sklearn.datasets.load_20newsgroups
sklearn.datasets.load_20newsgroups(*args, **kwargs)
DEPRECATED: Use fetch_20newsgroups instead with download_if_missing=False
Alias for fetch_20newsgroups(download_if_missing=False).
See fetch_20newsgroups.__doc__ for documentation and parameter list.
sklearn.datasets.fetch_20newsgroups
sklearn.datasets.fetch_20newsgroups(data_home=None, subset=train, cate-
gories=None, shufe=True, random_state=42, down-
load_if_missing=True)
Load the lenames of the 20 newsgroups dataset.
Parameters subset: train or test, all, optional :
Select the dataset to load: train for the training set, test for the test set, all for both,
with shufed ordering.
data_home: optional, default: None :
Specify an download and cache folder for the datasets. If None, all scikit-learn data is
stored in ~/scikit_learn_data subfolders.
categories: None or collection of string or unicode :
If None (default), load all the categories. If not None, list of category names to load
(other categories ignored).
shufe: bool, optional :
Whether or not to shufe the data: might be important for models that make the as-
sumption that the samples are independent and identically distributed (i.i.d.), such as
stochastic gradient descent.
random_state: numpy random number generator or seed integer :
Used to shufe the dataset.
download_if_missing: optional, True by default :
If False, raise an IOError if the data is not locally available instead of trying to download
the data from the source site.
sklearn.datasets.fetch_20newsgroups_vectorized
sklearn.datasets.fetch_20newsgroups_vectorized(subset=train, data_home=None)
Load the 20 newsgroups dataset and transform it into tf-idf vectors.
This is a convenience function; the tf-idf transformation is done using the default settings for
sklearn.feature_extraction.text.Vectorizer. For more advanced usage (stopword ltering, n-gram extraction,
etc.), combine fetch_20newsgroups with a custom Vectorizer or CountVectorizer.
282 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
Parameters subset: train or test, all, optional :
Select the dataset to load: train for the training set, test for the test set, all for both,
with shufed ordering.
data_home: optional, default: None :
Specify an download and cache folder for the datasets. If None, all scikit-learn data is
stored in ~/scikit_learn_data subfolders.
Returns bunch : Bunch object
bunch.data: sparse matrix, shape [n_samples, n_features] bunch.target: array, shape
[n_samples] bunch.target_names: list, length [n_classes]
sklearn.datasets.load_boston
sklearn.datasets.load_boston()
Load and return the boston house-prices dataset (regression).
Samples total 506
Dimensionality 13
Features real, positive
Targets real 5. - 50.
Returns data : Bunch
Dictionary-like object, the interesting attributes are: data, the data to learn, target,
the regression targets, target_names, the meaning of the labels, and DESCR, the full
description of the dataset.
Examples
>>> from sklearn.datasets import load_boston
>>> boston = load_boston()
>>> boston.data.shape
(506, 13)
sklearn.datasets.load_diabetes
sklearn.datasets.load_diabetes()
Load and return the diabetes dataset (regression).
Samples total 442
Dimensionality 10
Features real, -.2 < x < .2
Targets integer 25 - 346
Returns data : Bunch
Dictionary-like object, the interesting attributes are: data, the data to learn and target,
the regression target for each sample.
1.8. Reference 283
scikit-learn user guide, Release 0.12-git
sklearn.datasets.load_digits
sklearn.datasets.load_digits(n_class=10)
Load and return the digits dataset (classication).
Each datapoint is a 8x8 image of a digit.
Classes 10
Samples per class ~180
Samples total 1797
Dimensionality 64
Features integers 0-16
Parameters n_class : integer, between 0 and 10, optional (default=10)
The number of classes to return.
Returns data : Bunch
Dictionary-like object, the interesting attributes are: data, the data to learn, images,
the images corresponding to each sample, target, the classication labels for each
sample, target_names, the meaning of the labels, and DESCR, the full description of
the dataset.
Examples
To load the data and visualize the images:
>>> from sklearn.datasets import load_digits
>>> digits = load_digits()
>>> digits.data.shape
(1797, 64)
>>> import pylab as pl
>>> pl.gray()
>>> pl.matshow(digits.images[0])
>>> pl.show()
sklearn.datasets.load_les
sklearn.datasets.load_files(container_path, description=None, categories=None,
load_content=True, shufe=True, charset=None,
charse_error=strict, random_state=0)
Load text les with categories as subfolder names.
Individual samples are assumed to be les stored a two levels folder structure such as the following:
container_folder/
category_1_folder/le_1.txt le_2.txt ... le_42.txt
category_2_folder/le_43.txt le_44.txt ...
The folder names are used has supervised signal label names. The indivial le names are not important.
This function does not try to extract features into a numpy array or scipy sparse matrix. In addition, if
load_content is false it does not try to load the les in memory.
To use utf-8 text les in a scikit-learn classication or clustering algorithm you will rst need to use the
sklearn.features.text module to build a feature extraction transformer that suits your problem.
284 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
Similar feature extractors should be build for other kind of unstructured data input such as images, audio, video,
...
Parameters container_path : string or unicode
Path to the main folder holding one subfolder per category
description: string or unicode, optional (default=None) :
A paragraph describing the characteristic of the dataset: its source, reference, etc.
categories : A collection of strings or None, optional (default=None)
If None (default), load all the categories. If not None, list of category names to load
(other categories ignored).
load_content : boolean, optional (default=True)
Whether to load or not the content of the different les. If true a data attribute con-
taining the text information is present in the data structure returned. If not, a lenames
attribute gives the path to the les.
charset : string or None (default is None)
If None, do not try to decode the content of the les (e.g. for images or other non-text
content). If not None, charset to use to decode text les if load_content is True.
charset_error: {strict, ignore, replace} :
Instruction on what to do if a byte sequence is given to analyze that contains characters
not of the given charset. By default, it is strict, meaning that a UnicodeDecodeError
will be raised. Other values are ignore and replace.
shufe : bool, optional (default=True)
Whether or not to shufe the data: might be important for models that make the as-
sumption that the samples are independent and identically distributed (i.i.d.), such as
stochastic gradient descent.
random_state : int, RandomState instance or None, optional (default=0)
If int, random_state is the seed used by the random number generator; If RandomState
instance, random_state is the random number generator; If None, the random number
generator is the RandomState instance used by np.random.
Returns data : Bunch
Dictionary-like object, the interesting attributes are: either data, the raw text data to
learn, or lenames, the les holding it, target, the classication labels (integer index),
target_names, the meaning of the labels, and DESCR, the full description of the
dataset.
sklearn.datasets.load_iris
sklearn.datasets.load_iris()
Load and return the iris dataset (classication).
The iris dataset is a classic and very easy multi-class classication dataset.
1.8. Reference 285
scikit-learn user guide, Release 0.12-git
Classes 3
Samples per class 50
Samples total 150
Dimensionality 4
Features real, positive
Returns data : Bunch
Dictionary-like object, the interesting attributes are: data, the data to learn, target,
the classication labels, target_names, the meaning of the labels, feature_names, the
meaning of the features, and DESCR, the full description of the dataset.
Examples
Lets say you are interested in the samples 10, 25, and 50, and want to know their class name.
>>> from sklearn.datasets import load_iris
>>> data = load_iris()
>>> data.target[[10, 25, 50]]
array([0, 0, 1])
>>> list(data.target_names)
[setosa, versicolor, virginica]
sklearn.datasets.load_lfw_pairs
sklearn.datasets.load_lfw_pairs(download_if_missing=False, **kwargs)
Alias for fetch_lfw_pairs(download_if_missing=False)
Check fetch_lfw_pairs.__doc__ for the documentation and parameter list.
sklearn.datasets.fetch_lfw_pairs
sklearn.datasets.fetch_lfw_pairs(subset=train, data_home=None, funneled=True, re-
size=0.5, color=False, slice_=(slice(70, 195, None), slice(78,
172, None)), download_if_missing=True)
Loader for the Labeled Faces in the Wild (LFW) pairs dataset
This dataset is a collection of JPEG pictures of famous people collected on the internet, all details are available
on the ofcial website:
http://vis-www.cs.umass.edu/lfw/
Each picture is centered on a single face. Each pixel of each channel (color in RGB) is encoded by a oat in
range 0.0 - 1.0.
The task is called Face Verication: given a pair of two pictures, a binary classier must predict whether the
two images are from the same person.
In the ofcial README.txt this task is described as the Restricted task. As I am not sure as to implement the
Unrestricted variant correctly, I left it as unsupported for now.
Parameters subset: optional, default: train :
286 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
Select the dataset to load: train for the development training set, test for the develop-
ment test set, and 10_folds for the ofcial evaluation set that is meant to be used with
a 10-folds cross validation.
data_home: optional, default: None :
Specify another download and cache folder for the datasets. By default all scikit learn
data is stored in ~/scikit_learn_data subfolders.
funneled: boolean, optional, default: True :
Download and use the funneled variant of the dataset.
resize: oat, optional, default 0.5 :
Ratio used to resize the each face picture.
color: boolean, optional, default False :
Keep the 3 RGB channels instead of averaging them to a single gray level channel. If
color is True the shape of the data has one more dimension than than the shape with
color = False.
slice_: optional :
Provide a custom 2D slice (height, width) to extract the interesting part of the jpeg
les and avoid use statistical correlation from the background
download_if_missing: optional, True by default :
If False, raise a IOError if the data is not locally available instead of trying to download
the data from the source site.
sklearn.datasets.load_lfw_people
sklearn.datasets.load_lfw_people(download_if_missing=False, **kwargs)
Alias for fetch_lfw_people(download_if_missing=False)
Check fetch_lfw_people.__doc__ for the documentation and parameter list.
sklearn.datasets.fetch_lfw_people
sklearn.datasets.fetch_lfw_people(data_home=None, funneled=True, resize=0.5,
min_faces_per_person=None, color=False,
slice_=(slice(70, 195, None), slice(78, 172, None)),
download_if_missing=True)
Loader for the Labeled Faces in the Wild (LFW) people dataset
This dataset is a collection of JPEG pictures of famous people collected on the internet, all details are available
on the ofcial website:
http://vis-www.cs.umass.edu/lfw/
Each picture is centered on a single face. Each pixel of each channel (color in RGB) is encoded by a oat in
range 0.0 - 1.0.
The task is called Face Recognition (or Identication): given the picture of a face, nd the name of the person
given a training set (gallery).
Parameters data_home: optional, default: None :
1.8. Reference 287
scikit-learn user guide, Release 0.12-git
Specify another download and cache folder for the datasets. By default all scikit learn
data is stored in ~/scikit_learn_data subfolders.
funneled: boolean, optional, default: True :
Download and use the funneled variant of the dataset.
resize: oat, optional, default 0.5 :
Ratio used to resize the each face picture.
min_faces_per_person: int, optional, default None :
The extracted dataset will only retain pictures of people that have at least
min_faces_per_person different pictures.
color: boolean, optional, default False :
Keep the 3 RGB channels instead of averaging them to a single gray level channel. If
color is True the shape of the data has one more dimension than than the shape with
color = False.
slice_: optional :
Provide a custom 2D slice (height, width) to extract the interesting part of the jpeg
les and avoid use statistical correlation from the background
download_if_missing: optional, True by default :
If False, raise a IOError if the data is not locally available instead of trying to download
the data from the source site.
sklearn.datasets.load_linnerud
sklearn.datasets.load_linnerud()
Load and return the linnerud dataset (multivariate regression).
Samples total: 20 Dimensionality: 3 for both data and targets Features: integer Targets: integer
Returns data : Bunch
Dictionary-like object, the interesting attributes are: data and targets, the two mul-
tivariate datasets, with data corresponding to the exercise and targets corresponding
to the physiological measurements, as well as feature_names and target_names.
sklearn.datasets.fetch_olivetti_faces
sklearn.datasets.fetch_olivetti_faces(data_home=None, shufe=False, random_state=0,
download_if_missing=True)
Loader for the Olivetti faces data-set from AT&T.
Parameters data_home : optional, default: None
Specify another download and cache folder for the datasets. By default all scikit learn
data is stored in ~/scikit_learn_data subfolders.
shufe : boolean, optional
If True the order of the dataset is shufed to avoid having images of the same person
grouped.
download_if_missing: optional, True by default :
288 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
If False, raise a IOError if the data is not locally available instead of trying to download
the data from the source site.
random_state : optional, integer or RandomState object
The seed or the random number generator used to shufe the data.
Notes
This dataset consists of 10 pictures each of 40 individuals. The original database was available from (now
defunct)
http://www.uk.research.att.com/facedatabase.html
The version retrieved here comes in MATLAB format from the personal web page of Sam Roweis:
http://www.cs.nyu.edu/~roweis/
sklearn.datasets.load_sample_image
sklearn.datasets.load_sample_image(image_name)
Load the numpy array of a single sample image
Parameters image_name: {china.jpg, ower.jpg} :
The name of the sample image loaded
Returns img: 3D array :
The image as a numpy array: height x width x color
Examples
>>> from sklearn.datasets import load_sample_image
>>> china = load_sample_image(china.jpg)
>>> china.dtype
dtype(uint8)
>>> china.shape
(427, 640, 3)
>>> flower = load_sample_image(flower.jpg)
>>> flower.dtype
dtype(uint8)
>>> flower.shape
(427, 640, 3)
sklearn.datasets.load_sample_images
sklearn.datasets.load_sample_images()
Load sample images for image manipulation. Loads both, china and flower.
Returns data : Bunch
Dictionary-like object with the following attributes : images, the two sample images,
lenames, the le names for the images, and DESCR the full description of the
dataset.
1.8. Reference 289
scikit-learn user guide, Release 0.12-git
Examples
To load the data and visualize the images:
>>> from sklearn.datasets import load_sample_images
>>> dataset = load_sample_images()
>>> len(dataset.images)
2
>>> first_img_data = dataset.images[0]
>>> first_img_data.shape
(427, 640, 3)
>>> first_img_data.dtype
dtype(uint8)
sklearn.datasets.load_svmlight_le
sklearn.datasets.load_svmlight_file(f, n_features=None, dtype=<type numpy.oat64>, mul-
tilabel=False, zero_based=auto)
Load datasets in the svmlight / libsvm format into sparse CSR matrix
This format is a text-based format, with one sample per line. It does not store zero valued features hence is
suitable for sparse dataset.
The rst element of each line can be used to store a target variable to predict.
This format is used as the default format for both svmlight and the libsvm command line programs.
Parsing a text based source can be expensive. When working on repeatedly on the same dataset, it is recom-
mended to wrap this loader with joblib.Memory.cache to store a memmapped backup of the CSR results of the
rst call and benet from the near instantaneous loading of memmapped structures for the subsequent calls.
This implementation is naive: it does allocate too much memory and is slow since written in python. On large
datasets it is recommended to use an optimized loader such as:
https://github.com/mblondel/svmlight-loader
Parameters f: str or le-like open in binary mode. :
(Path to) a le to load.
n_features: int or None :
The number of features to use. If None, it will be inferred. This argument is useful to
load several les that are subsets of a bigger sliced dataset: each subset might not have
example of every feature, hence the inferred shape might vary from one slice to another.
multilabel: boolean, optional :
Samples may have several labels each (see http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multilabel.html)
zero_based: boolean or auto, optional :
Whether column indices in f are zero-based (True) or one-based (False). If set to auto,
a heuristic check is applied to determine this from the le contents. Both kinds of les
occur in the wild, but they are unfortunately not self-identifying. Using auto or
True should always be safe.
Returns (X, y) :
where X is a scipy.sparse matrix of shape (n_samples, n_features), :
290 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
y is a ndarray of shape (n_samples,), or, in the multilabel case, a list of tuples of length
n_samples.
See Also:
load_svmlight_filessimilar function for loading multiple les in this
format, enforcing
Samples generator
datasets.make_blobs([n_samples, n_features, ...]) Generate isotropic Gaussian blobs for clustering.
datasets.make_classification([n_samples, ...]) Generate a random n-class classication problem.
datasets.make_circles([n_samples, shufe, ...]) Make a large circle containing a smaller circle in 2di
datasets.make_friedman1([n_samples, ...]) Generate the Friedman #1 regression problem
datasets.make_friedman2([n_samples, noise, ...]) Generate the Friedman #2 regression problem
datasets.make_friedman3([n_samples, noise, ...]) Generate the Friedman #3 regression problem
datasets.make_hastie_10_2([n_samples, ...]) Generates data for binary classication used in
datasets.make_low_rank_matrix([n_samples, ...]) Generate a mostly low rank matrix with bell-shaped singular values
datasets.make_moons([n_samples, shufe, ...]) Make two interleaving half circles
datasets.make_multilabel_classification([...]) Generate a random multilabel classication problem.
datasets.make_regression([n_samples, ...]) Generate a random regression problem.
datasets.make_s_curve([n_samples, noise, ...]) Generate an S curve dataset.
datasets.make_sparse_coded_signal(n_samples, ...) Generate a signal as a sparse combination of dictionary elements.
datasets.make_sparse_spd_matrix([dim, ...]) Generate a sparse symetric denite positive matrix.
datasets.make_sparse_uncorrelated([...]) Generate a random regression problem with sparse uncorrelated design
datasets.make_spd_matrix(n_dim[, random_state]) Generate a random symmetric, positive-denite matrix.
datasets.make_swiss_roll([n_samples, noise, ...]) Generate a swiss roll dataset.
sklearn.datasets.make_blobs
sklearn.datasets.make_blobs(n_samples=100, n_features=2, centers=3, cluster_std=1.0,
center_box=(-10.0, 10.0), shufe=True, random_state=None)
Generate isotropic Gaussian blobs for clustering.
Parameters n_samples : int, optional (default=100)
The total number of points equally divided among clusters.
n_features : int, optional (default=2)
The number of features for each sample.
centers : int or array of shape [n_centers, n_features], optional
(default=3) The number of centers to generate, or the xed center locations.
cluster_std: oat or sequence of oats, optional (default=1.0) :
The standard deviation of the clusters.
center_box: pair of oats (min, max), optional (default=(-10.0, 10.0)) :
The bounding box for each cluster center when centers are generated at random.
shufe : boolean, optional (default=True)
1.8. Reference 291
scikit-learn user guide, Release 0.12-git
Shufe the samples.
random_state : int, RandomState instance or None, optional (default=None)
If int, random_state is the seed used by the random number generator; If RandomState
instance, random_state is the random number generator; If None, the random number
generator is the RandomState instance used by np.random.
Returns X : array of shape [n_samples, n_features]
The generated samples.
y : array of shape [n_samples]
The integer labels for cluster membership of each sample.
Examples
>>> from sklearn.datasets.samples_generator import make_blobs
>>> X, y = make_blobs(n_samples=10, centers=3, n_features=2,
... random_state=0)
>>> X.shape
(10, 2)
>>> y
array([0, 0, 1, 0, 2, 2, 2, 1, 1, 0])
sklearn.datasets.make_classication
sklearn.datasets.make_classification(n_samples=100, n_features=20, n_informative=2,
n_redundant=2, n_repeated=0, n_classes=2,
n_clusters_per_class=2, weights=None, ip_y=0.01,
class_sep=1.0, hypercube=True, shift=0.0, scale=1.0,
shufe=True, random_state=None)
Generate a random n-class classication problem.
Parameters n_samples : int, optional (default=100)
The number of samples.
n_features : int, optional (default=20)
The total number of features. These comprise n_informative informative features,
n_redundant redundant features, n_repeated dupplicated features and n_features-
n_informative-n_redundant- n_repeated useless features drawn at random.
n_informative : int, optional (default=2)
The number of informative features. Each class is composed of a number of gaussian
clusters each located around the vertices of a hypercube in a subspace of dimension
n_informative. For each cluster, informative features are drawn independently from
N(0, 1) and then randomly linearly combined in order to add covariance. The clusters
are then placed on the vertices of the hypercube.
n_redundant : int, optional (default=2)
The number of redundant features. These features are generated as random linear com-
binations of the informative features.
n_repeated : int, optional (default=2)
292 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
The number of dupplicated features, drawn randomly from the informative and the re-
dundant features.
n_classes : int, optional (default=2)
The number of classes (or labels) of the classication problem.
n_clusters_per_class : int, optional (default=2)
The number of clusters per class.
weights : list of oats or None (default=None)
The proportions of samples assigned to each class. If None, then classes are balanced.
Note that if len(weights) == n_classes - 1, then the last class weight is automatically
inferred.
ip_y : oat, optional (default=0.01)
The fraction of samples whose class are randomly exchanged.
class_sep : oat, optional (default=1.0)
The factor multiplying the hypercube dimension.
hypercube : boolean, optional (default=True)
If True, the clusters are put on the vertices of a hypercube. If False, the clusters are put
on the vertices of a random polytope.
shift : oat or None, optional (default=0.0)
Shift all features by the specied value. If None, then features are shifted by a random
value drawn in [-class_sep, class_sep].
scale : oat or None, optional (default=1.0)
Multiply all features by the specied value. If None, then features are scaled by a
random value drawn in [1, 100]. Note that scaling happens after shifting.
shufe : boolean, optional (default=True)
Shufe the samples and the features.
random_state : int, RandomState instance or None, optional (default=None)
If int, random_state is the seed used by the random number generator; If RandomState
instance, random_state is the random number generator; If None, the random number
generator is the RandomState instance used by np.random.
Returns X : array of shape [n_samples, n_features]
The generated samples.
y : array of shape [n_samples]
The integer labels for class membership of each sample.
Notes
The algorithm is adapted from Guyon [1] and was designed to generate the Madelon dataset.
1.8. Reference 293
scikit-learn user guide, Release 0.12-git
References
[R48]
sklearn.datasets.make_circles
sklearn.datasets.make_circles(n_samples=100, shufe=True, noise=None, random_state=None,
factor=0.8)
Make a large circle containing a smaller circle in 2di
A simple toy dataset to visualize clustering and classication algorithms.
Parameters n_samples : int, optional (default=100)
The total number of points generated.
shufe: bool, optional (default=True) :
Whether to shufe the samples.
noise : double or None (default=None)
Standard deviation of Gaussian noise added to the data.
factor : double < 1 (default=.8)
Scale factor between inner and outer circle.
sklearn.datasets.make_friedman1
sklearn.datasets.make_friedman1(n_samples=100, n_features=10, noise=0.0, ran-
dom_state=None)
Generate the Friedman #1 regression problem
This dataset is described in Friedman [1] and Breiman [2].
Inputs X are independent features uniformly distributed on the interval [0, 1]. The output y is created according
to the formula:
y(X) = 10
*
sin(pi
*
X[:, 0]
*
X[:, 1]) + 20
*
(X[:, 2] - 0.5)
**
2 + 10
*
X[:, 3] + 5
*
X[:, 4] + noise
*
N(0, 1).
Out of the n_features features, only 5 are actually used to compute y. The remaining features are independent
of y.
The number of features has to be >= 5.
Parameters n_samples : int, optional (default=100)
The number of samples.
n_features : int, optional (default=10)
The number of features. Should be at least 5.
noise : oat, optional (default=0.0)
The standard deviation of the gaussian noise applied to the output.
random_state : int, RandomState instance or None, optional (default=None)
If int, random_state is the seed used by the random number generator; If RandomState
instance, random_state is the random number generator; If None, the random number
generator is the RandomState instance used by np.random.
294 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
Returns X : array of shape [n_samples, n_features]
The input samples.
y : array of shape [n_samples]
The output values.
References
[R49], [R50]
sklearn.datasets.make_friedman2
sklearn.datasets.make_friedman2(n_samples=100, noise=0.0, random_state=None)
Generate the Friedman #2 regression problem
This dataset is described in Friedman [1] and Breiman [2].
Inputs X are 4 independent features uniformly distributed on the intervals:
0 <= X[:, 0] <= 100,
40
*
pi <= X[:, 1] <= 560
*
pi,
0 <= X[:, 2] <= 1,
1 <= X[:, 3] <= 11.
The output y is created according to the formula:
y(X) = (X[:, 0]
**
2 + (X[:, 1]
*
X[:, 2] - 1 / (X[:, 1]
*
X[:, 3]))
**
2)
**
0.5 + noise
*
N(0, 1).
Parameters n_samples : int, optional (default=100)
The number of samples.
noise : oat, optional (default=0.0)
The standard deviation of the gaussian noise applied to the output.
random_state : int, RandomState instance or None, optional (default=None)
If int, random_state is the seed used by the random number generator; If RandomState
instance, random_state is the random number generator; If None, the random number
generator is the RandomState instance used by np.random.
Returns X : array of shape [n_samples, 4]
The input samples.
y : array of shape [n_samples]
The output values.
References
[R51], [R52]
1.8. Reference 295
scikit-learn user guide, Release 0.12-git
sklearn.datasets.make_friedman3
sklearn.datasets.make_friedman3(n_samples=100, noise=0.0, random_state=None)
Generate the Friedman #3 regression problem
This dataset is described in Friedman [1] and Breiman [2].
Inputs X are 4 independent features uniformly distributed on the intervals:
0 <= X[:, 0] <= 100,
40
*
pi <= X[:, 1] <= 560
*
pi,
0 <= X[:, 2] <= 1,
1 <= X[:, 3] <= 11.
The output y is created according to the formula:
y(X) = arctan((X[:, 1]
*
X[:, 2] - 1 / (X[:, 1]
*
X[:, 3])) / X[:, 0]) + noise
*
N(0, 1).
Parameters n_samples : int, optional (default=100)
The number of samples.
noise : oat, optional (default=0.0)
The standard deviation of the gaussian noise applied to the output.
random_state : int, RandomState instance or None, optional (default=None)
If int, random_state is the seed used by the random number generator; If RandomState
instance, random_state is the random number generator; If None, the random number
generator is the RandomState instance used by np.random.
Returns X : array of shape [n_samples, 4]
The input samples.
y : array of shape [n_samples]
The output values.
References
[R53], [R54]
sklearn.datasets.make_hastie_10_2
sklearn.datasets.make_hastie_10_2(n_samples=12000, random_state=None)
Generates data for binary classication used in Hastie et al. 2009, Example 10.2.
The ten features are standard independent Gaussian and the target y is dened by:
y[i] = 1 if np.sum(X[i] > 9.34 else -1
Parameters n_samples : int, optional (default=12000)
The number of samples.
random_state : int, RandomState instance or None, optional (default=None)
296 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
If int, random_state is the seed used by the random number generator; If RandomState
instance, random_state is the random number generator; If None, the random number
generator is the RandomState instance used by np.random.
Returns X : array of shape [n_samples, 10]
The input samples.
y : array of shape [n_samples]
The output values.
**References:** :
.. [1] T. Hastie, R. Tibshirani and J. Friedman, Elements of Statistical :
Learning Ed. 2, Springer, 2009. :
sklearn.datasets.make_low_rank_matrix
sklearn.datasets.make_low_rank_matrix(n_samples=100, n_features=100, effective_rank=10,
tail_strength=0.5, random_state=None)
Generate a mostly low rank matrix with bell-shaped singular values
Most of the variance can be explained by a bell-shaped curve of width effective_rank: the low rank part of the
singular values prole is:
(1 - tail_strength)
*
exp(-1.0
*
(i / effective_rank)
**
2)
The remaining singular values tail is fat, decreasing as:
tail_strength
*
exp(-0.1
*
i / effective_rank).
The low rank part of the prole can be considered the structured signal part of the data while the tail can be
considered the noisy part of the data that cannot be summarized by a low number of linear components (singular
vectors).
This kind of singular proles is often seen in practice, for instance:
gray level pictures of faces
TF-IDF vectors of text documents crawled from the web
Parameters n_samples : int, optional (default=100)
The number of samples.
n_features : int, optional (default=100)
The number of features.
effective_rank : int, optional (default=10)
The approximate number of singular vectors required to explain most of the data by
linear combinations.
tail_strength : oat between 0.0 and 1.0, optional (default=0.5)
The relative importance of the fat noisy tail of the singular values prole.
random_state : int, RandomState instance or None, optional (default=None)
1.8. Reference 297
scikit-learn user guide, Release 0.12-git
If int, random_state is the seed used by the random number generator; If RandomState
instance, random_state is the random number generator; If None, the random number
generator is the RandomState instance used by np.random.
Returns X : array of shape [n_samples, n_features]
The matrix.
sklearn.datasets.make_moons
sklearn.datasets.make_moons(n_samples=100, shufe=True, noise=None, random_state=None)
Make two interleaving half circles
A simple toy dataset to visualize clustering and classication algorithms.
Parameters n_samples : int, optional (default=100)
The total number of points generated.
shufe : bool, optional (default=True)
Whether to shufe the samples.
noise : double or None (default=None)
Standard deviation of Gaussian noise added to the data.
sklearn.datasets.make_multilabel_classication
sklearn.datasets.make_multilabel_classification(n_samples=100, n_features=20,
n_classes=5, n_labels=2, length=50,
allow_unlabeled=True, ran-
dom_state=None)
Generate a random multilabel classication problem.
For each sample, the generative process is:
pick the number of labels: n ~ Poisson(n_labels)
n times, choose a class c: c ~ Multinomial(theta)
pick the document length: k ~ Poisson(length)
k times, choose a word: w ~ Multinomial(theta_c)
In the above process, rejection sampling is used to make sure that n is never zero or more than n_classes, and
that the document length is never zero. Likewise, we reject classes which have already been chosen.
Parameters n_samples : int, optional (default=100)
The number of samples.
n_features : int, optional (default=20)
The total number of features.
n_classes : int, optional (default=5)
The number of classes of the classication problem.
n_labels : int, optional (default=2)
The average number of labels per instance. Number of labels follows a Poisson distri-
bution that never takes the value 0.
298 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
length : int, optional (default=50)
Sum of the features (number of words if documents).
allow_unlabeled : bool, optional (default=True)
If True, some instances might not belong to any class.
random_state : int, RandomState instance or None, optional (default=None)
If int, random_state is the seed used by the random number generator; If RandomState
instance, random_state is the random number generator; If None, the random number
generator is the RandomState instance used by np.random.
Returns X : array of shape [n_samples, n_features]
The generated samples.
Y : list of tuples
The label sets.
sklearn.datasets.make_regression
sklearn.datasets.make_regression(n_samples=100, n_features=100, n_informative=10,
bias=0.0, effective_rank=None, tail_strength=0.5, noise=0.0,
shufe=True, coef=False, random_state=None)
Generate a random regression problem.
The input set can either be well conditioned (by default) or have a low rank-fat tail singular prole. See the
make_low_rank_matrix for more details.
The output is generated by applying a (potentially biased) random linear regression model with n_informative
nonzero regressors to the previously generated input and some gaussian centered noise with some adjustable
scale.
Parameters n_samples : int, optional (default=100)
The number of samples.
n_features : int, optional (default=100)
The number of features.
n_informative : int, optional (default=10)
The number of informative features, i.e., the number of features used to build the linear
model used to generate the output.
bias : oat, optional (default=0.0)
The bias term in the underlying linear model.
effective_rank : int or None, optional (default=None)
if not None:The approximate number of singular vectors required to explain most of
the input data by linear combinations. Using this kind of singular spectrum in the
input allows the generator to reproduce the correlations often observed in practice.
if None:The input set is well conditioned, centered and gaussian with unit variance.
tail_strength : oat between 0.0 and 1.0, optional (default=0.5)
The relative importance of the fat noisy tail of the singular values prole if effec-
tive_rank is not None.
1.8. Reference 299
scikit-learn user guide, Release 0.12-git
noise : oat, optional (default=0.0)
The standard deviation of the gaussian noise applied to the output.
shufe : boolean, optional (default=True)
Shufe the samples and the features.
coef : boolean, optional (default=False)
If True, the coefcients of the underlying linear model are returned.
random_state : int, RandomState instance or None, optional (default=None)
If int, random_state is the seed used by the random number generator; If RandomState
instance, random_state is the random number generator; If None, the random number
generator is the RandomState instance used by np.random.
Returns X : array of shape [n_samples, n_features]
The input samples.
y : array of shape [n_samples]
The output values.
coef : array of shape [n_features], optional
The coefcient of the underlying linear model. It is returned only if coef is True.
sklearn.datasets.make_s_curve
sklearn.datasets.make_s_curve(n_samples=100, noise=0.0, random_state=None)
Generate an S curve dataset.
Parameters n_samples : int, optional (default=100)
The number of sample points on the S curve.
noise : oat, optional (default=0.0)
The standard deviation of the gaussian noise.
random_state : int, RandomState instance or None, optional (default=None)
If int, random_state is the seed used by the random number generator; If RandomState
instance, random_state is the random number generator; If None, the random number
generator is the RandomState instance used by np.random.
Returns X : array of shape [n_samples, 3]
The points.
t : array of shape [n_samples]
The univariate position of the sample according to the main dimension of the points in
the manifold.
sklearn.datasets.make_sparse_coded_signal
sklearn.datasets.make_sparse_coded_signal(n_samples, n_components, n_features,
n_nonzero_coefs, random_state=None)
Generate a signal as a sparse combination of dictionary elements.
300 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
Returns a matrix Y = DX, such as D is (n_features, n_components), X is (n_components, n_samples) and each
column of X has exactly n_nonzero_coefs non-zero elements.
Parameters n_samples : int
number of samples to generate
n_components: int, :
number of components in the dictionary
n_features : int
number of features of the dataset to generate
n_nonzero_coefs : int
number of active (non-zero) coefcients in each sample
random_state: int or RandomState instance, optional (default=None) :
seed used by the pseudo random number generator
Returns data: array of shape [n_features, n_samples] :
The encoded signal (Y).
dictionary: array of shape [n_features, n_components] :
The dictionary with normalized components (D).
code: array of shape [n_components, n_samples] :
The sparse code such that each column of this matrix has exactly n_nonzero_coefs non-
zero items (X).
sklearn.datasets.make_sparse_spd_matrix
sklearn.datasets.make_sparse_spd_matrix(dim=1, alpha=0.95, norm_diag=False,
smallest_coef=0.1, largest_coef=0.9, ran-
dom_state=None)
Generate a sparse symetric denite positive matrix.
Parameters dim: integer, optional (default=1) :
The size of the random (matrix to generate.
alpha: oat between 0 and 1, optional (default=0.95) :
The probability that a coefcient is non zero (see notes).
random_state : int, RandomState instance or None, optional (default=None)
If int, random_state is the seed used by the random number generator; If RandomState
instance, random_state is the random number generator; If None, the random number
generator is the RandomState instance used by np.random.
Returns prec: array of shape = [dim, dim] :
Notes
The sparsity is actually imposed on the cholesky factor of the matrix. Thus alpha does not translate directly into
the lling fraction of the matrix itself.
1.8. Reference 301
scikit-learn user guide, Release 0.12-git
sklearn.datasets.make_sparse_uncorrelated
sklearn.datasets.make_sparse_uncorrelated(n_samples=100, n_features=10, ran-
dom_state=None)
Generate a random regression problem with sparse uncorrelated design
This dataset is described in Celeux et al [1]. as:
X ~ N(0, 1)
y(X) = X[:, 0] + 2
*
X[:, 1] - 2
*
X[:, 2] - 1.5
*
X[:, 3]
Only the rst 4 features are informative. The remaining features are useless.
Parameters n_samples : int, optional (default=100)
The number of samples.
n_features : int, optional (default=10)
The number of features.
random_state : int, RandomState instance or None, optional (default=None)
If int, random_state is the seed used by the random number generator; If RandomState
instance, random_state is the random number generator; If None, the random number
generator is the RandomState instance used by np.random.
Returns X : array of shape [n_samples, n_features]
The input samples.
y : array of shape [n_samples]
The output values.
References
[R55]
sklearn.datasets.make_spd_matrix
sklearn.datasets.make_spd_matrix(n_dim, random_state=None)
Generate a random symmetric, positive-denite matrix.
Parameters n_dim : int
The matrix dimension.
random_state : int, RandomState instance or None, optional (default=None)
If int, random_state is the seed used by the random number generator; If RandomState
instance, random_state is the random number generator; If None, the random number
generator is the RandomState instance used by np.random.
Returns X : array of shape [n_dim, n_dim]
The random symmetric, positive-denite matrix.
302 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
sklearn.datasets.make_swiss_roll
sklearn.datasets.make_swiss_roll(n_samples=100, noise=0.0, random_state=None)
Generate a swiss roll dataset.
Parameters n_samples : int, optional (default=100)
The number of sample points on the S curve.
noise : oat, optional (default=0.0)
The standard deviation of the gaussian noise.
random_state : int, RandomState instance or None, optional (default=None)
If int, random_state is the seed used by the random number generator; If RandomState
instance, random_state is the random number generator; If None, the random number
generator is the RandomState instance used by np.random.
Returns X : array of shape [n_samples, 3]
The points.
t : array of shape [n_samples]
The univariate position of the sample according to the main dimension of the points in
the manifold.
Notes
The algorithm is from Marsland [1].
References
[R56]
1.8.5 sklearn.decomposition: Matrix Decomposition
The sklearn.decomposition module includes matrix decomposition algorithms, including among others PCA,
NMF or ICA. Most of the algorithms of this module can be regarded as dimensionality reduction techniques.
User guide: See the Decomposing signals in components (matrix factorization problems) section for further details.
decomposition.PCA([n_components, copy, whiten]) Principal component analysis (PCA)
decomposition.ProbabilisticPCA([...]) Additional layer on top of PCA that adds a probabilistic evaluationPrincipal component analysis (PCA)
decomposition.ProjectedGradientNMF([...]) Non-Negative matrix factorization by Projected Gradient (NMF)
decomposition.RandomizedPCA(n_components[, ...]) Principal component analysis (PCA) using randomized SVD
decomposition.KernelPCA([n_components, ...]) Kernel Principal component analysis (KPCA)
decomposition.FastICA([n_components, ...]) FastICA; a fast algorithm for Independent Component Analysis
decomposition.NMF([n_components, init, ...]) Non-Negative matrix factorization by Projected Gradient (NMF)
decomposition.SparsePCA(n_components[, ...]) Sparse Principal Components Analysis (SparsePCA)
decomposition.MiniBatchSparsePCA(n_components) Mini-batch Sparse Principal Components Analysis
decomposition.SparseCoder(dictionary[, ...]) Sparse coding
decomposition.DictionaryLearning(n_atoms[, ...]) Dictionary learning
decomposition.MiniBatchDictionaryLearning(n_atoms) Mini-batch dictionary learning
1.8. Reference 303
scikit-learn user guide, Release 0.12-git
sklearn.decomposition.PCA
class sklearn.decomposition.PCA(n_components=None, copy=True, whiten=False)
Principal component analysis (PCA)
Linear dimensionality reduction using Singular Value Decomposition of the data and keeping only the most
signicant singular vectors to project the data to a lower dimensional space.
This implementation uses the scipy.linalg implementation of the singular value decomposition. It only works
for dense arrays and is not scalable to large dimensional data.
The time complexity of this implementation is O(n
**
3) assuming n ~ n_samples ~ n_features.
Parameters n_components : int, None or string
Number of components to keep. if n_components is not set all components are kept:
n_components == min(n_samples, n_features)
if n_components == mle, Minkas MLE is used to guess the dimension if 0 <
n_components < 1, select the number of components such that the amount of vari-
ance that needs to be explained is greater than the percentage specied by n_components
copy : bool
If False, data passed to t are overwritten
whiten : bool, optional
When True (False by default) the components_ vectors are divided by n_samples times
singular values to ensure uncorrelated outputs with unit component-wise variances.
Whitening will remove some information from the transformed signal (the relative vari-
ance scales of the components) but can sometime improve the predictive accuracy of
the downstream estimators by making there data respect some hard-wired assumptions.
See Also:
ProbabilisticPCA, RandomizedPCA, KernelPCA, SparsePCA
Notes
For n_components=mle, this class uses the method of Thomas P. Minka: Automatic Choice of Dimensionality
for PCA. NIPS 2000: 598-604
Due to implementation subtleties of the Singular Value Decomposition (SVD), which is used in this imple-
mentation, running t twice on the same matrix can lead to principal components with signs ipped (change
in direction). For this reason, it is important to always use the same estimator object to transform data in a
consistent fashion.
Examples
>>> import numpy as np
>>> from sklearn.decomposition import PCA
>>> X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
>>> pca = PCA(n_components=2)
>>> pca.fit(X)
PCA(copy=True, n_components=2, whiten=False)
304 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
>>> print(pca.explained_variance_ratio_)
[ 0.99244... 0.00755...]
Attributes
compo-
nents_
array,
[n_components,
n_features]
Components with maximum variance.
ex-
plained_variance_ratio_
array,
[n_components]
Percentage of variance explained by each of the selected components. k is
not set then all components are stored and the sum of explained variances
is equal to 1.0
Methods
fit(X[, y]) Fit the model with X.
fit_transform(X[, y]) Fit the model with X and apply the dimensionality reduction on X.
get_params([deep]) Get parameters for the estimator
inverse_transform(X) Transform data back to its original space, i.e.,
set_params(**params) Set the parameters of the estimator.
transform(X) Apply the dimensionality reduction on X.
__init__(n_components=None, copy=True, whiten=False)
fit(X, y=None, **params)
Fit the model with X.
Parameters X: array-like, shape (n_samples, n_features) :
Training data, where n_samples in the number of samples and n_features is the number
of features.
Returns self : object
Returns the instance itself.
fit_transform(X, y=None, **params)
Fit the model with X and apply the dimensionality reduction on X.
Parameters X : array-like, shape (n_samples, n_features)
Training data, where n_samples in the number of samples and n_features is the number
of features.
Returns X_new : array-like, shape (n_samples, n_components)
get_params(deep=True)
Get parameters for the estimator
Parameters deep: boolean, optional :
If True, will return the parameters for this estimator and contained subobjects that are
estimators.
inverse_transform(X)
Transform data back to its original space, i.e., return an input X_original whose transform would be X
Parameters X : array-like, shape (n_samples, n_components)
1.8. Reference 305
scikit-learn user guide, Release 0.12-git
New data, where n_samples in the number of samples and n_components is the number
of components.
Returns X_original array-like, shape (n_samples, n_features) :
Notes
If whitening is enabled, inverse_transform does not compute the exact inverse operation as transform.
set_params(**params)
Set the parameters of the estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The former have
parameters of the form <component>__<parameter> so that its possible to update each component
of a nested object.
Returns self :
transform(X)
Apply the dimensionality reduction on X.
Parameters X : array-like, shape (n_samples, n_features)
New data, where n_samples in the number of samples and n_features is the number of
features.
Returns X_new : array-like, shape (n_samples, n_components)
sklearn.decomposition.ProbabilisticPCA
class sklearn.decomposition.ProbabilisticPCA(n_components=None, copy=True,
whiten=False)
Additional layer on top of PCA that adds a probabilistic evaluationPrincipal component analysis (PCA)
Linear dimensionality reduction using Singular Value Decomposition of the data and keeping only the most
signicant singular vectors to project the data to a lower dimensional space.
This implementation uses the scipy.linalg implementation of the singular value decomposition. It only works
for dense arrays and is not scalable to large dimensional data.
The time complexity of this implementation is O(n
**
3) assuming n ~ n_samples ~ n_features.
Parameters n_components : int, None or string
Number of components to keep. if n_components is not set all components are kept:
n_components == min(n_samples, n_features)
if n_components == mle, Minkas MLE is used to guess the dimension if 0 <
n_components < 1, select the number of components such that the amount of vari-
ance that needs to be explained is greater than the percentage specied by n_components
copy : bool
If False, data passed to t are overwritten
whiten : bool, optional
When True (False by default) the components_ vectors are divided by n_samples times
singular values to ensure uncorrelated outputs with unit component-wise variances.
306 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
Whitening will remove some information from the transformed signal (the relative vari-
ance scales of the components) but can sometime improve the predictive accuracy of
the downstream estimators by making there data respect some hard-wired assumptions.
See Also:
ProbabilisticPCA, RandomizedPCA, KernelPCA, SparsePCA
Notes
For n_components=mle, this class uses the method of Thomas P. Minka: Automatic Choice of Dimensionality
for PCA. NIPS 2000: 598-604
Due to implementation subtleties of the Singular Value Decomposition (SVD), which is used in this imple-
mentation, running t twice on the same matrix can lead to principal components with signs ipped (change
in direction). For this reason, it is important to always use the same estimator object to transform data in a
consistent fashion.
Examples
>>> import numpy as np
>>> from sklearn.decomposition import PCA
>>> X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
>>> pca = PCA(n_components=2)
>>> pca.fit(X)
PCA(copy=True, n_components=2, whiten=False)
>>> print(pca.explained_variance_ratio_)
[ 0.99244... 0.00755...]
Attributes
compo-
nents_
array,
[n_components,
n_features]
Components with maximum variance.
ex-
plained_variance_ratio_
array,
[n_components]
Percentage of variance explained by each of the selected components. k is
not set then all components are stored and the sum of explained variances
is equal to 1.0
Methods
fit(X[, y, homoscedastic]) Additionally to PCA.t, learns a covariance model
fit_transform(X[, y]) Fit the model with X and apply the dimensionality reduction on X.
get_params([deep]) Get parameters for the estimator
inverse_transform(X) Transform data back to its original space, i.e.,
score(X[, y]) Return a score associated to new data
set_params(**params) Set the parameters of the estimator.
transform(X) Apply the dimensionality reduction on X.
__init__(n_components=None, copy=True, whiten=False)
fit(X, y=None, homoscedastic=True)
1.8. Reference 307
scikit-learn user guide, Release 0.12-git
Additionally to PCA.t, learns a covariance model
Parameters X : array of shape(n_samples, n_dim)
The data to t
homoscedastic : bool, optional,
If True, average variance across remaining dimensions
fit_transform(X, y=None, **params)
Fit the model with X and apply the dimensionality reduction on X.
Parameters X : array-like, shape (n_samples, n_features)
Training data, where n_samples in the number of samples and n_features is the number
of features.
Returns X_new : array-like, shape (n_samples, n_components)
get_params(deep=True)
Get parameters for the estimator
Parameters deep: boolean, optional :
If True, will return the parameters for this estimator and contained subobjects that are
estimators.
inverse_transform(X)
Transform data back to its original space, i.e., return an input X_original whose transform would be X
Parameters X : array-like, shape (n_samples, n_components)
New data, where n_samples in the number of samples and n_components is the number
of components.
Returns X_original array-like, shape (n_samples, n_features) :
Notes
If whitening is enabled, inverse_transform does not compute the exact inverse operation as transform.
score(X, y=None)
Return a score associated to new data
Parameters X: array of shape(n_samples, n_dim) :
The data to test
Returns ll: array of shape (n_samples), :
log-likelihood of each row of X under the current model
set_params(**params)
Set the parameters of the estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The former have
parameters of the form <component>__<parameter> so that its possible to update each component
of a nested object.
Returns self :
transform(X)
Apply the dimensionality reduction on X.
308 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
Parameters X : array-like, shape (n_samples, n_features)
New data, where n_samples in the number of samples and n_features is the number of
features.
Returns X_new : array-like, shape (n_samples, n_components)
sklearn.decomposition.ProjectedGradientNMF
class sklearn.decomposition.ProjectedGradientNMF(n_components=None, init=nndsvdar,
sparseness=None, beta=1, eta=0.1,
tol=0.0001, max_iter=200,
nls_max_iter=2000)
Non-Negative matrix factorization by Projected Gradient (NMF)
Parameters X: {array-like, sparse matrix}, shape = [n_samples, n_features] :
Data the model will be t to.
n_components: int or None :
Number of components, if n_components is not set all components are kept
init: nndsvd | nndsvda | nndsvdar | int | RandomState :
Method used to initialize the procedure. Default: nndsvdar Valid options:
nndsvd: Nonnegative Double Singular Value Decomposition (NNDSVD)
initialization (better for sparseness)
nndsvda: NNDSVD with zeros filled with the average of X
(better when sparsity is not desired)
nndsvdar: NNDSVD with zeros filled with small random values
(generally faster, less accurate alternative to NNDSVDa
for when sparsity is not desired)
int seed or RandomState: non-negative random matrices
sparseness: data | components | None, default: None :
Where to enforce sparsity in the model.
beta: double, default: 1 :
Degree of sparseness, if sparseness is not None. Larger values mean more sparseness.
eta: double, default: 0.1 :
Degree of correctness to mantain, if sparsity is not None. Smaller values mean larger
error.
tol: double, default: 1e-4 :
Tolerance value used in stopping conditions.
max_iter: int, default: 200 :
Number of iterations to compute.
nls_max_iter: int, default: 2000 :
Number of iterations in NLS subproblem.
1.8. Reference 309
scikit-learn user guide, Release 0.12-git
Notes
This implements
C.-J. Lin. Projected gradient methods for non-negative matrix factorization. Neural Computation, 19(2007),
2756-2779. http://www.csie.ntu.edu.tw/~cjlin/nmf/
P. Hoyer. Non-negative Matrix Factorization with Sparseness Constraints. Journal of Machine Learning Re-
search 2004.
NNDSVD is introduced in
C. Boutsidis, E. Gallopoulos: SVD based initialization: A head start for nonnegative matrix factorization -
Pattern Recognition, 2008 http://www.cs.rpi.edu/~boutsc/les/nndsvd.pdf
Examples
>>> import numpy as np
>>> X = np.array([[1,1], [2, 1], [3, 1.2], [4, 1], [5, 0.8], [6, 1]])
>>> from sklearn.decomposition import ProjectedGradientNMF
>>> model = ProjectedGradientNMF(n_components=2, init=0)
>>> model.fit(X)
ProjectedGradientNMF(beta=1, eta=0.1, init=0, max_iter=200, n_components=2,
nls_max_iter=2000, sparseness=None, tol=0.0001)
>>> model.components_
array([[ 0.77032744, 0.11118662],
[ 0.38526873, 0.38228063]])
>>> model.reconstruction_err_
0.00746...
>>> model = ProjectedGradientNMF(n_components=2, init=0,
... sparseness=components)
>>> model.fit(X)
ProjectedGradientNMF(beta=1, eta=0.1, init=0, max_iter=200, n_components=2,
nls_max_iter=2000, sparseness=components, tol=0.0001)
>>> model.components_
array([[ 1.67481991, 0.29614922],
[-0. , 0.4681982 ]])
>>> model.reconstruction_err_
0.513...
Attributes
compo-
nents_
array,
[n_components,
n_features]
Non-negative components of the data
recon-
struc-
tion_err_
number Frobenius norm of the matrix difference between the training data and the
reconstructed data from the t produced by the model. || X - WH ||_2
Not computed for sparse input matrices because it is too expensive in terms
of memory.
Methods
fit(X[, y]) Learn a NMF model for the data X.
Continued on next page
310 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
Table 1.47 continued from previous page
fit_transform(X[, y]) Learn a NMF model for the data X and returns the transformed data.
get_params([deep]) Get parameters for the estimator
set_params(**params) Set the parameters of the estimator.
transform(X) Transform the data X according to the tted NMF model
__init__(n_components=None, init=nndsvdar, sparseness=None, beta=1, eta=0.1, tol=0.0001,
max_iter=200, nls_max_iter=2000)
fit(X, y=None, **params)
Learn a NMF model for the data X.
Parameters X: {array-like, sparse matrix}, shape = [n_samples, n_features] :
Data matrix to be decomposed
Returns self :
fit_transform(X, y=None)
Learn a NMF model for the data X and returns the transformed data.
This is more efcient than calling t followed by transform.
Parameters X: {array-like, sparse matrix}, shape = [n_samples, n_features] :
Data matrix to be decomposed
Returns data: array, [n_samples, n_components] :
Transformed data
get_params(deep=True)
Get parameters for the estimator
Parameters deep: boolean, optional :
If True, will return the parameters for this estimator and contained subobjects that are
estimators.
set_params(**params)
Set the parameters of the estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The former have
parameters of the form <component>__<parameter> so that its possible to update each component
of a nested object.
Returns self :
transform(X)
Transform the data X according to the tted NMF model
Parameters X: {array-like, sparse matrix}, shape = [n_samples, n_features] :
Data matrix to be transformed by the model
Returns data: array, [n_samples, n_components] :
Transformed data
sklearn.decomposition.RandomizedPCA
class sklearn.decomposition.RandomizedPCA(n_components, copy=True, iterated_power=3,
whiten=False, random_state=None)
Principal component analysis (PCA) using randomized SVD
1.8. Reference 311
scikit-learn user guide, Release 0.12-git
Linear dimensionality reduction using approximated Singular Value Decomposition of the data and keeping
only the most signicant singular vectors to project the data to a lower dimensional space.
This implementation uses a randomized SVD implementation and can handle both scipy.sparse and numpy
dense arrays as input.
Parameters n_components : int
Maximum number of components to keep: default is 50.
copy : bool
If False, data passed to t are overwritten
iterated_power : int, optional
Number of iteration for the power method. 3 by default.
whiten : bool, optional
When True (False by default) the components_ vectors are divided by the singular values
to ensure uncorrelated outputs with unit component-wise variances.
Whitening will remove some information from the transformed signal (the relative vari-
ance scales of the components) but can sometime improve the predictive accuracy of
the downstream estimators by making there data respect some hard-wired assumptions.
random_state : int or RandomState instance or None (default)
Pseudo Random Number generator seed control. If None, use the numpy.random sin-
gleton.
See Also:
PCA, ProbabilisticPCA
References
[Halko2009], [MRT]
Examples
>>> import numpy as np
>>> from sklearn.decomposition import RandomizedPCA
>>> X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
>>> pca = RandomizedPCA(n_components=2)
>>> pca.fit(X)
RandomizedPCA(copy=True, iterated_power=3, n_components=2,
random_state=<mtrand.RandomState object at 0x...>, whiten=False)
>>> print(pca.explained_variance_ratio_)
[ 0.99244... 0.00755...]
312 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
Attributes
compo-
nents_
array,
[n_components,
n_features]
Components with maximum variance.
ex-
plained_variance_ratio_
array,
[n_components]
Percentage of variance explained by each of the selected components. k is
not set then all components are stored and the sum of explained variances
is equal to 1.0
Methods
fit(X[, y]) Fit the model to the data X.
fit_transform(X[, y]) Fit to data, then transform it
get_params([deep]) Get parameters for the estimator
inverse_transform(X) Transform data back to its original space, i.e.,
set_params(**params) Set the parameters of the estimator.
transform(X) Apply the dimensionality reduction on X.
__init__(n_components, copy=True, iterated_power=3, whiten=False, random_state=None)
fit(X, y=None)
Fit the model to the data X.
Parameters X: array-like or scipy.sparse matrix, shape (n_samples, n_features) :
Training vector, where n_samples in the number of samples and n_features is the num-
ber of features.
Returns self : object
Returns the instance itself.
fit_transform(X, y=None, **t_params)
Fit to data, then transform it
Fits transformer to X and y with optional parameters t_params and returns a transformed version of X.
Parameters X : numpy array of shape [n_samples, n_features]
Training set.
y : numpy array of shape [n_samples]
Target values.
Returns X_new : numpy array of shape [n_samples, n_features_new]
Transformed array.
Notes
This method just calls t and transform consecutively, i.e., it is not an optimized implementation of
t_transform, unlike other transformers such as PCA.
get_params(deep=True)
Get parameters for the estimator
1.8. Reference 313
scikit-learn user guide, Release 0.12-git
Parameters deep: boolean, optional :
If True, will return the parameters for this estimator and contained subobjects that are
estimators.
inverse_transform(X)
Transform data back to its original space, i.e., return an input X_original whose transform would be X
Parameters X : array-like or scipy.sparse matrix, shape (n_samples, n_components)
New data, where n_samples in the number of samples and n_components is the number
of components.
Returns X_original array-like, shape (n_samples, n_features) :
Notes
If whitening is enabled, inverse_transform does not compute the exact inverse operation as transform.
set_params(**params)
Set the parameters of the estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The former have
parameters of the form <component>__<parameter> so that its possible to update each component
of a nested object.
Returns self :
transform(X)
Apply the dimensionality reduction on X.
Parameters X : array-like or scipy.sparse matrix, shape (n_samples, n_features)
New data, where n_samples in the number of samples and n_features is the number of
features.
Returns X_new : array-like, shape (n_samples, n_components)
sklearn.decomposition.KernelPCA
class sklearn.decomposition.KernelPCA(n_components=None, kernel=linear,
gamma=0, degree=3, coef0=1, alpha=1.0,
t_inverse_transform=False, eigen_solver=auto,
tol=0, max_iter=None)
Kernel Principal component analysis (KPCA)
Non-linear dimensionality reduction through the use of kernels.
Parameters n_components: int or None :
Number of components. If None, all non-zero components are kept.
kernel: linear | poly | rbf | sigmoid | precomputed :
Kernel. Default: linear
degree : int, optional
Degree for poly, rbf and sigmoid kernels. Default: 3.
gamma : oat, optional
Kernel coefcient for rbf and poly kernels. Default: 1/n_features.
314 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
coef0 : oat, optional
Independent term in poly and sigmoid kernels.
alpha: int :
Hyperparameter of the ridge regression that learns the inverse transform (when
t_inverse_transform=True). Default: 1.0
t_inverse_transform: bool :
Learn the inverse transform for non-precomputed kernels. (i.e. learn to nd the pre-
image of a point) Default: False
eigen_solver: string [auto|dense|arpack] :
Select eigensolver to use. If n_components is much less than the number of training
samples, arpack may be more efcient than the dense eigensolver.
tol: oat :
convergence tolerance for arpack. Default: 0 (optimal value will be chosen by arpack)
max_iter : int
maximum number of iterations for arpack Default: None (optimal value will be chosen
by arpack)
References
Kernel PCA was intoduced in:Bernhard Schoelkopf, Alexander J. Smola, and Klaus-Robert Mueller. 1999.
Kernel principal component analysis. In Advances in kernel methods, MIT Press, Cambridge, MA, USA
327-352.
Attributes
lambdas_, alphas_: Eigenvalues and eigenvectors of the centered kernel matrix
dual_coef_: Inverse transform matrix
X_transformed_t_: Projection of the tted data on the kernel principal components
Methods
fit(X[, y]) Fit the model from data in X.
fit_transform(X[, y]) Fit the model from data in X and transform X.
get_params([deep]) Get parameters for the estimator
inverse_transform(X) Transform X back to original space.
set_params(**params) Set the parameters of the estimator.
transform(X) Transform X.
__init__(n_components=None, kernel=linear, gamma=0, degree=3, coef0=1, alpha=1.0,
t_inverse_transform=False, eigen_solver=auto, tol=0, max_iter=None)
fit(X, y=None)
Fit the model from data in X.
Parameters X: array-like, shape (n_samples, n_features) :
1.8. Reference 315
scikit-learn user guide, Release 0.12-git
Training vector, where n_samples in the number of samples and n_features is the num-
ber of features.
Returns self : object
Returns the instance itself.
fit_transform(X, y=None, **params)
Fit the model from data in X and transform X.
Parameters X: array-like, shape (n_samples, n_features) :
Training vector, where n_samples in the number of samples and n_features is the num-
ber of features.
Returns X_new: array-like, shape (n_samples, n_components) :
get_params(deep=True)
Get parameters for the estimator
Parameters deep: boolean, optional :
If True, will return the parameters for this estimator and contained subobjects that are
estimators.
inverse_transform(X)
Transform X back to original space.
Parameters X: array-like, shape (n_samples, n_components) :
Returns X_new: array-like, shape (n_samples, n_features) :
References
Learning to Find Pre-Images, G BakIr et al, 2004.
set_params(**params)
Set the parameters of the estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The former have
parameters of the form <component>__<parameter> so that its possible to update each component
of a nested object.
Returns self :
transform(X)
Transform X.
Parameters X: array-like, shape (n_samples, n_features) :
Returns X_new: array-like, shape (n_samples, n_components) :
sklearn.decomposition.FastICA
class sklearn.decomposition.FastICA(n_components=None, algorithm=parallel, whiten=True,
fun=logcosh, fun_prime=, fun_args=None,
max_iter=200, tol=0.0001, w_init=None, ran-
dom_state=None)
FastICA; a fast algorithm for Independent Component Analysis
Parameters n_components : int, optional
Number of components to use. If none is passed, all are used.
316 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
algorithm : {parallel, deation}
Apply parallel or deational algorithm for FastICA
whiten : boolean, optional
If whiten is false, the data is already considered to be whitened, and no whitening is
performed.
fun : {logcosh, exp, or cube}, or a callable
The non-linear function used in the FastICA loop to approximate negentropy. If a func-
tion is passed, it derivative should be passed as the fun_prime argument.
fun_prime : None or a callable
The derivative of the non-linearity used.
max_iter : int, optional
Maximum number of iterations during t
tol : oat, optional
Tolerance on update at each iteration
w_init : None of an (n_components, n_components) ndarray
The mixing matrix to be used to initialize the algorithm.
random_state: int or RandomState :
Pseudo number generator state used for random sampling.
Notes
Implementation based on A. Hyvarinen and E. Oja, Independent Component Analysis: Algorithms and Appli-
cations, Neural Networks, 13(4-5), 2000, pp. 411-430
Attributes
components_ 2D array, [n_components,
n_features]
The unmixing matrix
sources_: 2D array, [n_samples,
n_components]
The estimated latent sources of
the data.
Methods
fit(X)
get_mixing_matrix() Compute the mixing matrix
get_params([deep]) Get parameters for the estimator
set_params(**params) Set the parameters of the estimator.
transform(X) Apply un-mixing matrix W to X to recover the sources
__init__(n_components=None, algorithm=parallel, whiten=True, fun=logcosh, fun_prime=,
fun_args=None, max_iter=200, tol=0.0001, w_init=None, random_state=None)
get_mixing_matrix()
1.8. Reference 317
scikit-learn user guide, Release 0.12-git
Compute the mixing matrix
get_params(deep=True)
Get parameters for the estimator
Parameters deep: boolean, optional :
If True, will return the parameters for this estimator and contained subobjects that are
estimators.
set_params(**params)
Set the parameters of the estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The former have
parameters of the form <component>__<parameter> so that its possible to update each component
of a nested object.
Returns self :
transform(X)
Apply un-mixing matrix W to X to recover the sources
S = X * W.T
unmixing_matrix_
DEPRECATED: Renamed to components_
sklearn.decomposition.NMF
class sklearn.decomposition.NMF(n_components=None, init=nndsvdar, sparseness=None, beta=1,
eta=0.1, tol=0.0001, max_iter=200, nls_max_iter=2000)
Non-Negative matrix factorization by Projected Gradient (NMF)
Parameters X: {array-like, sparse matrix}, shape = [n_samples, n_features] :
Data the model will be t to.
n_components: int or None :
Number of components, if n_components is not set all components are kept
init: nndsvd | nndsvda | nndsvdar | int | RandomState :
Method used to initialize the procedure. Default: nndsvdar Valid options:
nndsvd: Nonnegative Double Singular Value Decomposition (NNDSVD)
initialization (better for sparseness)
nndsvda: NNDSVD with zeros filled with the average of X
(better when sparsity is not desired)
nndsvdar: NNDSVD with zeros filled with small random values
(generally faster, less accurate alternative to NNDSVDa
for when sparsity is not desired)
int seed or RandomState: non-negative random matrices
sparseness: data | components | None, default: None :
Where to enforce sparsity in the model.
beta: double, default: 1 :
Degree of sparseness, if sparseness is not None. Larger values mean more sparseness.
eta: double, default: 0.1 :
318 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
Degree of correctness to mantain, if sparsity is not None. Smaller values mean larger
error.
tol: double, default: 1e-4 :
Tolerance value used in stopping conditions.
max_iter: int, default: 200 :
Number of iterations to compute.
nls_max_iter: int, default: 2000 :
Number of iterations in NLS subproblem.
Notes
This implements
C.-J. Lin. Projected gradient methods for non-negative matrix factorization. Neural Computation, 19(2007),
2756-2779. http://www.csie.ntu.edu.tw/~cjlin/nmf/
P. Hoyer. Non-negative Matrix Factorization with Sparseness Constraints. Journal of Machine Learning Re-
search 2004.
NNDSVD is introduced in
C. Boutsidis, E. Gallopoulos: SVD based initialization: A head start for nonnegative matrix factorization -
Pattern Recognition, 2008 http://www.cs.rpi.edu/~boutsc/les/nndsvd.pdf
Examples
>>> import numpy as np
>>> X = np.array([[1,1], [2, 1], [3, 1.2], [4, 1], [5, 0.8], [6, 1]])
>>> from sklearn.decomposition import ProjectedGradientNMF
>>> model = ProjectedGradientNMF(n_components=2, init=0)
>>> model.fit(X)
ProjectedGradientNMF(beta=1, eta=0.1, init=0, max_iter=200, n_components=2,
nls_max_iter=2000, sparseness=None, tol=0.0001)
>>> model.components_
array([[ 0.77032744, 0.11118662],
[ 0.38526873, 0.38228063]])
>>> model.reconstruction_err_
0.00746...
>>> model = ProjectedGradientNMF(n_components=2, init=0,
... sparseness=components)
>>> model.fit(X)
ProjectedGradientNMF(beta=1, eta=0.1, init=0, max_iter=200, n_components=2,
nls_max_iter=2000, sparseness=components, tol=0.0001)
>>> model.components_
array([[ 1.67481991, 0.29614922],
[-0. , 0.4681982 ]])
>>> model.reconstruction_err_
0.513...
1.8. Reference 319
scikit-learn user guide, Release 0.12-git
Attributes
compo-
nents_
array,
[n_components,
n_features]
Non-negative components of the data
recon-
struc-
tion_err_
number Frobenius norm of the matrix difference between the training data and the
reconstructed data from the t produced by the model. || X - WH ||_2
Not computed for sparse input matrices because it is too expensive in terms
of memory.
Methods
fit(X[, y]) Learn a NMF model for the data X.
fit_transform(X[, y]) Learn a NMF model for the data X and returns the transformed data.
get_params([deep]) Get parameters for the estimator
set_params(**params) Set the parameters of the estimator.
transform(X) Transform the data X according to the tted NMF model
__init__(n_components=None, init=nndsvdar, sparseness=None, beta=1, eta=0.1, tol=0.0001,
max_iter=200, nls_max_iter=2000)
fit(X, y=None, **params)
Learn a NMF model for the data X.
Parameters X: {array-like, sparse matrix}, shape = [n_samples, n_features] :
Data matrix to be decomposed
Returns self :
fit_transform(X, y=None)
Learn a NMF model for the data X and returns the transformed data.
This is more efcient than calling t followed by transform.
Parameters X: {array-like, sparse matrix}, shape = [n_samples, n_features] :
Data matrix to be decomposed
Returns data: array, [n_samples, n_components] :
Transformed data
get_params(deep=True)
Get parameters for the estimator
Parameters deep: boolean, optional :
If True, will return the parameters for this estimator and contained subobjects that are
estimators.
set_params(**params)
Set the parameters of the estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The former have
parameters of the form <component>__<parameter> so that its possible to update each component
of a nested object.
Returns self :
320 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
transform(X)
Transform the data X according to the tted NMF model
Parameters X: {array-like, sparse matrix}, shape = [n_samples, n_features] :
Data matrix to be transformed by the model
Returns data: array, [n_samples, n_components] :
Transformed data
sklearn.decomposition.SparsePCA
class sklearn.decomposition.SparsePCA(n_components, alpha=1, ridge_alpha=0.01,
max_iter=1000, tol=1e-08, method=lars, n_jobs=1,
U_init=None, V_init=None, verbose=False, ran-
dom_state=None)
Sparse Principal Components Analysis (SparsePCA)
Finds the set of sparse components that can optimally reconstruct the data. The amount of sparseness is control-
lable by the coefcient of the L1 penalty, given by the parameter alpha.
Parameters n_components : int,
Number of sparse atoms to extract.
alpha : oat,
Sparsity controlling parameter. Higher values lead to sparser components.
ridge_alpha : oat,
Amount of ridge shrinkage to apply in order to improve conditioning when calling the
transform method.
max_iter : int,
Maximum number of iterations to perform.
tol : oat,
Tolerance for the stopping condition.
method : {lars, cd}
lars: uses the least angle regression method to solve the lasso problem (lin-
ear_model.lars_path) cd: uses the coordinate descent method to compute the Lasso so-
lution (linear_model.Lasso). Lars will be faster if the estimated components are sparse.
n_jobs : int,
Number of parallel jobs to run.
U_init : array of shape (n_samples, n_atoms),
Initial values for the loadings for warm restart scenarios.
V_init : array of shape (n_atoms, n_features),
Initial values for the components for warm restart scenarios.
verbose : :
Degree of verbosity of the printed output.
random_state : int or RandomState
1.8. Reference 321
scikit-learn user guide, Release 0.12-git
Pseudo number generator state used for random sampling.
See Also:
PCA, MiniBatchSparsePCA, DictionaryLearning
Attributes
components_ array, [n_components, n_features] Sparse components extracted from the data.
error_ array Vector of errors at each iteration.
Methods
fit(X[, y]) Fit the model from data in X.
fit_transform(X[, y]) Fit to data, then transform it
get_params([deep]) Get parameters for the estimator
set_params(**params) Set the parameters of the estimator.
transform(X[, ridge_alpha]) Least Squares projection of the data onto the sparse components.
__init__(n_components, alpha=1, ridge_alpha=0.01, max_iter=1000, tol=1e-08, method=lars,
n_jobs=1, U_init=None, V_init=None, verbose=False, random_state=None)
fit(X, y=None)
Fit the model from data in X.
Parameters X: array-like, shape (n_samples, n_features) :
Training vector, where n_samples in the number of samples and n_features is the num-
ber of features.
Returns self : object
Returns the instance itself.
fit_transform(X, y=None, **t_params)
Fit to data, then transform it
Fits transformer to X and y with optional parameters t_params and returns a transformed version of X.
Parameters X : numpy array of shape [n_samples, n_features]
Training set.
y : numpy array of shape [n_samples]
Target values.
Returns X_new : numpy array of shape [n_samples, n_features_new]
Transformed array.
Notes
This method just calls t and transform consecutively, i.e., it is not an optimized implementation of
t_transform, unlike other transformers such as PCA.
get_params(deep=True)
Get parameters for the estimator
322 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
Parameters deep: boolean, optional :
If True, will return the parameters for this estimator and contained subobjects that are
estimators.
set_params(**params)
Set the parameters of the estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The former have
parameters of the form <component>__<parameter> so that its possible to update each component
of a nested object.
Returns self :
transform(X, ridge_alpha=None)
Least Squares projection of the data onto the sparse components.
To avoid instability issues in case the system is under-determined, regularization can be applied (Ridge
regression) via the ridge_alpha parameter.
Note that Sparse PCA components orthogonality is not enforced as in PCA hence one cannot use a simple
linear projection.
Parameters X: array of shape (n_samples, n_features) :
Test data to be transformed, must have the same number of features as the data used to
train the model.
ridge_alpha: oat, default: 0.01 :
Amount of ridge shrinkage to apply in order to improve conditioning.
Returns X_new array, shape (n_samples, n_components) :
Transformed data.
sklearn.decomposition.MiniBatchSparsePCA
class sklearn.decomposition.MiniBatchSparsePCA(n_components, alpha=1,
ridge_alpha=0.01, n_iter=100, call-
back=None, chunk_size=3, verbose=False,
shufe=True, n_jobs=1, method=lars,
random_state=None)
Mini-batch Sparse Principal Components Analysis
Finds the set of sparse components that can optimally reconstruct the data. The amount of sparseness is control-
lable by the coefcient of the L1 penalty, given by the parameter alpha.
Parameters n_components : int,
number of sparse atoms to extract
alpha : int,
Sparsity controlling parameter. Higher values lead to sparser components.
ridge_alpha : oat,
Amount of ridge shrinkage to apply in order to improve conditioning when calling the
transform method.
n_iter : int,
number of iterations to perform for each mini batch
1.8. Reference 323
scikit-learn user guide, Release 0.12-git
callback : callable,
callable that gets invoked every ve iterations
chunk_size : int,
the number of features to take in each mini batch
verbose : :
degree of output the procedure will print
shufe : boolean,
whether to shufe the data before splitting it in batches
n_jobs : int,
number of parallel jobs to run, or -1 to autodetect.
method : {lars, cd}
lars: uses the least angle regression method to solve the lasso problem (lin-
ear_model.lars_path) cd: uses the coordinate descent method to compute the Lasso so-
lution (linear_model.Lasso). Lars will be faster if the estimated components are sparse.
random_state : int or RandomState
Pseudo number generator state used for random sampling.
See Also:
PCA, SparsePCA, DictionaryLearning
Attributes
components_ array, [n_components, n_features] Sparse components extracted from the data.
error_ array Vector of errors at each iteration.
Methods
fit(X[, y]) Fit the model from data in X.
fit_transform(X[, y]) Fit to data, then transform it
get_params([deep]) Get parameters for the estimator
set_params(**params) Set the parameters of the estimator.
transform(X[, ridge_alpha]) Least Squares projection of the data onto the sparse components.
__init__(n_components, alpha=1, ridge_alpha=0.01, n_iter=100, callback=None, chunk_size=3,
verbose=False, shufe=True, n_jobs=1, method=lars, random_state=None)
fit(X, y=None)
Fit the model from data in X.
Parameters X: array-like, shape (n_samples, n_features) :
Training vector, where n_samples in the number of samples and n_features is the num-
ber of features.
Returns self : object
Returns the instance itself.
324 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
fit_transform(X, y=None, **t_params)
Fit to data, then transform it
Fits transformer to X and y with optional parameters t_params and returns a transformed version of X.
Parameters X : numpy array of shape [n_samples, n_features]
Training set.
y : numpy array of shape [n_samples]
Target values.
Returns X_new : numpy array of shape [n_samples, n_features_new]
Transformed array.
Notes
This method just calls t and transform consecutively, i.e., it is not an optimized implementation of
t_transform, unlike other transformers such as PCA.
get_params(deep=True)
Get parameters for the estimator
Parameters deep: boolean, optional :
If True, will return the parameters for this estimator and contained subobjects that are
estimators.
set_params(**params)
Set the parameters of the estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The former have
parameters of the form <component>__<parameter> so that its possible to update each component
of a nested object.
Returns self :
transform(X, ridge_alpha=None)
Least Squares projection of the data onto the sparse components.
To avoid instability issues in case the system is under-determined, regularization can be applied (Ridge
regression) via the ridge_alpha parameter.
Note that Sparse PCA components orthogonality is not enforced as in PCA hence one cannot use a simple
linear projection.
Parameters X: array of shape (n_samples, n_features) :
Test data to be transformed, must have the same number of features as the data used to
train the model.
ridge_alpha: oat, default: 0.01 :
Amount of ridge shrinkage to apply in order to improve conditioning.
Returns X_new array, shape (n_samples, n_components) :
Transformed data.
1.8. Reference 325
scikit-learn user guide, Release 0.12-git
sklearn.decomposition.SparseCoder
class sklearn.decomposition.SparseCoder(dictionary, transform_algorithm=omp,
transform_n_nonzero_coefs=None, trans-
form_alpha=None, split_sign=False, n_jobs=1)
Sparse coding
Finds a sparse representation of data against a xed, precomputed dictionary.
Each row of the result is the solution to a sparse coding problem. The goal is to nd a sparse array code such
that:
X ~= code
*
dictionary
Parameters dictionary : array, [n_atoms, n_features]
The dictionary atoms used for sparse coding. Lines are assumed to be normalized to
unit norm.
transform_algorithm : {lasso_lars, lasso_cd, lars, omp, threshold}
Algorithm used to transform the data: lars: uses the least angle regression method (lin-
ear_model.lars_path) lasso_lars: uses Lars to compute the Lasso solution lasso_cd: uses
the coordinate descent method to compute the Lasso solution (linear_model.Lasso).
lasso_lars will be faster if the estimated components are sparse. omp: uses orthogonal
matching pursuit to estimate the sparse solution threshold: squashes to zero all coef-
cients less than alpha from the projection dictionary
*
X
transform_n_nonzero_coefs : int, 0.1
*
n_features by default
Number of nonzero coefcients to target in each column of the solution. This is only
used by algorithm=lars and algorithm=omp and is overridden by alpha in the omp
case.
transform_alpha : oat, 1. by default
If algorithm=lasso_lars or algorithm=lasso_cd, alpha is the penalty applied to the
L1 norm. If algorithm=threshold, alpha is the absolute value of the threshold below
which coefcients will be squashed to zero. If algorithm=omp, alpha is the toler-
ance parameter: the value of the reconstruction error targeted. In this case, it overrides
n_nonzero_coefs.
split_sign : bool, False by default
Whether to split the sparse feature vector into the concatenation of its negative part and
its positive part. This can improve the performance of downstream classiers.
n_jobs : int,
number of parallel jobs to run
See Also:
DictionaryLearning, MiniBatchDictionaryLearning, SparsePCA,
MiniBatchSparsePCA, sparse_encode
Attributes
components_ array, [n_atoms, n_features] The unchanged dictionary atoms
326 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
Methods
fit(X[, y]) Do nothing and return the estimator unchanged
fit_transform(X[, y]) Fit to data, then transform it
get_params([deep]) Get parameters for the estimator
set_params(**params) Set the parameters of the estimator.
transform(X[, y]) Encode the data as a sparse combination of the dictionary atoms.
__init__(dictionary, transform_algorithm=omp, transform_n_nonzero_coefs=None, trans-
form_alpha=None, split_sign=False, n_jobs=1)
fit(X, y=None)
Do nothing and return the estimator unchanged
This method is just there to implement the usual API and hence work in pipelines.
fit_transform(X, y=None, **t_params)
Fit to data, then transform it
Fits transformer to X and y with optional parameters t_params and returns a transformed version of X.
Parameters X : numpy array of shape [n_samples, n_features]
Training set.
y : numpy array of shape [n_samples]
Target values.
Returns X_new : numpy array of shape [n_samples, n_features_new]
Transformed array.
Notes
This method just calls t and transform consecutively, i.e., it is not an optimized implementation of
t_transform, unlike other transformers such as PCA.
get_params(deep=True)
Get parameters for the estimator
Parameters deep: boolean, optional :
If True, will return the parameters for this estimator and contained subobjects that are
estimators.
set_params(**params)
Set the parameters of the estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The former have
parameters of the form <component>__<parameter> so that its possible to update each component
of a nested object.
Returns self :
transform(X, y=None)
Encode the data as a sparse combination of the dictionary atoms.
Coding method is determined by the object parameter transform_algorithm.
Parameters X : array of shape (n_samples, n_features)
1.8. Reference 327
scikit-learn user guide, Release 0.12-git
Test data to be transformed, must have the same number of features as the data used to
train the model.
Returns X_new : array, shape (n_samples, n_components)
Transformed data
sklearn.decomposition.DictionaryLearning
class sklearn.decomposition.DictionaryLearning(n_atoms, alpha=1, max_iter=1000,
tol=1e-08, t_algorithm=lars,
transform_algorithm=omp, trans-
form_n_nonzero_coefs=None, trans-
form_alpha=None, n_jobs=1,
code_init=None, dict_init=None, ver-
bose=False, split_sign=False, ran-
dom_state=None)
Dictionary learning
Finds a dictionary (a set of atoms) that can best be used to represent data using a sparse code.
Solves the optimization problem:
(U^
*
,V^
*
) = argmin 0.5 || Y - U V ||_2^2 + alpha
*
|| U ||_1
(U,V)
with || V_k ||_2 = 1 for all 0 <= k < n_atoms
Parameters n_atoms : int,
number of dictionary elements to extract
alpha : int,
sparsity controlling parameter
max_iter : int,
maximum number of iterations to perform
tol : oat,
tolerance for numerical error
t_algorithm : {lars, cd}
lars: uses the least angle regression method to solve the lasso problem (lin-
ear_model.lars_path) cd: uses the coordinate descent method to compute the Lasso so-
lution (linear_model.Lasso). Lars will be faster if the estimated components are sparse.
transform_algorithm : {lasso_lars, lasso_cd, lars, omp, threshold}
Algorithm used to transform the data lars: uses the least angle regression method (lin-
ear_model.lars_path) lasso_lars: uses Lars to compute the Lasso solution lasso_cd: uses
the coordinate descent method to compute the Lasso solution (linear_model.Lasso).
lasso_lars will be faster if the estimated components are sparse. omp: uses orthogonal
matching pursuit to estimate the sparse solution threshold: squashes to zero all coef-
cients less than alpha from the projection dictionary
*
X
transform_n_nonzero_coefs : int, 0.1
*
n_features by default
328 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
Number of nonzero coefcients to target in each column of the solution. This is only
used by algorithm=lars and algorithm=omp and is overridden by alpha in the omp
case.
transform_alpha : oat, 1. by default
If algorithm=lasso_lars or algorithm=lasso_cd, alpha is the penalty applied to the
L1 norm. If algorithm=threshold, alpha is the absolute value of the threshold below
which coefcients will be squashed to zero. If algorithm=omp, alpha is the toler-
ance parameter: the value of the reconstruction error targeted. In this case, it overrides
n_nonzero_coefs.
split_sign : bool, False by default
Whether to split the sparse feature vector into the concatenation of its negative part and
its positive part. This can improve the performance of downstream classiers.
n_jobs : int,
number of parallel jobs to run
code_init : array of shape (n_samples, n_atoms),
initial value for the code, for warm restart
dict_init : array of shape (n_atoms, n_features),
initial values for the dictionary, for warm restart
verbose : :
degree of verbosity of the printed output
random_state : int or RandomState
Pseudo number generator state used for random sampling.
See Also:
SparseCoder, MiniBatchDictionaryLearning, SparsePCA, MiniBatchSparsePCA
Notes
References:
J. Mairal, F. Bach, J. Ponce, G. Sapiro, 2009: Online dictionary learning for sparse coding
(http://www.di.ens.fr/sierra/pdfs/icml09.pdf)
Attributes
components_ array, [n_atoms, n_features] dictionary atoms extracted from the data
error_ array vector of errors at each iteration
Methods
fit(X[, y]) Fit the model from data in X.
fit_transform(X[, y]) Fit to data, then transform it
Continued on next page
1.8. Reference 329
scikit-learn user guide, Release 0.12-git
Table 1.55 continued from previous page
get_params([deep]) Get parameters for the estimator
set_params(**params) Set the parameters of the estimator.
transform(X[, y]) Encode the data as a sparse combination of the dictionary atoms.
__init__(n_atoms, alpha=1, max_iter=1000, tol=1e-08, t_algorithm=lars, trans-
form_algorithm=omp, transform_n_nonzero_coefs=None, transform_alpha=None,
n_jobs=1, code_init=None, dict_init=None, verbose=False, split_sign=False, ran-
dom_state=None)
fit(X, y=None)
Fit the model from data in X.
Parameters X: array-like, shape (n_samples, n_features) :
Training vector, where n_samples in the number of samples and n_features is the num-
ber of features.
Returns self: object :
Returns the object itself
fit_transform(X, y=None, **t_params)
Fit to data, then transform it
Fits transformer to X and y with optional parameters t_params and returns a transformed version of X.
Parameters X : numpy array of shape [n_samples, n_features]
Training set.
y : numpy array of shape [n_samples]
Target values.
Returns X_new : numpy array of shape [n_samples, n_features_new]
Transformed array.
Notes
This method just calls t and transform consecutively, i.e., it is not an optimized implementation of
t_transform, unlike other transformers such as PCA.
get_params(deep=True)
Get parameters for the estimator
Parameters deep: boolean, optional :
If True, will return the parameters for this estimator and contained subobjects that are
estimators.
set_params(**params)
Set the parameters of the estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The former have
parameters of the form <component>__<parameter> so that its possible to update each component
of a nested object.
Returns self :
330 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
transform(X, y=None)
Encode the data as a sparse combination of the dictionary atoms.
Coding method is determined by the object parameter transform_algorithm.
Parameters X : array of shape (n_samples, n_features)
Test data to be transformed, must have the same number of features as the data used to
train the model.
Returns X_new : array, shape (n_samples, n_components)
Transformed data
sklearn.decomposition.MiniBatchDictionaryLearning
class sklearn.decomposition.MiniBatchDictionaryLearning(n_atoms, al-
pha=1, n_iter=1000,
t_algorithm=lars,
n_jobs=1, chunk_size=3,
shufe=True,
dict_init=None, trans-
form_algorithm=omp, trans-
form_n_nonzero_coefs=None,
transform_alpha=None, ver-
bose=False, split_sign=False,
random_state=None)
Mini-batch dictionary learning
Finds a dictionary (a set of atoms) that can best be used to represent data using a sparse code.
Solves the optimization problem:
(U^
*
,V^
*
) = argmin 0.5 || Y - U V ||_2^2 + alpha
*
|| U ||_1
(U,V)
with || V_k ||_2 = 1 for all 0 <= k < n_atoms
Parameters n_atoms : int,
number of dictionary elements to extract
alpha : int,
sparsity controlling parameter
n_iter : int,
total number of iterations to perform
t_algorithm : {lars, cd}
lars: uses the least angle regression method to solve the lasso problem (lin-
ear_model.lars_path) cd: uses the coordinate descent method to compute the Lasso so-
lution (linear_model.Lasso). Lars will be faster if the estimated components are sparse.
transform_algorithm : {lasso_lars, lasso_cd, lars, omp, threshold}
Algorithm used to transform the data. lars: uses the least angle regression method (lin-
ear_model.lars_path) lasso_lars: uses Lars to compute the Lasso solution lasso_cd: uses
the coordinate descent method to compute the Lasso solution (linear_model.Lasso).
lasso_lars will be faster if the estimated components are sparse. omp: uses orthogonal
1.8. Reference 331
scikit-learn user guide, Release 0.12-git
matching pursuit to estimate the sparse solution threshold: squashes to zero all coef-
cients less than alpha from the projection dictionary * X
transform_n_nonzero_coefs : int, 0.1
*
n_features by default
Number of nonzero coefcients to target in each column of the solution. This is only
used by algorithm=lars and algorithm=omp and is overridden by alpha in the omp
case.
transform_alpha : oat, 1. by default
If algorithm=lasso_lars or algorithm=lasso_cd, alpha is the penalty applied to the
L1 norm. If algorithm=threshold, alpha is the absolute value of the threshold below
which coefcients will be squashed to zero. If algorithm=omp, alpha is the toler-
ance parameter: the value of the reconstruction error targeted. In this case, it overrides
n_nonzero_coefs.
split_sign : bool, False by default
Whether to split the sparse feature vector into the concatenation of its negative part and
its positive part. This can improve the performance of downstream classiers.
n_jobs : int,
number of parallel jobs to run
dict_init : array of shape (n_atoms, n_features),
initial value of the dictionary for warm restart scenarios
verbose : :
degree of verbosity of the printed output
chunk_size : int,
number of samples in each mini-batch
shufe : bool,
whether to shufe the samples before forming batches
random_state : int or RandomState
Pseudo number generator state used for random sampling.
See Also:
SparseCoder, DictionaryLearning, SparsePCA, MiniBatchSparsePCA
Notes
References:
J. Mairal, F. Bach, J. Ponce, G. Sapiro, 2009: Online dictionary learning for sparse coding
(http://www.di.ens.fr/sierra/pdfs/icml09.pdf)
Attributes
components_ array, [n_atoms, n_features] components extracted from the data
332 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
Methods
fit(X[, y]) Fit the model from data in X.
fit_transform(X[, y]) Fit to data, then transform it
get_params([deep]) Get parameters for the estimator
partial_fit(X[, y, iter_offset]) Updates the model using the data in X as a mini-batch.
set_params(**params) Set the parameters of the estimator.
transform(X[, y]) Encode the data as a sparse combination of the dictionary atoms.
__init__(n_atoms, alpha=1, n_iter=1000, t_algorithm=lars, n_jobs=1, chunk_size=3, shuf-
e=True, dict_init=None, transform_algorithm=omp, transform_n_nonzero_coefs=None,
transform_alpha=None, verbose=False, split_sign=False, random_state=None)
fit(X, y=None)
Fit the model from data in X.
Parameters X: array-like, shape (n_samples, n_features) :
Training vector, where n_samples in the number of samples and n_features is the num-
ber of features.
Returns self : object
Returns the instance itself.
fit_transform(X, y=None, **t_params)
Fit to data, then transform it
Fits transformer to X and y with optional parameters t_params and returns a transformed version of X.
Parameters X : numpy array of shape [n_samples, n_features]
Training set.
y : numpy array of shape [n_samples]
Target values.
Returns X_new : numpy array of shape [n_samples, n_features_new]
Transformed array.
Notes
This method just calls t and transform consecutively, i.e., it is not an optimized implementation of
t_transform, unlike other transformers such as PCA.
get_params(deep=True)
Get parameters for the estimator
Parameters deep: boolean, optional :
If True, will return the parameters for this estimator and contained subobjects that are
estimators.
partial_fit(X, y=None, iter_offset=0)
Updates the model using the data in X as a mini-batch.
Parameters X: array-like, shape (n_samples, n_features) :
1.8. Reference 333
scikit-learn user guide, Release 0.12-git
Training vector, where n_samples in the number of samples and n_features is the num-
ber of features.
Returns self : object
Returns the instance itself.
set_params(**params)
Set the parameters of the estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The former have
parameters of the form <component>__<parameter> so that its possible to update each component
of a nested object.
Returns self :
transform(X, y=None)
Encode the data as a sparse combination of the dictionary atoms.
Coding method is determined by the object parameter transform_algorithm.
Parameters X : array of shape (n_samples, n_features)
Test data to be transformed, must have the same number of features as the data used to
train the model.
Returns X_new : array, shape (n_samples, n_components)
Transformed data
decomposition.fastica(X[, n_components, ...]) Perform Fast Independent Component Analysis.
decomposition.dict_learning(X, n_atoms, alpha) Solves a dictionary learning matrix factorization problem.
decomposition.dict_learning_online(X, ...[, ...]) Solves a dictionary learning matrix factorization problem online.
decomposition.sparse_encode(X, dictionary[, ...]) Sparse coding
sklearn.decomposition.fastica
sklearn.decomposition.fastica(X, n_components=None, algorithm=parallel, whiten=True,
fun=logcosh, fun_prime=, fun_args={}, max_iter=200,
tol=0.0001, w_init=None, random_state=None)
Perform Fast Independent Component Analysis.
Parameters X : array-like, shape = [n_samples, n_features]
Training vector, where n_samples is the number of samples and n_features is the number
of features.
n_components : int, optional
Number of components to extract. If None no dimension reduction is performed.
algorithm : {parallel, deation}, optional
Apply a parallel or deational FASTICA algorithm.
whiten: boolean, optional :
If True perform an initial whitening of the data. If False, the data is assumed to have
already been preprocessed: it should be centered, normed and white. Otherwise you
will get incorrect results. In this case the parameter n_components will be ignored.
fun : string or function, optional
334 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
The functional form of the G function used in the approximation to neg-entropy. Could
be either logcosh, exp, or cube. You can also provide your own function but in this
case, its derivative should be provided via argument fun_prime
fun_prime : empty string () or function, optional
See fun.
fun_args: dictionary, optional :
If empty and if fun=logcosh, fun_args will take value {alpha : 1.0}
max_iter: int, optional :
Maximum number of iterations to perform
tol: oat, optional :
A positive scalar giving the tolerance at which the un-mixing matrix is considered to
have converged
w_init: (n_components, n_components) array, optional :
Initial un-mixing array of dimension (n.comp,n.comp). If None (default) then an array
of normal r.v.s is used
source_only: boolean, optional :
if True, only the sources matrix is returned
random_state: int or RandomState :
Pseudo number generator state used for random sampling.
Returns K: (n_components, p) array or None. :
If whiten is True, K is the pre-whitening matrix that projects data onto the rst n.comp
principal components. If whiten is False, K is None.
W: (n_components, n_components) array :
estimated un-mixing matrix The mixing matrix can be obtained by:
w = np.dot(W, K.T)
A = w.T
*
(w
*
w.T).I
S: (n_components, n) array :
estimated source matrix
Notes
The data matrix X is considered to be a linear combination of non-Gaussian (independent) components i.e. X
= AS where columns of S contain the independent components and A is a linear mixing matrix. In short ICA
attempts to un-mix the data by estimating an un-mixing matrix W where S = W K X.
This implementation was originally made for data of shape [n_features, n_samples]. Now the input is transposed
before the algorithm is applied. This makes it slightly faster for Fortran-ordered input.
Implemented using FastICA: A. Hyvarinen and E. Oja, Independent Component Analysis: Algorithms and
Applications, Neural Networks, 13(4-5), 2000, pp. 411-430
1.8. Reference 335
scikit-learn user guide, Release 0.12-git
sklearn.decomposition.dict_learning
sklearn.decomposition.dict_learning(X, n_atoms, alpha, max_iter=100, tol=1e-
08, method=lars, n_jobs=1, dict_init=None,
code_init=None, callback=None, verbose=False,
random_state=None)
Solves a dictionary learning matrix factorization problem.
Finds the best dictionary and the corresponding sparse code for approximating the data matrix X by solving:
(U^
*
, V^
*
) = argmin 0.5 || X - U V ||_2^2 + alpha
*
|| U ||_1
(U,V)
with || V_k ||_2 = 1 for all 0 <= k < n_atoms
where V is the dictionary and U is the sparse code.
Parameters X: array of shape (n_samples, n_features) :
Data matrix.
n_atoms: int, :
Number of dictionary atoms to extract.
alpha: int, :
Sparsity controlling parameter.
max_iter: int, :
Maximum number of iterations to perform.
tol: oat, :
Tolerance for the stopping condition.
method: {lars, cd} :
lars: uses the least angle regression method to solve the lasso problem (lin-
ear_model.lars_path) cd: uses the coordinate descent method to compute the Lasso so-
lution (linear_model.Lasso). Lars will be faster if the estimated components are sparse.
n_jobs: int, :
Number of parallel jobs to run, or -1 to autodetect.
dict_init: array of shape (n_atoms, n_features), :
Initial value for the dictionary for warm restart scenarios.
code_init: array of shape (n_samples, n_atoms), :
Initial value for the sparse code for warm restart scenarios.
callback: :
Callable that gets invoked every ve iterations.
verbose: :
Degree of output the procedure will print.
random_state: int or RandomState :
Pseudo number generator state used for random sampling.
Returns code: array of shape (n_samples, n_atoms) :
336 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
The sparse code factor in the matrix factorization.
dictionary: array of shape (n_atoms, n_features), :
The dictionary factor in the matrix factorization.
errors: array :
Vector of errors at each iteration.
See Also:
dict_learning_online, DictionaryLearning, MiniBatchDictionaryLearning,
SparsePCA, MiniBatchSparsePCA
sklearn.decomposition.dict_learning_online
sklearn.decomposition.dict_learning_online(X, n_atoms, alpha, n_iter=100, re-
turn_code=True, dict_init=None, call-
back=None, chunk_size=3, verbose=False,
shufe=True, n_jobs=1, method=lars,
iter_offset=0, random_state=None)
Solves a dictionary learning matrix factorization problem online.
Finds the best dictionary and the corresponding sparse code for approximating the data matrix X by solving:
(U^
*
, V^
*
) = argmin 0.5 || X - U V ||_2^2 + alpha
*
|| U ||_1
(U,V)
with || V_k ||_2 = 1 for all 0 <= k < n_atoms
where V is the dictionary and U is the sparse code. This is accomplished by repeatedly iterating over mini-
batches by slicing the input data.
Parameters X: array of shape (n_samples, n_features) :
data matrix
n_atoms: int, :
number of dictionary atoms to extract
alpha: int, :
sparsity controlling parameter
n_iter: int, :
number of iterations to perform
return_code: boolean, :
whether to also return the code U or just the dictionary V
dict_init: array of shape (n_atoms, n_features), :
initial value for the dictionary for warm restart scenarios
callback: :
callable that gets invoked every ve iterations
chunk_size: int, :
the number of samples to take in each batch
verbose: :
1.8. Reference 337
scikit-learn user guide, Release 0.12-git
degree of output the procedure will print
shufe: boolean, :
whether to shufe the data before splitting it in batches
n_jobs: int, :
number of parallel jobs to run, or -1 to autodetect.
method: {lars, cd} :
lars: uses the least angle regression method to solve the lasso problem (lin-
ear_model.lars_path) cd: uses the coordinate descent method to compute the Lasso so-
lution (linear_model.Lasso). Lars will be faster if the estimated components are sparse.
iter_offset: int, default 0 :
number of previous iterations completed on the dictionary used for initialization
random_state: int or RandomState :
Pseudo number generator state used for random sampling.
Returns code: array of shape (n_samples, n_atoms), :
the sparse code (only returned if return_code=True)
dictionary: array of shape (n_atoms, n_features), :
the solutions to the dictionary learning problem
See Also:
dict_learning, DictionaryLearning, MiniBatchDictionaryLearning, SparsePCA,
MiniBatchSparsePCA
sklearn.decomposition.sparse_encode
sklearn.decomposition.sparse_encode(X, dictionary, gram=None, cov=None, algo-
rithm=lasso_lars, n_nonzero_coefs=None, al-
pha=None, copy_gram=None, copy_cov=True,
init=None, max_iter=1000, n_jobs=1)
Sparse coding
Each row of the result is the solution to a sparse coding problem. The goal is to nd a sparse array code such
that:
X ~= code
*
dictionary
Parameters X: array of shape (n_samples, n_features) :
Data matrix
dictionary: array of shape (n_atoms, n_features) :
The dictionary matrix against which to solve the sparse coding of the data. Some of the
algorithms assume normalized rows for meaningful output.
gram: array, shape=(n_atoms, n_atoms) :
Precomputed Gram matrix, dictionary * dictionary
cov: array, shape=(n_atoms, n_samples) :
338 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
Precomputed covariance, dictionary * X
algorithm: {lasso_lars, lasso_cd, lars, omp, threshold} :
lars: uses the least angle regression method (linear_model.lars_path) lasso_lars: uses
Lars to compute the Lasso solution lasso_cd: uses the coordinate descent method to
compute the Lasso solution (linear_model.Lasso). lasso_lars will be faster if the es-
timated components are sparse. omp: uses orthogonal matching pursuit to estimate
the sparse solution threshold: squashes to zero all coefcients less than alpha from the
projection dictionary * X
n_nonzero_coefs: int, 0.1 * n_features by default :
Number of nonzero coefcients to target in each column of the solution. This is only
used by algorithm=lars and algorithm=omp and is overridden by alpha in the omp
case.
alpha: oat, 1. by default :
If algorithm=lasso_lars or algorithm=lasso_cd, alpha is the penalty applied to the
L1 norm. If algorithm=threhold, alpha is the absolute value of the threshold below
which coefcients will be squashed to zero. If algorithm=omp, alpha is the toler-
ance parameter: the value of the reconstruction error targeted. In this case, it overrides
n_nonzero_coefs.
init: array of shape (n_samples, n_atoms) :
Initialization value of the sparse codes. Only used if algorithm=lasso_cd.
max_iter: int, 1000 by default :
Maximum number of iterations to perform if algorithm=lasso_cd.
copy_cov: boolean, optional :
Whether to copy the precomputed covariance matrix; if False, it may be overwritten.
n_jobs: int, optional :
Number of parallel jobs to run.
Returns code: array of shape (n_samples, n_atoms) :
The sparse codes
See Also:
sklearn.linear_model.lars_path, sklearn.linear_model.orthogonal_mp,
sklearn.linear_model.Lasso, SparseCoder
1.8.6 sklearn.ensemble: Ensemble Methods
The sklearn.ensemble module includes ensemble-based methods for classication and regression.
User guide: See the Ensemble methods section for further details.
ensemble.RandomForestClassifier([...]) A random forest classier.
ensemble.RandomForestRegressor([...]) A random forest regressor.
ensemble.ExtraTreesClassifier([...]) An extra-trees classier.
ensemble.ExtraTreesRegressor([n_estimators, ...]) An extra-trees regressor.
ensemble.GradientBoostingClassifier([loss, ...]) Gradient Boosting for classication.
Continued on next page
1.8. Reference 339
scikit-learn user guide, Release 0.12-git
Table 1.58 continued from previous page
ensemble.GradientBoostingRegressor([loss, ...]) Gradient Boosting for regression.
sklearn.ensemble.RandomForestClassier
class sklearn.ensemble.RandomForestClassifier(n_estimators=10, criterion=gini,
max_depth=None, min_samples_split=1,
min_samples_leaf=1, min_density=0.1,
max_features=auto, bootstrap=True, com-
pute_importances=False, oob_score=False,
n_jobs=1, random_state=None, verbose=0)
A random forest classier.
A random forest is a meta estimator that ts a number of classical decision trees on various sub-samples of the
dataset and use averaging to improve the predictive accuracy and control over-tting.
Parameters n_estimators : integer, optional (default=10)
The number of trees in the forest.
criterion : string, optional (default=gini)
The function to measure the quality of a split. Supported criteria are gini for the Gini
impurity and entropy for the information gain. Note: this parameter is tree-specic.
max_depth : integer or None, optional (default=None)
The maximum depth of the tree. If None, then nodes are expanded until all leaves
are pure or until all leaves contain less than min_samples_split samples. Note: this
parameter is tree-specic.
min_samples_split : integer, optional (default=1)
The minimum number of samples required to split an internal node. Note: this parame-
ter is tree-specic.
min_samples_leaf : integer, optional (default=1)
The minimum number of samples in newly created leaves. A split is discarded if after
the split, one of the leaves would contain less then min_samples_leaf samples.
Note: this parameter is tree-specic.
min_density : oat, optional (default=0.1)
This parameter controls a trade-off in an optimization heuristic. It controls the minimum
density of the sample_mask (i.e. the fraction of samples in the mask). If the density falls
below this threshold the mask is recomputed and the input data is packed which results
in data copying. If min_density equals to one, the partitions are always represented
as copies of the original data. Otherwise, partitions are represented as bit masks (aka
sample masks). Note: this parameter is tree-specic.
max_features : int, string or None, optional (default=auto)
The number of features to consider when looking for the best split:
If auto, then max_features=sqrt(n_features) on classication tasks and
max_features=n_features on regression problems.
If sqrt, then max_features=sqrt(n_features).
If log2, then max_features=log2(n_features).
If None, then max_features=n_features.
340 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
Note: this parameter is tree-specic.
bootstrap : boolean, optional (default=True)
Whether bootstrap samples are used when building trees.
compute_importances : boolean, optional (default=True)
Whether feature importances are computed and stored into the
feature_importances_ attribute when calling t.
oob_score : bool
Whether to use out-of-bag samples to estimate the generalization error.
n_jobs : integer, optional (default=1)
The number of jobs to run in parallel. If -1, then the number of jobs is set to the number
of cores.
random_state : int, RandomState instance or None, optional (default=None)
If int, random_state is the seed used by the random number generator; If RandomState
instance, random_state is the random number generator; If None, the random number
generator is the RandomState instance used by np.random.
verbose : int, optional (default=0)
Controlls the verbosity of the tree building process.
See Also:
DecisionTreeClassifier, ExtraTreesClassifier
References
[R59]
Attributes
fea-
ture_importances_
array, shape = [n_features] The feature importances (the higher, the more
important the feature).
oob_score_ oat Score of the training dataset obtained using an
out-of-bag estimate.
oob_decision_function_ array, shape = [n_samples,
n_classes]
Decision function computed with out-of-bag estimate
on the training set.
Methods
fit(X, y) Build a forest of trees from the training set (X, y).
fit_transform(X[, y]) Fit to data, then transform it
get_params([deep]) Get parameters for the estimator
predict(X) Predict class for X.
predict_log_proba(X) Predict class log-probabilities for X.
predict_proba(X) Predict class probabilities for X.
score(X, y) Returns the mean accuracy on the given test data and labels.
Continued on next page
1.8. Reference 341
scikit-learn user guide, Release 0.12-git
Table 1.59 continued from previous page
set_params(**params) Set the parameters of the estimator.
transform(X[, threshold]) Reduce X to its most important features.
__init__(n_estimators=10, criterion=gini, max_depth=None, min_samples_split=1,
min_samples_leaf=1, min_density=0.1, max_features=auto, bootstrap=True, com-
pute_importances=False, oob_score=False, n_jobs=1, random_state=None, verbose=0)
fit(X, y)
Build a forest of trees from the training set (X, y).
Parameters X : array-like of shape = [n_samples, n_features]
The training input samples.
y : array-like, shape = [n_samples]
The target values (integers that correspond to classes in classication, real numbers in
regression).
Returns self : object
Returns self.
fit_transform(X, y=None, **t_params)
Fit to data, then transform it
Fits transformer to X and y with optional parameters t_params and returns a transformed version of X.
Parameters X : numpy array of shape [n_samples, n_features]
Training set.
y : numpy array of shape [n_samples]
Target values.
Returns X_new : numpy array of shape [n_samples, n_features_new]
Transformed array.
Notes
This method just calls t and transform consecutively, i.e., it is not an optimized implementation of
t_transform, unlike other transformers such as PCA.
get_params(deep=True)
Get parameters for the estimator
Parameters deep: boolean, optional :
If True, will return the parameters for this estimator and contained subobjects that are
estimators.
predict(X)
Predict class for X.
The predicted class of an input sample is computed as the majority prediction of the trees in the forest.
Parameters X : array-like of shape = [n_samples, n_features]
The input samples.
Returns y : array of shape = [n_samples]
342 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
The predicted classes.
predict_log_proba(X)
Predict class log-probabilities for X.
The predicted class log-probabilities of an input sample is computed as the mean predicted class log-
probabilities of the trees in the forest.
Parameters X : array-like of shape = [n_samples, n_features]
The input samples.
Returns p : array of shape = [n_samples]
The class log-probabilities of the input samples. Classes are ordered by arithmetical
order.
predict_proba(X)
Predict class probabilities for X.
The predicted class probabilities of an input sample is computed as the mean predicted class probabilities
of the trees in the forest.
Parameters X : array-like of shape = [n_samples, n_features]
The input samples.
Returns p : array of shape = [n_samples]
The class probabilities of the input samples. Classes are ordered by arithmetical order.
score(X, y)
Returns the mean accuracy on the given test data and labels.
Parameters X : array-like, shape = [n_samples, n_features]
Training set.
y : array-like, shape = [n_samples]
Labels for X.
Returns z : oat
set_params(**params)
Set the parameters of the estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The former have
parameters of the form <component>__<parameter> so that its possible to update each component
of a nested object.
Returns self :
transform(X, threshold=None)
Reduce X to its most important features.
Parameters X : array or scipy sparse matrix of shape [n_samples, n_features]
The input samples.
threshold : string, oat or None, optional (default=None)
The threshold value to use for feature selection. Features whose importance is greater
or equal are kept while the others are discarded. If median (resp. mean), then the
threshold value is the median (resp. the mean) of the feature importances. A scaling
factor (e.g., 1.25*mean) may also be used. If None and if available, the object attribute
threshold is used. Otherwise, mean is used by default.
1.8. Reference 343
scikit-learn user guide, Release 0.12-git
Returns X_r : array of shape [n_samples, n_selected_features]
The input samples with only the selected features.
sklearn.ensemble.RandomForestRegressor
class sklearn.ensemble.RandomForestRegressor(n_estimators=10, criterion=mse,
max_depth=None, min_samples_split=1,
min_samples_leaf=1, min_density=0.1,
max_features=auto, bootstrap=True, com-
pute_importances=False, oob_score=False,
n_jobs=1, random_state=None, verbose=0)
A random forest regressor.
A random forest is a meta estimator that ts a number of classical decision trees on various sub-samples of the
dataset and use averaging to improve the predictive accuracy and control over-tting.
Parameters n_estimators : integer, optional (default=10)
The number of trees in the forest.
criterion : string, optional (default=mse)
The function to measure the quality of a split. The only supported criterion is mse for
the mean squared error. Note: this parameter is tree-specic.
max_depth : integer or None, optional (default=None)
The maximum depth of the tree. If None, then nodes are expanded until all leaves
are pure or until all leaves contain less than min_samples_split samples. Note: this
parameter is tree-specic.
min_samples_split : integer, optional (default=1)
The minimum number of samples required to split an internal node. Note: this parame-
ter is tree-specic.
min_samples_leaf : integer, optional (default=1)
The minimum number of samples in newly created leaves. A split is discarded if after
the split, one of the leaves would contain less then min_samples_leaf samples.
Note: this parameter is tree-specic.
min_density : oat, optional (default=0.1)
This parameter controls a trade-off in an optimization heuristic. It controls the minimum
density of the sample_mask (i.e. the fraction of samples in the mask). If the density falls
below this threshold the mask is recomputed and the input data is packed which results
in data copying. If min_density equals to one, the partitions are always represented
as copies of the original data. Otherwise, partitions are represented as bit masks (aka
sample masks). Note: this parameter is tree-specic.
max_features : int, string or None, optional (default=auto)
The number of features to consider when looking for the best split:
If auto, then max_features=sqrt(n_features) on classication tasks and
max_features=n_features on regression problems.
If sqrt, then max_features=sqrt(n_features).
If log2, then max_features=log2(n_features).
344 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
If None, then max_features=n_features.
Note: this parameter is tree-specic.
bootstrap : boolean, optional (default=True)
Whether bootstrap samples are used when building trees.
compute_importances : boolean, optional (default=True)
Whether feature importances are computed and stored into the
feature_importances_ attribute when calling t.
oob_score : bool
whether to use out-of-bag samples to estimate the generalization error.
n_jobs : integer, optional (default=1)
The number of jobs to run in parallel. If -1, then the number of jobs is set to the number
of cores.
random_state : int, RandomState instance or None, optional (default=None)
If int, random_state is the seed used by the random number generator; If RandomState
instance, random_state is the random number generator; If None, the random number
generator is the RandomState instance used by np.random.
verbose : int, optional (default=0)
Controlls the verbosity of the tree building process.
See Also:
DecisionTreeRegressor, ExtraTreesRegressor
References
[R60]
Attributes
fea-
ture_importances_
array of shape =
[n_features]
The feature mportances (the higher, the more important
the feature).
oob_score_ oat Score of the training dataset obtained using an out-of-bag
estimate.
oob_prediction_ array, shape =
[n_samples]
Prediction computed with out-of-bag estimate on the
training set.
Methods
fit(X, y) Build a forest of trees from the training set (X, y).
fit_transform(X[, y]) Fit to data, then transform it
get_params([deep]) Get parameters for the estimator
predict(X) Predict regression target for X.
score(X, y) Returns the coefcient of determination R^2 of the prediction.
Continued on next page
1.8. Reference 345
scikit-learn user guide, Release 0.12-git
Table 1.60 continued from previous page
set_params(**params) Set the parameters of the estimator.
transform(X[, threshold]) Reduce X to its most important features.
__init__(n_estimators=10, criterion=mse, max_depth=None, min_samples_split=1,
min_samples_leaf=1, min_density=0.1, max_features=auto, bootstrap=True, com-
pute_importances=False, oob_score=False, n_jobs=1, random_state=None, verbose=0)
fit(X, y)
Build a forest of trees from the training set (X, y).
Parameters X : array-like of shape = [n_samples, n_features]
The training input samples.
y : array-like, shape = [n_samples]
The target values (integers that correspond to classes in classication, real numbers in
regression).
Returns self : object
Returns self.
fit_transform(X, y=None, **t_params)
Fit to data, then transform it
Fits transformer to X and y with optional parameters t_params and returns a transformed version of X.
Parameters X : numpy array of shape [n_samples, n_features]
Training set.
y : numpy array of shape [n_samples]
Target values.
Returns X_new : numpy array of shape [n_samples, n_features_new]
Transformed array.
Notes
This method just calls t and transform consecutively, i.e., it is not an optimized implementation of
t_transform, unlike other transformers such as PCA.
get_params(deep=True)
Get parameters for the estimator
Parameters deep: boolean, optional :
If True, will return the parameters for this estimator and contained subobjects that are
estimators.
predict(X)
Predict regression target for X.
The predicted regression target of an input sample is computed as the mean predicted regression targets of
the trees in the forest.
Parameters X : array-like of shape = [n_samples, n_features]
The input samples.
346 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
Returns y: array of shape = [n_samples] :
The predicted values.
score(X, y)
Returns the coefcient of determination R^2 of the prediction.
The coefcient R^2 is dened as (1 - u/v), where u is the regression sum of squares ((y - y_pred) **
2).sum() and v is the residual sum of squares ((y_true - y_true.mean()) ** 2).sum(). Best possible score is
1.0, lower values are worse.
Parameters X : array-like, shape = [n_samples, n_features]
Training set.
y : array-like, shape = [n_samples]
Returns z : oat
set_params(**params)
Set the parameters of the estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The former have
parameters of the form <component>__<parameter> so that its possible to update each component
of a nested object.
Returns self :
transform(X, threshold=None)
Reduce X to its most important features.
Parameters X : array or scipy sparse matrix of shape [n_samples, n_features]
The input samples.
threshold : string, oat or None, optional (default=None)
The threshold value to use for feature selection. Features whose importance is greater
or equal are kept while the others are discarded. If median (resp. mean), then the
threshold value is the median (resp. the mean) of the feature importances. A scaling
factor (e.g., 1.25*mean) may also be used. If None and if available, the object attribute
threshold is used. Otherwise, mean is used by default.
Returns X_r : array of shape [n_samples, n_selected_features]
The input samples with only the selected features.
sklearn.ensemble.ExtraTreesClassier
class sklearn.ensemble.ExtraTreesClassifier(n_estimators=10, criterion=gini,
max_depth=None, min_samples_split=1,
min_samples_leaf=1, min_density=0.1,
max_features=auto, bootstrap=False, com-
pute_importances=False, oob_score=False,
n_jobs=1, random_state=None, verbose=0)
An extra-trees classier.
This class implements a meta estimator that ts a number of randomized decision trees (a.k.a. extra-trees) on
various sub-samples of the dataset and use averaging to improve the predictive accuracy and control over-tting.
Parameters n_estimators : integer, optional (default=10)
The number of trees in the forest.
1.8. Reference 347
scikit-learn user guide, Release 0.12-git
criterion : string, optional (default=gini)
The function to measure the quality of a split. Supported criteria are gini for the Gini
impurity and entropy for the information gain. Note: this parameter is tree-specic.
max_depth : integer or None, optional (default=None)
The maximum depth of the tree. If None, then nodes are expanded until all leaves
are pure or until all leaves contain less than min_samples_split samples. Note: this
parameter is tree-specic.
min_samples_split : integer, optional (default=1)
The minimum number of samples required to split an internal node. Note: this parame-
ter is tree-specic.
min_samples_leaf : integer, optional (default=1)
The minimum number of samples in newly created leaves. A split is discarded if after
the split, one of the leaves would contain less then min_samples_leaf samples.
Note: this parameter is tree-specic.
min_density : oat, optional (default=0.1)
This parameter controls a trade-off in an optimization heuristic. It controls the minimum
density of the sample_mask (i.e. the fraction of samples in the mask). If the density falls
below this threshold the mask is recomputed and the input data is packed which results
in data copying. If min_density equals to one, the partitions are always represented
as copies of the original data. Otherwise, partitions are represented as bit masks (aka
sample masks). Note: this parameter is tree-specic.
max_features : int, string or None, optional (default=auto)
The number of features to consider when looking for the best split.
If auto, then max_features=sqrt(n_features) on classication tasks and
max_features=n_features on regression problems.
If sqrt, then max_features=sqrt(n_features).
If log2, then max_features=log2(n_features).
If None, then max_features=n_features.
Note: this parameter is tree-specic.
bootstrap : boolean, optional (default=False)
Whether bootstrap samples are used when building trees.
compute_importances : boolean, optional (default=True)
Whether feature importances are computed and stored into the
feature_importances_ attribute when calling t.
oob_score : bool
Whether to use out-of-bag samples to estimate the generalization error.
n_jobs : integer, optional (default=1)
The number of jobs to run in parallel. If -1, then the number of jobs is set to the number
of cores.
random_state : int, RandomState instance or None, optional (default=None)
348 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
If int, random_state is the seed used by the random number generator; If RandomState
instance, random_state is the random number generator; If None, the random number
generator is the RandomState instance used by np.random.
verbose : int, optional (default=0)
Controlls the verbosity of the tree building process.
See Also:
sklearn.tree.ExtraTreeClassifierBase classier for this ensemble.
RandomForestClassifierEnsemble Classier based on trees with optimal splits.
References
[R57]
Attributes
fea-
ture_importances_
array of shape =
[n_features]
The feature mportances (the higher, the more
important the feature).
oob_score_ oat Score of the training dataset obtained using an
out-of-bag estimate.
oob_decision_function_ array, shape = [n_samples,
n_classes]
Decision function computed with out-of-bag estimate
on the training set.
Methods
fit(X, y) Build a forest of trees from the training set (X, y).
fit_transform(X[, y]) Fit to data, then transform it
get_params([deep]) Get parameters for the estimator
predict(X) Predict class for X.
predict_log_proba(X) Predict class log-probabilities for X.
predict_proba(X) Predict class probabilities for X.
score(X, y) Returns the mean accuracy on the given test data and labels.
set_params(**params) Set the parameters of the estimator.
transform(X[, threshold]) Reduce X to its most important features.
__init__(n_estimators=10, criterion=gini, max_depth=None, min_samples_split=1,
min_samples_leaf=1, min_density=0.1, max_features=auto, bootstrap=False, com-
pute_importances=False, oob_score=False, n_jobs=1, random_state=None, verbose=0)
fit(X, y)
Build a forest of trees from the training set (X, y).
Parameters X : array-like of shape = [n_samples, n_features]
The training input samples.
y : array-like, shape = [n_samples]
The target values (integers that correspond to classes in classication, real numbers in
regression).
1.8. Reference 349
scikit-learn user guide, Release 0.12-git
Returns self : object
Returns self.
fit_transform(X, y=None, **t_params)
Fit to data, then transform it
Fits transformer to X and y with optional parameters t_params and returns a transformed version of X.
Parameters X : numpy array of shape [n_samples, n_features]
Training set.
y : numpy array of shape [n_samples]
Target values.
Returns X_new : numpy array of shape [n_samples, n_features_new]
Transformed array.
Notes
This method just calls t and transform consecutively, i.e., it is not an optimized implementation of
t_transform, unlike other transformers such as PCA.
get_params(deep=True)
Get parameters for the estimator
Parameters deep: boolean, optional :
If True, will return the parameters for this estimator and contained subobjects that are
estimators.
predict(X)
Predict class for X.
The predicted class of an input sample is computed as the majority prediction of the trees in the forest.
Parameters X : array-like of shape = [n_samples, n_features]
The input samples.
Returns y : array of shape = [n_samples]
The predicted classes.
predict_log_proba(X)
Predict class log-probabilities for X.
The predicted class log-probabilities of an input sample is computed as the mean predicted class log-
probabilities of the trees in the forest.
Parameters X : array-like of shape = [n_samples, n_features]
The input samples.
Returns p : array of shape = [n_samples]
The class log-probabilities of the input samples. Classes are ordered by arithmetical
order.
predict_proba(X)
Predict class probabilities for X.
350 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
The predicted class probabilities of an input sample is computed as the mean predicted class probabilities
of the trees in the forest.
Parameters X : array-like of shape = [n_samples, n_features]
The input samples.
Returns p : array of shape = [n_samples]
The class probabilities of the input samples. Classes are ordered by arithmetical order.
score(X, y)
Returns the mean accuracy on the given test data and labels.
Parameters X : array-like, shape = [n_samples, n_features]
Training set.
y : array-like, shape = [n_samples]
Labels for X.
Returns z : oat
set_params(**params)
Set the parameters of the estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The former have
parameters of the form <component>__<parameter> so that its possible to update each component
of a nested object.
Returns self :
transform(X, threshold=None)
Reduce X to its most important features.
Parameters X : array or scipy sparse matrix of shape [n_samples, n_features]
The input samples.
threshold : string, oat or None, optional (default=None)
The threshold value to use for feature selection. Features whose importance is greater
or equal are kept while the others are discarded. If median (resp. mean), then the
threshold value is the median (resp. the mean) of the feature importances. A scaling
factor (e.g., 1.25*mean) may also be used. If None and if available, the object attribute
threshold is used. Otherwise, mean is used by default.
Returns X_r : array of shape [n_samples, n_selected_features]
The input samples with only the selected features.
sklearn.ensemble.ExtraTreesRegressor
class sklearn.ensemble.ExtraTreesRegressor(n_estimators=10, criterion=mse,
max_depth=None, min_samples_split=1,
min_samples_leaf=1, min_density=0.1,
max_features=auto, bootstrap=False, com-
pute_importances=False, oob_score=False,
n_jobs=1, random_state=None, verbose=0)
An extra-trees regressor.
This class implements a meta estimator that ts a number of randomized decision trees (a.k.a. extra-trees) on
various sub-samples of the dataset and use averaging to improve the predictive accuracy and control over-tting.
1.8. Reference 351
scikit-learn user guide, Release 0.12-git
Parameters n_estimators : integer, optional (default=10)
The number of trees in the forest.
criterion : string, optional (default=mse)
The function to measure the quality of a split. The only supported criterion is mse for
the mean squared error. Note: this parameter is tree-specic.
max_depth : integer or None, optional (default=None)
The maximum depth of the tree. If None, then nodes are expanded until all leaves
are pure or until all leaves contain less than min_samples_split samples. Note: this
parameter is tree-specic.
min_samples_split : integer, optional (default=1)
The minimum number of samples required to split an internal node. Note: this parame-
ter is tree-specic.
min_samples_leaf : integer, optional (default=1)
The minimum number of samples in newly created leaves. A split is discarded if after
the split, one of the leaves would contain less then min_samples_leaf samples.
Note: this parameter is tree-specic.
min_density : oat, optional (default=0.1)
This parameter controls a trade-off in an optimization heuristic. It controls the minimum
density of the sample_mask (i.e. the fraction of samples in the mask). If the density falls
below this threshold the mask is recomputed and the input data is packed which results
in data copying. If min_density equals to one, the partitions are always represented
as copies of the original data. Otherwise, partitions are represented as bit masks (aka
sample masks). Note: this parameter is tree-specic.
max_features : int, string or None, optional (default=auto)
The number of features to consider when looking for the best split:
If auto, then max_features=sqrt(n_features) on classication tasks and
max_features=n_features on regression problems.
If sqrt, then max_features=sqrt(n_features).
If log2, then max_features=log2(n_features).
If None, then max_features=n_features.
Note: this parameter is tree-specic.
bootstrap : boolean, optional (default=False)
Whether bootstrap samples are used when building trees. Note: this parameter is tree-
specic.
compute_importances : boolean, optional (default=True)
Whether feature importances are computed and stored into the
feature_importances_ attribute when calling t.
oob_score : bool
Whether to use out-of-bag samples to estimate the generalization error.
n_jobs : integer, optional (default=1)
352 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
The number of jobs to run in parallel. If -1, then the number of jobs is set to the number
of cores.
random_state : int, RandomState instance or None, optional (default=None)
If int, random_state is the seed used by the random number generator; If RandomState
instance, random_state is the random number generator; If None, the random number
generator is the RandomState instance used by np.random.
verbose : int, optional (default=0)
Controlls the verbosity of the tree building process.
See Also:
sklearn.tree.ExtraTreeRegressorBase estimator for this ensemble.
RandomForestRegressorEnsemble regressor using trees with optimal splits.
References
[R58]
Attributes
fea-
ture_importances_
array of shape =
[n_features]
The feature mportances (the higher, the more important
the feature).
oob_score_ oat Score of the training dataset obtained using an out-of-bag
estimate.
oob_prediction_ array, shape =
[n_samples]
Prediction computed with out-of-bag estimate on the
training set.
Methods
fit(X, y) Build a forest of trees from the training set (X, y).
fit_transform(X[, y]) Fit to data, then transform it
get_params([deep]) Get parameters for the estimator
predict(X) Predict regression target for X.
score(X, y) Returns the coefcient of determination R^2 of the prediction.
set_params(**params) Set the parameters of the estimator.
transform(X[, threshold]) Reduce X to its most important features.
__init__(n_estimators=10, criterion=mse, max_depth=None, min_samples_split=1,
min_samples_leaf=1, min_density=0.1, max_features=auto, bootstrap=False, com-
pute_importances=False, oob_score=False, n_jobs=1, random_state=None, verbose=0)
fit(X, y)
Build a forest of trees from the training set (X, y).
Parameters X : array-like of shape = [n_samples, n_features]
The training input samples.
y : array-like, shape = [n_samples]
1.8. Reference 353
scikit-learn user guide, Release 0.12-git
The target values (integers that correspond to classes in classication, real numbers in
regression).
Returns self : object
Returns self.
fit_transform(X, y=None, **t_params)
Fit to data, then transform it
Fits transformer to X and y with optional parameters t_params and returns a transformed version of X.
Parameters X : numpy array of shape [n_samples, n_features]
Training set.
y : numpy array of shape [n_samples]
Target values.
Returns X_new : numpy array of shape [n_samples, n_features_new]
Transformed array.
Notes
This method just calls t and transform consecutively, i.e., it is not an optimized implementation of
t_transform, unlike other transformers such as PCA.
get_params(deep=True)
Get parameters for the estimator
Parameters deep: boolean, optional :
If True, will return the parameters for this estimator and contained subobjects that are
estimators.
predict(X)
Predict regression target for X.
The predicted regression target of an input sample is computed as the mean predicted regression targets of
the trees in the forest.
Parameters X : array-like of shape = [n_samples, n_features]
The input samples.
Returns y: array of shape = [n_samples] :
The predicted values.
score(X, y)
Returns the coefcient of determination R^2 of the prediction.
The coefcient R^2 is dened as (1 - u/v), where u is the regression sum of squares ((y - y_pred) **
2).sum() and v is the residual sum of squares ((y_true - y_true.mean()) ** 2).sum(). Best possible score is
1.0, lower values are worse.
Parameters X : array-like, shape = [n_samples, n_features]
Training set.
y : array-like, shape = [n_samples]
Returns z : oat
354 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
set_params(**params)
Set the parameters of the estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The former have
parameters of the form <component>__<parameter> so that its possible to update each component
of a nested object.
Returns self :
transform(X, threshold=None)
Reduce X to its most important features.
Parameters X : array or scipy sparse matrix of shape [n_samples, n_features]
The input samples.
threshold : string, oat or None, optional (default=None)
The threshold value to use for feature selection. Features whose importance is greater
or equal are kept while the others are discarded. If median (resp. mean), then the
threshold value is the median (resp. the mean) of the feature importances. A scaling
factor (e.g., 1.25*mean) may also be used. If None and if available, the object attribute
threshold is used. Otherwise, mean is used by default.
Returns X_r : array of shape [n_samples, n_selected_features]
The input samples with only the selected features.
sklearn.ensemble.GradientBoostingClassier
class sklearn.ensemble.GradientBoostingClassifier(loss=deviance, learn_rate=0.1,
n_estimators=100, subsam-
ple=1.0, min_samples_split=1,
min_samples_leaf=1, max_depth=3,
init=None, random_state=None)
Gradient Boosting for classication.
GB builds an additive model in a forward stage-wise fashion; it allows for the optimization of arbitrary differen-
tiable loss functions. In each stage n_classes_ regression trees are t on the negative gradient of the binomial
or multinomial deviance loss function. Binary classication is a special case where only a single regression tree
is induced.
Parameters loss : {deviance, ls}, optional (default=deviance)
loss function to be optimized. deviance refers to deviance (= logistic regression) for
classication with probabilistic outputs. ls refers to least squares regression.
learn_rate : oat, optional (default=0.1)
learning rate shrinks the contribution of each tree by learn_rate. There is a trade-off
between learn_rate and n_estimators.
n_estimators : int (default=100)
The number of boosting stages to perform. Gradient boosting is fairly robust to over-
tting so a large number usually results in better performance.
max_depth : integer, optional (default=3)
maximum depth of the individual regression estimators. The maximum depth limits the
number of nodes in the tree. Tune this parameter for best performance; the best value
depends on the interaction of the input variables.
1.8. Reference 355
scikit-learn user guide, Release 0.12-git
min_samples_split : integer, optional (default=1)
The minimum number of samples required to split an internal node.
min_samples_leaf : integer, optional (default=1)
The minimum number of samples required to be at a leaf node.
subsample : oat, optional (default=1.0)
The fraction of samples to be used for tting the individual base learners. If smaller than
1.0 this results in Stochastic Gradient Boosting. subsample interacts with the parameter
n_estimators.
See Also:
sklearn.tree.DecisionTreeClassifier, RandomForestClassifier
References
J. Friedman, Greedy Function Approximation: A Gradient Boosting Machine, The Annals of Statistics, Vol. 29,
No. 5, 2001.
10.Friedman, Stochastic Gradient Boosting, 1999
T. Hastie, R. Tibshirani and J. Friedman. Elements of Statistical Learning Ed. 2, Springer, 2009.
Examples
>>> samples = [[0, 0, 2], [1, 0, 0]]
>>> labels = [0, 1]
>>> from sklearn.ensemble import GradientBoostingClassifier
>>> gb = GradientBoostingClassifier().fit(samples, labels)
>>> print gb.predict([[0.5, 0, 0]])
[0]
Methods
fit(X, y) Fit the gradient boosting model.
fit_stage(i, X, X_argsorted, y, y_pred, ...) Fit another stage of n_classes_ trees to the boosting model.
get_params([deep]) Get parameters for the estimator
predict(X) Predict class for X.
predict_proba(X) Predict class probabilities for X.
score(X, y) Returns the mean accuracy on the given test data and labels.
set_params(**params) Set the parameters of the estimator.
staged_decision_function(X) Compute decision function for X.
__init__(loss=deviance, learn_rate=0.1, n_estimators=100, subsample=1.0, min_samples_split=1,
min_samples_leaf=1, max_depth=3, init=None, random_state=None)
fit(X, y)
Fit the gradient boosting model.
Parameters X : array-like, shape = [n_samples, n_features]
Training vectors, where n_samples is the number of samples and n_features is the num-
356 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
ber of features. Use fortran-style to avoid memory copies.
y : array-like, shape = [n_samples]
Target values (integers in classication, real numbers in regression) For classication,
labels must correspond to classes 0, 1, ..., n_classes_-1
Returns self : object
Returns self.
fit_stage(i, X, X_argsorted, y, y_pred, sample_mask)
Fit another stage of n_classes_ trees to the boosting model.
get_params(deep=True)
Get parameters for the estimator
Parameters deep: boolean, optional :
If True, will return the parameters for this estimator and contained subobjects that are
estimators.
predict(X)
Predict class for X.
Parameters X : array-like of shape = [n_samples, n_features]
The input samples.
Returns y : array of shape = [n_samples]
The predicted classes.
predict_proba(X)
Predict class probabilities for X.
Parameters X : array-like of shape = [n_samples, n_features]
The input samples.
Returns p : array of shape = [n_samples]
The class probabilities of the input samples. Classes are ordered by arithmetical order.
score(X, y)
Returns the mean accuracy on the given test data and labels.
Parameters X : array-like, shape = [n_samples, n_features]
Training set.
y : array-like, shape = [n_samples]
Labels for X.
Returns z : oat
set_params(**params)
Set the parameters of the estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The former have
parameters of the form <component>__<parameter> so that its possible to update each component
of a nested object.
Returns self :
1.8. Reference 357
scikit-learn user guide, Release 0.12-git
staged_decision_function(X)
Compute decision function for X.
This method allows monitoring (i.e. determine error on testing set) after each stage.
Parameters X : array-like of shape = [n_samples, n_features]
The input samples.
Returns f : array of shape = [n_samples, n_classes]
The decision function of the input samples. Classes are ordered by arithmetical order.
Regression and binary classication are special cases with n_classes == 1.
sklearn.ensemble.GradientBoostingRegressor
class sklearn.ensemble.GradientBoostingRegressor(loss=ls, learn_rate=0.1,
n_estimators=100, subsam-
ple=1.0, min_samples_split=1,
min_samples_leaf=1, max_depth=3,
init=None, random_state=None)
Gradient Boosting for regression.
GB builds an additive model in a forward stage-wise fashion; it allows for the optimization of arbitrary differ-
entiable loss functions. In each stage a regression tree is t on the negative gradient of the given loss function.
Parameters loss : {ls, lad}, optional (default=ls)
loss function to be optimized. ls refers to least squares regression. lad (least absolute
deviation) is a highly robust loss function soley based on order information of the input
variables.
learn_rate : oat, optional (default=0.1)
learning rate shrinks the contribution of each tree by learn_rate. There is a trade-off
between learn_rate and n_estimators.
n_estimators : int (default=100)
The number of boosting stages to perform. Gradient boosting is fairly robust to over-
tting so a large number usually results in better performance.
max_depth : integer, optional (default=3)
maximum depth of the individual regression estimators. The maximum depth limits the
number of nodes in the tree. Tune this parameter for best performance; the best value
depends on the interaction of the input variables.
min_samples_split : integer, optional (default=1)
The minimum number of samples required to split an internal node.
min_samples_leaf : integer, optional (default=1)
The minimum number of samples required to be at a leaf node.
subsample : oat, optional (default=1.0)
The fraction of samples to be used for tting the individual base learners. If smaller than
1.0 this results in Stochastic Gradient Boosting. subsample interacts with the parameter
n_estimators.
358 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
See Also:
sklearn.tree.DecisionTreeRegressor, RandomForestRegressor
References
J. Friedman, Greedy Function Approximation: A Gradient Boosting Machine, The Annals of Statistics, Vol. 29,
No. 5, 2001.
10.Friedman, Stochastic Gradient Boosting, 1999
T. Hastie, R. Tibshirani and J. Friedman. Elements of Statistical Learning Ed. 2, Springer, 2009.
Examples
>>> samples = [[0, 0, 2], [1, 0, 0]]
>>> labels = [0, 1]
>>> from sklearn.ensemble import GradientBoostingRegressor
>>> gb = GradientBoostingRegressor().fit(samples, labels)
>>> print gb.predict([[0, 0, 0]])
[ 1.32806997e-05]
Attributes
fea-
ture_importances_
array, shape
=
[n_features]
The feature importances (the higher, the more important the feature).
oob_score_ array, shape
=
[n_estimators]
Score of the training dataset obtained using an out-of-bag estimate. The i-th
score oob_score_[i] is the deviance (= loss) of the model at iteration i
on the out-of-bag sample.
train_score_ array, shape
=
[n_estimators]
The i-th score train_score_[i] is the deviance (= loss) of the model at
iteration i on the in-bag sample. If subsample == 1 this is the deviance
on the training data.
Methods
fit(X, y) Fit the gradient boosting model.
fit_stage(i, X, X_argsorted, y, y_pred, ...) Fit another stage of n_classes_ trees to the boosting model.
get_params([deep]) Get parameters for the estimator
predict(X) Predict regression target for X.
score(X, y) Returns the coefcient of determination R^2 of the prediction.
set_params(**params) Set the parameters of the estimator.
staged_decision_function(X) Compute decision function for X.
staged_predict(X) Predict regression target at each stage for X.
__init__(loss=ls, learn_rate=0.1, n_estimators=100, subsample=1.0, min_samples_split=1,
min_samples_leaf=1, max_depth=3, init=None, random_state=None)
fit(X, y)
Fit the gradient boosting model.
1.8. Reference 359
scikit-learn user guide, Release 0.12-git
Parameters X : array-like, shape = [n_samples, n_features]
Training vectors, where n_samples is the number of samples and n_features is the num-
ber of features. Use fortran-style to avoid memory copies.
y : array-like, shape = [n_samples]
Target values (integers in classication, real numbers in regression) For classication,
labels must correspond to classes 0, 1, ..., n_classes_-1
Returns self : object
Returns self.
fit_stage(i, X, X_argsorted, y, y_pred, sample_mask)
Fit another stage of n_classes_ trees to the boosting model.
get_params(deep=True)
Get parameters for the estimator
Parameters deep: boolean, optional :
If True, will return the parameters for this estimator and contained subobjects that are
estimators.
predict(X)
Predict regression target for X.
Parameters X : array-like of shape = [n_samples, n_features]
The input samples.
Returns y: array of shape = [n_samples] :
The predicted values.
score(X, y)
Returns the coefcient of determination R^2 of the prediction.
The coefcient R^2 is dened as (1 - u/v), where u is the regression sum of squares ((y - y_pred) **
2).sum() and v is the residual sum of squares ((y_true - y_true.mean()) ** 2).sum(). Best possible score is
1.0, lower values are worse.
Parameters X : array-like, shape = [n_samples, n_features]
Training set.
y : array-like, shape = [n_samples]
Returns z : oat
set_params(**params)
Set the parameters of the estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The former have
parameters of the form <component>__<parameter> so that its possible to update each component
of a nested object.
Returns self :
staged_decision_function(X)
Compute decision function for X.
This method allows monitoring (i.e. determine error on testing set) after each stage.
Parameters X : array-like of shape = [n_samples, n_features]
360 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
The input samples.
Returns f : array of shape = [n_samples, n_classes]
The decision function of the input samples. Classes are ordered by arithmetical order.
Regression and binary classication are special cases with n_classes == 1.
staged_predict(X)
Predict regression target at each stage for X.
This method allows monitoring (i.e. determine error on testing set) after each stage.
Parameters X : array-like of shape = [n_samples, n_features]
The input samples.
Returns y : array of shape = [n_samples]
The predicted value of the input samples.
1.8.7 sklearn.feature_extraction: Feature Extraction
The sklearn.feature_extraction module deals with feature extraction from raw data. It currently includes
methods to extract features from text and images.
User guide: See the Feature extraction section for further details.
feature_extraction.DictVectorizer([dtype, ...]) Transforms lists of feature-value mappings to vectors.
sklearn.feature_extraction.DictVectorizer
class sklearn.feature_extraction.DictVectorizer(dtype=<type numpy.oat64>, separa-
tor==, sparse=True)
Transforms lists of feature-value mappings to vectors.
This transformer turns lists of mappings (dict-like objects) of feature names to feature values into Numpy arrays
or scipy.sparse matrices for use with scikit-learn estimators.
When feature values are strings, this transformer will do a binary one-hot (aka one-of-K) coding: one boolean-
valued feature is constructed for each of the possible string values that the feature can take on. For instance, a
feature f that can take on the values ham and spam will become two features in the output, one signifying
f=ham, the other f=spam.
Features that do not occur in a sample (mapping) will have a zero value in the resulting array/matrix.
Parameters dtype : callable, optional
The type of feature values. Passed to Numpy array/scipy.sparse matrix constructors as
the dtype argument.
separator: string, optional :
Separator string used when constructing new features for one-hot coding.
sparse: boolean, optional. :
Whether transform should produce scipy.sparse matrices. True by default.
1.8. Reference 361
scikit-learn user guide, Release 0.12-git
Examples
>>> from sklearn.feature_extraction import DictVectorizer
>>> v = DictVectorizer(sparse=False)
>>> D = [{foo: 1, bar: 2}, {foo: 3, baz: 1}]
>>> X = v.fit_transform(D)
>>> X
array([[ 2., 0., 1.],
[ 0., 1., 3.]])
>>> v.inverse_transform(X) == [{bar: 2.0, foo: 1.0}, {baz: 1.0, foo: 3.0}]
True
>>> v.transform({foo: 4, unseen_feature: 3})
array([[ 0., 0., 4.]])
Methods
fit(X[, y]) Learn a list of feature name -> indices mappings.
fit_transform(X[, y]) Learn a list of feature name -> indices mappings and transform X.
get_feature_names() Returns a list of feature names, ordered by their indices.
get_params([deep]) Get parameters for the estimator
inverse_transform(X[, dict_type]) Transform array or sparse matrix X back to feature mappings.
restrict(support[, indices]) Restrict the features to those in support.
set_params(**params) Set the parameters of the estimator.
transform(X[, y]) Transform feature->value dicts to array or sparse matrix.
__init__(dtype=<type numpy.oat64>, separator==, sparse=True)
fit(X, y=None)
Learn a list of feature name -> indices mappings.
Parameters X : Mapping or iterable over Mappings
Dict(s) or Mapping(s) from feature names (arbitrary Python objects) to feature values
(strings or convertible to dtype).
y : (ignored)
Returns self :
fit_transform(X, y=None)
Learn a list of feature name -> indices mappings and transform X.
Like t(X) followed by transform(X).
Parameters X : Mapping or iterable over Mappings
Dict(s) or Mapping(s) from feature names (arbitrary Python objects) to feature values
(strings or convertible to dtype).
y : (ignored)
Returns Xa : {array, sparse matrix}
Feature vectors; always 2-d.
get_feature_names()
Returns a list of feature names, ordered by their indices.
362 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
If one-of-K coding is applied to categorical features, this will include the constructed feature names but
not the original ones.
get_params(deep=True)
Get parameters for the estimator
Parameters deep: boolean, optional :
If True, will return the parameters for this estimator and contained subobjects that are
estimators.
inverse_transform(X, dict_type=<type dict>)
Transform array or sparse matrix X back to feature mappings.
X must have been produced by this DictVectorizers transform or t_transform method; it may only have
passed through transformers that preserve the number of features and their order.
In the case of one-hot/one-of-K coding, the constructed feature names and values are returned rather than
the original ones.
Parameters X : {array-like, sparse matrix}, shape = [n_samples, n_features]
Sample matrix.
dict_type : callable, optional
Constructor for feature mappings. Must conform to the collections.Mapping API.
Returns D : list of dict_type objects, length = n_samples
Feature mappings for the samples in X.
restrict(support, indices=False)
Restrict the features to those in support.
Parameters support : array-like
Boolean mask or list of indices (as returned by the get_support member of feature se-
lectors).
indices : boolean, optional
Whether support is a list of indices.
set_params(**params)
Set the parameters of the estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The former have
parameters of the form <component>__<parameter> so that its possible to update each component
of a nested object.
Returns self :
transform(X, y=None)
Transform feature->value dicts to array or sparse matrix.
Named features not encountered during t or t_transform will be silently ignored.
Parameters X : Mapping or iterable over Mappings, length = n_samples
Dict(s) or Mapping(s) from feature names (arbitrary Python objects) to feature values
(strings or convertible to dtype).
y : (ignored)
Returns Xa : {array, sparse matrix}
1.8. Reference 363
scikit-learn user guide, Release 0.12-git
Feature vectors; always 2-d.
From images
The sklearn.feature_extraction.image submodule gathers utilities to extract features from images.
feature_extraction.image.img_to_graph(img[, ...]) Graph of the pixel-to-pixel gradient connections
feature_extraction.image.grid_to_graph(n_x, n_y) Graph of the pixel-to-pixel connections
feature_extraction.image.extract_patches_2d(...) Reshape a 2D image into a collection of patches
feature_extraction.image.reconstruct_from_patches_2d(...) Reconstruct the image from all of its patches.
feature_extraction.image.PatchExtractor(...) Extracts patches from a collection of images
sklearn.feature_extraction.image.img_to_graph
sklearn.feature_extraction.image.img_to_graph(img, mask=None, return_as=<class
scipy.sparse.coo.coo_matrix>,
dtype=None)
Graph of the pixel-to-pixel gradient connections
Edges are weighted with the gradient values.
Parameters img: ndarray, 2D or 3D :
2D or 3D image
mask : ndarray of booleans, optional
An optional mask of the image, to consider only part of the pixels.
return_as: np.ndarray or a sparse matrix class, optional :
The class to use to build the returned adjacency matrix.
dtype: None or dtype, optional :
The data of the returned sparse matrix. By default it is the dtype of img
sklearn.feature_extraction.image.grid_to_graph
sklearn.feature_extraction.image.grid_to_graph(n_x, n_y, n_z=1,
mask=None, return_as=<class
scipy.sparse.coo.coo_matrix>,
dtype=<type int>)
Graph of the pixel-to-pixel connections
Edges exist if 2 voxels are connected.
Parameters n_x: int :
Dimension in x axis
n_y: int :
Dimension in y axis
n_z: int, optional, default 1 :
Dimension in z axis
mask : ndarray of booleans, optional
364 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
An optional mask of the image, to consider only part of the pixels.
return_as: np.ndarray or a sparse matrix class, optional :
The class to use to build the returned adjacency matrix.
dtype: dtype, optional, default int :
The data of the returned sparse matrix. By default it is int
sklearn.feature_extraction.image.extract_patches_2d
sklearn.feature_extraction.image.extract_patches_2d(image, patch_size,
max_patches=None, ran-
dom_state=None)
Reshape a 2D image into a collection of patches
The resulting patches are allocated in a dedicated array.
Parameters image: array, shape = (image_height, image_width) or :
(image_height, image_width, n_channels) The original image data. For color images,
the last dimension species the channel: a RGB image would have n_channels=3.
patch_size: tuple of ints (patch_height, patch_width) :
the dimensions of one patch
max_patches: integer or oat, optional default is None :
The maximum number of patches to extract. If max_patches is a oat between 0 and 1,
it is taken to be a proportion of the total number of patches.
random_state: int or RandomState :
Pseudo number generator state used for random sampling to use if max_patches is not
None.
Returns patches: array, shape = (n_patches, patch_height, patch_width) or :
(n_patches, patch_height, patch_width, n_channels) The collection of patches extracted
from the image, where n_patches is either max_patches or the total number of patches
that can be extracted.
Examples
>>> from sklearn.feature_extraction import image
>>> one_image = np.arange(16).reshape((4, 4))
>>> one_image
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11],
[12, 13, 14, 15]])
>>> patches = image.extract_patches_2d(one_image, (2, 2))
>>> patches.shape
(9, 2, 2)
>>> patches[0]
array([[0, 1],
[4, 5]])
>>> patches[1]
array([[1, 2],
1.8. Reference 365
scikit-learn user guide, Release 0.12-git
[5, 6]])
>>> patches[8]
array([[10, 11],
[14, 15]])
sklearn.feature_extraction.image.reconstruct_from_patches_2d
sklearn.feature_extraction.image.reconstruct_from_patches_2d(patches, im-
age_size)
Reconstruct the image from all of its patches.
Patches are assumed to overlap and the image is constructed by lling in the patches from left to right, top to
bottom, averaging the overlapping regions.
Parameters patches: array, shape = (n_patches, patch_height, patch_width) or :
(n_patches, patch_height, patch_width, n_channels) The complete set of patches. If
the patches contain colour information, channels are indexed along the last dimension:
RGB patches would have n_channels=3.
image_size: tuple of ints (image_height, image_width) or :
(image_height, image_width, n_channels) the size of the image that will be recon-
structed
Returns image: array, shape = image_size :
the reconstructed image
sklearn.feature_extraction.image.PatchExtractor
class sklearn.feature_extraction.image.PatchExtractor(patch_size, max_patches=None,
random_state=None)
Extracts patches from a collection of images
Parameters patch_size: tuple of ints (patch_height, patch_width) :
the dimensions of one patch
max_patches: integer or oat, optional default is None :
The maximum number of patches per image to extract. If max_patches is a oat in (0,
1), it is taken to mean a proportion of the total number of patches.
random_state: int or RandomState :
Pseudo number generator state used for random sampling.
Methods
fit(X[, y]) Do nothing and return the estimator unchanged
get_params([deep]) Get parameters for the estimator
set_params(**params) Set the parameters of the estimator.
transform(X) Transforms the image samples in X into a matrix of patch data.
__init__(patch_size, max_patches=None, random_state=None)
366 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
fit(X, y=None)
Do nothing and return the estimator unchanged
This method is just there to implement the usual API and hence work in pipelines.
get_params(deep=True)
Get parameters for the estimator
Parameters deep: boolean, optional :
If True, will return the parameters for this estimator and contained subobjects that are
estimators.
set_params(**params)
Set the parameters of the estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The former have
parameters of the form <component>__<parameter> so that its possible to update each component
of a nested object.
Returns self :
transform(X)
Transforms the image samples in X into a matrix of patch data.
Parameters X : array, shape = (n_samples, image_height, image_width) or
(n_samples, image_height, image_width, n_channels) Array of images from which to
extract patches. For color images, the last dimension species the channel: a RGB
image would have n_channels=3.
Returns patches: array, shape = (n_patches, patch_height, patch_width) or :
(n_patches, patch_height, patch_width, n_channels) The collection of patches extracted
from the images, where n_patches is either n_samples * max_patches or the total num-
ber of patches that can be extracted.
From text
The sklearn.feature_extraction.text submodule gathers utilities to build feature vectors from text doc-
uments.
feature_extraction.text.CountVectorizer([...]) Convert a collection of raw documents to a matrix of token counts
feature_extraction.text.TfidfTransformer([...]) Transform a count matrix to a normalized tf or tfidf representation
feature_extraction.text.TfidfVectorizer([...]) Convert a collection of raw documents to a matrix of TF-IDF features.
1.8. Reference 367
scikit-learn user guide, Release 0.12-git
sklearn.feature_extraction.text.CountVectorizer
class sklearn.feature_extraction.text.CountVectorizer(input=content, charset=utf-
8, charset_error=strict,
strip_accents=None, low-
ercase=True, preproces-
sor=None, tokenizer=None,
stop_words=None, to-
ken_pattern=u\b\w\w+\b,
min_n=1, max_n=1, ana-
lyzer=word, max_df=1.0,
max_features=None, vocab-
ulary=None, binary=False,
dtype=<type long>)
Convert a collection of raw documents to a matrix of token counts
This implementation produces a sparse representation of the counts using scipy.sparse.coo_matrix.
If you do not provide an a-priori dictionary and you do not use an analyzer that does some kind of feature
selection then the number of features will be equal to the vocabulary size found by analysing the data. The
default analyzer does simple stop word ltering for English.
Parameters input: string {lename, le, content} :
If lename, the sequence passed as an argument to t is expected to be a list of lenames
that need reading to fetch the raw content to analyze.
If le, the sequence items must have read method (le-like object) it is called to fetch
the bytes in memory.
Otherwise the input is expected to be the sequence strings or bytes items are expected
to be analyzed directly.
charset: string, utf-8 by default. :
If bytes or les are given to analyze, this charset is used to decode.
charset_error: {strict, ignore, replace} :
Instruction on what to do if a byte sequence is given to analyze that contains characters
not of the given charset. By default, it is strict, meaning that a UnicodeDecodeError
will be raised. Other values are ignore and replace.
strip_accents: {ascii, unicode, None} :
Remove accents during the preprocessing step. ascii is a fast method that only works
on characters that have an direct ASCII mapping. unicode is a slightly slower method
that works on any characters. None (default) does nothing.
analyzer: string, {word, char} or callable :
Whether the feature should be made of word or character n-grams.
If a callable is passed it is used to extract the sequence of features out of the raw, unpro-
cessed input.
preprocessor: callable or None (default) :
Override the preprocessing (string transformation) stage while preserving the tokenizing
and n-grams generation steps.
tokenizer: callable or None (default) :
368 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
Override the string tokenization step while preserving the preprocessing and n-grams
generation steps.
min_n: integer :
The lower boundary of the range of n-values for different n-grams to be extracted.
max_n: integer :
The upper boundary of the range of n-values for different n-grams to be extracted. All
values of n such that min_n <= n <= max_n will be used.
stop_words: string {english}, list, or None (default) :
If a string, it is passed to _check_stop_list and the appropriate stop list is returned is
currently the only supported string value.
If a list, that list is assumed to contain stop words, all of which will be removed from
the resulting tokens.
If None, no stop words will be used. max_df can be set to a value in the range [0.7, 1.0)
to automatically detect and lter stop words based on intra corpus document frequency
of terms.
token_pattern: string :
Regular expression denoting what constitutes a token, only used if tokenize ==
word. The default regexp select tokens of 2 or more letters characters (punctuation
is completely ignored and always treated as a token separator).
max_df : oat in range [0.0, 1.0], optional, 1.0 by default
When building the vocabulary ignore terms that have a term frequency strictly higher
than the given threshold (corpus specic stop words).
This parameter is ignored if vocabulary is not None.
max_features : optional, None by default
If not None, build a vocabulary that only consider the top max_features ordered by term
frequency across the corpus.
This parameter is ignored if vocabulary is not None.
binary: boolean, False by default. :
If True, all non zero counts are set to 1. This is useful for discrete probabilistic models
that model binary events rather than integer counts.
dtype: type, optional :
Type of the matrix returned by t_transform() or transform().
Methods
build_analyzer() Return a callable that handles preprocessing and tokenization
build_preprocessor() Return a function to preprocess the text before tokenization
build_tokenizer() Return a function that split a string in sequence of tokens
decode(doc) Decode the input into a string of unicode symbols
fit(raw_documents[, y]) Learn a vocabulary dictionary of all tokens in the raw documents
fit_transform(raw_documents[, y]) Learn the vocabulary dictionary and return the count vectors
Continued on next page
1.8. Reference 369
scikit-learn user guide, Release 0.12-git
Table 1.71 continued from previous page
get_feature_names() Array mapping from feature integer indicex to feature name
get_params([deep]) Get parameters for the estimator
get_stop_words() Build or fetch the effective stop words list
inverse_transform(X) Return terms per document with nonzero entries in X.
set_params(**params) Set the parameters of the estimator.
transform(raw_documents) Extract token counts out of raw text documents using the vocabulary tted with t or the one provided in the constructor.
__init__(input=content, charset=utf-8, charset_error=strict, strip_accents=None,
lowercase=True, preprocessor=None, tokenizer=None, stop_words=None, to-
ken_pattern=u\b\w\w+\b, min_n=1, max_n=1, analyzer=word, max_df=1.0,
max_features=None, vocabulary=None, binary=False, dtype=<type long>)
build_analyzer()
Return a callable that handles preprocessing and tokenization
build_preprocessor()
Return a function to preprocess the text before tokenization
build_tokenizer()
Return a function that split a string in sequence of tokens
decode(doc)
Decode the input into a string of unicode symbols
The decoding strategy depends on the vectorizer parameters.
fit(raw_documents, y=None)
Learn a vocabulary dictionary of all tokens in the raw documents
Parameters raw_documents: iterable :
an iterable which yields either str, unicode or le objects
Returns self :
fit_transform(raw_documents, y=None)
Learn the vocabulary dictionary and return the count vectors
This is more efcient than calling t followed by transform.
Parameters raw_documents: iterable :
an iterable which yields either str, unicode or le objects
Returns vectors: array, [n_samples, n_features] :
get_feature_names()
Array mapping from feature integer indicex to feature name
get_params(deep=True)
Get parameters for the estimator
Parameters deep: boolean, optional :
If True, will return the parameters for this estimator and contained subobjects that are
estimators.
get_stop_words()
Build or fetch the effective stop words list
inverse_transform(X)
Return terms per document with nonzero entries in X.
370 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
Parameters X : {array, sparse matrix}, shape = [n_samples, n_features]
Returns X_inv : list of arrays, len = n_samples
List of arrays of terms.
set_params(**params)
Set the parameters of the estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The former have
parameters of the form <component>__<parameter> so that its possible to update each component
of a nested object.
Returns self :
transform(raw_documents)
Extract token counts out of raw text documents using the vocabulary tted with t or the one provided in
the constructor.
Parameters raw_documents: iterable :
an iterable which yields either str, unicode or le objects
Returns vectors: sparse matrix, [n_samples, n_features] :
sklearn.feature_extraction.text.TdfTransformer
class sklearn.feature_extraction.text.TfidfTransformer(norm=l2, use_idf=True,
smooth_idf=True, sublin-
ear_tf=False)
Transform a count matrix to a normalized tf or tfidf representation
Tf means term-frequency while tfidf means term-frequency times inverse document-frequency. This is a com-
mon term weighting scheme in information retrieval, that has also found good use in document classication.
The goal of using tfidf instead of the raw frequencies of occurrence of a token in a given document is to scale
down the impact of tokens that occur very frequently in a given corpus and that are hence empirically less
informative than features that occur in a small fraction of the training corpus.
In the SMART notation used in IR, this class implements several tfidf variants. Tf is always n (natural), idf
is t iff use_idf is given, n otherwise, and normalization is c iff norm=l2, n iff norm=None.
Parameters norm : l1, l2 or None, optional
Norm used to normalize term vectors. None for no normalization.
use_idf : boolean, optional
Enable inverse-document-frequency reweighting.
smooth_idf : boolean, optional
Smooth idf weights by adding one to document frequencies, as if an extra document
was seen containing every term in the collection exactly once. Prevents zero divisions.
sublinear_tf : boolean, optional
Apply sublinear tf scaling, i.e. replace tf with 1 + log(tf).
References
[Yates2011], [MSR2008]
1.8. Reference 371
scikit-learn user guide, Release 0.12-git
Methods
fit(X[, y]) Learn the idf vector (global term weights)
fit_transform(X[, y]) Fit to data, then transform it
get_params([deep]) Get parameters for the estimator
set_params(**params) Set the parameters of the estimator.
transform(X[, copy]) Transform a count matrix to a tf or tfidf representation
__init__(norm=l2, use_idf=True, smooth_idf=True, sublinear_tf=False)
fit(X, y=None)
Learn the idf vector (global term weights)
Parameters X: sparse matrix, [n_samples, n_features] :
a matrix of term/token counts
fit_transform(X, y=None, **t_params)
Fit to data, then transform it
Fits transformer to X and y with optional parameters t_params and returns a transformed version of X.
Parameters X : numpy array of shape [n_samples, n_features]
Training set.
y : numpy array of shape [n_samples]
Target values.
Returns X_new : numpy array of shape [n_samples, n_features_new]
Transformed array.
Notes
This method just calls t and transform consecutively, i.e., it is not an optimized implementation of
t_transform, unlike other transformers such as PCA.
get_params(deep=True)
Get parameters for the estimator
Parameters deep: boolean, optional :
If True, will return the parameters for this estimator and contained subobjects that are
estimators.
set_params(**params)
Set the parameters of the estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The former have
parameters of the form <component>__<parameter> so that its possible to update each component
of a nested object.
Returns self :
transform(X, copy=True)
Transform a count matrix to a tf or tfidf representation
Parameters X: sparse matrix, [n_samples, n_features] :
372 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
a matrix of term/token counts
Returns vectors: sparse matrix, [n_samples, n_features] :
sklearn.feature_extraction.text.TdfVectorizer
class sklearn.feature_extraction.text.TfidfVectorizer(input=content, charset=utf-
8, charset_error=strict,
strip_accents=None, low-
ercase=True, preproces-
sor=None, tokenizer=None, ana-
lyzer=word, stop_words=None,
token_pattern=u\b\w\w+\b,
min_n=1, max_n=1,
max_df=1.0,
max_features=None, vocab-
ulary=None, binary=False,
dtype=<type long>, norm=l2,
use_idf=True, smooth_idf=True,
sublinear_tf=False)
Convert a collection of raw documents to a matrix of TF-IDF features.
Equivalent to CountVectorizer followed by TdfTransformer.
See Also:
CountVectorizerTokenize the documents and count the occurrences of token and return them as a sparse
matrix
TfidfTransformerApply Term Frequency Inverse Document Frequency normalization to a sparse matrix
of occurrence counts.
Methods
build_analyzer() Return a callable that handles preprocessing and tokenization
build_preprocessor() Return a function to preprocess the text before tokenization
build_tokenizer() Return a function that split a string in sequence of tokens
decode(doc) Decode the input into a string of unicode symbols
fit(raw_documents) Learn a conversion law from documents to array data
fit_transform(raw_documents[, y]) Learn the representation and return the vectors.
get_feature_names() Array mapping from feature integer indicex to feature name
get_params([deep]) Get parameters for the estimator
get_stop_words() Build or fetch the effective stop words list
inverse_transform(X) Return terms per document with nonzero entries in X.
set_params(**params) Set the parameters of the estimator.
transform(raw_documents[, copy]) Transform raw text documents to tfidf vectors
__init__(input=content, charset=utf-8, charset_error=strict, strip_accents=None, lower-
case=True, preprocessor=None, tokenizer=None, analyzer=word, stop_words=None,
token_pattern=u\b\w\w+\b, min_n=1, max_n=1, max_df=1.0, max_features=None,
vocabulary=None, binary=False, dtype=<type long>, norm=l2, use_idf=True,
smooth_idf=True, sublinear_tf=False)
1.8. Reference 373
scikit-learn user guide, Release 0.12-git
build_analyzer()
Return a callable that handles preprocessing and tokenization
build_preprocessor()
Return a function to preprocess the text before tokenization
build_tokenizer()
Return a function that split a string in sequence of tokens
decode(doc)
Decode the input into a string of unicode symbols
The decoding strategy depends on the vectorizer parameters.
fit(raw_documents)
Learn a conversion law from documents to array data
fit_transform(raw_documents, y=None)
Learn the representation and return the vectors.
Parameters raw_documents: iterable :
an iterable which yields either str, unicode or le objects
Returns vectors: array, [n_samples, n_features] :
get_feature_names()
Array mapping from feature integer indicex to feature name
get_params(deep=True)
Get parameters for the estimator
Parameters deep: boolean, optional :
If True, will return the parameters for this estimator and contained subobjects that are
estimators.
get_stop_words()
Build or fetch the effective stop words list
inverse_transform(X)
Return terms per document with nonzero entries in X.
Parameters X : {array, sparse matrix}, shape = [n_samples, n_features]
Returns X_inv : list of arrays, len = n_samples
List of arrays of terms.
set_params(**params)
Set the parameters of the estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The former have
parameters of the form <component>__<parameter> so that its possible to update each component
of a nested object.
Returns self :
transform(raw_documents, copy=True)
Transform raw text documents to tfidf vectors
Parameters raw_documents: iterable :
an iterable which yields either str, unicode or le objects
Returns vectors: sparse matrix, [n_samples, n_features] :
374 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
1.8.8 sklearn.feature_selection: Feature Selection
The sklearn.feature_selection module implements feature selection algorithms. It currently includes uni-
variate lter selection methods and the recursive feature elimination algorithm.
User guide: See the Feature selection section for further details.
feature_selection.SelectPercentile(score_func) Filter: Select the best percentile of the p_values
feature_selection.SelectKBest(score_func[, k]) Filter: Select the k lowest p-values.
feature_selection.SelectFpr(score_func[, alpha]) Filter: Select the pvalues below alpha based on a FPR test.
feature_selection.SelectFdr(score_func[, alpha]) Filter: Select the p-values for an estimated false discovery rate
feature_selection.SelectFwe(score_func[, alpha]) Filter: Select the p-values corresponding to Family-wise error rate
feature_selection.RFE(estimator, ...[, step]) Feature ranking with recursive feature elimination.
feature_selection.RFECV(estimator[, step, ...]) Feature ranking with recursive feature elimination and cross-validated selection of the best number of features.
sklearn.feature_selection.SelectPercentile
class sklearn.feature_selection.SelectPercentile(score_func, percentile=10)
Filter: Select the best percentile of the p_values
Parameters score_func: callable :
function taking two arrays X and y, and returning 2 arrays: both scores and pvalues
percentile: int, optional :
percent of features to keep
Methods
fit(X, y) Evaluate the function
fit_transform(X[, y]) Fit to data, then transform it
get_params([deep]) Get parameters for the estimator
get_support([indices]) Return a mask, or list, of the features/indices selected.
inverse_transform(X) Transform a new matrix using the selected features
set_params(**params) Set the parameters of the estimator.
transform(X) Transform a new matrix using the selected features
__init__(score_func, percentile=10)
fit(X, y)
Evaluate the function
fit_transform(X, y=None, **t_params)
Fit to data, then transform it
Fits transformer to X and y with optional parameters t_params and returns a transformed version of X.
Parameters X : numpy array of shape [n_samples, n_features]
Training set.
y : numpy array of shape [n_samples]
Target values.
Returns X_new : numpy array of shape [n_samples, n_features_new]
1.8. Reference 375
scikit-learn user guide, Release 0.12-git
Transformed array.
Notes
This method just calls t and transform consecutively, i.e., it is not an optimized implementation of
t_transform, unlike other transformers such as PCA.
get_params(deep=True)
Get parameters for the estimator
Parameters deep: boolean, optional :
If True, will return the parameters for this estimator and contained subobjects that are
estimators.
get_support(indices=False)
Return a mask, or list, of the features/indices selected.
inverse_transform(X)
Transform a new matrix using the selected features
set_params(**params)
Set the parameters of the estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The former have
parameters of the form <component>__<parameter> so that its possible to update each component
of a nested object.
Returns self :
transform(X)
Transform a new matrix using the selected features
sklearn.feature_selection.SelectKBest
class sklearn.feature_selection.SelectKBest(score_func, k=10)
Filter: Select the k lowest p-values.
Parameters score_func: callable :
Function taking two arrays X and y, and returning a pair of arrays (scores, pvalues).
k: int, optional :
Number of top features to select.
Notes
Ties between features with equal p-values will be broken in an unspecied way.
Methods
fit(X, y) Evaluate the function
fit_transform(X[, y]) Fit to data, then transform it
get_params([deep]) Get parameters for the estimator
Continued on next page
376 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
Table 1.76 continued from previous page
get_support([indices]) Return a mask, or list, of the features/indices selected.
inverse_transform(X) Transform a new matrix using the selected features
set_params(**params) Set the parameters of the estimator.
transform(X) Transform a new matrix using the selected features
__init__(score_func, k=10)
fit(X, y)
Evaluate the function
fit_transform(X, y=None, **t_params)
Fit to data, then transform it
Fits transformer to X and y with optional parameters t_params and returns a transformed version of X.
Parameters X : numpy array of shape [n_samples, n_features]
Training set.
y : numpy array of shape [n_samples]
Target values.
Returns X_new : numpy array of shape [n_samples, n_features_new]
Transformed array.
Notes
This method just calls t and transform consecutively, i.e., it is not an optimized implementation of
t_transform, unlike other transformers such as PCA.
get_params(deep=True)
Get parameters for the estimator
Parameters deep: boolean, optional :
If True, will return the parameters for this estimator and contained subobjects that are
estimators.
get_support(indices=False)
Return a mask, or list, of the features/indices selected.
inverse_transform(X)
Transform a new matrix using the selected features
set_params(**params)
Set the parameters of the estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The former have
parameters of the form <component>__<parameter> so that its possible to update each component
of a nested object.
Returns self :
transform(X)
Transform a new matrix using the selected features
1.8. Reference 377
scikit-learn user guide, Release 0.12-git
sklearn.feature_selection.SelectFpr
class sklearn.feature_selection.SelectFpr(score_func, alpha=0.05)
Filter: Select the pvalues below alpha based on a FPR test.
FPR test stands for False Positive Rate test. It controls the total amount of false detections.
Parameters score_func: callable :
function taking two arrays X and y, and returning 2 arrays: both scores and pvalues
alpha: oat, optional :
the highest p-value for features to be kept
Methods
fit(X, y) Evaluate the function
fit_transform(X[, y]) Fit to data, then transform it
get_params([deep]) Get parameters for the estimator
get_support([indices]) Return a mask, or list, of the features/indices selected.
inverse_transform(X) Transform a new matrix using the selected features
set_params(**params) Set the parameters of the estimator.
transform(X) Transform a new matrix using the selected features
__init__(score_func, alpha=0.05)
fit(X, y)
Evaluate the function
fit_transform(X, y=None, **t_params)
Fit to data, then transform it
Fits transformer to X and y with optional parameters t_params and returns a transformed version of X.
Parameters X : numpy array of shape [n_samples, n_features]
Training set.
y : numpy array of shape [n_samples]
Target values.
Returns X_new : numpy array of shape [n_samples, n_features_new]
Transformed array.
Notes
This method just calls t and transform consecutively, i.e., it is not an optimized implementation of
t_transform, unlike other transformers such as PCA.
get_params(deep=True)
Get parameters for the estimator
Parameters deep: boolean, optional :
If True, will return the parameters for this estimator and contained subobjects that are
estimators.
378 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
get_support(indices=False)
Return a mask, or list, of the features/indices selected.
inverse_transform(X)
Transform a new matrix using the selected features
set_params(**params)
Set the parameters of the estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The former have
parameters of the form <component>__<parameter> so that its possible to update each component
of a nested object.
Returns self :
transform(X)
Transform a new matrix using the selected features
sklearn.feature_selection.SelectFdr
class sklearn.feature_selection.SelectFdr(score_func, alpha=0.05)
Filter: Select the p-values for an estimated false discovery rate
This uses the Benjamini-Hochberg procedure. alpha is the target false discovery rate.
Parameters score_func: callable :
function taking two arrays X and y, and returning 2 arrays: both scores and pvalues
alpha: oat, optional :
the highest uncorrected p-value for features to keep
Methods
fit(X, y) Evaluate the function
fit_transform(X[, y]) Fit to data, then transform it
get_params([deep]) Get parameters for the estimator
get_support([indices]) Return a mask, or list, of the features/indices selected.
inverse_transform(X) Transform a new matrix using the selected features
set_params(**params) Set the parameters of the estimator.
transform(X) Transform a new matrix using the selected features
__init__(score_func, alpha=0.05)
fit(X, y)
Evaluate the function
fit_transform(X, y=None, **t_params)
Fit to data, then transform it
Fits transformer to X and y with optional parameters t_params and returns a transformed version of X.
Parameters X : numpy array of shape [n_samples, n_features]
Training set.
y : numpy array of shape [n_samples]
1.8. Reference 379
scikit-learn user guide, Release 0.12-git
Target values.
Returns X_new : numpy array of shape [n_samples, n_features_new]
Transformed array.
Notes
This method just calls t and transform consecutively, i.e., it is not an optimized implementation of
t_transform, unlike other transformers such as PCA.
get_params(deep=True)
Get parameters for the estimator
Parameters deep: boolean, optional :
If True, will return the parameters for this estimator and contained subobjects that are
estimators.
get_support(indices=False)
Return a mask, or list, of the features/indices selected.
inverse_transform(X)
Transform a new matrix using the selected features
set_params(**params)
Set the parameters of the estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The former have
parameters of the form <component>__<parameter> so that its possible to update each component
of a nested object.
Returns self :
transform(X)
Transform a new matrix using the selected features
sklearn.feature_selection.SelectFwe
class sklearn.feature_selection.SelectFwe(score_func, alpha=0.05)
Filter: Select the p-values corresponding to Family-wise error rate
Parameters score_func: callable :
function taking two arrays X and y, and returning 2 arrays: both scores and pvalues
alpha: oat, optional :
the highest uncorrected p-value for features to keep
Methods
fit(X, y) Evaluate the function
fit_transform(X[, y]) Fit to data, then transform it
get_params([deep]) Get parameters for the estimator
get_support([indices]) Return a mask, or list, of the features/indices selected.
inverse_transform(X) Transform a new matrix using the selected features
Continued on next page
380 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
Table 1.79 continued from previous page
set_params(**params) Set the parameters of the estimator.
transform(X) Transform a new matrix using the selected features
__init__(score_func, alpha=0.05)
fit(X, y)
Evaluate the function
fit_transform(X, y=None, **t_params)
Fit to data, then transform it
Fits transformer to X and y with optional parameters t_params and returns a transformed version of X.
Parameters X : numpy array of shape [n_samples, n_features]
Training set.
y : numpy array of shape [n_samples]
Target values.
Returns X_new : numpy array of shape [n_samples, n_features_new]
Transformed array.
Notes
This method just calls t and transform consecutively, i.e., it is not an optimized implementation of
t_transform, unlike other transformers such as PCA.
get_params(deep=True)
Get parameters for the estimator
Parameters deep: boolean, optional :
If True, will return the parameters for this estimator and contained subobjects that are
estimators.
get_support(indices=False)
Return a mask, or list, of the features/indices selected.
inverse_transform(X)
Transform a new matrix using the selected features
set_params(**params)
Set the parameters of the estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The former have
parameters of the form <component>__<parameter> so that its possible to update each component
of a nested object.
Returns self :
transform(X)
Transform a new matrix using the selected features
1.8. Reference 381
scikit-learn user guide, Release 0.12-git
sklearn.feature_selection.RFE
class sklearn.feature_selection.RFE(estimator, n_features_to_select, step=1)
Feature ranking with recursive feature elimination.
Given an external estimator that assigns weights to features (e.g., the coefcients of a linear model), the goal of
recursive feature elimination (RFE) is to select features by recursively considering smaller and smaller sets of
features. First, the estimator is trained on the initial set of features and weights are assigned to each one of them.
Then, features whose absolute weights are the smallest are pruned from the current set features. That procedure
is recursively repeated on the pruned set until the desired number of features to select is eventually reached.
Parameters estimator : object
A supervised learning estimator with a t method that updates a coef_ attribute that
holds the tted parameters. Important features must correspond to high absolute values
in the coef_ array.
For instance, this is the case for most supervised learning algorithms such as Support
Vector Classiers and Generalized Linear Models from the svm and linear_model mod-
ules.
n_features_to_select : int
The number of features to select.
step : int or oat, optional (default=1)
If greater than or equal to 1, then step corresponds to the (integer) number of features
to remove at each iteration. If within (0.0, 1.0), then step corresponds to the percentage
(rounded down) of features to remove at each iteration.
References
[R61]
Examples
The following example shows how to retrieve the 5 right informative features in the Friedman #1 dataset.
>>> from sklearn.datasets import make_friedman1
>>> from sklearn.feature_selection import RFE
>>> from sklearn.svm import SVR
>>> X, y = make_friedman1(n_samples=50, n_features=10, random_state=0)
>>> estimator = SVR(kernel="linear")
>>> selector = RFE(estimator, 5, step=1)
>>> selector = selector.fit(X, y)
>>> selector.support_
array([ True, True, True, True, True,
False, False, False, False, False], dtype=bool)
>>> selector.ranking_
array([1, 1, 1, 1, 1, 6, 4, 3, 2, 5])
382 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
Attributes
n_features_ int The number of selected features.
sup-
port_
array of shape
[n_features]
The mask of selected features.
rank-
ing_
array of shape
[n_features]
The feature ranking, such that ranking_[i] corresponds to the ranking position of
the i-th feature. Selected (i.e., estimated best) features are assigned rank 1.
Methods
fit(X, y) Fit the RFE model and then the underlying estimator on the selected
get_params([deep]) Get parameters for the estimator
predict(X) Reduce X to the selected features and then predict using the
score(X, y) Reduce X to the selected features and then return the score of the
set_params(**params) Set the parameters of the estimator.
transform(X) Reduce X to the selected features during the elimination.
__init__(estimator, n_features_to_select, step=1)
fit(X, y)
Fit the RFE model and then the underlying estimator on the selected features.
Parameters X : array of shape [n_samples, n_features]
The training input samples.
y : array of shape [n_samples]
The target values.
get_params(deep=True)
Get parameters for the estimator
Parameters deep: boolean, optional :
If True, will return the parameters for this estimator and contained subobjects that are
estimators.
predict(X)
Reduce X to the selected features and then predict using the underlying estimator.
Parameters X : array of shape [n_samples, n_features]
The input samples.
Returns y : array of shape [n_samples]
The predicted target values.
score(X, y)
Reduce X to the selected features and then return the score of the underlying estimator.
Parameters X : array of shape [n_samples, n_features]
The input samples.
y : array of shape [n_samples]
The target values.
1.8. Reference 383
scikit-learn user guide, Release 0.12-git
set_params(**params)
Set the parameters of the estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The former have
parameters of the form <component>__<parameter> so that its possible to update each component
of a nested object.
Returns self :
transform(X)
Reduce X to the selected features during the elimination.
Parameters X : array of shape [n_samples, n_features]
The input samples.
Returns X_r : array of shape [n_samples, n_selected_features]
The input samples with only the features selected during the elimination.
sklearn.feature_selection.RFECV
class sklearn.feature_selection.RFECV(estimator, step=1, cv=None, loss_func=None)
Feature ranking with recursive feature elimination and cross-validatedselection of the best number of fea-
tures.
Parameters estimator : object
A supervised learning estimator with a t method that updates a coef_ attribute that
holds the tted parameters. Important features must correspond to high absolute values
in the coef_ array.
For instance, this is the case for most supervised learning algorithms such as Support
Vector Classiers and Generalized Linear Models from the svm and linear_model mod-
ules.
step : int or oat, optional (default=1)
If greater than or equal to 1, then step corresponds to the (integer) number of features
to remove at each iteration. If within (0.0, 1.0), then step corresponds to the percentage
(rounded down) of features to remove at each iteration.
cv : int or cross-validation generator, optional (default=None)
If int, it is the number of folds. If None, 3-fold cross-validation is performed by de-
fault. Specic cross-validation objects can also be passed, see sklearn.cross_validation
module for details.
loss_function : function, optional (default=None)
The loss function to minimize by cross-validation. If None, then the score function of
the estimator is maximized.
References
[R62]
384 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
Examples
The following example shows how to retrieve the a-priori not known 5 informative features in the Friedman #1
dataset.
>>> from sklearn.datasets import make_friedman1
>>> from sklearn.feature_selection import RFECV
>>> from sklearn.svm import SVR
>>> X, y = make_friedman1(n_samples=50, n_features=10, random_state=0)
>>> estimator = SVR(kernel="linear")
>>> selector = RFECV(estimator, step=1, cv=5)
>>> selector = selector.fit(X, y)
>>> selector.support_
array([ True, True, True, True, True,
False, False, False, False, False], dtype=bool)
>>> selector.ranking_
array([1, 1, 1, 1, 1, 6, 4, 3, 2, 5])
Attributes
n_features_ int The number of selected features with cross-validation.
sup-
port_
array of shape
[n_features]
The mask of selected features.
rank-
ing_
array of shape
[n_features]
The feature ranking, such that ranking_[i] corresponds to the ranking
position of the i-th feature. Selected (i.e., estimated best) features are
assigned rank 1.
cv_scores_ array of shape
[n_subsets_of_features]
The cross-validation scores such that cv_scores_[i] corresponds to the CV
score of the i-th subset of features.
Methods
fit(X, y) Fit the RFE model and automatically tune the number of selected
get_params([deep]) Get parameters for the estimator
predict(X) Reduce X to the selected features and then predict using the
score(X, y) Reduce X to the selected features and then return the score of the
set_params(**params) Set the parameters of the estimator.
transform(X) Reduce X to the selected features during the elimination.
__init__(estimator, step=1, cv=None, loss_func=None)
fit(X, y)
Fit the RFE model and automatically tune the number of selected features.
Parameters X : array of shape [n_samples, n_features]
Training vector, where n_samples is the number of samples and n_features is the total
number of features.
y : array of shape [n_samples]
Target values (integers for classication, real numbers for regression).
get_params(deep=True)
Get parameters for the estimator
1.8. Reference 385
scikit-learn user guide, Release 0.12-git
Parameters deep: boolean, optional :
If True, will return the parameters for this estimator and contained subobjects that are
estimators.
predict(X)
Reduce X to the selected features and then predict using the underlying estimator.
Parameters X : array of shape [n_samples, n_features]
The input samples.
Returns y : array of shape [n_samples]
The predicted target values.
score(X, y)
Reduce X to the selected features and then return the score of the underlying estimator.
Parameters X : array of shape [n_samples, n_features]
The input samples.
y : array of shape [n_samples]
The target values.
set_params(**params)
Set the parameters of the estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The former have
parameters of the form <component>__<parameter> so that its possible to update each component
of a nested object.
Returns self :
transform(X)
Reduce X to the selected features during the elimination.
Parameters X : array of shape [n_samples, n_features]
The input samples.
Returns X_r : array of shape [n_samples, n_selected_features]
The input samples with only the features selected during the elimination.
feature_selection.chi2(X, y) Compute (chi-squared) statistic for each class/feature combination.
feature_selection.f_classif(X, y) Compute the Anova F-value for the provided sample
feature_selection.f_regression(X, y[, center]) Univariate linear regression tests
sklearn.feature_selection.chi2
sklearn.feature_selection.chi2(X, y)
Compute (chi-squared) statistic for each class/feature combination.
This transformer can be used to select the n_features features with the highest values for the (chi-square)
statistic from either boolean or multinomially distributed data (e.g., term counts in document classication)
relative to the classes.
Recall that the statistic measures dependence between stochastic variables, so a transformer based on this
function weeds out the features that are the most likely to be independent of class and therefore irrelevant for
classication.
386 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
Parameters X : {array-like, sparse matrix}, shape = [n_samples, n_features_in]
Sample vectors.
y : array-like, shape = n_samples
Target vector (class labels).
Notes
Complexity of this algorithm is O(n_classes * n_features).
sklearn.feature_selection.f_classif
sklearn.feature_selection.f_classif(X, y)
Compute the Anova F-value for the provided sample
Parameters X : array of shape (n_samples, n_features)
the set of regressors sthat will tested sequentially
y : array of shape(n_samples)
the data matrix
Returns F : array of shape (m),
the set of F values
pval : array of shape(m),
the set of p-values
sklearn.feature_selection.f_regression
sklearn.feature_selection.f_regression(X, y, center=True)
Univariate linear regression tests
Quick linear model for testing the effect of a single regressor, sequentially for many regressors.
This is done in 3 steps: 1. the regressor of interest and the data are orthogonalized wrt constant regressors 2. the
cross correlation between data and regressors is computed 3. it is converted to an F score then to a p-value
Parameters X : array of shape (n_samples, n_features)
the set of regressors sthat will tested sequentially
y : array of shape(n_samples)
the data matrix
center : True, bool,
If true, X and y are centered
Returns F : array of shape (m),
the set of F values
pval : array of shape(m)
the set of p-values
1.8. Reference 387
scikit-learn user guide, Release 0.12-git
1.8.9 sklearn.gaussian_process: Gaussian Processes
The sklearn.gaussian_process module implements scalar Gaussian Process based predictions.
User guide: See the Gaussian Processes section for further details.
gaussian_process.GaussianProcess([regr, ...]) The Gaussian Process model class.
sklearn.gaussian_process.GaussianProcess
class sklearn.gaussian_process.GaussianProcess(regr=constant,
corr=squared_exponential, beta0=None,
storage_mode=full, verbose=False,
theta0=0.1, thetaL=None, thetaU=None,
optimizer=fmin_cobyla, ran-
dom_start=1, normalize=True,
nugget=2.2204460492503131e-15, ran-
dom_state=None)
The Gaussian Process model class.
Parameters regr : string or callable, optional
A regression function returning an array of outputs of the linear regression functional
basis. The number of observations n_samples should be greater than the size p of this
basis. Default assumes a simple constant regression trend. Available built-in regression
models are:
constant, linear, quadratic
corr : string or callable, optional
A stationary autocorrelation function returning the autocorrelation between two points
x and x. Default assumes a squared-exponential autocorrelation model. Built-in corre-
lation models are:
absolute_exponential, squared_exponential,
generalized_exponential, cubic, linear
beta0 : double array_like, optional
The regression weight vector to perform Ordinary Kriging (OK). Default assumes Uni-
versal Kriging (UK) so that the vector beta of regression weights is estimated using the
maximum likelihood principle.
storage_mode : string, optional
A string specifying whether the Cholesky decomposition of the correlation matrix
should be stored in the class (storage_mode = full) or not (storage_mode = light).
Default assumes storage_mode = full, so that the Cholesky decomposition of the cor-
relation matrix is stored. This might be a useful parameter when one is not interested
in the MSE and only plan to estimate the BLUP, for which the correlation matrix is not
required.
verbose : boolean, optional
A boolean specifying the verbose level. Default is verbose = False.
theta0 : double array_like, optional
388 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
An array with shape (n_features, ) or (1, ). The parameters in the autocorrelation model.
If thetaL and thetaU are also specied, theta0 is considered as the starting point for the
maximum likelihood rstimation of the best set of parameters. Default assumes isotropic
autocorrelation model with theta0 = 1e-1.
thetaL : double array_like, optional
An array with shape matching theta0s. Lower bound on the autocorrelation parame-
ters for maximum likelihood estimation. Default is None, so that it skips maximum
likelihood estimation and it uses theta0.
thetaU : double array_like, optional
An array with shape matching theta0s. Upper bound on the autocorrelation parame-
ters for maximum likelihood estimation. Default is None, so that it skips maximum
likelihood estimation and it uses theta0.
normalize : boolean, optional
Input X and observations y are centered and reduced wrt means and standard deviations
estimated from the n_samples observations provided. Default is normalize = True so
that data is normalized to ease maximum likelihood estimation.
nugget : double or ndarray, optional
Introduce a nugget effect to allow smooth predictions from noisy data. If nugget is
an ndarray, it must be the same length as the number of data points used for the t.
The nugget is added to the diagonal of the assumed training covariance; in this way
it acts as a Tikhonov regularization in the problem. In the special case of the squared
exponential correlation function, the nugget mathematically represents the variance of
the input values. Default assumes a nugget close to machine precision for the sake of
robustness (nugget = 10. * MACHINE_EPSILON).
optimizer : string, optional
A string specifying the optimization algorithm to be used. Default uses fmin_cobyla
algorithm from scipy.optimize. Available optimizers are:
fmin_cobyla, Welch
Welch optimizer is dued to Welch et al., see reference [WBSWM1992]. It consists
in iterating over several one-dimensional optimizations instead of running one single
multi-dimensional optimization.
random_start : int, optional
The number of times the Maximum Likelihood Estimation should be performed from a
random starting point. The rst MLE always uses the specied starting point (theta0),
the next starting points are picked at random according to an exponential distribution
(log-uniform on [thetaL, thetaU]). Default does not use random starting point (ran-
dom_start = 1).
random_state: integer or numpy.RandomState, optional :
The generator used to shufe the sequence of coordinates of theta in the Welch opti-
mizer. If an integer is given, it xes the seed. Defaults to the global numpy random
number generator.
1.8. Reference 389
scikit-learn user guide, Release 0.12-git
Notes
The presentation implementation is based on a translation of the DACE Matlab toolbox, see reference
[NLNS2002].
References
[NLNS2002], [WBSWM1992]
Examples
>>> import numpy as np
>>> from sklearn.gaussian_process import GaussianProcess
>>> X = np.array([[1., 3., 5., 6., 7., 8.]]).T
>>> y = (X
*
np.sin(X)).ravel()
>>> gp = GaussianProcess(theta0=0.1, thetaL=.001, thetaU=1.)
>>> gp.fit(X, y)
GaussianProcess(beta0=None...
...
Attributes
theta_: array Specied theta OR the best set of autocorrelation parameters (the sought
maximizer of the reduced likelihood function).
re-
duced_likelihood_function_value_:
array
The optimal reduced likelihood function value.
Methods
arg_max_reduced_likelihood_function(*args, ...) DEPRECATED: to be removed; access self.theta_ etc. directly after t
fit(X, y) The Gaussian Process model tting method.
get_params([deep]) Get parameters for the estimator
predict(X[, eval_MSE, batch_size]) This function evaluates the Gaussian Process model at x.
reduced_likelihood_function([theta]) This function determines the BLUP parameters and evaluates the reduced
score(X, y) Returns the coefcient of determination R^2 of the prediction.
set_params(**params) Set the parameters of the estimator.
__init__(regr=constant, corr=squared_exponential, beta0=None, storage_mode=full, ver-
bose=False, theta0=0.1, thetaL=None, thetaU=None, optimizer=fmin_cobyla, ran-
dom_start=1, normalize=True, nugget=2.2204460492503131e-15, random_state=None)
arg_max_reduced_likelihood_function(*args, **kwargs)
DEPRECATED: to be removed; access self.theta_ etc. directly after t
fit(X, y)
The Gaussian Process model tting method.
Parameters X : double array_like
390 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
An array with shape (n_samples, n_features) with the input at which observations were
made.
y : double array_like
An array with shape (n_features, ) with the observations of the scalar output to be pre-
dicted.
Returns gp : self
A tted Gaussian Process model object awaiting data to perform predictions.
get_params(deep=True)
Get parameters for the estimator
Parameters deep: boolean, optional :
If True, will return the parameters for this estimator and contained subobjects that are
estimators.
predict(X, eval_MSE=False, batch_size=None)
This function evaluates the Gaussian Process model at x.
Parameters X : array_like
An array with shape (n_eval, n_features) giving the point(s) at which the prediction(s)
should be made.
eval_MSE : boolean, optional
A boolean specifying whether the Mean Squared Error should be evaluated or not. De-
fault assumes evalMSE = False and evaluates only the BLUP (mean prediction).
batch_size : integer, optional
An integer giving the maximum number of points that can be evaluated simulatneously
(depending on the available memory). Default is None so that all given points are eval-
uated at the same time.
Returns y : array_like
An array with shape (n_eval, ) with the Best Linear Unbiased Prediction at x.
MSE : array_like, optional (if eval_MSE == True)
An array with shape (n_eval, ) with the Mean Squared Error at x.
reduced_likelihood_function(theta=None)
This function determines the BLUP parameters and evaluates the reduced likelihood function for the given
autocorrelation parameters theta.
Maximizing this function wrt the autocorrelation parameters theta is equivalent to maximizing the likeli-
hood of the assumed joint Gaussian distribution of the observations y evaluated onto the design of experi-
ments X.
Parameters theta : array_like, optional
An array containing the autocorrelation parameters at which the Gaussian Process
model parameters should be determined. Default uses the built-in autocorrelation pa-
rameters (ie theta = self.theta_).
Returns reduced_likelihood_function_value : double
The value of the reduced likelihood function associated to the given autocorrelation
parameters theta.
1.8. Reference 391
scikit-learn user guide, Release 0.12-git
par : dict
A dictionary containing the requested Gaussian Process model parameters:
sigma2Gaussian Process variance.
betaGeneralized least-squares regression weights for Universal Kriging or given
beta0 for Ordinary Kriging.
gammaGaussian Process weights.
CCholesky decomposition of the correlation matrix [R].
FtSolution of the linear equation system : [R] x Ft = F
GQR decomposition of the matrix Ft.
reduced_likelihood_function_value
DEPRECATED: reduced_likelihood_function_value is deprecated and willbe removed
please use reduced_likelihood_function_value_ instead.
score(X, y)
Returns the coefcient of determination R^2 of the prediction.
The coefcient R^2 is dened as (1 - u/v), where u is the regression sum of squares ((y - y_pred) **
2).sum() and v is the residual sum of squares ((y_true - y_true.mean()) ** 2).sum(). Best possible score is
1.0, lower values are worse.
Parameters X : array-like, shape = [n_samples, n_features]
Training set.
y : array-like, shape = [n_samples]
Returns z : oat
set_params(**params)
Set the parameters of the estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The former have
parameters of the form <component>__<parameter> so that its possible to update each component
of a nested object.
Returns self :
theta
DEPRECATED: theta is deprecated and will be removedplease use theta_ instead.
gaussian_process.correlation_models.absolute_exponential(...) Absolute exponential autocorrelation model.
gaussian_process.correlation_models.squared_exponential(...) Squared exponential correlation model (Radial Basis Function).
gaussian_process.correlation_models.generalized_exponential(...) Generalized exponential correlation model.
gaussian_process.correlation_models.pure_nugget(...) Spatial independence correlation model (pure nugget).
gaussian_process.correlation_models.cubic(...) Cubic correlation model:
gaussian_process.correlation_models.linear(...) Linear correlation model:
gaussian_process.regression_models.constant(x) Zero order polynomial (constant, p = 1) regression model.
gaussian_process.regression_models.linear(x) First order polynomial (linear, p = n+1) regression model.
gaussian_process.regression_models.quadratic(x) Second order polynomial (quadratic, p = n*(n-1)/2+n+1) regression model.
sklearn.gaussian_process.correlation_models.absolute_exponential
sklearn.gaussian_process.correlation_models.absolute_exponential(theta, d)
Absolute exponential autocorrelation model. (Ornstein-Uhlenbeck stochastic process):
392 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
n
theta, dx --> r(theta, dx) = exp( sum - theta_i
*
|dx_i| )
i = 1
Parameters theta : array_like
An array with shape 1 (isotropic) or n (anisotropic) giving the autocorrelation parame-
ter(s).
dx : array_like
An array with shape (n_eval, n_features) giving the componentwise distances between
locations x and x at which the correlation model should be evaluated.
Returns r : array_like
An array with shape (n_eval, ) containing the values of the autocorrelation model.
sklearn.gaussian_process.correlation_models.squared_exponential
sklearn.gaussian_process.correlation_models.squared_exponential(theta, d)
Squared exponential correlation model (Radial Basis Function). (Innitely differentiable stochastic process,
very smooth):
n
theta, dx --> r(theta, dx) = exp( sum - theta_i
*
(dx_i)^2 )
i = 1
Parameters theta : array_like
An array with shape 1 (isotropic) or n (anisotropic) giving the autocorrelation parame-
ter(s).
dx : array_like
An array with shape (n_eval, n_features) giving the componentwise distances between
locations x and x at which the correlation model should be evaluated.
Returns r : array_like
An array with shape (n_eval, ) containing the values of the autocorrelation model.
sklearn.gaussian_process.correlation_models.generalized_exponential
sklearn.gaussian_process.correlation_models.generalized_exponential(theta, d)
Generalized exponential correlation model. (Useful when one does not know the smoothness of the function to
be predicted.):
n
theta, dx --> r(theta, dx) = exp( sum - theta_i
*
|dx_i|^p )
i = 1
Parameters theta : array_like
An array with shape 1+1 (isotropic) or n+1 (anisotropic) giving the autocorrelation pa-
rameter(s) (theta, p).
dx : array_like
1.8. Reference 393
scikit-learn user guide, Release 0.12-git
An array with shape (n_eval, n_features) giving the componentwise distances between
locations x and x at which the correlation model should be evaluated.
Returns r : array_like
An array with shape (n_eval, ) with the values of the autocorrelation model.
sklearn.gaussian_process.correlation_models.pure_nugget
sklearn.gaussian_process.correlation_models.pure_nugget(theta, d)
Spatial independence correlation model (pure nugget). (Useful when one wants to solve an ordinary least squares
problem!):
n
theta, dx --> r(theta, dx) = 1 if sum |dx_i| == 0
i = 1
0 otherwise
Parameters theta : array_like
None.
dx : array_like
An array with shape (n_eval, n_features) giving the componentwise distances between
locations x and x at which the correlation model should be evaluated.
Returns r : array_like
An array with shape (n_eval, ) with the values of the autocorrelation model.
sklearn.gaussian_process.correlation_models.cubic
sklearn.gaussian_process.correlation_models.cubic(theta, d)
Cubic correlation model:
theta, dx --> r(theta, dx) =
n
prod max(0, 1 - 3(theta_j
*
d_ij)^2 + 2(theta_j
*
d_ij)^3) , i = 1,...,m
j = 1
Parameters theta : array_like
An array with shape 1 (isotropic) or n (anisotropic) giving the autocorrelation parame-
ter(s).
dx : array_like
An array with shape (n_eval, n_features) giving the componentwise distances between
locations x and x at which the correlation model should be evaluated.
Returns r : array_like
An array with shape (n_eval, ) with the values of the autocorrelation model.
394 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
sklearn.gaussian_process.correlation_models.linear
sklearn.gaussian_process.correlation_models.linear(theta, d)
Linear correlation model:
theta, dx --> r(theta, dx) =
n
prod max(0, 1 - theta_j
*
d_ij) , i = 1,...,m
j = 1
Parameters theta : array_like
An array with shape 1 (isotropic) or n (anisotropic) giving the autocorrelation parame-
ter(s).
dx : array_like
An array with shape (n_eval, n_features) giving the componentwise distances between
locations x and x at which the correlation model should be evaluated.
Returns r : array_like
An array with shape (n_eval, ) with the values of the autocorrelation model.
sklearn.gaussian_process.regression_models.constant
sklearn.gaussian_process.regression_models.constant(x)
Zero order polynomial (constant, p = 1) regression model.
x > f(x) = 1
Parameters x : array_like
An array with shape (n_eval, n_features) giving the locations x at which the regression
model should be evaluated.
Returns f : array_like
An array with shape (n_eval, p) with the values of the regression model.
sklearn.gaussian_process.regression_models.linear
sklearn.gaussian_process.regression_models.linear(x)
First order polynomial (linear, p = n+1) regression model.
x > f(x) = [ 1, x_1, ..., x_n ].T
Parameters x : array_like
An array with shape (n_eval, n_features) giving the locations x at which the regression
model should be evaluated.
Returns f : array_like
An array with shape (n_eval, p) with the values of the regression model.
1.8. Reference 395
scikit-learn user guide, Release 0.12-git
sklearn.gaussian_process.regression_models.quadratic
sklearn.gaussian_process.regression_models.quadratic(x)
Second order polynomial (quadratic, p = n*(n-1)/2+n+1) regression model.
x > f(x) = [ 1, { x_i, i = 1,...,n }, { x_i * x_j, (i,j) = 1,...,n } ].Ti > j
Parameters x : array_like
An array with shape (n_eval, n_features) giving the locations x at which the regression
model should be evaluated.
Returns f : array_like
An array with shape (n_eval, p) with the values of the regression model.
1.8.10 sklearn.grid_search: Grid Search
The sklearn.grid_search includes utilities to ne-tune the parameters of an estimator.
User guide: See the Grid Search: setting estimator parameters section for further details.
grid_search.GridSearchCV(estimator, param_grid) Grid search on the parameters of a classier
grid_search.IterGrid(param_grid) Generators on the combination of the various parameter lists given
sklearn.grid_search.GridSearchCV
class sklearn.grid_search.GridSearchCV(estimator, param_grid, loss_func=None,
score_func=None, t_params=None, n_jobs=1,
iid=True, ret=True, cv=None, verbose=0,
pre_dispatch=2*n_jobs)
Grid search on the parameters of a classier
Important members are t, predict.
GridSearchCV implements a t method and a predict method like any classier except that the parameters
of the classier used to predict is optimized by cross-validation.
Parameters estimator: object type that implements the t and predict methods :
A object of that type is instantiated for each grid point.
param_grid: dict :
Dictionary with parameters names (string) as keys and lists of parameter settings to try
as values.
loss_func: callable, optional :
function that takes 2 arguments and compares them in order to evaluate the performance
of prediciton (small is good) if None is passed, the score of the estimator is maximized
score_func: callable, optional :
A function that takes 2 arguments and compares them in order to evaluate the perfor-
mance of prediction (high is good). If None is passed, the score of the estimator is
maximized.
t_params : dict, optional
396 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
parameters to pass to the t method
n_jobs: int, optional :
number of jobs to run in parallel (default 1)
pre_dispatch: int, or string, optional :
Controls the number of jobs that get dispatched during parallel execution. Reducing
this number can be useful to avoid an explosion of memory consumption when more
jobs get dispatched than CPUs can process. This parameter can be:
None, in which case all the jobs are immediatly created and spawned. Use this for
lightweight and fast-running jobs, to avoid delays due to on-demand spawning of the
jobs
An int, giving the exact number of total jobs that are spawned
A string, giving an expression as a function of n_jobs, as in 2*n_jobs
iid: boolean, optional :
If True, the data is assumed to be identically distributed across the folds, and the loss
minimized is the total loss per sample, and not the mean loss across the folds.
cv : integer or crossvalidation generator, optional
If an integer is passed, it is the number of fold (default 3). Specic crossvalidation ob-
jects can be passed, see sklearn.cross_validation module for the list of possible objects
ret: boolean :
ret the best estimator with the entire dataset. If False, it is impossible to make
predictions using this GridSearch instance after tting.
verbose: integer :
Controls the verbosity: the higher, the more messages.
See Also:
IterGridgenerates all the combinations of a an hyperparameter grid.
sklearn.cross_validation.train_test_splitutility function to split the data into a development
set usable for tting a GridSearchCV instance and an evaluation set for its nal evaluation.
Notes
The parameters selected are those that maximize the score of the left out data, unless an explicit score_func is
passed in which case it is used instead. If a loss function loss_func is passed, it overrides the score functions
and is minimized.
If n_jobs was set to a value higher than one, the data is copied for each point in the grid (and not n_jobs times).
This is done for efciency reasons if individual jobs take very little time, but may raise errors if the dataset is
large and not enough memory is available. A workaround in this case is to set pre_dispatch. Then, the memory
is copied only pre_dispatch many times. A reasonable value for pre_dispatch is 2 * n_jobs.
Examples
1.8. Reference 397
scikit-learn user guide, Release 0.12-git
>>> from sklearn import svm, grid_search, datasets
>>> iris = datasets.load_iris()
>>> parameters = {kernel:(linear, rbf), C:[1, 10]}
>>> svr = svm.SVC()
>>> clf = grid_search.GridSearchCV(svr, parameters)
>>> clf.fit(iris.data, iris.target)
...
GridSearchCV(cv=None,
estimator=SVC(C=1.0, cache_size=..., coef0=..., degree=...,
gamma=..., kernel=rbf, probability=False,
shrinking=True, tol=...),
fit_params={}, iid=True, loss_func=None, n_jobs=1,
param_grid=...,
...)
Attributes
grid_scores_dict of any
to oat
Contains scores for all parameter combinations in param_grid.
best_estimator_ estimator Estimator that was choosen by grid search, i.e. estimator which gave highest
score (or smallest loss if specied) on the left out data.
best_score_ oat score of best_estimator on the left out data.
best_params_dict Parameter setting that gave the best results on the hold out data.
Methods
fit(X[, y]) Run t with all sets of parameters
get_params([deep]) Get parameters for the estimator
score(X[, y])
set_params(**params) Set the parameters of the estimator.
__init__(estimator, param_grid, loss_func=None, score_func=None, t_params=None, n_jobs=1,
iid=True, ret=True, cv=None, verbose=0, pre_dispatch=2*n_jobs)
best_estimator
DEPRECATED: GridSearchCV.best_estimator is deprecated and will be removed in version 0.12. Please
use GridSearchCV.best_estimator_ instead.
best_score
DEPRECATED: GridSearchCV.best_score is deprecated and will be removed in version 0.12. Please use
GridSearchCV.best_score_ instead.
fit(X, y=None, **params)
Run t with all sets of parameters
Returns the best classier
Parameters X: array, [n_samples, n_features] :
Training vector, where n_samples in the number of samples and n_features is the num-
ber of features.
y: array-like, shape = [n_samples], optional :
Target vector relative to X for classication; None for unsupervised learning.
398 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
get_params(deep=True)
Get parameters for the estimator
Parameters deep: boolean, optional :
If True, will return the parameters for this estimator and contained subobjects that are
estimators.
set_params(**params)
Set the parameters of the estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The former have
parameters of the form <component>__<parameter> so that its possible to update each component
of a nested object.
Returns self :
sklearn.grid_search.IterGrid
class sklearn.grid_search.IterGrid(param_grid)
Generators on the combination of the various parameter lists given
Parameters param_grid: dict of string to sequence :
The parameter grid to explore, as a dictionary mapping estimator parameters to se-
quences of allowed values.
Returns params: dict of string to any :
Yields dictionaries mapping each estimator parameter to one of its allowed values.
See Also:
GridSearchCVuses IterGrid to perform a full parallelized grid search.
Examples
>>> from sklearn.grid_search import IterGrid
>>> param_grid = {a:[1, 2], b:[True, False]}
>>> list(IterGrid(param_grid))
[{a: 1, b: True}, {a: 1, b: False},
{a: 2, b: True}, {a: 2, b: False}]
__init__(param_grid)
1.8.11 sklearn.hmm: Hidden Markov Models
The sklearn.hmm module implements hidden Markov models.
Warning: sklearn.hmm is orphaned, undocumented and has known numerical stability issues. If nobody volun-
teers to write documentation and make it more stable, this module will be removed in version 0.11.
User guide: See the Hidden Markov Models section for further details.
hmm.GaussianHMM([n_components, ...]) Hidden Markov Model with Gaussian emissions
hmm.MultinomialHMM([n_components, ...]) Hidden Markov Model with multinomial (discrete) emissions
hmm.GMMHMM([n_components, n_mix, startprob, ...]) Hidden Markov Model with Gaussin mixture emissions
1.8. Reference 399
scikit-learn user guide, Release 0.12-git
sklearn.hmm.GaussianHMM
class sklearn.hmm.GaussianHMM(n_components=1, covariance_type=diag, start-
prob=None, transmat=None, startprob_prior=None, trans-
mat_prior=None, algorithm=viterbi, means_prior=None,
means_weight=0, covars_prior=0.01, covars_weight=1,
random_state=None, n_iter=10, thresh=0.01,
params=abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ,
init_params=abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ)
Hidden Markov Model with Gaussian emissions
Representation of a hidden Markov model probability distribution. This class allows for easy evaluation of,
sampling from, and maximum-likelihood estimation of the parameters of a HMM.
Parameters n_components : int
Number of states.
_covariance_type : string
String describing the type of covariance parameters to use. Must be one of spherical,
tied, diag, full. Defaults to diag.
See Also:
GMMGaussian mixture model
Examples
>>> from sklearn.hmm import GaussianHMM
>>> GaussianHMM(n_components=2)
...
GaussianHMM(algorithm=viterbi,...
400 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
Attributes
_covariance_type string String describing the type of co-
variance parameters used by the
model. Must be one of spherical,
tied, diag, full.
n_features int Dimensionality of the Gaussian
emissions.
n_components int Number of states in the model.
transmat array, shape (n_components,
n_components)
Matrix of transition probabilities
between states.
startprob array, shape (n_components,) Initial state occupation distribution.
means array, shape (n_components,
n_features)
Mean parameters for each state.
covars array Covariance parameters for each
state. The shape depends on
_covariance_type:
(n_components,) if spherical,
(n_features, n_features) if tied,
(n_components, n_features) if diag,
(n_components, n_features, n_features) if full
random_state: RandomState or an
int seed (0 by default)
A random number generator in-
stance
n_iter int, optional Number of iterations to perform.
thresh oat, optional Convergence threshold.
params string, optional Controls which parameters are up-
dated in the training process. Can
contain any combination of s for
startprob, t for transmat, m for
means, and c for covars, etc. De-
faults to all parameters.
init_params string, optional Controls which parameters are ini-
tialized prior to training. Can con-
tain any combination of s for
startprob, t for transmat, m for
means, and c for covars, etc. De-
faults to all parameters.
Methods
decode(obs[, algorithm]) Find most likely state sequence corresponding to obs.
eval(obs) Compute the log probability under the model and compute posteriors
fit(obs, **kwargs) Estimate model parameters.
get_params([deep]) Get parameters for the estimator
predict(obs[, algorithm]) Find most likely state sequence corresponding to obs.
predict_proba(obs) Compute the posterior probability for each state in the model
rvs(*args, **kwargs) DEPRECATED: rvs is deprecated in 0.11 will be removed in 0.13: use sample instead
sample([n, random_state]) Generate random samples from the model.
score(obs) Compute the log probability under the model.
Continued on next page
1.8. Reference 401
scikit-learn user guide, Release 0.12-git
Table 1.89 continued from previous page
set_params(**params) Set the parameters of the estimator.
__init__(n_components=1, covariance_type=diag, startprob=None, trans-
mat=None, startprob_prior=None, transmat_prior=None, algo-
rithm=viterbi, means_prior=None, means_weight=0, covars_prior=0.01,
covars_weight=1, random_state=None, n_iter=10, thresh=0.01,
params=abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ,
init_params=abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ)
algorithm
decoder algorithm
covariance_type
Covariance type of the model.
Must be one of spherical, tied, diag, full.
covars_
Return covars as a full matrix.
decode(obs, algorithm=viterbi)
Find most likely state sequence corresponding to obs. Uses the selected algorithm for decoding.
Parameters obs : array_like, shape (n, n_features)
List of n_features-dimensional data points. Each row corresponds to a single data point.
algorithm : string, one of the decoder_algorithms
decoder algorithm to be used
Returns logprob : oat
Log probability of the maximum likelihood path through the HMM
state_sequence : array_like, shape (n,)
Index of the most likely states for each observation
See Also:
evalCompute the log probability under the model and posteriors
scoreCompute the log probability under the model
eval(obs)
Compute the log probability under the model and compute posteriors
Implements rank and beam pruning in the forward-backward algorithm to speed up inference in large
models.
Parameters obs : array_like, shape (n, n_features)
Sequence of n_features-dimensional data points. Each row corresponds to a single point
in the sequence.
Returns logprob : oat
Log likelihood of the sequence obs
posteriors: array_like, shape (n, n_components) :
Posterior probabilities of each state for each observation
402 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
See Also:
scoreCompute the log probability under the model
decodeFind most likely state sequence corresponding to a obs
fit(obs, **kwargs)
Estimate model parameters.
An initialization step is performed before entering the EM algorithm. If you want to avoid this step,
set the keyword argument init_params to the empty string . Likewise, if you would like just to do an
initialization, call this method with n_iter=0.
Parameters obs : list
List of array-like observation sequences (shape (n_i, n_features)).
Notes
In general, logprob should be non-decreasing unless aggressive pruning is used. Decreasing logprob is
generally a sign of overtting (e.g. a covariance parameter getting too small). You can x this by getting
more training data, or decreasing covars_prior.
Please note that setting parameters in the t method is deprecated and will be removed in the next
release. Set it on initialization instead.
get_params(deep=True)
Get parameters for the estimator
Parameters deep: boolean, optional :
If True, will return the parameters for this estimator and contained subobjects that are
estimators.
means_
Mean parameters for each state.
predict(obs, algorithm=viterbi)
Find most likely state sequence corresponding to obs.
Parameters obs : array_like, shape (n, n_features)
List of n_features-dimensional data points. Each row corresponds to a single data point.
Returns state_sequence : array_like, shape (n,)
Index of the most likely states for each observation
predict_proba(obs)
Compute the posterior probability for each state in the model
Parameters obs : array_like, shape (n, n_features)
List of n_features-dimensional data points. Each row corresponds to a single data point.
Returns T : array-like, shape (n, n_components)
Returns the probability of the sample for each state in the model.
rvs(*args, **kwargs)
DEPRECATED: rvs is deprecated in 0.11 will be removed in 0.13: use sample instead
1.8. Reference 403
scikit-learn user guide, Release 0.12-git
sample(n=1, random_state=None)
Generate random samples from the model.
Parameters n : int
Number of samples to generate.
random_state: RandomState or an int seed (0 by default) :
A random number generator instance. If None is given, the objects random_state is
used
Returns (obs, hidden_states) :
obs : array_like, length n List of samples
hidden_states : array_like, length n List of hidden states
score(obs)
Compute the log probability under the model.
Parameters obs : array_like, shape (n, n_features)
Sequence of n_features-dimensional data points. Each row corresponds to a single data
point.
Returns logprob : oat
Log likelihood of the obs
See Also:
evalCompute the log probability under the model and posteriors
decodeFind most likely state sequence corresponding to a obs
set_params(**params)
Set the parameters of the estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The former have
parameters of the form <component>__<parameter> so that its possible to update each component
of a nested object.
Returns self :
startprob_
Mixing startprob for each state.
transmat_
Matrix of transition probabilities.
sklearn.hmm.MultinomialHMM
class sklearn.hmm.MultinomialHMM(n_components=1, startprob=None, transmat=None,
startprob_prior=None, transmat_prior=None, algo-
rithm=viterbi, random_state=None, n_iter=10, thresh=0.01,
params=abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ,
init_params=abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ)
Hidden Markov Model with multinomial (discrete) emissions
See Also:
GaussianHMMHMM with Gaussian emissions
404 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
Examples
>>> from sklearn.hmm import MultinomialHMM
>>> MultinomialHMM(n_components=2)
...
MultinomialHMM(algorithm=viterbi,...
Attributes
n_components int Number of states in the model.
n_symbols int Number of possible symbols emitted by the model (in the
observations).
transmat array, shape
(n_components,
n_components)
Matrix of transition probabilities between states.
startprob array, shape
(n_components,)
Initial state occupation distribution.
emissionprob array, shape
(n_components,
n_symbols)
Probability of emitting a given symbol when in each state.
random_state:
RandomState or an
int seed (0 by
default)
A random number generator instance
n_iter int, optional Number of iterations to perform.
thresh oat, optional Convergence threshold.
params string, optional Controls which parameters are updated in the training process.
Can contain any combination of s for startprob, t for
transmat, m for means, and c for covars, etc. Defaults to all
parameters.
init_params string, optional Controls which parameters are initialized prior to training. Can
contain any combination of s for startprob, t for transmat,
m for means, and c for covars, etc. Defaults to all
parameters.
Methods
decode(obs[, algorithm]) Find most likely state sequence corresponding to obs.
eval(obs) Compute the log probability under the model and compute posteriors
fit(obs, **kwargs) Estimate model parameters.
get_params([deep]) Get parameters for the estimator
predict(obs[, algorithm]) Find most likely state sequence corresponding to obs.
predict_proba(obs) Compute the posterior probability for each state in the model
rvs(*args, **kwargs) DEPRECATED: rvs is deprecated in 0.11 will be removed in 0.13: use sample instead
sample([n, random_state]) Generate random samples from the model.
score(obs) Compute the log probability under the model.
set_params(**params) Set the parameters of the estimator.
1.8. Reference 405
scikit-learn user guide, Release 0.12-git
__init__(n_components=1, startprob=None, transmat=None, startprob_prior=None, trans-
mat_prior=None, algorithm=viterbi, random_state=None, n_iter=10, thresh=0.01,
params=abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ,
init_params=abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ)
Create a hidden Markov model with multinomial emissions.
Parameters n_components : int
Number of states.
algorithm
decoder algorithm
decode(obs, algorithm=viterbi)
Find most likely state sequence corresponding to obs. Uses the selected algorithm for decoding.
Parameters obs : array_like, shape (n, n_features)
List of n_features-dimensional data points. Each row corresponds to a single data point.
algorithm : string, one of the decoder_algorithms
decoder algorithm to be used
Returns logprob : oat
Log probability of the maximum likelihood path through the HMM
state_sequence : array_like, shape (n,)
Index of the most likely states for each observation
See Also:
evalCompute the log probability under the model and posteriors
scoreCompute the log probability under the model
emissionprob_
Emission probability distribution for each state.
eval(obs)
Compute the log probability under the model and compute posteriors
Implements rank and beam pruning in the forward-backward algorithm to speed up inference in large
models.
Parameters obs : array_like, shape (n, n_features)
Sequence of n_features-dimensional data points. Each row corresponds to a single point
in the sequence.
Returns logprob : oat
Log likelihood of the sequence obs
posteriors: array_like, shape (n, n_components) :
Posterior probabilities of each state for each observation
See Also:
scoreCompute the log probability under the model
decodeFind most likely state sequence corresponding to a obs
406 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
fit(obs, **kwargs)
Estimate model parameters.
An initialization step is performed before entering the EM algorithm. If you want to avoid this step,
set the keyword argument init_params to the empty string . Likewise, if you would like just to do an
initialization, call this method with n_iter=0.
Parameters obs : list
List of array-like observation sequences (shape (n_i, n_features)).
Notes
In general, logprob should be non-decreasing unless aggressive pruning is used. Decreasing logprob is
generally a sign of overtting (e.g. a covariance parameter getting too small). You can x this by getting
more training data, or decreasing covars_prior.
Please note that setting parameters in the t method is deprecated and will be removed in the next
release. Set it on initialization instead.
get_params(deep=True)
Get parameters for the estimator
Parameters deep: boolean, optional :
If True, will return the parameters for this estimator and contained subobjects that are
estimators.
predict(obs, algorithm=viterbi)
Find most likely state sequence corresponding to obs.
Parameters obs : array_like, shape (n, n_features)
List of n_features-dimensional data points. Each row corresponds to a single data point.
Returns state_sequence : array_like, shape (n,)
Index of the most likely states for each observation
predict_proba(obs)
Compute the posterior probability for each state in the model
Parameters obs : array_like, shape (n, n_features)
List of n_features-dimensional data points. Each row corresponds to a single data point.
Returns T : array-like, shape (n, n_components)
Returns the probability of the sample for each state in the model.
rvs(*args, **kwargs)
DEPRECATED: rvs is deprecated in 0.11 will be removed in 0.13: use sample instead
sample(n=1, random_state=None)
Generate random samples from the model.
Parameters n : int
Number of samples to generate.
random_state: RandomState or an int seed (0 by default) :
A random number generator instance. If None is given, the objects random_state is
used
1.8. Reference 407
scikit-learn user guide, Release 0.12-git
Returns (obs, hidden_states) :
obs : array_like, length n List of samples
hidden_states : array_like, length n List of hidden states
score(obs)
Compute the log probability under the model.
Parameters obs : array_like, shape (n, n_features)
Sequence of n_features-dimensional data points. Each row corresponds to a single data
point.
Returns logprob : oat
Log likelihood of the obs
See Also:
evalCompute the log probability under the model and posteriors
decodeFind most likely state sequence corresponding to a obs
set_params(**params)
Set the parameters of the estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The former have
parameters of the form <component>__<parameter> so that its possible to update each component
of a nested object.
Returns self :
startprob_
Mixing startprob for each state.
transmat_
Matrix of transition probabilities.
sklearn.hmm.GMMHMM
class sklearn.hmm.GMMHMM(n_components=1, n_mix=1, startprob=None, trans-
mat=None, startprob_prior=None, transmat_prior=None, al-
gorithm=viterbi, gmms=None, covariance_type=diag, co-
vars_prior=0.01, random_state=None, n_iter=10, thresh=0.01,
params=abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ,
init_params=abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ)
Hidden Markov Model with Gaussin mixture emissions
See Also:
GaussianHMMHMM with Gaussian emissions
Examples
>>> from sklearn.hmm import GMMHMM
>>> GMMHMM(n_components=2, n_mix=10, covariance_type=diag)
...
GMMHMM(algorithm=viterbi, covariance_type=diag,...
408 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
Attributes
init_params string, optional Controls which parameters are initialized prior to training. Can contain any
combination of s for startprob, t for transmat, m for means, and c for
covars, etc. Defaults to all parameters.
params string, optional Controls which parameters are updated in the training process. Can contain
any combination of s for startprob, t for transmat,m for means, and c
for covars, etc. Defaults to all parameters.
n_components int Number of states in the model.
trans-
mat
array, shape
(n_components,
n_components)
Matrix of transition probabilities between states.
start-
prob
array, shape
(n_components,)
Initial state occupation distribution.
gmms array of GMM
objects, length
n_components
GMM emission distributions for each state.
ran-
dom_state
RandomState or an
int seed (0 by
default)
A random number generator instance
n_iter int, optional Number of iterations to perform.
thresh oat, optional Convergence threshold.
Methods
decode(obs[, algorithm]) Find most likely state sequence corresponding to obs.
eval(obs) Compute the log probability under the model and compute posteriors
fit(obs, **kwargs) Estimate model parameters.
get_params([deep]) Get parameters for the estimator
predict(obs[, algorithm]) Find most likely state sequence corresponding to obs.
predict_proba(obs) Compute the posterior probability for each state in the model
rvs(*args, **kwargs) DEPRECATED: rvs is deprecated in 0.11 will be removed in 0.13: use sample instead
sample([n, random_state]) Generate random samples from the model.
score(obs) Compute the log probability under the model.
set_params(**params) Set the parameters of the estimator.
__init__(n_components=1, n_mix=1, startprob=None, transmat=None, start-
prob_prior=None, transmat_prior=None, algorithm=viterbi, gmms=None, covari-
ance_type=diag, covars_prior=0.01, random_state=None, n_iter=10, thresh=0.01,
params=abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ,
init_params=abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ)
Create a hidden Markov model with GMM emissions.
Parameters n_components : int
Number of states.
algorithm
decoder algorithm
covariance_type
Covariance type of the model.
1.8. Reference 409
scikit-learn user guide, Release 0.12-git
Must be one of spherical, tied, diag, full.
decode(obs, algorithm=viterbi)
Find most likely state sequence corresponding to obs. Uses the selected algorithm for decoding.
Parameters obs : array_like, shape (n, n_features)
List of n_features-dimensional data points. Each row corresponds to a single data point.
algorithm : string, one of the decoder_algorithms
decoder algorithm to be used
Returns logprob : oat
Log probability of the maximum likelihood path through the HMM
state_sequence : array_like, shape (n,)
Index of the most likely states for each observation
See Also:
evalCompute the log probability under the model and posteriors
scoreCompute the log probability under the model
eval(obs)
Compute the log probability under the model and compute posteriors
Implements rank and beam pruning in the forward-backward algorithm to speed up inference in large
models.
Parameters obs : array_like, shape (n, n_features)
Sequence of n_features-dimensional data points. Each row corresponds to a single point
in the sequence.
Returns logprob : oat
Log likelihood of the sequence obs
posteriors: array_like, shape (n, n_components) :
Posterior probabilities of each state for each observation
See Also:
scoreCompute the log probability under the model
decodeFind most likely state sequence corresponding to a obs
fit(obs, **kwargs)
Estimate model parameters.
An initialization step is performed before entering the EM algorithm. If you want to avoid this step,
set the keyword argument init_params to the empty string . Likewise, if you would like just to do an
initialization, call this method with n_iter=0.
Parameters obs : list
List of array-like observation sequences (shape (n_i, n_features)).
410 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
Notes
In general, logprob should be non-decreasing unless aggressive pruning is used. Decreasing logprob is
generally a sign of overtting (e.g. a covariance parameter getting too small). You can x this by getting
more training data, or decreasing covars_prior.
Please note that setting parameters in the t method is deprecated and will be removed in the next
release. Set it on initialization instead.
get_params(deep=True)
Get parameters for the estimator
Parameters deep: boolean, optional :
If True, will return the parameters for this estimator and contained subobjects that are
estimators.
predict(obs, algorithm=viterbi)
Find most likely state sequence corresponding to obs.
Parameters obs : array_like, shape (n, n_features)
List of n_features-dimensional data points. Each row corresponds to a single data point.
Returns state_sequence : array_like, shape (n,)
Index of the most likely states for each observation
predict_proba(obs)
Compute the posterior probability for each state in the model
Parameters obs : array_like, shape (n, n_features)
List of n_features-dimensional data points. Each row corresponds to a single data point.
Returns T : array-like, shape (n, n_components)
Returns the probability of the sample for each state in the model.
rvs(*args, **kwargs)
DEPRECATED: rvs is deprecated in 0.11 will be removed in 0.13: use sample instead
sample(n=1, random_state=None)
Generate random samples from the model.
Parameters n : int
Number of samples to generate.
random_state: RandomState or an int seed (0 by default) :
A random number generator instance. If None is given, the objects random_state is
used
Returns (obs, hidden_states) :
obs : array_like, length n List of samples
hidden_states : array_like, length n List of hidden states
score(obs)
Compute the log probability under the model.
Parameters obs : array_like, shape (n, n_features)
Sequence of n_features-dimensional data points. Each row corresponds to a single data
point.
1.8. Reference 411
scikit-learn user guide, Release 0.12-git
Returns logprob : oat
Log likelihood of the obs
See Also:
evalCompute the log probability under the model and posteriors
decodeFind most likely state sequence corresponding to a obs
set_params(**params)
Set the parameters of the estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The former have
parameters of the form <component>__<parameter> so that its possible to update each component
of a nested object.
Returns self :
startprob_
Mixing startprob for each state.
transmat_
Matrix of transition probabilities.
1.8.12 sklearn.kernel_approximation Kernel Approximation
The sklearn.kernel_approximation module implements several approximate kernel feature maps base on
Fourier transforms.
User guide: See the Kernel Approximation section for further details.
kernel_approximation.RBFSampler([gamma, ...]) Approximates feature map of an RBF kernel by Monte Carlo approximation
kernel_approximation.AdditiveChi2Sampler([...]) Approximate feature map for additive chi kernel.
kernel_approximation.SkewedChi2Sampler([...]) Approximates feature map of the skewed chi-squared kernel by Monte
sklearn.kernel_approximation.RBFSampler
class sklearn.kernel_approximation.RBFSampler(gamma=1.0, n_components=100.0, ran-
dom_state=None)
Approximates feature map of an RBF kernel by Monte Carlo approximation of its Fourier transform.
Parameters gamma: oat :
parameter of RBF kernel: exp(-gamma * x**2)
n_components: int :
number of Monte Carlo samples per original feature. Equals the dimensionality of the
computed feature space.
random_state : {int, RandomState}, optional
If int, random_state is the seed used by the random number generator; if RandomState
instance, random_state is the random number generator.
412 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
Notes
See Random Features for Large-Scale Kernel Machines by A. Rahimi and Benjamin Recht.
Methods
fit(X[, y]) Fit the model with X.
fit_transform(X[, y]) Fit to data, then transform it
get_params([deep]) Get parameters for the estimator
set_params(**params) Set the parameters of the estimator.
transform(X[, y]) Apply the approximate feature map to X.
__init__(gamma=1.0, n_components=100.0, random_state=None)
fit(X, y=None)
Fit the model with X.
Samples random projection according to n_features.
Parameters X: {array-like, sparse matrix}, shape (n_samples, n_features) :
Training data, where n_samples in the number of samples and n_features is the number
of features.
Returns self : object
Returns the transformer.
fit_transform(X, y=None, **t_params)
Fit to data, then transform it
Fits transformer to X and y with optional parameters t_params and returns a transformed version of X.
Parameters X : numpy array of shape [n_samples, n_features]
Training set.
y : numpy array of shape [n_samples]
Target values.
Returns X_new : numpy array of shape [n_samples, n_features_new]
Transformed array.
Notes
This method just calls t and transform consecutively, i.e., it is not an optimized implementation of
t_transform, unlike other transformers such as PCA.
get_params(deep=True)
Get parameters for the estimator
Parameters deep: boolean, optional :
If True, will return the parameters for this estimator and contained subobjects that are
estimators.
1.8. Reference 413
scikit-learn user guide, Release 0.12-git
set_params(**params)
Set the parameters of the estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The former have
parameters of the form <component>__<parameter> so that its possible to update each component
of a nested object.
Returns self :
transform(X, y=None)
Apply the approximate feature map to X.
Parameters X: {array-like, sparse matrix}, shape (n_samples, n_features) :
New data, where n_samples in the number of samples and n_features is the number of
features.
Returns X_new: array-like, shape (n_samples, n_components) :
sklearn.kernel_approximation.AdditiveChi2Sampler
class sklearn.kernel_approximation.AdditiveChi2Sampler(sample_steps=2, sam-
ple_interval=None)
Approximate feature map for additive chi kernel.
Uses sampling the fourier transform of the kernel characteristic at regular intervals.
Since the kernel that is to be approximated is additive, the components of the input vectors can be treated
separately. Each entry in the original space is transformed into 2sample_steps+1 features, where sample_steps
is a parameter of the method. Typical values of n include 1, 2 and 3.
Optimal choices for the sampling interval for certain data ranges can be computed (see the reference). The
default values should be reasonable.
Parameters sample_steps: int, optional :
Gives the number of (complex) sampling points.
sample_interval: oat, optional :
Sampling interval. Must be specied when sample_steps not in {1,2,3}.
Notes
See Efcient additive kernels via explicit feature maps Vedaldi, A. and Zisserman, A. - Computer Vision and
Pattern Recognition 2010
Methods
fit(X[, y]) Set parameters.
fit_transform(X[, y]) Fit to data, then transform it
get_params([deep]) Get parameters for the estimator
set_params(**params) Set the parameters of the estimator.
transform(X[, y]) Apply approximate feature map to X.
__init__(sample_steps=2, sample_interval=None)
414 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
fit(X, y=None)
Set parameters.
fit_transform(X, y=None, **t_params)
Fit to data, then transform it
Fits transformer to X and y with optional parameters t_params and returns a transformed version of X.
Parameters X : numpy array of shape [n_samples, n_features]
Training set.
y : numpy array of shape [n_samples]
Target values.
Returns X_new : numpy array of shape [n_samples, n_features_new]
Transformed array.
Notes
This method just calls t and transform consecutively, i.e., it is not an optimized implementation of
t_transform, unlike other transformers such as PCA.
get_params(deep=True)
Get parameters for the estimator
Parameters deep: boolean, optional :
If True, will return the parameters for this estimator and contained subobjects that are
estimators.
set_params(**params)
Set the parameters of the estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The former have
parameters of the form <component>__<parameter> so that its possible to update each component
of a nested object.
Returns self :
transform(X, y=None)
Apply approximate feature map to X.
Parameters X: array-like, shape (n_samples, n_features) :
Returns X_new: array-like, shape (n_samples, n_features * (2n + 1)) :
sklearn.kernel_approximation.SkewedChi2Sampler
class sklearn.kernel_approximation.SkewedChi2Sampler(skewedness=1.0,
n_components=100, ran-
dom_state=None)
Approximates feature map of the skewed chi-squared kernel by Monte Carlo approximation of its Fourier
transform.
Parameters skewedness: oat :
skewedness parameter of the kernel. Needs to be cross-validated.
n_components: int :
1.8. Reference 415
scikit-learn user guide, Release 0.12-git
number of Monte Carlo samples per original feature. Equals the dimensionality of the
computed feature space.
random_state : {int, RandomState}, optional
If int, random_state is the seed used by the random number generator; if RandomState
instance, random_state is the random number generator.
Notes
See Random Fourier Approximations for Skewed Multiplicative Histogram Kernels by Fuxin Li, Catalin
Ionescu and Cristian Sminchisescu.
Methods
fit(X[, y]) Fit the model with X.
fit_transform(X[, y]) Fit to data, then transform it
get_params([deep]) Get parameters for the estimator
set_params(**params) Set the parameters of the estimator.
transform(X[, y]) Apply the approximate feature map to X.
__init__(skewedness=1.0, n_components=100, random_state=None)
fit(X, y=None)
Fit the model with X.
Samples random projection according to n_features.
Parameters X: array-like, shape (n_samples, n_features) :
Training data, where n_samples in the number of samples and n_features is the number
of features.
Returns self : object
Returns the transformer.
fit_transform(X, y=None, **t_params)
Fit to data, then transform it
Fits transformer to X and y with optional parameters t_params and returns a transformed version of X.
Parameters X : numpy array of shape [n_samples, n_features]
Training set.
y : numpy array of shape [n_samples]
Target values.
Returns X_new : numpy array of shape [n_samples, n_features_new]
Transformed array.
Notes
This method just calls t and transform consecutively, i.e., it is not an optimized implementation of
t_transform, unlike other transformers such as PCA.
416 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
get_params(deep=True)
Get parameters for the estimator
Parameters deep: boolean, optional :
If True, will return the parameters for this estimator and contained subobjects that are
estimators.
set_params(**params)
Set the parameters of the estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The former have
parameters of the form <component>__<parameter> so that its possible to update each component
of a nested object.
Returns self :
transform(X, y=None)
Apply the approximate feature map to X.
Parameters X: array-like, shape (n_samples, n_features) :
New data, where n_samples in the number of samples and n_features is the number of
features.
Returns X_new: array-like, shape (n_samples, n_components) :
1.8.13 sklearn.semi_supervised Semi-Supervised Learning
The sklearn.semi_supervised module implements semi-supervised learning algorithms. These algorithms
utilized small amounts of labeled data and large amounts of unlabeled data for classication tasks. This module
includes Label Propagation.
User guide: See the Semi-Supervised section for further details.
semi_supervised.LabelPropagation([kernel, ...]) Label Propagation classier
semi_supervised.LabelSpreading([kernel, ...]) LabelSpreading model for semi-supervised learning
sklearn.semi_supervised.LabelPropagation
class sklearn.semi_supervised.LabelPropagation(kernel=rbf , gamma=20, n_neighbors=7,
alpha=1, max_iters=30, tol=0.001)
Label Propagation classier
Parameters kernel : {knn, rbf}
String identier for kernel function to use. Only rbf and knn kernels are currently
supported..
gamma : oat
parameter for rbf kernel
n_neighbors : integer > 0
parameter for knn kernel
alpha : oat
clamping factor
max_iters : oat
1.8. Reference 417
scikit-learn user guide, Release 0.12-git
change maximum number of iterations allowed
tol : oat
Convergence tolerance: threshold to consider the system at steady state
See Also:
LabelSpreadingAlternate label proagation strategy more robust to noise
References
Xiaojin Zhu and Zoubin Ghahramani. Learning from labeled and unlabeled data with la-
bel propagation. Technical Report CMU-CALD-02-107, Carnegie Mellon University, 2002
http://pages.cs.wisc.edu/~jerryzhu/pub/CMU-CALD-02-107.pdf
Examples
>>> from sklearn import datasets
>>> from sklearn.semi_supervised import LabelPropagation
>>> label_prop_model = LabelPropagation()
>>> iris = datasets.load_iris()
>>> random_unlabeled_points = np.where(np.random.random_integers(0, 1,
... size=len(iris.target)))
>>> labels = np.copy(iris.target)
>>> labels[random_unlabeled_points] = -1
>>> label_prop_model.fit(iris.data, labels)
...
LabelPropagation(...)
Methods
fit(X, y) Fit a semi-supervised label propagation model based
get_params([deep]) Get parameters for the estimator
predict(X) Performs inductive inference across the model.
predict_proba(X) Predict probability for each possible outcome.
score(X, y) Returns the mean accuracy on the given test data and labels.
set_params(**params) Set the parameters of the estimator.
__init__(kernel=rbf , gamma=20, n_neighbors=7, alpha=1, max_iters=30, tol=0.001)
fit(X, y)
Fit a semi-supervised label propagation model based
All the input data is provided matrix X (labeled and unlabeled) and corresponding label matrix y with a
dedicated marker value for unlabeled samples.
Parameters X : array-like, shape = [n_samples, n_features]
A {n_samples by n_samples} size matrix will be created from this
y : array_like, shape = [n_samples]
n_labeled_samples (unlabeled points are marked as -1) All unlabeled samples will be
transductively assigned labels
418 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
Returns self : returns an instance of self.
get_params(deep=True)
Get parameters for the estimator
Parameters deep: boolean, optional :
If True, will return the parameters for this estimator and contained subobjects that are
estimators.
predict(X)
Performs inductive inference across the model.
Parameters X : array_like, shape = [n_samples, n_features]
Returns y : array_like, shape = [n_samples]
Predictions for input data
predict_proba(X)
Predict probability for each possible outcome.
Compute the probability estimates for each single sample in X and each possible outcome seen during
training (categorical distribution).
Parameters X : array_like, shape = [n_samples, n_features]
Returns probabilities : array, shape = [n_samples, n_classes]
Normalized probability distributions across class labels
score(X, y)
Returns the mean accuracy on the given test data and labels.
Parameters X : array-like, shape = [n_samples, n_features]
Training set.
y : array-like, shape = [n_samples]
Labels for X.
Returns z : oat
set_params(**params)
Set the parameters of the estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The former have
parameters of the form <component>__<parameter> so that its possible to update each component
of a nested object.
Returns self :
sklearn.semi_supervised.LabelSpreading
class sklearn.semi_supervised.LabelSpreading(kernel=rbf , gamma=20, n_neighbors=7, al-
pha=0.2, max_iters=30, tol=0.001)
LabelSpreading model for semi-supervised learning
This model is similar to the basic Label Propgation algorithm, but uses afnity matrix based on the normalized
graph Laplacian and soft clamping across the labels.
Parameters kernel : {knn, rbf}
1.8. Reference 419
scikit-learn user guide, Release 0.12-git
String identier for kernel function to use. Only rbf and knn kernels are currently
supported.
gamma : oat
parameter for rbf kernel
n_neighbors : integer > 0
parameter for knn kernel
alpha : oat
clamping factor
max_iters : oat
maximum number of iterations allowed
tol : oat
Convergence tolerance: threshold to consider the system at steady state
See Also:
LabelPropagationUnregularized graph based semi-supervised learning
References
Dengyong Zhou, Olivier Bousquet, Thomas Navin Lal, Jason Weston, Bernhard Schlkopf. Learning with local
and global consistency (2004) http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.115.3219
Examples
>>> from sklearn import datasets
>>> from sklearn.semi_supervised import LabelSpreading
>>> label_prop_model = LabelSpreading()
>>> iris = datasets.load_iris()
>>> random_unlabeled_points = np.where(np.random.random_integers(0, 1,
... size=len(iris.target)))
>>> labels = np.copy(iris.target)
>>> labels[random_unlabeled_points] = -1
>>> label_prop_model.fit(iris.data, labels)
...
LabelSpreading(...)
Methods
fit(X, y) Fit a semi-supervised label propagation model based
get_params([deep]) Get parameters for the estimator
predict(X) Performs inductive inference across the model.
predict_proba(X) Predict probability for each possible outcome.
score(X, y) Returns the mean accuracy on the given test data and labels.
set_params(**params) Set the parameters of the estimator.
__init__(kernel=rbf , gamma=20, n_neighbors=7, alpha=0.2, max_iters=30, tol=0.001)
420 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
fit(X, y)
Fit a semi-supervised label propagation model based
All the input data is provided matrix X (labeled and unlabeled) and corresponding label matrix y with a
dedicated marker value for unlabeled samples.
Parameters X : array-like, shape = [n_samples, n_features]
A {n_samples by n_samples} size matrix will be created from this
y : array_like, shape = [n_samples]
n_labeled_samples (unlabeled points are marked as -1) All unlabeled samples will be
transductively assigned labels
Returns self : returns an instance of self.
get_params(deep=True)
Get parameters for the estimator
Parameters deep: boolean, optional :
If True, will return the parameters for this estimator and contained subobjects that are
estimators.
predict(X)
Performs inductive inference across the model.
Parameters X : array_like, shape = [n_samples, n_features]
Returns y : array_like, shape = [n_samples]
Predictions for input data
predict_proba(X)
Predict probability for each possible outcome.
Compute the probability estimates for each single sample in X and each possible outcome seen during
training (categorical distribution).
Parameters X : array_like, shape = [n_samples, n_features]
Returns probabilities : array, shape = [n_samples, n_classes]
Normalized probability distributions across class labels
score(X, y)
Returns the mean accuracy on the given test data and labels.
Parameters X : array-like, shape = [n_samples, n_features]
Training set.
y : array-like, shape = [n_samples]
Labels for X.
Returns z : oat
set_params(**params)
Set the parameters of the estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The former have
parameters of the form <component>__<parameter> so that its possible to update each component
of a nested object.
Returns self :
1.8. Reference 421
scikit-learn user guide, Release 0.12-git
1.8.14 sklearn.lda: Linear Discriminant Analysis
The sklearn.lda module implements Linear Discriminant Analysis (LDA).
User guide: See the Linear and Quadratic Discriminant Analysis section for further details.
lda.LDA([n_components, priors]) Linear Discriminant Analysis (LDA)
sklearn.lda.LDA
class sklearn.lda.LDA(n_components=None, priors=None)
Linear Discriminant Analysis (LDA)
A classier with a linear decision boundary, generated by tting class conditional densities to the data and using
Bayes rule.
The model ts a Gaussian density to each class, assuming that all classes share the same covariance matrix.
The tted model can also be used to reduce the dimensionality of the input, by projecting it to the most discrim-
inative directions.
Parameters n_components: int :
Number of components (< n_classes - 1) for dimensionality reduction
priors : array, optional, shape = [n_classes]
Priors on classes
See Also:
sklearn.qda.QDAQuadratic discriminant analysis
Examples
>>> import numpy as np
>>> from sklearn.lda import LDA
>>> X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
>>> y = np.array([1, 1, 1, 2, 2, 2])
>>> clf = LDA()
>>> clf.fit(X, y)
LDA(n_components=None, priors=None)
>>> print(clf.predict([[-0.8, -1]]))
[1]
Attributes
means_ array-like, shape = [n_classes, n_features] Class means
xbar_ oat, shape = [n_features] Over all mean
priors_ array-like, shape = [n_classes] Class priors (sum to 1)
covariance_ array-like, shape = [n_features, n_features] Covariance matrix (shared by all classes)
Methods
422 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
decision_function(X) This function return the decision function values related to each
fit(X, y[, store_covariance, tol]) Fit the LDA model according to the given training data and parameters.
fit_transform(X[, y]) Fit to data, then transform it
get_params([deep]) Get parameters for the estimator
predict(X) This function does classication on an array of test vectors X.
predict_log_proba(X) This function return posterior log-probabilities of classication
predict_proba(X) This function return posterior probabilities of classication
score(X, y) Returns the mean accuracy on the given test data and labels.
set_params(**params) Set the parameters of the estimator.
transform(X) Project the data so as to maximize class separation (large separation between projected class means and small variance within each class).
__init__(n_components=None, priors=None)
decision_function(X)
This function return the decision function values related to each class on an array of test vectors X.
Parameters X : array-like, shape = [n_samples, n_features]
Returns C : array, shape = [n_samples, n_classes]
fit(X, y, store_covariance=False, tol=0.0001)
Fit the LDA model according to the given training data and parameters.
Parameters X : array-like, shape = [n_samples, n_features]
Training vector, where n_samples in the number of samples and n_features is the num-
ber of features.
y : array, shape = [n_samples]
Target values (integers)
store_covariance : boolean
If True the covariance matrix (shared by all classes) is computed and stored in
self.covariance_ attribute.
fit_transform(X, y=None, **t_params)
Fit to data, then transform it
Fits transformer to X and y with optional parameters t_params and returns a transformed version of X.
Parameters X : numpy array of shape [n_samples, n_features]
Training set.
y : numpy array of shape [n_samples]
Target values.
Returns X_new : numpy array of shape [n_samples, n_features_new]
Transformed array.
Notes
This method just calls t and transform consecutively, i.e., it is not an optimized implementation of
t_transform, unlike other transformers such as PCA.
get_params(deep=True)
Get parameters for the estimator
1.8. Reference 423
scikit-learn user guide, Release 0.12-git
Parameters deep: boolean, optional :
If True, will return the parameters for this estimator and contained subobjects that are
estimators.
predict(X)
This function does classication on an array of test vectors X.
The predicted class C for each sample in X is returned.
Parameters X : array-like, shape = [n_samples, n_features]
Returns C : array, shape = [n_samples]
predict_log_proba(X)
This function return posterior log-probabilities of classication according to each class on an array of test
vectors X.
Parameters X : array-like, shape = [n_samples, n_features]
Returns C : array, shape = [n_samples, n_classes]
predict_proba(X)
This function return posterior probabilities of classication according to each class on an array of test
vectors X.
Parameters X : array-like, shape = [n_samples, n_features]
Returns C : array, shape = [n_samples, n_classes]
score(X, y)
Returns the mean accuracy on the given test data and labels.
Parameters X : array-like, shape = [n_samples, n_features]
Training set.
y : array-like, shape = [n_samples]
Labels for X.
Returns z : oat
set_params(**params)
Set the parameters of the estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The former have
parameters of the form <component>__<parameter> so that its possible to update each component
of a nested object.
Returns self :
transform(X)
Project the data so as to maximize class separation (large separation between projected class means and
small variance within each class).
Parameters X : array-like, shape = [n_samples, n_features]
Returns X_new : array, shape = [n_samples, n_components]
1.8.15 sklearn.linear_model: Generalized Linear Models
The sklearn.linear_model module implements genelarized linear models. It includes Ridge regression,
Bayesian Regression, Lasso and Elastic Net estimators computed with Least Angle Regression and coordinate de-
scent. It also implements Stochastic Gradient Descent related algorithms.
424 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
User guide: See the Generalized Linear Models section for further details.
For dense data
linear_model.LinearRegression([...]) Ordinary least squares Linear Regression.
linear_model.Ridge([alpha, t_intercept, ...]) Linear least squares with l2 regularization.
linear_model.RidgeClassifier([alpha, ...]) Classier using Ridge regression.
linear_model.RidgeClassifierCV([alphas, ...]) Ridge classier with built-in cross-validation.
linear_model.RidgeCV([alphas, ...]) Ridge regression with built-in cross-validation.
linear_model.Lasso([alpha, t_intercept, ...]) Linear Model trained with L1 prior as regularizer (aka the Lasso)
linear_model.LassoCV([eps, n_alphas, ...]) Lasso linear model with iterative tting along a regularization path
linear_model.ElasticNet([alpha, rho, ...]) Linear Model trained with L1 and L2 prior as regularizer
linear_model.ElasticNetCV([rho, eps, ...]) Elastic Net model with iterative tting along a regularization path
linear_model.Lars([t_intercept, verbose, ...]) Least Angle Regression model a.k.a. LAR
linear_model.LassoLars([alpha, ...]) Lasso model t with Least Angle Regression a.k.a. Lars
linear_model.LarsCV([t_intercept, ...]) Cross-validated Least Angle Regression model
linear_model.LassoLarsCV([t_intercept, ...]) Cross-validated Lasso, using the LARS algorithm
linear_model.LassoLarsIC([criterion, ...]) Lasso model t with Lars using BIC or AIC for model selection
linear_model.LogisticRegression([penalty, ...]) Logistic Regression (aka logit, MaxEnt) classier.
linear_model.OrthogonalMatchingPursuit([...]) Orthogonal Mathching Pursuit model (OMP)
linear_model.Perceptron([penalty, alpha, ...]) Perceptron
linear_model.SGDClassifier([loss, penalty, ...]) Linear model tted by minimizing a regularized empirical loss with SGD.
linear_model.SGDRegressor([loss, penalty, ...]) Linear model tted by minimizing a regularized empirical loss with SGD
linear_model.BayesianRidge([n_iter, tol, ...]) Bayesian ridge regression
linear_model.ARDRegression([n_iter, tol, ...]) Bayesian ARD regression.
linear_model.RandomizedLasso([alpha, ...]) Randomized Lasso
linear_model.RandomizedLogisticRegression([...]) Randomized Logistic Regression
sklearn.linear_model.LinearRegression
class sklearn.linear_model.LinearRegression(t_intercept=True, normalize=False,
copy_X=True)
Ordinary least squares Linear Regression.
Parameters t_intercept : boolean, optional
wether to calculate the intercept for this model. If set to false, no intercept will be used
in calculations (e.g. data is expected to be already centered).
normalize : boolean, optional
If True, the regressors X are normalized
Notes
From the implementation point of view, this is just plain Ordinary Least Squares (numpy.linalg.lstsq) wrapped
as a predictor object.
1.8. Reference 425
scikit-learn user guide, Release 0.12-git
Attributes
coef_ array Estimated coefcients for the linear regression problem.
intercept_ array Independent term in the linear model.
Methods
decision_function(X) Decision function of the linear model
fit(X, y[, n_jobs]) Fit linear model.
get_params([deep]) Get parameters for the estimator
predict(X) Predict using the linear model
score(X, y) Returns the coefcient of determination R^2 of the prediction.
set_params(**params) Set the parameters of the estimator.
__init__(t_intercept=True, normalize=False, copy_X=True)
decision_function(X)
Decision function of the linear model
Parameters X : numpy array of shape [n_samples, n_features]
Returns C : array, shape = [n_samples]
Returns predicted values.
fit(X, y, n_jobs=1)
Fit linear model.
Parameters X : numpy array or sparse matrix of shape [n_samples,n_features]
Training data
y : numpy array of shape [n_samples, n_responses]
Target values
n_jobs : The number of jobs to use for the computation.
If -1 all CPUs are used. This will only provide speedup for n_response > 1 and sufcient
large problems
Returns self : returns an instance of self.
get_params(deep=True)
Get parameters for the estimator
Parameters deep: boolean, optional :
If True, will return the parameters for this estimator and contained subobjects that are
estimators.
predict(X)
Predict using the linear model
Parameters X : numpy array of shape [n_samples, n_features]
Returns C : array, shape = [n_samples]
Returns predicted values.
426 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
score(X, y)
Returns the coefcient of determination R^2 of the prediction.
The coefcient R^2 is dened as (1 - u/v), where u is the regression sum of squares ((y - y_pred) **
2).sum() and v is the residual sum of squares ((y_true - y_true.mean()) ** 2).sum(). Best possible score is
1.0, lower values are worse.
Parameters X : array-like, shape = [n_samples, n_features]
Training set.
y : array-like, shape = [n_samples]
Returns z : oat
set_params(**params)
Set the parameters of the estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The former have
parameters of the form <component>__<parameter> so that its possible to update each component
of a nested object.
Returns self :
sklearn.linear_model.Ridge
class sklearn.linear_model.Ridge(alpha=1.0, t_intercept=True, normalize=False, copy_X=True,
tol=0.001)
Linear least squares with l2 regularization.
This model solves a regression model where the loss function is the linear least squares function and regulariza-
tion is given by the l2-norm. Also known as Ridge Regression or Tikhonov regularization. This estimator has
built-in support for multi-variate regression (i.e., when y is a 2d-array of shape [n_samples, n_responses]).
Parameters alpha : oat
Small positive values of alpha improve the conditioning of the problem and reduce the
variance of the estimates. Alpha corresponds to (2*C)^-1 in other linear models such as
LogisticRegression or LinearSVC.
t_intercept : boolean
Whether to calculate the intercept for this model. If set to false, no intercept will be
used in calculations (e.g. data is expected to be already centered).
normalize : boolean, optional
If True, the regressors X are normalized
copy_X : boolean, optional, default True
If True, X will be copied; else, it may be overwritten.
tol: oat :
Precision of the solution.
See Also:
RidgeClassifier, RidgeCV
1.8. Reference 427
scikit-learn user guide, Release 0.12-git
Examples
>>> from sklearn.linear_model import Ridge
>>> import numpy as np
>>> n_samples, n_features = 10, 5
>>> np.random.seed(0)
>>> y = np.random.randn(n_samples)
>>> X = np.random.randn(n_samples, n_features)
>>> clf = Ridge(alpha=1.0)
>>> clf.fit(X, y)
Ridge(alpha=1.0, copy_X=True, fit_intercept=True, normalize=False,
tol=0.001)
Attributes
coef_ array, shape = [n_features] or [n_responses, n_features] Weight vector(s).
Methods
decision_function(X) Decision function of the linear model
fit(X, y[, sample_weight, solver]) Fit Ridge regression model
get_params([deep]) Get parameters for the estimator
predict(X) Predict using the linear model
score(X, y) Returns the coefcient of determination R^2 of the prediction.
set_params(**params) Set the parameters of the estimator.
__init__(alpha=1.0, t_intercept=True, normalize=False, copy_X=True, tol=0.001)
decision_function(X)
Decision function of the linear model
Parameters X : numpy array of shape [n_samples, n_features]
Returns C : array, shape = [n_samples]
Returns predicted values.
fit(X, y, sample_weight=1.0, solver=auto)
Fit Ridge regression model
Parameters X : {array-like, sparse matrix}, shape = [n_samples, n_features]
Training data
y : array-like, shape = [n_samples] or [n_samples, n_responses]
Target values
sample_weight : oat or numpy array of shape [n_samples]
Individual weights for each sample
solver : {auto, dense_cholesky, sparse_cg}
Solver to use in the computational routines. delse_cholesky will use the standard
scipy.linalg.solve function, sparse_cg will use the conjugate gradient solver as found
428 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
in scipy.sparse.linalg.cg while auto will chose the most appropriate depending on the
matrix X.
Returns self : returns an instance of self.
get_params(deep=True)
Get parameters for the estimator
Parameters deep: boolean, optional :
If True, will return the parameters for this estimator and contained subobjects that are
estimators.
predict(X)
Predict using the linear model
Parameters X : numpy array of shape [n_samples, n_features]
Returns C : array, shape = [n_samples]
Returns predicted values.
score(X, y)
Returns the coefcient of determination R^2 of the prediction.
The coefcient R^2 is dened as (1 - u/v), where u is the regression sum of squares ((y - y_pred) **
2).sum() and v is the residual sum of squares ((y_true - y_true.mean()) ** 2).sum(). Best possible score is
1.0, lower values are worse.
Parameters X : array-like, shape = [n_samples, n_features]
Training set.
y : array-like, shape = [n_samples]
Returns z : oat
set_params(**params)
Set the parameters of the estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The former have
parameters of the form <component>__<parameter> so that its possible to update each component
of a nested object.
Returns self :
sklearn.linear_model.RidgeClassier
class sklearn.linear_model.RidgeClassifier(alpha=1.0, t_intercept=True, normalize=False,
copy_X=True, tol=0.001, class_weight=None)
Classier using Ridge regression.
Parameters alpha : oat
Small positive values of alpha improve the conditioning of the problem and reduce the
variance of the estimates. Alpha corresponds to (2*C)^-1 in other linear models such as
LogisticRegression or LinearSVC.
t_intercept : boolean
Whether to calculate the intercept for this model. If set to false, no intercept will be
used in calculations (e.g. data is expected to be already centered).
normalize : boolean, optional
1.8. Reference 429
scikit-learn user guide, Release 0.12-git
If True, the regressors X are normalized
copy_X : boolean, optional, default True
If True, X will be copied; else, it may be overwritten.
tol: oat :
Precision of the solution.
class_weight : dict, optional
Weights associated with classes in the form {class_label : weight}. If not given, all
classes are supposed to have weight one.
See Also:
Ridge, RidgeClassifierCV
Notes
For multi-class classication, n_class classiers are trained in a one-versus-all approach. Concretely, this is
implemented by taking advantage of the multi-variate response support in Ridge.
Attributes
coef_ array, shape = [n_features] or [n_classes, n_features] Weight vector(s).
Methods
decision_function(X)
fit(X, y[, solver]) Fit Ridge regression model.
get_params([deep]) Get parameters for the estimator
predict(X) Predict target values according to the tted model.
score(X, y) Returns the coefcient of determination R^2 of the prediction.
set_params(**params) Set the parameters of the estimator.
__init__(alpha=1.0, t_intercept=True, normalize=False, copy_X=True, tol=0.001,
class_weight=None)
fit(X, y, solver=auto)
Fit Ridge regression model.
Parameters X : {array-like, sparse matrix}, shape = [n_samples,n_features]
Training data
y : array-like, shape = [n_samples]
Target values
solver : {auto, dense_cholesky, sparse_cg}
Solver to use in the computational routines. delse_cholesky will use the standard
scipy.linalg.solve function, sparse_cg will use the conjugate gradient solver as found
in scipy.sparse.linalg.cg while auto will chose the most appropriate depending on the
matrix X.
430 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
Returns self : returns an instance of self.
get_params(deep=True)
Get parameters for the estimator
Parameters deep: boolean, optional :
If True, will return the parameters for this estimator and contained subobjects that are
estimators.
predict(X)
Predict target values according to the tted model.
Parameters X : array-like, shape = [n_samples, n_features]
Returns y : array, shape = [n_samples]
score(X, y)
Returns the coefcient of determination R^2 of the prediction.
The coefcient R^2 is dened as (1 - u/v), where u is the regression sum of squares ((y - y_pred) **
2).sum() and v is the residual sum of squares ((y_true - y_true.mean()) ** 2).sum(). Best possible score is
1.0, lower values are worse.
Parameters X : array-like, shape = [n_samples, n_features]
Training set.
y : array-like, shape = [n_samples]
Returns z : oat
set_params(**params)
Set the parameters of the estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The former have
parameters of the form <component>__<parameter> so that its possible to update each component
of a nested object.
Returns self :
sklearn.linear_model.RidgeClassierCV
class sklearn.linear_model.RidgeClassifierCV(alphas=array([ 0.1, 1., 10. ]),
t_intercept=True, normalize=False,
score_func=None, loss_func=None, cv=None,
class_weight=None)
Ridge classier with built-in cross-validation.
By default, it performs Generalized Cross-Validation, which is a form of efcient Leave-One-Out cross-
validation. Currently, only the n_features > n_samples case is handled efciently.
Parameters alphas: numpy array of shape [n_alpha] :
Array of alpha values to try. Small positive values of alpha improve the conditioning of
the problem and reduce the variance of the estimates. Alpha corresponds to (2*C)^-1 in
other linear models such as LogisticRegression or LinearSVC.
t_intercept : boolean
Whether to calculate the intercept for this model. If set to false, no intercept will be
used in calculations (e.g. data is expected to be already centered).
normalize : boolean, optional
1.8. Reference 431
scikit-learn user guide, Release 0.12-git
If True, the regressors X are normalized
score_func: callable, optional :
function that takes 2 arguments and compares them in order to evaluate the performance
of prediction (big is good) if None is passed, the score of the estimator is maximized
loss_func: callable, optional :
function that takes 2 arguments and compares them in order to evaluate the performance
of prediction (small is good) if None is passed, the score of the estimator is maximized
cv : cross-validation generator, optional
If None, Generalized Cross-Validation (efcient Leave-One-Out) will be used.
class_weight : dict, optional
Weights associated with classes in the form {class_label : weight}. If not given, all
classes are supposed to have weight one.
See Also:
RidgeRidge regression
RidgeClassifierRidge classier
RidgeCVRidge regression with built-in cross validation
Notes
For multi-class classication, n_class classiers are trained in a one-versus-all approach. Concretely, this is
implemented by taking advantage of the multi-variate response support in Ridge.
Methods
decision_function(X)
fit(X, y[, sample_weight, class_weight]) Fit the ridge classier.
get_params([deep]) Get parameters for the estimator
predict(X) Predict target values according to the tted model.
score(X, y) Returns the coefcient of determination R^2 of the prediction.
set_params(**params) Set the parameters of the estimator.
__init__(alphas=array([ 0.1, 1., 10. ]), t_intercept=True, normalize=False, score_func=None,
loss_func=None, cv=None, class_weight=None)
fit(X, y, sample_weight=1.0, class_weight=None)
Fit the ridge classier.
Parameters X : array-like, shape = [n_samples, n_features]
Training vectors, where n_samples is the number of samples and n_features is the num-
ber of features.
y : array-like, shape = [n_samples]
Target values.
sample_weight : oat or numpy array of shape [n_samples]
432 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
Sample weight
class_weight : dict, optional
Weights associated with classes in the form {class_label : weight}. If not given, all
classes are supposed to have weight one.
Returns self : object
Returns self.
get_params(deep=True)
Get parameters for the estimator
Parameters deep: boolean, optional :
If True, will return the parameters for this estimator and contained subobjects that are
estimators.
predict(X)
Predict target values according to the tted model.
Parameters X : array-like, shape = [n_samples, n_features]
Returns y : array, shape = [n_samples]
score(X, y)
Returns the coefcient of determination R^2 of the prediction.
The coefcient R^2 is dened as (1 - u/v), where u is the regression sum of squares ((y - y_pred) **
2).sum() and v is the residual sum of squares ((y_true - y_true.mean()) ** 2).sum(). Best possible score is
1.0, lower values are worse.
Parameters X : array-like, shape = [n_samples, n_features]
Training set.
y : array-like, shape = [n_samples]
Returns z : oat
set_params(**params)
Set the parameters of the estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The former have
parameters of the form <component>__<parameter> so that its possible to update each component
of a nested object.
Returns self :
sklearn.linear_model.RidgeCV
class sklearn.linear_model.RidgeCV(alphas=array([ 0.1, 1., 10. ]), t_intercept=True, nor-
malize=False, score_func=None, loss_func=None, cv=None,
gcv_mode=None)
Ridge regression with built-in cross-validation.
By default, it performs Generalized Cross-Validation, which is a form of efcient Leave-One-Out cross-
validation.
Parameters alphas: numpy array of shape [n_alpha] :
1.8. Reference 433
scikit-learn user guide, Release 0.12-git
Array of alpha values to try. Small positive values of alpha improve the conditioning of
the problem and reduce the variance of the estimates. Alpha corresponds to (2
*
C)^-1
in other linear models such as LogisticRegression or LinearSVC.
t_intercept : boolean
Whether to calculate the intercept for this model. If set to false, no intercept will be
used in calculations (e.g. data is expected to be already centered).
normalize : boolean, optional
If True, the regressors X are normalized
score_func: callable, optional :
function that takes 2 arguments and compares them in order to evaluate the performance
of prediction (big is good) if None is passed, the score of the estimator is maximized
loss_func: callable, optional :
function that takes 2 arguments and compares them in order to evaluate the performance
of prediction (small is good) if None is passed, the score of the estimator is maximized
cv : cross-validation generator, optional
If None, Generalized Cross-Validation (efcient Leave-One-Out) will be used.
See Also:
RidgeRidge regression
RidgeClassifierRidge classier
RidgeCVRidge regression with built-in cross validation
Attributes
coef_ array, shape = [n_features] or
[n_classes, n_features]
Weight vector(s).
gcv_mode {None, auto, svd, eigen}, op-
tional
Flag indicating which strategy to
use when performing Generalized
Cross-Validation. Options are:
auto : use svd if n_samples > n_features, otherwise use eigen
svd : force computation via singular value decomposition of X
eigen : force computation via eigendecomposition of X^T X
The auto mode is the default and
is intended to pick the cheaper op-
tion of the two depending upon the
shape of the training data.
Methods
decision_function(X) Decision function of the linear model
fit(X, y[, sample_weight]) Fit Ridge regression model
get_params([deep]) Get parameters for the estimator
Continued on next page
434 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
Table 1.106 continued from previous page
predict(X) Predict using the linear model
score(X, y) Returns the coefcient of determination R^2 of the prediction.
set_params(**params) Set the parameters of the estimator.
__init__(alphas=array([ 0.1, 1., 10. ]), t_intercept=True, normalize=False, score_func=None,
loss_func=None, cv=None, gcv_mode=None)
decision_function(X)
Decision function of the linear model
Parameters X : numpy array of shape [n_samples, n_features]
Returns C : array, shape = [n_samples]
Returns predicted values.
fit(X, y, sample_weight=1.0)
Fit Ridge regression model
Parameters X : array-like, shape = [n_samples, n_features]
Training data
y : array-like, shape = [n_samples] or [n_samples, n_responses]
Target values
sample_weight : oat or array-like of shape [n_samples]
Sample weight
Returns self : Returns self.
get_params(deep=True)
Get parameters for the estimator
Parameters deep: boolean, optional :
If True, will return the parameters for this estimator and contained subobjects that are
estimators.
predict(X)
Predict using the linear model
Parameters X : numpy array of shape [n_samples, n_features]
Returns C : array, shape = [n_samples]
Returns predicted values.
score(X, y)
Returns the coefcient of determination R^2 of the prediction.
The coefcient R^2 is dened as (1 - u/v), where u is the regression sum of squares ((y - y_pred) **
2).sum() and v is the residual sum of squares ((y_true - y_true.mean()) ** 2).sum(). Best possible score is
1.0, lower values are worse.
Parameters X : array-like, shape = [n_samples, n_features]
Training set.
y : array-like, shape = [n_samples]
Returns z : oat
1.8. Reference 435
scikit-learn user guide, Release 0.12-git
set_params(**params)
Set the parameters of the estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The former have
parameters of the form <component>__<parameter> so that its possible to update each component
of a nested object.
Returns self :
sklearn.linear_model.Lasso
class sklearn.linear_model.Lasso(alpha=1.0, t_intercept=True, normalize=False, precom-
pute=auto, copy_X=True, max_iter=1000, tol=0.0001,
warm_start=False, positive=False)
Linear Model trained with L1 prior as regularizer (aka the Lasso)
The optimization objective for Lasso is:
(1 / (2
*
n_samples))
*
||y - Xw||^2_2 + alpha
*
||w||_1
Technically the Lasso model is optimizing the same objective function as the Elastic Net with rho=1.0 (no L2
penalty).
Parameters alpha : oat, optional
Constant that multiplies the L1 term. Defaults to 1.0
t_intercept : boolean
whether to calculate the intercept for this model. If set to false, no intercept will be used
in calculations (e.g. data is expected to be already centered).
normalize : boolean, optional
If True, the regressors X are normalized
copy_X : boolean, optional, default True
If True, X will be copied; else, it may be overwritten.
precompute : True | False | auto | array-like
Whether to use a precomputed Gram matrix to speed up calculations. If set to auto let
us decide. The Gram matrix can also be passed as argument.
max_iter: int, optional :
The maximum number of iterations
tol: oat, optional :
The tolerance for the optimization: if the updates are smaller than tol, the optimization
code checks the dual gap for optimality and continues until it is smaller than tol.
warm_start : bool, optional
When set to True, reuse the solution of the previous call to t as initialization, otherwise,
just erase the previous solution.
positive: bool, optional :
When set to True, forces the coefcients to be positive.
436 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
See Also:
lars_path, lasso_path, LassoLars, LassoCV, LassoLarsCV,
sklearn.decomposition.sparse_encode
Notes
The algorithm used to t the model is coordinate descent.
To avoid unnecessary memory duplication the X argument of the t method should be directly passed as a fortran
contiguous numpy array.
Examples
>>> from sklearn import linear_model
>>> clf = linear_model.Lasso(alpha=0.1)
>>> clf.fit([[0,0], [1, 1], [2, 2]], [0, 1, 2])
Lasso(alpha=0.1, copy_X=True, fit_intercept=True, max_iter=1000,
normalize=False, positive=False, precompute=auto, tol=0.0001,
warm_start=False)
>>> print(clf.coef_)
[ 0.85 0. ]
>>> print(clf.intercept_)
0.15
Attributes
coef_ array, shape = [n_features] parameter vector (w in the fomulation formula)
intercept_ oat independent term in decision function.
Methods
decision_function(X) Decision function of the linear model
fit(X, y[, Xy, coef_init]) Fit Elastic Net model with coordinate descent
get_params([deep]) Get parameters for the estimator
predict(X) Predict using the linear model
score(X, y) Returns the coefcient of determination R^2 of the prediction.
set_params(**params) Set the parameters of the estimator.
__init__(alpha=1.0, t_intercept=True, normalize=False, precompute=auto, copy_X=True,
max_iter=1000, tol=0.0001, warm_start=False, positive=False)
decision_function(X)
Decision function of the linear model
Parameters X : numpy array of shape [n_samples, n_features]
Returns C : array, shape = [n_samples]
Returns predicted values.
fit(X, y, Xy=None, coef_init=None)
Fit Elastic Net model with coordinate descent
1.8. Reference 437
scikit-learn user guide, Release 0.12-git
Parameters X: ndarray, (n_samples, n_features) :
Data
y: ndarray, (n_samples) :
Target
Xy : array-like, optional
Xy = np.dot(X.T, y) that can be precomputed. It is useful only when the Gram matrix is
precomputed.
coef_init: ndarray of shape n_features :
The initial coefents to warm-start the optimization
Notes
Coordinate descent is an algorithm that considers each column of data at a time hence it will automatically
convert the X input as a fortran contiguous numpy array if necessary.
To avoid memory re-allocation it is advised to allocate the initial data in memory directly using that format.
get_params(deep=True)
Get parameters for the estimator
Parameters deep: boolean, optional :
If True, will return the parameters for this estimator and contained subobjects that are
estimators.
predict(X)
Predict using the linear model
Parameters X : numpy array of shape [n_samples, n_features]
Returns C : array, shape = [n_samples]
Returns predicted values.
score(X, y)
Returns the coefcient of determination R^2 of the prediction.
The coefcient R^2 is dened as (1 - u/v), where u is the regression sum of squares ((y - y_pred) **
2).sum() and v is the residual sum of squares ((y_true - y_true.mean()) ** 2).sum(). Best possible score is
1.0, lower values are worse.
Parameters X : array-like, shape = [n_samples, n_features]
Training set.
y : array-like, shape = [n_samples]
Returns z : oat
set_params(**params)
Set the parameters of the estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The former have
parameters of the form <component>__<parameter> so that its possible to update each component
of a nested object.
Returns self :
438 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
sklearn.linear_model.LassoCV
class sklearn.linear_model.LassoCV(eps=0.001, n_alphas=100, alphas=None, t_intercept=True,
normalize=False, precompute=auto, max_iter=1000,
tol=0.0001, copy_X=True, cv=None, verbose=False)
Lasso linear model with iterative tting along a regularization path
The best model is selected by cross-validation.
The optimization objective for Lasso is:
(1 / (2
*
n_samples))
*
||y - Xw||^2_2 + alpha
*
||w||_1
Parameters eps : oat, optional
Length of the path. eps=1e-3 means that alpha_min / alpha_max = 1e-3.
n_alphas : int, optional
Number of alphas along the regularization path
alphas : numpy array, optional
List of alphas where to compute the models. If None alphas are set automatically
precompute : True | False | auto | array-like
Whether to use a precomputed Gram matrix to speed up calculations. If set to auto let
us decide. The Gram matrix can also be passed as argument.
max_iter: int, optional :
The maximum number of iterations
tol: oat, optional :
The tolerance for the optimization: if the updates are smaller than tol, the optimization
code checks the dual gap for optimality and continues until it is smaller than tol.
cv : integer or crossvalidation generator, optional
If an integer is passed, it is the number of fold (default 3). Specic crossvalidation ob-
jects can be passed, see sklearn.cross_validation module for the list of possible objects
verbose : bool or integer
amount of verbosity
See Also:
lars_path, lasso_path, LassoLars, Lasso, LassoLarsCV
Notes
See examples/linear_model/lasso_path_with_crossvalidation.py for an example.
To avoid unnecessary memory duplication the X argument of the t method should be directly passed as a fortran
contiguous numpy array.
1.8. Reference 439
scikit-learn user guide, Release 0.12-git
Attributes
alpha_: oat The amount of penalization choosen by cross
validation
coef_ array, shape =
[n_features]
parameter vector (w in the fomulation
formula)
intercept_ oat independent term in decision function.
mse_path_: array, shape =
[n_alphas, n_folds]
mean square error for the test set on each fold,
varying alpha
Methods
decision_function(X) Decision function of the linear model
fit(X, y) Fit linear model with coordinate descent along decreasing alphas
get_params([deep]) Get parameters for the estimator
path(X, y[, eps, n_alphas, alphas, ...]) Compute Lasso path with coordinate descent
predict(X) Predict using the linear model
score(X, y) Returns the coefcient of determination R^2 of the prediction.
set_params(**params) Set the parameters of the estimator.
__init__(eps=0.001, n_alphas=100, alphas=None, t_intercept=True, normalize=False, precom-
pute=auto, max_iter=1000, tol=0.0001, copy_X=True, cv=None, verbose=False)
decision_function(X)
Decision function of the linear model
Parameters X : numpy array of shape [n_samples, n_features]
Returns C : array, shape = [n_samples]
Returns predicted values.
fit(X, y)
Fit linear model with coordinate descent along decreasing alphas using cross-validation
Parameters X : numpy array of shape [n_samples,n_features]
Training data. Pass directly as fortran contiguous data to avoid unnecessary memory
duplication
y : numpy array of shape [n_samples]
Target values
get_params(deep=True)
Get parameters for the estimator
Parameters deep: boolean, optional :
If True, will return the parameters for this estimator and contained subobjects that are
estimators.
static path(X, y, eps=0.001, n_alphas=100, alphas=None, precompute=auto, Xy=None,
t_intercept=True, normalize=False, copy_X=True, verbose=False, **params)
Compute Lasso path with coordinate descent
The optimization objective for Lasso is:
440 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
(1 / (2
*
n_samples))
*
||y - Xw||^2_2 + alpha
*
||w||_1
Parameters X : numpy array of shape [n_samples,n_features]
Training data. Pass directly as fortran contiguous data to avoid unnecessary memory
duplication
y : numpy array of shape [n_samples]
Target values
eps : oat, optional
Length of the path. eps=1e-3 means that alpha_min / alpha_max = 1e-3
n_alphas : int, optional
Number of alphas along the regularization path
alphas : numpy array, optional
List of alphas where to compute the models. If None alphas are set automatically
precompute : True | False | auto | array-like
Whether to use a precomputed Gram matrix to speed up calculations. If set to auto let
us decide. The Gram matrix can also be passed as argument.
Xy : array-like, optional
Xy = np.dot(X.T, y) that can be precomputed. It is useful only when the Gram matrix is
precomputed.
t_intercept : bool
Fit or not an intercept
normalize : boolean, optional
If True, the regressors X are normalized
copy_X : boolean, optional, default True
If True, X will be copied; else, it may be overwritten.
verbose : bool or integer
Amount of verbosity
params : kwargs
keyword arguments passed to the Lasso objects
Returns models : a list of models along the regularization path
See Also:
lars_path, Lasso, LassoLars, LassoCV, LassoLarsCV,
sklearn.decomposition.sparse_encode
Notes
See examples/linear_model/plot_lasso_coordinate_descent_path.py for an example.
1.8. Reference 441
scikit-learn user guide, Release 0.12-git
To avoid unnecessary memory duplication the X argument of the t method should be directly passed as a
fortran contiguous numpy array.
predict(X)
Predict using the linear model
Parameters X : numpy array of shape [n_samples, n_features]
Returns C : array, shape = [n_samples]
Returns predicted values.
score(X, y)
Returns the coefcient of determination R^2 of the prediction.
The coefcient R^2 is dened as (1 - u/v), where u is the regression sum of squares ((y - y_pred) **
2).sum() and v is the residual sum of squares ((y_true - y_true.mean()) ** 2).sum(). Best possible score is
1.0, lower values are worse.
Parameters X : array-like, shape = [n_samples, n_features]
Training set.
y : array-like, shape = [n_samples]
Returns z : oat
set_params(**params)
Set the parameters of the estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The former have
parameters of the form <component>__<parameter> so that its possible to update each component
of a nested object.
Returns self :
sklearn.linear_model.ElasticNet
class sklearn.linear_model.ElasticNet(alpha=1.0, rho=0.5, t_intercept=True, normal-
ize=False, precompute=auto, max_iter=1000,
copy_X=True, tol=0.0001, warm_start=False, posi-
tive=False)
Linear Model trained with L1 and L2 prior as regularizer
Minimizes the objective function:
1 / (2
*
n_samples)
*
||y - Xw||^2_2 +
+ alpha
*
rho
*
||w||_1 + 0.5
*
alpha
*
(1 - rho)
*
||w||^2_2
If you are interested in controlling the L1 and L2 penalty separately, keep in mind that this is equivalent to:
a
*
L1 + b
*
L2
where:
alpha = a + b and rho = a / (a + b)
The parameter rho corresponds to alpha in the glmnet R package while alpha corresponds to the lambda param-
eter in glmnet. Specically, rho = 1 is the lasso penalty. Currently, rho <= 0.01 is not reliable, unless you supply
your own sequence of alpha.
Parameters alpha : oat
442 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
Constant that multiplies the penalty terms. Defaults to 1.0 See the notes for the exact
mathematical meaning of this parameter
rho : oat
The ElasticNet mixing parameter, with 0 < rho <= 1. For rho = 0 the penalty is an L1
penalty. For rho = 1 it is an L2 penalty. For 0 < rho < 1, the penalty is a combination of
L1 and L2
t_intercept: bool :
Whether the intercept should be estimated or not. If False, the data is assumed to be
already centered.
normalize : boolean, optional
If True, the regressors X are normalized
precompute : True | False | auto | array-like
Whether to use a precomputed Gram matrix to speed up calculations. If set to auto let
us decide. The Gram matrix can also be passed as argument.
max_iter: int, optional :
The maximum number of iterations
copy_X : boolean, optional, default False
If True, X will be copied; else, it may be overwritten.
tol: oat, optional :
The tolerance for the optimization: if the updates are smaller than tol, the optimization
code checks the dual gap for optimality and continues until it is smaller than tol.
warm_start : bool, optional
When set to True, reuse the solution of the previous call to t as initialization, otherwise,
just erase the previous solution.
positive: bool, optional :
When set to True, forces the coefcients to be positive.
Notes
To avoid unnecessary memory duplication the X argument of the t method should be directly passed as a fortran
contiguous numpy array.
Methods
decision_function(X) Decision function of the linear model
fit(X, y[, Xy, coef_init]) Fit Elastic Net model with coordinate descent
get_params([deep]) Get parameters for the estimator
predict(X) Predict using the linear model
score(X, y) Returns the coefcient of determination R^2 of the prediction.
set_params(**params) Set the parameters of the estimator.
1.8. Reference 443
scikit-learn user guide, Release 0.12-git
__init__(alpha=1.0, rho=0.5, t_intercept=True, normalize=False, precompute=auto,
max_iter=1000, copy_X=True, tol=0.0001, warm_start=False, positive=False)
decision_function(X)
Decision function of the linear model
Parameters X : numpy array of shape [n_samples, n_features]
Returns C : array, shape = [n_samples]
Returns predicted values.
fit(X, y, Xy=None, coef_init=None)
Fit Elastic Net model with coordinate descent
Parameters X: ndarray, (n_samples, n_features) :
Data
y: ndarray, (n_samples) :
Target
Xy : array-like, optional
Xy = np.dot(X.T, y) that can be precomputed. It is useful only when the Gram matrix is
precomputed.
coef_init: ndarray of shape n_features :
The initial coefents to warm-start the optimization
Notes
Coordinate descent is an algorithm that considers each column of data at a time hence it will automatically
convert the X input as a fortran contiguous numpy array if necessary.
To avoid memory re-allocation it is advised to allocate the initial data in memory directly using that format.
get_params(deep=True)
Get parameters for the estimator
Parameters deep: boolean, optional :
If True, will return the parameters for this estimator and contained subobjects that are
estimators.
predict(X)
Predict using the linear model
Parameters X : numpy array of shape [n_samples, n_features]
Returns C : array, shape = [n_samples]
Returns predicted values.
score(X, y)
Returns the coefcient of determination R^2 of the prediction.
The coefcient R^2 is dened as (1 - u/v), where u is the regression sum of squares ((y - y_pred) **
2).sum() and v is the residual sum of squares ((y_true - y_true.mean()) ** 2).sum(). Best possible score is
1.0, lower values are worse.
Parameters X : array-like, shape = [n_samples, n_features]
Training set.
444 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
y : array-like, shape = [n_samples]
Returns z : oat
set_params(**params)
Set the parameters of the estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The former have
parameters of the form <component>__<parameter> so that its possible to update each component
of a nested object.
Returns self :
sklearn.linear_model.ElasticNetCV
class sklearn.linear_model.ElasticNetCV(rho=0.5, eps=0.001, n_alphas=100, alphas=None,
t_intercept=True, normalize=False, precom-
pute=auto, max_iter=1000, tol=0.0001, cv=None,
copy_X=True, verbose=0, n_jobs=1)
Elastic Net model with iterative tting along a regularization path
The best model is selected by cross-validation.
Parameters rho : oat, optional
oat between 0 and 1 passed to ElasticNet (scaling between l1 and l2 penalties). For
rho = 0 the penalty is an L1 penalty. For rho = 1 it is an L2 penalty. For 0 < rho < 1,
the penalty is a combination of L1 and L2 This parameter can be a list, in which case
the different values are tested by cross-validation and the one giving the best prediction
score is used. Note that a good choice of list of values for rho is often to put more values
close to 1 (i.e. Lasso) and less close to 0 (i.e. Ridge), as in [.1, .5, .7, .9, .95, .99, 1]
eps : oat, optional
Length of the path. eps=1e-3 means that alpha_min / alpha_max = 1e-3.
n_alphas : int, optional
Number of alphas along the regularization path
alphas : numpy array, optional
List of alphas where to compute the models. If None alphas are set automatically
precompute : True | False | auto | array-like
Whether to use a precomputed Gram matrix to speed up calculations. If set to auto let
us decide. The Gram matrix can also be passed as argument.
max_iter: int, optional :
The maximum number of iterations
tol: oat, optional :
The tolerance for the optimization: if the updates are smaller than tol, the optimization
code checks the dual gap for optimality and continues until it is smaller than tol.
cv : integer or crossvalidation generator, optional
If an integer is passed, it is the number of fold (default 3). Specic crossvalidation ob-
jects can be passed, see sklearn.cross_validation module for the list of possible objects
verbose : bool or integer
1.8. Reference 445
scikit-learn user guide, Release 0.12-git
amount of verbosity
n_jobs : integer, optional
Number of CPUs to use during the cross validation. If -1, use all the CPUs. Note that
this is used only if multiple values for rho are given.
See Also:
enet_path, ElasticNet
Notes
See examples/linear_model/lasso_path_with_crossvalidation.py for an example.
To avoid unnecessary memory duplication the X argument of the t method should be directly passed as a fortran
contiguous numpy array.
The parameter rho corresponds to alpha in the glmnet R package while alpha corresponds to the lambda param-
eter in glmnet. More specically, the optimization objective is:
1 / (2
*
n_samples)
*
||y - Xw||^2_2 +
+ alpha
*
rho
*
||w||_1 + 0.5
*
alpha
*
(1 - rho)
*
||w||^2_2
If you are interested in controlling the L1 and L2 penalty separately, keep in mind that this is equivalent to:
a
*
L1 + b
*
L2
for:
alpha = a + b and rho = a / (a + b)
Attributes
alpha_: oat The amount of penalization choosen by cross
validation
rho_: oat The compromise between l1 and l2 penalization
choosen by cross validation
coef_ array, shape =
[n_features]
parameter vector (w in the fomulation formula)
intercept_ oat independent term in decision function.
mse_path_: array, shape = [n_rho,
n_alpha, n_folds]
mean square error for the test set on each fold,
varying rho and alpha
Methods
decision_function(X) Decision function of the linear model
fit(X, y) Fit linear model with coordinate descent along decreasing alphas
get_params([deep]) Get parameters for the estimator
path(X, y[, rho, eps, n_alphas, alphas, ...]) Compute Elastic-Net path with coordinate descent
predict(X) Predict using the linear model
score(X, y) Returns the coefcient of determination R^2 of the prediction.
set_params(**params) Set the parameters of the estimator.
446 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
__init__(rho=0.5, eps=0.001, n_alphas=100, alphas=None, t_intercept=True, normalize=False,
precompute=auto, max_iter=1000, tol=0.0001, cv=None, copy_X=True, verbose=0,
n_jobs=1)
decision_function(X)
Decision function of the linear model
Parameters X : numpy array of shape [n_samples, n_features]
Returns C : array, shape = [n_samples]
Returns predicted values.
fit(X, y)
Fit linear model with coordinate descent along decreasing alphas using cross-validation
Parameters X : numpy array of shape [n_samples,n_features]
Training data. Pass directly as fortran contiguous data to avoid unnecessary memory
duplication
y : numpy array of shape [n_samples]
Target values
get_params(deep=True)
Get parameters for the estimator
Parameters deep: boolean, optional :
If True, will return the parameters for this estimator and contained subobjects that are
estimators.
static path(X, y, rho=0.5, eps=0.001, n_alphas=100, alphas=None, precompute=auto, Xy=None,
t_intercept=True, normalize=False, copy_X=True, verbose=False, **params)
Compute Elastic-Net path with coordinate descent
The Elastic Net optimization function is:
1 / (2
*
n_samples)
*
||y - Xw||^2_2 +
+ alpha
*
rho
*
||w||_1 + 0.5
*
alpha
*
(1 - rho)
*
||w||^2_2
Parameters X : numpy array of shape [n_samples, n_features]
Training data. Pass directly as fortran contiguous data to avoid unnecessary memory
duplication
y : numpy array of shape [n_samples]
Target values
rho : oat, optional
oat between 0 and 1 passed to ElasticNet (scaling between l1 and l2 penalties). rho=1
corresponds to the Lasso
eps : oat
Length of the path. eps=1e-3 means that alpha_min / alpha_max = 1e-3
n_alphas : int, optional
Number of alphas along the regularization path
alphas : numpy array, optional
List of alphas where to compute the models. If None alphas are set automatically
1.8. Reference 447
scikit-learn user guide, Release 0.12-git
precompute : True | False | auto | array-like
Whether to use a precomputed Gram matrix to speed up calculations. If set to auto let
us decide. The Gram matrix can also be passed as argument.
Xy : array-like, optional
Xy = np.dot(X.T, y) that can be precomputed. It is useful only when the Gram matrix is
precomputed.
t_intercept : bool
Fit or not an intercept
normalize : boolean, optional
If True, the regressors X are normalized
copy_X : boolean, optional, default True
If True, X will be copied; else, it may be overwritten.
verbose : bool or integer
Amount of verbosity
params : kwargs
keyword arguments passed to the Lasso objects
Returns models : a list of models along the regularization path
See Also:
ElasticNet, ElasticNetCV
Notes
See examples/plot_lasso_coordinate_descent_path.py for an example.
predict(X)
Predict using the linear model
Parameters X : numpy array of shape [n_samples, n_features]
Returns C : array, shape = [n_samples]
Returns predicted values.
score(X, y)
Returns the coefcient of determination R^2 of the prediction.
The coefcient R^2 is dened as (1 - u/v), where u is the regression sum of squares ((y - y_pred) **
2).sum() and v is the residual sum of squares ((y_true - y_true.mean()) ** 2).sum(). Best possible score is
1.0, lower values are worse.
Parameters X : array-like, shape = [n_samples, n_features]
Training set.
y : array-like, shape = [n_samples]
Returns z : oat
448 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
set_params(**params)
Set the parameters of the estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The former have
parameters of the form <component>__<parameter> so that its possible to update each component
of a nested object.
Returns self :
sklearn.linear_model.Lars
class sklearn.linear_model.Lars(t_intercept=True, verbose=False, normalize=True, precom-
pute=auto, n_nonzero_coefs=500, eps=2.2204460492503131e-
16, copy_X=True)
Least Angle Regression model a.k.a. LAR
Parameters n_nonzero_coefs : int, optional
Target number of non-zero coefcients. Use np.inf for no limit.
t_intercept : boolean
Whether to calculate the intercept for this model. If set to false, no intercept will be
used in calculations (e.g. data is expected to be already centered).
verbose : boolean or integer, optional
Sets the verbosity amount
normalize : boolean, optional
If True, the regressors X are normalized
precompute : True | False | auto | array-like
Whether to use a precomputed Gram matrix to speed up calculations. If set to auto let
us decide. The Gram matrix can also be passed as argument.
copy_X : boolean, optional, default True
If True, X will be copied; else, it may be overwritten.
eps: oat, optional :
The machine-precision regularization in the computation of the Cholesky diagonal fac-
tors. Increase this for very ill-conditioned systems. Unlike the tol parameter in some
iterative optimization-based algorithms, this parameter does not control the tolerance of
the optimization.
See Also:
lars_path, LarsCV, sklearn.decomposition.sparse_encode
http//en.wikipedia.org/wiki/Least_angle_regression
Examples
>>> from sklearn import linear_model
>>> clf = linear_model.Lars(n_nonzero_coefs=1)
>>> clf.fit([[-1, 1], [0, 0], [1, 1]], [-1.1111, 0, -1.1111])
...
Lars(copy_X=True, eps=..., fit_intercept=True, n_nonzero_coefs=1,
1.8. Reference 449
scikit-learn user guide, Release 0.12-git
normalize=True, precompute=auto, verbose=False)
>>> print(clf.coef_)
[ 0. -1.11...]
Attributes
coef_ array, shape = [n_features] parameter vector (w in the fomulation formula)
intercept_ oat independent term in decision function.
Methods
decision_function(X) Decision function of the linear model
fit(X, y) Fit the model using X, y as training data.
get_params([deep]) Get parameters for the estimator
predict(X) Predict using the linear model
score(X, y) Returns the coefcient of determination R^2 of the prediction.
set_params(**params) Set the parameters of the estimator.
__init__(t_intercept=True, verbose=False, normalize=True, precompute=auto,
n_nonzero_coefs=500, eps=2.2204460492503131e-16, copy_X=True)
decision_function(X)
Decision function of the linear model
Parameters X : numpy array of shape [n_samples, n_features]
Returns C : array, shape = [n_samples]
Returns predicted values.
fit(X, y)
Fit the model using X, y as training data.
Parameters X : array-like, shape = [n_samples, n_features]
training data.
y : array-like, shape = [n_samples]
target values.
Returns self : object
returns an instance of self.
get_params(deep=True)
Get parameters for the estimator
Parameters deep: boolean, optional :
If True, will return the parameters for this estimator and contained subobjects that are
estimators.
predict(X)
Predict using the linear model
Parameters X : numpy array of shape [n_samples, n_features]
450 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
Returns C : array, shape = [n_samples]
Returns predicted values.
score(X, y)
Returns the coefcient of determination R^2 of the prediction.
The coefcient R^2 is dened as (1 - u/v), where u is the regression sum of squares ((y - y_pred) **
2).sum() and v is the residual sum of squares ((y_true - y_true.mean()) ** 2).sum(). Best possible score is
1.0, lower values are worse.
Parameters X : array-like, shape = [n_samples, n_features]
Training set.
y : array-like, shape = [n_samples]
Returns z : oat
set_params(**params)
Set the parameters of the estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The former have
parameters of the form <component>__<parameter> so that its possible to update each component
of a nested object.
Returns self :
sklearn.linear_model.LassoLars
class sklearn.linear_model.LassoLars(alpha=1.0, t_intercept=True, verbose=False, nor-
malize=True, precompute=auto, max_iter=500,
eps=2.2204460492503131e-16, copy_X=True)
Lasso model t with Least Angle Regression a.k.a. Lars
It is a Linear Model trained with an L1 prior as regularizer.
The optimization objective for Lasso is:
(1 / (2
*
n_samples))
*
||y - Xw||^2_2 + alpha
*
||w||_1
Parameters t_intercept : boolean
whether to calculate the intercept for this model. If set to false, no intercept will be used
in calculations (e.g. data is expected to be already centered).
verbose : boolean or integer, optional
Sets the verbosity amount
normalize : boolean, optional
If True, the regressors X are normalized
copy_X : boolean, optional, default True
If True, X will be copied; else, it may be overwritten.
precompute : True | False | auto | array-like
Whether to use a precomputed Gram matrix to speed up calculations. If set to auto let
us decide. The Gram matrix can also be passed as argument.
max_iter: integer, optional :
1.8. Reference 451
scikit-learn user guide, Release 0.12-git
Maximum number of iterations to perform.
eps: oat, optional :
The machine-precision regularization in the computation of the Cholesky diagonal fac-
tors. Increase this for very ill-conditioned systems. Unlike the tol parameter in some
iterative optimization-based algorithms, this parameter does not control the tolerance of
the optimization.
See Also:
lars_path, lasso_path, Lasso, LassoCV, LassoLarsCV, sklearn.decomposition.sparse_encode
http//en.wikipedia.org/wiki/Least_angle_regression
Examples
>>> from sklearn import linear_model
>>> clf = linear_model.LassoLars(alpha=0.01)
>>> clf.fit([[-1, 1], [0, 0], [1, 1]], [-1, 0, -1])
...
LassoLars(alpha=0.01, copy_X=True, eps=..., fit_intercept=True,
max_iter=500, normalize=True, precompute=auto, verbose=False)
>>> print(clf.coef_)
[ 0. -0.963257...]
Attributes
coef_ array, shape = [n_features] parameter vector (w in the fomulation formula)
intercept_ oat independent term in decision function.
Methods
decision_function(X) Decision function of the linear model
fit(X, y) Fit the model using X, y as training data.
get_params([deep]) Get parameters for the estimator
predict(X) Predict using the linear model
score(X, y) Returns the coefcient of determination R^2 of the prediction.
set_params(**params) Set the parameters of the estimator.
__init__(alpha=1.0, t_intercept=True, verbose=False, normalize=True, precompute=auto,
max_iter=500, eps=2.2204460492503131e-16, copy_X=True)
decision_function(X)
Decision function of the linear model
Parameters X : numpy array of shape [n_samples, n_features]
Returns C : array, shape = [n_samples]
Returns predicted values.
fit(X, y)
Fit the model using X, y as training data.
452 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
Parameters X : array-like, shape = [n_samples, n_features]
training data.
y : array-like, shape = [n_samples]
target values.
Returns self : object
returns an instance of self.
get_params(deep=True)
Get parameters for the estimator
Parameters deep: boolean, optional :
If True, will return the parameters for this estimator and contained subobjects that are
estimators.
predict(X)
Predict using the linear model
Parameters X : numpy array of shape [n_samples, n_features]
Returns C : array, shape = [n_samples]
Returns predicted values.
score(X, y)
Returns the coefcient of determination R^2 of the prediction.
The coefcient R^2 is dened as (1 - u/v), where u is the regression sum of squares ((y - y_pred) **
2).sum() and v is the residual sum of squares ((y_true - y_true.mean()) ** 2).sum(). Best possible score is
1.0, lower values are worse.
Parameters X : array-like, shape = [n_samples, n_features]
Training set.
y : array-like, shape = [n_samples]
Returns z : oat
set_params(**params)
Set the parameters of the estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The former have
parameters of the form <component>__<parameter> so that its possible to update each component
of a nested object.
Returns self :
sklearn.linear_model.LarsCV
class sklearn.linear_model.LarsCV(t_intercept=True, verbose=False, max_iter=500, normal-
ize=True, precompute=auto, cv=None, max_n_alphas=1000,
n_jobs=1, eps=2.2204460492503131e-16, copy_X=True)
Cross-validated Least Angle Regression model
Parameters t_intercept : boolean
whether to calculate the intercept for this model. If set to false, no intercept will be used
in calculations (e.g. data is expected to be already centered).
1.8. Reference 453
scikit-learn user guide, Release 0.12-git
verbose : boolean or integer, optional
Sets the verbosity amount
normalize : boolean, optional
If True, the regressors X are normalized
copy_X : boolean, optional, default True
If True, X will be copied; else, it may be overwritten.
precompute : True | False | auto | array-like
Whether to use a precomputed Gram matrix to speed up calculations. If set to auto let
us decide. The Gram matrix can also be passed as argument.
max_iter: integer, optional :
Maximum number of iterations to perform.
cv : crossvalidation generator, optional
see sklearn.cross_validation module. If None is passed, default to a 5-fold strategy
max_n_alphas : integer, optional
The maximum number of points on the path used to compute the residuals in the cross-
validation
n_jobs : integer, optional
Number of CPUs to use during the cross validation. If -1, use all the CPUs
eps: oat, optional :
The machine-precision regularization in the computation of the Cholesky diagonal fac-
tors. Increase this for very ill-conditioned systems.
See Also:
lars_path, LassoLARS, LassoLarsCV
Attributes
coef_ array, shape =
[n_features]
parameter vector (w in the fomulation
formula)
intercept_ oat independent term in decision function.
coef_path: array, shape = [n_features,
n_alpha]
the varying values of the coefcients
along the path
Methods
decision_function(X) Decision function of the linear model
fit(X, y) Fit the model using X, y as training data.
get_params([deep]) Get parameters for the estimator
predict(X) Predict using the linear model
score(X, y) Returns the coefcient of determination R^2 of the prediction.
set_params(**params) Set the parameters of the estimator.
454 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
__init__(t_intercept=True, verbose=False, max_iter=500, normalize=True, precompute=auto,
cv=None, max_n_alphas=1000, n_jobs=1, eps=2.2204460492503131e-16, copy_X=True)
decision_function(X)
Decision function of the linear model
Parameters X : numpy array of shape [n_samples, n_features]
Returns C : array, shape = [n_samples]
Returns predicted values.
fit(X, y)
Fit the model using X, y as training data.
Parameters X : array-like, shape = [n_samples, n_features]
Training data.
y : array-like, shape = [n_samples]
Target values.
Returns self : object
returns an instance of self.
get_params(deep=True)
Get parameters for the estimator
Parameters deep: boolean, optional :
If True, will return the parameters for this estimator and contained subobjects that are
estimators.
predict(X)
Predict using the linear model
Parameters X : numpy array of shape [n_samples, n_features]
Returns C : array, shape = [n_samples]
Returns predicted values.
score(X, y)
Returns the coefcient of determination R^2 of the prediction.
The coefcient R^2 is dened as (1 - u/v), where u is the regression sum of squares ((y - y_pred) **
2).sum() and v is the residual sum of squares ((y_true - y_true.mean()) ** 2).sum(). Best possible score is
1.0, lower values are worse.
Parameters X : array-like, shape = [n_samples, n_features]
Training set.
y : array-like, shape = [n_samples]
Returns z : oat
set_params(**params)
Set the parameters of the estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The former have
parameters of the form <component>__<parameter> so that its possible to update each component
of a nested object.
Returns self :
1.8. Reference 455
scikit-learn user guide, Release 0.12-git
sklearn.linear_model.LassoLarsCV
class sklearn.linear_model.LassoLarsCV(t_intercept=True, verbose=False, max_iter=500,
normalize=True, precompute=auto,
cv=None, max_n_alphas=1000, n_jobs=1,
eps=2.2204460492503131e-16, copy_X=True)
Cross-validated Lasso, using the LARS algorithm
The optimization objective for Lasso is:
(1 / (2
*
n_samples))
*
||y - Xw||^2_2 + alpha
*
||w||_1
Parameters t_intercept : boolean
whether to calculate the intercept for this model. If set to false, no intercept will be used
in calculations (e.g. data is expected to be already centered).
verbose : boolean or integer, optional
Sets the verbosity amount
normalize : boolean, optional
If True, the regressors X are normalized
precompute : True | False | auto | array-like
Whether to use a precomputed Gram matrix to speed up calculations. If set to auto let
us decide. The Gram matrix can also be passed as argument.
max_iter: integer, optional :
Maximum number of iterations to perform.
cv : crossvalidation generator, optional
see sklearn.cross_validation module. If None is passed, default to a 5-fold strategy
max_n_alphas : integer, optional
The maximum number of points on the path used to compute the residuals in the cross-
validation
n_jobs : integer, optional
Number of CPUs to use during the cross validation. If -1, use all the CPUs
eps: oat, optional :
The machine-precision regularization in the computation of the Cholesky diagonal fac-
tors. Increase this for very ill-conditioned systems.
copy_X : boolean, optional, default True
If True, X will be copied; else, it may be overwritten.
See Also:
lars_path, LassoLars, LarsCV, LassoCV
456 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
Notes
The object solves the same problem as the LassoCV object. However, unlike the LassoCV, it nd the relevent
alphas values by itself. In general, because of this property, it will be more stable. However, it is more fragile to
heavily multicollinear datasets.
It is more efcient than the LassoCV if only a small number of features are selected compared to the total
number, for instance if there are very few samples compared to the number of features.
Attributes
coef_ array, shape =
[n_features]
parameter vector (w in the fomulation formula)
intercept_ oat independent term in decision function.
coef_path: array, shape =
[n_features, n_alpha]
the varying values of the coefcients along the path
alphas_: array, shape =
[n_alpha]
the different values of alpha along the path
cv_alphas: array, shape =
[n_cv_alphas]
all the values of alpha along the path for the different
folds
cv_mse_path_: array, shape =
[n_folds, n_cv_alphas]
the mean square error on left-out for each fold along
the path (alpha values given by cv_alphas)
Methods
decision_function(X) Decision function of the linear model
fit(X, y) Fit the model using X, y as training data.
get_params([deep]) Get parameters for the estimator
predict(X) Predict using the linear model
score(X, y) Returns the coefcient of determination R^2 of the prediction.
set_params(**params) Set the parameters of the estimator.
__init__(t_intercept=True, verbose=False, max_iter=500, normalize=True, precompute=auto,
cv=None, max_n_alphas=1000, n_jobs=1, eps=2.2204460492503131e-16, copy_X=True)
decision_function(X)
Decision function of the linear model
Parameters X : numpy array of shape [n_samples, n_features]
Returns C : array, shape = [n_samples]
Returns predicted values.
fit(X, y)
Fit the model using X, y as training data.
Parameters X : array-like, shape = [n_samples, n_features]
Training data.
y : array-like, shape = [n_samples]
Target values.
1.8. Reference 457
scikit-learn user guide, Release 0.12-git
Returns self : object
returns an instance of self.
get_params(deep=True)
Get parameters for the estimator
Parameters deep: boolean, optional :
If True, will return the parameters for this estimator and contained subobjects that are
estimators.
predict(X)
Predict using the linear model
Parameters X : numpy array of shape [n_samples, n_features]
Returns C : array, shape = [n_samples]
Returns predicted values.
score(X, y)
Returns the coefcient of determination R^2 of the prediction.
The coefcient R^2 is dened as (1 - u/v), where u is the regression sum of squares ((y - y_pred) **
2).sum() and v is the residual sum of squares ((y_true - y_true.mean()) ** 2).sum(). Best possible score is
1.0, lower values are worse.
Parameters X : array-like, shape = [n_samples, n_features]
Training set.
y : array-like, shape = [n_samples]
Returns z : oat
set_params(**params)
Set the parameters of the estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The former have
parameters of the form <component>__<parameter> so that its possible to update each component
of a nested object.
Returns self :
sklearn.linear_model.LassoLarsIC
class sklearn.linear_model.LassoLarsIC(criterion=aic, t_intercept=True, verbose=False,
normalize=True, precompute=auto, max_iter=500,
eps=2.2204460492503131e-16, copy_X=True)
Lasso model t with Lars using BIC or AIC for model selection
The optimization objective for Lasso is:
(1 / (2
*
n_samples))
*
||y - Xw||^2_2 + alpha
*
||w||_1
AIC is the Akaike information criterion and BIC is the Bayes Information criterion. Such criteria are useful
to select the value of the regularization parameter by making a trade-off between the goodness of t and the
complexity of the model. A good model should explain well the data while being simple.
Parameters criterion: bic | aic :
The type of criterion to use.
458 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
t_intercept : boolean
whether to calculate the intercept for this model. If set to false, no intercept will be used
in calculations (e.g. data is expected to be already centered).
verbose : boolean or integer, optional
Sets the verbosity amount
normalize : boolean, optional
If True, the regressors X are normalized
copy_X : boolean, optional, default True
If True, X will be copied; else, it may be overwritten.
precompute : True | False | auto | array-like
Whether to use a precomputed Gram matrix to speed up calculations. If set to auto let
us decide. The Gram matrix can also be passed as argument.
max_iter: integer, optional :
Maximum number of iterations to perform. Can be used for early stopping.
eps: oat, optional :
The machine-precision regularization in the computation of the Cholesky diagonal fac-
tors. Increase this for very ill-conditioned systems. Unlike the tol parameter in some
iterative optimization-based algorithms, this parameter does not control the tolerance of
the optimization.
See Also:
lars_path, LassoLars, LassoLarsCV
Notes
The estimation of the number of degrees of freedom is given by:
On the degrees of freedom of the lasso Hui Zou, Trevor Hastie, and Robert Tibshirani Ann. Statist. Volume
35, Number 5 (2007), 2173-2192.
http://en.wikipedia.org/wiki/Akaike_information_criterion http://en.wikipedia.org/wiki/Bayesian_information_criterion
Examples
>>> from sklearn import linear_model
>>> clf = linear_model.LassoLarsIC(criterion=bic)
>>> clf.fit([[-1, 1], [0, 0], [1, 1]], [-1.1111, 0, -1.1111])
...
LassoLarsIC(copy_X=True, criterion=bic, eps=..., fit_intercept=True,
max_iter=500, normalize=True, precompute=auto,
verbose=False)
>>> print(clf.coef_)
[ 0. -1.11...]
1.8. Reference 459
scikit-learn user guide, Release 0.12-git
Attributes
coef_ array, shape = [n_features] parameter vector (w in the fomulation formula)
intercept_ oat independent term in decision function.
alpha_ oat the alpha parameter chosen by the information criterion
Methods
decision_function(X) Decision function of the linear model
fit(X, y[, copy_X]) Fit the model using X, y as training data.
get_params([deep]) Get parameters for the estimator
predict(X) Predict using the linear model
score(X, y) Returns the coefcient of determination R^2 of the prediction.
set_params(**params) Set the parameters of the estimator.
__init__(criterion=aic, t_intercept=True, verbose=False, normalize=True, precompute=auto,
max_iter=500, eps=2.2204460492503131e-16, copy_X=True)
decision_function(X)
Decision function of the linear model
Parameters X : numpy array of shape [n_samples, n_features]
Returns C : array, shape = [n_samples]
Returns predicted values.
fit(X, y, copy_X=True)
Fit the model using X, y as training data.
Parameters x : array-like, shape = [n_samples, n_features]
training data.
y : array-like, shape = [n_samples]
target values.
Returns self : object
returns an instance of self.
get_params(deep=True)
Get parameters for the estimator
Parameters deep: boolean, optional :
If True, will return the parameters for this estimator and contained subobjects that are
estimators.
predict(X)
Predict using the linear model
Parameters X : numpy array of shape [n_samples, n_features]
Returns C : array, shape = [n_samples]
Returns predicted values.
460 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
score(X, y)
Returns the coefcient of determination R^2 of the prediction.
The coefcient R^2 is dened as (1 - u/v), where u is the regression sum of squares ((y - y_pred) **
2).sum() and v is the residual sum of squares ((y_true - y_true.mean()) ** 2).sum(). Best possible score is
1.0, lower values are worse.
Parameters X : array-like, shape = [n_samples, n_features]
Training set.
y : array-like, shape = [n_samples]
Returns z : oat
set_params(**params)
Set the parameters of the estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The former have
parameters of the form <component>__<parameter> so that its possible to update each component
of a nested object.
Returns self :
sklearn.linear_model.LogisticRegression
class sklearn.linear_model.LogisticRegression(penalty=l2, dual=False, tol=0.0001,
C=1.0, t_intercept=True, inter-
cept_scaling=1, class_weight=None)
Logistic Regression (aka logit, MaxEnt) classier.
In the multiclass case, the training algorithm uses a one-vs.-all (OvA) scheme, rather than the true multinomial
LR.
This class implements L1 and L2 regularized logistic regression using the liblinear library. It can handle both
dense and sparse input. Use C-ordered arrays or CSR matrices containing 64-bit oats for optimal performance;
any other input format will be converted (and copied).
Parameters penalty : string, l1 or l2
Used to specify the norm used in the penalization
dual : boolean
Dual or primal formulation. Dual formulation is only implemented for l2 penalty. Prefer
dual=False when n_samples > n_features.
C : oat or None, optional (default=None)
Species the strength of the regularization. The smaller it is the bigger in the regular-
ization. If None then C is set to n_samples.
t_intercept : bool, default: True
Species if a constant (a.k.a. bias or intercept) should be added the decision function
intercept_scaling : oat, default: 1
when self.t_intercept is True, instance vector x becomes [x, self.intercept_scaling],
i.e. a synthetic feature with constant value equals to intercept_scaling is appended to
the instance vector. The intercept becomes intercept_scaling * synthetic feature weight
Note! the synthetic feature weight is subject to l1/l2 regularization as all other features.
1.8. Reference 461
scikit-learn user guide, Release 0.12-git
To lessen the effect of regularization on synthetic feature weight (and therefore on the
intercept) intercept_scaling has to be increased
tol: oat, optional :
tolerance for stopping criteria
See Also:
LinearSVC
Notes
The underlying C implementation uses a random number generator to select features when tting the model.
It is thus not uncommon, to have slightly different results for the same input data. If that happens, try with a
smaller tol parameter.
References:
LIBLINEAR A Library for Large Linear Classicationhttp://www.csie.ntu.edu.tw/~cjlin/liblinear/
Hsiang-Fu Yu, Fang-Lan Huang, Chih-Jen Lin (2011). Dual coordinate descentmethods for lo-
gistic regression and maximum entropy models. Machine Learning 85(1-2):41-75.
http://www.csie.ntu.edu.tw/~cjlin/papers/maxent_dual.pdf
Attributes
coef_ array, shape =
[n_classes-1,
n_features]
Coefcient of the features in the decision function.
coef_ is readonly property derived from raw_coef_ that follows the internal
memory layout of liblinear.
in-
ter-
cept_
array, shape =
[n_classes-1]
intercept (a.k.a. bias) added to the decision function. It is available only
when parameter intercept is set to True
Methods
decision_function(X) Decision function value for X according to the trained model.
fit(X, y[, class_weight]) Fit the model according to the given training data.
fit_transform(X[, y]) Fit to data, then transform it
get_params([deep]) Get parameters for the estimator
predict(X) Predict target values of X according to the tted model.
predict_log_proba(X) Log of Probability estimates.
predict_proba(X) Probability estimates.
score(X, y) Returns the mean accuracy on the given test data and labels.
set_params(**params) Set the parameters of the estimator.
transform(X[, threshold]) Reduce X to its most important features.
__init__(penalty=l2, dual=False, tol=0.0001, C=1.0, t_intercept=True, intercept_scaling=1,
class_weight=None)
decision_function(X)
Decision function value for X according to the trained model.
Parameters X : array-like, shape = [n_samples, n_features]
462 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
Returns T : array-like, shape = [n_samples, n_class]
Returns the decision function of the sample for each class in the model.
fit(X, y, class_weight=None)
Fit the model according to the given training data.
Parameters X : {array-like, sparse matrix}, shape = [n_samples, n_features]
Training vector, where n_samples in the number of samples and n_features is the num-
ber of features.
y : array-like, shape = [n_samples]
Target vector relative to X
class_weight : {dict, auto}, optional
Weights associated with classes. If not given, all classes are supposed to have weight
one.
Returns self : object
Returns self.
fit_transform(X, y=None, **t_params)
Fit to data, then transform it
Fits transformer to X and y with optional parameters t_params and returns a transformed version of X.
Parameters X : numpy array of shape [n_samples, n_features]
Training set.
y : numpy array of shape [n_samples]
Target values.
Returns X_new : numpy array of shape [n_samples, n_features_new]
Transformed array.
Notes
This method just calls t and transform consecutively, i.e., it is not an optimized implementation of
t_transform, unlike other transformers such as PCA.
get_params(deep=True)
Get parameters for the estimator
Parameters deep: boolean, optional :
If True, will return the parameters for this estimator and contained subobjects that are
estimators.
predict(X)
Predict target values of X according to the tted model.
Parameters X : {array-like, sparse matrix}, shape = [n_samples, n_features]
Returns C : array, shape = [n_samples]
predict_log_proba(X)
Log of Probability estimates.
The returned estimates for all classes are ordered by the label of classes.
1.8. Reference 463
scikit-learn user guide, Release 0.12-git
Parameters X : array-like, shape = [n_samples, n_features]
Returns T : array-like, shape = [n_samples, n_classes]
Returns the log-probabilities of the sample for each class in the model, where classes
are ordered by arithmetical order.
predict_proba(X)
Probability estimates.
The returned estimates for all classes are ordered by the label of classes.
Parameters X : array-like, shape = [n_samples, n_features]
Returns T : array-like, shape = [n_samples, n_classes]
Returns the probability of the sample for each class in the model, where classes are
ordered by arithmetical order.
score(X, y)
Returns the mean accuracy on the given test data and labels.
Parameters X : array-like, shape = [n_samples, n_features]
Training set.
y : array-like, shape = [n_samples]
Labels for X.
Returns z : oat
set_params(**params)
Set the parameters of the estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The former have
parameters of the form <component>__<parameter> so that its possible to update each component
of a nested object.
Returns self :
transform(X, threshold=None)
Reduce X to its most important features.
Parameters X : array or scipy sparse matrix of shape [n_samples, n_features]
The input samples.
threshold : string, oat or None, optional (default=None)
The threshold value to use for feature selection. Features whose importance is greater
or equal are kept while the others are discarded. If median (resp. mean), then the
threshold value is the median (resp. the mean) of the feature importances. A scaling
factor (e.g., 1.25*mean) may also be used. If None and if available, the object attribute
threshold is used. Otherwise, mean is used by default.
Returns X_r : array of shape [n_samples, n_selected_features]
The input samples with only the selected features.
464 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
sklearn.linear_model.OrthogonalMatchingPursuit
class sklearn.linear_model.OrthogonalMatchingPursuit(copy_X=True, copy_Gram=True,
copy_Xy=True,
n_nonzero_coefs=None,
tol=None, t_intercept=True,
normalize=True, precom-
pute_gram=False)
Orthogonal Mathching Pursuit model (OMP)
Parameters n_nonzero_coefs : int, optional
Desired number of non-zero entries in the solution. If None (by default) this value is set
to 10% of n_features.
tol : oat, optional
Maximum norm of the residual. If not None, overrides n_nonzero_coefs.
t_intercept : boolean, optional
whether to calculate the intercept for this model. If set to false, no intercept will be used
in calculations (e.g. data is expected to be already centered).
normalize : boolean, optional
If False, the regressors X are assumed to be already normalized.
precompute_gram : {True, False, auto},
Whether to use a precomputed Gram and Xy matrix to speed up calculations. Improves
performance when n_targets or n_samples is very large. Note that if you already have
such matrices, you can pass them directly to the t method.
copy_X : bool, optional
Whether the design matrix X must be copied by the algorithm. A false value is only
helpful if X is already Fortran-ordered, otherwise a copy is made anyway.
copy_Gram : bool, optional
Whether the gram matrix must be copied by the algorithm. A false value is only helpful
if X is already Fortran-ordered, otherwise a copy is made anyway.
copy_Xy : bool, optional
Whether the covariance vector Xy must be copied by the algorithm. If False, it may be
overwritten.
See Also:
orthogonal_mp, orthogonal_mp_gram, lars_path, Lars, LassoLars,
decomposition.sparse_encode, decomposition.sparse_encode_parallel
Notes
Orthogonal matching pursuit was introduced in G. Mallat, Z. Zhang, Matching pursuits with time-frequency
dictionaries, IEEE Transactions on Signal Processing, Vol. 41, No. 12. (December 1993), pp. 3397-3415.
(http://blanche.polytechnique.fr/~mallat/papiers/MallatPursuit93.pdf)
1.8. Reference 465
scikit-learn user guide, Release 0.12-git
This implementation is based on Rubinstein, R., Zibulevsky, M. and Elad, M., Efcient Implementation of
the K-SVD Algorithm using Batch Orthogonal Matching Pursuit Technical Report - CS Technion, April 2008.
http://www.cs.technion.ac.il/~ronrubin/Publications/KSVD-OMP-v2.pdf
Attributes
coef_ array, shape = (n_features,) or (n_features, n_targets) parameter vector (w in the fomulation formula)
intercept_ oat or array, shape =(n_targets,) independent term in decision function.
Methods
decision_function(X) Decision function of the linear model
fit(X, y[, Gram, Xy]) Fit the model using X, y as training data.
get_params([deep]) Get parameters for the estimator
predict(X) Predict using the linear model
score(X, y) Returns the coefcient of determination R^2 of the prediction.
set_params(**params) Set the parameters of the estimator.
__init__(copy_X=True, copy_Gram=True, copy_Xy=True, n_nonzero_coefs=None, tol=None,
t_intercept=True, normalize=True, precompute_gram=False)
decision_function(X)
Decision function of the linear model
Parameters X : numpy array of shape [n_samples, n_features]
Returns C : array, shape = [n_samples]
Returns predicted values.
fit(X, y, Gram=None, Xy=None)
Fit the model using X, y as training data.
Parameters X: array-like, shape = (n_samples, n_features) :
Training data.
y: array-like, shape = (n_samples,) or (n_samples, n_targets) :
Target values.
Gram: array-like, shape = (n_features, n_features) (optional) :
Gram matrix of the input data: X.T * X
Xy: array-like, shape = (n_features,) or (n_features, n_targets) :
(optional) Input targets multiplied by X: X.T * y
Returns self: object :
returns an instance of self.
get_params(deep=True)
Get parameters for the estimator
Parameters deep: boolean, optional :
If True, will return the parameters for this estimator and contained subobjects that are
estimators.
466 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
predict(X)
Predict using the linear model
Parameters X : numpy array of shape [n_samples, n_features]
Returns C : array, shape = [n_samples]
Returns predicted values.
score(X, y)
Returns the coefcient of determination R^2 of the prediction.
The coefcient R^2 is dened as (1 - u/v), where u is the regression sum of squares ((y - y_pred) **
2).sum() and v is the residual sum of squares ((y_true - y_true.mean()) ** 2).sum(). Best possible score is
1.0, lower values are worse.
Parameters X : array-like, shape = [n_samples, n_features]
Training set.
y : array-like, shape = [n_samples]
Returns z : oat
set_params(**params)
Set the parameters of the estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The former have
parameters of the form <component>__<parameter> so that its possible to update each component
of a nested object.
Returns self :
sklearn.linear_model.Perceptron
class sklearn.linear_model.Perceptron(penalty=None, alpha=0.0001, t_intercept=True,
n_iter=5, shufe=False, verbose=0, eta0=1.0, n_jobs=1,
seed=0, class_weight=None, warm_start=False)
Perceptron
Parameters penalty : None, l2 or l1 or elasticnet
The penalty (aka regularization term) to be used. Defaults to None.
alpha : oat
Constant that multiplies the regularization term if regularization is used. Defaults to
0.0001
t_intercept: bool :
Whether the intercept should be estimated or not. If False, the data is assumed to be
already centered. Defaults to True.
n_iter: int, optional :
The number of passes over the training data (aka epochs). Defaults to 5.
shufe: bool, optional :
Whether or not the training data should be shufed after each epoch. Defaults to False.
seed: int, optional :
The seed of the pseudo random number generator to use when shufing the data.
1.8. Reference 467
scikit-learn user guide, Release 0.12-git
verbose: integer, optional :
The verbosity level
n_jobs: integer, optional :
The number of CPUs to use to do the OVA (One Versus All, for multi-class problems)
computation. -1 means all CPUs. Defaults to 1.
eta0 : double
Constant by which the updates are multiplied. Defaults to 1.
class_weight : dict, {class_label
Preset for the class_weight t parameter.
Weights associated with classes. If not given, all classes are supposed to have weight
one.
The auto mode uses the values of y to automatically adjust weights inversely propor-
tional to class frequencies.
warm_start : bool, optional
When set to True, reuse the solution of the previous call to t as initialization, otherwise,
just erase the previous solution.
See Also:
SGDClassifier
Notes
Perceptron and SGDClassier share the same underlying implementation. In fact, Perceptron() is equivalent to
SGDClassier(loss=perceptron, eta0=1, learning_rate=constant, penalty=None).
References
http://en.wikipedia.org/wiki/Perceptron and references therein.
Attributes
coef_ array, shape = [1, n_features] if n_classes == 2 else [n_classes,
n_features] Weights assigned to the features.
intercept_ array, shape = [1] if n_classes == 2 else [n_classes] Constants in decision function.
Methods
decision_function(X) Predict signed distance to the hyperplane (aka condence score)
fit(X, y[, coef_init, intercept_init, ...]) Fit linear model with Stochastic Gradient Descent.
fit_transform(X[, y]) Fit to data, then transform it
get_params([deep]) Get parameters for the estimator
partial_fit(X, y[, classes, class_weight, ...]) Fit linear model with Stochastic Gradient Descent.
Continued on next page
468 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
Table 1.118 continued from previous page
predict(X) Predict using the linear model
predict_proba(X) Predict class membership probability
score(X, y) Returns the mean accuracy on the given test data and labels.
set_params(**params) Set the parameters of the estimator.
transform(X[, threshold]) Reduce X to its most important features.
__init__(penalty=None, alpha=0.0001, t_intercept=True, n_iter=5, shufe=False, verbose=0,
eta0=1.0, n_jobs=1, seed=0, class_weight=None, warm_start=False)
classes
DEPRECATED: to be removed in v0.12; use classes_ instead.
decision_function(X)
Predict signed distance to the hyperplane (aka condence score)
Parameters X : {array-like, sparse matrix}, shape = [n_samples, n_features]
Returns array, shape = [n_samples] if n_classes == 2 else [n_samples,n_classes] :
The signed distances to the hyperplane(s).
fit(X, y, coef_init=None, intercept_init=None, class_weight=None, sample_weight=None)
Fit linear model with Stochastic Gradient Descent.
Parameters X : {array-like, sparse matrix}, shape = [n_samples, n_features]
Training data
y : numpy array of shape [n_samples]
Target values
coef_init : array, shape = [n_classes,n_features]
The initial coefents to warm-start the optimization.
intercept_init : array, shape = [n_classes]
The initial intercept to warm-start the optimization.
sample_weight : array-like, shape = [n_samples], optional
Weights applied to individual samples. If not provided, uniform weights are assumed.
Returns self : returns an instance of self.
fit_transform(X, y=None, **t_params)
Fit to data, then transform it
Fits transformer to X and y with optional parameters t_params and returns a transformed version of X.
Parameters X : numpy array of shape [n_samples, n_features]
Training set.
y : numpy array of shape [n_samples]
Target values.
Returns X_new : numpy array of shape [n_samples, n_features_new]
Transformed array.
1.8. Reference 469
scikit-learn user guide, Release 0.12-git
Notes
This method just calls t and transform consecutively, i.e., it is not an optimized implementation of
t_transform, unlike other transformers such as PCA.
get_params(deep=True)
Get parameters for the estimator
Parameters deep: boolean, optional :
If True, will return the parameters for this estimator and contained subobjects that are
estimators.
partial_fit(X, y, classes=None, class_weight=None, sample_weight=None)
Fit linear model with Stochastic Gradient Descent.
Parameters X : {array-like, sparse matrix}, shape = [n_samples, n_features]
Subset of the training data
y : numpy array of shape [n_samples]
Subset of the target values
classes : array, shape = [n_classes]
Classes across all calls to partial_t. Can be obtained by via np.unique(y_all), where
y_all is the target vector of the entire dataset. This argument is required for the rst call
to partial_t and can be omitted in the subsequent calls. Note that y doesnt need to
contain all labels in classes.
sample_weight : array-like, shape = [n_samples], optional
Weights applied to individual samples. If not provided, uniform weights are assumed.
Returns self : returns an instance of self.
predict(X)
Predict using the linear model
Parameters X : {array-like, sparse matrix}, shape = [n_samples, n_features]
Returns array, shape = [n_samples] :
Array containing the predicted class labels.
predict_proba(X)
Predict class membership probability
Parameters X : {array-like, sparse matrix}, shape = [n_samples, n_features]
Returns array, shape = [n_samples] if n_classes == 2 else [n_samples, :
n_classes] :
Contains the membership probabilities of the positive class.
References
The justication for the formula in the loss=modied_huber case is in the appendix B in:
http://jmlr.csail.mit.edu/papers/volume2/zhang02c/zhang02c.pdf
score(X, y)
Returns the mean accuracy on the given test data and labels.
470 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
Parameters X : array-like, shape = [n_samples, n_features]
Training set.
y : array-like, shape = [n_samples]
Labels for X.
Returns z : oat
set_params(**params)
Set the parameters of the estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The former have
parameters of the form <component>__<parameter> so that its possible to update each component
of a nested object.
Returns self :
transform(X, threshold=None)
Reduce X to its most important features.
Parameters X : array or scipy sparse matrix of shape [n_samples, n_features]
The input samples.
threshold : string, oat or None, optional (default=None)
The threshold value to use for feature selection. Features whose importance is greater
or equal are kept while the others are discarded. If median (resp. mean), then the
threshold value is the median (resp. the mean) of the feature importances. A scaling
factor (e.g., 1.25*mean) may also be used. If None and if available, the object attribute
threshold is used. Otherwise, mean is used by default.
Returns X_r : array of shape [n_samples, n_selected_features]
The input samples with only the selected features.
sklearn.linear_model.SGDClassier
class sklearn.linear_model.SGDClassifier(loss=hinge, penalty=l2, alpha=0.0001,
rho=0.85, t_intercept=True, n_iter=5,
shufe=False, verbose=0, epsilon=0.1,
n_jobs=1, seed=0, learning_rate=optimal,
eta0=0.0, power_t=0.5, class_weight=None,
warm_start=False)
Linear model tted by minimizing a regularized empirical loss with SGD.
SGD stands for Stochastic Gradient Descent: the gradient of the loss is estimated each sample at a time and the
model is updated along the way with a decreasing strength schedule (aka learning rate).
The regularizer is a penalty added to the loss function that shrinks model parameters towards the zero vector
using either the squared euclidean norm L2 or the absolute norm L1 or a combination of both (Elastic Net). If
the parameter update crosses the 0.0 value because of the regularizer, the update is truncated to 0.0 to allow for
learning sparse models and achieve online feature selection.
This implementation works with data represented as dense numpy arrays of oating point values for the features.
Parameters loss : str, hinge or log or modied_huber
The loss function to be used. Defaults to hinge. The hinge loss is a margin loss used
by standard linear SVM models. The log loss is the loss of logistic regression models
1.8. Reference 471
scikit-learn user guide, Release 0.12-git
and can be used for probability estimation in binary classiers. modied_huber is
another smooth loss that brings tolerance to outliers.
penalty : str, l2 or l1 or elasticnet
The penalty (aka regularization term) to be used. Defaults to l2 which is the standard
regularizer for linear SVMmodels. l1 and elasticnet migh bring sparsity to the model
(feature selection) not achievable with l2.
alpha : oat
Constant that multiplies the regularization term. Defaults to 0.0001
rho : oat
The Elastic Net mixing parameter, with 0 < rho <= 1. Defaults to 0.85.
t_intercept: bool :
Whether the intercept should be estimated or not. If False, the data is assumed to be
already centered. Defaults to True.
n_iter: int, optional :
The number of passes over the training data (aka epochs). Defaults to 5.
shufe: bool, optional :
Whether or not the training data should be shufed after each epoch. Defaults to False.
seed: int, optional :
The seed of the pseudo random number generator to use when shufing the data.
verbose: integer, optional :
The verbosity level
n_jobs: integer, optional :
The number of CPUs to use to do the OVA (One Versus All, for multi-class problems)
computation. -1 means all CPUs. Defaults to 1.
learning_rate : string, optional
The learning rate: constant: eta = eta0 optimal: eta = 1.0/(t+t0) [default] invscaling: eta
= eta0 / pow(t, power_t)
eta0 : double
The initial learning rate [default 0.01].
power_t : double
The exponent for inverse scaling learning rate [default 0.25].
class_weight : dict, {class_label
Preset for the class_weight t parameter.
Weights associated with classes. If not given, all classes are supposed to have weight
one.
The auto mode uses the values of y to automatically adjust weights inversely propor-
tional to class frequencies.
warm_start : bool, optional
472 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
When set to True, reuse the solution of the previous call to t as initialization, otherwise,
just erase the previous solution.
See Also:
LinearSVC, LogisticRegression, Perceptron
Examples
>>> import numpy as np
>>> from sklearn import linear_model
>>> X = np.array([[-1, -1], [-2, -1], [1, 1], [2, 1]])
>>> Y = np.array([1, 1, 2, 2])
>>> clf = linear_model.SGDClassifier()
>>> clf.fit(X, Y)
...
SGDClassifier(alpha=0.0001, class_weight=None, epsilon=0.1, eta0=0.0,
fit_intercept=True, learning_rate=optimal, loss=hinge,
n_iter=5, n_jobs=1, penalty=l2, power_t=0.5, rho=0.85, seed=0,
shuffle=False, verbose=0, warm_start=False)
>>> print(clf.predict([[-0.8, -1]]))
[1]
Attributes
coef_ array, shape = [1, n_features] if n_classes == 2 else [n_classes,
n_features] Weights assigned to the features.
intercept_ array, shape = [1] if n_classes == 2 else [n_classes] Constants in decision function.
Methods
decision_function(X) Predict signed distance to the hyperplane (aka condence score)
fit(X, y[, coef_init, intercept_init, ...]) Fit linear model with Stochastic Gradient Descent.
fit_transform(X[, y]) Fit to data, then transform it
get_params([deep]) Get parameters for the estimator
partial_fit(X, y[, classes, class_weight, ...]) Fit linear model with Stochastic Gradient Descent.
predict(X) Predict using the linear model
predict_proba(X) Predict class membership probability
score(X, y) Returns the mean accuracy on the given test data and labels.
set_params(**params) Set the parameters of the estimator.
transform(X[, threshold]) Reduce X to its most important features.
__init__(loss=hinge, penalty=l2, alpha=0.0001, rho=0.85, t_intercept=True, n_iter=5, shuf-
e=False, verbose=0, epsilon=0.1, n_jobs=1, seed=0, learning_rate=optimal, eta0=0.0,
power_t=0.5, class_weight=None, warm_start=False)
classes
DEPRECATED: to be removed in v0.12; use classes_ instead.
decision_function(X)
Predict signed distance to the hyperplane (aka condence score)
Parameters X : {array-like, sparse matrix}, shape = [n_samples, n_features]
1.8. Reference 473
scikit-learn user guide, Release 0.12-git
Returns array, shape = [n_samples] if n_classes == 2 else [n_samples,n_classes] :
The signed distances to the hyperplane(s).
fit(X, y, coef_init=None, intercept_init=None, class_weight=None, sample_weight=None)
Fit linear model with Stochastic Gradient Descent.
Parameters X : {array-like, sparse matrix}, shape = [n_samples, n_features]
Training data
y : numpy array of shape [n_samples]
Target values
coef_init : array, shape = [n_classes,n_features]
The initial coefents to warm-start the optimization.
intercept_init : array, shape = [n_classes]
The initial intercept to warm-start the optimization.
sample_weight : array-like, shape = [n_samples], optional
Weights applied to individual samples. If not provided, uniform weights are assumed.
Returns self : returns an instance of self.
fit_transform(X, y=None, **t_params)
Fit to data, then transform it
Fits transformer to X and y with optional parameters t_params and returns a transformed version of X.
Parameters X : numpy array of shape [n_samples, n_features]
Training set.
y : numpy array of shape [n_samples]
Target values.
Returns X_new : numpy array of shape [n_samples, n_features_new]
Transformed array.
Notes
This method just calls t and transform consecutively, i.e., it is not an optimized implementation of
t_transform, unlike other transformers such as PCA.
get_params(deep=True)
Get parameters for the estimator
Parameters deep: boolean, optional :
If True, will return the parameters for this estimator and contained subobjects that are
estimators.
partial_fit(X, y, classes=None, class_weight=None, sample_weight=None)
Fit linear model with Stochastic Gradient Descent.
Parameters X : {array-like, sparse matrix}, shape = [n_samples, n_features]
Subset of the training data
y : numpy array of shape [n_samples]
474 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
Subset of the target values
classes : array, shape = [n_classes]
Classes across all calls to partial_t. Can be obtained by via np.unique(y_all), where
y_all is the target vector of the entire dataset. This argument is required for the rst call
to partial_t and can be omitted in the subsequent calls. Note that y doesnt need to
contain all labels in classes.
sample_weight : array-like, shape = [n_samples], optional
Weights applied to individual samples. If not provided, uniform weights are assumed.
Returns self : returns an instance of self.
predict(X)
Predict using the linear model
Parameters X : {array-like, sparse matrix}, shape = [n_samples, n_features]
Returns array, shape = [n_samples] :
Array containing the predicted class labels.
predict_proba(X)
Predict class membership probability
Parameters X : {array-like, sparse matrix}, shape = [n_samples, n_features]
Returns array, shape = [n_samples] if n_classes == 2 else [n_samples, :
n_classes] :
Contains the membership probabilities of the positive class.
References
The justication for the formula in the loss=modied_huber case is in the appendix B in:
http://jmlr.csail.mit.edu/papers/volume2/zhang02c/zhang02c.pdf
score(X, y)
Returns the mean accuracy on the given test data and labels.
Parameters X : array-like, shape = [n_samples, n_features]
Training set.
y : array-like, shape = [n_samples]
Labels for X.
Returns z : oat
set_params(**params)
Set the parameters of the estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The former have
parameters of the form <component>__<parameter> so that its possible to update each component
of a nested object.
Returns self :
transform(X, threshold=None)
Reduce X to its most important features.
1.8. Reference 475
scikit-learn user guide, Release 0.12-git
Parameters X : array or scipy sparse matrix of shape [n_samples, n_features]
The input samples.
threshold : string, oat or None, optional (default=None)
The threshold value to use for feature selection. Features whose importance is greater
or equal are kept while the others are discarded. If median (resp. mean), then the
threshold value is the median (resp. the mean) of the feature importances. A scaling
factor (e.g., 1.25*mean) may also be used. If None and if available, the object attribute
threshold is used. Otherwise, mean is used by default.
Returns X_r : array of shape [n_samples, n_selected_features]
The input samples with only the selected features.
sklearn.linear_model.SGDRegressor
class sklearn.linear_model.SGDRegressor(loss=squared_loss, penalty=l2, alpha=0.0001,
rho=0.85, t_intercept=True, n_iter=5, shufe=False,
verbose=0, epsilon=0.1, p=None, seed=0, learn-
ing_rate=invscaling, eta0=0.01, power_t=0.25,
warm_start=False)
Linear model tted by minimizing a regularized empirical loss with SGD
SGD stands for Stochastic Gradient Descent: the gradient of the loss is estimated each sample at a time and the
model is updated along the way with a decreasing strength schedule (aka learning rate).
The regularizer is a penalty added to the loss function that shrinks model parameters towards the zero vector
using either the squared euclidean norm L2 or the absolute norm L1 or a combination of both (Elastic Net). If
the parameter update crosses the 0.0 value because of the regularizer, the update is truncated to 0.0 to allow for
learning sparse models and achieve online feature selection.
This implementation works with data represented as dense numpy arrays of oating point values for the features.
Parameters loss : str, squared_loss or huber
The loss function to be used. Defaults to squared_loss which refers to the ordinary
least squares t. huber is an epsilon insensitive loss function for robust regression.
penalty : str, l2 or l1 or elasticnet
The penalty (aka regularization term) to be used. Defaults to l2 which is the standard
regularizer for linear SVMmodels. l1 and elasticnet migh bring sparsity to the model
(feature selection) not achievable with l2.
alpha : oat
Constant that multiplies the regularization term. Defaults to 0.0001
rho : oat
The Elastic Net mixing parameter, with 0 < rho <= 1. Defaults to 0.85.
t_intercept: bool :
Whether the intercept should be estimated or not. If False, the data is assumed to be
already centered. Defaults to True.
n_iter: int, optional :
The number of passes over the training data (aka epochs). Defaults to 5.
shufe: bool, optional :
476 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
Whether or not the training data should be shufed after each epoch. Defaults to False.
seed: int, optional :
The seed of the pseudo random number generator to use when shufing the data.
verbose: integer, optional :
The verbosity level.
epsilon: oat :
Epsilon in the epsilon-insensitive huber loss function; only if loss==huber.
learning_rate : string, optional
The learning rate: constant: eta = eta0 optimal: eta = 1.0/(t+t0) invscaling: eta = eta0 /
pow(t, power_t) [default]
eta0 : double, optional
The initial learning rate [default 0.01].
power_t : double, optional
The exponent for inverse scaling learning rate [default 0.25].
warm_start : bool, optional
When set to True, reuse the solution of the previous call to t as initialization, otherwise,
just erase the previous solution.
See Also:
Ridge, ElasticNet, Lasso, SVR
Examples
>>> import numpy as np
>>> from sklearn import linear_model
>>> n_samples, n_features = 10, 5
>>> np.random.seed(0)
>>> y = np.random.randn(n_samples)
>>> X = np.random.randn(n_samples, n_features)
>>> clf = linear_model.SGDRegressor()
>>> clf.fit(X, y)
SGDRegressor(alpha=0.0001, epsilon=0.1, eta0=0.01, fit_intercept=True,
learning_rate=invscaling, loss=squared_loss, n_iter=5, p=None,
penalty=l2, power_t=0.25, rho=0.85, seed=0, shuffle=False,
verbose=0, warm_start=False)
Attributes
coef_ array, shape = [n_features] Weights asigned to the features.
intercept_ array, shape = [1] The intercept term.
Methods
1.8. Reference 477
scikit-learn user guide, Release 0.12-git
decision_function(X) Predict using the linear model
fit(X, y[, coef_init, intercept_init, ...]) Fit linear model with Stochastic Gradient Descent.
fit_transform(X[, y]) Fit to data, then transform it
get_params([deep]) Get parameters for the estimator
partial_fit(X, y[, sample_weight]) Fit linear model with Stochastic Gradient Descent.
predict(X) Predict using the linear model
score(X, y) Returns the coefcient of determination R^2 of the prediction.
set_params(**params) Set the parameters of the estimator.
transform(X[, threshold]) Reduce X to its most important features.
__init__(loss=squared_loss, penalty=l2, alpha=0.0001, rho=0.85, t_intercept=True, n_iter=5,
shufe=False, verbose=0, epsilon=0.1, p=None, seed=0, learning_rate=invscaling,
eta0=0.01, power_t=0.25, warm_start=False)
decision_function(X)
Predict using the linear model
Parameters X : {array-like, sparse matrix}, shape = [n_samples, n_features]
Returns array, shape = [n_samples] :
Predicted target values per element in X.
fit(X, y, coef_init=None, intercept_init=None, sample_weight=None)
Fit linear model with Stochastic Gradient Descent.
Parameters X : {array-like, sparse matrix}, shape = [n_samples, n_features]
Training data
y : numpy array of shape [n_samples]
Target values
coef_init : array, shape = [n_features]
The initial coefents to warm-start the optimization.
intercept_init : array, shape = [1]
The initial intercept to warm-start the optimization.
sample_weight : array-like, shape = [n_samples], optional
Weights applied to individual samples (1. for unweighted).
Returns self : returns an instance of self.
fit_transform(X, y=None, **t_params)
Fit to data, then transform it
Fits transformer to X and y with optional parameters t_params and returns a transformed version of X.
Parameters X : numpy array of shape [n_samples, n_features]
Training set.
y : numpy array of shape [n_samples]
Target values.
Returns X_new : numpy array of shape [n_samples, n_features_new]
Transformed array.
478 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
Notes
This method just calls t and transform consecutively, i.e., it is not an optimized implementation of
t_transform, unlike other transformers such as PCA.
get_params(deep=True)
Get parameters for the estimator
Parameters deep: boolean, optional :
If True, will return the parameters for this estimator and contained subobjects that are
estimators.
partial_fit(X, y, sample_weight=None)
Fit linear model with Stochastic Gradient Descent.
Parameters X : {array-like, sparse matrix}, shape = [n_samples, n_features]
Subset of training data
y : numpy array of shape [n_samples]
Subset of target values
sample_weight : array-like, shape = [n_samples], optional
Weights applied to individual samples. If not provided, uniform weights are assumed.
Returns self : returns an instance of self.
predict(X)
Predict using the linear model
Parameters X : {array-like, sparse matrix}, shape = [n_samples, n_features]
Returns array, shape = [n_samples] :
Predicted target values per element in X.
score(X, y)
Returns the coefcient of determination R^2 of the prediction.
The coefcient R^2 is dened as (1 - u/v), where u is the regression sum of squares ((y - y_pred) **
2).sum() and v is the residual sum of squares ((y_true - y_true.mean()) ** 2).sum(). Best possible score is
1.0, lower values are worse.
Parameters X : array-like, shape = [n_samples, n_features]
Training set.
y : array-like, shape = [n_samples]
Returns z : oat
set_params(**params)
Set the parameters of the estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The former have
parameters of the form <component>__<parameter> so that its possible to update each component
of a nested object.
Returns self :
transform(X, threshold=None)
Reduce X to its most important features.
Parameters X : array or scipy sparse matrix of shape [n_samples, n_features]
1.8. Reference 479
scikit-learn user guide, Release 0.12-git
The input samples.
threshold : string, oat or None, optional (default=None)
The threshold value to use for feature selection. Features whose importance is greater
or equal are kept while the others are discarded. If median (resp. mean), then the
threshold value is the median (resp. the mean) of the feature importances. A scaling
factor (e.g., 1.25*mean) may also be used. If None and if available, the object attribute
threshold is used. Otherwise, mean is used by default.
Returns X_r : array of shape [n_samples, n_selected_features]
The input samples with only the selected features.
sklearn.linear_model.BayesianRidge
class sklearn.linear_model.BayesianRidge(n_iter=300, tol=0.001, alpha_1=1e-06,
alpha_2=1e-06, lambda_1=1e-06, lambda_2=1e-
06, compute_score=False, t_intercept=True,
normalize=False, copy_X=True, verbose=False)
Bayesian ridge regression
Fit a Bayesian ridge model and optimize the regularization parameters lambda (precision of the weights) and
alpha (precision of the noise).
Parameters X : array, shape = (n_samples, n_features)
Training vectors.
y : array, shape = (length)
Target values for training vectors
n_iter : int, optional
Maximum number of iterations. Default is 300.
tol : oat, optional
Stop the algorithm if w has converged. Default is 1.e-3.
alpha_1 : oat, optional
Hyper-parameter : shape parameter for the Gamma distribution prior over the alpha
parameter. Default is 1.e-6
alpha_2 : oat, optional
Hyper-parameter : inverse scale parameter (rate parameter) for the Gamma distribution
prior over the alpha parameter. Default is 1.e-6.
lambda_1 : oat, optional
Hyper-parameter : shape parameter for the Gamma distribution prior over the lambda
parameter. Default is 1.e-6.
lambda_2 : oat, optional
Hyper-parameter : inverse scale parameter (rate parameter) for the Gamma distribution
prior over the lambda parameter. Default is 1.e-6
compute_score : boolean, optional
If True, compute the objective function at each step of the model. Default is False
480 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
t_intercept : boolean, optional
wether to calculate the intercept for this model. If set to false, no intercept will be used
in calculations (e.g. data is expected to be already centered). Default is True.
normalize : boolean, optional, default False
If True, the regressors X are normalized
copy_X : boolean, optional, default True
If True, X will be copied; else, it may be overwritten.
verbose : boolean, optional, default False
Verbose mode when tting the model.
Notes
See examples/linear_model/plot_bayesian_ridge.py for an example.
Examples
>>> from sklearn import linear_model
>>> clf = linear_model.BayesianRidge()
>>> clf.fit([[0,0], [1, 1], [2, 2]], [0, 1, 2])
...
BayesianRidge(alpha_1=1e-06, alpha_2=1e-06, compute_score=False,
copy_X=True, fit_intercept=True, lambda_1=1e-06, lambda_2=1e-06,
n_iter=300, normalize=False, tol=0.001, verbose=False)
>>> clf.predict([[1, 1]])
array([ 1.])
Attributes
coef_ array, shape = (n_features) Coefcients of the regression model (mean of distribution)
alpha_ oat estimated precision of the noise.
lambda_ array, shape = (n_features) estimated precisions of the weights.
scores_ oat if computed, value of the objective function (to be maximized)
Methods
decision_function(X) Decision function of the linear model
fit(X, y) Fit the model
get_params([deep]) Get parameters for the estimator
predict(X) Predict using the linear model
score(X, y) Returns the coefcient of determination R^2 of the prediction.
set_params(**params) Set the parameters of the estimator.
__init__(n_iter=300, tol=0.001, alpha_1=1e-06, alpha_2=1e-06, lambda_1=1e-06, lambda_2=1e-
06, compute_score=False, t_intercept=True, normalize=False, copy_X=True, ver-
bose=False)
1.8. Reference 481
scikit-learn user guide, Release 0.12-git
decision_function(X)
Decision function of the linear model
Parameters X : numpy array of shape [n_samples, n_features]
Returns C : array, shape = [n_samples]
Returns predicted values.
fit(X, y)
Fit the model
Parameters X : numpy array of shape [n_samples,n_features]
Training data
y : numpy array of shape [n_samples]
Target values
Returns self : returns an instance of self.
get_params(deep=True)
Get parameters for the estimator
Parameters deep: boolean, optional :
If True, will return the parameters for this estimator and contained subobjects that are
estimators.
predict(X)
Predict using the linear model
Parameters X : numpy array of shape [n_samples, n_features]
Returns C : array, shape = [n_samples]
Returns predicted values.
score(X, y)
Returns the coefcient of determination R^2 of the prediction.
The coefcient R^2 is dened as (1 - u/v), where u is the regression sum of squares ((y - y_pred) **
2).sum() and v is the residual sum of squares ((y_true - y_true.mean()) ** 2).sum(). Best possible score is
1.0, lower values are worse.
Parameters X : array-like, shape = [n_samples, n_features]
Training set.
y : array-like, shape = [n_samples]
Returns z : oat
set_params(**params)
Set the parameters of the estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The former have
parameters of the form <component>__<parameter> so that its possible to update each component
of a nested object.
Returns self :
482 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
sklearn.linear_model.ARDRegression
class sklearn.linear_model.ARDRegression(n_iter=300, tol=0.001, alpha_1=1e-
06, alpha_2=1e-06, lambda_1=1e-06,
lambda_2=1e-06, compute_score=False, thresh-
old_lambda=10000.0, t_intercept=True, normal-
ize=False, copy_X=True, verbose=False)
Bayesian ARD regression.
Fit the weights of a regression model, using an ARD prior. The weights of the regression model are assumed
to be in Gaussian distributions. Also estimate the parameters lambda (precisions of the distributions of the
weights) and alpha (precision of the distribution of the noise). The estimation is done by an iterative procedures
(Evidence Maximization)
Parameters X : array, shape = (n_samples, n_features)
Training vectors.
y : array, shape = (n_samples)
Target values for training vectors
n_iter : int, optional
Maximum number of iterations. Default is 300
tol : oat, optional
Stop the algorithm if w has converged. Default is 1.e-3.
alpha_1 : oat, optional
Hyper-parameter : shape parameter for the Gamma distribution prior over the alpha
parameter. Default is 1.e-6.
alpha_2 : oat, optional
Hyper-parameter : inverse scale parameter (rate parameter) for the Gamma distribution
prior over the alpha parameter. Default is 1.e-6.
lambda_1 : oat, optional
Hyper-parameter : shape parameter for the Gamma distribution prior over the lambda
parameter. Default is 1.e-6.
lambda_2 : oat, optional
Hyper-parameter : inverse scale parameter (rate parameter) for the Gamma distribution
prior over the lambda parameter. Default is 1.e-6.
compute_score : boolean, optional
If True, compute the objective function at each step of the model. Default is False.
threshold_lambda : oat, optional
threshold for removing (pruning) weights with high precision from the computation.
Default is 1.e+4.
t_intercept : boolean, optional
wether to calculate the intercept for this model. If set to false, no intercept will be used
in calculations (e.g. data is expected to be already centered). Default is True.
normalize : boolean, optional
1.8. Reference 483
scikit-learn user guide, Release 0.12-git
If True, the regressors X are normalized
copy_X : boolean, optional, default True.
If True, X will be copied; else, it may be overwritten.
verbose : boolean, optional, default False
Verbose mode when tting the model.
Notes
See examples/linear_model/plot_ard.py for an example.
Examples
>>> from sklearn import linear_model
>>> clf = linear_model.ARDRegression()
>>> clf.fit([[0,0], [1, 1], [2, 2]], [0, 1, 2])
...
ARDRegression(alpha_1=1e-06, alpha_2=1e-06, compute_score=False,
copy_X=True, fit_intercept=True, lambda_1=1e-06, lambda_2=1e-06,
n_iter=300, normalize=False, threshold_lambda=10000.0, tol=0.001,
verbose=False)
>>> clf.predict([[1, 1]])
array([ 1.])
Attributes
coef_ array, shape = (n_features) Coefcients of the regression model (mean of distribution)
alpha_ oat estimated precision of the noise.
lambda_ array, shape = (n_features) estimated precisions of the weights.
sigma_ array, shape = (n_features, n_features) estimated variance-covariance matrix of the weights
scores_ oat if computed, value of the objective function (to be maximized)
Methods
decision_function(X) Decision function of the linear model
fit(X, y) Fit the ARDRegression model according to the given training data
get_params([deep]) Get parameters for the estimator
predict(X) Predict using the linear model
score(X, y) Returns the coefcient of determination R^2 of the prediction.
set_params(**params) Set the parameters of the estimator.
__init__(n_iter=300, tol=0.001, alpha_1=1e-06, alpha_2=1e-06, lambda_1=1e-06, lambda_2=1e-
06, compute_score=False, threshold_lambda=10000.0, t_intercept=True, normal-
ize=False, copy_X=True, verbose=False)
decision_function(X)
Decision function of the linear model
Parameters X : numpy array of shape [n_samples, n_features]
484 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
Returns C : array, shape = [n_samples]
Returns predicted values.
fit(X, y)
Fit the ARDRegression model according to the given training data and parameters.
Iterative procedure to maximize the evidence
Parameters X : array-like, shape = [n_samples, n_features]
Training vector, where n_samples in the number of samples and n_features is the num-
ber of features.
y : array, shape = [n_samples]
Target values (integers)
Returns self : returns an instance of self.
get_params(deep=True)
Get parameters for the estimator
Parameters deep: boolean, optional :
If True, will return the parameters for this estimator and contained subobjects that are
estimators.
predict(X)
Predict using the linear model
Parameters X : numpy array of shape [n_samples, n_features]
Returns C : array, shape = [n_samples]
Returns predicted values.
score(X, y)
Returns the coefcient of determination R^2 of the prediction.
The coefcient R^2 is dened as (1 - u/v), where u is the regression sum of squares ((y - y_pred) **
2).sum() and v is the residual sum of squares ((y_true - y_true.mean()) ** 2).sum(). Best possible score is
1.0, lower values are worse.
Parameters X : array-like, shape = [n_samples, n_features]
Training set.
y : array-like, shape = [n_samples]
Returns z : oat
set_params(**params)
Set the parameters of the estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The former have
parameters of the form <component>__<parameter> so that its possible to update each component
of a nested object.
Returns self :
1.8. Reference 485
scikit-learn user guide, Release 0.12-git
sklearn.linear_model.RandomizedLasso
class sklearn.linear_model.RandomizedLasso(alpha=aic, scaling=0.5, sample_fraction=0.75,
n_resampling=200, selection_threshold=0.25,
t_intercept=True, verbose=False,
normalize=True, precompute=auto,
max_iter=500, eps=2.2204460492503131e-
16, random_state=None, n_jobs=1,
pre_dispatch=3*n_jobs, mem-
ory=Memory(cachedir=None))
Randomized Lasso
Randomized Lasso works by resampling the train data and computing a Lasso on each resampling. In short, the
features selected more often are good features. It is also known as stability selection.
Parameters alpha : oat, aic, or bic
The regularization parameter alpha parameter in the Lasso. Warning: this is not the
alpha parameter in the stability selection article which is scaling.
scaling : oat
The alpha parameter in the stability selection article used to randomly scale the features.
Should be between 0 and 1.
sample_fraction : oat
The fraction of samples to be used in each randomized design. Should be between 0
and 1. If 1, all samples are used.
t_intercept : boolean
whether to calculate the intercept for this model. If set to false, no intercept will be used
in calculations (e.g. data is expected to be already centered).
verbose : boolean or integer, optional
Sets the verbosity amount
normalize : boolean, optional
If True, the regressors X are normalized
precompute : True | False | auto
Whether to use a precomputed Gram matrix to speed up calculations. If set to auto let
us decide. The Gram matrix can also be passed as argument.
max_iter : integer, optional
Maximum number of iterations to perform in the Lars algorithm.
eps : oat, optional
The machine-precision regularization in the computation of the Cholesky diagonal fac-
tors. Increase this for very ill-conditioned systems. Unlike the tol parameter in some
iterative optimization-based algorithms, this parameter does not control the tolerance of
the optimization.
n_jobs : integer, optional
Number of CPUs to use during the resampling. If -1, use all the CPUs
random_state : int, RandomState instance or None, optional (default=None)
486 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
If int, random_state is the seed used by the random number generator; If RandomState
instance, random_state is the random number generator; If None, the random number
generator is the RandomState instance used by np.random.
pre_dispatch : int, or string, optional
Controls the number of jobs that get dispatched during parallel execution. Reducing
this number can be useful to avoid an explosion of memory consumption when more
jobs get dispatched than CPUs can process. This parameter can be:
None, in which case all the jobs are immediatly created and spawned. Use this for
lightweight and fast-running jobs, to avoid delays due to on-demand spawning of the
jobs
An int, giving the exact number of total jobs that are spawned
A string, giving an expression as a function of n_jobs, as in 2*n_jobs
memory : Instance of joblib.Memory or string
Used for internal caching. By default, no caching is done. If a string is given, it is
thepath to the caching directory.
See Also:
RandomizedLogisticRegression, LogisticRegression
Notes
See examples/linear_model/plot_sparse_recovery.py for an example.
References
Stability selection Nicolai Meinshausen, Peter Buhlmann Journal of the Royal Statistical Society: Series B
Volume 72, Issue 4, pages 417-473, September 2010 DOI: 10.1111/j.1467-9868.2010.00740.x
Examples
>>> from sklearn.linear_model import RandomizedLasso
>>> randomized_lasso = RandomizedLasso()
Attributes
scores_ array, shape =
[n_features]
Feature scores between 0 and 1.
all_scores_ array, shape =
[n_features,
n_reg_parameter]
Feature scores between 0 and 1 for all values of the regularization
parameter. The reference article suggests scores_ is the max of
all_scores_.
Methods
fit(X, y) Fit the model using X, y as training data.
Continued on next page
1.8. Reference 487
scikit-learn user guide, Release 0.12-git
Table 1.123 continued from previous page
fit_transform(X[, y]) Fit to data, then transform it
get_params([deep]) Get parameters for the estimator
get_support([indices]) Return a mask, or list, of the features/indices selected.
inverse_transform(X) Transform a new matrix using the selected features
set_params(**params) Set the parameters of the estimator.
transform(X) Transform a new matrix using the selected features
__init__(alpha=aic, scaling=0.5, sample_fraction=0.75, n_resampling=200, selec-
tion_threshold=0.25, t_intercept=True, verbose=False, normalize=True, precom-
pute=auto, max_iter=500, eps=2.2204460492503131e-16, random_state=None,
n_jobs=1, pre_dispatch=3*n_jobs, memory=Memory(cachedir=None))
fit(X, y)
Fit the model using X, y as training data.
Parameters X : array-like, shape = [n_samples, n_features]
training data.
y : array-like, shape = [n_samples]
target values.
Returns self : object
returns an instance of self.
fit_transform(X, y=None, **t_params)
Fit to data, then transform it
Fits transformer to X and y with optional parameters t_params and returns a transformed version of X.
Parameters X : numpy array of shape [n_samples, n_features]
Training set.
y : numpy array of shape [n_samples]
Target values.
Returns X_new : numpy array of shape [n_samples, n_features_new]
Transformed array.
Notes
This method just calls t and transform consecutively, i.e., it is not an optimized implementation of
t_transform, unlike other transformers such as PCA.
get_params(deep=True)
Get parameters for the estimator
Parameters deep: boolean, optional :
If True, will return the parameters for this estimator and contained subobjects that are
estimators.
get_support(indices=False)
Return a mask, or list, of the features/indices selected.
488 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
inverse_transform(X)
Transform a new matrix using the selected features
set_params(**params)
Set the parameters of the estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The former have
parameters of the form <component>__<parameter> so that its possible to update each component
of a nested object.
Returns self :
transform(X)
Transform a new matrix using the selected features
sklearn.linear_model.RandomizedLogisticRegression
class sklearn.linear_model.RandomizedLogisticRegression(C=1, scaling=0.5,
sample_fraction=0.75,
n_resampling=200, se-
lection_threshold=0.25,
tol=0.001, t_intercept=True,
verbose=False, nor-
malize=True, ran-
dom_state=None, n_jobs=1,
pre_dispatch=3*n_jobs,
mem-
ory=Memory(cachedir=None))
Randomized Logistic Regression
Randomized Regression works by resampling the train data and computing a LogisticRegression on each re-
sampling. In short, the features selected more often are good features. It is also known as stability selection.
Parameters C : oat
The regularization parameter C in the LogisticRegression.
scaling : oat
The alpha parameter in the stability selection article used to randomly scale the features.
Should be between 0 and 1.
sample_fraction : oat
The fraction of samples to be used in each randomized design. Should be between 0
and 1. If 1, all samples are used.
t_intercept : boolean
whether to calculate the intercept for this model. If set to false, no intercept will be used
in calculations (e.g. data is expected to be already centered).
verbose : boolean or integer, optional
Sets the verbosity amount
normalize : boolean, optional
If True, the regressors X are normalized
tol : oat, optional
tolerance for stopping criteria of LogisticRegression
1.8. Reference 489
scikit-learn user guide, Release 0.12-git
n_jobs : integer, optional
Number of CPUs to use during the resampling. If -1, use all the CPUs
random_state : int, RandomState instance or None, optional (default=None)
If int, random_state is the seed used by the random number generator; If RandomState
instance, random_state is the random number generator; If None, the random number
generator is the RandomState instance used by np.random.
pre_dispatch : int, or string, optional
Controls the number of jobs that get dispatched during parallel execution. Reducing
this number can be useful to avoid an explosion of memory consumption when more
jobs get dispatched than CPUs can process. This parameter can be:
None, in which case all the jobs are immediatly created and spawned. Use this for
lightweight and fast-running jobs, to avoid delays due to on-demand spawning of the
jobs
An int, giving the exact number of total jobs that are spawned
A string, giving an expression as a function of n_jobs, as in 2*n_jobs
memory : Instance of joblib.Memory or string
Used for internal caching. By default, no caching is done. If a string is given, it is
thepath to the caching directory.
See Also:
RandomizedLasso, Lasso, ElasticNet
Notes
See examples/linear_model/plot_randomized_lasso.py for an example.
References
Stability selection Nicolai Meinshausen, Peter Buhlmann Journal of the Royal Statistical Society: Series B
Volume 72, Issue 4, pages 417-473, September 2010 DOI: 10.1111/j.1467-9868.2010.00740.x
Examples
>>> from sklearn.linear_model import RandomizedLogisticRegression
>>> randomized_logistic = RandomizedLogisticRegression()
Attributes
scores_ array, shape =
[n_features]
Feature scores between 0 and 1.
all_scores_ array, shape =
[n_features,
n_reg_parameter]
Feature scores between 0 and 1 for all values of the regularization
parameter. The reference article suggests scores_ is the max of
all_scores_.
490 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
Methods
fit(X, y) Fit the model using X, y as training data.
fit_transform(X[, y]) Fit to data, then transform it
get_params([deep]) Get parameters for the estimator
get_support([indices]) Return a mask, or list, of the features/indices selected.
inverse_transform(X) Transform a new matrix using the selected features
set_params(**params) Set the parameters of the estimator.
transform(X) Transform a new matrix using the selected features
__init__(C=1, scaling=0.5, sample_fraction=0.75, n_resampling=200, selection_threshold=0.25,
tol=0.001, t_intercept=True, verbose=False, normalize=True, random_state=None,
n_jobs=1, pre_dispatch=3*n_jobs, memory=Memory(cachedir=None))
fit(X, y)
Fit the model using X, y as training data.
Parameters X : array-like, shape = [n_samples, n_features]
training data.
y : array-like, shape = [n_samples]
target values.
Returns self : object
returns an instance of self.
fit_transform(X, y=None, **t_params)
Fit to data, then transform it
Fits transformer to X and y with optional parameters t_params and returns a transformed version of X.
Parameters X : numpy array of shape [n_samples, n_features]
Training set.
y : numpy array of shape [n_samples]
Target values.
Returns X_new : numpy array of shape [n_samples, n_features_new]
Transformed array.
Notes
This method just calls t and transform consecutively, i.e., it is not an optimized implementation of
t_transform, unlike other transformers such as PCA.
get_params(deep=True)
Get parameters for the estimator
Parameters deep: boolean, optional :
If True, will return the parameters for this estimator and contained subobjects that are
estimators.
get_support(indices=False)
Return a mask, or list, of the features/indices selected.
1.8. Reference 491
scikit-learn user guide, Release 0.12-git
inverse_transform(X)
Transform a new matrix using the selected features
set_params(**params)
Set the parameters of the estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The former have
parameters of the form <component>__<parameter> so that its possible to update each component
of a nested object.
Returns self :
transform(X)
Transform a new matrix using the selected features
linear_model.lasso_path(X, y[, eps, ...]) Compute Lasso path with coordinate descent
linear_model.lars_path(X, y[, Xy, Gram, ...]) Compute Least Angle Regression and Lasso path
linear_model.orthogonal_mp(X, y[, ...]) Orthogonal Matching Pursuit (OMP)
linear_model.orthogonal_mp_gram(Gram, Xy[, ...]) Gram Orthogonal Matching Pursuit (OMP)
linear_model.lasso_stability_path(X, y[, ...]) Stabiliy path based on randomized Lasso estimates
sklearn.linear_model.lasso_path
sklearn.linear_model.lasso_path(X, y, eps=0.001, n_alphas=100, alphas=None, precom-
pute=auto, Xy=None, t_intercept=True, normalize=False,
copy_X=True, verbose=False, **params)
Compute Lasso path with coordinate descent
The optimization objective for Lasso is:
(1 / (2
*
n_samples))
*
||y - Xw||^2_2 + alpha
*
||w||_1
Parameters X : numpy array of shape [n_samples,n_features]
Training data. Pass directly as fortran contiguous data to avoid unnecessary memory
duplication
y : numpy array of shape [n_samples]
Target values
eps : oat, optional
Length of the path. eps=1e-3 means that alpha_min / alpha_max = 1e-3
n_alphas : int, optional
Number of alphas along the regularization path
alphas : numpy array, optional
List of alphas where to compute the models. If None alphas are set automatically
precompute : True | False | auto | array-like
Whether to use a precomputed Gram matrix to speed up calculations. If set to auto let
us decide. The Gram matrix can also be passed as argument.
Xy : array-like, optional
Xy = np.dot(X.T, y) that can be precomputed. It is useful only when the Gram matrix is
precomputed.
492 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
t_intercept : bool
Fit or not an intercept
normalize : boolean, optional
If True, the regressors X are normalized
copy_X : boolean, optional, default True
If True, X will be copied; else, it may be overwritten.
verbose : bool or integer
Amount of verbosity
params : kwargs
keyword arguments passed to the Lasso objects
Returns models : a list of models along the regularization path
See Also:
lars_path, Lasso, LassoLars, LassoCV, LassoLarsCV, sklearn.decomposition.sparse_encode
Notes
See examples/linear_model/plot_lasso_coordinate_descent_path.py for an example.
To avoid unnecessary memory duplication the X argument of the t method should be directly passed as a fortran
contiguous numpy array.
sklearn.linear_model.lars_path
sklearn.linear_model.lars_path(X, y, Xy=None, Gram=None, max_iter=500, alpha_min=0,
method=lar, copy_X=True, eps=2.2204460492503131e-16,
copy_Gram=True, verbose=False)
Compute Least Angle Regression and Lasso path
The optimization objective for Lasso is:
(1 / (2
*
n_samples))
*
||y - Xw||^2_2 + alpha
*
||w||_1
Parameters X: array, shape: (n_samples, n_features) :
Input data
y: array, shape: (n_samples) :
Input targets
max_iter: integer, optional :
Maximum number of iterations to perform, set to innity for no limit.
Gram: None, auto, array, shape: (n_features, n_features), optional :
Precomputed Gram matrix (X * X), if auto, the Gram matrix is precomputed from
the given X, if there are more samples than features
alpha_min: oat, optional :
1.8. Reference 493
scikit-learn user guide, Release 0.12-git
Minimum correlation along the path. It corresponds to the regularization parameter
alpha parameter in the Lasso.
method: {lar, lasso} :
Species the returned model. Select lar for Least Angle Regression, lasso for the
Lasso.
eps: oat, optional :
The machine-precision regularization in the computation of the Cholesky diagonal fac-
tors. Increase this for very ill-conditioned systems.
copy_X: bool :
If False, X is overwritten.
copy_Gram: bool :
If False, Gram is overwritten.
Returns alphas: array, shape: (max_features + 1,) :
Maximum of covariances (in absolute value) at each iteration.
active: array, shape (max_features,) :
Indices of active variables at the end of the path.
coefs: array, shape (n_features, max_features + 1) :
Coefcients along the path
See Also:
lasso_path, LassoLars, Lars, LassoLarsCV, LarsCV, sklearn.decomposition.sparse_encode
Notes
http://en.wikipedia.org/wiki/Least-angle_regression
http://en.wikipedia.org/wiki/Lasso_(statistics)#LASSO_method
sklearn.linear_model.orthogonal_mp
sklearn.linear_model.orthogonal_mp(X, y, n_nonzero_coefs=None, tol=None, precom-
pute_gram=False, copy_X=True)
Orthogonal Matching Pursuit (OMP)
Solves n_targets Orthogonal Matching Pursuit problems. An instance of the problem has the form:
When parametrized by the number of non-zero coefcients using n_nonzero_coefs: argmin ||y - Xgamma||^2
subject to ||gamma||_0 <= n_{nonzero coefs}
When parametrized by error using the parameter tol: argmin ||gamma||_0 subject to ||y - Xgamma||^2 <= tol
Parameters X: array, shape = (n_samples, n_features) :
Input data. Columns are assumed to have unit norm.
y: array, shape = (n_samples,) or (n_samples, n_targets) :
Input targets
494 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
n_nonzero_coefs: int :
Desired number of non-zero entries in the solution. If None (by default) this value is set
to 10% of n_features.
tol: oat :
Maximum norm of the residual. If not None, overrides n_nonzero_coefs.
precompute_gram: {True, False, auto}, :
Whether to perform precomputations. Improves performance when n_targets or
n_samples is very large.
copy_X: bool, optional :
Whether the design matrix X must be copied by the algorithm. A false value is only
helpful if X is already Fortran-ordered, otherwise a copy is made anyway.
Returns coef: array, shape = (n_features,) or (n_features, n_targets) :
Coefcients of the OMP solution
See Also:
OrthogonalMatchingPursuit, orthogonal_mp_gram, lars_path,
decomposition.sparse_encode, decomposition.sparse_encode_parallel
Notes
Orthogonal matching pursuit was introduced in G. Mallat, Z. Zhang, Matching pursuits with time-frequency
dictionaries, IEEE Transactions on Signal Processing, Vol. 41, No. 12. (December 1993), pp. 3397-3415.
(http://blanche.polytechnique.fr/~mallat/papiers/MallatPursuit93.pdf)
This implementation is based on Rubinstein, R., Zibulevsky, M. and Elad, M., Efcient Implementation of
the K-SVD Algorithm using Batch Orthogonal Matching Pursuit Technical Report - CS Technion, April 2008.
http://www.cs.technion.ac.il/~ronrubin/Publications/KSVD-OMP-v2.pdf
sklearn.linear_model.orthogonal_mp_gram
sklearn.linear_model.orthogonal_mp_gram(Gram, Xy, n_nonzero_coefs=None, tol=None,
norms_squared=None, copy_Gram=True,
copy_Xy=True)
Gram Orthogonal Matching Pursuit (OMP)
Solves n_targets Orthogonal Matching Pursuit problems using only the Gram matrix X.T * X and the product
X.T * y.
Parameters Gram: array, shape = (n_features, n_features) :
Gram matrix of the input data: X.T * X
Xy: array, shape = (n_features,) or (n_features, n_targets) :
Input targets multiplied by X: X.T * y
n_nonzero_coefs: int :
Desired number of non-zero entries in the solution. If None (by default) this value is set
to 10% of n_features.
tol: oat :
1.8. Reference 495
scikit-learn user guide, Release 0.12-git
Maximum norm of the residual. If not None, overrides n_nonzero_coefs.
norms_squared: array-like, shape = (n_targets,) :
Squared L2 norms of the lines of y. Required if tol is not None.
copy_Gram: bool, optional :
Whether the gram matrix must be copied by the algorithm. A false value is only helpful
if it is already Fortran-ordered, otherwise a copy is made anyway.
copy_Xy: bool, optional :
Whether the covariance vector Xy must be copied by the algorithm. If False, it may be
overwritten.
Returns coef: array, shape = (n_features,) or (n_features, n_targets) :
Coefcients of the OMP solution
See Also:
OrthogonalMatchingPursuit, orthogonal_mp, lars_path, decomposition.sparse_encode,
decomposition.sparse_encode_parallel
Notes
Orthogonal matching pursuit was introduced in G. Mallat, Z. Zhang, Matching pursuits with time-frequency
dictionaries, IEEE Transactions on Signal Processing, Vol. 41, No. 12. (December 1993), pp. 3397-3415.
(http://blanche.polytechnique.fr/~mallat/papiers/MallatPursuit93.pdf)
This implementation is based on Rubinstein, R., Zibulevsky, M. and Elad, M., Efcient Implementation of
the K-SVD Algorithm using Batch Orthogonal Matching Pursuit Technical Report - CS Technion, April 2008.
http://www.cs.technion.ac.il/~ronrubin/Publications/KSVD-OMP-v2.pdf
sklearn.linear_model.lasso_stability_path
sklearn.linear_model.lasso_stability_path(X, y, scaling=0.5, ran-
dom_state=None, n_resampling=200,
n_grid=100, sample_fraction=0.75,
eps=8.8817841970012523e-16, n_jobs=1,
verbose=False)
Stabiliy path based on randomized Lasso estimates
Parameters X : array-like, shape = [n_samples, n_features]
training data.
y : array-like, shape = [n_samples]
target values.
scaling : oat
The alpha parameter in the stability selection article used to randomly scale the features.
Should be between 0 and 1.
random_state : integer or numpy.RandomState, optional
The generator used to randomize the design.
n_resampling : int
496 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
Number of randomized models.
n_grid : int
Number of grid points. The path is linearly reinterpolated on a grid between 0 and 1
before computing the scores.
sample_fraction : oat
The fraction of samples to be used in each randomized design. Should be between 0
and 1. If 1, all samples are used.
eps : oat
Smallest value of alpha / alpha_max considered
n_jobs : integer, optional
Number of CPUs to use during the resampling. If -1, use all the CPUs
verbose : boolean or integer, optional
Sets the verbosity amount
Returns alphas_grid : array, shape ~ [n_grid]
The grid points between 0 and 1: alpha/alpha_max
scores_path : array, shape = [n_features, n_grid]
The scores for each feature along the path.
Notes
See examples/linear_model/plot_randomized_lasso.py for an example.
For sparse data
The sklearn.linear_model.sparse submodule is the sparse counterpart of the sklearn.linear_model
module.
User guide: See the Generalized Linear Models section for further details.
linear_model.sparse.Lasso([alpha, ...]) Linear Model trained with L1 prior as regularizer
linear_model.sparse.ElasticNet([alpha, rho, ...]) Linear Model trained with L1 and L2 prior as regularizer
linear_model.sparse.SGDClassifier(*args, ...)
linear_model.sparse.SGDRegressor(*args, **kwargs)
linear_model.LogisticRegression([penalty, ...]) Logistic Regression (aka logit, MaxEnt) classier.
sklearn.linear_model.sparse.Lasso
class sklearn.linear_model.sparse.Lasso(alpha=1.0, t_intercept=False, normalize=False,
max_iter=1000, tol=0.0001, positive=False)
Linear Model trained with L1 prior as regularizer
This implementation works on scipy.sparse X and dense coef_. Technically this is the same as Elastic Net with
the L2 penalty set to zero.
Parameters alpha : oat
1.8. Reference 497
scikit-learn user guide, Release 0.12-git
Constant that multiplies the L1 term. Defaults to 1.0
coef_ : ndarray of shape n_features
The initial coefents to warm-start the optimization
t_intercept: bool :
Whether the intercept should be estimated or not. If False, the data is assumed to be
already centered.
Methods
decision_function(X) Decision function of the linear model
fit(X, y) Fit current model with coordinate descent
get_params([deep]) Get parameters for the estimator
predict(X) Predict using the linear model
score(X, y) Returns the coefcient of determination R^2 of the prediction.
set_params(**params) Set the parameters of the estimator.
__init__(alpha=1.0, t_intercept=False, normalize=False, max_iter=1000, tol=0.0001, posi-
tive=False)
decision_function(X)
Decision function of the linear model
Parameters X : scipy.sparse matrix of shape [n_samples, n_features]
Returns array, shape = [n_samples] with the predicted real values :
fit(X, y)
Fit current model with coordinate descent
X is expected to be a sparse matrix. For maximum efciency, use a sparse matrix in CSC format
(scipy.sparse.csc_matrix)
get_params(deep=True)
Get parameters for the estimator
Parameters deep: boolean, optional :
If True, will return the parameters for this estimator and contained subobjects that are
estimators.
predict(X)
Predict using the linear model
Parameters X : numpy array of shape [n_samples, n_features]
Returns C : array, shape = [n_samples]
Returns predicted values.
score(X, y)
Returns the coefcient of determination R^2 of the prediction.
The coefcient R^2 is dened as (1 - u/v), where u is the regression sum of squares ((y - y_pred) **
2).sum() and v is the residual sum of squares ((y_true - y_true.mean()) ** 2).sum(). Best possible score is
1.0, lower values are worse.
Parameters X : array-like, shape = [n_samples, n_features]
498 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
Training set.
y : array-like, shape = [n_samples]
Returns z : oat
set_params(**params)
Set the parameters of the estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The former have
parameters of the form <component>__<parameter> so that its possible to update each component
of a nested object.
Returns self :
sklearn.linear_model.sparse.ElasticNet
class sklearn.linear_model.sparse.ElasticNet(alpha=1.0, rho=0.5, t_intercept=False, nor-
malize=False, max_iter=1000, tol=0.0001,
positive=False)
Linear Model trained with L1 and L2 prior as regularizer
This implementation works on scipy.sparse X and dense coef_.
rho=1 is the lasso penalty. Currently, rho <= 0.01 is not reliable, unless you supply your own sequence of alpha.
Parameters alpha : oat
Constant that multiplies the L1 term. Defaults to 1.0
rho : oat
The ElasticNet mixing parameter, with 0 < rho <= 1.
t_intercept: bool :
Whether the intercept should be estimated or not. If False, the data is assumed to be
already centered.
TODO: t_intercept=True is not yet implemented
Notes
The parameter rho corresponds to alpha in the glmnet R package while alpha corresponds to the lambda param-
eter in glmnet.
Methods
decision_function(X) Decision function of the linear model
fit(X, y) Fit current model with coordinate descent
get_params([deep]) Get parameters for the estimator
predict(X) Predict using the linear model
score(X, y) Returns the coefcient of determination R^2 of the prediction.
set_params(**params) Set the parameters of the estimator.
__init__(alpha=1.0, rho=0.5, t_intercept=False, normalize=False, max_iter=1000, tol=0.0001,
positive=False)
1.8. Reference 499
scikit-learn user guide, Release 0.12-git
decision_function(X)
Decision function of the linear model
Parameters X : scipy.sparse matrix of shape [n_samples, n_features]
Returns array, shape = [n_samples] with the predicted real values :
fit(X, y)
Fit current model with coordinate descent
X is expected to be a sparse matrix. For maximum efciency, use a sparse matrix in CSC format
(scipy.sparse.csc_matrix)
get_params(deep=True)
Get parameters for the estimator
Parameters deep: boolean, optional :
If True, will return the parameters for this estimator and contained subobjects that are
estimators.
predict(X)
Predict using the linear model
Parameters X : numpy array of shape [n_samples, n_features]
Returns C : array, shape = [n_samples]
Returns predicted values.
score(X, y)
Returns the coefcient of determination R^2 of the prediction.
The coefcient R^2 is dened as (1 - u/v), where u is the regression sum of squares ((y - y_pred) **
2).sum() and v is the residual sum of squares ((y_true - y_true.mean()) ** 2).sum(). Best possible score is
1.0, lower values are worse.
Parameters X : array-like, shape = [n_samples, n_features]
Training set.
y : array-like, shape = [n_samples]
Returns z : oat
set_params(**params)
Set the parameters of the estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The former have
parameters of the form <component>__<parameter> so that its possible to update each component
of a nested object.
Returns self :
sklearn.linear_model.sparse.SGDClassier
class sklearn.linear_model.sparse.SGDClassifier(*args, **kwargs)
Methods
decision_function(X) Predict signed distance to the hyperplane (aka condence score)
Continued on next page
500 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
Table 1.129 continued from previous page
fit(X, y[, coef_init, intercept_init, ...]) Fit linear model with Stochastic Gradient Descent.
fit_transform(X[, y]) Fit to data, then transform it
get_params([deep]) Get parameters for the estimator
partial_fit(X, y[, classes, class_weight, ...]) Fit linear model with Stochastic Gradient Descent.
predict(X) Predict using the linear model
predict_proba(X) Predict class membership probability
score(X, y) Returns the mean accuracy on the given test data and labels.
set_params(**params) Set the parameters of the estimator.
transform(X[, threshold]) Reduce X to its most important features.
__init__(*args, **kwargs)
DEPRECATED: to be removed in v0.12; use sklearn.linear_model.SGDClassier directly
classes
DEPRECATED: to be removed in v0.12; use classes_ instead.
decision_function(X)
Predict signed distance to the hyperplane (aka condence score)
Parameters X : {array-like, sparse matrix}, shape = [n_samples, n_features]
Returns array, shape = [n_samples] if n_classes == 2 else [n_samples,n_classes] :
The signed distances to the hyperplane(s).
fit(X, y, coef_init=None, intercept_init=None, class_weight=None, sample_weight=None)
Fit linear model with Stochastic Gradient Descent.
Parameters X : {array-like, sparse matrix}, shape = [n_samples, n_features]
Training data
y : numpy array of shape [n_samples]
Target values
coef_init : array, shape = [n_classes,n_features]
The initial coefents to warm-start the optimization.
intercept_init : array, shape = [n_classes]
The initial intercept to warm-start the optimization.
sample_weight : array-like, shape = [n_samples], optional
Weights applied to individual samples. If not provided, uniform weights are assumed.
Returns self : returns an instance of self.
fit_transform(X, y=None, **t_params)
Fit to data, then transform it
Fits transformer to X and y with optional parameters t_params and returns a transformed version of X.
Parameters X : numpy array of shape [n_samples, n_features]
Training set.
y : numpy array of shape [n_samples]
Target values.
Returns X_new : numpy array of shape [n_samples, n_features_new]
1.8. Reference 501
scikit-learn user guide, Release 0.12-git
Transformed array.
Notes
This method just calls t and transform consecutively, i.e., it is not an optimized implementation of
t_transform, unlike other transformers such as PCA.
get_params(deep=True)
Get parameters for the estimator
Parameters deep: boolean, optional :
If True, will return the parameters for this estimator and contained subobjects that are
estimators.
partial_fit(X, y, classes=None, class_weight=None, sample_weight=None)
Fit linear model with Stochastic Gradient Descent.
Parameters X : {array-like, sparse matrix}, shape = [n_samples, n_features]
Subset of the training data
y : numpy array of shape [n_samples]
Subset of the target values
classes : array, shape = [n_classes]
Classes across all calls to partial_t. Can be obtained by via np.unique(y_all), where
y_all is the target vector of the entire dataset. This argument is required for the rst call
to partial_t and can be omitted in the subsequent calls. Note that y doesnt need to
contain all labels in classes.
sample_weight : array-like, shape = [n_samples], optional
Weights applied to individual samples. If not provided, uniform weights are assumed.
Returns self : returns an instance of self.
predict(X)
Predict using the linear model
Parameters X : {array-like, sparse matrix}, shape = [n_samples, n_features]
Returns array, shape = [n_samples] :
Array containing the predicted class labels.
predict_proba(X)
Predict class membership probability
Parameters X : {array-like, sparse matrix}, shape = [n_samples, n_features]
Returns array, shape = [n_samples] if n_classes == 2 else [n_samples, :
n_classes] :
Contains the membership probabilities of the positive class.
References
The justication for the formula in the loss=modied_huber case is in the appendix B in:
http://jmlr.csail.mit.edu/papers/volume2/zhang02c/zhang02c.pdf
502 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
score(X, y)
Returns the mean accuracy on the given test data and labels.
Parameters X : array-like, shape = [n_samples, n_features]
Training set.
y : array-like, shape = [n_samples]
Labels for X.
Returns z : oat
set_params(**params)
Set the parameters of the estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The former have
parameters of the form <component>__<parameter> so that its possible to update each component
of a nested object.
Returns self :
transform(X, threshold=None)
Reduce X to its most important features.
Parameters X : array or scipy sparse matrix of shape [n_samples, n_features]
The input samples.
threshold : string, oat or None, optional (default=None)
The threshold value to use for feature selection. Features whose importance is greater
or equal are kept while the others are discarded. If median (resp. mean), then the
threshold value is the median (resp. the mean) of the feature importances. A scaling
factor (e.g., 1.25*mean) may also be used. If None and if available, the object attribute
threshold is used. Otherwise, mean is used by default.
Returns X_r : array of shape [n_samples, n_selected_features]
The input samples with only the selected features.
sklearn.linear_model.sparse.SGDRegressor
class sklearn.linear_model.sparse.SGDRegressor(*args, **kwargs)
Methods
decision_function(X) Predict using the linear model
fit(X, y[, coef_init, intercept_init, ...]) Fit linear model with Stochastic Gradient Descent.
fit_transform(X[, y]) Fit to data, then transform it
get_params([deep]) Get parameters for the estimator
partial_fit(X, y[, sample_weight]) Fit linear model with Stochastic Gradient Descent.
predict(X) Predict using the linear model
score(X, y) Returns the coefcient of determination R^2 of the prediction.
set_params(**params) Set the parameters of the estimator.
transform(X[, threshold]) Reduce X to its most important features.
1.8. Reference 503
scikit-learn user guide, Release 0.12-git
__init__(*args, **kwargs)
DEPRECATED: to be removed in v0.12; use sklearn.linear_model.SGDRegressor directly
decision_function(X)
Predict using the linear model
Parameters X : {array-like, sparse matrix}, shape = [n_samples, n_features]
Returns array, shape = [n_samples] :
Predicted target values per element in X.
fit(X, y, coef_init=None, intercept_init=None, sample_weight=None)
Fit linear model with Stochastic Gradient Descent.
Parameters X : {array-like, sparse matrix}, shape = [n_samples, n_features]
Training data
y : numpy array of shape [n_samples]
Target values
coef_init : array, shape = [n_features]
The initial coefents to warm-start the optimization.
intercept_init : array, shape = [1]
The initial intercept to warm-start the optimization.
sample_weight : array-like, shape = [n_samples], optional
Weights applied to individual samples (1. for unweighted).
Returns self : returns an instance of self.
fit_transform(X, y=None, **t_params)
Fit to data, then transform it
Fits transformer to X and y with optional parameters t_params and returns a transformed version of X.
Parameters X : numpy array of shape [n_samples, n_features]
Training set.
y : numpy array of shape [n_samples]
Target values.
Returns X_new : numpy array of shape [n_samples, n_features_new]
Transformed array.
Notes
This method just calls t and transform consecutively, i.e., it is not an optimized implementation of
t_transform, unlike other transformers such as PCA.
get_params(deep=True)
Get parameters for the estimator
Parameters deep: boolean, optional :
If True, will return the parameters for this estimator and contained subobjects that are
estimators.
504 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
partial_fit(X, y, sample_weight=None)
Fit linear model with Stochastic Gradient Descent.
Parameters X : {array-like, sparse matrix}, shape = [n_samples, n_features]
Subset of training data
y : numpy array of shape [n_samples]
Subset of target values
sample_weight : array-like, shape = [n_samples], optional
Weights applied to individual samples. If not provided, uniform weights are assumed.
Returns self : returns an instance of self.
predict(X)
Predict using the linear model
Parameters X : {array-like, sparse matrix}, shape = [n_samples, n_features]
Returns array, shape = [n_samples] :
Predicted target values per element in X.
score(X, y)
Returns the coefcient of determination R^2 of the prediction.
The coefcient R^2 is dened as (1 - u/v), where u is the regression sum of squares ((y - y_pred) **
2).sum() and v is the residual sum of squares ((y_true - y_true.mean()) ** 2).sum(). Best possible score is
1.0, lower values are worse.
Parameters X : array-like, shape = [n_samples, n_features]
Training set.
y : array-like, shape = [n_samples]
Returns z : oat
set_params(**params)
Set the parameters of the estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The former have
parameters of the form <component>__<parameter> so that its possible to update each component
of a nested object.
Returns self :
transform(X, threshold=None)
Reduce X to its most important features.
Parameters X : array or scipy sparse matrix of shape [n_samples, n_features]
The input samples.
threshold : string, oat or None, optional (default=None)
The threshold value to use for feature selection. Features whose importance is greater
or equal are kept while the others are discarded. If median (resp. mean), then the
threshold value is the median (resp. the mean) of the feature importances. A scaling
factor (e.g., 1.25*mean) may also be used. If None and if available, the object attribute
threshold is used. Otherwise, mean is used by default.
Returns X_r : array of shape [n_samples, n_selected_features]
1.8. Reference 505
scikit-learn user guide, Release 0.12-git
The input samples with only the selected features.
sklearn.linear_model.LogisticRegression
class sklearn.linear_model.LogisticRegression(penalty=l2, dual=False, tol=0.0001,
C=1.0, t_intercept=True, inter-
cept_scaling=1, class_weight=None)
Logistic Regression (aka logit, MaxEnt) classier.
In the multiclass case, the training algorithm uses a one-vs.-all (OvA) scheme, rather than the true multinomial
LR.
This class implements L1 and L2 regularized logistic regression using the liblinear library. It can handle both
dense and sparse input. Use C-ordered arrays or CSR matrices containing 64-bit oats for optimal performance;
any other input format will be converted (and copied).
Parameters penalty : string, l1 or l2
Used to specify the norm used in the penalization
dual : boolean
Dual or primal formulation. Dual formulation is only implemented for l2 penalty. Prefer
dual=False when n_samples > n_features.
C : oat or None, optional (default=None)
Species the strength of the regularization. The smaller it is the bigger in the regular-
ization. If None then C is set to n_samples.
t_intercept : bool, default: True
Species if a constant (a.k.a. bias or intercept) should be added the decision function
intercept_scaling : oat, default: 1
when self.t_intercept is True, instance vector x becomes [x, self.intercept_scaling],
i.e. a synthetic feature with constant value equals to intercept_scaling is appended to
the instance vector. The intercept becomes intercept_scaling * synthetic feature weight
Note! the synthetic feature weight is subject to l1/l2 regularization as all other features.
To lessen the effect of regularization on synthetic feature weight (and therefore on the
intercept) intercept_scaling has to be increased
tol: oat, optional :
tolerance for stopping criteria
See Also:
LinearSVC
Notes
The underlying C implementation uses a random number generator to select features when tting the model.
It is thus not uncommon, to have slightly different results for the same input data. If that happens, try with a
smaller tol parameter.
References:
LIBLINEAR A Library for Large Linear Classicationhttp://www.csie.ntu.edu.tw/~cjlin/liblinear/
506 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
Hsiang-Fu Yu, Fang-Lan Huang, Chih-Jen Lin (2011). Dual coordinate descentmethods for lo-
gistic regression and maximum entropy models. Machine Learning 85(1-2):41-75.
http://www.csie.ntu.edu.tw/~cjlin/papers/maxent_dual.pdf
Attributes
coef_ array, shape =
[n_classes-1,
n_features]
Coefcient of the features in the decision function.
coef_ is readonly property derived from raw_coef_ that follows the internal
memory layout of liblinear.
in-
ter-
cept_
array, shape =
[n_classes-1]
intercept (a.k.a. bias) added to the decision function. It is available only
when parameter intercept is set to True
Methods
decision_function(X) Decision function value for X according to the trained model.
fit(X, y[, class_weight]) Fit the model according to the given training data.
fit_transform(X[, y]) Fit to data, then transform it
get_params([deep]) Get parameters for the estimator
predict(X) Predict target values of X according to the tted model.
predict_log_proba(X) Log of Probability estimates.
predict_proba(X) Probability estimates.
score(X, y) Returns the mean accuracy on the given test data and labels.
set_params(**params) Set the parameters of the estimator.
transform(X[, threshold]) Reduce X to its most important features.
__init__(penalty=l2, dual=False, tol=0.0001, C=1.0, t_intercept=True, intercept_scaling=1,
class_weight=None)
decision_function(X)
Decision function value for X according to the trained model.
Parameters X : array-like, shape = [n_samples, n_features]
Returns T : array-like, shape = [n_samples, n_class]
Returns the decision function of the sample for each class in the model.
fit(X, y, class_weight=None)
Fit the model according to the given training data.
Parameters X : {array-like, sparse matrix}, shape = [n_samples, n_features]
Training vector, where n_samples in the number of samples and n_features is the num-
ber of features.
y : array-like, shape = [n_samples]
Target vector relative to X
class_weight : {dict, auto}, optional
Weights associated with classes. If not given, all classes are supposed to have weight
one.
Returns self : object
1.8. Reference 507
scikit-learn user guide, Release 0.12-git
Returns self.
fit_transform(X, y=None, **t_params)
Fit to data, then transform it
Fits transformer to X and y with optional parameters t_params and returns a transformed version of X.
Parameters X : numpy array of shape [n_samples, n_features]
Training set.
y : numpy array of shape [n_samples]
Target values.
Returns X_new : numpy array of shape [n_samples, n_features_new]
Transformed array.
Notes
This method just calls t and transform consecutively, i.e., it is not an optimized implementation of
t_transform, unlike other transformers such as PCA.
get_params(deep=True)
Get parameters for the estimator
Parameters deep: boolean, optional :
If True, will return the parameters for this estimator and contained subobjects that are
estimators.
predict(X)
Predict target values of X according to the tted model.
Parameters X : {array-like, sparse matrix}, shape = [n_samples, n_features]
Returns C : array, shape = [n_samples]
predict_log_proba(X)
Log of Probability estimates.
The returned estimates for all classes are ordered by the label of classes.
Parameters X : array-like, shape = [n_samples, n_features]
Returns T : array-like, shape = [n_samples, n_classes]
Returns the log-probabilities of the sample for each class in the model, where classes
are ordered by arithmetical order.
predict_proba(X)
Probability estimates.
The returned estimates for all classes are ordered by the label of classes.
Parameters X : array-like, shape = [n_samples, n_features]
Returns T : array-like, shape = [n_samples, n_classes]
Returns the probability of the sample for each class in the model, where classes are
ordered by arithmetical order.
score(X, y)
Returns the mean accuracy on the given test data and labels.
508 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
Parameters X : array-like, shape = [n_samples, n_features]
Training set.
y : array-like, shape = [n_samples]
Labels for X.
Returns z : oat
set_params(**params)
Set the parameters of the estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The former have
parameters of the form <component>__<parameter> so that its possible to update each component
of a nested object.
Returns self :
transform(X, threshold=None)
Reduce X to its most important features.
Parameters X : array or scipy sparse matrix of shape [n_samples, n_features]
The input samples.
threshold : string, oat or None, optional (default=None)
The threshold value to use for feature selection. Features whose importance is greater
or equal are kept while the others are discarded. If median (resp. mean), then the
threshold value is the median (resp. the mean) of the feature importances. A scaling
factor (e.g., 1.25*mean) may also be used. If None and if available, the object attribute
threshold is used. Otherwise, mean is used by default.
Returns X_r : array of shape [n_samples, n_selected_features]
The input samples with only the selected features.
1.8.16 sklearn.manifold: Manifold Learning
The sklearn.manifold module implements data embedding techniques.
User guide: See the Manifold learning section for further details.
manifold.LocallyLinearEmbedding([...]) Locally Linear Embedding
manifold.Isomap([n_neighbors, n_components, ...]) Isomap Embedding
manifold.MDS([n_components, metric, n_init, ...]) Multidimensional scaling
sklearn.manifold.LocallyLinearEmbedding
class sklearn.manifold.LocallyLinearEmbedding(n_neighbors=5, n_components=2,
reg=0.001, eigen_solver=auto, tol=1e-
06, max_iter=100, method=standard,
hessian_tol=0.0001, modied_tol=1e-
12, neighbors_algorithm=auto, ran-
dom_state=None, out_dim=None)
Locally Linear Embedding
Parameters n_neighbors : integer
number of neighbors to consider for each point.
1.8. Reference 509
scikit-learn user guide, Release 0.12-git
n_components : integer
number of coordinates for the manifold
reg : oat
regularization constant, multiplies the trace of the local covariance matrix of the dis-
tances.
eigen_solver : string, {auto, arpack, dense}
auto : algorithm will attempt to choose the best method for input data
arpack[use arnoldi iteration in shift-invert mode.] For this method, M may be a dense
matrix, sparse matrix, or general linear operator.
dense[use standard dense matrix operations for the eigenvalue] decomposition. For this
method, M must be an array or matrix type. This method should be avoided for large
problems.
tol : oat, optional
Tolerance for arpack method Not used if eigen_solver==dense.
max_iter : integer
maximum number of iterations for the arpack solver. Not used if
eigen_solver==dense.
method : string [standard | hessian | modied]
standard[use the standard locally linear embedding algorithm.] see reference [1]
hessian[use the Hessian eigenmap method. This method requires] n_neighbors >
n_components * (1 + (n_components + 1) / 2. see reference [2]
modied[use the modied locally linear embedding algorithm.] see reference [3]
ltsa[use local tangent space alignment algorithm] see reference [4]
hessian_tol : oat, optional
Tolerance for Hessian eigenmapping method. Only used if method == hessian
modied_tol : oat, optional
Tolerance for modied LLE method. Only used if method == modied
neighbors_algorithm : string [auto|brute|kd_tree|ball_tree]
algorithm to use for nearest neighbors search, passed to neighbors.NearestNeighbors
instance
random_state: numpy.RandomState, optional :
The generator used to initialize the centers. Defaults to numpy.random. Used to deter-
mine the starting vector for arpack iterations
References
[R63], [R64], [R65], [R66]
510 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
Attributes
embed-
ding_vectors_
array-like, shape
[n_components, n_samples]
Stores the embedding vectors
reconstruc-
tion_error_
oat Reconstruction error associated with
embedding_vectors_
nbrs_ NearestNeighbors object Stores nearest neighbors instance, including BallTree
or KDtree if applicable.
Methods
fit(X[, y]) Compute the embedding vectors for data X
fit_transform(X[, y]) Compute the embedding vectors for data X and transform X.
get_params([deep]) Get parameters for the estimator
set_params(**params) Set the parameters of the estimator.
transform(X) Transform new points into embedding space.
__init__(n_neighbors=5, n_components=2, reg=0.001, eigen_solver=auto, tol=1e-06,
max_iter=100, method=standard, hessian_tol=0.0001, modied_tol=1e-12, neigh-
bors_algorithm=auto, random_state=None, out_dim=None)
fit(X, y=None)
Compute the embedding vectors for data X
Parameters X : array-like of shape [n_samples, n_features]
training set.
Returns self : returns an instance of self.
fit_transform(X, y=None)
Compute the embedding vectors for data X and transform X.
Parameters X : array-like of shape [n_samples, n_features]
training set.
Returns X_new: array-like, shape (n_samples, n_components) :
get_params(deep=True)
Get parameters for the estimator
Parameters deep: boolean, optional :
If True, will return the parameters for this estimator and contained subobjects that are
estimators.
set_params(**params)
Set the parameters of the estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The former have
parameters of the form <component>__<parameter> so that its possible to update each component
of a nested object.
Returns self :
transform(X)
Transform new points into embedding space.
1.8. Reference 511
scikit-learn user guide, Release 0.12-git
Parameters X : array-like, shape = [n_samples, n_features]
Returns X_new : array, shape = [n_samples, n_components]
Notes
Because of scaling performed by this method, it is discouraged to use it together with methods that are not
scale-invariant (like SVMs)
sklearn.manifold.Isomap
class sklearn.manifold.Isomap(n_neighbors=5, n_components=2, eigen_solver=auto, tol=0,
max_iter=None, path_method=auto, neighbors_algorithm=auto,
out_dim=None)
Isomap Embedding
Non-linear dimensionality reduction through Isometric Mapping
Parameters n_neighbors : integer
number of neighbors to consider for each point.
n_components : integer
number of coordinates for the manifold
eigen_solver : [auto|arpack|dense]
auto[attempt to choose the most efcient solver] for the given problem.
arpack[use Arnoldi decomposition to nd the eigenvalues] and eigenvectors. Note
that arpack can handle both dense and sparse data efciently
dense[use a direct solver (i.e. LAPACK)] for the eigenvalue decomposition.
tol : oat
convergence tolerance passed to arpack or lobpcg. not used if eigen_solver == dense
max_iter : integer
maximumnumber of iterations for the arpack solver. not used if eigen_solver == dense
path_method : string [auto|FW|D]
method to use in nding shortest path. auto : attempt to choose the best algorithm
automatically FW : Floyd-Warshall algorithm D : Dijkstra algorithm with Fibonacci
Heaps
neighbors_algorithm : string [auto|brute|kd_tree|ball_tree]
algorithm to use for nearest neighbors search, passed to neighbors.NearestNeighbors
instance
References
[1] Tenenbaum, J.B.; De Silva, V.; & Langford, J.C. A global geometricframework for nonlinear dimen-
sionality reduction. Science 290 (5500)
512 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
Attributes
embed-
ding_
array-like, shape (n_samples,
n_components)
Stores the embedding vectors
ker-
nel_pca_
KernelPCA object used to
implement the embedding
train-
ing_data_
array-like, shape (n_samples,
n_features)
Stores the training data
nbrs_ sklearn.neighbors.NearestNeighbors
instance
Stores nearest neighbors instance, including BallTree
or KDtree if applicable.
dist_matrix_ array-like, shape (n_samples,
n_samples)
Stores the geodesic distance matrix of training data
Methods
fit(X[, y]) Compute the embedding vectors for data X
fit_transform(X[, y]) Fit the model from data in X and transform X.
get_params([deep]) Get parameters for the estimator
reconstruction_error() Compute the reconstruction error for the embedding.
set_params(**params) Set the parameters of the estimator.
transform(X) Transform X.
__init__(n_neighbors=5, n_components=2, eigen_solver=auto, tol=0, max_iter=None,
path_method=auto, neighbors_algorithm=auto, out_dim=None)
fit(X, y=None)
Compute the embedding vectors for data X
Parameters X : {array-like, sparse matrix, BallTree, cKDTree, NearestNeighbors}
Sample data, shape = (n_samples, n_features), in the form of a numpy array, sparse
array, precomputed tree, or NearestNeighbors object.
Returns self : returns an instance of self.
fit_transform(X, y=None)
Fit the model from data in X and transform X.
Parameters X: {array-like, sparse matrix, BallTree, cKDTree} :
Training vector, where n_samples in the number of samples and n_features is the num-
ber of features.
Returns X_new: array-like, shape (n_samples, n_components) :
get_params(deep=True)
Get parameters for the estimator
Parameters deep: boolean, optional :
If True, will return the parameters for this estimator and contained subobjects that are
estimators.
reconstruction_error()
Compute the reconstruction error for the embedding.
Returns reconstruction_error : oat
1.8. Reference 513
scikit-learn user guide, Release 0.12-git
Notes
The cost function of an isomap embedding is
E = frobenius_norm[K(D) - K(D_fit)] / n_samples
Where D is the matrix of distances for the input data X, D_t is the matrix of distances for the output
embedding X_t, and K is the isomap kernel:
K(D) = -0.5
*
(I - 1/n_samples)
*
D^2
*
(I - 1/n_samples)
set_params(**params)
Set the parameters of the estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The former have
parameters of the form <component>__<parameter> so that its possible to update each component
of a nested object.
Returns self :
transform(X)
Transform X.
This is implemented by linking the points X into the graph of geodesic distances of the training data. First
the n_neighbors nearest neighbors of X are found in the training data, and from these the shortest geodesic
distances from each point in X to each point in the training data are computed in order to construct the
kernel. The embedding of X is the projection of this kernel onto the embedding vectors of the training set.
Parameters X: array-like, shape (n_samples, n_features) :
Returns X_new: array-like, shape (n_samples, n_components) :
sklearn.manifold.MDS
class sklearn.manifold.MDS(n_components=2, metric=True, n_init=4, max_iter=300, verbose=0,
eps=0.001, n_jobs=1, random_state=None)
Multidimensional scaling
Parameters metric : boolean, optional, default: True
compute metric or nonmetric SMACOF (Scaling by Majorizing a Complicated Func-
tion) algorithm
n_components : int, optional, default: 2
number of dimension in which to immerse the similarities overridden if initial array is
provided.
n_init : int, optional, default: 4
Number of time the smacof algorithm will be run with different initialisation. The nal
results will be the best output of the n_init consecutive runs in terms of stress.
max_iter : int, optional, default: 300
Maximum number of iterations of the SMACOF algorithm for a single run
verbose : int, optional, default: 0
level of verbosity
eps : oat, optional, default: 1e-6
relative tolerance w.r.t stress to declare converge
514 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
n_jobs : int, optional, default: 1
The number of jobs to use for the computation. This works by breaking down the
pairwise matrix into n_jobs even slices and computing them in parallel.
If -1 all CPUs are used. If 1 is given, no parallel computing code is used at all, which
is useful for debuging. For n_jobs below -1, (n_cpus + 1 - n_jobs) are used. Thus for
n_jobs = -2, all CPUs but one are used.
random_state : integer or numpy.RandomState, optional
The generator used to initialize the centers. If an integer is given, it xes the seed.
Defaults to the global numpy random number generator.
Notes
Modern Multidimensional Scaling - Theory and Applications Borg, I.; Groenen P. Springer Series in Statistics
(1997)
Nonmetric multidimensional scaling: a numerical method Kruskal, J. Psychometrika, 29 (1964)
Multidimensional scaling by optimizing goodness of t to a nonmetric hypothesis Kruskal, J. Psychometrika,
29, (1964)
Attributes
embedding_ array-like, shape
[n_components,
n_samples]
Stores the position of the dataset in the embedding space
stress_ oat The nal value of the stress (sum of squared distance of the
disparities and the distances for all constrained points)
Methods
fit(X[, init, y]) Computes the position of the points in the embedding space
fit_transform(X[, init, y]) Fit the data from X, and returns the embedded coordinates
get_params([deep]) Get parameters for the estimator
set_params(**params) Set the parameters of the estimator.
__init__(n_components=2, metric=True, n_init=4, max_iter=300, verbose=0, eps=0.001, n_jobs=1,
random_state=None)
fit(X, init=None, y=None)
Computes the position of the points in the embedding space
Parameters X: array, shape=[n_samples, n_samples], symetric :
Proximity matrice
init: {None or ndarray, shape (n_samples,)} :
if None, randomly chooses the initial conguration if ndarray, initialize the SMACOF
algorithm with this array
fit_transform(X, init=None, y=None)
Fit the data from X, and returns the embedded coordinates
1.8. Reference 515
scikit-learn user guide, Release 0.12-git
Parameters X: array, shape=[n_samples, n_samples], symetric :
Proximity matrice
init: {None or ndarray, shape (n_samples,)} :
if None, randomly chooses the initial conguration if ndarray, initialize the SMACOF
algorithm with this array
get_params(deep=True)
Get parameters for the estimator
Parameters deep: boolean, optional :
If True, will return the parameters for this estimator and contained subobjects that are
estimators.
set_params(**params)
Set the parameters of the estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The former have
parameters of the form <component>__<parameter> so that its possible to update each component
of a nested object.
Returns self :
manifold.locally_linear_embedding(X, ...[, ...]) Perform a Locally Linear Embedding analysis on the data.
sklearn.manifold.locally_linear_embedding
sklearn.manifold.locally_linear_embedding(X, n_neighbors, n_components, reg=0.001,
eigen_solver=auto, tol=1e-06, max_iter=100,
method=standard, hessian_tol=0.0001,
modied_tol=1e-12, random_state=None,
out_dim=None)
Perform a Locally Linear Embedding analysis on the data.
Parameters X : {array-like, sparse matrix, BallTree, cKDTree, NearestNeighbors}
Sample data, shape = (n_samples, n_features), in the form of a numpy array, sparse
array, precomputed tree, or NearestNeighbors object.
n_neighbors : integer
number of neighbors to consider for each point.
n_components : integer
number of coordinates for the manifold.
reg : oat
regularization constant, multiplies the trace of the local covariance matrix of the dis-
tances.
eigen_solver : string, {auto, arpack, dense}
auto : algorithm will attempt to choose the best method for input data
arpack[use arnoldi iteration in shift-invert mode.] For this method, M may be a dense
matrix, sparse matrix, or general linear operator.
516 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
dense[use standard dense matrix operations for the eigenvalue] decomposition. For this
method, M must be an array or matrix type. This method should be avoided for large
problems.
tol : oat, optional
Tolerance for arpack method Not used if eigen_solver==dense.
max_iter : integer
maximum number of iterations for the arpack solver.
method : {standard, hessian, modied, ltsa}
standard[use the standard locally linear embedding algorithm.] see reference [R67]
hessian[use the Hessian eigenmap method. This method requires] n_neighbors >
n_components * (1 + (n_components + 1) / 2. see reference [R68]
modied[use the modied locally linear embedding algorithm.] see reference [R69]
ltsa[use local tangent space alignment algorithm] see reference [R70]
hessian_tol : oat, optional
Tolerance for Hessian eigenmapping method. Only used if method == hessian
modied_tol : oat, optional
Tolerance for modied LLE method. Only used if method == modied
random_state: numpy.RandomState, optional :
The generator used to initialize the centers. Defaults to numpy.random.
Returns Y : array-like, shape [n_samples, n_components]
Embedding vectors.
squared_error : oat
Reconstruction error for the embedding vectors. Equivalent to norm(Y - W Y,
fro)
**
2, where W are the reconstruction weights.
References
[R67], [R68], [R69], [R70]
1.8.17 sklearn.metrics: Metrics
The sklearn.metrics module includes score functions, performance metrics and pairwise metrics and distance
computations.
Classication metrics
metrics.confusion_matrix(y_true, y_pred[, ...]) Compute confusion matrix to evaluate the accuracy of a classication
metrics.roc_curve(y_true, y_score) compute Receiver operating characteristic (ROC)
metrics.auc(x, y) Compute Area Under the Curve (AUC) using the trapezoidal rule
metrics.precision_score(y_true, y_pred[, ...]) Compute the precision
Continued on next page
1.8. Reference 517
scikit-learn user guide, Release 0.12-git
Table 1.137 continued from previous page
metrics.recall_score(y_true, y_pred[, ...]) Compute the recall
metrics.fbeta_score(y_true, y_pred, beta[, ...]) Compute fbeta score
metrics.f1_score(y_true, y_pred[, labels, ...]) Compute f1 score
metrics.precision_recall_fscore_support(...) Compute precisions, recalls, f-measures and support for each class
metrics.classification_report(y_true, y_pred) Build a text report showing the main classication metrics
metrics.precision_recall_curve(y_true, ...) Compute precision-recall pairs for different probability thresholds
metrics.zero_one_score(y_true, y_pred) Zero-one classication score (accuracy)
metrics.zero_one(y_true, y_pred) Zero-One classication loss
metrics.hinge_loss(y_true, pred_decision[, ...]) Cumulated hinge loss (non-regularized).
sklearn.metrics.confusion_matrix
sklearn.metrics.confusion_matrix(y_true, y_pred, labels=None)
Compute confusion matrix to evaluate the accuracy of a classication
By denition a confusion matrix cm is such that cm[i, j] is equal to the number of observations known to be in
group i but predicted to be in group j.
Parameters y_true : array, shape = [n_samples]
true targets
y_pred : array, shape = [n_samples]
estimated targets
labels : array, shape = [n_classes]
lists all labels occuring in the dataset. If none is given, those that appear at least once in
y_true or y_pred are used.
Returns CM : array, shape = [n_classes, n_classes]
confusion matrix
References
http://en.wikipedia.org/wiki/Confusion_matrix
sklearn.metrics.roc_curve
sklearn.metrics.roc_curve(y_true, y_score)
compute Receiver operating characteristic (ROC)
Note: this implementation is restricted to the binary classication task.
Parameters y_true : array, shape = [n_samples]
true binary labels
y_score : array, shape = [n_samples]
target scores, can either be probability estimates of the positive class, condence values,
or binary decisions.
Returns fpr : array, shape = [>2]
False Positive Rates
518 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
tpr : array, shape = [>2]
True Positive Rates
thresholds : array, shape = [>2]
Thresholds on y_score used to compute fpr and tpr
References
http://en.wikipedia.org/wiki/Receiver_operating_characteristic
Examples
>>> import numpy as np
>>> from sklearn import metrics
>>> y = np.array([1, 1, 2, 2])
>>> scores = np.array([0.1, 0.4, 0.35, 0.8])
>>> fpr, tpr, thresholds = metrics.roc_curve(y, scores)
>>> fpr
array([ 0. , 0.5, 0.5, 1. ])
sklearn.metrics.auc
sklearn.metrics.auc(x, y)
Compute Area Under the Curve (AUC) using the trapezoidal rule
Parameters x : array, shape = [n]
x coordinates
y : array, shape = [n]
y coordinates
Returns auc : oat
Examples
>>> import numpy as np
>>> from sklearn import metrics
>>> y = np.array([1, 1, 2, 2])
>>> pred = np.array([0.1, 0.4, 0.35, 0.8])
>>> fpr, tpr, thresholds = metrics.roc_curve(y, pred)
>>> metrics.auc(fpr, tpr)
0.75
sklearn.metrics.precision_score
sklearn.metrics.precision_score(y_true, y_pred, labels=None, pos_label=1, aver-
age=weighted)
Compute the precision
1.8. Reference 519
scikit-learn user guide, Release 0.12-git
The precision is the ratio tp/(tp + fp) where tp is the number of true positives and fp the number of false
positives. The precision is intuitively the ability of the classier not to label as positive a sample that is negative.
The best value is 1 and the worst value is 0.
Parameters y_true : array, shape = [n_samples]
True targets
y_pred : array, shape = [n_samples]
Predicted targets
labels : array
Integer array of labels
pos_label : int
In the binary classication case, give the label of the positive class (default is 1). Ev-
erything else but pos_label is considered to belong to the negative class. Set to None
in the case of multiclass classication.
average : string, [None, micro, macro, weighted(default)]
In the multiclass classication case, this determines the type of averaging performed on
the data.
macro:Average over classes (does not take imbalance into account).
micro:Average over instances (takes imbalance into account). This implies that
precision == recall == f1
weighted:Average weighted by support (takes imbalance into account). Can result in
f1 score that is not between precision and recall.
Returns precision : oat
Precision of the positive class in binary classication or weighted average of the preci-
sion of each class for the multiclass task
sklearn.metrics.recall_score
sklearn.metrics.recall_score(y_true, y_pred, labels=None, pos_label=1, average=weighted)
Compute the recall
The recall is the ratio tp/(tp+fn) where tp is the number of true positives and fn the number of false negatives.
The recall is intuitively the ability of the classier to nd all the positive samples.
The best value is 1 and the worst value is 0.
Parameters y_true : array, shape = [n_samples]
True targets
y_pred : array, shape = [n_samples]
Predicted targets
labels : array
Integer array of labels
pos_label : int
520 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
In the binary classication case, give the label of the positive class (default is 1). Ev-
erything else but pos_label is considered to belong to the negative class. Set to None
in the case of multiclass classication.
average : string, [None, micro, macro, weighted(default)]
In the multiclass classication case, this determines the type of averaging performed on
the data.
macro:Average over classes (does not take imbalance into account).
micro:Average over instances (takes imbalance into account). This implies that
precision == recall == f1
weighted:Average weighted by support (takes imbalance into account). Can result in
f1 score that is not between precision and recall.
Returns recall : oat
Recall of the positive class in binary classication or weighted average of the recall of
each class for the multiclass task.
sklearn.metrics.fbeta_score
sklearn.metrics.fbeta_score(y_true, y_pred, beta, labels=None, pos_label=1, aver-
age=weighted)
Compute fbeta score
The F_beta score is the weighted harmonic mean of precision and recall, reaching its optimal value at 1 and its
worst value at 0.
The beta parameter determines the weight of precision in the combined score. beta < 1 lends more weight
to precision, while beta > 1 favors precision (beta == 0 considers only precision, beta == inf only
recall).
Parameters y_true : array, shape = [n_samples]
True targets
y_pred : array, shape = [n_samples]
Predicted targets
beta: oat :
Weight of precision in harmonic mean.
labels : array
Integer array of labels
pos_label : int
In the binary classication case, give the label of the positive class (default is 1). Ev-
erything else but pos_label is considered to belong to the negative class. Set to None
in the case of multiclass classication.
average : string, [None, micro, macro, weighted(default)]
In the multiclass classication case, this determines the type of averaging performed on
the data.
macro:Average over classes (does not take imbalance into account).
1.8. Reference 521
scikit-learn user guide, Release 0.12-git
micro:Average over instances (takes imbalance into account). This implies that
precision == recall == f1
weighted:Average weighted by support (takes imbalance into account). Can result in
f1 score that is not between precision and recall.
Returns fbeta_score : oat
fbeta_score of the positive class in binary classication or weighted average of the
fbeta_score of each class for the multiclass task.
References
R. Baeza-Yates and B. Ribeiro-Neto (2011). Modern Information Retrieval. Addison Wesley, pp. 327-328.
http://en.wikipedia.org/wiki/F1_score
sklearn.metrics.f1_score
sklearn.metrics.f1_score(y_true, y_pred, labels=None, pos_label=1, average=weighted)
Compute f1 score
The F1 score can be interpreted as a weighted average of the precision and recall, where an F1 score reaches its
best value at 1 and worst score at 0. The relative contribution of precision and recall to the f1 score are equal.
The formular for the F_1 score is:
F_1 = 2
*
(precision
*
recall) / (precision + recall)
See: http://en.wikipedia.org/wiki/F1_score
In the multi-class case, this is the weighted average of the f1-score of each class.
Parameters y_true : array, shape = [n_samples]
True targets
y_pred : array, shape = [n_samples]
Predicted targets
labels : array
Integer array of labels
pos_label : int
In the binary classication case, give the label of the positive class (default is 1). Ev-
erything else but pos_label is considered to belong to the negative class. Set to None
in the case of multiclass classication.
average : string, [None, micro, macro, weighted(default)]
In the multiclass classication case, this determines the type of averaging performed on
the data.
macro:Average over classes (does not take imbalance into account).
micro:Average over instances (takes imbalance into account). This implies that
precision == recall == f1
weighted:Average weighted by support (takes imbalance into account). Can result in
f1 score that is not between precision and recall.
522 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
Returns f1_score : oat
f1_score of the positive class in binary classication or weighted average of the
f1_scores of each class for the multiclass task
References
http://en.wikipedia.org/wiki/F1_score
sklearn.metrics.precision_recall_fscore_support
sklearn.metrics.precision_recall_fscore_support(y_true, y_pred, beta=1.0, la-
bels=None, pos_label=1, aver-
age=None)
Compute precisions, recalls, f-measures and support for each class
The precision is the ratio tp/(tp + fp) where tp is the number of true positives and fp the number of false
positives. The precision is intuitively the ability of the classier not to label as positive a sample that is negative.
The recall is the ratio tp/(tp+fn) where tp is the number of true positives and fn the number of false negatives.
The recall is intuitively the ability of the classier to nd all the positive samples.
The F_beta score can be interpreted as a weighted harmonic mean of the precision and recall, where an F_beta
score reaches its best value at 1 and worst score at 0.
The F_beta score weights recall beta as much as precision. beta = 1.0 means recall and precsion are equally
important.
The support is the number of occurrences of each class in y_true.
If pos_label is None, this function returns the average precision, recall and f-measure if average is one of
micro, macro, weighted.
Parameters y_true : array, shape = [n_samples]
True targets
y_pred : array, shape = [n_samples]
Predicted targets
beta : oat, 1.0 by default
The strength of recall versus precision in the f-score.
labels : array
Integer array of labels
pos_label : int
In the binary classication case, give the label of the positive class (default is 1). Ev-
erything else but pos_label is considered to belong to the negative class. Set to None
in the case of multiclass classication.
average : string, [None, micro, macro, weighted(default)]
In the multiclass classication case, this determines the type of averaging performed on
the data.
macro:Average over classes (does not take imbalance into account).
1.8. Reference 523
scikit-learn user guide, Release 0.12-git
micro:Average over instances (takes imbalance into account). This implies that
precision == recall == f1
weighted:Average weighted by support (takes imbalance into account). Can result in
f1 score that is not between precision and recall.
Returns precision: array, shape = [n_unique_labels], dtype = np.double :
recall: array, shape = [n_unique_labels], dtype = np.double :
f1_score: array, shape = [n_unique_labels], dtype = np.double :
support: array, shape = [n_unique_labels], dtype = np.long :
References
http://en.wikipedia.org/wiki/Precision_and_recall
sklearn.metrics.classication_report
sklearn.metrics.classification_report(y_true, y_pred, labels=None, target_names=None)
Build a text report showing the main classication metrics
Parameters y_true : array, shape = [n_samples]
True targets
y_pred : array, shape = [n_samples]
Estimated targets
labels : array, shape = [n_labels]
Optional list of label indices to include in the report
target_names : list of strings
Optional display names matching the labels (same order)
Returns report : string
Text summary of the precision, recall, f1-score for each class
sklearn.metrics.precision_recall_curve
sklearn.metrics.precision_recall_curve(y_true, probas_pred)
Compute precision-recall pairs for different probability thresholds
Note: this implementation is restricted to the binary classication task.
The precision is the ratio tp/(tp + fp) where tp is the number of true positives and fp the number of false
positives. The precision is intuitively the ability of the classier not to label as positive a sample that is negative.
The recall is the ratio tp/(tp+fn) where tp is the number of true positives and fn the number of false negatives.
The recall is intuitively the ability of the classier to nd all the positive samples.
The last precision and recall values are 1. and 0. respectively and do not have a corresponding threshold. This
ensures that the graph starts on the x axis.
Parameters y_true : array, shape = [n_samples]
True targets of binary classication in range {-1, 1} or {0, 1}
524 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
probas_pred : array, shape = [n_samples]
Estimated probabilities
Returns precision : array, shape = [n + 1]
Precision values
recall : array, shape = [n + 1]
Recall values
thresholds : array, shape = [n]
Thresholds on y_score used to compute precision and recall
sklearn.metrics.zero_one_score
sklearn.metrics.zero_one_score(y_true, y_pred)
Zero-one classication score (accuracy)
Positive integer (number of good classications). The best performance is 1.
Return the fraction of correct predictions in y_pred.
Parameters y_true : array-like, shape = n_samples
Gold standard labels.
y_pred : array-like, shape = n_samples
Predicted labels, as returned by a classier.
Returns score : oat
sklearn.metrics.zero_one
sklearn.metrics.zero_one(y_true, y_pred)
Zero-One classication loss
Positive integer (number of misclassications). The best performance is 0.
Return the number of errors
Parameters y_true : array-like
y_pred : array-like
Returns loss : oat
sklearn.metrics.hinge_loss
sklearn.metrics.hinge_loss(y_true, pred_decision, pos_label=1, neg_label=-1)
Cumulated hinge loss (non-regularized).
Assuming labels in y_true are encoded with +1 and -1, when a prediction mistake is made, margin = y_true *
pred_decision is always negative (since the signs disagree), therefore 1 - margin is always greater than 1. The
cumulated hinge loss therefore upperbounds the number of mistakes made by the classier.
Parameters y_true : array, shape = [n_samples]
True target (integers)
1.8. Reference 525
scikit-learn user guide, Release 0.12-git
pred_decision : array, shape = [n_samples] or [n_samples, n_classes]
Predicted decisions, as output by decision_function (oats)
Regression metrics
metrics.r2_score(y_true, y_pred) R^2 (coefcient of determination) regression score function
metrics.mean_squared_error(y_true, y_pred) Mean squared error regression loss
sklearn.metrics.r2_score
sklearn.metrics.r2_score(y_true, y_pred)
R^2 (coefcient of determination) regression score function
Best possible score is 1.0, lower values are worse.
Parameters y_true : array-like
y_pred : array-like
Returns z : oat
The R^2 score
Notes
This is not a symmetric function.
References
http://en.wikipedia.org/wiki/Coefcient_of_determination
sklearn.metrics.mean_squared_error
sklearn.metrics.mean_squared_error(y_true, y_pred)
Mean squared error regression loss
Return a a positive oating point value (the best value is 0.0).
Parameters y_true : array-like
y_pred : array-like
Returns loss : oat
Clustering metrics
See the Clustering section of the user guide for further details. The sklearn.metrics.cluster submodule
contains evaluation metrics for cluster analysis results. There are two forms of evaluation:
supervised, which uses a ground truth class values for each sample.
unsupervised, which does not and measures the quality of the model itself.
526 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
metrics.adjusted_mutual_info_score(...) Adjusted Mutual Information between two clusterings
metrics.adjusted_rand_score(labels_true, ...) Rand index adjusted for chance
metrics.completeness_score(labels_true, ...) Completeness metric of a cluster labeling given a ground truth
metrics.homogeneity_completeness_v_measure(...) Compute the homogeneity and completeness and V-measure scores at once
metrics.homogeneity_score(labels_true, ...) Homogeneity metric of a cluster labeling given a ground truth
metrics.mutual_info_score(labels_true, ...) Mutual Information between two clusterings
metrics.normalized_mutual_info_score(...) Normalized Mutual Information between two clusterings
metrics.silhouette_score(X, labels[, ...]) Compute the mean Silhouette Coefcient of all samples.
metrics.v_measure_score(labels_true, labels_pred) V-Measure cluster labeling given a ground truth.
sklearn.metrics.adjusted_mutual_info_score
sklearn.metrics.adjusted_mutual_info_score(labels_true, labels_pred)
Adjusted Mutual Information between two clusterings
Adjusted Mutual Information (AMI) is an adjustement of the Mutual Information (MI) score to account for
chance. It accounts for the fact that the MI is generally higher for two clusterings with a larger number of
clusters, regardless of whether there is actually more information shared. For two clusterings U and V, the AMI
is given as:
AMI(U, V) = [MI(U, V) - E(MI(U, V))] / [max(H(U), H(V)) - E(MI(U, V))]
This metric is independent of the absolute values of the labels: a permutation of the class or cluster label values
wont change the score value in any way.
This metric is furthermore symmetric: switching label_true with label_pred will return the same score value.
This can be useful to measure the agreement of two independent label assignments strategies on the same dataset
when the real ground truth is not known.
Be mindful that this function is an order of magnitude slower than other metrics, such as the Adjusted Rand
Index.
Parameters labels_true : int array, shape = [n_samples]
A clustering of the data into disjoint subsets.
labels_pred : array, shape = [n_samples]
A clustering of the data into disjoint subsets.
Returns ami: oat :
score between 0.0 and 1.0. 1.0 stands for perfectly complete labeling
See Also:
adjusted_rand_scoreAdjusted Rand Index
mutual_information_scoreMutual Information (not adjusted for chance)
References
[R41], [R42]
1.8. Reference 527
scikit-learn user guide, Release 0.12-git
Examples
Perfect labelings are both homogeneous and complete, hence have score 1.0:
>>> from sklearn.metrics.cluster import adjusted_mutual_info_score
>>> adjusted_mutual_info_score([0, 0, 1, 1], [0, 0, 1, 1])
1.0
>>> adjusted_mutual_info_score([0, 0, 1, 1], [1, 1, 0, 0])
1.0
If classes members are completly splitted across different clusters, the assignment is totally in-complete, hence
the AMI is null:
>>> adjusted_mutual_info_score([0, 0, 0, 0], [0, 1, 2, 3])
0.0
sklearn.metrics.adjusted_rand_score
sklearn.metrics.adjusted_rand_score(labels_true, labels_pred)
Rand index adjusted for chance
The Rand Index computes a similarity measure between two clusterings by considering all pairs of samples and
counting pairs that are assigned in the same or different clusters in the predicted and true clusterings.
The raw RI score is then adjusted for chance into the ARI score using the following scheme:
ARI = (RI - Expected_RI) / (max(RI) - Expected_RI)
The adjusted Rand index is thus ensured to have a value close to 0.0 for random labeling independently of the
number of clusters and samples and exactly 1.0 when the clusterings are identical (up to a permutation).
ARI is a symmetric measure:
adjusted_rand_score(a, b) == adjusted_rand_score(b, a)
Parameters labels_true : int array, shape = [n_samples]
Ground truth class labels to be used as a reference
labels_pred : array, shape = [n_samples]
Cluster labels to evaluate
Returns ari: oat :
Similarity score between -1.0 and 1.0. Random labelings have an ARI close to 0.0. 1.0
stands for perfect match.
See Also:
adjusted_mutual_info_scoreAdjusted Mutual Information
References
[Hubert1985], [wk]
528 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
Examples
Perfectly maching labelings have a score of 1 even
>>> from sklearn.metrics.cluster import adjusted_rand_score
>>> adjusted_rand_score([0, 0, 1, 1], [0, 0, 1, 1])
1.0
>>> adjusted_rand_score([0, 0, 1, 1], [1, 1, 0, 0])
1.0
Labelings that assign all classes members to the same clusters are complete be not always pure, hence penalized:
>>> adjusted_rand_score([0, 0, 1, 2], [0, 0, 1, 1])
0.57...
ARI is symmetric, so labelings that have pure clusters with members coming from the same classes but unnec-
essary splits are penalized:
>>> adjusted_rand_score([0, 0, 1, 1], [0, 0, 1, 2])
0.57...
If classes members are completely split across different clusters, the assignment is totally incomplete, hence the
ARI is very low:
>>> adjusted_rand_score([0, 0, 0, 0], [0, 1, 2, 3])
0.0
sklearn.metrics.completeness_score
sklearn.metrics.completeness_score(labels_true, labels_pred)
Completeness metric of a cluster labeling given a ground truth
A clustering result satises completeness if all the data points that are members of a given class are elements of
the same cluster.
This metric is independent of the absolute values of the labels: a permutation of the class or cluster label values
wont change the score value in any way.
This metric is not symmetric: switching label_true with label_pred will return the homogeneity_score which
will be different in general.
Parameters labels_true : int array, shape = [n_samples]
ground truth class labels to be used as a reference
labels_pred : array, shape = [n_samples]
cluster labels to evaluate
Returns completeness: oat :
score between 0.0 and 1.0. 1.0 stands for perfectly complete labeling
See Also:
homogeneity_score, v_measure_score
References
Andrew Rosenberg and Julia Hirschberg V-Measure: A conditional entropy-based external cluster evalu-
ation measure, 2007 http://acl.ldc.upenn.edu/D/D07/D07-1043.pdf
1.8. Reference 529
scikit-learn user guide, Release 0.12-git
Examples
Perfect labelings are complete:
>>> from sklearn.metrics.cluster import completeness_score
>>> completeness_score([0, 0, 1, 1], [1, 1, 0, 0])
1.0
Non-pefect labelings that assign all classes members to the same clusters are still complete:
>>> completeness_score([0, 0, 1, 1], [0, 0, 0, 0])
1.0
>>> completeness_score([0, 1, 2, 3], [0, 0, 1, 1])
1.0
If classes members are splitted across different clusters, the assignment cannot be complete:
>>> completeness_score([0, 0, 1, 1], [0, 1, 0, 1])
0.0
>>> completeness_score([0, 0, 0, 0], [0, 1, 2, 3])
0.0
sklearn.metrics.homogeneity_completeness_v_measure
sklearn.metrics.homogeneity_completeness_v_measure(labels_true, labels_pred)
Compute the homogeneity and completeness and V-measure scores at once
Those metrics are based on normalized conditional entropy measures of the clustering labeling to evaluate given
the knowledge of a Ground Truth class labels of the same samples.
A clustering result satises homogeneity if all of its clusters contain only data points which are members of a
single class.
A clustering result satises completeness if all the data points that are members of a given class are elements of
the same cluster.
Both scores have positive values between 0.0 and 1.0, larger values being desirable.
Those 3 metrics are independent of the absolute values of the labels: a permutation of the class or cluster label
values wont change the score values in any way.
V-Measure is furthermore symmetric: swapping labels_true and label_pred will give the same score. This does
not hold for homogeneity and completeness.
Parameters labels_true : int array, shape = [n_samples]
ground truth class labels to be used as a reference
labels_pred : array, shape = [n_samples]
cluster labels to evaluate
Returns homogeneity: oat :
score between 0.0 and 1.0. 1.0 stands for perfectly homogeneous labeling
completeness: oat :
score between 0.0 and 1.0. 1.0 stands for perfectly complete labeling
v_measure: oat :
harmonic mean of the rst two
530 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
See Also:
homogeneity_score, completeness_score, v_measure_score
sklearn.metrics.homogeneity_score
sklearn.metrics.homogeneity_score(labels_true, labels_pred)
Homogeneity metric of a cluster labeling given a ground truth
A clustering result satises homogeneity if all of its clusters contain only data points which are members of a
single class.
This metric is independent of the absolute values of the labels: a permutation of the class or cluster label values
wont change the score value in any way.
This metric is not symmetric: switching label_true with label_pred will return the completeness_score which
will be different in general.
Parameters labels_true : int array, shape = [n_samples]
ground truth class labels to be used as a reference
labels_pred : array, shape = [n_samples]
cluster labels to evaluate
Returns homogeneity: oat :
score between 0.0 and 1.0. 1.0 stands for perfectly homogeneous labeling
See Also:
completeness_score, v_measure_score
References
Andrew Rosenberg and Julia Hirschberg V-Measure: A conditional entropy-based external cluster evalu-
ation measure, 2007 http://acl.ldc.upenn.edu/D/D07/D07-1043.pdf
Examples
Perfect labelings are homegenous:
>>> from sklearn.metrics.cluster import homogeneity_score
>>> homogeneity_score([0, 0, 1, 1], [1, 1, 0, 0])
1.0
Non-pefect labelings that futher split classes into more clusters can be perfectly homogeneous:
>>> homogeneity_score([0, 0, 1, 1], [0, 0, 1, 2])
1.0
>>> homogeneity_score([0, 0, 1, 1], [0, 1, 2, 3])
1.0
Clusters that include samples from different classes do not make for an homogeneous labeling:
>>> homogeneity_score([0, 0, 1, 1], [0, 1, 0, 1])
0.0
>>> homogeneity_score([0, 0, 1, 1], [0, 0, 0, 0])
0.0
1.8. Reference 531
scikit-learn user guide, Release 0.12-git
sklearn.metrics.mutual_info_score
sklearn.metrics.mutual_info_score(labels_true, labels_pred, contingency=None)
Mutual Information between two clusterings
The Mutual Information is a measure of the similarity between two labels of the same data. Where P(i) is the
probability of a random sample occuring in cluster U_i and P(j) is the probability of a random sample occuring
in cluster V_j, the Mutual information between clusterings U and V is given as:
MI(U, V ) =
R
i=1
C
j=1
P(i, j) log
P(i, j)
P(i)P
(j)
This is equal to the Kullback-Leibler divergence of the joint distribution with the product distribution of the
marginals.
This metric is independent of the absolute values of the labels: a permutation of the class or cluster label values
wont change the score value in any way.
This metric is furthermore symmetric: switching label_true with label_pred will return the same score value.
This can be useful to measure the agreement of two independent label assignments strategies on the same dataset
when the real ground truth is not known.
Parameters labels_true : int array, shape = [n_samples]
A clustering of the data into disjoint subsets.
labels_pred : array, shape = [n_samples]
A clustering of the data into disjoint subsets.
contingency: None or array, shape = [n_classes_true, n_classes_pred] :
A contingency matrix given by the contingency_matrix function. If value is None, it
will be computed, otherwise the given value is used, with labels_true and labels_pred
ignored.
Returns mi: oat :
Mutual information, a non-negative value
See Also:
adjusted_mutual_info_scoreAdjusted against chance Mutual Information
normalized_mutual_info_scoreNormalized Mutual Information
sklearn.metrics.normalized_mutual_info_score
sklearn.metrics.normalized_mutual_info_score(labels_true, labels_pred)
Normalized Mutual Information between two clusterings
Normalized Mutual Information (NMI) is an normalization of the Mutual Information (MI) score to scale the
results between 0 (no mutual information) and 1 (perfect correlation).
This measure is not adjusted for chance. Therefore adjusted_mustual_info_score might be preferred.
This metric is independent of the absolute values of the labels: a permutation of the class or cluster label values
wont change the score value in any way.
This metric is furthermore symmetric: switching label_true with label_pred will return the same score value.
This can be useful to measure the agreement of two independent label assignments strategies on the same dataset
when the real ground truth is not known.
532 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
Parameters labels_true : int array, shape = [n_samples]
A clustering of the data into disjoint subsets.
labels_pred : array, shape = [n_samples]
A clustering of the data into disjoint subsets.
Returns nmi: oat :
score between 0.0 and 1.0. 1.0 stands for perfectly complete labeling
See Also:
adjusted_rand_scoreAdjusted Rand Index
adjusted_mutual_information_scoreAdjusted Mutual Information (adjusted against chance)
Examples
Perfect labelings are both homogeneous and complete, hence have score 1.0:
>>> from sklearn.metrics.cluster import normalized_mutual_info_score
>>> normalized_mutual_info_score([0, 0, 1, 1], [0, 0, 1, 1])
1.0
>>> normalized_mutual_info_score([0, 0, 1, 1], [1, 1, 0, 0])
1.0
If classes members are completly splitted across different clusters, the assignment is totally in-complete, hence
the NMI is null:
>>> normalized_mutual_info_score([0, 0, 0, 0], [0, 1, 2, 3])
0.0
sklearn.metrics.silhouette_score
sklearn.metrics.silhouette_score(X, labels, metric=euclidean, sample_size=None, ran-
dom_state=None, **kwds)
Compute the mean Silhouette Coefcient of all samples.
The Silhouette Coefcient is calculated using the mean intra-cluster distance (a) and the mean nearest-cluster
distance (b) for each sample. The Silhouette Coefcient for a sample is (b - a) / max(a, b). To clarrify,
b is the distance between a sample and the nearest cluster that b is not a part of.
This function returns the mean Silhoeutte Coefcient over all samples. To obtain the values for each sample,
use silhouette_samples
The best value is 1 and the worst value is -1. Values near 0 indicate overlapping clusters. Negative values
generally indicate that a sample has been assigned to the wrong cluster, as a different cluster is more similar.
Parameters X : array [n_samples_a, n_samples_a] if metric == precomputed, or, [n_samples_a,
n_features] otherwise
Array of pairwise distances between samples, or a feature array.
labels : array, shape = [n_samples]
label values for each sample
metric : string, or callable
1.8. Reference 533
scikit-learn user guide, Release 0.12-git
The metric to use when calculating distance between instances in a feature ar-
ray. If metric is a string, it must be one of the options allowed by met-
rics.pairwise.pairwise_distances. If X is the distance array itself, use precomputed
as the metric.
sample_size : int or None
The size of the sample to use when computing the Silhouette Coefcient. If sample_size
is None, no sampling is used.
random_state : integer or numpy.RandomState, optional
The generator used to initialize the centers. If an integer is given, it xes the seed.
Defaults to the global numpy random number generator.
**kwds : optional keyword parameters
Any further parameters are passed directly to the distance function. If using a
scipy.spatial.distance metric, the parameters are still metric dependent. See the scipy
docs for usage examples.
Returns silhouette : oat
Mean Silhouette Coefcient for all samples.
References
Peter J. Rousseeuw (1987). Silhouettes: a Graphical Aid to theInterpretation and Validation of Cluster
Analysis. Computational and Applied Mathematics 20: 53-65. doi:10.1016/0377-0427(87)90125-7.
http://en.wikipedia.org/wiki/Silhouette_(clustering)
sklearn.metrics.v_measure_score
sklearn.metrics.v_measure_score(labels_true, labels_pred)
V-Measure cluster labeling given a ground truth.
This score is identical to normalized_mutual_info_score.
The V-Measure is the hormonic mean between homogeneity and completeness:
v = 2
*
(homogeneity
*
completeness) / (homogeneity + completeness)
This metric is independent of the absolute values of the labels: a permutation of the class or cluster label values
wont change the score value in any way.
This metric is furthermore symmetric: switching label_true with label_pred will return the same score value.
This can be useful to measure the agreement of two independent label assignments strategies on the same dataset
when the real ground truth is not known.
Parameters labels_true : int array, shape = [n_samples]
ground truth class labels to be used as a reference
labels_pred : array, shape = [n_samples]
cluster labels to evaluate
Returns completeness: oat :
score between 0.0 and 1.0. 1.0 stands for perfectly complete labeling
534 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
See Also:
homogeneity_score, completeness_score
References
[Rosenberg2007]
Examples
Perfect labelings are both homogeneous and complete, hence have score 1.0:
>>> from sklearn.metrics.cluster import v_measure_score
>>> v_measure_score([0, 0, 1, 1], [0, 0, 1, 1])
1.0
>>> v_measure_score([0, 0, 1, 1], [1, 1, 0, 0])
1.0
Labelings that assign all classes members to the same clusters are complete be not homogeneous, hence penal-
ized:
>>> v_measure_score([0, 0, 1, 2], [0, 0, 1, 1])
0.8...
>>> v_measure_score([0, 1, 2, 3], [0, 0, 1, 1])
0.66...
Labelings that have pure clusters with members coming from the same classes are homogeneous but un-
necessary splits harms completeness and thus penalize V-measure as well:
>>> v_measure_score([0, 0, 1, 1], [0, 0, 1, 2])
0.8...
>>> v_measure_score([0, 0, 1, 1], [0, 1, 2, 3])
0.66...
If classes members are completly splitted across different clusters, the assignment is totally in-complete, hence
the v-measure is null:
>>> v_measure_score([0, 0, 0, 0], [0, 1, 2, 3])
0.0
Clusters that include samples from totally different classes totally destroy the homogeneity of the labeling,
hence:
>>> v_measure_score([0, 0, 1, 1], [0, 0, 0, 0])
0.0
Pairwise metrics
The sklearn.metrics.pairwise submodule implements utilities to evaluate pairwise distances or afnity of
sets of samples.
This module contains both distance metrics and kernels. A brief summary is given on the two here.
Distance metrics are a function d(a, b) such that d(a, b) < d(a, c) if objects a and b are considered more similar
to objects a and c. Two objects exactly alike would have a distance of zero. One of the most popular examples is
Euclidean distance. To be a true metric, it must obey the following four conditions:
1.8. Reference 535
scikit-learn user guide, Release 0.12-git
1. d(a, b) >= 0, for all a and b
2. d(a, b) == 0, if and only if a = b, positive definiteness
3. d(a, b) == d(b, a), symmetry
4. d(a, c) <= d(a, b) + d(b, c), the triangle inequality
Kernels are measures of similarity, i.e. s(a, b) > s(a, c) if objects a and b are considered more similar to
objects a and c. A kernel must also be positive semi-denite.
There are a number of ways to convert between a distance metric and a similarity measure, such as a kernel. Let D be
the distance, and S be the kernel:
1. S = np.exp(-D
*
gamma), where one heuristic for choosing
gamma is 1 / num_features
2. S = 1. / (D / np.max(D))
metrics.pairwise.euclidean_distances(X[, Y, ...]) Considering the rows of X (and Y=X) as vectors, compute the
metrics.pairwise.manhattan_distances(X[, Y, ...]) Compute the L1 distances between the vectors in X and Y.
metrics.pairwise.linear_kernel(X[, Y]) Compute the linear kernel between X and Y.
metrics.pairwise.polynomial_kernel(X[, Y, ...]) Compute the polynomial kernel between X and Y:
metrics.pairwise.rbf_kernel(X[, Y, gamma]) Compute the rbf (gaussian) kernel between X and Y:
metrics.pairwise.distance_metrics() Valid metrics for pairwise_distances
metrics.pairwise.pairwise_distances(X[, Y, ...]) Compute the distance matrix from a vector array X and optional Y.
metrics.pairwise.kernel_metrics() Valid metrics for pairwise_kernels
metrics.pairwise.pairwise_kernels(X[, Y, ...]) Compute the kernel between arrays X and optional array Y.
sklearn.metrics.pairwise.euclidean_distances
sklearn.metrics.pairwise.euclidean_distances(X, Y=None, Y_norm_squared=None,
squared=False)
Considering the rows of X (and Y=X) as vectors, compute the distance matrix between each pair of vectors.
For efciency reasons, the euclidean distance between a pair of row vector x and y is computed as:
dist(x, y) = sqrt(dot(x, x) - 2
*
dot(x, y) + dot(y, y))
This formulation has two main advantages. First, it is computationally efcient when dealing with sparse data.
Second, if x varies but y remains unchanged, then the right-most dot-product dot(y, y) can be pre-computed.
Parameters X : {array-like, sparse matrix}, shape = [n_samples_1, n_features]
Y : {array-like, sparse matrix}, shape = [n_samples_2, n_features]
Y_norm_squared : array-like, shape = [n_samples_2], optional
Pre-computed dot-products of vectors in Y (e.g., (Y
**
2).sum(axis=1))
squared : boolean, optional
Return squared Euclidean distances.
Returns distances : {array, sparse matrix}, shape = [n_samples_1, n_samples_2]
Examples
>>> from sklearn.metrics.pairwise import euclidean_distances
>>> X = [[0, 1], [1, 1]]
>>> # distance between rows of X
536 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
>>> euclidean_distances(X, X)
array([[ 0., 1.],
[ 1., 0.]])
>>> # get distance to origin
>>> euclidean_distances(X, [[0, 0]])
array([[ 1. ],
[ 1.41421356]])
sklearn.metrics.pairwise.manhattan_distances
sklearn.metrics.pairwise.manhattan_distances(X, Y=None, sum_over_features=True)
Compute the L1 distances between the vectors in X and Y.
With sum_over_features equal to False it returns the componentwise distances.
Parameters X : array_like
An array with shape (n_samples_X, n_features).
Y : array_like, optional
An array with shape (n_samples_Y, n_features).
sum_over_features : bool, default=True
If True the function returns the pairwise distance matrix else it returns the component-
wise L1 pairwise-distances.
Returns D : array
If sum_over_features is False shape is (n_samples_X * n_samples_Y, n_features) and D
contains the componentwise L1 pairwise-distances (ie. absolute difference), else shape
is (n_samples_X, n_samples_Y) and D contains the pairwise l1 distances.
Examples
>>> from sklearn.metrics.pairwise import manhattan_distances
>>> manhattan_distances(3, 3)
array([[ 0.]])
>>> manhattan_distances(3, 2)
array([[ 1.]])
>>> manhattan_distances(2, 3)
array([[ 1.]])
>>> manhattan_distances([[1, 2], [3, 4]], [[1, 2], [0, 3]])
array([[ 0., 2.],
[ 4., 4.]])
>>> import numpy as np
>>> X = np.ones((1, 2))
>>> y = 2
*
np.ones((2, 2))
>>> manhattan_distances(X, y, sum_over_features=False)
array([[ 1., 1.],
[ 1., 1.]]...)
sklearn.metrics.pairwise.linear_kernel
sklearn.metrics.pairwise.linear_kernel(X, Y=None)
Compute the linear kernel between X and Y.
1.8. Reference 537
scikit-learn user guide, Release 0.12-git
Parameters X : array of shape (n_samples_1, n_features)
Y : array of shape (n_samples_2, n_features)
Returns Gram matrix : array of shape (n_samples_1, n_samples_2)
sklearn.metrics.pairwise.polynomial_kernel
sklearn.metrics.pairwise.polynomial_kernel(X, Y=None, degree=3, gamma=0, coef0=1)
Compute the polynomial kernel between X and Y:
K(X, Y) = (gamma <X, Y> + coef0)^degree
Parameters X : array of shape (n_samples_1, n_features)
Y : array of shape (n_samples_2, n_features)
degree : int
Returns Gram matrix : array of shape (n_samples_1, n_samples_2)
sklearn.metrics.pairwise.rbf_kernel
sklearn.metrics.pairwise.rbf_kernel(X, Y=None, gamma=0)
Compute the rbf (gaussian) kernel between X and Y:
K(X, Y) = exp(-gamma ||X-Y||^2)
Parameters X : array of shape (n_samples_1, n_features)
Y : array of shape (n_samples_2, n_features)
gamma : oat
Returns Gram matrix : array of shape (n_samples_1, n_samples_2)
sklearn.metrics.pairwise.distance_metrics
sklearn.metrics.pairwise.distance_metrics()
Valid metrics for pairwise_distances
This function simply returns the valid pairwise distance metrics. It exists, however, to allow for a verbose
description of the mapping for each of the valid strings.
The valid distance metrics, and the function they map to, are:
metric Function
cityblock sklearn.pairwise.manhattan_distances
euclidean sklearn.pairwise.euclidean_distances
l1 sklearn.pairwise.manhattan_distances
l2 sklearn.pairwise.euclidean_distances
manhattan sklearn.pairwise.manhattan_distances
538 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
sklearn.metrics.pairwise.pairwise_distances
sklearn.metrics.pairwise.pairwise_distances(X, Y=None, metric=euclidean, n_jobs=1,
**kwds)
Compute the distance matrix from a vector array X and optional Y.
This method takes either a vector array or a distance matrix, and returns a distance matrix. If the input is a vector
array, the distances are computed. If the input is a distances matrix, it is returned instead.
This method provides a safe way to take a distance matrix as input, while preserving compatability with many
other algorithms that take a vector array.
If Y is given (default is None), then the returned matrix is the pairwise distance between the arrays from both X
and Y.
Please note that support for sparse matrices is currently limited to those metrics listed in pair-
wise.pairwise_distance_functions.
Valid values for metric are:
from scikit-learn: [euclidean, l2, l1, manhattan, cityblock]
from scipy.spatial.distance: [braycurtis, canberra, chebyshev, correlation, cosine, dice, ham-
ming, jaccard, kulsinski, mahalanobis, matching, minkowski, rogerstanimoto, russell-
rao, seuclidean, sokalmichener, sokalsneath, sqeucludean, yule] See the documentation for
scipy.spatial.distance for details on these metrics.
Note in the case of euclidean and cityblock (which are valid scipy.spatial.distance metrics), the values will use
the scikit-learn implementation, which is faster and has support for sparse matrices. For a verbose description
of the metrics from scikit-learn, see the __doc__ of the sklearn.pairwise.distance_metrics function.
Parameters X : array [n_samples_a, n_samples_a] if metric == precomputed, or, [n_samples_a,
n_features] otherwise
Array of pairwise distances between samples, or a feature array.
Y : array [n_samples_b, n_features]
A second feature array only if X has shape [n_samples_a, n_features].
metric : string, or callable
The metric to use when calculating distance between instances in a feature array. If
metric is a string, it must be one of the options allowed by scipy.spatial.distance.pdist
for its metric parameter, or a metric listed in pairwise.pairwise_distance_functions. If
metric is precomputed, X is assumed to be a distance matrix. Alternatively, if metric
is a callable function, it is called on each pair of instances (rows) and the resulting
value recorded. The callable should take two arrays from X as input and return a value
indicating the distance between them.
n_jobs : int
The number of jobs to use for the computation. This works by breaking down the
pairwise matrix into n_jobs even slices and computing them in parallel.
If -1 all CPUs are used. If 1 is given, no parallel computing code is used at all, which
is useful for debuging. For n_jobs below -1, (n_cpus + 1 - n_jobs) are used. Thus for
n_jobs = -2, all CPUs but one are used.
**kwds : optional keyword parameters
1.8. Reference 539
scikit-learn user guide, Release 0.12-git
Any further parameters are passed directly to the distance function. If using a
scipy.spatial.distance metric, the parameters are still metric dependent. See the scipy
docs for usage examples.
Returns D : array [n_samples_a, n_samples_a] or [n_samples_a, n_samples_b]
A distance matrix D such that D_{i, j} is the distance between the ith and jth vectors of
the given matrix X, if Y is None. If Y is not None, then D_{i, j} is the distance between
the ith array from X and the jth array from Y.
sklearn.metrics.pairwise.kernel_metrics
sklearn.metrics.pairwise.kernel_metrics()
Valid metrics for pairwise_kernels
This function simply returns the valid pairwise distance metrics. It exists, however, to allow for a verbose
description of the mapping for each of the valid strings.
The valid distance metrics, and the function they map to, are:
metric Function
linear sklearn.pairwise.linear_kernel
poly sklearn.pairwise.polynomial_kernel
polynomial sklearn.pairwise.polynomial_kernel
rbf sklearn.pairwise.rbf_kernel
sigmoid sklearn.pairwise.sigmoid_kernel
sklearn.metrics.pairwise.pairwise_kernels
sklearn.metrics.pairwise.pairwise_kernels(X, Y=None, metric=linear, l-
ter_params=False, n_jobs=1, **kwds)
Compute the kernel between arrays X and optional array Y.
This method takes either a vector array or a kernel matrix, and returns a kernel matrix. If the input is a vector
array, the kernels are computed. If the input is a kernel matrix, it is returned instead.
This method provides a safe way to take a kernel matrix as input, while preserving compatability with many
other algorithms that take a vector array.
If Y is given (default is None), then the returned matrix is the pairwise kernel between the arrays from both X
and Y.
Valid values for metric are::[rbf, sigmoid, polynomial, poly, linear]
Parameters X : array [n_samples_a, n_samples_a] if metric == precomputed, or, [n_samples_a,
n_features] otherwise
Array of pairwise kernels between samples, or a feature array.
Y : array [n_samples_b, n_features]
A second feature array only if X has shape [n_samples_a, n_features].
metric : string, or callable
The metric to use when calculating kernel between instances in a feature array. If met-
ric is a string, it must be one of the metrics in pairwise.pairwise_kernel_functions. If
metric is precomputed, X is assumed to be a kernel matrix. Alternatively, if met-
ric is a callable function, it is called on each pair of instances (rows) and the resulting
value recorded. The callable should take two arrays from X as input and return a value
indicating the distance between them.
540 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
n_jobs : int
The number of jobs to use for the computation. This works by breaking down the
pairwise matrix into n_jobs even slices and computing them in parallel.
If -1 all CPUs are used. If 1 is given, no parallel computing code is used at all, which
is useful for debuging. For n_jobs below -1, (n_cpus + 1 - n_jobs) are used. Thus for
n_jobs = -2, all CPUs but one are used.
lter_params: boolean :
Whether to lter invalid parameters or not.
**kwds : optional keyword parameters
Any further parameters are passed directly to the kernel function.
Returns K : array [n_samples_a, n_samples_a] or [n_samples_a, n_samples_b]
A kernel matrix K such that K_{i, j} is the kernel between the ith and jth vectors of the
given matrix X, if Y is None. If Y is not None, then K_{i, j} is the kernel between the
ith array from X and the jth array from Y.
1.8.18 sklearn.mixture: Gaussian Mixture Models
The sklearn.mixture module implements mixture modeling algorithms.
User guide: See the Gaussian mixture models section for further details.
mixture.GMM([n_components, covariance_type, ...]) Gaussian Mixture Model
mixture.DPGMM([n_components, ...]) Variational Inference for the Innite Gaussian Mixture Model.
mixture.VBGMM([n_components, ...]) Variational Inference for the Gaussian Mixture Model
sklearn.mixture.GMM
class sklearn.mixture.GMM(n_components=1, covariance_type=diag, random_state=None,
thresh=0.01, min_covar=0.001, n_iter=100, n_init=1, params=wmc,
init_params=wmc)
Gaussian Mixture Model
Representation of a Gaussian mixture model probability distribution. This class allows for easy evaluation of,
sampling from, and maximum-likelihood estimation of the parameters of a GMM distribution.
Initializes parameters such that every mixture component has zero mean and identity covariance.
Parameters n_components : int, optional
Number of mixture components. Defaults to 1.
covariance_type : string, optional
String describing the type of covariance parameters to use. Must be one of spherical,
tied, diag, full. Defaults to diag.
random_state: RandomState or an int seed (0 by default) :
A random number generator instance
min_covar : oat, optional
Floor on the diagonal of the covariance matrix to prevent overtting. Defaults to 1e-3.
1.8. Reference 541
scikit-learn user guide, Release 0.12-git
thresh : oat, optional
Convergence threshold.
n_iter : int, optional
Number of EM iterations to perform.
n_init : int, optional
Number of initializations to perform. the best results is kept
params : string, optional
Controls which parameters are updated in the training process. Can contain any combi-
nation of w for weights, m for means, and c for covars. Defaults to wmc.
init_params : string, optional
Controls which parameters are updated in the initialization process. Can contain any
combination of w for weights, m for means, and c for covars. Defaults to wmc.
See Also:
DPGMMIninite gaussian mixture model, using the dirichlet process, t with a variational algorithm
VBGMMFinite gaussian mixture model t with a variational algorithm, better for situations where there might be
too little data to get a good estimate of the covariance matrix.
Examples
>>> import numpy as np
>>> from sklearn import mixture
>>> np.random.seed(1)
>>> g = mixture.GMM(n_components=2)
>>> # Generate random observations with two modes centered on 0
>>> # and 10 to use for training.
>>> obs = np.concatenate((np.random.randn(100, 1),
... 10 + np.random.randn(300, 1)))
>>> g.fit(obs)
GMM(covariance_type=None, init_params=wmc, min_covar=0.001,
n_components=2, n_init=1, n_iter=100, params=wmc,
random_state=None, thresh=0.01)
>>> np.round(g.weights_, 2)
array([ 0.75, 0.25])
>>> np.round(g.means_, 2)
array([[ 10.05],
[ 0.06]])
>>> np.round(g.covars_, 2)
array([[[ 1.02]],
[[ 0.96]]])
>>> g.predict([[0], [2], [9], [10]])
array([1, 1, 0, 0]...)
>>> np.round(g.score([[0], [2], [9], [10]]), 2)
array([-2.19, -4.58, -1.75, -1.21])
>>> # Refit the model on new data (initial parameters remain the
>>> # same), this time with an even split between the two modes.
>>> g.fit(20
*
[[0]] + 20
*
[[10]])
GMM(covariance_type=None, init_params=wmc, min_covar=0.001,
n_components=2, n_init=1, n_iter=100, params=wmc,
random_state=None, thresh=0.01)
542 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
>>> np.round(g.weights_, 2)
array([ 0.5, 0.5])
Attributes
weights_ array, shape (n_components,) This attribute stores the mixing
weights for each mixture compo-
nent.
means_ array, shape (n_components,
n_features)
Mean parameters for each mixture
component.
covars_ array Covariance parameters for each
mixture component. The shape de-
pends on covariance_type:
(n_components,) if spherical,
(n_features, n_features) if tied,
(n_components, n_features) if diag,
(n_components, n_features, n_features) if full
converged_ bool True when convergence was
reached in t(), False otherwise.
Methods
aic(X) Akaike information criterion for the current model t
bic(X) Bayesian information criterion for the current model t
decode(*args, **kwargs) DEPRECATED: will be removed in v0.12;
eval(X) Evaluate the model on data
fit(X, **kwargs) Estimate model parameters with the expectation-maximization algorithm.
get_params([deep]) Get parameters for the estimator
predict(X) Predict label for data.
predict_proba(X) Predict posterior probability of data under each Gaussian
rvs(*args, **kwargs) DEPRECATED: will be removed in v0.12;
sample([n_samples, random_state]) Generate random samples from the model.
score(X) Compute the log probability under the model.
set_params(**params) Set the parameters of the estimator.
__init__(n_components=1, covariance_type=diag, random_state=None, thresh=0.01,
min_covar=0.001, n_iter=100, n_init=1, params=wmc, init_params=wmc)
aic(X)
Akaike information criterion for the current model t and the proposed data
Parameters X : array of shape(n_samples, n_dimensions)
Returns aic: oat (the lower the better) :
bic(X)
Bayesian information criterion for the current model t and the proposed data
Parameters X : array of shape(n_samples, n_dimensions)
Returns bic: oat (the lower the better) :
1.8. Reference 543
scikit-learn user guide, Release 0.12-git
decode(*args, **kwargs)
DEPRECATED: will be removed in v0.12; use the score or predict method instead, depending on the
question
Find most likely mixture components for each point in X.
DEPRECATED IN VERSION 0.10; WILL BE REMOVED IN VERSION 0.12 use the score or
predict method instead, depending on the question.
Parameters X : array_like, shape (n, n_features)
List of n_features-dimensional data points. Each row corresponds to a single data point.
Returns logprobs : array_like, shape (n_samples,)
Log probability of each point in obs under the model.
components[array_like, shape (n_samples,)] Index of the most likelihod mixture com-
ponents for each observation
eval(X)
Evaluate the model on data
Compute the log probability of X under the model and return the posterior distribution (responsibilities)
of each mixture component for each element of X.
Parameters X: array_like, shape (n_samples, n_features) :
List of n_features-dimensional data points. Each row corresponds to a single data point.
Returns logprob: array_like, shape (n_samples,) :
Log probabilities of each data point in X
responsibilities: array_like, shape (n_samples, n_components) :
Posterior probabilities of each mixture component for each observation
fit(X, **kwargs)
Estimate model parameters with the expectation-maximization algorithm.
A initialization step is performed before entering the em algorithm. If you want to avoid this step, set the
keyword argument init_params to the empty string when creating the GMM object. Likewise, if you
would like just to do an initialization, set n_iter=0.
Parameters X : array_like, shape (n, n_features)
List of n_features-dimensional data points. Each row corresponds to a single data point.
get_params(deep=True)
Get parameters for the estimator
Parameters deep: boolean, optional :
If True, will return the parameters for this estimator and contained subobjects that are
estimators.
predict(X)
Predict label for data.
Parameters X : array-like, shape = [n_samples, n_features]
Returns C : array, shape = (n_samples,)
544 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
predict_proba(X)
Predict posterior probability of data under each Gaussian in the model.
Parameters X : array-like, shape = [n_samples, n_features]
Returns responsibilities : array-like, shape = (n_samples, n_components)
Returns the probability of the sample for each Gaussian (state) in the model.
rvs(*args, **kwargs)
DEPRECATED: will be removed in v0.12; use the score or predict method instead, depending on the
question
Generate random samples from the model.
DEPRECATED IN VERSION 0.11; WILL BE REMOVED IN VERSION 0.12 use sample in-
stead
sample(n_samples=1, random_state=None)
Generate random samples from the model.
Parameters n_samples : int, optional
Number of samples to generate. Defaults to 1.
Returns X : array_like, shape (n_samples, n_features)
List of samples
score(X)
Compute the log probability under the model.
Parameters X : array_like, shape (n_samples, n_features)
List of n_features-dimensional data points. Each row corresponds to a single data point.
Returns logprob : array_like, shape (n_samples,)
Log probabilities of each data point in X
set_params(**params)
Set the parameters of the estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The former have
parameters of the form <component>__<parameter> so that its possible to update each component
of a nested object.
Returns self :
sklearn.mixture.DPGMM
class sklearn.mixture.DPGMM(n_components=1, covariance_type=diag, alpha=1.0, ran-
dom_state=None, thresh=0.01, verbose=False, min_covar=None,
n_iter=10, params=wmc, init_params=wmc)
Variational Inference for the Innite Gaussian Mixture Model.
DPGMM stands for Dirichlet Process Gaussian Mixture Model, and it is an innite mixture model with the
Dirichlet Process as a prior distribution on the number of clusters. In practice the approximate inference algo-
rithm uses a truncated distribution with a xed maximum number of components, but almost always the number
of components actually used depends on the data.
Stick-breaking Representation of a Gaussian mixture model probability distribution. This class allows for easy
and efcient inference of an approximate posterior distribution over the parameters of a Gaussian mixture model
with a variable number of components (smaller than the truncation parameter n_components).
1.8. Reference 545
scikit-learn user guide, Release 0.12-git
Initialization is with normally-distributed means and identity covariance, for proper convergence.
Parameters n_components: int, optional :
Number of mixture components. Defaults to 1.
covariance_type: string, optional :
String describing the type of covariance parameters to use. Must be one of spherical,
tied, diag, full. Defaults to diag.
alpha: oat, optional :
Real number representing the concentration parameter of the dirichlet process. Intu-
itively, the Dirichlet Process is as likely to start a new cluster for a point as it is to add
that point to a cluster with alpha elements. A higher alpha means more clusters, as the
expected number of clusters is alpha
*
log(N). Defaults to 1.
thresh : oat, optional
Convergence threshold.
n_iter : int, optional
Maximum number of iterations to perform before convergence.
params : string, optional
Controls which parameters are updated in the training process. Can contain any combi-
nation of w for weights, m for means, and c for covars. Defaults to wmc.
init_params : string, optional
Controls which parameters are updated in the initialization process. Can contain any
combination of w for weights, m for means, and c for covars. Defaults to wmc.
See Also:
GMMFinite Gaussian mixture model t with EM
VBGMMFinite Gaussian mixture model t with a variational algorithm, better for situations where there might be
too little data to get a good estimate of the covariance matrix.
546 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
Attributes
covariance_type string String describing the type of co-
variance parameters used by the
DP-GMM. Must be one of spher-
ical, tied, diag, full.
n_components int Number of mixture components.
weights_ array, shape (n_components,) Mixing weights for each mixture
component.
means_ array, shape (n_components,
n_features)
Mean parameters for each mixture
component.
precisions_ array Precision (inverse covariance) pa-
rameters for each mixture compo-
nent. The shape depends on covari-
ance_type:
(n_components, n_features) if spherical,
(n_features, n_features) if tied,
(n_components, n_features) if diag,
(n_components, n_features, n_features) if full
converged_ bool True when convergence was
reached in t(), False otherwise.
Methods
aic(X) Akaike information criterion for the current model t
bic(X) Bayesian information criterion for the current model t
decode(*args, **kwargs) DEPRECATED: will be removed in v0.12;
eval(X) Evaluate the model on data
fit(X, **kwargs) Estimate model parameters with the variational algorithm.
get_params([deep]) Get parameters for the estimator
lower_bound(X, z) returns a lower bound on model evidence based on X and membership
predict(X) Predict label for data.
predict_proba(X) Predict posterior probability of data under each Gaussian
rvs(*args, **kwargs) DEPRECATED: will be removed in v0.12;
sample([n_samples, random_state]) Generate random samples from the model.
score(X) Compute the log probability under the model.
set_params(**params) Set the parameters of the estimator.
__init__(n_components=1, covariance_type=diag, alpha=1.0, random_state=None, thresh=0.01,
verbose=False, min_covar=None, n_iter=10, params=wmc, init_params=wmc)
aic(X)
Akaike information criterion for the current model t and the proposed data
Parameters X : array of shape(n_samples, n_dimensions)
Returns aic: oat (the lower the better) :
bic(X)
Bayesian information criterion for the current model t and the proposed data
Parameters X : array of shape(n_samples, n_dimensions)
1.8. Reference 547
scikit-learn user guide, Release 0.12-git
Returns bic: oat (the lower the better) :
decode(*args, **kwargs)
DEPRECATED: will be removed in v0.12; use the score or predict method instead, depending on the
question
Find most likely mixture components for each point in X.
DEPRECATED IN VERSION 0.10; WILL BE REMOVED IN VERSION 0.12 use the score or
predict method instead, depending on the question.
Parameters X : array_like, shape (n, n_features)
List of n_features-dimensional data points. Each row corresponds to a single data point.
Returns logprobs : array_like, shape (n_samples,)
Log probability of each point in obs under the model.
components[array_like, shape (n_samples,)] Index of the most likelihod mixture com-
ponents for each observation
eval(X)
Evaluate the model on data
Compute the bound on log probability of X under the model and return the posterior distribution (respon-
sibilities) of each mixture component for each element of X.
This is done by computing the parameters for the mean-eld of z for each observation.
Parameters X : array_like, shape (n_samples, n_features)
List of n_features-dimensional data points. Each row corresponds to a single data point.
Returns logprob : array_like, shape (n_samples,)
Log probabilities of each data point in X
responsibilities: array_like, shape (n_samples, n_components) :
Posterior probabilities of each mixture component for each observation
fit(X, **kwargs)
Estimate model parameters with the variational algorithm.
For a full derivation and description of the algorithm see doc/dp-derivation/dp-derivation.tex
A initialization step is performed before entering the em algorithm. If you want to avoid this step, set the
keyword argument init_params to the empty string when when creating the object. Likewise, if you
would like just to do an initialization, set n_iter=0.
Parameters X : array_like, shape (n, n_features)
List of n_features-dimensional data points. Each row corresponds to a single data point.
get_params(deep=True)
Get parameters for the estimator
Parameters deep: boolean, optional :
If True, will return the parameters for this estimator and contained subobjects that are
estimators.
lower_bound(X, z)
returns a lower bound on model evidence based on X and membership
548 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
predict(X)
Predict label for data.
Parameters X : array-like, shape = [n_samples, n_features]
Returns C : array, shape = (n_samples,)
predict_proba(X)
Predict posterior probability of data under each Gaussian in the model.
Parameters X : array-like, shape = [n_samples, n_features]
Returns responsibilities : array-like, shape = (n_samples, n_components)
Returns the probability of the sample for each Gaussian (state) in the model.
rvs(*args, **kwargs)
DEPRECATED: will be removed in v0.12; use the score or predict method instead, depending on the
question
Generate random samples from the model.
DEPRECATED IN VERSION 0.11; WILL BE REMOVED IN VERSION 0.12 use sample in-
stead
sample(n_samples=1, random_state=None)
Generate random samples from the model.
Parameters n_samples : int, optional
Number of samples to generate. Defaults to 1.
Returns X : array_like, shape (n_samples, n_features)
List of samples
score(X)
Compute the log probability under the model.
Parameters X : array_like, shape (n_samples, n_features)
List of n_features-dimensional data points. Each row corresponds to a single data point.
Returns logprob : array_like, shape (n_samples,)
Log probabilities of each data point in X
set_params(**params)
Set the parameters of the estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The former have
parameters of the form <component>__<parameter> so that its possible to update each component
of a nested object.
Returns self :
sklearn.mixture.VBGMM
class sklearn.mixture.VBGMM(n_components=1, covariance_type=diag, alpha=1.0, ran-
dom_state=None, thresh=0.01, verbose=False, min_covar=None,
n_iter=10, params=wmc, init_params=wmc)
Variational Inference for the Gaussian Mixture Model
1.8. Reference 549
scikit-learn user guide, Release 0.12-git
Variational inference for a Gaussian mixture model probability distribution. This class allows for easy and
efcient inference of an approximate posterior distribution over the parameters of a Gaussian mixture model
with a xed number of components.
Initialization is with normally-distributed means and identity covariance, for proper convergence.
Parameters n_components: int, optional :
Number of mixture components. Defaults to 1.
covariance_type: string, optional :
String describing the type of covariance parameters to use. Must be one of spherical,
tied, diag, full. Defaults to diag.
alpha: oat, optional :
Real number representing the concentration parameter of the dirichlet distribution. Intu-
itively, the higher the value of alpha the more likely the variational mixture of Gaussians
model will use all components it can. Defaults to 1.
See Also:
GMMFinite Gaussian mixture model t with EM
DPGMMIninite Gaussian mixture model, using the dirichlet process, t with a variational algorithm
Attributes
covariance_type string String describing the type of co-
variance parameters used by the
DP-GMM. Must be one of spher-
ical, tied, diag, full.
n_features int Dimensionality of the Gaussians.
n_components int (read-only) Number of mixture components.
weights_ array, shape (n_components,) Mixing weights for each mixture
component.
means_ array, shape (n_components,
n_features)
Mean parameters for each mixture
component.
precisions_ array Precision (inverse covariance) pa-
rameters for each mixture compo-
nent. The shape depends on covari-
ance_type:
(n_components, n_features) if spherical,
(n_features, n_features) if tied,
(n_components, n_features) if diag,
(n_components, n_features, n_features) if full
converged_ bool True when convergence was
reached in t(), False otherwise.
Methods
aic(X) Akaike information criterion for the current model t
Continued on next page
550 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
Table 1.144 continued from previous page
bic(X) Bayesian information criterion for the current model t
decode(*args, **kwargs) DEPRECATED: will be removed in v0.12;
eval(X) Evaluate the model on data
fit(X, **kwargs) Estimate model parameters with the variational algorithm.
get_params([deep]) Get parameters for the estimator
lower_bound(X, z) returns a lower bound on model evidence based on X and membership
predict(X) Predict label for data.
predict_proba(X) Predict posterior probability of data under each Gaussian
rvs(*args, **kwargs) DEPRECATED: will be removed in v0.12;
sample([n_samples, random_state]) Generate random samples from the model.
score(X) Compute the log probability under the model.
set_params(**params) Set the parameters of the estimator.
__init__(n_components=1, covariance_type=diag, alpha=1.0, random_state=None, thresh=0.01,
verbose=False, min_covar=None, n_iter=10, params=wmc, init_params=wmc)
aic(X)
Akaike information criterion for the current model t and the proposed data
Parameters X : array of shape(n_samples, n_dimensions)
Returns aic: oat (the lower the better) :
bic(X)
Bayesian information criterion for the current model t and the proposed data
Parameters X : array of shape(n_samples, n_dimensions)
Returns bic: oat (the lower the better) :
decode(*args, **kwargs)
DEPRECATED: will be removed in v0.12; use the score or predict method instead, depending on the
question
Find most likely mixture components for each point in X.
DEPRECATED IN VERSION 0.10; WILL BE REMOVED IN VERSION 0.12 use the score or
predict method instead, depending on the question.
Parameters X : array_like, shape (n, n_features)
List of n_features-dimensional data points. Each row corresponds to a single data point.
Returns logprobs : array_like, shape (n_samples,)
Log probability of each point in obs under the model.
components[array_like, shape (n_samples,)] Index of the most likelihod mixture com-
ponents for each observation
eval(X)
Evaluate the model on data
Compute the bound on log probability of X under the model and return the posterior distribution (respon-
sibilities) of each mixture component for each element of X.
This is done by computing the parameters for the mean-eld of z for each observation.
Parameters X : array_like, shape (n_samples, n_features)
1.8. Reference 551
scikit-learn user guide, Release 0.12-git
List of n_features-dimensional data points. Each row corresponds to a single data point.
Returns logprob : array_like, shape (n_samples,)
Log probabilities of each data point in X
responsibilities: array_like, shape (n_samples, n_components) :
Posterior probabilities of each mixture component for each observation
fit(X, **kwargs)
Estimate model parameters with the variational algorithm.
For a full derivation and description of the algorithm see doc/dp-derivation/dp-derivation.tex
A initialization step is performed before entering the em algorithm. If you want to avoid this step, set the
keyword argument init_params to the empty string when when creating the object. Likewise, if you
would like just to do an initialization, set n_iter=0.
Parameters X : array_like, shape (n, n_features)
List of n_features-dimensional data points. Each row corresponds to a single data point.
get_params(deep=True)
Get parameters for the estimator
Parameters deep: boolean, optional :
If True, will return the parameters for this estimator and contained subobjects that are
estimators.
lower_bound(X, z)
returns a lower bound on model evidence based on X and membership
predict(X)
Predict label for data.
Parameters X : array-like, shape = [n_samples, n_features]
Returns C : array, shape = (n_samples,)
predict_proba(X)
Predict posterior probability of data under each Gaussian in the model.
Parameters X : array-like, shape = [n_samples, n_features]
Returns responsibilities : array-like, shape = (n_samples, n_components)
Returns the probability of the sample for each Gaussian (state) in the model.
rvs(*args, **kwargs)
DEPRECATED: will be removed in v0.12; use the score or predict method instead, depending on the
question
Generate random samples from the model.
DEPRECATED IN VERSION 0.11; WILL BE REMOVED IN VERSION 0.12 use sample in-
stead
sample(n_samples=1, random_state=None)
Generate random samples from the model.
Parameters n_samples : int, optional
Number of samples to generate. Defaults to 1.
Returns X : array_like, shape (n_samples, n_features)
552 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
List of samples
score(X)
Compute the log probability under the model.
Parameters X : array_like, shape (n_samples, n_features)
List of n_features-dimensional data points. Each row corresponds to a single data point.
Returns logprob : array_like, shape (n_samples,)
Log probabilities of each data point in X
set_params(**params)
Set the parameters of the estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The former have
parameters of the form <component>__<parameter> so that its possible to update each component
of a nested object.
Returns self :
1.8.19 sklearn.multiclass: Multiclass and multilabel classication
Multiclass and multilabel classication strategies
This module implements multiclass learning algorithms:
one-vs-the-rest / one-vs-all
one-vs-one
error correcting output codes
The estimators provided in this module are meta-estimators: they require a base estimator to be provided in their
constructor. For example, it is possible to use these estimators to turn a binary classier or a regressor into a multiclass
classier. It is also possible to use these estimators with multiclass estimators in the hope that their accuracy or runtime
performance improves.
User guide: See the Multiclass and multilabel algorithms section for further details.
multiclass.OneVsRestClassifier(estimator) One-vs-the-rest (OvR) multiclass/multilabel strategy
multiclass.OneVsOneClassifier(estimator) One-vs-one multiclass strategy
multiclass.OutputCodeClassifier(estimator[, ...]) (Error-Correcting) Output-Code multiclass strategy
sklearn.multiclass.OneVsRestClassier
class sklearn.multiclass.OneVsRestClassifier(estimator)
One-vs-the-rest (OvR) multiclass/multilabel strategy
Also known as one-vs-all, this strategy consists in tting one classier per class. For each classier, the class
is tted against all the other classes. In addition to its computational efciency (only n_classes classiers are
needed), one advantage of this approach is its interpretability. Since each class is represented by one and one
classier only, it is possible to gain knowledge about the class by inspecting its corresponding classier. This is
the most commonly used strategy for multiclass classication and is a fair default choice.
This strategy can also be used for multilabel learning, where a classier is used to predict multiple labels for
instance, by tting on a sequence of sequences of labels (e.g., a list of tuples) rather than a single target vector.
For multilabel learning, the number of classes must be at least three, since otherwise OvR reduces to binary
1.8. Reference 553
scikit-learn user guide, Release 0.12-git
classication.
Parameters estimator : estimator object
An estimator object implementing t and one of decision_function or predict_proba.
Attributes
estimators_ list of n_classes
estimators
Estimators used for predictions.
la-
bel_binarizer_
LabelBinarizer object Object used to transform multiclass labels to binary labels and
vice-versa.
multilabel_ boolean Whether a OneVsRestClassier is a multilabel classier.
Methods
fit(X, y) Fit underlying estimators.
get_params([deep]) Get parameters for the estimator
predict(X) Predict multi-class targets using underlying estimators.
score(X, y)
set_params(**params) Set the parameters of the estimator.
__init__(estimator)
fit(X, y)
Fit underlying estimators.
Parameters X : {array-like, sparse matrix}, shape = [n_samples, n_features]
Data.
y : array-like, shape = [n_samples]
or sequence of sequences, len = n_samplesMulti-class targets. A sequence of se-
quences turns on multilabel classication.
Returns self :
get_params(deep=True)
Get parameters for the estimator
Parameters deep: boolean, optional :
If True, will return the parameters for this estimator and contained subobjects that are
estimators.
multilabel_
Whether this is a multilabel classier
predict(X)
Predict multi-class targets using underlying estimators.
Parameters X: {array-like, sparse matrix}, shape = [n_samples, n_features] :
Data.
Returns y : array-like, shape = [n_samples]
Predicted multi-class targets.
554 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
set_params(**params)
Set the parameters of the estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The former have
parameters of the form <component>__<parameter> so that its possible to update each component
of a nested object.
Returns self :
sklearn.multiclass.OneVsOneClassier
class sklearn.multiclass.OneVsOneClassifier(estimator)
One-vs-one multiclass strategy
This strategy consists in tting one classier per class pair. At prediction time, the class which received the
most votes is selected. Since it requires to t n_classes * (n_classes - 1) / 2 classiers, this method is usually
slower than one-vs-the-rest, due to its O(n_classes^2) complexity. However, this method may be advantageous
for algorithms such as kernel algorithms which dont scale well with n_samples. This is because each individual
learning problem only involves a small subset of the data whereas, with one-vs-the-rest, the complete dataset is
used n_classes times.
Parameters estimator : estimator object
An estimator object implementing t and predict.
Attributes
estimators_ list of n_classes * (n_classes - 1) / 2 estimators Estimators used for predictions.
classes_ numpy array of shape [n_classes] Array containing labels.
Methods
fit(X, y) Fit underlying estimators.
get_params([deep]) Get parameters for the estimator
predict(X) Predict multi-class targets using underlying estimators.
score(X, y) Returns the mean accuracy on the given test data and labels.
set_params(**params) Set the parameters of the estimator.
__init__(estimator)
fit(X, y)
Fit underlying estimators.
Parameters X: {array-like, sparse matrix}, shape = [n_samples, n_features] :
Data.
y : numpy array of shape [n_samples]
Multi-class targets.
Returns self :
get_params(deep=True)
Get parameters for the estimator
1.8. Reference 555
scikit-learn user guide, Release 0.12-git
Parameters deep: boolean, optional :
If True, will return the parameters for this estimator and contained subobjects that are
estimators.
predict(X)
Predict multi-class targets using underlying estimators.
Parameters X : {array-like, sparse matrix}, shape = [n_samples, n_features]
Data.
Returns y : numpy array of shape [n_samples]
Predicted multi-class targets.
score(X, y)
Returns the mean accuracy on the given test data and labels.
Parameters X : array-like, shape = [n_samples, n_features]
Training set.
y : array-like, shape = [n_samples]
Labels for X.
Returns z : oat
set_params(**params)
Set the parameters of the estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The former have
parameters of the form <component>__<parameter> so that its possible to update each component
of a nested object.
Returns self :
sklearn.multiclass.OutputCodeClassier
class sklearn.multiclass.OutputCodeClassifier(estimator, code_size=1.5, ran-
dom_state=None)
(Error-Correcting) Output-Code multiclass strategy
Output-code based strategies consist in representing each class with a binary code (an array of 0s and 1s). At
tting time, one binary classier per bit in the code book is tted. At prediction time, the classiers are used to
project new points in the class space and the class closest to the points is chosen. The main advantage of these
strategies is that the number of classiers used can be controlled by the user, either for compressing the model
(0 < code_size < 1) or for making the model more robust to errors (code_size > 1). See the documentation for
more details.
Parameters estimator : estimator object
An estimator object implementing t and one of decision_function or predict_proba.
code_size : oat
Percentage of the number of classes to be used to create the code book. A number
between 0 and 1 will require fewer classiers than one-vs-the-rest. A number greater
than 1 will require more classiers than one-vs-the-rest.
random_state : numpy.RandomState, optional
The generator used to initialize the codebook. Defaults to numpy.random.
556 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
References
[R73], [R74], [R75]
Attributes
estimators_ list of int(n_classes * code_size) estimators Estimators used for predictions.
classes_ numpy array of shape [n_classes] Array containing labels.
code_book_ numpy array of shape [n_classes, code_size] Binary array containing the code of each class.
Methods
fit(X, y) Fit underlying estimators.
get_params([deep]) Get parameters for the estimator
predict(X) Predict multi-class targets using underlying estimators.
score(X, y) Returns the mean accuracy on the given test data and labels.
set_params(**params) Set the parameters of the estimator.
__init__(estimator, code_size=1.5, random_state=None)
fit(X, y)
Fit underlying estimators.
Parameters X: {array-like, sparse matrix}, shape = [n_samples, n_features] :
Data.
y : numpy array of shape [n_samples]
Multi-class targets.
Returns self :
get_params(deep=True)
Get parameters for the estimator
Parameters deep: boolean, optional :
If True, will return the parameters for this estimator and contained subobjects that are
estimators.
predict(X)
Predict multi-class targets using underlying estimators.
Parameters X: {array-like, sparse matrix}, shape = [n_samples, n_features] :
Data.
Returns y : numpy array of shape [n_samples]
Predicted multi-class targets.
score(X, y)
Returns the mean accuracy on the given test data and labels.
Parameters X : array-like, shape = [n_samples, n_features]
Training set.
1.8. Reference 557
scikit-learn user guide, Release 0.12-git
y : array-like, shape = [n_samples]
Labels for X.
Returns z : oat
set_params(**params)
Set the parameters of the estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The former have
parameters of the form <component>__<parameter> so that its possible to update each component
of a nested object.
Returns self :
multiclass.fit_ovr(estimator, X, y) Fit a one-vs-the-rest strategy.
multiclass.predict_ovr(estimators, ...) Make predictions using the one-vs-the-rest strategy.
multiclass.fit_ovo(estimator, X, y) Fit a one-vs-one strategy.
multiclass.predict_ovo(estimators, classes, X) Make predictions using the one-vs-one strategy.
multiclass.fit_ecoc(estimator, X, y[, ...]) Fit an error-correcting output-code strategy.
multiclass.predict_ecoc(estimators, classes, ...) Make predictions using the error-correcting output-code strategy.
sklearn.multiclass.t_ovr
sklearn.multiclass.fit_ovr(estimator, X, y)
Fit a one-vs-the-rest strategy.
sklearn.multiclass.predict_ovr
sklearn.multiclass.predict_ovr(estimators, label_binarizer, X)
Make predictions using the one-vs-the-rest strategy.
sklearn.multiclass.t_ovo
sklearn.multiclass.fit_ovo(estimator, X, y)
Fit a one-vs-one strategy.
sklearn.multiclass.predict_ovo
sklearn.multiclass.predict_ovo(estimators, classes, X)
Make predictions using the one-vs-one strategy.
sklearn.multiclass.t_ecoc
sklearn.multiclass.fit_ecoc(estimator, X, y, code_size=1.5, random_state=None)
Fit an error-correcting output-code strategy.
Parameters estimator : estimator object
An estimator object implementing t and one of decision_function or predict_proba.
code_size: oat, optional :
Percentage of the number of classes to be used to create the code book.
558 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
random_state: numpy.RandomState, optional :
The generator used to initialize the codebook. Defaults to numpy.random.
Returns estimators : list of int(n_classes * code_size) estimators
Estimators used for predictions.
classes : numpy array of shape [n_classes]
Array containing labels.
code_book_: numpy array of shape [n_classes, code_size] :
Binary array containing the code of each class.
sklearn.multiclass.predict_ecoc
sklearn.multiclass.predict_ecoc(estimators, classes, code_book, X)
Make predictions using the error-correcting output-code strategy.
1.8.20 sklearn.naive_bayes: Naive Bayes
The sklearn.naive_bayes module implements Naive Bayes algorithms. These are supervised learning methods
based on applying Bayes theorem with strong (naive) feature independence assumptions.
User guide: See the Naive Bayes section for further details.
naive_bayes.GaussianNB Gaussian Naive Bayes (GaussianNB)
naive_bayes.MultinomialNB([alpha, t_prior]) Naive Bayes classier for multinomial models
naive_bayes.BernoulliNB([alpha, binarize, ...]) Naive Bayes classier for multivariate Bernoulli models.
sklearn.naive_bayes.GaussianNB
class sklearn.naive_bayes.GaussianNB
Gaussian Naive Bayes (GaussianNB)
Parameters X : array-like, shape = [n_samples, n_features]
Training vector, where n_samples in the number of samples and n_features is the num-
ber of features.
y : array, shape = [n_samples]
Target vector relative to X
Examples
>>> import numpy as np
>>> X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
>>> Y = np.array([1, 1, 1, 2, 2, 2])
>>> from sklearn.naive_bayes import GaussianNB
>>> clf = GaussianNB()
>>> clf.fit(X, Y)
GaussianNB()
>>> print(clf.predict([[-0.8, -1]]))
[1]
1.8. Reference 559
scikit-learn user guide, Release 0.12-git
Attributes
class_prior_ array, shape = [n_classes] probability of each class.
theta_ array, shape = [n_classes, n_features] mean of each feature per class
sigma_ array, shape = [n_classes, n_features] variance of each feature per class
Methods
fit(X, y) Fit Gaussian Naive Bayes according to X, y
get_params([deep]) Get parameters for the estimator
predict(X) Perform classication on an array of test vectors X.
predict_log_proba(X) Return log-probability estimates for the test vector X.
predict_proba(X) Return probability estimates for the test vector X.
score(X, y) Returns the mean accuracy on the given test data and labels.
set_params(**params) Set the parameters of the estimator.
__init__()
x.__init__(...) initializes x; see help(type(x)) for signature
class_prior
DEPRECATED: GaussianNB.class_prior is deprecated and will be removed in version 0.12. Please use
GaussianNB.class_prior_ instead.
fit(X, y)
Fit Gaussian Naive Bayes according to X, y
Parameters X : array-like, shape = [n_samples, n_features]
Training vectors, where n_samples is the number of samples and n_features is the num-
ber of features.
y : array-like, shape = [n_samples]
Target values.
Returns self : object
Returns self.
get_params(deep=True)
Get parameters for the estimator
Parameters deep: boolean, optional :
If True, will return the parameters for this estimator and contained subobjects that are
estimators.
predict(X)
Perform classication on an array of test vectors X.
Parameters X : array-like, shape = [n_samples, n_features]
Returns C : array, shape = [n_samples]
Predicted target values for X
predict_log_proba(X)
Return log-probability estimates for the test vector X.
560 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
Parameters X : array-like, shape = [n_samples, n_features]
Returns C : array-like, shape = [n_samples, n_classes]
Returns the log-probability of the sample for each class in the model, where classes are
ordered arithmetically.
predict_proba(X)
Return probability estimates for the test vector X.
Parameters X : array-like, shape = [n_samples, n_features]
Returns C : array-like, shape = [n_samples, n_classes]
Returns the probability of the sample for each class in the model, where classes are
ordered arithmetically.
score(X, y)
Returns the mean accuracy on the given test data and labels.
Parameters X : array-like, shape = [n_samples, n_features]
Training set.
y : array-like, shape = [n_samples]
Labels for X.
Returns z : oat
set_params(**params)
Set the parameters of the estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The former have
parameters of the form <component>__<parameter> so that its possible to update each component
of a nested object.
Returns self :
sigma
DEPRECATED: GaussianNB.sigma is deprecated and will be removed in version 0.12. Please use
GaussianNB.sigma_ instead.
theta
DEPRECATED: GaussianNB.theta is deprecated and will be removed in version 0.12. Please use
GaussianNB.theta_ instead.
sklearn.naive_bayes.MultinomialNB
class sklearn.naive_bayes.MultinomialNB(alpha=1.0, t_prior=True)
Naive Bayes classier for multinomial models
The multinomial Naive Bayes classier is suitable for classication with discrete features (e.g., word counts for
text classication). The multinomial distribution normally requires integer feature counts. However, in practice,
fractional counts such as tf-idf may also work.
Parameters alpha: oat, optional (default=1.0) :
Additive (Laplace/Lidstone) smoothing parameter (0 for no smoothing).
t_prior: boolean :
Whether to learn class prior probabilities or not. If false, a uniform prior will be used.
1.8. Reference 561
scikit-learn user guide, Release 0.12-git
Notes
For the rationale behind the names coef_ and intercept_, i.e. naive Bayes as a linear classier, see J. Rennie et
al. (2003), Tackling the poor assumptions of naive Bayes text classiers, ICML.
Examples
>>> import numpy as np
>>> X = np.random.randint(5, size=(6, 100))
>>> Y = np.array([1, 2, 3, 4, 5, 6])
>>> from sklearn.naive_bayes import MultinomialNB
>>> clf = MultinomialNB()
>>> clf.fit(X, Y)
MultinomialNB(alpha=1.0, fit_prior=True)
>>> print(clf.predict(X[2]))
[3]
Attributes
intercept_,
class_log_prior_
array, shape =
[n_classes]
Smoothed empirical log probability for each class.
fea-
ture_log_prob_,
coef_
array, shape =
[n_classes,
n_features]
Empirical log probability of features given a class, P(x_i|y).
(intercept_ and coef_ are properties referring to class_log_prior_ and
feature_log_prob_, respectively.)
Methods
fit(X, y[, sample_weight, class_prior]) Fit Naive Bayes classier according to X, y
get_params([deep]) Get parameters for the estimator
predict(X) Perform classication on an array of test vectors X.
predict_log_proba(X) Return log-probability estimates for the test vector X.
predict_proba(X) Return probability estimates for the test vector X.
score(X, y) Returns the mean accuracy on the given test data and labels.
set_params(**params) Set the parameters of the estimator.
__init__(alpha=1.0, t_prior=True)
fit(X, y, sample_weight=None, class_prior=None)
Fit Naive Bayes classier according to X, y
Parameters X : {array-like, sparse matrix}, shape = [n_samples, n_features]
Training vectors, where n_samples is the number of samples and n_features is the num-
ber of features.
y : array-like, shape = [n_samples]
Target values.
sample_weight : array-like, shape = [n_samples], optional
Weights applied to individual samples (1. for unweighted).
562 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
class_prior : array, shape [n_classes]
Custom prior probability per class. Overrides the t_prior parameter.
Returns self : object
Returns self.
get_params(deep=True)
Get parameters for the estimator
Parameters deep: boolean, optional :
If True, will return the parameters for this estimator and contained subobjects that are
estimators.
predict(X)
Perform classication on an array of test vectors X.
Parameters X : array-like, shape = [n_samples, n_features]
Returns C : array, shape = [n_samples]
Predicted target values for X
predict_log_proba(X)
Return log-probability estimates for the test vector X.
Parameters X : array-like, shape = [n_samples, n_features]
Returns C : array-like, shape = [n_samples, n_classes]
Returns the log-probability of the sample for each class in the model, where classes are
ordered arithmetically.
predict_proba(X)
Return probability estimates for the test vector X.
Parameters X : array-like, shape = [n_samples, n_features]
Returns C : array-like, shape = [n_samples, n_classes]
Returns the probability of the sample for each class in the model, where classes are
ordered arithmetically.
score(X, y)
Returns the mean accuracy on the given test data and labels.
Parameters X : array-like, shape = [n_samples, n_features]
Training set.
y : array-like, shape = [n_samples]
Labels for X.
Returns z : oat
set_params(**params)
Set the parameters of the estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The former have
parameters of the form <component>__<parameter> so that its possible to update each component
of a nested object.
Returns self :
1.8. Reference 563
scikit-learn user guide, Release 0.12-git
sklearn.naive_bayes.BernoulliNB
class sklearn.naive_bayes.BernoulliNB(alpha=1.0, binarize=0.0, t_prior=True)
Naive Bayes classier for multivariate Bernoulli models.
Like MultinomialNB, this classier is suitable for discrete data. The difference is that while MultinomialNB
works with occurrence counts, BernoulliNB is designed for binary/boolean features.
Parameters alpha: oat, optional (default=1.0) :
Additive (Laplace/Lidstone) smoothing parameter (0 for no smoothing).
binarize: oat or None, optional :
Threshold for binarizing (mapping to booleans) of sample features. If None, input is
presumed to already consist of binary vectors.
t_prior: boolean :
Whether to learn class prior probabilities or not. If false, a uniform prior will be used.
References
C.D. Manning, P. Raghavan and H. Schtze (2008). Introduction to Information Retrieval. Cambridge Univer-
sity Press, pp. 234265.
A. McCallum and K. Nigam (1998). A comparison of event models for naive Bayes text classication. Proc.
AAAI/ICML-98 Workshop on Learning for Text Categorization, pp. 4148.
V. Metsis, I. Androutsopoulos and G. Paliouras (2006). Spam ltering with naive Bayes Which naive Bayes?
3rd Conf. on Email and Anti-Spam (CEAS).
Examples
>>> import numpy as np
>>> X = np.random.randint(2, size=(6, 100))
>>> Y = np.array([1, 2, 3, 4, 4, 5])
>>> from sklearn.naive_bayes import BernoulliNB
>>> clf = BernoulliNB()
>>> clf.fit(X, Y)
BernoulliNB(alpha=1.0, binarize=0.0, fit_prior=True)
>>> print(clf.predict(X[2]))
[3]
Attributes
class_log_prior_ array, shape = [n_classes] Log probability of each class (smoothed).
fea-
ture_log_prob_
array, shape = [n_classes,
n_features]
Empirical log probability of features given a class,
P(x_i|y).
Methods
fit(X, y[, sample_weight, class_prior]) Fit Naive Bayes classier according to X, y
Continued on next page
564 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
Table 1.153 continued from previous page
get_params([deep]) Get parameters for the estimator
predict(X) Perform classication on an array of test vectors X.
predict_log_proba(X) Return log-probability estimates for the test vector X.
predict_proba(X) Return probability estimates for the test vector X.
score(X, y) Returns the mean accuracy on the given test data and labels.
set_params(**params) Set the parameters of the estimator.
__init__(alpha=1.0, binarize=0.0, t_prior=True)
fit(X, y, sample_weight=None, class_prior=None)
Fit Naive Bayes classier according to X, y
Parameters X : {array-like, sparse matrix}, shape = [n_samples, n_features]
Training vectors, where n_samples is the number of samples and n_features is the num-
ber of features.
y : array-like, shape = [n_samples]
Target values.
sample_weight : array-like, shape = [n_samples], optional
Weights applied to individual samples (1. for unweighted).
class_prior : array, shape [n_classes]
Custom prior probability per class. Overrides the t_prior parameter.
Returns self : object
Returns self.
get_params(deep=True)
Get parameters for the estimator
Parameters deep: boolean, optional :
If True, will return the parameters for this estimator and contained subobjects that are
estimators.
predict(X)
Perform classication on an array of test vectors X.
Parameters X : array-like, shape = [n_samples, n_features]
Returns C : array, shape = [n_samples]
Predicted target values for X
predict_log_proba(X)
Return log-probability estimates for the test vector X.
Parameters X : array-like, shape = [n_samples, n_features]
Returns C : array-like, shape = [n_samples, n_classes]
Returns the log-probability of the sample for each class in the model, where classes are
ordered arithmetically.
predict_proba(X)
Return probability estimates for the test vector X.
Parameters X : array-like, shape = [n_samples, n_features]
1.8. Reference 565
scikit-learn user guide, Release 0.12-git
Returns C : array-like, shape = [n_samples, n_classes]
Returns the probability of the sample for each class in the model, where classes are
ordered arithmetically.
score(X, y)
Returns the mean accuracy on the given test data and labels.
Parameters X : array-like, shape = [n_samples, n_features]
Training set.
y : array-like, shape = [n_samples]
Labels for X.
Returns z : oat
set_params(**params)
Set the parameters of the estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The former have
parameters of the form <component>__<parameter> so that its possible to update each component
of a nested object.
Returns self :
1.8.21 sklearn.neighbors: Nearest Neighbors
The sklearn.neighbors module implements the k-nearest neighbors algorithm.
User guide: See the Nearest Neighbors section for further details.
neighbors.NearestNeighbors([n_neighbors, ...]) Unsupervised learner for implementing neighbor searches.
neighbors.KNeighborsClassifier([...]) Classier implementing the k-nearest neighbors vote.
neighbors.RadiusNeighborsClassifier([...]) Classier implementing a vote among neighbors within a given radius
neighbors.KNeighborsRegressor([n_neighbors, ...]) Regression based on k-nearest neighbors.
neighbors.RadiusNeighborsRegressor([radius, ...]) Regression based on neighbors within a xed radius.
neighbors.BallTree Ball Tree for fast nearest-neighbor searches :
neighbors.NearestCentroid([metric, ...]) Nearest centroid classier.
sklearn.neighbors.NearestNeighbors
class sklearn.neighbors.NearestNeighbors(n_neighbors=5, radius=1.0, algorithm=auto,
leaf_size=30, warn_on_equidistant=True, p=2)
Unsupervised learner for implementing neighbor searches.
Parameters n_neighbors : int, optional (default = 5)
Number of neighbors to use by default for k_neighbors queries.
radius : oat, optional (default = 1.0)
Range of parameter space to use by default for :methradius_neighbors queries.
algorithm : {auto, ball_tree, kd_tree, brute}, optional
Algorithm used to compute the nearest neighbors:
ball_tree will use BallTree
566 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
kd_tree will use scipy.spatial.cKDtree
brute will use a brute-force search.
auto will attempt to decide the most appropriate algorithm based on the values
passed to fit method.
Note: tting on sparse input will override the setting of this parameter, using brute force.
leaf_size : int, optional (default = 30)
Leaf size passed to BallTree or cKDTree. This can affect the speed of the construction
and query, as well as the memory required to store the tree. The optimal value depends
on the nature of the problem.
warn_on_equidistant : boolean, optional. Defaults to True.
Generate a warning if equidistant neighbors are discarded. For classication or regres-
sion based on k-neighbors, if neighbor k and neighbor k+1 have identical distances but
different labels, then the result will be dependent on the ordering of the training data. If
the t method is kd_tree, no warnings will be generated.
p: integer, optional (default = 2) :
Parameter for the Minkowski metric from sklearn.metrics.pairwise.pairwise_distances.
When p = 1, this is equivalent to using manhattan_distance (l1), and euclidean_distance
(l2) for p = 2. For arbitrary p, minkowski_distance (l_p) is used.
See Also:
KNeighborsClassifier, RadiusNeighborsClassifier, KNeighborsRegressor,
RadiusNeighborsRegressor, BallTree
Notes
See Nearest Neighbors in the online documentation for a discussion of the choice of algorithm and
leaf_size.
http://en.wikipedia.org/wiki/K-nearest_neighbor_algorithm
Examples
>>> from sklearn.neighbors import NearestNeighbors
>>> samples = [[0, 0, 2], [1, 0, 0], [0, 0, 1]]
>>> neigh = NearestNeighbors(2, 0.4)
>>> neigh.fit(samples)
NearestNeighbors(...)
>>> neigh.kneighbors([[0, 0, 1.3]], 2, return_distance=False)
array([[2, 0]]...)
>>> neigh.radius_neighbors([0, 0, 1.3], 0.4, return_distance=False)
array([[2]])
Methods
1.8. Reference 567
scikit-learn user guide, Release 0.12-git
fit(X[, y]) Fit the model using X as training data
get_params([deep]) Get parameters for the estimator
kneighbors(X[, n_neighbors, return_distance]) Finds the K-neighbors of a point.
kneighbors_graph(X[, n_neighbors, mode]) Computes the (weighted) graph of k-Neighbors for points in X
radius_neighbors(X[, radius, return_distance]) Finds the neighbors of a point within a given radius.
radius_neighbors_graph(X[, radius, mode]) Computes the (weighted) graph of Neighbors for points in X
set_params(**params) Set the parameters of the estimator.
__init__(n_neighbors=5, radius=1.0, algorithm=auto, leaf_size=30, warn_on_equidistant=True,
p=2)
fit(X, y=None)
Fit the model using X as training data
Parameters X : {array-like, sparse matrix, BallTree, cKDTree}
Training data. If array or matrix, shape = [n_samples, n_features]
get_params(deep=True)
Get parameters for the estimator
Parameters deep: boolean, optional :
If True, will return the parameters for this estimator and contained subobjects that are
estimators.
kneighbors(X, n_neighbors=None, return_distance=True)
Finds the K-neighbors of a point.
Returns distance
Parameters X : array-like, last dimension same as that of t data
The new point.
n_neighbors : int
Number of neighbors to get (default is the value passed to the constructor).
return_distance : boolean, optional. Defaults to True.
If False, distances will not be returned
Returns dist : array
Array representing the lengths to point, only present if return_distance=True
ind : array
Indices of the nearest points in the population matrix.
Examples
In the following example, we construct a NeighborsClassier class from an array representing our data set
and ask whos the closest point to [1,1,1]
>>> samples = [[0., 0., 0.], [0., .5, 0.], [1., 1., .5]]
>>> from sklearn.neighbors import NearestNeighbors
>>> neigh = NearestNeighbors(n_neighbors=1)
>>> neigh.fit(samples)
NearestNeighbors(algorithm=auto, leaf_size=30, ...)
568 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
>>> print(neigh.kneighbors([1., 1., 1.]))
(array([[ 0.5]]), array([[2]]...))
As you can see, it returns [[0.5]], and [[2]], which means that the element is at distance 0.5 and is the third
element of samples (indexes start at 0). You can also query for multiple points:
>>> X = [[0., 1., 0.], [1., 0., 1.]]
>>> neigh.kneighbors(X, return_distance=False)
array([[1],
[2]]...)
kneighbors_graph(X, n_neighbors=None, mode=connectivity)
Computes the (weighted) graph of k-Neighbors for points in X
Parameters X : array-like, shape = [n_samples, n_features]
Sample data
n_neighbors : int
Number of neighbors for each sample. (default is value passed to the constructor).
mode : {connectivity, distance}, optional
Type of returned matrix: connectivity will return the connectivity matrix with ones
and zeros, in distance the edges are Euclidean distance between points.
Returns A : sparse matrix in CSR format, shape = [n_samples, n_samples_t]
n_samples_t is the number of samples in the tted data A[i, j] is assigned the weight
of edge that connects i to j.
See Also:
NearestNeighbors.radius_neighbors_graph
Examples
>>> X = [[0], [3], [1]]
>>> from sklearn.neighbors import NearestNeighbors
>>> neigh = NearestNeighbors(n_neighbors=2)
>>> neigh.fit(X)
NearestNeighbors(algorithm=auto, leaf_size=30, ...)
>>> A = neigh.kneighbors_graph(X)
>>> A.todense()
matrix([[ 1., 0., 1.],
[ 0., 1., 1.],
[ 1., 0., 1.]])
radius_neighbors(X, radius=None, return_distance=True)
Finds the neighbors of a point within a given radius.
Returns distance
Parameters X : array-like, last dimension same as that of t data
The new point.
radius : oat
Limiting distance of neighbors to return. (default is the value passed to the constructor).
return_distance : boolean, optional. Defaults to True.
1.8. Reference 569
scikit-learn user guide, Release 0.12-git
If False, distances will not be returned
Returns dist : array
Array representing the lengths to point, only present if return_distance=True
ind : array
Indices of the nearest points in the population matrix.
Examples
In the following example, we construnct a NeighborsClassier class from an array representing our data
set and ask whos the closest point to [1,1,1]
>>> samples = [[0., 0., 0.], [0., .5, 0.], [1., 1., .5]]
>>> from sklearn.neighbors import NearestNeighbors
>>> neigh = NearestNeighbors(radius=1.6)
>>> neigh.fit(samples)
NearestNeighbors(algorithm=auto, leaf_size=30, ...)
>>> print(neigh.radius_neighbors([1., 1., 1.]))
(array([[ 1.5, 0.5]]...), array([[1, 2]]...)
The rst array returned contains the distances to all points which are closer than 1.6, while the second array
returned contains their indices. In general, multiple points can be queried at the same time. Because the
number of neighbors of each point is not necessarily equal, radius_neighbors returns an array of objects,
where each object is a 1D array of indices.
radius_neighbors_graph(X, radius=None, mode=connectivity)
Computes the (weighted) graph of Neighbors for points in X
Neighborhoods are restricted the points at a distance lower than radius.
Parameters X : array-like, shape = [n_samples, n_features]
Sample data
radius : oat
Radius of neighborhoods. (default is the value passed to the constructor).
mode : {connectivity, distance}, optional
Type of returned matrix: connectivity will return the connectivity matrix with ones
and zeros, in distance the edges are Euclidean distance between points.
Returns A : sparse matrix in CSR format, shape = [n_samples, n_samples]
A[i, j] is assigned the weight of edge that connects i to j.
See Also:
kneighbors_graph
Examples
>>> X = [[0], [3], [1]]
>>> from sklearn.neighbors import NearestNeighbors
>>> neigh = NearestNeighbors(radius=1.5)
>>> neigh.fit(X)
NearestNeighbors(algorithm=auto, leaf_size=30, ...)
570 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
>>> A = neigh.radius_neighbors_graph(X)
>>> A.todense()
matrix([[ 1., 0., 1.],
[ 0., 1., 0.],
[ 1., 0., 1.]])
set_params(**params)
Set the parameters of the estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The former have
parameters of the form <component>__<parameter> so that its possible to update each component
of a nested object.
Returns self :
sklearn.neighbors.KNeighborsClassier
class sklearn.neighbors.KNeighborsClassifier(n_neighbors=5, weights=uniform,
algorithm=auto, leaf_size=30,
warn_on_equidistant=True, p=2)
Classier implementing the k-nearest neighbors vote.
Parameters n_neighbors : int, optional (default = 5)
Number of neighbors to use by default for k_neighbors queries.
weights : str or callable
weight function used in prediction. Possible values:
uniform : uniform weights. All points in each neighborhood are weighted equally.
distance : weight points by the inverse of their distance. in this case, closer neigh-
bors of a query point will have a greater inuence than neighbors which are further
away.
[callable] : a user-dened function which accepts an array of distances, and returns
an array of the same shape containing the weights.
Uniform weights are used by default.
algorithm : {auto, ball_tree, kd_tree, brute}, optional
Algorithm used to compute the nearest neighbors:
ball_tree will use BallTree
kd_tree will use scipy.spatial.cKDtree
brute will use a brute-force search.
auto will attempt to decide the most appropriate algorithm based on the values
passed to fit method.
Note: tting on sparse input will override the setting of this parameter, using brute force.
leaf_size : int, optional (default = 30)
Leaf size passed to BallTree or cKDTree. This can affect the speed of the construction
and query, as well as the memory required to store the tree. The optimal value depends
on the nature of the problem.
warn_on_equidistant : boolean, optional. Defaults to True.
1.8. Reference 571
scikit-learn user guide, Release 0.12-git
Generate a warning if equidistant neighbors are discarded. For classication or regres-
sion based on k-neighbors, if neighbor k and neighbor k+1 have identical distances but
different labels, then the result will be dependent on the ordering of the training data. If
the t method is kd_tree, no warnings will be generated.
p: integer, optional (default = 2) :
Parameter for the Minkowski metric from sklearn.metrics.pairwise.pairwise_distances.
When p = 1, this is equivalent to using manhattan_distance (l1), and euclidean_distance
(l2) for p = 2. For arbitrary p, minkowski_distance (l_p) is used.
See Also:
RadiusNeighborsClassifier, KNeighborsRegressor, RadiusNeighborsRegressor,
NearestNeighbors
Notes
See Nearest Neighbors in the online documentation for a discussion of the choice of algorithm and
leaf_size.
http://en.wikipedia.org/wiki/K-nearest_neighbor_algorithm
Examples
>>> X = [[0], [1], [2], [3]]
>>> y = [0, 0, 1, 1]
>>> from sklearn.neighbors import KNeighborsClassifier
>>> neigh = KNeighborsClassifier(n_neighbors=2)
>>> neigh.fit(X, y)
KNeighborsClassifier(...)
>>> print(neigh.predict([[1.5]]))
[0]
Methods
fit(X, y) Fit the model using X as training data and y as target values
get_params([deep]) Get parameters for the estimator
kneighbors(X[, n_neighbors, return_distance]) Finds the K-neighbors of a point.
kneighbors_graph(X[, n_neighbors, mode]) Computes the (weighted) graph of k-Neighbors for points in X
predict(X) Predict the class labels for the provided data
score(X, y) Returns the mean accuracy on the given test data and labels.
set_params(**params) Set the parameters of the estimator.
__init__(n_neighbors=5, weights=uniform, algorithm=auto, leaf_size=30,
warn_on_equidistant=True, p=2)
fit(X, y)
Fit the model using X as training data and y as target values
Parameters X : {array-like, sparse matrix, BallTree, cKDTree}
Training data. If array or matrix, then the shape is [n_samples, n_features]
y : {array-like, sparse matrix}, shape = [n_samples]
572 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
Target values, array of integer values.
get_params(deep=True)
Get parameters for the estimator
Parameters deep: boolean, optional :
If True, will return the parameters for this estimator and contained subobjects that are
estimators.
kneighbors(X, n_neighbors=None, return_distance=True)
Finds the K-neighbors of a point.
Returns distance
Parameters X : array-like, last dimension same as that of t data
The new point.
n_neighbors : int
Number of neighbors to get (default is the value passed to the constructor).
return_distance : boolean, optional. Defaults to True.
If False, distances will not be returned
Returns dist : array
Array representing the lengths to point, only present if return_distance=True
ind : array
Indices of the nearest points in the population matrix.
Examples
In the following example, we construct a NeighborsClassier class from an array representing our data set
and ask whos the closest point to [1,1,1]
>>> samples = [[0., 0., 0.], [0., .5, 0.], [1., 1., .5]]
>>> from sklearn.neighbors import NearestNeighbors
>>> neigh = NearestNeighbors(n_neighbors=1)
>>> neigh.fit(samples)
NearestNeighbors(algorithm=auto, leaf_size=30, ...)
>>> print(neigh.kneighbors([1., 1., 1.]))
(array([[ 0.5]]), array([[2]]...))
As you can see, it returns [[0.5]], and [[2]], which means that the element is at distance 0.5 and is the third
element of samples (indexes start at 0). You can also query for multiple points:
>>> X = [[0., 1., 0.], [1., 0., 1.]]
>>> neigh.kneighbors(X, return_distance=False)
array([[1],
[2]]...)
kneighbors_graph(X, n_neighbors=None, mode=connectivity)
Computes the (weighted) graph of k-Neighbors for points in X
Parameters X : array-like, shape = [n_samples, n_features]
Sample data
n_neighbors : int
1.8. Reference 573
scikit-learn user guide, Release 0.12-git
Number of neighbors for each sample. (default is value passed to the constructor).
mode : {connectivity, distance}, optional
Type of returned matrix: connectivity will return the connectivity matrix with ones
and zeros, in distance the edges are Euclidean distance between points.
Returns A : sparse matrix in CSR format, shape = [n_samples, n_samples_t]
n_samples_t is the number of samples in the tted data A[i, j] is assigned the weight
of edge that connects i to j.
See Also:
NearestNeighbors.radius_neighbors_graph
Examples
>>> X = [[0], [3], [1]]
>>> from sklearn.neighbors import NearestNeighbors
>>> neigh = NearestNeighbors(n_neighbors=2)
>>> neigh.fit(X)
NearestNeighbors(algorithm=auto, leaf_size=30, ...)
>>> A = neigh.kneighbors_graph(X)
>>> A.todense()
matrix([[ 1., 0., 1.],
[ 0., 1., 1.],
[ 1., 0., 1.]])
predict(X)
Predict the class labels for the provided data
Parameters X: array :
A 2-D array representing the test points.
Returns labels: array :
List of class labels (one for each data sample).
score(X, y)
Returns the mean accuracy on the given test data and labels.
Parameters X : array-like, shape = [n_samples, n_features]
Training set.
y : array-like, shape = [n_samples]
Labels for X.
Returns z : oat
set_params(**params)
Set the parameters of the estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The former have
parameters of the form <component>__<parameter> so that its possible to update each component
of a nested object.
Returns self :
574 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
sklearn.neighbors.RadiusNeighborsClassier
class sklearn.neighbors.RadiusNeighborsClassifier(radius=1.0, weights=uniform, algo-
rithm=auto, leaf_size=30, p=2, out-
lier_label=None)
Classier implementing a vote among neighbors within a given radius
Parameters radius : oat, optional (default = 1.0)
Range of parameter space to use by default for :methradius_neighbors queries.
weights : str or callable
weight function used in prediction. Possible values:
uniform : uniform weights. All points in each neighborhood are weighted equally.
distance : weight points by the inverse of their distance. in this case, closer neigh-
bors of a query point will have a greater inuence than neighbors which are further
away.
[callable] : a user-dened function which accepts an array of distances, and returns
an array of the same shape containing the weights.
Uniform weights are used by default.
algorithm : {auto, ball_tree, kd_tree, brute}, optional
Algorithm used to compute the nearest neighbors:
ball_tree will use BallTree
kd_tree will use scipy.spatial.cKDtree
brute will use a brute-force search.
auto will attempt to decide the most appropriate algorithm based on the values
passed to fit method.
Note: tting on sparse input will override the setting of this parameter, using brute force.
leaf_size : int, optional (default = 30)
Leaf size passed to BallTree or cKDTree. This can affect the speed of the construction
and query, as well as the memory required to store the tree. The optimal value depends
on the nature of the problem.
p: integer, optional (default = 2) :
Parameter for the Minkowski metric from sklearn.metrics.pairwise.pairwise_distances.
When p = 1, this is equivalent to using manhattan_distance (l1), and euclidean_distance
(l2) for p = 2. For arbitrary p, minkowski_distance (l_p) is used.
outlier_label: int, optional (default = None) :
Label, which is given for outlier samples (samples with no neighbors on given radius).
If set to None, ValueError is raised, when outlier is detected.
See Also:
KNeighborsClassifier, RadiusNeighborsRegressor, KNeighborsRegressor,
NearestNeighbors
1.8. Reference 575
scikit-learn user guide, Release 0.12-git
Notes
See Nearest Neighbors in the online documentation for a discussion of the choice of algorithm and
leaf_size.
http://en.wikipedia.org/wiki/K-nearest_neighbor_algorithm
Examples
>>> X = [[0], [1], [2], [3]]
>>> y = [0, 0, 1, 1]
>>> from sklearn.neighbors import RadiusNeighborsClassifier
>>> neigh = RadiusNeighborsClassifier(radius=1.0)
>>> neigh.fit(X, y)
RadiusNeighborsClassifier(...)
>>> print(neigh.predict([[1.5]]))
[0]
Methods
fit(X, y) Fit the model using X as training data and y as target values
get_params([deep]) Get parameters for the estimator
predict(X) Predict the class labels for the provided data
radius_neighbors(X[, radius, return_distance]) Finds the neighbors of a point within a given radius.
radius_neighbors_graph(X[, radius, mode]) Computes the (weighted) graph of Neighbors for points in X
score(X, y) Returns the mean accuracy on the given test data and labels.
set_params(**params) Set the parameters of the estimator.
__init__(radius=1.0, weights=uniform, algorithm=auto, leaf_size=30, p=2, outlier_label=None)
fit(X, y)
Fit the model using X as training data and y as target values
Parameters X : {array-like, sparse matrix, BallTree, cKDTree}
Training data. If array or matrix, then the shape is [n_samples, n_features]
y : {array-like, sparse matrix}, shape = [n_samples]
Target values, array of integer values.
get_params(deep=True)
Get parameters for the estimator
Parameters deep: boolean, optional :
If True, will return the parameters for this estimator and contained subobjects that are
estimators.
predict(X)
Predict the class labels for the provided data
Parameters X: array :
A 2-D array representing the test points.
Returns labels: array :
576 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
List of class labels (one for each data sample).
radius_neighbors(X, radius=None, return_distance=True)
Finds the neighbors of a point within a given radius.
Returns distance
Parameters X : array-like, last dimension same as that of t data
The new point.
radius : oat
Limiting distance of neighbors to return. (default is the value passed to the constructor).
return_distance : boolean, optional. Defaults to True.
If False, distances will not be returned
Returns dist : array
Array representing the lengths to point, only present if return_distance=True
ind : array
Indices of the nearest points in the population matrix.
Examples
In the following example, we construnct a NeighborsClassier class from an array representing our data
set and ask whos the closest point to [1,1,1]
>>> samples = [[0., 0., 0.], [0., .5, 0.], [1., 1., .5]]
>>> from sklearn.neighbors import NearestNeighbors
>>> neigh = NearestNeighbors(radius=1.6)
>>> neigh.fit(samples)
NearestNeighbors(algorithm=auto, leaf_size=30, ...)
>>> print(neigh.radius_neighbors([1., 1., 1.]))
(array([[ 1.5, 0.5]]...), array([[1, 2]]...)
The rst array returned contains the distances to all points which are closer than 1.6, while the second array
returned contains their indices. In general, multiple points can be queried at the same time. Because the
number of neighbors of each point is not necessarily equal, radius_neighbors returns an array of objects,
where each object is a 1D array of indices.
radius_neighbors_graph(X, radius=None, mode=connectivity)
Computes the (weighted) graph of Neighbors for points in X
Neighborhoods are restricted the points at a distance lower than radius.
Parameters X : array-like, shape = [n_samples, n_features]
Sample data
radius : oat
Radius of neighborhoods. (default is the value passed to the constructor).
mode : {connectivity, distance}, optional
Type of returned matrix: connectivity will return the connectivity matrix with ones
and zeros, in distance the edges are Euclidean distance between points.
Returns A : sparse matrix in CSR format, shape = [n_samples, n_samples]
1.8. Reference 577
scikit-learn user guide, Release 0.12-git
A[i, j] is assigned the weight of edge that connects i to j.
See Also:
kneighbors_graph
Examples
>>> X = [[0], [3], [1]]
>>> from sklearn.neighbors import NearestNeighbors
>>> neigh = NearestNeighbors(radius=1.5)
>>> neigh.fit(X)
NearestNeighbors(algorithm=auto, leaf_size=30, ...)
>>> A = neigh.radius_neighbors_graph(X)
>>> A.todense()
matrix([[ 1., 0., 1.],
[ 0., 1., 0.],
[ 1., 0., 1.]])
score(X, y)
Returns the mean accuracy on the given test data and labels.
Parameters X : array-like, shape = [n_samples, n_features]
Training set.
y : array-like, shape = [n_samples]
Labels for X.
Returns z : oat
set_params(**params)
Set the parameters of the estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The former have
parameters of the form <component>__<parameter> so that its possible to update each component
of a nested object.
Returns self :
sklearn.neighbors.KNeighborsRegressor
class sklearn.neighbors.KNeighborsRegressor(n_neighbors=5, weights=uniform,
algorithm=auto, leaf_size=30,
warn_on_equidistant=True, p=2)
Regression based on k-nearest neighbors.
The target is predicted by local interpolation of the targets associated of the nearest neighbors in the training set.
Parameters n_neighbors : int, optional (default = 5)
Number of neighbors to use by default for k_neighbors queries.
weights : str or callable
weight function used in prediction. Possible values:
uniform : uniform weights. All points in each neighborhood are weighted equally.
578 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
distance : weight points by the inverse of their distance. in this case, closer neigh-
bors of a query point will have a greater inuence than neighbors which are further
away.
[callable] : a user-dened function which accepts an array of distances, and returns
an array of the same shape containing the weights.
Uniform weights are used by default.
algorithm : {auto, ball_tree, kd_tree, brute}, optional
Algorithm used to compute the nearest neighbors:
ball_tree will use BallTree
kd_tree will use scipy.spatial.cKDtree
brute will use a brute-force search.
auto will attempt to decide the most appropriate algorithm based on the values
passed to fit method.
Note: tting on sparse input will override the setting of this parameter, using brute force.
leaf_size : int, optional (default = 30)
Leaf size passed to BallTree or cKDTree. This can affect the speed of the construction
and query, as well as the memory required to store the tree. The optimal value depends
on the nature of the problem.
warn_on_equidistant : boolean, optional. Defaults to True.
Generate a warning if equidistant neighbors are discarded. For classication or regres-
sion based on k-neighbors, if neighbor k and neighbor k+1 have identical distances but
different labels, then the result will be dependent on the ordering of the training data. If
the t method is kd_tree, no warnings will be generated.
p: integer, optional (default = 2) :
Parameter for the Minkowski metric from sklearn.metrics.pairwise.pairwise_distances.
When p = 1, this is equivalent to using manhattan_distance (l1), and euclidean_distance
(l2) for p = 2. For arbitrary p, minkowski_distance (l_p) is used.
See Also:
NearestNeighbors, RadiusNeighborsRegressor, KNeighborsClassifier,
RadiusNeighborsClassifier
Notes
See Nearest Neighbors in the online documentation for a discussion of the choice of algorithm and
leaf_size.
http://en.wikipedia.org/wiki/K-nearest_neighbor_algorithm
Examples
>>> X = [[0], [1], [2], [3]]
>>> y = [0, 0, 1, 1]
>>> from sklearn.neighbors import KNeighborsRegressor
>>> neigh = KNeighborsRegressor(n_neighbors=2)
1.8. Reference 579
scikit-learn user guide, Release 0.12-git
>>> neigh.fit(X, y)
KNeighborsRegressor(...)
>>> print(neigh.predict([[1.5]]))
[ 0.5]
Methods
fit(X, y) Fit the model using X as training data and y as target values
get_params([deep]) Get parameters for the estimator
kneighbors(X[, n_neighbors, return_distance]) Finds the K-neighbors of a point.
kneighbors_graph(X[, n_neighbors, mode]) Computes the (weighted) graph of k-Neighbors for points in X
predict(X) Predict the target for the provided data
score(X, y) Returns the coefcient of determination R^2 of the prediction.
set_params(**params) Set the parameters of the estimator.
__init__(n_neighbors=5, weights=uniform, algorithm=auto, leaf_size=30,
warn_on_equidistant=True, p=2)
fit(X, y)
Fit the model using X as training data and y as target values
Parameters X : {array-like, sparse matrix, BallTree, cKDTree}
Training data. If array or matrix, then the shape is [n_samples, n_features]
y : {array-like, sparse matrix}, shape = [n_samples]
Target values, array of oat values.
get_params(deep=True)
Get parameters for the estimator
Parameters deep: boolean, optional :
If True, will return the parameters for this estimator and contained subobjects that are
estimators.
kneighbors(X, n_neighbors=None, return_distance=True)
Finds the K-neighbors of a point.
Returns distance
Parameters X : array-like, last dimension same as that of t data
The new point.
n_neighbors : int
Number of neighbors to get (default is the value passed to the constructor).
return_distance : boolean, optional. Defaults to True.
If False, distances will not be returned
Returns dist : array
Array representing the lengths to point, only present if return_distance=True
ind : array
Indices of the nearest points in the population matrix.
580 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
Examples
In the following example, we construct a NeighborsClassier class from an array representing our data set
and ask whos the closest point to [1,1,1]
>>> samples = [[0., 0., 0.], [0., .5, 0.], [1., 1., .5]]
>>> from sklearn.neighbors import NearestNeighbors
>>> neigh = NearestNeighbors(n_neighbors=1)
>>> neigh.fit(samples)
NearestNeighbors(algorithm=auto, leaf_size=30, ...)
>>> print(neigh.kneighbors([1., 1., 1.]))
(array([[ 0.5]]), array([[2]]...))
As you can see, it returns [[0.5]], and [[2]], which means that the element is at distance 0.5 and is the third
element of samples (indexes start at 0). You can also query for multiple points:
>>> X = [[0., 1., 0.], [1., 0., 1.]]
>>> neigh.kneighbors(X, return_distance=False)
array([[1],
[2]]...)
kneighbors_graph(X, n_neighbors=None, mode=connectivity)
Computes the (weighted) graph of k-Neighbors for points in X
Parameters X : array-like, shape = [n_samples, n_features]
Sample data
n_neighbors : int
Number of neighbors for each sample. (default is value passed to the constructor).
mode : {connectivity, distance}, optional
Type of returned matrix: connectivity will return the connectivity matrix with ones
and zeros, in distance the edges are Euclidean distance between points.
Returns A : sparse matrix in CSR format, shape = [n_samples, n_samples_t]
n_samples_t is the number of samples in the tted data A[i, j] is assigned the weight
of edge that connects i to j.
See Also:
NearestNeighbors.radius_neighbors_graph
Examples
>>> X = [[0], [3], [1]]
>>> from sklearn.neighbors import NearestNeighbors
>>> neigh = NearestNeighbors(n_neighbors=2)
>>> neigh.fit(X)
NearestNeighbors(algorithm=auto, leaf_size=30, ...)
>>> A = neigh.kneighbors_graph(X)
>>> A.todense()
matrix([[ 1., 0., 1.],
[ 0., 1., 1.],
[ 1., 0., 1.]])
predict(X)
Predict the target for the provided data
1.8. Reference 581
scikit-learn user guide, Release 0.12-git
Parameters X : array
A 2-D array representing the test data.
Returns y: array :
List of target values (one for each data sample).
score(X, y)
Returns the coefcient of determination R^2 of the prediction.
The coefcient R^2 is dened as (1 - u/v), where u is the regression sum of squares ((y - y_pred) **
2).sum() and v is the residual sum of squares ((y_true - y_true.mean()) ** 2).sum(). Best possible score is
1.0, lower values are worse.
Parameters X : array-like, shape = [n_samples, n_features]
Training set.
y : array-like, shape = [n_samples]
Returns z : oat
set_params(**params)
Set the parameters of the estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The former have
parameters of the form <component>__<parameter> so that its possible to update each component
of a nested object.
Returns self :
sklearn.neighbors.RadiusNeighborsRegressor
class sklearn.neighbors.RadiusNeighborsRegressor(radius=1.0, weights=uniform, algo-
rithm=auto, leaf_size=30, p=2)
Regression based on neighbors within a xed radius.
The target is predicted by local interpolation of the targets associated of the nearest neighbors in the training set.
Parameters radius : oat, optional (default = 1.0)
Range of parameter space to use by default for :methradius_neighbors queries.
weights : str or callable
weight function used in prediction. Possible values:
uniform : uniform weights. All points in each neighborhood are weighted equally.
distance : weight points by the inverse of their distance. in this case, closer neigh-
bors of a query point will have a greater inuence than neighbors which are further
away.
[callable] : a user-dened function which accepts an array of distances, and returns
an array of the same shape containing the weights.
Uniform weights are used by default.
algorithm : {auto, ball_tree, kd_tree, brute}, optional
Algorithm used to compute the nearest neighbors:
ball_tree will use BallTree
kd_tree will use scipy.spatial.cKDtree
582 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
brute will use a brute-force search.
auto will attempt to decide the most appropriate algorithm based on the values
passed to fit method.
Note: tting on sparse input will override the setting of this parameter, using brute force.
leaf_size : int, optional (default = 30)
Leaf size passed to BallTree or cKDTree. This can affect the speed of the construction
and query, as well as the memory required to store the tree. The optimal value depends
on the nature of the problem.
p: integer, optional (default = 2) :
Parameter for the Minkowski metric from sklearn.metrics.pairwise.pairwise_distances.
When p = 1, this is equivalent to using manhattan_distance (l1), and euclidean_distance
(l2) for p = 2. For arbitrary p, minkowski_distance (l_p) is used.
See Also:
NearestNeighbors, KNeighborsRegressor, KNeighborsClassifier,
RadiusNeighborsClassifier
Notes
See Nearest Neighbors in the online documentation for a discussion of the choice of algorithm and
leaf_size.
http://en.wikipedia.org/wiki/K-nearest_neighbor_algorithm
Examples
>>> X = [[0], [1], [2], [3]]
>>> y = [0, 0, 1, 1]
>>> from sklearn.neighbors import RadiusNeighborsRegressor
>>> neigh = RadiusNeighborsRegressor(radius=1.0)
>>> neigh.fit(X, y)
RadiusNeighborsRegressor(...)
>>> print(neigh.predict([[1.5]]))
[ 0.5]
Methods
fit(X, y) Fit the model using X as training data and y as target values
get_params([deep]) Get parameters for the estimator
predict(X) Predict the target for the provided data
radius_neighbors(X[, radius, return_distance]) Finds the neighbors of a point within a given radius.
radius_neighbors_graph(X[, radius, mode]) Computes the (weighted) graph of Neighbors for points in X
score(X, y) Returns the coefcient of determination R^2 of the prediction.
set_params(**params) Set the parameters of the estimator.
__init__(radius=1.0, weights=uniform, algorithm=auto, leaf_size=30, p=2)
fit(X, y)
1.8. Reference 583
scikit-learn user guide, Release 0.12-git
Fit the model using X as training data and y as target values
Parameters X : {array-like, sparse matrix, BallTree, cKDTree}
Training data. If array or matrix, then the shape is [n_samples, n_features]
y : {array-like, sparse matrix}, shape = [n_samples]
Target values, array of oat values.
get_params(deep=True)
Get parameters for the estimator
Parameters deep: boolean, optional :
If True, will return the parameters for this estimator and contained subobjects that are
estimators.
predict(X)
Predict the target for the provided data
Parameters X : array
A 2-D array representing the test data.
Returns y: array :
List of target values (one for each data sample).
radius_neighbors(X, radius=None, return_distance=True)
Finds the neighbors of a point within a given radius.
Returns distance
Parameters X : array-like, last dimension same as that of t data
The new point.
radius : oat
Limiting distance of neighbors to return. (default is the value passed to the constructor).
return_distance : boolean, optional. Defaults to True.
If False, distances will not be returned
Returns dist : array
Array representing the lengths to point, only present if return_distance=True
ind : array
Indices of the nearest points in the population matrix.
Examples
In the following example, we construnct a NeighborsClassier class from an array representing our data
set and ask whos the closest point to [1,1,1]
>>> samples = [[0., 0., 0.], [0., .5, 0.], [1., 1., .5]]
>>> from sklearn.neighbors import NearestNeighbors
>>> neigh = NearestNeighbors(radius=1.6)
>>> neigh.fit(samples)
NearestNeighbors(algorithm=auto, leaf_size=30, ...)
>>> print(neigh.radius_neighbors([1., 1., 1.]))
(array([[ 1.5, 0.5]]...), array([[1, 2]]...)
584 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
The rst array returned contains the distances to all points which are closer than 1.6, while the second array
returned contains their indices. In general, multiple points can be queried at the same time. Because the
number of neighbors of each point is not necessarily equal, radius_neighbors returns an array of objects,
where each object is a 1D array of indices.
radius_neighbors_graph(X, radius=None, mode=connectivity)
Computes the (weighted) graph of Neighbors for points in X
Neighborhoods are restricted the points at a distance lower than radius.
Parameters X : array-like, shape = [n_samples, n_features]
Sample data
radius : oat
Radius of neighborhoods. (default is the value passed to the constructor).
mode : {connectivity, distance}, optional
Type of returned matrix: connectivity will return the connectivity matrix with ones
and zeros, in distance the edges are Euclidean distance between points.
Returns A : sparse matrix in CSR format, shape = [n_samples, n_samples]
A[i, j] is assigned the weight of edge that connects i to j.
See Also:
kneighbors_graph
Examples
>>> X = [[0], [3], [1]]
>>> from sklearn.neighbors import NearestNeighbors
>>> neigh = NearestNeighbors(radius=1.5)
>>> neigh.fit(X)
NearestNeighbors(algorithm=auto, leaf_size=30, ...)
>>> A = neigh.radius_neighbors_graph(X)
>>> A.todense()
matrix([[ 1., 0., 1.],
[ 0., 1., 0.],
[ 1., 0., 1.]])
score(X, y)
Returns the coefcient of determination R^2 of the prediction.
The coefcient R^2 is dened as (1 - u/v), where u is the regression sum of squares ((y - y_pred) **
2).sum() and v is the residual sum of squares ((y_true - y_true.mean()) ** 2).sum(). Best possible score is
1.0, lower values are worse.
Parameters X : array-like, shape = [n_samples, n_features]
Training set.
y : array-like, shape = [n_samples]
Returns z : oat
set_params(**params)
Set the parameters of the estimator.
1.8. Reference 585
scikit-learn user guide, Release 0.12-git
The method works on simple estimators as well as on nested objects (such as pipelines). The former have
parameters of the form <component>__<parameter> so that its possible to update each component
of a nested object.
Returns self :
sklearn.neighbors.BallTree
class sklearn.neighbors.BallTree
Ball Tree for fast nearest-neighbor searches :
BallTree(X, leaf_size=20, p=2.0)
Parameters X : array-like, shape = [n_samples, n_features]
n_samples is the number of points in the data set, and n_features is the dimension of
the parameter space. Note: if X is a C-contiguous array of doubles then data will not be
copied. Otherwise, an internal copy will be made.
leaf_size : positive integer (default = 20)
Number of points at which to switch to brute-force. Changing leaf_size will not af-
fect the results of a query, but can signicantly impact the speed of a query and the
memory required to store the built ball tree. The amount of memory needed to store
the tree scales as 2 ** (1 + oor(log2((n_samples - 1) / leaf_size))) - 1 For a specied
leaf_size, a leaf node is guaranteed to satisfy leaf_size <= n_points <=
2
*
leaf_size, except in the case that n_samples < leaf_size.
p : distance metric for the BallTree. p encodes the Minkowski
p-distance:
D = sum((X[i] - X[j])
**
p)
**
(1. / p)
p must be greater than or equal to 1, so that the triangle inequality will hold. If p ==
np.inf, then the distance is equivalent to:
D = max(X[i] - X[j])
Examples
Query for k-nearest neighbors
>>> import numpy as np
>>> np.random.seed(0)
>>> X = np.random.random((10,3)) # 10 points in 3 dimensions
>>> ball_tree = BallTree(X, leaf_size=2)
>>> dist, ind = ball_tree.query(X[0], n_neighbors=3)
>>> print ind # indices of 3 closest neighbors
[0 3 1]
>>> print dist # distances to 3 closest neighbors
[ 0. 0.19662693 0.29473397]
Pickle and Unpickle a ball tree (using protocol = 2). Note that the state of the tree is saved in the pickle operation:
the tree is not rebuilt on un-pickling
586 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
>>> import numpy as np
>>> import pickle
>>> np.random.seed(0)
>>> X = np.random.random((10,3)) # 10 points in 3 dimensions
>>> ball_tree = BallTree(X, leaf_size=2)
>>> s = pickle.dumps(ball_tree, protocol=2)
>>> ball_tree_copy = pickle.loads(s)
>>> dist, ind = ball_tree_copy.query(X[0], k=3)
>>> print ind # indices of 3 closest neighbors
[0 3 1]
>>> print dist # distances to 3 closest neighbors
[ 0. 0.19662693 0.29473397]
Attributes
data
warning_flag
Methods
query(X[, k, return_distance]) query the Ball Tree for the k nearest neighbors
query_radius query_radius(self, X, r, count_only = False):
__init__()
x.__init__(...) initializes x; see help(type(x)) for signature
query(X, k=1, return_distance=True)
query the Ball Tree for the k nearest neighbors
Parameters X : array-like, last dimension self.dim
An array of points to query
k : integer (default = 1)
The number of nearest neighbors to return
return_distance : boolean (default = True)
if True, return a tuple (d,i) if False, return array i
Returns i : if return_distance == False
(d,i) : if return_distance == True
d : array of doubles - shape: x.shape[:-1] + (k,)
each entry gives the list of distances to the neighbors of the corresponding point (note
that distances are not sorted)
i : array of integers - shape: x.shape[:-1] + (k,)
each entry gives the list of indices of neighbors of the corresponding point (note that
neighbors are not sorted)
1.8. Reference 587
scikit-learn user guide, Release 0.12-git
Examples
Query for k-nearest neighbors
>>> import numpy as np
>>> np.random.seed(0)
>>> X = np.random.random((10,3)) # 10 points in 3 dimensions
>>> ball_tree = BallTree(X, leaf_size=2)
>>> dist, ind = ball_tree.query(X[0], k=3)
>>> print ind # indices of 3 closest neighbors
[0 3 1]
>>> print dist # distances to 3 closest neighbors
[ 0. 0.19662693 0.29473397]
query_radius()
query_radius(self, X, r, count_only = False):
query the Ball Tree for neighbors within a ball of size r
Parameters X : array-like, last dimension self.dim
An array of points to query
r : distance within which neighbors are returned
r can be a single value, or an array of values of shape x.shape[:-1] if different radii are
desired for each point.
return_distance : boolean (default = False)
if True, return distances to neighbors of each point if False, return only neighbors Note
that unlike BallTree.query(), setting return_distance=True adds to the computation time.
Not all distances need to be calculated explicitly for return_distance=False. Results are
not sorted by default: see sort_results keyword.
count_only : boolean (default = False)
if True, return only the count of points within distance r if False, return the indices of all
points within distance r If return_distance==True, setting count_only=True will result
in an error.
sort_results : boolean (default = False)
if True, the distances and indices will be sorted before being returned. If False, the
results will not be sorted. If return_distance == False, setting sort_results = True will
result in an error.
Returns count : if count_only == True
ind : if count_only == False and return_distance == False
(ind, dist) : if count_only == False and return_distance == True
count : array of integers, shape = X.shape[:-1]
each entry gives the number of neighbors within a distance r of the corresponding point.
ind : array of objects, shape = X.shape[:-1]
each element is a numpy integer array listing the indices of neighbors of the correspond-
ing point. Note that unlike the results of BallTree.query(), the returned neighbors are
not sorted by distance
dist : array of objects, shape = X.shape[:-1]
588 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
each element is a numpy double array listing the distances corresponding to indices in
i.
Examples
Query for neighbors in a given radius
>>> import numpy as np
>>> np.random.seed(0)
>>> X = np.random.random((10,3)) # 10 points in 3 dimensions
>>> ball_tree = BallTree(X, leaf_size=2)
>>> print ball_tree.query_radius(X[0], r=0.3, count_only=True)
3
>>> ind = ball_tree.query_radius(X[0], r=0.3)
>>> print ind # indices of neighbors within distance 0.3
[3 0 1]
sklearn.neighbors.NearestCentroid
class sklearn.neighbors.NearestCentroid(metric=euclidean, shrink_threshold=None)
Nearest centroid classier.
Each class is represented by its centroid, with test samples classied to the class with the nearest centroid.
Parameters metric: string, or callable :
The metric to use when calculating distance between instances in a feature array.
If metric is a string or callable, it must be one of the options allowed by met-
rics.pairwise.pairwise_distances for its metric parameter.
shrink_threshold : oat, optional
Threshold for shrinking centroids to remove features.
See Also:
sklearn.neighbors.KNeighborsClassifiernearest neighbors classier
Notes
When used for text classication with tfidf vectors, this classier is also known as the Rocchio classier.
References
Tibshirani, R., Hastie, T., Narasimhan, B., & Chu, G. (2002). Diagnosis of multiple cancer types by shrunken
centroids of gene expression. Proceedings of the National Academy of Sciences of the United States of America,
99(10), 6567-6572. The National Academy of Sciences.
Examples
1.8. Reference 589
scikit-learn user guide, Release 0.12-git
>>> from sklearn.neighbors.nearest_centroid import NearestCentroid
>>> import numpy as np
>>> X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
>>> y = np.array([1, 1, 1, 2, 2, 2])
>>> clf = NearestCentroid()
>>> clf.fit(X, y)
NearestCentroid(metric=euclidean, shrink_threshold=None)
>>> print clf.predict([[-0.8, -1]])
[1]
Attributes
centroids_ array-like, shape = [n_classes, n_features] Centroid of each class
Methods
fit(X, y) Fit the NearestCentroid model according to the given training data.
get_params([deep]) Get parameters for the estimator
predict(X) Perform classication on an array of test vectors X.
score(X, y) Returns the mean accuracy on the given test data and labels.
set_params(**params) Set the parameters of the estimator.
__init__(metric=euclidean, shrink_threshold=None)
fit(X, y)
Fit the NearestCentroid model according to the given training data.
Parameters X : {array-like, sparse matrix}, shape = [n_samples, n_features]
Training vector, where n_samples in the number of samples and n_features is the num-
ber of features. Note that centroid shrinking cannot be used with sparse matrices.
y : array, shape = [n_samples]
Target values (integers)
get_params(deep=True)
Get parameters for the estimator
Parameters deep: boolean, optional :
If True, will return the parameters for this estimator and contained subobjects that are
estimators.
predict(X)
Perform classication on an array of test vectors X.
The predicted class C for each sample in X is returned.
Parameters X : array-like, shape = [n_samples, n_features]
Returns C : array, shape = [n_samples]
590 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
Notes
If the metric constructor parameter is precomputed, X is assumed to be the distance matrix between the
data to be predicted and self.centroids_.
score(X, y)
Returns the mean accuracy on the given test data and labels.
Parameters X : array-like, shape = [n_samples, n_features]
Training set.
y : array-like, shape = [n_samples]
Labels for X.
Returns z : oat
set_params(**params)
Set the parameters of the estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The former have
parameters of the form <component>__<parameter> so that its possible to update each component
of a nested object.
Returns self :
neighbors.kneighbors_graph(X, n_neighbors[, ...]) Computes the (weighted) graph of k-Neighbors for points in X
neighbors.radius_neighbors_graph(X, radius) Computes the (weighted) graph of Neighbors for points in X
sklearn.neighbors.kneighbors_graph
sklearn.neighbors.kneighbors_graph(X, n_neighbors, mode=connectivity)
Computes the (weighted) graph of k-Neighbors for points in X
Parameters X : array-like or BallTree, shape = [n_samples, n_features]
Sample data, in the form of a numpy array or a precomputed BallTree.
n_neighbors : int
Number of neighbors for each sample.
mode : {connectivity, distance}, optional
Type of returned matrix: connectivity will return the connectivity matrix with ones
and zeros, in distance the edges are Euclidean distance between points.
Returns A : sparse matrix in CSR format, shape = [n_samples, n_samples]
A[i, j] is assigned the weight of edge that connects i to j.
See Also:
radius_neighbors_graph
Examples
>>> X = [[0], [3], [1]]
>>> from sklearn.neighbors import kneighbors_graph
>>> A = kneighbors_graph(X, 2)
1.8. Reference 591
scikit-learn user guide, Release 0.12-git
>>> A.todense()
matrix([[ 1., 0., 1.],
[ 0., 1., 1.],
[ 1., 0., 1.]])
sklearn.neighbors.radius_neighbors_graph
sklearn.neighbors.radius_neighbors_graph(X, radius, mode=connectivity)
Computes the (weighted) graph of Neighbors for points in X
Neighborhoods are restricted the points at a distance lower than radius.
Parameters X : array-like or BallTree, shape = [n_samples, n_features]
Sample data, in the form of a numpy array or a precomputed BallTree.
radius : oat
Radius of neighborhoods.
mode : {connectivity, distance}, optional
Type of returned matrix: connectivity will return the connectivity matrix with ones
and zeros, in distance the edges are Euclidean distance between points.
Returns A : sparse matrix in CSR format, shape = [n_samples, n_samples]
A[i, j] is assigned the weight of edge that connects i to j.
See Also:
kneighbors_graph
Examples
>>> X = [[0], [3], [1]]
>>> from sklearn.neighbors import radius_neighbors_graph
>>> A = radius_neighbors_graph(X, 1.5)
>>> A.todense()
matrix([[ 1., 0., 1.],
[ 0., 1., 0.],
[ 1., 0., 1.]])
1.8.22 sklearn.pls: Partial Least Squares
The sklearn.pls module implements Partial Least Squares (PLS).
User guide: See the Partial Least Squares section for further details.
pls.PLSRegression([n_components, scale, ...]) PLS regression
pls.PLSCanonical([n_components, scale, ...]) PLSCanonical implements the 2 blocks canonical PLS of the original Wold
pls.CCA([n_components, scale, max_iter, ...]) CCA Canonical Correlation Analysis. CCA inherits from PLS with
pls.PLSSVD([n_components, scale, copy]) Partial Least Square SVD
592 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
sklearn.pls.PLSRegression
class sklearn.pls.PLSRegression(n_components=2, scale=True, max_iter=500, tol=1e-06,
copy=True)
PLS regression
PLSRegression implements the PLS 2 blocks regression known as PLS2 or PLS1 in case of one dimensional
response. This class inherits from _PLS with mode=A, deation_mode=regression, norm_y_weights=False
and algorithm=nipals.
Parameters X : array-like of predictors, shape = [n_samples, p]
Training vectors, where n_samples in the number of samples and p is the number of
predictors.
Y : array-like of response, shape = [n_samples, q]
Training vectors, where n_samples in the number of samples and q is the number of
response variables.
n_components : int, (default 2)
Number of components to keep.
scale : boolean, (default True)
whether to scale the data
max_iter : an integer, (default 500)
the maximum number of iterations of the NIPALS inner loop (used only if algo-
rithm=nipals)
tol : non-negative real
Tolerance used in the iterative algorithm default 1e-06.
copy : boolean, default True
Whether the deation should be done on a copy. Let the default value to True unless
you dont care about side effect
Notes
For each component k, nd weights u, v that optimizes: max corr(Xk u, Yk v)
*
var(Xk u)
var(Yk u), such that |u| = 1
Note that it maximizes both the correlations between the scores and the intra-block variances.
The residual matrix of X (Xk+1) block is obtained by the deation on the current X score: x_score.
The residual matrix of Y (Yk+1) block is obtained by deation on the current X score. This performs the PLS
regression known as PLS2. This mode is prediction oriented.
This implementation provides the same results that 3 PLS packages provided in the R language (R-project):
mixOmics with function pls(X, Y, mode = regression)
plspm with function plsreg2(X, Y)
pls with function oscorespls.t(X, Y)
1.8. Reference 593
scikit-learn user guide, Release 0.12-git
References
Jacob A. Wegelin. A survey of Partial Least Squares (PLS) methods, with emphasis on the two-block case.
Technical Report 371, Department of Statistics, University of Washington, Seattle, 2000.
In french but still a reference: Tenenhaus, M. (1998). La regression PLS: theorie et pratique. Paris: Editions
Technic.
Examples
>>> from sklearn.pls import PLSCanonical, PLSRegression, CCA
>>> X = [[0., 0., 1.], [1.,0.,0.], [2.,2.,2.], [2.,5.,4.]]
>>> Y = [[0.1, -0.2], [0.9, 1.1], [6.2, 5.9], [11.9, 12.3]]
>>> pls2 = PLSRegression(n_components=2)
>>> pls2.fit(X, Y)
...
PLSRegression(copy=True, max_iter=500, n_components=2, scale=True,
tol=1e-06)
>>> Y_pred = pls2.predict(X)
Attributes
x_weights_ array, [p, n_components] X block weights vectors.
y_weights_ array, [q, n_components] Y block weights vectors.
x_loadings_ array, [p, n_components] X block loadings vectors.
y_loadings_ array, [q, n_components] Y block loadings vectors.
x_scores_ array, [n_samples, n_components] X scores.
y_scores_ array, [n_samples, n_components] Y scores.
x_rotations_ array, [p, n_components] X block to latents rotations.
y_rotations_ array, [q, n_components] Y block to latents rotations.
coefs: array, [p, q] The coecients of the linear model: Y = X coefs + Err
Methods
fit(X, Y)
get_params([deep]) Get parameters for the estimator
predict(X[, copy]) Apply the dimension reduction learned on the train data.
set_params(**params) Set the parameters of the estimator.
transform(X[, Y, copy]) Apply the dimension reduction learned on the train data.
__init__(n_components=2, scale=True, max_iter=500, tol=1e-06, copy=True)
get_params(deep=True)
Get parameters for the estimator
Parameters deep: boolean, optional :
If True, will return the parameters for this estimator and contained subobjects that are
estimators.
predict(X, copy=True)
Apply the dimension reduction learned on the train data.
594 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
Parameters X : array-like of predictors, shape = [n_samples, p]
Training vectors, where n_samples in the number of samples and p is the number of
predictors.
copy : boolean
Whether to copy X and Y, or perform in-place normalization.
Notes
This call require the estimation of a p x q matrix, which may be an issue in high dimensional space.
set_params(**params)
Set the parameters of the estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The former have
parameters of the form <component>__<parameter> so that its possible to update each component
of a nested object.
Returns self :
transform(X, Y=None, copy=True)
Apply the dimension reduction learned on the train data.
Parameters X : array-like of predictors, shape = [n_samples, p]
Training vectors, where n_samples in the number of samples and p is the number of
predictors.
Y : array-like of response, shape = [n_samples, q], optional
Training vectors, where n_samples in the number of samples and q is the number of
response variables.
copy : boolean
Whether to copy X and Y, or perform in-place normalization.
Returns x_scores if Y is not given, (x_scores, y_scores) otherwise. :
sklearn.pls.PLSCanonical
class sklearn.pls.PLSCanonical(n_components=2, scale=True, algorithm=nipals, max_iter=500,
tol=1e-06, copy=True)
PLSCanonical implements the 2 blocks canonical PLS of the original Wold algorithm [Tenenhaus 1998] p.204,
refered as PLS-C2A in [Wegelin 2000].
This class inherits from PLS with mode=A and deation_mode=canonical, norm_y_weights=True and al-
gorithm=nipals, but svd should provide similar results up to numerical errors.
Parameters X : array-like of predictors, shape = [n_samples, p]
Training vectors, where n_samples in the number of samples and p is the number of
predictors.
Y : array-like of response, shape = [n_samples, q]
Training vectors, where n_samples in the number of samples and q is the number of
response variables.
1.8. Reference 595
scikit-learn user guide, Release 0.12-git
n_components : int, number of components to keep. (default 2).
scale : boolean, scale data? (default True)
algorithm : string, nipals or svd
The algorithm used to estimate the weights. It will be called n_components times, i.e.
once for each iteration of the outer loop.
max_iter : an integer, (default 500)
the maximum number of iterations of the NIPALS inner loop (used only if algo-
rithm=nipals)
tol : non-negative real, default 1e-06
the tolerance used in the iterative algorithm
copy : boolean, default True
Whether the deation should be done on a copy. Let the default value to True unless
you dont care about side effect
See Also:
CCA, PLSSVD
Notes
For each component k, nd weights u, v that optimize:: max corr(Xk u, Yk v) * var(Xk u) var(Yk u), such that
|u| = |v| = 1
Note that it maximizes both the correlations between the scores and the intra-block variances.
The residual matrix of X (Xk+1) block is obtained by the deation on the current X score: x_score.
The residual matrix of Y (Yk+1) block is obtained by deation on the current Y score. This performs a canonical
symetric version of the PLS regression. But slightly different than the CCA. This is mode mostly used for
modeling.
This implementation provides the same results that the plspm package provided in the R language (R-
project), using the function plsca(X, Y). Results are equal or colinear with the function pls(..., mode =
"canonical") of the mixOmics package. The difference relies in the fact that mixOmics implmentation
does not exactly implement the Wold algorithm since it does not normalize y_weights to one.
References
Jacob A. Wegelin. A survey of Partial Least Squares (PLS) methods, with emphasis on the two-block case.
Technical Report 371, Department of Statistics, University of Washington, Seattle, 2000.
Tenenhaus, M. (1998). La regression PLS: theorie et pratique. Paris: Editions Technic.
Examples
>>> from sklearn.pls import PLSCanonical, PLSRegression, CCA
>>> X = [[0., 0., 1.], [1.,0.,0.], [2.,2.,2.], [2.,5.,4.]]
>>> Y = [[0.1, -0.2], [0.9, 1.1], [6.2, 5.9], [11.9, 12.3]]
>>> plsca = PLSCanonical(n_components=2)
>>> plsca.fit(X, Y)
596 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
...
PLSCanonical(algorithm=nipals, copy=True, max_iter=500, n_components=2,
scale=True, tol=1e-06)
>>> X_c, Y_c = plsca.transform(X, Y)
Attributes
x_weights_ array, shape = [p, n_components] X block weights vectors.
y_weights_ array, shape = [q, n_components] Y block weights vectors.
x_loadings_ array, shape = [p, n_components] X block loadings vectors.
y_loadings_ array, shape = [q, n_components] Y block loadings vectors.
x_scores_ array, shape = [n_samples, n_components] X scores.
y_scores_ array, shape = [n_samples, n_components] Y scores.
x_rotations_ array, shape = [p, n_components] X block to latents rotations.
y_rotations_ array, shape = [q, n_components] Y block to latents rotations.
Methods
fit(X, Y)
get_params([deep]) Get parameters for the estimator
predict(X[, copy]) Apply the dimension reduction learned on the train data.
set_params(**params) Set the parameters of the estimator.
transform(X[, Y, copy]) Apply the dimension reduction learned on the train data.
__init__(n_components=2, scale=True, algorithm=nipals, max_iter=500, tol=1e-06, copy=True)
get_params(deep=True)
Get parameters for the estimator
Parameters deep: boolean, optional :
If True, will return the parameters for this estimator and contained subobjects that are
estimators.
predict(X, copy=True)
Apply the dimension reduction learned on the train data.
Parameters X : array-like of predictors, shape = [n_samples, p]
Training vectors, where n_samples in the number of samples and p is the number of
predictors.
copy : boolean
Whether to copy X and Y, or perform in-place normalization.
Notes
This call require the estimation of a p x q matrix, which may be an issue in high dimensional space.
set_params(**params)
Set the parameters of the estimator.
1.8. Reference 597
scikit-learn user guide, Release 0.12-git
The method works on simple estimators as well as on nested objects (such as pipelines). The former have
parameters of the form <component>__<parameter> so that its possible to update each component
of a nested object.
Returns self :
transform(X, Y=None, copy=True)
Apply the dimension reduction learned on the train data.
Parameters X : array-like of predictors, shape = [n_samples, p]
Training vectors, where n_samples in the number of samples and p is the number of
predictors.
Y : array-like of response, shape = [n_samples, q], optional
Training vectors, where n_samples in the number of samples and q is the number of
response variables.
copy : boolean
Whether to copy X and Y, or perform in-place normalization.
Returns x_scores if Y is not given, (x_scores, y_scores) otherwise. :
sklearn.pls.CCA
class sklearn.pls.CCA(n_components=2, scale=True, max_iter=500, tol=1e-06, copy=True)
CCA Canonical Correlation Analysis. CCA inherits from PLS with mode=B and dea-
tion_mode=canonical.
Parameters X : array-like of predictors, shape = [n_samples, p]
Training vectors, where n_samples in the number of samples and p is the number of
predictors.
Y : array-like of response, shape = [n_samples, q]
Training vectors, where n_samples in the number of samples and q is the number of
response variables.
n_components : int, (default 2).
number of components to keep.
scale : boolean, (default True)
whether to scale the data?
max_iter : an integer, (default 500)
the maximum number of iterations of the NIPALS inner loop (used only if algo-
rithm=nipals)
tol : non-negative real, default 1e-06.
the tolerance used in the iterative algorithm
copy : boolean
Whether the deation be done on a copy. Let the default value to True unless you dont
care about side effects
See Also:
PLSCanonical, PLSSVD
598 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
Notes
For each component k, nd the weights u, v that maximizes max corr(Xk u, Yk v), such that |u| = |v| =
1
Note that it maximizes only the correlations between the scores.
The residual matrix of X (Xk+1) block is obtained by the deation on the current X score: x_score.
The residual matrix of Y (Yk+1) block is obtained by deation on the current Y score.
References
Jacob A. Wegelin. A survey of Partial Least Squares (PLS) methods, with emphasis on the two-block case.
Technical Report 371, Department of Statistics, University of Washington, Seattle, 2000.
In french but still a reference: Tenenhaus, M. (1998). La regression PLS: theorie et pratique. Paris: Editions
Technic.
Examples
>>> from sklearn.pls import PLSCanonical, PLSRegression, CCA
>>> X = [[0., 0., 1.], [1.,0.,0.], [2.,2.,2.], [3.,5.,4.]]
>>> Y = [[0.1, -0.2], [0.9, 1.1], [6.2, 5.9], [11.9, 12.3]]
>>> cca = CCA(n_components=1)
>>> cca.fit(X, Y)
...
CCA(copy=True, max_iter=500, n_components=1, scale=True, tol=1e-06)
>>> X_c, Y_c = cca.transform(X, Y)
Attributes
x_weights_ array, [p, n_components] X block weights vectors.
y_weights_ array, [q, n_components] Y block weights vectors.
x_loadings_ array, [p, n_components] X block loadings vectors.
y_loadings_ array, [q, n_components] Y block loadings vectors.
x_scores_ array, [n_samples, n_components] X scores.
y_scores_ array, [n_samples, n_components] Y scores.
x_rotations_ array, [p, n_components] X block to latents rotations.
y_rotations_ array, [q, n_components] Y block to latents rotations.
Methods
fit(X, Y)
get_params([deep]) Get parameters for the estimator
predict(X[, copy]) Apply the dimension reduction learned on the train data.
set_params(**params) Set the parameters of the estimator.
transform(X[, Y, copy]) Apply the dimension reduction learned on the train data.
__init__(n_components=2, scale=True, max_iter=500, tol=1e-06, copy=True)
1.8. Reference 599
scikit-learn user guide, Release 0.12-git
get_params(deep=True)
Get parameters for the estimator
Parameters deep: boolean, optional :
If True, will return the parameters for this estimator and contained subobjects that are
estimators.
predict(X, copy=True)
Apply the dimension reduction learned on the train data.
Parameters X : array-like of predictors, shape = [n_samples, p]
Training vectors, where n_samples in the number of samples and p is the number of
predictors.
copy : boolean
Whether to copy X and Y, or perform in-place normalization.
Notes
This call require the estimation of a p x q matrix, which may be an issue in high dimensional space.
set_params(**params)
Set the parameters of the estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The former have
parameters of the form <component>__<parameter> so that its possible to update each component
of a nested object.
Returns self :
transform(X, Y=None, copy=True)
Apply the dimension reduction learned on the train data.
Parameters X : array-like of predictors, shape = [n_samples, p]
Training vectors, where n_samples in the number of samples and p is the number of
predictors.
Y : array-like of response, shape = [n_samples, q], optional
Training vectors, where n_samples in the number of samples and q is the number of
response variables.
copy : boolean
Whether to copy X and Y, or perform in-place normalization.
Returns x_scores if Y is not given, (x_scores, y_scores) otherwise. :
sklearn.pls.PLSSVD
class sklearn.pls.PLSSVD(n_components=2, scale=True, copy=True)
Partial Least Square SVD
Simply perform a svd on the crosscovariance matrix: XY The are no iterative deation here.
Parameters X : array-like of predictors, shape = [n_samples, p]
Training vector, where n_samples in the number of samples and p is the number of
predictors. X will be centered before any analysis.
600 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
Y : array-like of response, shape = [n_samples, q]
Training vector, where n_samples in the number of samples and q is the number of
response variables. X will be centered before any analysis.
n_components : int, (default 2).
number of components to keep.
scale : boolean, (default True)
scale X and Y
See Also:
PLSCanonical, CCA
Attributes
x_weights_ array, [p, n_components] X block weights vectors.
y_weights_ array, [q, n_components] Y block weights vectors.
x_scores_ array, [n_samples, n_components] X scores.
y_scores_ array, [n_samples, n_components] Y scores.
Methods
fit(X, Y)
get_params([deep]) Get parameters for the estimator
set_params(**params) Set the parameters of the estimator.
transform(X[, Y]) Apply the dimension reduction learned on the train data.
__init__(n_components=2, scale=True, copy=True)
get_params(deep=True)
Get parameters for the estimator
Parameters deep: boolean, optional :
If True, will return the parameters for this estimator and contained subobjects that are
estimators.
set_params(**params)
Set the parameters of the estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The former have
parameters of the form <component>__<parameter> so that its possible to update each component
of a nested object.
Returns self :
transform(X, Y=None)
Apply the dimension reduction learned on the train data.
1.8.23 sklearn.pipeline: Pipeline
The sklearn.pipeline module implements utilites to build a composite estimator, as a chain of transforms and
estimators.
1.8. Reference 601
scikit-learn user guide, Release 0.12-git
pipeline.Pipeline(steps) Pipeline of transforms with a nal estimator.
sklearn.pipeline.Pipeline
class sklearn.pipeline.Pipeline(steps)
Pipeline of transforms with a nal estimator.
Sequentially apply a list of transforms and a nal estimator. Intermediate steps of the pipeline must be trans-
forms, that is, they must implements t and transform methods. The nal estimator needs only implements
t.
The purpose of the pipeline is to assemble several steps that can be cross-validated together while setting differ-
ent parameters. For this, it enables setting parameters of the various steps using their names and the parameter
name separated by a __, as in the example below.
Parameters steps: list :
List of (name, transform) tuples (implementing t/transform) that are chained, in the
order in which they are chained, with the last object an estimator.
Examples
>>> from sklearn import svm
>>> from sklearn.datasets import samples_generator
>>> from sklearn.feature_selection import SelectKBest
>>> from sklearn.feature_selection import f_regression
>>> from sklearn.pipeline import Pipeline
>>> # generate some data to play with
>>> X, y = samples_generator.make_classification(
... n_informative=5, n_redundant=0, random_state=42)
>>> # ANOVA SVM-C
>>> anova_filter = SelectKBest(f_regression, k=5)
>>> clf = svm.SVC(kernel=linear)
>>> anova_svm = Pipeline([(anova, anova_filter), (svc, clf)])
>>> # You can set the parameters using the names issued
>>> # For instance, fit using a k of 10 in the SelectKBest
>>> # and a parameter C of the svn
>>> anova_svm.set_params(anova__k=10, svc__C=.1).fit(X, y)
...
Pipeline(steps=[...])
>>> prediction = anova_svm.predict(X)
>>> anova_svm.score(X, y)
0.75
Attributes
steps list of (name,
object)
List of the named object that compose the pipeline, in the order that they are
applied on the data.
602 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
Methods
decision_function(X) Applies transforms to the data, and the decision_function method of the nal estimator.
fit(X[, y]) Fit all the transforms one after the other and transform the
fit_transform(X[, y]) Fit all the transforms one after the other and transform the data, then use t_transform on transformed data using the nal estimator.
get_params([deep])
inverse_transform(X)
predict(X) Applies transforms to the data, and the predict method of the nal estimator.
predict_log_proba(X)
predict_proba(X) Applies transforms to the data, and the predict_proba method of the nal estimator.
score(X[, y]) Applies transforms to the data, and the score method of the nal estimator.
set_params(**params) Set the parameters of the estimator.
transform(X) Applies transforms to the data, and the transform method of the nal estimator.
__init__(steps)
decision_function(X)
Applies transforms to the data, and the decision_function method of the nal estimator. Valid only if the
nal estimator implements decision_function.
fit(X, y=None, **t_params)
Fit all the transforms one after the other and transform the data, then t the transformed data using the nal
estimator.
fit_transform(X, y=None, **t_params)
Fit all the transforms one after the other and transform the data, then use t_transform on transformed data
using the nal estimator. Valid only if the nal estimator implements t_transform.
predict(X)
Applies transforms to the data, and the predict method of the nal estimator. Valid only if the nal estimator
implements predict.
predict_proba(X)
Applies transforms to the data, and the predict_proba method of the nal estimator. Valid only if the nal
estimator implements predict_proba.
score(X, y=None)
Applies transforms to the data, and the score method of the nal estimator. Valid only if the nal estimator
implements score.
set_params(**params)
Set the parameters of the estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The former have
parameters of the form <component>__<parameter> so that its possible to update each component
of a nested object.
Returns self :
transform(X)
Applies transforms to the data, and the transform method of the nal estimator. Valid only if the nal
estimator implements transform.
1.8.24 sklearn.preprocessing: Preprocessing and Normalization
User guide: See the Preprocessing data section for further details.
1.8. Reference 603
scikit-learn user guide, Release 0.12-git
preprocessing.Scaler([copy, with_mean, with_std]) Standardize features by removing the mean and scaling to unit variance
preprocessing.Normalizer([norm, copy]) Normalize samples individually to unit norm
preprocessing.Binarizer([threshold, copy]) Binarize data (set feature values to 0 or 1) according to a threshold
preprocessing.LabelBinarizer([neg_label, ...]) Binarize labels in a one-vs-all fashion
preprocessing.KernelCenterer Center a kernel matrix
sklearn.preprocessing.Scaler
class sklearn.preprocessing.Scaler(copy=True, with_mean=True, with_std=True)
Standardize features by removing the mean and scaling to unit variance
Centering and scaling happen indepently on each feature by computing the relevant statistics on the samples
in the training set. Mean and standard deviation are then stored to be used on later data using the transform
method.
Standardization of a dataset is a common requirement for many machine learning estimators: they might behave
badly if the individual feature do not more or less look like standard normally distributed data (e.g. Gaussian
with 0 mean and unit variance).
For instance many elements used in the objective function of a learning algorithm (such as the RBF kernel of
Support Vector Machines or the L1 and L2 regularizers of linear models) assume that all features are centered
around 0 and have variance in the same order. If a feature has a variance that is orders of magnitude larger
that others, it might dominate the objective function and make the estimator unable to learn from other features
correctly as expected.
Parameters with_mean : boolean, True by default
If True, center the data before scaling.
with_std : boolean, True by default
If True, scale the data to unit variance (or equivalently, unit standard deviation).
copy : boolean, optional, default is True
set to False to perform inplace row normalization and avoid a copy (if the input is
already a numpy array or a scipy.sparse CSR matrix and if axis is 1).
See Also:
sklearn.preprocessing.scale, scaling, sklearn.decomposition.RandomizedPCA, to
Attributes
mean_ array of oats with shape [n_features] The mean value for each feature in the training set.
std_ array of oats with shape [n_features] The standard deviation for each feature in the training set.
Methods
fit(X[, y]) Compute the mean and std to be used for later scaling
fit_transform(X[, y]) Fit to data, then transform it
get_params([deep]) Get parameters for the estimator
inverse_transform(X[, copy]) Scale back the data to the original representation
Continued on next page
604 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
Table 1.172 continued from previous page
set_params(**params) Set the parameters of the estimator.
transform(X[, y, copy]) Perform standardization by centering and scaling
__init__(copy=True, with_mean=True, with_std=True)
fit(X, y=None)
Compute the mean and std to be used for later scaling
Parameters X : array-like or CSR matrix with shape [n_samples, n_features]
The data used to compute the mean and standard deviation used for later scaling along
the features axis.
fit_transform(X, y=None, **t_params)
Fit to data, then transform it
Fits transformer to X and y with optional parameters t_params and returns a transformed version of X.
Parameters X : numpy array of shape [n_samples, n_features]
Training set.
y : numpy array of shape [n_samples]
Target values.
Returns X_new : numpy array of shape [n_samples, n_features_new]
Transformed array.
Notes
This method just calls t and transform consecutively, i.e., it is not an optimized implementation of
t_transform, unlike other transformers such as PCA.
get_params(deep=True)
Get parameters for the estimator
Parameters deep: boolean, optional :
If True, will return the parameters for this estimator and contained subobjects that are
estimators.
inverse_transform(X, copy=None)
Scale back the data to the original representation
Parameters X : array-like with shape [n_samples, n_features]
The data used to scale along the features axis.
set_params(**params)
Set the parameters of the estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The former have
parameters of the form <component>__<parameter> so that its possible to update each component
of a nested object.
Returns self :
transform(X, y=None, copy=None)
Perform standardization by centering and scaling
1.8. Reference 605
scikit-learn user guide, Release 0.12-git
Parameters X : array-like with shape [n_samples, n_features]
The data used to scale along the features axis.
sklearn.preprocessing.Normalizer
class sklearn.preprocessing.Normalizer(norm=l2, copy=True)
Normalize samples individually to unit norm
Each sample (i.e. each row of the data matrix) with at least one non zero component is rescaled independently
of other samples so that its norm (l1 or l2) equals one.
This transformer is able to work both with dense numpy arrays and scipy.sparse matrix (use CSR format if you
want to avoid the burden of a copy / conversion).
Scaling inputs to unit norms is a common operation for text classication or clustering for instance. For instance
the dot product of two l2-normalized TF-IDF vectors is the cosine similarity of the vectors and is the base
similarity metric for the Vector Space Model commonly used by the Information Retrieval community.
Parameters norm : l1 or l2, optional (l2 by default)
The norm to use to normalize each non zero sample.
copy : boolean, optional, default is True
set to False to perform inplace row normalization and avoid a copy (if the input is
already a numpy array or a scipy.sparse CSR matrix).
See Also:
sklearn.preprocessing.normalize, without
Notes
This estimator is stateless (besides constructor parameters), the t method does nothing but is useful when used
in a pipeline.
Methods
fit(X[, y]) Do nothing and return the estimator unchanged
fit_transform(X[, y]) Fit to data, then transform it
get_params([deep]) Get parameters for the estimator
set_params(**params) Set the parameters of the estimator.
transform(X[, y, copy]) Scale each non zero row of X to unit norm
__init__(norm=l2, copy=True)
fit(X, y=None)
Do nothing and return the estimator unchanged
This method is just there to implement the usual API and hence work in pipelines.
fit_transform(X, y=None, **t_params)
Fit to data, then transform it
Fits transformer to X and y with optional parameters t_params and returns a transformed version of X.
606 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
Parameters X : numpy array of shape [n_samples, n_features]
Training set.
y : numpy array of shape [n_samples]
Target values.
Returns X_new : numpy array of shape [n_samples, n_features_new]
Transformed array.
Notes
This method just calls t and transform consecutively, i.e., it is not an optimized implementation of
t_transform, unlike other transformers such as PCA.
get_params(deep=True)
Get parameters for the estimator
Parameters deep: boolean, optional :
If True, will return the parameters for this estimator and contained subobjects that are
estimators.
set_params(**params)
Set the parameters of the estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The former have
parameters of the form <component>__<parameter> so that its possible to update each component
of a nested object.
Returns self :
transform(X, y=None, copy=None)
Scale each non zero row of X to unit norm
Parameters X : array or scipy.sparse matrix with shape [n_samples, n_features]
The data to normalize, row by row. scipy.sparse matrices should be in CSR format to
avoid an un-necessary copy.
sklearn.preprocessing.Binarizer
class sklearn.preprocessing.Binarizer(threshold=0.0, copy=True)
Binarize data (set feature values to 0 or 1) according to a threshold
The default threshold is 0.0 so that any non-zero values are set to 1.0 and zeros are left untouched.
Binarization is a common operation on text count data where the analyst can decide to only consider the presence
or absence of a feature rather than a quantied number of occurences for instance.
It can also be used as a pre-processing step for estimators that consider boolean random variables (e.g. modeled
using the Bernoulli distribution in a Bayesian setting).
Parameters threshold : oat, optional (0.0 by default)
The lower bound that triggers feature values to be replaced by 1.0.
copy : boolean, optional, default is True
set to False to perform inplace binarization and avoid a copy (if the input is already a
numpy array or a scipy.sparse CSR matrix).
1.8. Reference 607
scikit-learn user guide, Release 0.12-git
Notes
If the input is a sparse matrix, only the non-zero values are subject to update by the Binarizer class.
This estimator is stateless (besides constructor parameters), the t method does nothing but is useful when used
in a pipeline.
Methods
fit(X[, y]) Do nothing and return the estimator unchanged
fit_transform(X[, y]) Fit to data, then transform it
get_params([deep]) Get parameters for the estimator
set_params(**params) Set the parameters of the estimator.
transform(X[, y, copy]) Binarize each element of X
__init__(threshold=0.0, copy=True)
fit(X, y=None)
Do nothing and return the estimator unchanged
This method is just there to implement the usual API and hence work in pipelines.
fit_transform(X, y=None, **t_params)
Fit to data, then transform it
Fits transformer to X and y with optional parameters t_params and returns a transformed version of X.
Parameters X : numpy array of shape [n_samples, n_features]
Training set.
y : numpy array of shape [n_samples]
Target values.
Returns X_new : numpy array of shape [n_samples, n_features_new]
Transformed array.
Notes
This method just calls t and transform consecutively, i.e., it is not an optimized implementation of
t_transform, unlike other transformers such as PCA.
get_params(deep=True)
Get parameters for the estimator
Parameters deep: boolean, optional :
If True, will return the parameters for this estimator and contained subobjects that are
estimators.
set_params(**params)
Set the parameters of the estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The former have
parameters of the form <component>__<parameter> so that its possible to update each component
of a nested object.
608 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
Returns self :
transform(X, y=None, copy=None)
Binarize each element of X
Parameters X : array or scipy.sparse matrix with shape [n_samples, n_features]
The data to binarize, element by element. scipy.sparse matrices should be in CSRformat
to avoid an un-necessary copy.
sklearn.preprocessing.LabelBinarizer
class sklearn.preprocessing.LabelBinarizer(neg_label=0, pos_label=1)
Binarize labels in a one-vs-all fashion
Several regression and binary classication algorithms are available in the scikit. A simple way to extend these
algorithms to the multi-class classication case is to use the so-called one-vs-all scheme.
At learning time, this simply consists in learning one regressor or binary classier per class. In doing so, one
needs to convert multi-class labels to binary labels (belong or does not belong to the class). LabelBinarizer
makes this process easy with the transform method.
At prediction time, one assigns the class for which the corresponding model gave the greatest condence. La-
belBinarizer makes this easy with the inverse_transform method.
Parameters neg_label: int (default: 0) :
Value with which negative labels must be encoded.
pos_label: int (default: 1) :
Value with which positive labels must be encoded.
Examples
>>> from sklearn import preprocessing
>>> lb = preprocessing.LabelBinarizer()
>>> lb.fit([1, 2, 6, 4, 2])
LabelBinarizer(neg_label=0, pos_label=1)
>>> lb.classes_
array([1, 2, 4, 6])
>>> lb.transform([1, 6])
array([[ 1., 0., 0., 0.],
[ 0., 0., 0., 1.]])
>>> lb.fit_transform([(1, 2), (3,)])
array([[ 1., 1., 0.],
[ 0., 0., 1.]])
>>> lb.classes_
array([1, 2, 3])
Attributes
classes_: array of shape [n_class] Holds the label for each class.
1.8. Reference 609
scikit-learn user guide, Release 0.12-git
Methods
610 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
fit(y) Fit label binarizer
fit_transform(X[, y]) Fit to data, then transform it
get_params([deep]) Get parameters for the estimator
inverse_transform(Y[, threshold]) Transform binary labels back to multi-class labels
set_params(**params) Set the parameters of the estimator.
transform(y) Transform multi-class labels to binary labels
__init__(neg_label=0, pos_label=1)
fit(y)
Fit label binarizer
Parameters y : numpy array of shape [n_samples] or sequence of sequences
Target values. In the multilabel case the nested sequences can have variable lengths.
Returns self : returns an instance of self.
fit_transform(X, y=None, **t_params)
Fit to data, then transform it
Fits transformer to X and y with optional parameters t_params and returns a transformed version of X.
Parameters X : numpy array of shape [n_samples, n_features]
Training set.
y : numpy array of shape [n_samples]
Target values.
Returns X_new : numpy array of shape [n_samples, n_features_new]
Transformed array.
Notes
This method just calls t and transform consecutively, i.e., it is not an optimized implementation of
t_transform, unlike other transformers such as PCA.
get_params(deep=True)
Get parameters for the estimator
Parameters deep: boolean, optional :
If True, will return the parameters for this estimator and contained subobjects that are
estimators.
inverse_transform(Y, threshold=None)
Transform binary labels back to multi-class labels
Parameters Y : numpy array of shape [n_samples, n_classes]
Target values.
threshold : oat or None
Threshold used in the binary and multi-label cases.
Use 0 when:
Y contains the output of decision_function (classier)
1.8. Reference 611
scikit-learn user guide, Release 0.12-git
Use 0.5 when:
Y contains the output of predict_proba
If None, the threshold is assumed to be half way between neg_label and pos_label.
Returns y : numpy array of shape [n_samples] or sequence of sequences
Target values. In the multilabel case the nested sequences can have variable lengths.
Notes
In the case when the binary labels are fractional (probabilistic), inverse_transform chooses the class with
the greatest value. Typically, this allows to use the output of a linear models decision_function method
directly as the input of inverse_transform.
set_params(**params)
Set the parameters of the estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The former have
parameters of the form <component>__<parameter> so that its possible to update each component
of a nested object.
Returns self :
transform(y)
Transform multi-class labels to binary labels
The output of transform is sometimes referred to by some authors as the 1-of-K coding scheme.
Parameters y : numpy array of shape [n_samples] or sequence of sequences
Target values. In the multilabel case the nested sequences can have variable lengths.
Returns Y : numpy array of shape [n_samples, n_classes]
sklearn.preprocessing.KernelCenterer
class sklearn.preprocessing.KernelCenterer
Center a kernel matrix
This is equivalent to centering phi(X) with sklearn.preprocessing.Scaler(with_std=False).
Methods
fit(K) Fit KernelCenterer
fit_transform(X[, y]) Fit to data, then transform it
get_params([deep]) Get parameters for the estimator
set_params(**params) Set the parameters of the estimator.
transform(K[, copy]) Center kernel
__init__()
x.__init__(...) initializes x; see help(type(x)) for signature
fit(K)
Fit KernelCenterer
612 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
Parameters K : numpy array of shape [n_samples, n_samples]
Kernel matrix.
Returns self : returns an instance of self.
fit_transform(X, y=None, **t_params)
Fit to data, then transform it
Fits transformer to X and y with optional parameters t_params and returns a transformed version of X.
Parameters X : numpy array of shape [n_samples, n_features]
Training set.
y : numpy array of shape [n_samples]
Target values.
Returns X_new : numpy array of shape [n_samples, n_features_new]
Transformed array.
Notes
This method just calls t and transform consecutively, i.e., it is not an optimized implementation of
t_transform, unlike other transformers such as PCA.
get_params(deep=True)
Get parameters for the estimator
Parameters deep: boolean, optional :
If True, will return the parameters for this estimator and contained subobjects that are
estimators.
set_params(**params)
Set the parameters of the estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The former have
parameters of the form <component>__<parameter> so that its possible to update each component
of a nested object.
Returns self :
transform(K, copy=True)
Center kernel
Parameters K : numpy array of shape [n_samples1, n_samples2]
Kernel matrix.
Returns K_new : numpy array of shape [n_samples1, n_samples2]
preprocessing.scale(X[, axis, with_mean, ...]) Standardize a dataset along any axis
preprocessing.normalize(X[, norm, axis, copy]) Normalize a dataset along any axis
preprocessing.binarize(X[, threshold, copy]) Boolean thresholding of array-like or scipy.sparse matrix
sklearn.preprocessing.scale
sklearn.preprocessing.scale(X, axis=0, with_mean=True, with_std=True, copy=True)
Standardize a dataset along any axis
1.8. Reference 613
scikit-learn user guide, Release 0.12-git
Center to the mean and component wise scale to unit variance.
Parameters X : array-like or CSR matrix.
The data to center and scale.
axis : int (0 by default)
axis used to compute the means and standard deviations along. If 0, independently
standardize each feature, otherwise (if 1) standardize each sample.
with_mean : boolean, True by default
If True, center the data before scaling.
with_std : boolean, True by default
If True, scale the data to unit variance (or equivalently, unit standard deviation).
copy : boolean, optional, default is True
set to False to perform inplace row normalization and avoid a copy (if the input is
already a numpy array or a scipy.sparse CSR matrix and if axis is 1).
See Also:
sklearn.preprocessing.Scaler, scaling, sklearn.pipeline.Pipeline
Notes
This implementation will refuse to center scipy.sparse matrices since it would make them non-sparse and would
potentially crash the program with memory exhaustion problems.
Instead the caller is expected to either set explicitly with_mean=False (in that case, only variance scaling will
be performed on the features of the CSR matrix) or to call X.toarray() if he/she expects the materialized dense
array to t in memory.
To avoid memory copy the caller should pass a CSR matrix.
sklearn.preprocessing.normalize
sklearn.preprocessing.normalize(X, norm=l2, axis=1, copy=True)
Normalize a dataset along any axis
Parameters X : array or scipy.sparse matrix with shape [n_samples, n_features]
The data to normalize, element by element. scipy.sparse matrices should be in CSR
format to avoid an un-necessary copy.
norm : l1 or l2, optional (l2 by default)
The norm to use to normalize each non zero sample (or each non-zero feature if axis is
0).
axis : 0 or 1, optional (1 by default)
axis used to normalize the data along. If 1, independently normalize each sample, oth-
erwise (if 0) normalize each feature.
copy : boolean, optional, default is True
set to False to perform inplace row normalization and avoid a copy (if the input is
already a numpy array or a scipy.sparse CSR matrix and if axis is 1).
614 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
See Also:
sklearn.preprocessing.Normalizer, using, sklearn.pipeline.Pipeline
sklearn.preprocessing.binarize
sklearn.preprocessing.binarize(X, threshold=0.0, copy=True)
Boolean thresholding of array-like or scipy.sparse matrix
Parameters X : array or scipy.sparse matrix with shape [n_samples, n_features]
The data to binarize, element by element. scipy.sparse matrices should be in CSRformat
to avoid an un-necessary copy.
threshold : oat, optional (0.0 by default)
The lower bound that triggers feature values to be replaced by 1.0.
copy : boolean, optional, default is True
set to False to perform inplace binarization and avoid a copy (if the input is already a
numpy array or a scipy.sparse CSR matrix and if axis is 1).
See Also:
sklearn.preprocessing.Binarizer, using, sklearn.pipeline.Pipeline
1.8.25 sklearn.qda: Quadratic Discriminant Analysis
Quadratic Discriminant Analysis
User guide: See the Linear and Quadratic Discriminant Analysis section for further details.
qda.QDA([priors]) Quadratic Discriminant Analysis (QDA)
sklearn.qda.QDA
class sklearn.qda.QDA(priors=None)
Quadratic Discriminant Analysis (QDA)
A classier with a quadratic decision boundary, generated by tting class conditional densities to the data and
using Bayes rule.
The model ts a Gaussian density to each class.
Parameters priors : array, optional, shape = [n_classes]
Priors on classes
See Also:
sklearn.lda.LDALinear discriminant analysis
Examples
>>> from sklearn.qda import QDA
>>> import numpy as np
>>> X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
1.8. Reference 615
scikit-learn user guide, Release 0.12-git
>>> y = np.array([1, 1, 1, 2, 2, 2])
>>> clf = QDA()
>>> clf.fit(X, y)
QDA(priors=None)
>>> print(clf.predict([[-0.8, -1]]))
[1]
Attributes
means_ array-like, shape = [n_classes, n_features] Class means
priors_ array-like, shape = [n_classes] Class priors (sum to 1)
covariances_ list of array-like, shape = [n_features, n_features] Covariance matrices of each class
Methods
decision_function(X) Apply decision function to an array of samples.
fit(X, y[, store_covariances, tol]) Fit the QDA model according to the given training data and parameters.
get_params([deep]) Get parameters for the estimator
predict(X) Perform classication on an array of test vectors X.
predict_log_proba(X) Return posterior probabilities of classication.
predict_proba(X) Return posterior probabilities of classication.
score(X, y) Returns the mean accuracy on the given test data and labels.
set_params(**params) Set the parameters of the estimator.
__init__(priors=None)
decision_function(X)
Apply decision function to an array of samples.
Parameters X : array-like, shape = [n_samples, n_features]
Array of samples (test vectors).
Returns C : array, shape = [n_samples, n_classes]
Decision function values related to each class, per sample.
fit(X, y, store_covariances=False, tol=0.0001)
Fit the QDA model according to the given training data and parameters.
Parameters X : array-like, shape = [n_samples, n_features]
Training vector, where n_samples in the number of samples and n_features is the num-
ber of features.
y : array, shape = [n_samples]
Target values (integers)
store_covariances : boolean
If True the covariance matrices are computed and stored in the self.covariances_ at-
tribute.
get_params(deep=True)
Get parameters for the estimator
616 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
Parameters deep: boolean, optional :
If True, will return the parameters for this estimator and contained subobjects that are
estimators.
predict(X)
Perform classication on an array of test vectors X.
The predicted class C for each sample in X is returned.
Parameters X : array-like, shape = [n_samples, n_features]
Returns C : array, shape = [n_samples]
predict_log_proba(X)
Return posterior probabilities of classication.
Parameters X : array-like, shape = [n_samples, n_features]
Array of samples/test vectors.
Returns C : array, shape = [n_samples, n_classes]
Posterior log-probabilities of classication per class.
predict_proba(X)
Return posterior probabilities of classication.
Parameters X : array-like, shape = [n_samples, n_features]
Array of samples/test vectors.
Returns C : array, shape = [n_samples, n_classes]
Posterior probabilities of classication per class.
score(X, y)
Returns the mean accuracy on the given test data and labels.
Parameters X : array-like, shape = [n_samples, n_features]
Training set.
y : array-like, shape = [n_samples]
Labels for X.
Returns z : oat
set_params(**params)
Set the parameters of the estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The former have
parameters of the form <component>__<parameter> so that its possible to update each component
of a nested object.
Returns self :
1.8.26 sklearn.svm: Support Vector Machines
The sklearn.svm module includes Support Vector Machine algorithms.
User guide: See the Support Vector Machines section for further details.
1.8. Reference 617
scikit-learn user guide, Release 0.12-git
Estimators
svm.SVC([C, kernel, degree, gamma, coef0, ...]) C-Support Vector Classication.
svm.LinearSVC([penalty, loss, dual, tol, C, ...]) Linear Support Vector Classication.
svm.NuSVC([nu, kernel, degree, gamma, ...]) Nu-Support Vector Classication.
svm.SVR([kernel, degree, gamma, coef0, tol, ...]) epsilon-Support Vector Regression.
svm.NuSVR([nu, C, kernel, degree, gamma, ...]) Nu Support Vector Regression.
svm.OneClassSVM([kernel, degree, gamma, ...]) Unsupervised Outliers Detection.
sklearn.svm.SVC
class sklearn.svm.SVC(C=1.0, kernel=rbf , degree=3, gamma=0.0, coef0=0.0, shrinking=True, proba-
bility=False, tol=0.001, cache_size=200, class_weight=None, verbose=False)
C-Support Vector Classication.
The implementations is a based on libsvm. The t time complexity is more than quadratic with the number of
samples which makes it hard to scale to dataset with more than a couple of 10000 samples.
The multiclass support is handled according to a one-vs-one scheme.
For details on the precise mathematical formulation of the provided kernel functions and how gamma, coef0 and
degree affect each, see the corresponding section in the narrative documentation: Kernel functions.
Parameters C : oat or None, optional (default=None)
Penalty parameter C of the error term. If None then C is set to n_samples.
kernel : string, optional (default=rbf)
Species the kernel type to be used in the algorithm. It must be one of linear, poly,
rbf, sigmoid, precomputed. If none is given, rbf will be used.
degree : int, optional (default=3)
Degree of kernel function. It is signicant only in poly and sigmoid.
gamma : oat, optional (default=0.0)
Kernel coefcient for rbf and poly. If gamma is 0.0 then 1/n_features will be used
instead.
coef0 : oat, optional (default=0.0)
Independent term in kernel function. It is only signicant in poly and sigmoid.
probability: boolean, optional (default=False) :
Whether to enable probability estimates. This must be enabled prior to calling pre-
dict_proba.
shrinking: boolean, optional (default=True) :
Whether to use the shrinking heuristic.
tol: oat, optional (default=1e-3) :
Tolerance for stopping criterion.
cache_size: oat, optional :
Specify the size of the kernel cache (in MB)
class_weight : {dict, auto}, optional
618 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
Set the parameter C of class i to class_weight[i]*C for SVC. If not given, all classes are
supposed to have weight one. The auto mode uses the values of y to automatically
adjust weights inversely proportional to class frequencies.
verbose : bool, default: False
Enable verbose output. Note that this setting takes advantage of a per-process runtime
setting in libsvm that, if enabled, may not work properly in a multithreaded context.
See Also:
SVRSupport Vector Machine for Regression implemented using libsvm.
LinearSVCScalable Linear Support Vector Machine for classifcation implemented using liblinear. Check
the See also section of LinearSVC for more comparison element.
Examples
>>> import numpy as np
>>> X = np.array([[-1, -1], [-2, -1], [1, 1], [2, 1]])
>>> y = np.array([1, 1, 2, 2])
>>> from sklearn.svm import SVC
>>> clf = SVC()
>>> clf.fit(X, y)
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, degree=3,
gamma=0.5, kernel=rbf, probability=False, shrinking=True,
tol=0.001, verbose=False)
>>> print(clf.predict([[-0.8, -1]]))
[ 1.]
Attributes
sup-
port_
array-like, shape
= [n_SV]
Index of support vectors.
sup-
port_vectors_
array-like, shape
= [n_SV,
n_features]
Support vectors.
n_support_array-like,
dtype=int32,
shape = [n_class]
number of support vector for each class.
dual_coef_array, shape =
[n_class-1, n_SV]
Coefcients of the support vector in the decision function. For multiclass,
coefcient for all 1-vs-1 classiers. The layout of the coefcients in the
multiclass case is somewhat non-trivial. See the section about multi-class
classication in the SVM section of the User Guide for details.
coef_ array, shape =
[n_class-1,
n_features]
Weights asigned to the features (coefcients in the primal problem). This is
only available in the case of linear kernel.
coef_ is readonly property derived from dual_coef_ and support_vectors_
inter-
cept_
array, shape =
[n_class *
(n_class-1) / 2]
Constants in decision function.
Methods
1.8. Reference 619
scikit-learn user guide, Release 0.12-git
decision_function(X) Distance of the samples X to the separating hyperplane.
fit(X, y[, class_weight, sample_weight]) Fit the SVM model according to the given training data.
get_params([deep]) Get parameters for the estimator
predict(X) Perform classication or regression samples in X.
predict_log_proba(X) Compute the log likehoods each possible outcomes of samples in X.
predict_proba(X) Compute the likehoods each possible outcomes of samples in T.
score(X, y) Returns the mean accuracy on the given test data and labels.
set_params(**params) Set the parameters of the estimator.
__init__(C=1.0, kernel=rbf , degree=3, gamma=0.0, coef0=0.0, shrinking=True, probabil-
ity=False, tol=0.001, cache_size=200, class_weight=None, verbose=False)
decision_function(X)
Distance of the samples X to the separating hyperplane.
Parameters X : array-like, shape = [n_samples, n_features]
Returns X : array-like, shape = [n_samples, n_class * (n_class-1) / 2]
Returns the decision function of the sample for each class in the model.
fit(X, y, class_weight=None, sample_weight=None)
Fit the SVM model according to the given training data.
Parameters X : {array-like, sparse matrix}, shape = [n_samples, n_features]
Training vectors, where n_samples is the number of samples and n_features is the num-
ber of features.
y : array-like, shape = [n_samples]
Target values (integers in classication, real numbers in regression)
sample_weight : array-like, shape = [n_samples], optional
Weights applied to individual samples (1. for unweighted).
Returns self : object
Returns self.
Notes
If X and y are not C-ordered and contiguous arrays of np.oat64 and X is not a scipy.sparse.csr_matrix, X
and/or y may be copied.
If X is a dense array, then the other methods will not support sparse matrices as input.
get_params(deep=True)
Get parameters for the estimator
Parameters deep: boolean, optional :
If True, will return the parameters for this estimator and contained subobjects that are
estimators.
predict(X)
Perform classication or regression samples in X.
For a classication model, the predicted class for each sample in X is returned. For a regression model,
the function value of X calculated is returned.
620 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
For an one-class model, +1 or -1 is returned.
Parameters X : {array-like, sparse matrix}, shape = [n_samples, n_features]
Returns C : array, shape = [n_samples]
predict_log_proba(X)
Compute the log likehoods each possible outcomes of samples in X.
The model need to have probability information computed at training time: t with attribute probability
set to True.
Parameters X : array-like, shape = [n_samples, n_features]
Returns X : array-like, shape = [n_samples, n_classes]
Returns the log-probabilities of the sample for each class in the model, where classes
are ordered by arithmetical order.
Notes
The probability model is created using cross validation, so the results can be slightly different than those
obtained by predict. Also, it will meaningless results on very small datasets.
predict_proba(X)
Compute the likehoods each possible outcomes of samples in T.
The model need to have probability information computed at training time: t with attribute probability
set to True.
Parameters X : array-like, shape = [n_samples, n_features]
Returns X : array-like, shape = [n_samples, n_classes]
Returns the probability of the sample for each class in the model, where classes are
ordered by arithmetical order.
Notes
The probability model is created using cross validation, so the results can be slightly different than those
obtained by predict. Also, it will produce meaningless results on very small datasets.
score(X, y)
Returns the mean accuracy on the given test data and labels.
Parameters X : array-like, shape = [n_samples, n_features]
Training set.
y : array-like, shape = [n_samples]
Labels for X.
Returns z : oat
set_params(**params)
Set the parameters of the estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The former have
parameters of the form <component>__<parameter> so that its possible to update each component
of a nested object.
Returns self :
1.8. Reference 621
scikit-learn user guide, Release 0.12-git
sklearn.svm.LinearSVC
class sklearn.svm.LinearSVC(penalty=l2, loss=l2, dual=True, tol=0.0001, C=1.0,
multi_class=ovr, t_intercept=True, intercept_scaling=1,
class_weight=None, verbose=0)
Linear Support Vector Classication.
Similar to SVC with parameter kernel=linear, but implemented in terms of liblinear rather than libsvm, so it
has more exibility in the choice of penalties and loss functions and should scale better (to large numbers of
samples).
This class supports both dense and sparse input and the multiclass support is handled according to a one-vs-the-
rest scheme.
Parameters C : oat or None, optional (default=None)
Penalty parameter C of the error term. If None then C is set to n_samples.
loss : string, l1 or l2 (default=l2)
Species the loss function. l1 is the hinge loss (standard SVM) while l2 is the
squared hinge loss.
penalty : string, l1 or l2 (default=l2)
Species the norm used in the penalization. The l2 penalty is the standard used in
SVC. The l1 leads to coef_ vectors that are sparse.
dual : bool, (default=True)
Select the algorithm to either solve the dual or primal optimization problem. Prefer
dual=False when n_samples > n_features.
tol: oat, optional (default=1e-4) :
Tolerance for stopping criteria
multi_class: string, ovr or crammer_singer (default=ovr) :
Determines the multi-class strategy if y contains more than two classes. ovr trains
n_classes one-vs-rest classiers, while crammer_singer optimizes a joint objective over
all classes. While crammer_singer is interesting from an theoretical perspective as it is
consistent it is seldom used in practice and rarely leads to better accuracy and is more
expensive to compute. If crammer_singer is choosen, the options loss, penalty and dual
will be ignored.
t_intercept : boolean, optional (default=True)
Whether to calculate the intercept for this model. If set to false, no intercept will be
used in calculations (e.g. data is expected to be already centered).
intercept_scaling : oat, optional (default=1)
when self.t_intercept is True, instance vector x becomes [x, self.intercept_scaling],
i.e. a synthetic feature with constant value equals to intercept_scaling is appended to
the instance vector. The intercept becomes intercept_scaling * synthetic feature weight
Note! the synthetic feature weight is subject to l1/l2 regularization as all other features.
To lessen the effect of regularization on synthetic feature weight (and therefore on the
intercept) intercept_scaling has to be increased
class_weight : {dict, auto}, optional
622 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
Set the parameter C of class i to class_weight[i]*C for SVC. If not given, all classes are
supposed to have weight one. The auto mode uses the values of y to automatically
adjust weights inversely proportional to class frequencies.
verbose : int, default: 0
Enable verbose output. Note that this setting takes advantage of a per-process runtime
setting in liblinear that, if enabled, may not work properly in a multithreaded context.
See Also:
SVCImplementation of Support Vector Machine classier using libsvm: the kernel can be non-linear but its
SMO algorithm does not scale to large number of samples as LinearSVC does. Furthermore SVC multi-
class mode is implemented using one vs one scheme while LinearSVC uses one vs the rest. It is possible to
implement one vs the rest with SVC by using the sklearn.multiclass.OneVsRestClassifier
wrapper. Finally SVC can t dense data without memory copy if the input is C-contiguous. Sparse data
will still incur memory copy though.
sklearn.linear_model.SGDClassifierSGDClassier can optimize the same cost function as Lin-
earSVC by adjusting the penalty and loss parameters. Furthermore SGDClassier is scalable to large
number of samples as it uses a Stochastic Gradient Descent optimizer. Finally SGDClassier can t both
dense and sparse data without memory copy if the input is C-contiguous or CSR.
Notes
The underlying C implementation uses a random number generator to select features when tting the model.
It is thus not uncommon, to have slightly different results for the same input data. If that happens, try with a
smaller tol parameter.
The underlying implementation (liblinear) uses a sparse internal representation for the data that will incur a
memory copy.
References: LIBLINEAR: A Library for Large Linear Classication
Attributes
coef_ array, shape = [n_features] if
n_classes == 2 else
[n_classes, n_features]
Weights asigned to the features (coefcients in the primal problem).
This is only available in the case of linear kernel.
coef_ is readonly property derived from raw_coef_ that follows the
internal memory layout of liblinear.
in-
ter-
cept_
array, shape = [1] if
n_classes == 2 else
[n_classes]
Constants in decision function.
Methods
decision_function(X) Decision function value for X according to the trained model.
fit(X, y[, class_weight]) Fit the model according to the given training data.
fit_transform(X[, y]) Fit to data, then transform it
get_params([deep]) Get parameters for the estimator
predict(X) Predict target values of X according to the tted model.
score(X, y) Returns the mean accuracy on the given test data and labels.
Continued on next page
1.8. Reference 623
scikit-learn user guide, Release 0.12-git
Table 1.182 continued from previous page
set_params(**params) Set the parameters of the estimator.
transform(X[, threshold]) Reduce X to its most important features.
__init__(penalty=l2, loss=l2, dual=True, tol=0.0001, C=1.0, multi_class=ovr,
t_intercept=True, intercept_scaling=1, class_weight=None, verbose=0)
decision_function(X)
Decision function value for X according to the trained model.
Parameters X : array-like, shape = [n_samples, n_features]
Returns T : array-like, shape = [n_samples, n_class]
Returns the decision function of the sample for each class in the model.
fit(X, y, class_weight=None)
Fit the model according to the given training data.
Parameters X : {array-like, sparse matrix}, shape = [n_samples, n_features]
Training vector, where n_samples in the number of samples and n_features is the num-
ber of features.
y : array-like, shape = [n_samples]
Target vector relative to X
class_weight : {dict, auto}, optional
Weights associated with classes. If not given, all classes are supposed to have weight
one.
Returns self : object
Returns self.
fit_transform(X, y=None, **t_params)
Fit to data, then transform it
Fits transformer to X and y with optional parameters t_params and returns a transformed version of X.
Parameters X : numpy array of shape [n_samples, n_features]
Training set.
y : numpy array of shape [n_samples]
Target values.
Returns X_new : numpy array of shape [n_samples, n_features_new]
Transformed array.
Notes
This method just calls t and transform consecutively, i.e., it is not an optimized implementation of
t_transform, unlike other transformers such as PCA.
get_params(deep=True)
Get parameters for the estimator
Parameters deep: boolean, optional :
624 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
If True, will return the parameters for this estimator and contained subobjects that are
estimators.
predict(X)
Predict target values of X according to the tted model.
Parameters X : {array-like, sparse matrix}, shape = [n_samples, n_features]
Returns C : array, shape = [n_samples]
score(X, y)
Returns the mean accuracy on the given test data and labels.
Parameters X : array-like, shape = [n_samples, n_features]
Training set.
y : array-like, shape = [n_samples]
Labels for X.
Returns z : oat
set_params(**params)
Set the parameters of the estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The former have
parameters of the form <component>__<parameter> so that its possible to update each component
of a nested object.
Returns self :
transform(X, threshold=None)
Reduce X to its most important features.
Parameters X : array or scipy sparse matrix of shape [n_samples, n_features]
The input samples.
threshold : string, oat or None, optional (default=None)
The threshold value to use for feature selection. Features whose importance is greater
or equal are kept while the others are discarded. If median (resp. mean), then the
threshold value is the median (resp. the mean) of the feature importances. A scaling
factor (e.g., 1.25*mean) may also be used. If None and if available, the object attribute
threshold is used. Otherwise, mean is used by default.
Returns X_r : array of shape [n_samples, n_selected_features]
The input samples with only the selected features.
sklearn.svm.NuSVC
class sklearn.svm.NuSVC(nu=0.5, kernel=rbf , degree=3, gamma=0.0, coef0=0.0, shrinking=True,
probability=False, tol=0.001, cache_size=200, verbose=False)
Nu-Support Vector Classication.
Similar to SVC but uses a parameter to control the number of support vectors.
The implementation is based on libsvm.
Parameters nu : oat, optional (default=0.5)
An upper bound on the fraction of training errors and a lower bound of the fraction of
support vectors. Should be in the interval (0, 1].
1.8. Reference 625
scikit-learn user guide, Release 0.12-git
kernel : string, optional (default=rbf)
Species the kernel type to be used in the algorithm. one of linear, poly, rbf,
sigmoid, precomputed. If none is given rbf will be used.
degree : int, optional (default=3)
degree of kernel function is signicant only in poly, rbf, sigmoid
gamma : oat, optional (default=0.0)
kernel coefcient for rbf and poly, if gamma is 0.0 then 1/n_features will be taken.
coef0 : oat, optional (default=0.0)
independent term in kernel function. It is only signicant in poly/sigmoid.
probability: boolean, optional (default=False) :
Whether to enable probability estimates. This must be enabled prior to calling pre-
dict_proba.
shrinking: boolean, optional (default=True) :
Whether to use the shrinking heuristic.
tol: oat, optional (default=1e-3) :
Tolerance for stopping criterion.
cache_size: oat, optional :
Specify the size of the kernel cache (in MB)
class_weight : {dict, auto}, optional
Set the parameter C of class i to class_weight[i]*C for SVC. If not given, all classes are
supposed to have weight one. The auto mode uses the values of y to automatically
adjust weights inversely proportional to class frequencies.
verbose : bool, default: False
Enable verbose output. Note that this setting takes advantage of a per-process runtime
setting in libsvm that, if enabled, may not work properly in a multithreaded context.
See Also:
SVCSupport Vector Machine for classication using libsvm.
LinearSVCScalable linear Support Vector Machine for classication using liblinear.
Examples
>>> import numpy as np
>>> X = np.array([[-1, -1], [-2, -1], [1, 1], [2, 1]])
>>> y = np.array([1, 1, 2, 2])
>>> from sklearn.svm import NuSVC
>>> clf = NuSVC()
>>> clf.fit(X, y)
NuSVC(cache_size=200, coef0=0.0, degree=3, gamma=0.5, kernel=rbf, nu=0.5,
probability=False, shrinking=True, tol=0.001, verbose=False)
>>> print(clf.predict([[-0.8, -1]]))
[ 1.]
626 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
Attributes
sup-
port_
array-like, shape
= [n_SV]
Index of support vectors.
sup-
port_vectors_
array-like, shape
= [n_SV,
n_features]
Support vectors.
n_support_array-like,
dtype=int32,
shape = [n_class]
number of support vector for each class.
dual_coef_array, shape =
[n_class-1, n_SV]
Coefcients of the support vector in the decision function. For multiclass,
coefcient for all 1-vs-1 classiers. The layout of the coefcients in the
multiclass case is somewhat non-trivial. See the section about multi-class
classication in the SVM section of the User Guide for details.
coef_ array, shape =
[n_class-1,
n_features]
Weights asigned to the features (coefcients in the primal problem). This is
only available in the case of linear kernel.
coef_ is readonly property derived from dual_coef_ and support_vectors_
inter-
cept_
array, shape =
[n_class *
(n_class-1) / 2]
Constants in decision function.
Methods
decision_function(X) Distance of the samples X to the separating hyperplane.
fit(X, y[, class_weight, sample_weight]) Fit the SVM model according to the given training data.
get_params([deep]) Get parameters for the estimator
predict(X) Perform classication or regression samples in X.
predict_log_proba(X) Compute the log likehoods each possible outcomes of samples in X.
predict_proba(X) Compute the likehoods each possible outcomes of samples in T.
score(X, y) Returns the mean accuracy on the given test data and labels.
set_params(**params) Set the parameters of the estimator.
__init__(nu=0.5, kernel=rbf , degree=3, gamma=0.0, coef0=0.0, shrinking=True, probabil-
ity=False, tol=0.001, cache_size=200, verbose=False)
decision_function(X)
Distance of the samples X to the separating hyperplane.
Parameters X : array-like, shape = [n_samples, n_features]
Returns X : array-like, shape = [n_samples, n_class * (n_class-1) / 2]
Returns the decision function of the sample for each class in the model.
fit(X, y, class_weight=None, sample_weight=None)
Fit the SVM model according to the given training data.
Parameters X : {array-like, sparse matrix}, shape = [n_samples, n_features]
Training vectors, where n_samples is the number of samples and n_features is the num-
ber of features.
y : array-like, shape = [n_samples]
Target values (integers in classication, real numbers in regression)
1.8. Reference 627
scikit-learn user guide, Release 0.12-git
sample_weight : array-like, shape = [n_samples], optional
Weights applied to individual samples (1. for unweighted).
Returns self : object
Returns self.
Notes
If X and y are not C-ordered and contiguous arrays of np.oat64 and X is not a scipy.sparse.csr_matrix, X
and/or y may be copied.
If X is a dense array, then the other methods will not support sparse matrices as input.
get_params(deep=True)
Get parameters for the estimator
Parameters deep: boolean, optional :
If True, will return the parameters for this estimator and contained subobjects that are
estimators.
predict(X)
Perform classication or regression samples in X.
For a classication model, the predicted class for each sample in X is returned. For a regression model,
the function value of X calculated is returned.
For an one-class model, +1 or -1 is returned.
Parameters X : {array-like, sparse matrix}, shape = [n_samples, n_features]
Returns C : array, shape = [n_samples]
predict_log_proba(X)
Compute the log likehoods each possible outcomes of samples in X.
The model need to have probability information computed at training time: t with attribute probability
set to True.
Parameters X : array-like, shape = [n_samples, n_features]
Returns X : array-like, shape = [n_samples, n_classes]
Returns the log-probabilities of the sample for each class in the model, where classes
are ordered by arithmetical order.
Notes
The probability model is created using cross validation, so the results can be slightly different than those
obtained by predict. Also, it will meaningless results on very small datasets.
predict_proba(X)
Compute the likehoods each possible outcomes of samples in T.
The model need to have probability information computed at training time: t with attribute probability
set to True.
Parameters X : array-like, shape = [n_samples, n_features]
Returns X : array-like, shape = [n_samples, n_classes]
628 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
Returns the probability of the sample for each class in the model, where classes are
ordered by arithmetical order.
Notes
The probability model is created using cross validation, so the results can be slightly different than those
obtained by predict. Also, it will produce meaningless results on very small datasets.
score(X, y)
Returns the mean accuracy on the given test data and labels.
Parameters X : array-like, shape = [n_samples, n_features]
Training set.
y : array-like, shape = [n_samples]
Labels for X.
Returns z : oat
set_params(**params)
Set the parameters of the estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The former have
parameters of the form <component>__<parameter> so that its possible to update each component
of a nested object.
Returns self :
sklearn.svm.SVR
class sklearn.svm.SVR(kernel=rbf , degree=3, gamma=0.0, coef0=0.0, tol=0.001, C=1.0, epsilon=0.1,
shrinking=True, probability=False, cache_size=200, verbose=False)
epsilon-Support Vector Regression.
The free parameters in the model are C and epsilon.
The implementations is a based on libsvm.
Parameters C : oat or None, optional (default=None)
penalty parameter C of the error term. If None then C is set to n_samples.
epsilon : oat, optional (default=0.1)
epsilon in the epsilon-SVR model. It species the epsilon-tube within which no penalty
is associated in the training loss function with points predicted within a distance epsilon
from the actual value.
kernel : string, optional (default=rbf)
Species the kernel type to be used in the algorithm. one of linear, poly, rbf,
sigmoid, precomputed. If none is given rbf will be used.
degree : int, optional (default=3)
degree of kernel function is signicant only in poly, rbf, sigmoid
gamma : oat, optional (default=0.0)
kernel coefcient for rbf and poly, if gamma is 0.0 then 1/n_features will be taken.
1.8. Reference 629
scikit-learn user guide, Release 0.12-git
coef0 : oat, optional (default=0.0)
independent term in kernel function. It is only signicant in poly/sigmoid.
probability: boolean, optional (default=False) :
Whether to enable probability estimates. This must be enabled prior to calling pre-
dict_proba.
shrinking: boolean, optional (default=True) :
Whether to use the shrinking heuristic.
tol: oat, optional (default=1e-3) :
Tolerance for stopping criterion.
cache_size: oat, optional :
Specify the size of the kernel cache (in MB)
verbose : bool, default: False
Enable verbose output. Note that this setting takes advantage of a per-process runtime
setting in libsvm that, if enabled, may not work properly in a multithreaded context.
See Also:
NuSVRSupport Vector Machine for regression implemented using libsvm using a parameter to control the num-
ber of support vectors.
Examples
>>> from sklearn.svm import SVR
>>> import numpy as np
>>> n_samples, n_features = 10, 5
>>> np.random.seed(0)
>>> y = np.random.randn(n_samples)
>>> X = np.random.randn(n_samples, n_features)
>>> clf = SVR(C=1.0, epsilon=0.2)
>>> clf.fit(X, y)
SVR(C=1.0, cache_size=200, coef0=0.0, degree=3, epsilon=0.2, gamma=0.2,
kernel=rbf, probability=False, shrinking=True, tol=0.001,
verbose=False)
630 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
Attributes
sup-
port_
array-like, shape
= [n_SV]
Index of support vectors.
sup-
port_vectors_
array-like, shape
= [nSV,
n_features]
Support vectors.
dual_coef_array, shape =
[n_classes-1,
n_SV]
Coefcients of the support vector in the decision function.
coef_ array, shape =
[n_classes-1,
n_features]
Weights asigned to the features (coefcients in the primal problem). This is
only available in the case of linear kernel.
coef_ is readonly property derived from dual_coef_ and support_vectors_
inter-
cept_
array, shape =
[n_class *
(n_class-1) / 2]
Constants in decision function.
Methods
decision_function(X) Distance of the samples X to the separating hyperplane.
fit(X, y[, class_weight, sample_weight]) Fit the SVM model according to the given training data.
get_params([deep]) Get parameters for the estimator
predict(X) Perform classication or regression samples in X.
predict_log_proba(X) Compute the log likehoods each possible outcomes of samples in X.
predict_proba(X) Compute the likehoods each possible outcomes of samples in T.
score(X, y) Returns the coefcient of determination R^2 of the prediction.
set_params(**params) Set the parameters of the estimator.
__init__(kernel=rbf , degree=3, gamma=0.0, coef0=0.0, tol=0.001, C=1.0, epsilon=0.1, shrink-
ing=True, probability=False, cache_size=200, verbose=False)
decision_function(X)
Distance of the samples X to the separating hyperplane.
Parameters X : array-like, shape = [n_samples, n_features]
Returns X : array-like, shape = [n_samples, n_class * (n_class-1) / 2]
Returns the decision function of the sample for each class in the model.
fit(X, y, class_weight=None, sample_weight=None)
Fit the SVM model according to the given training data.
Parameters X : {array-like, sparse matrix}, shape = [n_samples, n_features]
Training vectors, where n_samples is the number of samples and n_features is the num-
ber of features.
y : array-like, shape = [n_samples]
Target values (integers in classication, real numbers in regression)
sample_weight : array-like, shape = [n_samples], optional
Weights applied to individual samples (1. for unweighted).
Returns self : object
1.8. Reference 631
scikit-learn user guide, Release 0.12-git
Returns self.
Notes
If X and y are not C-ordered and contiguous arrays of np.oat64 and X is not a scipy.sparse.csr_matrix, X
and/or y may be copied.
If X is a dense array, then the other methods will not support sparse matrices as input.
get_params(deep=True)
Get parameters for the estimator
Parameters deep: boolean, optional :
If True, will return the parameters for this estimator and contained subobjects that are
estimators.
predict(X)
Perform classication or regression samples in X.
For a classication model, the predicted class for each sample in X is returned. For a regression model,
the function value of X calculated is returned.
For an one-class model, +1 or -1 is returned.
Parameters X : {array-like, sparse matrix}, shape = [n_samples, n_features]
Returns C : array, shape = [n_samples]
predict_log_proba(X)
Compute the log likehoods each possible outcomes of samples in X.
The model need to have probability information computed at training time: t with attribute probability
set to True.
Parameters X : array-like, shape = [n_samples, n_features]
Returns X : array-like, shape = [n_samples, n_classes]
Returns the log-probabilities of the sample for each class in the model, where classes
are ordered by arithmetical order.
Notes
The probability model is created using cross validation, so the results can be slightly different than those
obtained by predict. Also, it will meaningless results on very small datasets.
predict_proba(X)
Compute the likehoods each possible outcomes of samples in T.
The model need to have probability information computed at training time: t with attribute probability
set to True.
Parameters X : array-like, shape = [n_samples, n_features]
Returns X : array-like, shape = [n_samples, n_classes]
Returns the probability of the sample for each class in the model, where classes are
ordered by arithmetical order.
632 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
Notes
The probability model is created using cross validation, so the results can be slightly different than those
obtained by predict. Also, it will produce meaningless results on very small datasets.
score(X, y)
Returns the coefcient of determination R^2 of the prediction.
The coefcient R^2 is dened as (1 - u/v), where u is the regression sum of squares ((y - y_pred) **
2).sum() and v is the residual sum of squares ((y_true - y_true.mean()) ** 2).sum(). Best possible score is
1.0, lower values are worse.
Parameters X : array-like, shape = [n_samples, n_features]
Training set.
y : array-like, shape = [n_samples]
Returns z : oat
set_params(**params)
Set the parameters of the estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The former have
parameters of the form <component>__<parameter> so that its possible to update each component
of a nested object.
Returns self :
sklearn.svm.NuSVR
class sklearn.svm.NuSVR(nu=0.5, C=1.0, kernel=rbf , degree=3, gamma=0.0, coef0=0.0, shrink-
ing=True, probability=False, tol=0.001, cache_size=200, verbose=False)
Nu Support Vector Regression.
Similar to NuSVC, for regression, uses a parameter nu to control the number of support vectors. However,
unlike NuSVC, where nu replaces C, here nu replaces with the parameter epsilon of SVR.
The implementations is a based on libsvm.
Parameters C : oat or None, optional (default=None)
penalty parameter C of the error term. If None then C is set to n_samples.
nu : oat, optional
An upper bound on the fraction of training errors and a lower bound of the fraction of
support vectors. Should be in the interval (0, 1]. By default 0.5 will be taken. Only
available if impl=nu_svc.
kernel : string, optional (default=rbf)
Species the kernel type to be used in the algorithm. one of linear, poly, rbf,
sigmoid, precomputed. If none is given rbf will be used.
degree : int, optional (default=3)
degree of kernel function is signicant only in poly, rbf, sigmoid
gamma : oat, optional (default=0.0)
kernel coefcient for rbf and poly, if gamma is 0.0 then 1/n_features will be taken.
coef0 : oat, optional (default=0.0)
1.8. Reference 633
scikit-learn user guide, Release 0.12-git
independent term in kernel function. It is only signicant in poly/sigmoid.
probability: boolean, optional (default=False) :
Whether to enable probability estimates. This must be enabled prior to calling pre-
dict_proba.
shrinking: boolean, optional (default=True) :
Whether to use the shrinking heuristic.
tol: oat, optional (default=1e-3) :
Tolerance for stopping criterion.
cache_size: oat, optional :
Specify the size of the kernel cache (in MB)
verbose : bool, default: False
Enable verbose output. Note that this setting takes advantage of a per-process runtime
setting in libsvm that, if enabled, may not work properly in a multithreaded context.
See Also:
NuSVCSupport Vector Machine for classication implemented with libsvm with a parameter to control the
number of support vectors.
SVRepsilon Support Vector Machine for regression implemented with libsvm.
Examples
>>> from sklearn.svm import NuSVR
>>> import numpy as np
>>> n_samples, n_features = 10, 5
>>> np.random.seed(0)
>>> y = np.random.randn(n_samples)
>>> X = np.random.randn(n_samples, n_features)
>>> clf = NuSVR(C=1.0, nu=0.1)
>>> clf.fit(X, y)
NuSVR(C=1.0, cache_size=200, coef0=0.0, degree=3, gamma=0.2, kernel=rbf,
nu=0.1, probability=False, shrinking=True, tol=0.001, verbose=False)
634 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
Attributes
sup-
port_
array-like, shape
= [n_SV]
Index of support vectors.
sup-
port_vectors_
array-like, shape
= [nSV,
n_features]
Support vectors.
dual_coef_array, shape =
[n_classes-1,
n_SV]
Coefcients of the support vector in the decision function.
coef_ array, shape =
[n_classes-1,
n_features]
Weights asigned to the features (coefcients in the primal problem). This is
only available in the case of linear kernel.
coef_ is readonly property derived from dual_coef_ and support_vectors_
inter-
cept_
array, shape =
[n_class *
(n_class-1) / 2]
Constants in decision function.
Methods
decision_function(X) Distance of the samples X to the separating hyperplane.
fit(X, y[, class_weight, sample_weight]) Fit the SVM model according to the given training data.
get_params([deep]) Get parameters for the estimator
predict(X) Perform classication or regression samples in X.
predict_log_proba(X) Compute the log likehoods each possible outcomes of samples in X.
predict_proba(X) Compute the likehoods each possible outcomes of samples in T.
score(X, y) Returns the coefcient of determination R^2 of the prediction.
set_params(**params) Set the parameters of the estimator.
__init__(nu=0.5, C=1.0, kernel=rbf , degree=3, gamma=0.0, coef0=0.0, shrinking=True, probabil-
ity=False, tol=0.001, cache_size=200, verbose=False)
decision_function(X)
Distance of the samples X to the separating hyperplane.
Parameters X : array-like, shape = [n_samples, n_features]
Returns X : array-like, shape = [n_samples, n_class * (n_class-1) / 2]
Returns the decision function of the sample for each class in the model.
fit(X, y, class_weight=None, sample_weight=None)
Fit the SVM model according to the given training data.
Parameters X : {array-like, sparse matrix}, shape = [n_samples, n_features]
Training vectors, where n_samples is the number of samples and n_features is the num-
ber of features.
y : array-like, shape = [n_samples]
Target values (integers in classication, real numbers in regression)
sample_weight : array-like, shape = [n_samples], optional
Weights applied to individual samples (1. for unweighted).
Returns self : object
1.8. Reference 635
scikit-learn user guide, Release 0.12-git
Returns self.
Notes
If X and y are not C-ordered and contiguous arrays of np.oat64 and X is not a scipy.sparse.csr_matrix, X
and/or y may be copied.
If X is a dense array, then the other methods will not support sparse matrices as input.
get_params(deep=True)
Get parameters for the estimator
Parameters deep: boolean, optional :
If True, will return the parameters for this estimator and contained subobjects that are
estimators.
predict(X)
Perform classication or regression samples in X.
For a classication model, the predicted class for each sample in X is returned. For a regression model,
the function value of X calculated is returned.
For an one-class model, +1 or -1 is returned.
Parameters X : {array-like, sparse matrix}, shape = [n_samples, n_features]
Returns C : array, shape = [n_samples]
predict_log_proba(X)
Compute the log likehoods each possible outcomes of samples in X.
The model need to have probability information computed at training time: t with attribute probability
set to True.
Parameters X : array-like, shape = [n_samples, n_features]
Returns X : array-like, shape = [n_samples, n_classes]
Returns the log-probabilities of the sample for each class in the model, where classes
are ordered by arithmetical order.
Notes
The probability model is created using cross validation, so the results can be slightly different than those
obtained by predict. Also, it will meaningless results on very small datasets.
predict_proba(X)
Compute the likehoods each possible outcomes of samples in T.
The model need to have probability information computed at training time: t with attribute probability
set to True.
Parameters X : array-like, shape = [n_samples, n_features]
Returns X : array-like, shape = [n_samples, n_classes]
Returns the probability of the sample for each class in the model, where classes are
ordered by arithmetical order.
636 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
Notes
The probability model is created using cross validation, so the results can be slightly different than those
obtained by predict. Also, it will produce meaningless results on very small datasets.
score(X, y)
Returns the coefcient of determination R^2 of the prediction.
The coefcient R^2 is dened as (1 - u/v), where u is the regression sum of squares ((y - y_pred) **
2).sum() and v is the residual sum of squares ((y_true - y_true.mean()) ** 2).sum(). Best possible score is
1.0, lower values are worse.
Parameters X : array-like, shape = [n_samples, n_features]
Training set.
y : array-like, shape = [n_samples]
Returns z : oat
set_params(**params)
Set the parameters of the estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The former have
parameters of the form <component>__<parameter> so that its possible to update each component
of a nested object.
Returns self :
sklearn.svm.OneClassSVM
class sklearn.svm.OneClassSVM(kernel=rbf , degree=3, gamma=0.0, coef0=0.0, tol=0.001, nu=0.5,
shrinking=True, cache_size=200, verbose=False)
Unsupervised Outliers Detection.
Estimate the support of a high-dimensional distribution.
The implementation is based on libsvm.
Parameters kernel : string, optional
Species the kernel type to be used in the algorithm. Can be one of linear, poly,
rbf, sigmoid, precomputed. If none is given rbf will be used.
nu : oat, optional
An upper bound on the fraction of training errors and a lower bound of the fraction of
support vectors. Should be in the interval (0, 1]. By default 0.5 will be taken.
degree : int, optional
Degree of kernel function. Signicant only in poly, rbf, sigmoid.
gamma : oat, optional (default=0.0)
kernel coefcient for rbf and poly, if gamma is 0.0 then 1/n_features will be taken.
coef0 : oat, optional
Independent term in kernel function. It is only signicant in poly/sigmoid.
tol: oat, optional :
Tolerance for stopping criterion.
1.8. Reference 637
scikit-learn user guide, Release 0.12-git
shrinking: boolean, optional :
Whether to use the shrinking heuristic.
cache_size: oat, optional :
Specify the size of the kernel cache (in MB)
verbose : bool, default: False
Enable verbose output. Note that this setting takes advantage of a per-process runtime
setting in libsvm that, if enabled, may not work properly in a multithreaded context.
Attributes
sup-
port_
array-like, shape
= [n_SV]
Index of support vectors.
sup-
port_vectors_
array-like, shape
= [nSV,
n_features]
Support vectors.
dual_coef_array, shape =
[n_classes-1,
n_SV]
Coefcient of the support vector in the decision function.
coef_ array, shape =
[n_classes-1,
n_features]
Weights asigned to the features (coefcients in the primal problem). This is
only available in the case of linear kernel.
coef_ is readonly property derived from dual_coef_ and support_vectors_
inter-
cept_
array, shape =
[n_classes-1]
Constants in decision function.
Methods
decision_function(X) Distance of the samples X to the separating hyperplane.
fit(X[, sample_weight]) Detects the soft boundary of the set of samples X.
get_params([deep]) Get parameters for the estimator
predict(X) Perform classication or regression samples in X.
predict_log_proba(X) Compute the log likehoods each possible outcomes of samples in X.
predict_proba(X) Compute the likehoods each possible outcomes of samples in T.
set_params(**params) Set the parameters of the estimator.
__init__(kernel=rbf , degree=3, gamma=0.0, coef0=0.0, tol=0.001, nu=0.5, shrinking=True,
cache_size=200, verbose=False)
decision_function(X)
Distance of the samples X to the separating hyperplane.
Parameters X : array-like, shape = [n_samples, n_features]
Returns X : array-like, shape = [n_samples, n_class * (n_class-1) / 2]
Returns the decision function of the sample for each class in the model.
fit(X, sample_weight=None, **params)
Detects the soft boundary of the set of samples X.
Parameters X : {array-like, sparse matrix}, shape = [n_samples, n_features]
638 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
Set of samples, where n_samples is the number of samples and n_features is the number
of features.
Returns self : object
Returns self.
Notes
If X is not a C-ordered contiguous array it is copied.
get_params(deep=True)
Get parameters for the estimator
Parameters deep: boolean, optional :
If True, will return the parameters for this estimator and contained subobjects that are
estimators.
predict(X)
Perform classication or regression samples in X.
For a classication model, the predicted class for each sample in X is returned. For a regression model,
the function value of X calculated is returned.
For an one-class model, +1 or -1 is returned.
Parameters X : {array-like, sparse matrix}, shape = [n_samples, n_features]
Returns C : array, shape = [n_samples]
predict_log_proba(X)
Compute the log likehoods each possible outcomes of samples in X.
The model need to have probability information computed at training time: t with attribute probability
set to True.
Parameters X : array-like, shape = [n_samples, n_features]
Returns X : array-like, shape = [n_samples, n_classes]
Returns the log-probabilities of the sample for each class in the model, where classes
are ordered by arithmetical order.
Notes
The probability model is created using cross validation, so the results can be slightly different than those
obtained by predict. Also, it will meaningless results on very small datasets.
predict_proba(X)
Compute the likehoods each possible outcomes of samples in T.
The model need to have probability information computed at training time: t with attribute probability
set to True.
Parameters X : array-like, shape = [n_samples, n_features]
Returns X : array-like, shape = [n_samples, n_classes]
Returns the probability of the sample for each class in the model, where classes are
ordered by arithmetical order.
1.8. Reference 639
scikit-learn user guide, Release 0.12-git
Notes
The probability model is created using cross validation, so the results can be slightly different than those
obtained by predict. Also, it will produce meaningless results on very small datasets.
set_params(**params)
Set the parameters of the estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The former have
parameters of the form <component>__<parameter> so that its possible to update each component
of a nested object.
Returns self :
svm.l1_min_c(X, y[, loss, t_intercept, ...]) Return the lowest bound for C such that for C in (l1_min_C, innity)
sklearn.svm.l1_min_c
sklearn.svm.l1_min_c(X, y, loss=l2, t_intercept=True, intercept_scaling=1.0)
Return the lowest bound for C such that for C in (l1_min_C, innity) the model is guaranteed not
to be empty. This applies to l1 penalized classiers, such as LinearSVC with penalty=l1 and lin-
ear_model.LogisticRegression with penalty=l1.
This value is valid if class_weight parameter in t() is not set.
Parameters X : array-like or sparse matrix, shape = [n_samples, n_features]
Training vector, where n_samples in the number of samples and n_features is the num-
ber of features.
y : array, shape = [n_samples]
Target vector relative to X
loss : {l2, log}, default to l2
Species the loss function. With l2 it is the l2 loss (a.k.a. squared hinge loss). With
log it is the loss of logistic regression models.
t_intercept : bool, default: True
Species if the intercept should be tted by the model. It must match the t() method
paramenter.
intercept_scaling : oat, default: 1
when t_intercept is True, instance vector x becomes [x, intercept_scaling], i.e. a syn-
thetic feature with constant value equals to intercept_scaling is appended to the in-
stance vector. It must match the t() method parameter.
Returns l1_min_c: oat :
minimum value for C
Low-level methods
svm.libsvm.fit Train the model using libsvm (low-level method)
svm.libsvm.decision_function Predict margin (libsvm name for this is predict_values)
Continued on next page
640 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
Table 1.188 continued from previous page
svm.libsvm.predict Predict target values of X given a model (low-level method)
svm.libsvm.predict_proba Predict probabilities svm_model stores all parameters needed to predict a given value.
svm.libsvm.cross_validation Binding of the cross-validation routine (low-level routine)
sklearn.svm.libsvm.t
sklearn.svm.libsvm.fit()
Train the model using libsvm (low-level method)
Parameters X: array-like, dtype=oat64, size=[n_samples, n_features] :
Y: array, dtype=oat64, size=[n_samples] :
target vector
svm_type : {0, 1, 2, 3, 4}
Type of SVM: C_SVC, NuSVC, OneClassSVM, EpsilonSVR or NuSVR respectevely.
kernel : {linear, rbf, poly, sigmoid, precomputed}
Kernel to use in the model: linear, polynomial, RBF, sigmoid or precomputed.
degree : int32
Degree of the polynomial kernel (only relevant if kernel is set to polynomial)
gamma : oat64
Gamma parameter in RBF kernel (only relevant if kernel is set to RBF)
coef0 : oat64
Independent parameter in poly/sigmoid kernel.
tol : oat64
Stopping criteria.
C : oat64
C parameter in C-Support Vector Classication
nu : oat64
cache_size : oat64
Returns support : array, shape=[n_support]
index of support vectors
support_vectors : array, shape=[n_support, n_features]
support vectors (equivalent to X[support]). Will return an empty array in the case of
precomputed kernel.
n_class_SV : array
number of support vectors in each class.
sv_coef : array
coefcients of support vectors in decision function.
intercept : array
1.8. Reference 641
scikit-learn user guide, Release 0.12-git
intercept in decision function
label : labels for different classes (only relevant in classication).
probA, probB : array
probability estimates, empty array for probability=False
sklearn.svm.libsvm.decision_function
sklearn.svm.libsvm.decision_function()
Predict margin (libsvm name for this is predict_values)
We have to reconstruct model and parameters to make sure we stay in sync with the python object.
sklearn.svm.libsvm.predict
sklearn.svm.libsvm.predict()
Predict target values of X given a model (low-level method)
Parameters X: array-like, dtype=oat, size=[n_samples, n_features] :
svm_type : {0, 1, 2, 3, 4}
Type of SVM: C SVC, nu SVC, one class, epsilon SVR, nu SVR
kernel : {linear, rbf, poly, sigmoid, precomputed}
Kernel to use in the model: linear, polynomial, RBF, sigmoid or precomputed.
degree : int
Degree of the polynomial kernel (only relevant if kernel is set to polynomial)
gamma : oat
Gamma parameter in RBF kernel (only relevant if kernel is set to RBF)
coef0 : oat
Independent parameter in poly/sigmoid kernel.
eps : oat
Stopping criteria.
C : oat
C parameter in C-Support Vector Classication
Returns dec_values : array
predicted values.
TODO: probably theres no point in setting some parameters, like :
cache_size or weights. :
642 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
sklearn.svm.libsvm.predict_proba
sklearn.svm.libsvm.predict_proba()
Predict probabilities
svm_model stores all parameters needed to predict a given value.
For speed, all real work is done at the C level in function copy_predict (libsvm_helper.c).
We have to reconstruct model and parameters to make sure we stay in sync with the python object.
See sklearn.svm.predict for a complete list of parameters.
Parameters X: array-like, dtype=oat :
Y: array :
target vector
kernel : {linear, rbf, poly, sigmoid, precomputed}
Returns dec_values : array
predicted values.
sklearn.svm.libsvm.cross_validation
sklearn.svm.libsvm.cross_validation()
Binding of the cross-validation routine (low-level routine)
Parameters X: array-like, dtype=oat, size=[n_samples, n_features] :
Y: array, dtype=oat, size=[n_samples] :
target vector
svm_type : {0, 1, 2, 3, 4}
Type of SVM: C SVC, nu SVC, one class, epsilon SVR, nu SVR
kernel : {linear, rbf, poly, sigmoid, precomputed}
Kernel to use in the model: linear, polynomial, RBF, sigmoid or precomputed.
degree : int
Degree of the polynomial kernel (only relevant if kernel is set to polynomial)
gamma : oat
Gamma parameter in RBF kernel (only relevant if kernel is set to RBF)
coef0 : oat
Independent parameter in poly/sigmoid kernel.
tol : oat
Stopping criteria.
C : oat
C parameter in C-Support Vector Classication
nu : oat
cache_size : oat
1.8. Reference 643
scikit-learn user guide, Release 0.12-git
Returns target : array, oat
1.8.27 sklearn.tree: Decision Trees
The sklearn.tree module includes decision tree-based models for classication and regression.
User guide: See the Decision Trees section for further details.
tree.DecisionTreeClassifier([criterion, ...]) A decision tree classier.
tree.DecisionTreeRegressor([criterion, ...]) A tree regressor.
tree.ExtraTreeClassifier([criterion, ...]) An extremely randomized tree classier.
tree.ExtraTreeRegressor([criterion, ...]) An extremely randomized tree regressor.
sklearn.tree.DecisionTreeClassier
class sklearn.tree.DecisionTreeClassifier(criterion=gini, max_depth=None,
min_samples_split=1, min_samples_leaf=1,
min_density=0.1, max_features=None, com-
pute_importances=False, random_state=None)
A decision tree classier.
Parameters criterion : string, optional (default=gini)
The function to measure the quality of a split. Supported criteria are gini for the Gini
impurity and entropy for the information gain.
max_depth : integer or None, optional (default=None)
The maximum depth of the tree. If None, then nodes are expanded until all leaves are
pure or until all leaves contain less than min_samples_split samples.
min_samples_split : integer, optional (default=1)
The minimum number of samples required to split an internal node.
min_samples_leaf : integer, optional (default=1)
The minimum number of samples required to be at a leaf node.
min_density : oat, optional (default=0.1)
This parameter controls a trade-off in an optimization heuristic. It controls the minimum
density of the sample_mask (i.e. the fraction of samples in the mask). If the density falls
below this threshold the mask is recomputed and the input data is packed which results
in data copying. If min_density equals to one, the partitions are always represented
as copies of the original data. Otherwise, partitions are represented as bit masks (aka
sample masks).
max_features : int, string or None, optional (default=None)
The number of features to consider when looking for the best split. If auto, then
max_features=sqrt(n_features) on classication tasks and max_features=n_features on
regression problems. If sqrt, then max_features=sqrt(n_features). If log2, then
max_features=log2(n_features). If None, then max_features=n_features.
compute_importances : boolean, optional (default=True)
Whether feature importances are computed and stored into the
feature_importances_ attribute when calling t.
644 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
random_state : int, RandomState instance or None, optional (default=None)
If int, random_state is the seed used by the random number generator; If RandomState
instance, random_state is the random number generator; If None, the random number
generator is the RandomState instance used by np.random.
See Also:
DecisionTreeRegressor
References
[R76], [R77], [R78], [R79]
Examples
>>> from sklearn.datasets import load_iris
>>> from sklearn.cross_validation import cross_val_score
>>> from sklearn.tree import DecisionTreeClassifier
>>> clf = DecisionTreeClassifier(random_state=0)
>>> iris = load_iris()
>>> cross_val_score(clf, iris.data, iris.target, cv=10)
...
...
array([ 1. , 0.93..., 0.86..., 0.93..., 0.93...,
0.93..., 0.93..., 1. , 0.93..., 1. ])
Attributes
tree_ Tree object The underlying Tree object.
fea-
ture_importances_
array of
shape =
[n_features]
The feature mportances (the higher, the more important the feature). The
importance I(f) of a feature f is computed as the (normalized) total reduction
of error brought by that feature. It is also known as the Gini importance [R79].
Methods
fit(X, y[, sample_mask, X_argsorted]) Build a decision tree from the training set (X, y).
fit_transform(X[, y]) Fit to data, then transform it
get_params([deep]) Get parameters for the estimator
predict(X) Predict class or regression target for X.
predict_log_proba(X) Predict class log-probabilities of the input samples X.
predict_proba(X) Predict class probabilities of the input samples X.
score(X, y) Returns the mean accuracy on the given test data and labels.
set_params(**params) Set the parameters of the estimator.
transform(X[, threshold]) Reduce X to its most important features.
__init__(criterion=gini, max_depth=None, min_samples_split=1, min_samples_leaf=1,
min_density=0.1, max_features=None, compute_importances=False, random_state=None)
1.8. Reference 645
scikit-learn user guide, Release 0.12-git
fit(X, y, sample_mask=None, X_argsorted=None)
Build a decision tree from the training set (X, y).
Parameters X : array-like of shape = [n_samples, n_features]
The training input samples.
y : array-like, shape = [n_samples]
The target values (integers that correspond to classes in classication, real numbers in
regression).
Returns self : object
Returns self.
fit_transform(X, y=None, **t_params)
Fit to data, then transform it
Fits transformer to X and y with optional parameters t_params and returns a transformed version of X.
Parameters X : numpy array of shape [n_samples, n_features]
Training set.
y : numpy array of shape [n_samples]
Target values.
Returns X_new : numpy array of shape [n_samples, n_features_new]
Transformed array.
Notes
This method just calls t and transform consecutively, i.e., it is not an optimized implementation of
t_transform, unlike other transformers such as PCA.
get_params(deep=True)
Get parameters for the estimator
Parameters deep: boolean, optional :
If True, will return the parameters for this estimator and contained subobjects that are
estimators.
predict(X)
Predict class or regression target for X.
For a classication model, the predicted class for each sample in X is returned. For a regression model,
the predicted value based on X is returned.
Parameters X : array-like of shape = [n_samples, n_features]
The input samples.
Returns y : array of shape = [n_samples]
The predicted classes, or the predict values.
predict_log_proba(X)
Predict class log-probabilities of the input samples X.
Parameters X : array-like of shape = [n_samples, n_features]
The input samples.
646 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
Returns p : array of shape = [n_samples, n_classes]
The class log-probabilities of the input samples. Classes are ordered by arithmetical
order.
predict_proba(X)
Predict class probabilities of the input samples X.
Parameters X : array-like of shape = [n_samples, n_features]
The input samples.
Returns p : array of shape = [n_samples, n_classes]
The class probabilities of the input samples. Classes are ordered by arithmetical order.
score(X, y)
Returns the mean accuracy on the given test data and labels.
Parameters X : array-like, shape = [n_samples, n_features]
Training set.
y : array-like, shape = [n_samples]
Labels for X.
Returns z : oat
set_params(**params)
Set the parameters of the estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The former have
parameters of the form <component>__<parameter> so that its possible to update each component
of a nested object.
Returns self :
transform(X, threshold=None)
Reduce X to its most important features.
Parameters X : array or scipy sparse matrix of shape [n_samples, n_features]
The input samples.
threshold : string, oat or None, optional (default=None)
The threshold value to use for feature selection. Features whose importance is greater
or equal are kept while the others are discarded. If median (resp. mean), then the
threshold value is the median (resp. the mean) of the feature importances. A scaling
factor (e.g., 1.25*mean) may also be used. If None and if available, the object attribute
threshold is used. Otherwise, mean is used by default.
Returns X_r : array of shape [n_samples, n_selected_features]
The input samples with only the selected features.
sklearn.tree.DecisionTreeRegressor
class sklearn.tree.DecisionTreeRegressor(criterion=mse, max_depth=None,
min_samples_split=1, min_samples_leaf=1,
min_density=0.1, max_features=None, com-
pute_importances=False, random_state=None)
A tree regressor.
1.8. Reference 647
scikit-learn user guide, Release 0.12-git
Parameters criterion : string, optional (default=mse)
The function to measure the quality of a split. The only supported criterion is mse for
the mean squared error.
max_depth : integer or None, optional (default=None)
The maximum depth of the tree. If None, then nodes are expanded until all leaves are
pure or until all leaves contain less than min_samples_split samples.
min_samples_split : integer, optional (default=1)
The minimum number of samples required to split an internal node.
min_samples_leaf : integer, optional (default=1)
The minimum number of samples required to be at a leaf node.
min_density : oat, optional (default=0.1)
This parameter controls a trade-off in an optimization heuristic. It controls the minimum
density of the sample_mask (i.e. the fraction of samples in the mask). If the density falls
below this threshold the mask is recomputed and the input data is packed which results
in data copying. If min_density equals to one, the partitions are always represented
as copies of the original data. Otherwise, partitions are represented as bit masks (aka
sample masks).
max_features : int, string or None, optional (default=None)
The number of features to consider when looking for the best split. If auto, then
max_features=sqrt(n_features) on classication tasks and max_features=n_features on
regression problems. If sqrt, then max_features=sqrt(n_features). If log2, then
max_features=log2(n_features). If None, then max_features=n_features.
compute_importances : boolean, optional (default=True)
Whether feature importances are computed and stored into the
feature_importances_ attribute when calling t.
random_state : int, RandomState instance or None, optional (default=None)
If int, random_state is the seed used by the random number generator; If RandomState
instance, random_state is the random number generator; If None, the random number
generator is the RandomState instance used by np.random.
See Also:
DecisionTreeClassifier
References
[R80], [R81], [R82], [R83]
Examples
>>> from sklearn.datasets import load_boston
>>> from sklearn.cross_validation import cross_val_score
>>> from sklearn.tree import DecisionTreeRegressor
>>> boston = load_boston()
>>> regressor = DecisionTreeRegressor(random_state=0)
648 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
R2 scores (a.k.a. coefcient of determination) over 10-folds CV:
>>> cross_val_score(regressor, boston.data, boston.target, cv=10)
...
...
array([ 0.61..., 0.57..., -0.34..., 0.41..., 0.75...,
0.07..., 0.29..., 0.33..., -1.42..., -1.77...])
Attributes
tree_ Tree object The underlying Tree object.
fea-
ture_importances_
array of
shape =
[n_features]
The feature mportances (the higher, the more important the feature). The
importance I(f) of a feature f is computed as the (normalized) total reduction
of error brought by that feature. It is also known as the Gini importance [R83].
Methods
fit(X, y[, sample_mask, X_argsorted]) Build a decision tree from the training set (X, y).
fit_transform(X[, y]) Fit to data, then transform it
get_params([deep]) Get parameters for the estimator
predict(X) Predict class or regression target for X.
score(X, y) Returns the coefcient of determination R^2 of the prediction.
set_params(**params) Set the parameters of the estimator.
transform(X[, threshold]) Reduce X to its most important features.
__init__(criterion=mse, max_depth=None, min_samples_split=1, min_samples_leaf=1,
min_density=0.1, max_features=None, compute_importances=False, random_state=None)
fit(X, y, sample_mask=None, X_argsorted=None)
Build a decision tree from the training set (X, y).
Parameters X : array-like of shape = [n_samples, n_features]
The training input samples.
y : array-like, shape = [n_samples]
The target values (integers that correspond to classes in classication, real numbers in
regression).
Returns self : object
Returns self.
fit_transform(X, y=None, **t_params)
Fit to data, then transform it
Fits transformer to X and y with optional parameters t_params and returns a transformed version of X.
Parameters X : numpy array of shape [n_samples, n_features]
Training set.
y : numpy array of shape [n_samples]
Target values.
Returns X_new : numpy array of shape [n_samples, n_features_new]
1.8. Reference 649
scikit-learn user guide, Release 0.12-git
Transformed array.
Notes
This method just calls t and transform consecutively, i.e., it is not an optimized implementation of
t_transform, unlike other transformers such as PCA.
get_params(deep=True)
Get parameters for the estimator
Parameters deep: boolean, optional :
If True, will return the parameters for this estimator and contained subobjects that are
estimators.
predict(X)
Predict class or regression target for X.
For a classication model, the predicted class for each sample in X is returned. For a regression model,
the predicted value based on X is returned.
Parameters X : array-like of shape = [n_samples, n_features]
The input samples.
Returns y : array of shape = [n_samples]
The predicted classes, or the predict values.
score(X, y)
Returns the coefcient of determination R^2 of the prediction.
The coefcient R^2 is dened as (1 - u/v), where u is the regression sum of squares ((y - y_pred) **
2).sum() and v is the residual sum of squares ((y_true - y_true.mean()) ** 2).sum(). Best possible score is
1.0, lower values are worse.
Parameters X : array-like, shape = [n_samples, n_features]
Training set.
y : array-like, shape = [n_samples]
Returns z : oat
set_params(**params)
Set the parameters of the estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The former have
parameters of the form <component>__<parameter> so that its possible to update each component
of a nested object.
Returns self :
transform(X, threshold=None)
Reduce X to its most important features.
Parameters X : array or scipy sparse matrix of shape [n_samples, n_features]
The input samples.
threshold : string, oat or None, optional (default=None)
650 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
The threshold value to use for feature selection. Features whose importance is greater
or equal are kept while the others are discarded. If median (resp. mean), then the
threshold value is the median (resp. the mean) of the feature importances. A scaling
factor (e.g., 1.25*mean) may also be used. If None and if available, the object attribute
threshold is used. Otherwise, mean is used by default.
Returns X_r : array of shape [n_samples, n_selected_features]
The input samples with only the selected features.
sklearn.tree.ExtraTreeClassier
class sklearn.tree.ExtraTreeClassifier(criterion=gini, max_depth=None,
min_samples_split=1, min_samples_leaf=1,
min_density=0.1, max_features=auto, com-
pute_importances=False, random_state=None)
An extremely randomized tree classier.
Extra-trees differ from classic decision trees in the way they are built. When looking for the best split to separate
the samples of a node into two groups, random splits are drawn for each of the max_features randomly selected
features and the best split among those is chosen. When max_features is set 1, this amounts to building a totally
random decision tree.
Warning: Extra-trees should only be used within ensemble methods.
See Also:
ExtraTreeRegressor, ExtraTreesClassifier, ExtraTreesRegressor
References
[R84]
Methods
fit(X, y[, sample_mask, X_argsorted]) Build a decision tree from the training set (X, y).
fit_transform(X[, y]) Fit to data, then transform it
get_params([deep]) Get parameters for the estimator
predict(X) Predict class or regression target for X.
predict_log_proba(X) Predict class log-probabilities of the input samples X.
predict_proba(X) Predict class probabilities of the input samples X.
score(X, y) Returns the mean accuracy on the given test data and labels.
set_params(**params) Set the parameters of the estimator.
transform(X[, threshold]) Reduce X to its most important features.
__init__(criterion=gini, max_depth=None, min_samples_split=1, min_samples_leaf=1,
min_density=0.1, max_features=auto, compute_importances=False, ran-
dom_state=None)
fit(X, y, sample_mask=None, X_argsorted=None)
Build a decision tree from the training set (X, y).
Parameters X : array-like of shape = [n_samples, n_features]
The training input samples.
1.8. Reference 651
scikit-learn user guide, Release 0.12-git
y : array-like, shape = [n_samples]
The target values (integers that correspond to classes in classication, real numbers in
regression).
Returns self : object
Returns self.
fit_transform(X, y=None, **t_params)
Fit to data, then transform it
Fits transformer to X and y with optional parameters t_params and returns a transformed version of X.
Parameters X : numpy array of shape [n_samples, n_features]
Training set.
y : numpy array of shape [n_samples]
Target values.
Returns X_new : numpy array of shape [n_samples, n_features_new]
Transformed array.
Notes
This method just calls t and transform consecutively, i.e., it is not an optimized implementation of
t_transform, unlike other transformers such as PCA.
get_params(deep=True)
Get parameters for the estimator
Parameters deep: boolean, optional :
If True, will return the parameters for this estimator and contained subobjects that are
estimators.
predict(X)
Predict class or regression target for X.
For a classication model, the predicted class for each sample in X is returned. For a regression model,
the predicted value based on X is returned.
Parameters X : array-like of shape = [n_samples, n_features]
The input samples.
Returns y : array of shape = [n_samples]
The predicted classes, or the predict values.
predict_log_proba(X)
Predict class log-probabilities of the input samples X.
Parameters X : array-like of shape = [n_samples, n_features]
The input samples.
Returns p : array of shape = [n_samples, n_classes]
The class log-probabilities of the input samples. Classes are ordered by arithmetical
order.
652 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
predict_proba(X)
Predict class probabilities of the input samples X.
Parameters X : array-like of shape = [n_samples, n_features]
The input samples.
Returns p : array of shape = [n_samples, n_classes]
The class probabilities of the input samples. Classes are ordered by arithmetical order.
score(X, y)
Returns the mean accuracy on the given test data and labels.
Parameters X : array-like, shape = [n_samples, n_features]
Training set.
y : array-like, shape = [n_samples]
Labels for X.
Returns z : oat
set_params(**params)
Set the parameters of the estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The former have
parameters of the form <component>__<parameter> so that its possible to update each component
of a nested object.
Returns self :
transform(X, threshold=None)
Reduce X to its most important features.
Parameters X : array or scipy sparse matrix of shape [n_samples, n_features]
The input samples.
threshold : string, oat or None, optional (default=None)
The threshold value to use for feature selection. Features whose importance is greater
or equal are kept while the others are discarded. If median (resp. mean), then the
threshold value is the median (resp. the mean) of the feature importances. A scaling
factor (e.g., 1.25*mean) may also be used. If None and if available, the object attribute
threshold is used. Otherwise, mean is used by default.
Returns X_r : array of shape [n_samples, n_selected_features]
The input samples with only the selected features.
sklearn.tree.ExtraTreeRegressor
class sklearn.tree.ExtraTreeRegressor(criterion=mse, max_depth=None,
min_samples_split=1, min_samples_leaf=1,
min_density=0.1, max_features=auto, com-
pute_importances=False, random_state=None)
An extremely randomized tree regressor.
Extra-trees differ from classic decision trees in the way they are built. When looking for the best split to separate
the samples of a node into two groups, random splits are drawn for each of the max_features randomly selected
features and the best split among those is chosen. When max_features is set 1, this amounts to building a totally
random decision tree.
1.8. Reference 653
scikit-learn user guide, Release 0.12-git
Warning: Extra-trees should only be used within ensemble methods.
See Also:
ExtraTreeClassifierA classier base on extremely randomized trees
sklearn.ensemble.ExtraTreesClassifierAn ensemble of extra-trees for classication
sklearn.ensemble.ExtraTreesRegressorAn ensemble of extra-trees for regression
References
[R85]
Methods
fit(X, y[, sample_mask, X_argsorted]) Build a decision tree from the training set (X, y).
fit_transform(X[, y]) Fit to data, then transform it
get_params([deep]) Get parameters for the estimator
predict(X) Predict class or regression target for X.
score(X, y) Returns the coefcient of determination R^2 of the prediction.
set_params(**params) Set the parameters of the estimator.
transform(X[, threshold]) Reduce X to its most important features.
__init__(criterion=mse, max_depth=None, min_samples_split=1, min_samples_leaf=1,
min_density=0.1, max_features=auto, compute_importances=False, ran-
dom_state=None)
fit(X, y, sample_mask=None, X_argsorted=None)
Build a decision tree from the training set (X, y).
Parameters X : array-like of shape = [n_samples, n_features]
The training input samples.
y : array-like, shape = [n_samples]
The target values (integers that correspond to classes in classication, real numbers in
regression).
Returns self : object
Returns self.
fit_transform(X, y=None, **t_params)
Fit to data, then transform it
Fits transformer to X and y with optional parameters t_params and returns a transformed version of X.
Parameters X : numpy array of shape [n_samples, n_features]
Training set.
y : numpy array of shape [n_samples]
Target values.
Returns X_new : numpy array of shape [n_samples, n_features_new]
Transformed array.
654 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
Notes
This method just calls t and transform consecutively, i.e., it is not an optimized implementation of
t_transform, unlike other transformers such as PCA.
get_params(deep=True)
Get parameters for the estimator
Parameters deep: boolean, optional :
If True, will return the parameters for this estimator and contained subobjects that are
estimators.
predict(X)
Predict class or regression target for X.
For a classication model, the predicted class for each sample in X is returned. For a regression model,
the predicted value based on X is returned.
Parameters X : array-like of shape = [n_samples, n_features]
The input samples.
Returns y : array of shape = [n_samples]
The predicted classes, or the predict values.
score(X, y)
Returns the coefcient of determination R^2 of the prediction.
The coefcient R^2 is dened as (1 - u/v), where u is the regression sum of squares ((y - y_pred) **
2).sum() and v is the residual sum of squares ((y_true - y_true.mean()) ** 2).sum(). Best possible score is
1.0, lower values are worse.
Parameters X : array-like, shape = [n_samples, n_features]
Training set.
y : array-like, shape = [n_samples]
Returns z : oat
set_params(**params)
Set the parameters of the estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The former have
parameters of the form <component>__<parameter> so that its possible to update each component
of a nested object.
Returns self :
transform(X, threshold=None)
Reduce X to its most important features.
Parameters X : array or scipy sparse matrix of shape [n_samples, n_features]
The input samples.
threshold : string, oat or None, optional (default=None)
The threshold value to use for feature selection. Features whose importance is greater
or equal are kept while the others are discarded. If median (resp. mean), then the
threshold value is the median (resp. the mean) of the feature importances. A scaling
factor (e.g., 1.25*mean) may also be used. If None and if available, the object attribute
threshold is used. Otherwise, mean is used by default.
1.8. Reference 655
scikit-learn user guide, Release 0.12-git
Returns X_r : array of shape [n_samples, n_selected_features]
The input samples with only the selected features.
tree.export_graphviz(decision_tree[, ...]) Export a decision tree in DOT format.
sklearn.tree.export_graphviz
sklearn.tree.export_graphviz(decision_tree, out_le=None, feature_names=None)
Export a decision tree in DOT format.
This function generates a GraphViz representation of the decision tree, which is then written into out_le. Once
exported, graphical renderings can be generated using, for example:
$ dot -Tps tree.dot -o tree.ps (PostScript format)
$ dot -Tpng tree.dot -o tree.png (PNG format)
Parameters decision_tree : decision tree classier
The decision tree to be exported to graphviz.
out : le object or string, optional (default=None)
Handle or name of the output le.
feature_names : list of strings, optional (default=None)
Names of each of the features.
Returns out_le : le object
The le object to which the tree was exported. The user is expected to close() this object
when done with it.
Examples
>>> from sklearn.datasets import load_iris
>>> from sklearn import tree
>>> clf = tree.DecisionTreeClassifier()
>>> iris = load_iris()
>>> clf = clf.fit(iris.data, iris.target)
>>> import tempfile
>>> out_file = tree.export_graphviz(clf, out_file=tempfile.TemporaryFile())
>>> out_file.close()
1.8.28 sklearn.utils: Utilities
The sklearn.utils module includes various utilites.
Developer guide: See the Utilities for Developers page for further details.
utils.check_random_state(seed) Turn seed into a np.random.RandomState instance
utils.resample(*arrays, **options) Resample arrays or sparse matrices in a consistent way
utils.shuffle(*arrays, **options) Shufe arrays or sparse matrices in a consistent way
656 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
sklearn.utils.check_random_state
sklearn.utils.check_random_state(seed)
Turn seed into a np.random.RandomState instance
If seed is None, return the RandomState singleton used by np.random. If seed is an int, return a new Ran-
domState instance seeded with seed. If seed is already a RandomState instance, return it. Otherwise raise
ValueError.
sklearn.utils.resample
sklearn.utils.resample(*arrays, **options)
Resample arrays or sparse matrices in a consistent way
The default strategy implements one step of the bootstrapping procedure.
Parameters *arrays : sequence of arrays or scipy.sparse matrices with same shape[0]
replace : boolean, True by default
Implements resampling with replacement. If False, this will implement (sliced) random
permutations.
n_samples : int, None by default
Number of samples to generate. If left to None this is automatically set to the rst
dimension of the arrays.
random_state : int or RandomState instance
Control the shufing for reproducible behavior.
Returns Sequence of resampled views of the collections. The original arrays are :
not impacted. :
See Also:
sklearn.cross_validation.Bootstrap, sklearn.utils.shuffle
Examples
It is possible to mix sparse and dense arrays in the same run:
>>> X = [[1., 0.], [2., 1.], [0., 0.]]
>>> y = np.array([0, 1, 2])
>>> from scipy.sparse import coo_matrix
>>> X_sparse = coo_matrix(X)
>>> from sklearn.utils import resample
>>> X, X_sparse, y = resample(X, X_sparse, y, random_state=0)
>>> X
array([[ 1., 0.],
[ 2., 1.],
[ 1., 0.]])
>>> X_sparse
<3x2 sparse matrix of type <... numpy.float64>
with 4 stored elements in Compressed Sparse Row format>
1.8. Reference 657
scikit-learn user guide, Release 0.12-git
>>> X_sparse.toarray()
array([[ 1., 0.],
[ 2., 1.],
[ 1., 0.]])
>>> y
array([0, 1, 0])
>>> resample(y, n_samples=2, random_state=0)
array([0, 1])
sklearn.utils.shufe
sklearn.utils.shuffle(*arrays, **options)
Shufe arrays or sparse matrices in a consistent way
This is a convenience alias to resample(
*
arrays, replace=False) to do random permutations of the
collections.
Parameters *arrays : sequence of arrays or scipy.sparse matrices with same shape[0]
random_state : int or RandomState instance
Control the shufing for reproducible behavior.
n_samples : int, None by default
Number of samples to generate. If left to None this is automatically set to the rst
dimension of the arrays.
Returns Sequence of shufed views of the collections. The original arrays are :
not impacted. :
See Also:
sklearn.utils.resample
Examples
It is possible to mix sparse and dense arrays in the same run:
>>> X = [[1., 0.], [2., 1.], [0., 0.]]
>>> y = np.array([0, 1, 2])
>>> from scipy.sparse import coo_matrix
>>> X_sparse = coo_matrix(X)
>>> from sklearn.utils import shuffle
>>> X, X_sparse, y = shuffle(X, X_sparse, y, random_state=0)
>>> X
array([[ 0., 0.],
[ 2., 1.],
[ 1., 0.]])
>>> X_sparse
<3x2 sparse matrix of type <... numpy.float64>
with 3 stored elements in Compressed Sparse Row format>
658 Chapter 1. User Guide
scikit-learn user guide, Release 0.12-git
>>> X_sparse.toarray()
array([[ 0., 0.],
[ 2., 1.],
[ 1., 0.]])
>>> y
array([2, 1, 0])
>>> shuffle(y, n_samples=2, random_state=0)
array([0, 1])
1.8. Reference 659
scikit-learn user guide, Release 0.12-git
660 Chapter 1. User Guide
CHAPTER
TWO
EXAMPLE GALLERY
2.1 Examples
2.1.1 General examples
General-purpose and introductory examples for the scikit.
Figure 2.1: Plot classication probability
Plot classication probability
Plot the classication probability for different classiers. We use a 3 class dataset, and we classify it with a Support
Vector classier, as well as L1 and L2 penalized logistic regression.
The logistic regression is not a multiclass classier out of the box. As a result it can identify only the rst class.
661
scikit-learn user guide, Release 0.12-git
Script output:
classif_rate for Linear SVC : 82.000000
classif_rate for L1 logistic : 79.333333
classif_rate for L2 logistic : 76.666667
Python source code: plot_classification_probability.py
print __doc__
# Author: Alexandre Gramfort <[email protected]>
# License: BSD Style.
import pylab as pl
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn import datasets
662 Chapter 2. Example Gallery
scikit-learn user guide, Release 0.12-git
iris = datasets.load_iris()
X = iris.data[:, 0:2] # we only take the first two features for visualization
y = iris.target
n_features = X.shape[1]
C = 1.0
# Create different classifiers. The logistic regression cannot do
# multiclass out of the box.
classifiers = {
L1 logistic: LogisticRegression(C=C, penalty=l1),
L2 logistic: LogisticRegression(C=C, penalty=l2),
Linear SVC: SVC(kernel=linear, C=C, probability=True),
}
n_classifiers = len(classifiers)
pl.figure(figsize=(3
*
2, n_classifiers
*
2))
pl.subplots_adjust(bottom=.2, top=.95)
for index, (name, classifier) in enumerate(classifiers.iteritems()):
classifier.fit(X, y)
y_pred = classifier.predict(X)
classif_rate = np.mean(y_pred.ravel() == y.ravel())
*
100
print "classif_rate for %s : %f " % (name, classif_rate)
# View probabilities=
xx = np.linspace(3, 9, 100)
yy = np.linspace(1, 5, 100).T
xx, yy = np.meshgrid(xx, yy)
Xfull = np.c_[xx.ravel(), yy.ravel()]
probas = classifier.predict_proba(Xfull)
n_classes = np.unique(y_pred).size
for k in range(n_classes):
pl.subplot(n_classifiers, n_classes, index
*
n_classes + k + 1)
pl.title("Class %d" % k)
if k == 0:
pl.ylabel(name)
imshow_handle = pl.imshow(probas[:, k].reshape((100, 100)),
extent=(3, 9, 1, 5), origin=lower)
pl.xticks(())
pl.yticks(())
idx = (y_pred == k)
if idx.any():
pl.scatter(X[idx, 0], X[idx, 1], marker=o, c=k)
ax = pl.axes([0.15, 0.04, 0.7, 0.05])
pl.title("Probability")
pl.colorbar(imshow_handle, cax=ax, orientation=horizontal)
pl.show()
2.1. Examples 663
scikit-learn user guide, Release 0.12-git
Figure 2.2: Confusion matrix
Confusion matrix
Example of confusion matrix usage to evaluate the quality of the output of a classier.
Script output:
664 Chapter 2. Example Gallery
scikit-learn user guide, Release 0.12-git
[[25 0 0]
[ 0 28 2]
[ 0 1 19]]
Python source code: plot_confusion_matrix.py
print __doc__
import random
import pylab as pl
from sklearn import svm, datasets
from sklearn.metrics import confusion_matrix
# import some data to play with
iris = datasets.load_iris()
X = iris.data
y = iris.target
n_samples, n_features = X.shape
p = range(n_samples)
random.seed(0)
random.shuffle(p)
X, y = X[p], y[p]
half = int(n_samples / 2)
# Run classifier
classifier = svm.SVC(kernel=linear)
y_ = classifier.fit(X[:half], y[:half]).predict(X[half:])
# Compute confusion matrix
cm = confusion_matrix(y[half:], y_)
print cm
# Show confusion matrix
pl.matshow(cm)
pl.title(Confusion matrix)
pl.colorbar()
pl.show()
Figure 2.3: Recognizing hand-written digits
Recognizing hand-written digits
An example showing how the scikit-learn can be used to recognize images of hand-written digits.
This example is commented in the tutorial section of the user manual.
2.1. Examples 665
scikit-learn user guide, Release 0.12-git
Script output:
Classification report for classifier SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, degree=3,
gamma=0.001, kernel=rbf, probability=False, shrinking=True, tol=0.001,
verbose=False):
precision recall f1-score support
0 1.00 0.99 0.99 88
1 0.99 0.97 0.98 91
2 0.99 0.99 0.99 86
3 0.98 0.87 0.92 91
4 0.99 0.96 0.97 92
5 0.95 0.97 0.96 91
6 0.99 0.99 0.99 91
7 0.96 0.99 0.97 89
8 0.94 1.00 0.97 88
9 0.93 0.98 0.95 92
avg / total 0.97 0.97 0.97 899
Confusion matrix:
[[87 0 0 0 1 0 0 0 0 0]
[ 0 88 1 0 0 0 0 0 1 1]
[ 0 0 85 1 0 0 0 0 0 0]
[ 0 0 0 79 0 3 0 4 5 0]
666 Chapter 2. Example Gallery
scikit-learn user guide, Release 0.12-git
[ 0 0 0 0 88 0 0 0 0 4]
[ 0 0 0 0 0 88 1 0 0 2]
[ 0 1 0 0 0 0 90 0 0 0]
[ 0 0 0 0 0 1 0 88 0 0]
[ 0 0 0 0 0 0 0 0 88 0]
[ 0 0 0 1 0 1 0 0 0 90]]
Python source code: plot_digits_classification.py
print __doc__
# Author: Gael Varoquaux <gael dot varoquaux at normalesup dot org>
# License: Simplified BSD
# Standard scientific Python imports
import pylab as pl
# Import datasets, classifiers and performance metrics
from sklearn import datasets, svm, metrics
# The digits dataset
digits = datasets.load_digits()
# The data that we are interested in is made of 8x8 images of digits,
# lets have a look at the first 3 images, stored in the images
# attribute of the dataset. If we were working from image files, we
# could load them using pylab.imread. For these images know which
# digit they represent: it is given in the target of the dataset.
for index, (image, label) in enumerate(zip(digits.images, digits.target)[:4]):
pl.subplot(2, 4, index + 1)
pl.axis(off)
pl.imshow(image, cmap=pl.cm.gray_r, interpolation=nearest)
pl.title(Training: %i % label)
# To apply an classifier on this data, we need to flatten the image, to
# turn the data in a (samples, feature) matrix:
n_samples = len(digits.images)
data = digits.images.reshape((n_samples, -1))
# Create a classifier: a support vector classifier
classifier = svm.SVC(gamma=0.001)
# We learn the digits on the first half of the digits
classifier.fit(data[:n_samples / 2], digits.target[:n_samples / 2])
# Now predict the value of the digit on the second half:
expected = digits.target[n_samples / 2:]
predicted = classifier.predict(data[n_samples / 2:])
print "Classification report for classifier %s:\n%s\n" % (
classifier, metrics.classification_report(expected, predicted))
print "Confusion matrix:\n%s" % metrics.confusion_matrix(expected, predicted)
for index, (image, prediction) in enumerate(
zip(digits.images[n_samples / 2:], predicted)[:4]):
pl.subplot(2, 4, index + 5)
pl.axis(off)
pl.imshow(image, cmap=pl.cm.gray_r, interpolation=nearest)
2.1. Examples 667
scikit-learn user guide, Release 0.12-git
pl.title(Prediction: %i % prediction)
pl.show()
Figure 2.4: Pipelining: chaining a PCA and a logistic regression
Pipelining: chaining a PCA and a logistic regression
The PCA does an unsupervised dimensionality reduction, while the logistic regression does the prediction.
We use a GridSearchCV to set the dimensionality of the PCA
Python source code: plot_digits_pipe.py
print __doc__
# Code source: Gael Varoqueux
# Modified for Documentation merge by Jaques Grobler
# License: BSD
import numpy as np
import pylab as pl
from sklearn import linear_model, decomposition, datasets
logistic = linear_model.LogisticRegression()
pca = decomposition.PCA()
from sklearn.pipeline import Pipeline
pipe = Pipeline(steps=[(pca, pca), (logistic, logistic)])
668 Chapter 2. Example Gallery
scikit-learn user guide, Release 0.12-git
digits = datasets.load_digits()
X_digits = digits.data
y_digits = digits.target
###############################################################################
# Plot the PCA spectrum
pca.fit(X_digits)
pl.figure(1, figsize=(4, 3))
pl.clf()
pl.axes([.2, .2, .7, .7])
pl.plot(pca.explained_variance_, linewidth=2)
pl.axis(tight)
pl.xlabel(n_components)
pl.ylabel(explained_variance_)
###############################################################################
# Prediction
from sklearn.grid_search import GridSearchCV
n_components = [20, 40, 64]
Cs = np.logspace(-4, 4, 3)
#Parameters of pipelines can be set using __ separated parameter names:
estimator = GridSearchCV(pipe,
dict(pca__n_components=n_components,
logistic__C=Cs))
estimator.fit(X_digits, y_digits)
pl.axvline(estimator.best_estimator_.named_steps[pca].n_components,
linestyle=:, label=n_components chosen)
pl.legend(prop=dict(size=12))
pl.show()
Figure 2.5: Univariate Feature Selection
Univariate Feature Selection
An example showing univariate feature selection.
Noisy (non informative) features are added to the iris data and univariate feature selection is applied. For each feature,
we plot the p-values for the univariate feature selection and the corresponding weights of an SVM. We can see that
univariate feature selection selects the informative features and that these have larger SVM weights.
In the total set of features, only the 4 rst ones are signicant. We can see that they have the highest score with
univariate feature selection. The SVM attributes small weights to these features, but these weight are non zero.
2.1. Examples 669
scikit-learn user guide, Release 0.12-git
Applying univariate feature selection before the SVM increases the SVM weight attributed to the signicant features,
and will thus improve classication.
Python source code: plot_feature_selection.py
print __doc__
import numpy as np
import pylab as pl
from sklearn import datasets, svm
from sklearn.feature_selection import SelectPercentile, f_classif
###############################################################################
# import some data to play with
# The IRIS dataset
iris = datasets.load_iris()
# Some noisy data not correlated
E = np.random.normal(size=(len(iris.data), 35))
# Add the noisy data to the informative features
x = np.hstack((iris.data, E))
y = iris.target
670 Chapter 2. Example Gallery
scikit-learn user guide, Release 0.12-git
###############################################################################
pl.figure(1)
pl.clf()
x_indices = np.arange(x.shape[-1])
###############################################################################
# Univariate feature selection with F-test for feature scoring
# We use the default selection function: the 10% most significant features
selector = SelectPercentile(f_classif, percentile=10)
selector.fit(x, y)
scores = -np.log10(selector.scores_)
scores /= scores.max()
pl.bar(x_indices - .45, scores, width=.3,
label=rUnivariate score ($-Log(p_{value})$),
color=g)
###############################################################################
# Compare to the weights of an SVM
clf = svm.SVC(kernel=linear)
clf.fit(x, y)
svm_weights = (clf.coef_
**
2).sum(axis=0)
svm_weights /= svm_weights.max()
pl.bar(x_indices - .15, svm_weights, width=.3, label=SVM weight,
color=r)
pl.title("Comparing feature selection")
pl.xlabel(Feature number)
pl.yticks(())
pl.axis(tight)
pl.legend(loc=upper right)
pl.show()
Figure 2.6: Demonstration of sampling from HMM
Demonstration of sampling from HMM
This script shows how to sample points from a Hiden Markov Model (HMM): we use a 4-components with specied
mean and covariance.
The plot show the sequence of observations generated with the transitions between them. We can see that, as specied
by our transition matrix, there are no transition between component 1 and 3.
2.1. Examples 671
scikit-learn user guide, Release 0.12-git
Python source code: plot_hmm_sampling.py
import numpy as np
import matplotlib.pyplot as plt
from sklearn import hmm
##############################################################
# Prepare parameters for a 3-components HMM
# Initial population probability
start_prob = np.array([0.6, 0.3, 0.1, 0.0])
# The transition matrix, note that there are no transitions possible
# between component 1 and 4
trans_mat = np.array([[0.7, 0.2, 0.0, 0.1],
[0.3, 0.5, 0.2, 0.0],
[0.0, 0.3, 0.5, 0.2],
[0.2, 0.0, 0.2, 0.6]])
# The means of each component
means = np.array([[0.0, 0.0],
[0.0, 11.0],
[9.0, 10.0],
[11.0, -1.0],
])
# The covariance of each component
covars = .5
*
np.tile(np.identity(2), (4, 1, 1))
672 Chapter 2. Example Gallery
scikit-learn user guide, Release 0.12-git
# Build an HMM instance and set parameters
model = hmm.GaussianHMM(4, "full", start_prob, trans_mat,
random_state=42)
# Instead of fitting it from the data, we directly set the estimated
# parameters, the means and covariance of the components
model.means_ = means
model.covars_ = covars
###############################################################
# Generate samples
X, Z = model.sample(500)
# Plot the sampled data
plt.plot(X[:, 0], X[:, 1], "-o", label="observations", ms=6,
mfc="orange", alpha=0.7)
# Indicate the component numbers
for i, m in enumerate(means):
plt.text(m[0], m[1], Component %i % (i + 1),
size=17, horizontalalignment=center,
bbox=dict(alpha=.7, facecolor=w))
plt.legend(loc=best)
plt.show()
Figure 2.7: Gaussian HMM of stock data
Gaussian HMM of stock data
This script shows how to use Gaussian HMM. It uses stock price data, which can be obtained from yahoo nance. For
more information on how to get stock prices with matplotlib, please refer to date_demo1.py of matplotlib.
2.1. Examples 673
scikit-learn user guide, Release 0.12-git
Script output:
fitting to HMM and decoding ... done
Transition matrix
[[ 9.76719299e-01 1.35417228e-16 2.38997332e-03 2.08907155e-02
1.18773340e-08]
[ 2.56709643e-15 6.27458268e-01 3.26816051e-02 2.40445128e-02
3.15815615e-01]
[ 8.32867819e-04 2.92086856e-02 8.20163873e-01 1.35374694e-05
1.49781036e-01]
[ 2.62989391e-01 3.24149388e-01 3.61148574e-18 4.12861221e-01
1.07421560e-16]
[ 3.94120552e-03 1.18350712e-01 1.54841511e-01 3.55404724e-03
7.19312524e-01]]
means and vars of each hidden state
0th hidden state
mean = [ 2.86671252e-02 4.96912888e+07]
var = [ 9.36505203e-01 2.50416506e+14]
1th hidden state
mean = [ 3.82710228e-02 1.10461347e+08]
var = [ 2.07797740e-01 8.81745732e+14]
2th hidden state
674 Chapter 2. Example Gallery
scikit-learn user guide, Release 0.12-git
mean = [ 6.45173011e-03 4.91151802e+07]
var = [ 5.33155033e-02 1.09532022e+14]
3th hidden state
mean = [ -7.94680418e-01 1.49185466e+08]
var = [ 6.50069278e+00 1.02490114e+16]
4th hidden state
mean = [ 1.20905487e-02 6.99175140e+07]
var = [ 1.31030113e-01 1.52865824e+14]
Python source code: plot_hmm_stock_analysis.py
print __doc__
import datetime
import numpy as np
import pylab as pl
from matplotlib.finance import quotes_historical_yahoo
from matplotlib.dates import YearLocator, MonthLocator, DateFormatter
from sklearn.hmm import GaussianHMM
###############################################################################
# Downloading the data
date1 = datetime.date(1995, 1, 1) # start date
date2 = datetime.date(2012, 1, 6) # end date
# get quotes from yahoo finance
quotes = quotes_historical_yahoo("INTC", date1, date2)
if len(quotes) == 0:
raise SystemExit
# unpack quotes
dates = np.array([q[0] for q in quotes], dtype=int)
close_v = np.array([q[2] for q in quotes])
volume = np.array([q[5] for q in quotes])[1:]
# take diff of close value
# this makes len(diff) = len(close_t) - 1
# therefore, others quantity also need to be shifted
diff = close_v[1:] - close_v[:-1]
dates = dates[1:]
close_v = close_v[1:]
# pack diff and volume for training
X = np.column_stack([diff, volume])
###############################################################################
# Run Gaussian HMM
print "fitting to HMM and decoding ...",
n_components = 5
# make an HMM instance and execute fit
model = GaussianHMM(n_components, covariance_type="diag", n_iter=1000)
model.fit([X])
# predict the optimal sequence of internal hidden state
hidden_states = model.predict(X)
2.1. Examples 675
scikit-learn user guide, Release 0.12-git
print "done\n"
###############################################################################
# print trained parameters and plot
print "Transition matrix"
print model.transmat_
print ""
print "means and vars of each hidden state"
for i in xrange(n_components):
print "%dth hidden state" % i
print "mean = ", model.means_[i]
print "var = ", np.diag(model.covars_[i])
print ""
years = YearLocator() # every year
months = MonthLocator() # every month
yearsFmt = DateFormatter(%Y)
fig = pl.figure()
ax = fig.add_subplot(111)
for i in xrange(n_components):
# use fancy indexing to plot data in each state
idx = (hidden_states == i)
ax.plot_date(dates[idx], close_v[idx], o, label="%dth hidden state" % i)
ax.legend()
# format the ticks
ax.xaxis.set_major_locator(years)
ax.xaxis.set_major_formatter(yearsFmt)
ax.xaxis.set_minor_locator(months)
ax.autoscale_view()
# format the coords message box
ax.fmt_xdata = DateFormatter(%Y-%m-%d)
ax.fmt_ydata = lambda x: $%1.2f % x
ax.grid(True)
fig.autofmt_xdate()
pl.show()
Figure 2.8: Classiers Comparison
Classiers Comparison
A Comparison of a K-nearest-neighbours, Logistic Regression and a Linear SVC classifying the iris dataset.
676 Chapter 2. Example Gallery
scikit-learn user guide, Release 0.12-git
Script output:
Computing adjusted_rand_score for 10 values of n_clusters and n_samples=100
done in 0.253s
Computing v_measure_score for 10 values of n_clusters and n_samples=100
done in 2.110s
Computing adjusted_mutual_info_score for 10 values of n_clusters and n_samples=100
done in 7.992s
Computing mutual_info_score for 10 values of n_clusters and n_samples=100
done in 0.033s
Computing adjusted_rand_score for 10 values of n_clusters and n_samples=1000
done in 0.432s
Computing v_measure_score for 10 values of n_clusters and n_samples=1000
done in 0.932s
Computing adjusted_mutual_info_score for 10 values of n_clusters and n_samples=1000
done in 21.824s
Computing mutual_info_score for 10 values of n_clusters and n_samples=1000
done in 0.155s
Python source code: plot_adjusted_for_chance_measures.py
print __doc__
# Author: Olivier Grisel <[email protected]>
# License: Simplified BSD
import numpy as np
import pylab as pl
from time import time
2.1. Examples 749
scikit-learn user guide, Release 0.12-git
from sklearn import metrics
def uniform_labelings_scores(score_func, n_samples, n_clusters_range,
fixed_n_classes=None, n_runs=5, seed=42):
"""Compute score for 2 random uniform cluster labelings.
Both random labelings have the same number of clusters for each value
possible value in n_clusters_range.
When fixed_n_classes is not None the first labeling is considered a ground
truth class assignement with fixed number of classes.
"""
random_labels = np.random.RandomState(seed).random_integers
scores = np.zeros((len(n_clusters_range), n_runs))
if fixed_n_classes is not None:
labels_a = random_labels(low=0, high=fixed_n_classes - 1,
size=n_samples)
for i, k in enumerate(n_clusters_range):
for j in range(n_runs):
if fixed_n_classes is None:
labels_a = random_labels(low=0, high=k - 1, size=n_samples)
labels_b = random_labels(low=0, high=k - 1, size=n_samples)
scores[i, j] = score_func(labels_a, labels_b)
return scores
score_funcs = [
metrics.adjusted_rand_score,
metrics.v_measure_score,
metrics.adjusted_mutual_info_score,
metrics.mutual_info_score,
]
# 2 independent random clusterings with equal cluster number
n_samples = 100
n_clusters_range = np.linspace(2, n_samples, 10).astype(np.int)
pl.figure(1)
plots = []
names = []
for score_func in score_funcs:
print "Computing %s for %d values of n_clusters and n_samples=%d" % (
score_func.__name__, len(n_clusters_range), n_samples)
t0 = time()
scores = uniform_labelings_scores(score_func, n_samples, n_clusters_range)
print "done in %0.3fs" % (time() - t0)
plots.append(pl.errorbar(
n_clusters_range, np.median(scores, axis=1), scores.std(axis=1))[0])
names.append(score_func.__name__)
pl.title("Clustering measures for 2 random uniform labelings\n"
"with equal number of clusters")
pl.xlabel(Number of clusters (Number of samples is fixed to %d) % n_samples)
750 Chapter 2. Example Gallery
scikit-learn user guide, Release 0.12-git
pl.ylabel(Score value)
pl.legend(plots, names)
pl.ylim(ymin=-0.05, ymax=1.05)
# Random labeling with varying n_clusters against ground class labels
# with fixed number of clusters
n_samples = 1000
n_clusters_range = np.linspace(2, 100, 10).astype(np.int)
n_classes = 10
pl.figure(2)
plots = []
names = []
for score_func in score_funcs:
print "Computing %s for %d values of n_clusters and n_samples=%d" % (
score_func.__name__, len(n_clusters_range), n_samples)
t0 = time()
scores = uniform_labelings_scores(score_func, n_samples, n_clusters_range,
fixed_n_classes=n_classes)
print "done in %0.3fs" % (time() - t0)
plots.append(pl.errorbar(
n_clusters_range, scores.mean(axis=1), scores.std(axis=1))[0])
names.append(score_func.__name__)
pl.title("Clustering measures for random uniform labeling\n"
"against reference assignement with %d classes" % n_classes)
pl.xlabel(Number of clusters (Number of samples is fixed to %d) % n_samples)
pl.ylabel(Score value)
pl.ylim(ymin=-0.05, ymax=1.05)
pl.legend(plots, names)
pl.show()
Figure 2.35: Demo of afnity propagation clustering algorithm
Demo of afnity propagation clustering algorithm
Reference: Brendan J. Frey and Delbert Dueck, Clustering by Passing Messages Between Data Points, Science Feb.
2007
2.1. Examples 751
scikit-learn user guide, Release 0.12-git
Script output:
Estimated number of clusters: 3
Homogeneity: 0.885
Completeness: 0.885
V-measure: 0.885
Adjusted Rand Index: 0.922
Adjusted Mutual Information: 0.884
Silhouette Coefficient: 0.774
Python source code: plot_affinity_propagation.py
print __doc__
import numpy as np
from sklearn.cluster import AffinityPropagation
from sklearn import metrics
from sklearn.datasets.samples_generator import make_blobs
##############################################################################
# Generate sample data
centers = [[1, 1], [-1, -1], [1, -1]]
X, labels_true = make_blobs(n_samples=300, centers=centers, cluster_std=0.5)
##############################################################################
# Compute similarities
X_norms = np.sum(X
**
2, axis=1)
752 Chapter 2. Example Gallery
scikit-learn user guide, Release 0.12-git
S = - X_norms[:, np.newaxis] - X_norms[np.newaxis, :] + 2
*
np.dot(X, X.T)
p = 10
*
np.median(S)
##############################################################################
# Compute Affinity Propagation
af = AffinityPropagation().fit(S, p)
cluster_centers_indices = af.cluster_centers_indices_
labels = af.labels_
n_clusters_ = len(cluster_centers_indices)
print Estimated number of clusters: %d % n_clusters_
print "Homogeneity: %0.3f" % metrics.homogeneity_score(labels_true, labels)
print "Completeness: %0.3f" % metrics.completeness_score(labels_true, labels)
print "V-measure: %0.3f" % metrics.v_measure_score(labels_true, labels)
print "Adjusted Rand Index: %0.3f" % \
metrics.adjusted_rand_score(labels_true, labels)
print "Adjusted Mutual Information: %0.3f" % \
metrics.adjusted_mutual_info_score(labels_true, labels)
D = (S / np.min(S))
print ("Silhouette Coefficient: %0.3f" %
metrics.silhouette_score(D, labels, metric=precomputed))
##############################################################################
# Plot result
import pylab as pl
from itertools import cycle
pl.close(all)
pl.figure(1)
pl.clf()
colors = cycle(bgrcmykbgrcmykbgrcmykbgrcmyk)
for k, col in zip(range(n_clusters_), colors):
class_members = labels == k
cluster_center = X[cluster_centers_indices[k]]
pl.plot(X[class_members, 0], X[class_members, 1], col + .)
pl.plot(cluster_center[0], cluster_center[1], o, markerfacecolor=col,
markeredgecolor=k, markersize=14)
for x in X[class_members]:
pl.plot([cluster_center[0], x[0]], [cluster_center[1], x[1]], col)
pl.title(Estimated number of clusters: %d % n_clusters_)
pl.show()
Comparing different clustering algorithms on toy datasets
This example aims at showing characteristics of different clustering algorithms on datasets that are interesting but
still in 2D. The last dataset is an example of a null situation for clustering: the data is homogeneous, and there is no
good clustering.
While these examples give some intuition about the algorithms, this intuition might not apply to very high dimensional
data.
The results could be improved by tweaking the parameters for each clustering strategy, for instance setting the number
of clusters for the methods that needs this parameter specied. Note that afnity propagation has a tendency to create
many clusters. Thus in this example its two parameters (damping and per-point preference) were set to to mitigate this
2.1. Examples 753
scikit-learn user guide, Release 0.12-git
Figure 2.36: Comparing different clustering algorithms on toy datasets
behavior.
Python source code: plot_cluster_comparison.py
print __doc__
import time
import numpy as np
import pylab as pl
from sklearn import cluster, datasets
from sklearn.metrics import euclidean_distances
754 Chapter 2. Example Gallery
scikit-learn user guide, Release 0.12-git
from sklearn.neighbors import kneighbors_graph
from sklearn.preprocessing import Scaler
np.random.seed(0)
# Generate datasets. We choose the size big enough to see the scalability
# of the algorithms, but not too big to avoid too long running times
n_samples = 1500
noisy_circles = datasets.make_circles(n_samples=n_samples, factor=.5,
noise=.05)
noisy_moons = datasets.make_moons(n_samples=n_samples, noise=.05)
blobs = datasets.make_blobs(n_samples=n_samples, random_state=8)
no_structure = np.random.rand(n_samples, 2), None
colors = np.array([x for x in bgrcmykbgrcmykbgrcmykbgrcmyk])
colors = np.hstack([colors]
*
20)
pl.figure(figsize=(14, 9.5))
pl.subplots_adjust(left=.001, right=.999, bottom=.001, top=.96, wspace=.05,
hspace=.01)
plot_num = 1
for i_dataset, dataset in enumerate([noisy_circles, noisy_moons, blobs,
no_structure]):
X, y = dataset
# normalize dataset for easier parameter selection
X = Scaler().fit_transform(X)
# estimate bandwidth for mean shift
bandwidth = cluster.estimate_bandwidth(X, quantile=0.3)
# connectivity matrix for structured Ward
connectivity = kneighbors_graph(X, n_neighbors=10)
# make connectivity symmetric
connectivity = 0.5
*
(connectivity + connectivity.T)
# Compute distances
distances = euclidean_distances(X)
# create clustering estimators
ms = cluster.MeanShift(bandwidth=bandwidth, bin_seeding=True)
two_means = cluster.MiniBatchKMeans(n_clusters=2)
ward_five = cluster.Ward(n_clusters=2, connectivity=connectivity)
spectral = cluster.SpectralClustering(n_clusters=2, mode=arpack)
dbscan = cluster.DBSCAN(eps=.2)
affinity_propagation = cluster.AffinityPropagation(damping=.9)
for algorithm in [two_means, affinity_propagation, ms, spectral,
ward_five, dbscan]:
# predict cluster memberships
t0 = time.time()
if algorithm == spectral:
algorithm.fit(connectivity)
elif algorithm == affinity_propagation:
# Set a low preference to avoid creating too many
# clusters. This parameter is hard to set in practice
algorithm.fit(-distances, p=-50
*
distances.max())
else:
2.1. Examples 755
scikit-learn user guide, Release 0.12-git
algorithm.fit(X)
t1 = time.time()
if hasattr(algorithm, labels_):
y_pred = algorithm.labels_.astype(np.int)
else:
y_pred = algorithm.predict(X)
# plot
pl.subplot(4, 6, plot_num)
if i_dataset == 0:
pl.title(str(algorithm).split(()[0], size=18)
pl.scatter(X[:, 0], X[:, 1], color=colors[y_pred].tolist(), s=10)
if hasattr(algorithm, cluster_centers_):
centers = algorithm.cluster_centers_
center_colors = colors[:len(centers)]
pl.scatter(centers[:, 0], centers[:, 1], s=100, c=center_colors)
pl.xlim(-2, 2)
pl.ylim(-2, 2)
pl.xticks(())
pl.yticks(())
pl.text(.99, .01, (%.2fs % (t1 - t0)).lstrip(0),
transform=pl.gca().transAxes, size=15,
horizontalalignment=right)
plot_num += 1
pl.show()
Figure 2.37: K-means Clustering
K-means Clustering
The plots display rstly what a K-means algorithm would yield using three clusters. It is then shown what the effect
of a bad initialization is on the classication process: By setting n_init to only 1 (default is 10), the amount of times
that the algorithm will be run with different centroid seeds is reduced. The next plot displays what using eight clusters
would deliver and nally the ground truth.
Script output:
Fitting estimator on a small sub-sample of the data
done in 1.165s.
Predicting color indices on the full image (k-means)
done in 1.024s.
Predicting color indices on the full image (random)
done in 1.131s.
Python source code: plot_color_quantization.py
# Authors: Robert Layton <[email protected]>
# Olivier Grisel <[email protected]>
# Mathieu Blondel <[email protected]>
#
# License: BSD
print __doc__
import numpy as np
import pylab as pl
from sklearn.cluster import KMeans
from sklearn.metrics import euclidean_distances
from sklearn.datasets import load_sample_image
from sklearn.utils import shuffle
from time import time
n_colors = 64
# Load the Summer Palace photo
china = load_sample_image("china.jpg")
# Convert to floats instead of the default 8 bits integer coding. Dividing by
# 255 is important so that pl.imshow behaves works well on foat data (need to
# be in the range [0-1]
china = np.array(china, dtype=np.float64) / 255
# Load Image and transform to a 2D numpy array.
w, h, d = original_shape = tuple(china.shape)
assert d == 3
image_array = np.reshape(china, (w
*
h, d))
print "Fitting estimator on a small sub-sample of the data"
t0 = time()
image_array_sample = shuffle(image_array, random_state=0)[:1000]
760 Chapter 2. Example Gallery
scikit-learn user guide, Release 0.12-git
kmeans = KMeans(n_clusters=n_colors, random_state=0).fit(image_array_sample)
print "done in %0.3fs." % (time() - t0)
# Get labels for all points
print "Predicting color indices on the full image (k-means)"
t0 = time()
labels = kmeans.predict(image_array)
print "done in %0.3fs." % (time() - t0)
codebook_random = shuffle(image_array, random_state=0)[:n_colors + 1]
print "Predicting color indices on the full image (random)"
t0 = time()
dist = euclidean_distances(codebook_random, image_array, squared=True)
labels_random = dist.argmin(axis=0)
print "done in %0.3fs." % (time() - t0)
def recreate_image(codebook, labels, w, h):
"""Recreate the (compressed) image from the code book & labels"""
d = codebook.shape[1]
image = np.zeros((w, h, d))
label_idx = 0
for i in range(w):
for j in range(h):
image[i][j] = codebook[labels[label_idx]]
label_idx += 1
return image
# Display all results, alongside original image
pl.figure(1)
pl.clf()
ax = pl.axes([0, 0, 1, 1])
pl.axis(off)
pl.title(Original image (96,615 colors))
pl.imshow(china)
pl.figure(2)
pl.clf()
ax = pl.axes([0, 0, 1, 1])
pl.axis(off)
pl.title(Quantized image (64 colors, K-Means))
pl.imshow(recreate_image(kmeans.cluster_centers_, labels, w, h))
pl.figure(3)
pl.clf()
ax = pl.axes([0, 0, 1, 1])
pl.axis(off)
pl.title(Quantized image (64 colors, Random))
pl.imshow(recreate_image(codebook_random, labels_random, w, h))
pl.show()
Demo of DBSCAN clustering algorithm
Finds core samples of high density and expands clusters from them.
2.1. Examples 761
scikit-learn user guide, Release 0.12-git
Figure 2.39: Demo of DBSCAN clustering algorithm
Script output:
Estimated number of clusters: 2
Homogeneity: 0.517
Completeness: 0.660
V-measure: 0.580
Adjusted Rand Index: 0.501
Adjusted Mutual Information: 0.516
Silhouette Coefficient: 0.381
Python source code: plot_dbscan.py
print __doc__
import numpy as np
762 Chapter 2. Example Gallery
scikit-learn user guide, Release 0.12-git
from scipy.spatial import distance
from sklearn.cluster import DBSCAN
from sklearn import metrics
from sklearn.datasets.samples_generator import make_blobs
##############################################################################
# Generate sample data
centers = [[1, 1], [-1, -1], [1, -1]]
X, labels_true = make_blobs(n_samples=750, centers=centers, cluster_std=0.4)
##############################################################################
# Compute similarities
D = distance.squareform(distance.pdist(X))
S = 1 - (D / np.max(D))
##############################################################################
# Compute DBSCAN
db = DBSCAN(eps=0.95, min_samples=10).fit(S)
core_samples = db.core_sample_indices_
labels = db.labels_
# Number of clusters in labels, ignoring noise if present.
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
print Estimated number of clusters: %d % n_clusters_
print "Homogeneity: %0.3f" % metrics.homogeneity_score(labels_true, labels)
print "Completeness: %0.3f" % metrics.completeness_score(labels_true, labels)
print "V-measure: %0.3f" % metrics.v_measure_score(labels_true, labels)
print "Adjusted Rand Index: %0.3f" % \
metrics.adjusted_rand_score(labels_true, labels)
print "Adjusted Mutual Information: %0.3f" % \
metrics.adjusted_mutual_info_score(labels_true, labels)
print ("Silhouette Coefficient: %0.3f" %
metrics.silhouette_score(D, labels, metric=precomputed))
##############################################################################
# Plot result
import pylab as pl
from itertools import cycle
pl.close(all)
pl.figure(1)
pl.clf()
# Black removed and is used for noise instead.
colors = cycle(bgrcmybgrcmybgrcmybgrcmy)
for k, col in zip(set(labels), colors):
if k == -1:
# Black used for noise.
col = k
markersize = 6
class_members = [index[0] for index in np.argwhere(labels == k)]
cluster_core_samples = [index for index in core_samples
if labels[index] == k]
for index in class_members:
x = X[index]
if index in core_samples and k != -1:
2.1. Examples 763
scikit-learn user guide, Release 0.12-git
markersize = 14
else:
markersize = 6
pl.plot(x[0], x[1], o, markerfacecolor=col,
markeredgecolor=k, markersize=markersize)
pl.title(Estimated number of clusters: %d % n_clusters_)
pl.show()
Figure 2.40: Feature agglomeration
Feature agglomeration
These images how similiar features are merged together using feature agglomeration.
Python source code: plot_digits_agglomeration.py
print __doc__
# Code source: Gael Varoqueux
# Modified for Documentation merge by Jaques Grobler
# License: BSD
import numpy as np
import pylab as pl
764 Chapter 2. Example Gallery
scikit-learn user guide, Release 0.12-git
from sklearn import datasets, cluster
from sklearn.feature_extraction.image import grid_to_graph
digits = datasets.load_digits()
images = digits.images
X = np.reshape(images, (len(images), -1))
connectivity = grid_to_graph(
*
images[0].shape)
agglo = cluster.WardAgglomeration(connectivity=connectivity,
n_clusters=32)
agglo.fit(X)
X_reduced = agglo.transform(X)
X_restored = agglo.inverse_transform(X_reduced)
images_restored = np.reshape(X_restored, images.shape)
pl.figure(1, figsize=(4, 3.5))
pl.clf()
pl.subplots_adjust(left=.01, right=.99, bottom=.01, top=.91)
for i in range(4):
pl.subplot(3, 4, i + 1)
pl.imshow(images[i], cmap=pl.cm.gray,
vmax=16, interpolation=nearest)
pl.xticks(())
pl.yticks(())
if i == 1:
pl.title(Original data)
pl.subplot(3, 4, 4 + i + 1)
pl.imshow(images_restored[i],
cmap=pl.cm.gray, vmax=16, interpolation=nearest)
if i == 1:
pl.title(Agglomerated data)
pl.xticks(())
pl.yticks(())
pl.subplot(3, 4, 10)
pl.imshow(np.reshape(agglo.labels_, images[0].shape),
interpolation=nearest, cmap=pl.cm.spectral)
pl.xticks(())
pl.yticks(())
pl.title(Labels)
Figure 2.41: Feature agglomeration vs. univariate selection
Feature agglomeration vs. univariate selection
This example compares 2 dimensionality reduction strategies:
univariate feature selection with Anova
feature agglomeration with Ward hierarchical clustering
Both methods are compared in a regression problem using a BayesianRidge as supervised estimator.
2.1. Examples 765
scikit-learn user guide, Release 0.12-git
Script output:
________________________________________________________________________________
[Memory] Calling sklearn.cluster.hierarchical.ward_tree...
ward_tree(array([[-0.451933, ..., -0.675318],
...,
[ 0.275706, ..., -1.085711]]),
<1600x1600 sparse matrix of type <type numpy.int32>
with 7840 stored elements in COOrdinate format>, copy=True, n_components=1)
________________________________________________________ward_tree - 0.3s, 0.0min
________________________________________________________________________________
[Memory] Calling sklearn.cluster.hierarchical.ward_tree...
ward_tree(array([[ 0.905206, ..., 0.161245],
...,
[-0.849835, ..., -1.091621]]),
<1600x1600 sparse matrix of type <type numpy.int32>
with 7840 stored elements in COOrdinate format>, copy=True, n_components=1)
________________________________________________________ward_tree - 0.3s, 0.0min
________________________________________________________________________________
[Memory] Calling sklearn.cluster.hierarchical.ward_tree...
ward_tree(array([[ 0.905206, ..., -0.675318],
...,
[-0.849835, ..., -1.085711]]),
<1600x1600 sparse matrix of type <type numpy.int32>
with 7840 stored elements in COOrdinate format>, copy=True, n_components=1)
________________________________________________________ward_tree - 0.3s, 0.0min
________________________________________________________________________________
[Memory] Calling sklearn.feature_selection.univariate_selection.f_regression...
f_regression(array([[-0.451933, ..., 0.275706],
...,
[-0.675318, ..., -1.085711]]),
array([ 25.267703, ..., -25.026711]))
_____________________________________________________f_regression - 0.0s, 0.0min
________________________________________________________________________________
[Memory] Calling sklearn.feature_selection.univariate_selection.f_regression...
f_regression(array([[ 0.905206, ..., -0.849835],
...,
[ 0.161245, ..., -1.091621]]),
array([ -27.447268, ..., -112.638768]))
_____________________________________________________f_regression - 0.0s, 0.0min
________________________________________________________________________________
[Memory] Calling sklearn.feature_selection.univariate_selection.f_regression...
f_regression(array([[ 0.905206, ..., -0.849835],
766 Chapter 2. Example Gallery
scikit-learn user guide, Release 0.12-git
...,
[-0.675318, ..., -1.085711]]),
array([-27.447268, ..., -25.026711]))
_____________________________________________________f_regression - 0.0s, 0.0min
Python source code: plot_feature_agglomeration_vs_univariate_selection.py
# Author: Alexandre Gramfort <[email protected]>
# License: BSD Style.
print __doc__
import shutil
import tempfile
import numpy as np
import pylab as pl
from scipy import linalg, ndimage
from sklearn.feature_extraction.image import grid_to_graph
from sklearn import feature_selection
from sklearn.cluster import WardAgglomeration
from sklearn.linear_model import BayesianRidge
from sklearn.pipeline import Pipeline
from sklearn.grid_search import GridSearchCV
from sklearn.externals.joblib import Memory
from sklearn.cross_validation import KFold
###############################################################################
# Generate data
n_samples = 200
size = 40 # image size
roi_size = 15
snr = 5.
np.random.seed(0)
mask = np.ones([size, size], dtype=np.bool)
coef = np.zeros((size, size))
coef[0:roi_size, 0:roi_size] = -1.
coef[-roi_size:, -roi_size:] = 1.
X = np.random.randn(n_samples, size
**
2)
for x in X: # smooth data
x[:] = ndimage.gaussian_filter(x.reshape(size, size), sigma=1.0).ravel()
X -= X.mean(axis=0)
X /= X.std(axis=0)
y = np.dot(X, coef.ravel())
noise = np.random.randn(y.shape[0])
noise_coef = (linalg.norm(y, 2) / np.exp(snr / 20.)) / linalg.norm(noise, 2)
y += noise_coef
*
noise # add noise
###############################################################################
# Compute the coefs of a Bayesian Ridge with GridSearch
cv = KFold(len(y), 2) # cross-validation generator for model selection
ridge = BayesianRidge()
cachedir = tempfile.mkdtemp()
mem = Memory(cachedir=cachedir, verbose=1)
2.1. Examples 767
scikit-learn user guide, Release 0.12-git
# Ward agglomeration followed by BayesianRidge
A = grid_to_graph(n_x=size, n_y=size)
ward = WardAgglomeration(n_clusters=10, connectivity=A, memory=mem,
n_components=1)
clf = Pipeline([(ward, ward), (ridge, ridge)])
# Select the optimal number of parcels with grid search
clf = GridSearchCV(clf, {ward__n_clusters: [10, 20, 30]}, n_jobs=1, cv=cv)
clf.fit(X, y) # set the best parameters
coef_ = clf.best_estimator_.steps[-1][1].coef_
coef_ = clf.best_estimator_.steps[0][1].inverse_transform(coef_)
coef_agglomeration_ = coef_.reshape(size, size)
# Anova univariate feature selection followed by BayesianRidge
f_regression = mem.cache(feature_selection.f_regression) # caching function
anova = feature_selection.SelectPercentile(f_regression)
clf = Pipeline([(anova, anova), (ridge, ridge)])
# Select the optimal percentage of features with grid search
clf = GridSearchCV(clf, {anova__percentile: [5, 10, 20]}, cv=cv)
clf.fit(X, y) # set the best parameters
coef_ = clf.best_estimator_.steps[-1][1].coef_
coef_ = clf.best_estimator_.steps[0][1].inverse_transform(coef_)
coef_selection_ = coef_.reshape(size, size)
###############################################################################
# Inverse the transformation to plot the results on an image
pl.close(all)
pl.figure(figsize=(7.3, 2.7))
pl.subplot(1, 3, 1)
pl.imshow(coef, interpolation="nearest", cmap=pl.cm.RdBu_r)
pl.title("True weights")
pl.subplot(1, 3, 2)
pl.imshow(coef_selection_, interpolation="nearest", cmap=pl.cm.RdBu_r)
pl.title("Feature Selection")
pl.subplot(1, 3, 3)
pl.imshow(coef_agglomeration_, interpolation="nearest", cmap=pl.cm.RdBu_r)
pl.title("Feature Agglomeration")
pl.subplots_adjust(0.04, 0.0, 0.98, 0.94, 0.16, 0.26)
pl.show()
# Attempt to remove the temporary cachedir, but dont worry if it fails
shutil.rmtree(cachedir, ignore_errors=True)
Figure 2.42: A demo of K-Means clustering on the handwritten digits data
A demo of K-Means clustering on the handwritten digits data
In this example with compare the various initialization strategies for K-means in terms of runtime and quality of the
results.
768 Chapter 2. Example Gallery
scikit-learn user guide, Release 0.12-git
As the ground truth is known here, we also apply different cluster quality metrics to judge the goodness of t of the
cluster labels to the ground truth.
Cluster quality metrics evaluated (see Clustering performance evaluation for denitions and discussions of the met-
rics):
Shorthand full name
homo homogeneity score
compl completeness score
v-meas V measure
ARI adjusted Rand index
AMI adjusted mutual information
silhouette silhouette coefcient
Script output:
n_digits: 10, n_samples 1797, n_features 64
_______________________________________________________________________________
init time inertia homo compl v-meas ARI AMI silhouette
k-means++ 1.90s 69432 0.602 0.650 0.625 0.465 0.598 0.146
random 1.80s 69694 0.669 0.710 0.689 0.553 0.666 0.147
PCA-based 0.14s 71820 0.673 0.715 0.693 0.567 0.670 0.150
_______________________________________________________________________________
Python source code: plot_kmeans_digits.py
2.1. Examples 769
scikit-learn user guide, Release 0.12-git
print __doc__
from time import time
import numpy as np
import pylab as pl
from sklearn import metrics
from sklearn.cluster import KMeans
from sklearn.datasets import load_digits
from sklearn.decomposition import PCA
from sklearn.preprocessing import scale
np.random.seed(42)
digits = load_digits()
data = scale(digits.data)
n_samples, n_features = data.shape
n_digits = len(np.unique(digits.target))
labels = digits.target
sample_size = 300
print "n_digits: %d, \t n_samples %d, \t n_features %d" % (n_digits,
n_samples, n_features)
print 79
*
_
print (% 9s % init
time inertia homo compl v-meas ARI AMI silhouette)
def bench_k_means(estimator, name, data):
t0 = time()
estimator.fit(data)
print % 9s %.2fs %i %.3f %.3f %.3f %.3f %.3f %.3f % (
name, (time() - t0), estimator.inertia_,
metrics.homogeneity_score(labels, estimator.labels_),
metrics.completeness_score(labels, estimator.labels_),
metrics.v_measure_score(labels, estimator.labels_),
metrics.adjusted_rand_score(labels, estimator.labels_),
metrics.adjusted_mutual_info_score(labels, estimator.labels_),
metrics.silhouette_score(data, estimator.labels_,
metric=euclidean,
sample_size=sample_size),
)
bench_k_means(KMeans(init=k-means++, n_clusters=n_digits, n_init=10),
name="k-means++", data=data)
bench_k_means(KMeans(init=random, n_clusters=n_digits, n_init=10),
name="random", data=data)
# in this case the seeding of the centers is deterministic, hence we run the
# kmeans algorithm only once with n_init=1
pca = PCA(n_components=n_digits).fit(data)
bench_k_means(KMeans(init=pca.components_, n_clusters=n_digits, n_init=1),
name="PCA-based",
770 Chapter 2. Example Gallery
scikit-learn user guide, Release 0.12-git
data=data)
print 79
*
_
###############################################################################
# Visualize the results on PCA-reduced data
reduced_data = PCA(n_components=2).fit_transform(data)
kmeans = KMeans(init=k-means++, n_clusters=n_digits, n_init=10)
kmeans.fit(reduced_data)
# Step size of the mesh. Decrease to increase the quality of the VQ.
h = .02 # point in the mesh [x_min, m_max]x[y_min, y_max].
# Plot the decision boundary. For that, we will asign a color to each
x_min, x_max = reduced_data[:, 0].min() + 1, reduced_data[:, 0].max() - 1
y_min, y_max = reduced_data[:, 1].min() + 1, reduced_data[:, 1].max() - 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
# Obtain labels for each point in mesh. Use last trained model.
Z = kmeans.predict(np.c_[xx.ravel(), yy.ravel()])
# Put the result into a color plot
Z = Z.reshape(xx.shape)
pl.figure(1)
pl.clf()
pl.imshow(Z, interpolation=nearest,
extent=(xx.min(), xx.max(), yy.min(), yy.max()),
cmap=pl.cm.Paired,
aspect=auto, origin=lower)
pl.plot(reduced_data[:, 0], reduced_data[:, 1], k., markersize=2)
# Plot the centroids as a white X
centroids = kmeans.cluster_centers_
pl.scatter(centroids[:, 0], centroids[:, 1],
marker=x, s=169, linewidths=3,
color=w, zorder=10)
pl.title(K-means clustering on the digits dataset (PCA-reduced data)\n
Centroids are marked with white cross)
pl.xlim(x_min, x_max)
pl.ylim(y_min, y_max)
pl.xticks(())
pl.yticks(())
pl.show()
Figure 2.43: Empirical evaluation of the impact of k-means initialization
2.1. Examples 771
scikit-learn user guide, Release 0.12-git
Empirical evaluation of the impact of k-means initialization
Evaluate the ability of k-means initializations strategies to make the algorithm convergence robust as measured by the
relative standard deviation of the inertia of the clustering (i.e. the sum of distances to the nearest cluster center).
The rst plot shows the best inertia reached for each combination of the model (KMeans or MiniBatchKMeans)
and the init method (init="random" or init="kmeans++") for increasing values of the n_init parameter
that controls the number of initializations.
The second plot demonstrate one single run of the MiniBatchKMeans estimator using a init="random" and
n_init=1. This run leads to a bad convergence (local optimum) with estimated centers between stucked between
ground truth clusters.
The dataset used for evaluation is a 2D grid of isotropic gaussian clusters widely spaced.
Script output:
Evaluation of KMeans with k-means++ init
Evaluation of KMeans with random init
Evaluation of MiniBatchKMeans with k-means++ init
Evaluation of MiniBatchKMeans with random init
Python source code: plot_kmeans_stability_low_dim_dense.py
print __doc__
# Author: Olivier Grisel <[email protected]>
# License: Simplified BSD
import numpy as np
772 Chapter 2. Example Gallery
scikit-learn user guide, Release 0.12-git
import pylab as pl
import matplotlib.cm as cm
from sklearn.utils import shuffle
from sklearn.utils import check_random_state
from sklearn.cluster import MiniBatchKMeans
from sklearn.cluster import KMeans
random_state = np.random.RandomState(0)
# Number of run (with randomly generated dataset) for each strategy so as
# to be able to compute an estimate of the standard deviation
n_runs = 5
# k-means models can do several random inits so as to be able to trade
# CPU time for convergence robustness
n_init_range = np.array([1, 5, 10, 15, 20])
# Datasets generation parameters
n_samples_per_center = 100
grid_size = 3
scale = 0.1
n_clusters = grid_size
**
2
def make_data(random_state, n_samples_per_center, grid_size, scale):
random_state = check_random_state(random_state)
centers = np.array([[i, j]
for i in range(grid_size)
for j in range(grid_size)])
n_clusters_true, n_featues = centers.shape
noise = random_state.normal(
scale=scale, size=(n_samples_per_center, centers.shape[1]))
X = np.concatenate([c + noise for c in centers])
y = np.concatenate([[i]
*
n_samples_per_center
for i in range(n_clusters_true)])
return shuffle(X, y, random_state=random_state)
# Part 1: Quantitative evaluation of various init methods
fig = pl.figure()
plots = []
legends = []
cases = [
(KMeans, k-means++, {}),
(KMeans, random, {}),
(MiniBatchKMeans, k-means++, {max_no_improvement: 3}),
(MiniBatchKMeans, random, {max_no_improvement: 3, init_size: 500}),
]
for factory, init, params in cases:
print "Evaluation of %s with %s init" % (factory.__name__, init)
inertia = np.empty((len(n_init_range), n_runs))
for run_id in range(n_runs):
2.1. Examples 773
scikit-learn user guide, Release 0.12-git
X, y = make_data(run_id, n_samples_per_center, grid_size, scale)
for i, n_init in enumerate(n_init_range):
km = factory(n_clusters=n_clusters, init=init, random_state=run_id,
n_init=n_init,
**
params).fit(X)
inertia[i, run_id] = km.inertia_
p = pl.errorbar(n_init_range, inertia.mean(axis=1), inertia.std(axis=1))
plots.append(p[0])
legends.append("%s with %s init" % (factory.__name__, init))
pl.xlabel(n_init)
pl.ylabel(inertia)
pl.legend(plots, legends)
pl.title("Mean inertia for various k-means init across %d runs" % n_runs)
# Part 2: Qualitative visual inspection of the convergence
X, y = make_data(random_state, n_samples_per_center, grid_size, scale)
km = MiniBatchKMeans(n_clusters=n_clusters, init=random, n_init=1,
random_state=random_state).fit(X)
fig = pl.figure()
for k in range(n_clusters):
my_members = km.labels_ == k
color = cm.spectral(float(k) / n_clusters, 1)
pl.plot(X[my_members, 0], X[my_members, 1], o, marker=., c=color)
cluster_center = km.cluster_centers_[k]
pl.plot(cluster_center[0], cluster_center[1], o,
markerfacecolor=color, markeredgecolor=k, markersize=6)
pl.title("Example cluster allocation with a single random init\n"
"with MiniBatchKMeans")
pl.show()
Figure 2.44: Vector Quantization Example
Vector Quantization Example
The classic image processing example, Lena, an 8-bit grayscale bit-depth, 512 x 512 sized image, is used here to
illustrate how k-means is used for vector quantization.
Script output:
Compute unstructured hierarchical clustering...
Elapsed time: 0.912338972092
Number of points: 1000
788 Chapter 2. Example Gallery
scikit-learn user guide, Release 0.12-git
Compute structured hierarchical clustering...
Elapsed time: 0.172855138779
Number of points: 1000
Python source code: plot_ward_structured_vs_unstructured.py
# Authors : Vincent Michel, 2010
# Alexandre Gramfort, 2010
# Gael Varoquaux, 2010
# License: BSD
print __doc__
import time as time
import numpy as np
import pylab as pl
import mpl_toolkits.mplot3d.axes3d as p3
from sklearn.cluster import Ward
from sklearn.datasets.samples_generator import make_swiss_roll
###############################################################################
# Generate data (swiss roll dataset)
n_samples = 1000
noise = 0.05
X, _ = make_swiss_roll(n_samples, noise)
# Make it thinner
X[:, 1]
*
= .5
###############################################################################
# Compute clustering
print "Compute unstructured hierarchical clustering..."
st = time.time()
ward = Ward(n_clusters=6).fit(X)
label = ward.labels_
print "Elapsed time: ", time.time() - st
print "Number of points: ", label.size
###############################################################################
# Plot result
fig = pl.figure()
ax = p3.Axes3D(fig)
ax.view_init(7, -80)
for l in np.unique(label):
ax.plot3D(X[label == l, 0], X[label == l, 1], X[label == l, 2],
o, color=pl.cm.jet(np.float(l) / np.max(label + 1)))
pl.title(Without connectivity constraints)
###############################################################################
# Define the structure A of the data. Here a 10 nearest neighbors
from sklearn.neighbors import kneighbors_graph
connectivity = kneighbors_graph(X, n_neighbors=10)
###############################################################################
# Compute clustering
print "Compute structured hierarchical clustering..."
st = time.time()
ward = Ward(n_clusters=6, connectivity=connectivity).fit(X)
2.1. Examples 789
scikit-learn user guide, Release 0.12-git
label = ward.labels_
print "Elapsed time: ", time.time() - st
print "Number of points: ", label.size
###############################################################################
# Plot result
fig = pl.figure()
ax = p3.Axes3D(fig)
ax.view_init(7, -80)
for l in np.unique(label):
ax.plot3D(X[label == l, 0], X[label == l, 1], X[label == l, 2],
o, color=pl.cm.jet(float(l) / np.max(label + 1)))
pl.title(With connectivity constraints)
pl.show()
2.1.4 Covariance estimation
Examples concerning the sklearn.covariance package.
Figure 2.51: Ledoit-Wolf vs Covariance simple estimation
Ledoit-Wolf vs Covariance simple estimation
The usual covariance maximum likelihood estimate can be regularized using shrinkage. Ledoit and Wolf proposed a
close formula to compute the asymptotical optimal shrinkage parameter (minimizing a MSE criterion), yielding the
Ledoit-Wolf covariance estimate.
Chen et al. proposed an improvement of the Ledoit-Wolf shrinkage parameter, the OAS coefcient, whose convergence
is signicantly better under the assumption that the data are gaussian.
In this example, we compute the likelihood of unseen data for different values of the shrinkage parameter, highlighting
the LW and OAS estimates. The Ledoit-Wolf estimate stays close to the likelihood criterion optimal value, which is
an artifact of the method since it is asymptotic and we are working with a small number of observations. The OAS
estimate deviates from the likelihood criterion optimal value but better approximate the MSE optimal value, especially
for a small number a observations.
790 Chapter 2. Example Gallery
scikit-learn user guide, Release 0.12-git
Python source code: plot_covariance_estimation.py
print __doc__
import numpy as np
import pylab as pl
from scipy import linalg
###############################################################################
# Generate sample data
n_features, n_samples = 30, 20
base_X_train = np.random.normal(size=(n_samples, n_features))
base_X_test = np.random.normal(size=(n_samples, n_features))
# Color samples
coloring_matrix = np.random.normal(size=(n_features, n_features))
X_train = np.dot(base_X_train, coloring_matrix)
X_test = np.dot(base_X_test, coloring_matrix)
###############################################################################
# Compute Ledoit-Wolf and Covariances on a grid of shrinkages
from sklearn.covariance import LedoitWolf, OAS, ShrunkCovariance, \
log_likelihood, empirical_covariance
# Ledoit-Wolf optimal shrinkage coefficient estimate
2.1. Examples 791
scikit-learn user guide, Release 0.12-git
lw = LedoitWolf()
loglik_lw = lw.fit(X_train, assume_centered=True).score(
X_test, assume_centered=True)
# OAS coefficient estimate
oa = OAS()
loglik_oa = oa.fit(X_train, assume_centered=True).score(
X_test, assume_centered=True)
# spanning a range of possible shrinkage coefficient values
shrinkages = np.logspace(-3, 0, 30)
negative_logliks = [-ShrunkCovariance(shrinkage=s).fit(
X_train, assume_centered=True).score(X_test, assume_centered=True) \
for s in shrinkages]
# getting the likelihood under the real model
real_cov = np.dot(coloring_matrix.T, coloring_matrix)
emp_cov = empirical_covariance(X_train)
loglik_real = -log_likelihood(emp_cov, linalg.inv(real_cov))
###############################################################################
# Plot results
pl.figure()
pl.title("Regularized covariance: likelihood and shrinkage coefficient")
pl.xlabel(Shrinkage)
pl.ylabel(Negative log-likelihood)
# range shrinkage curve
pl.loglog(shrinkages, negative_logliks)
# real likelihood reference
# BUG: hlines(..., linestyle=--) breaks on some older versions of matplotlib
#pl.hlines(loglik_real, pl.xlim()[0], pl.xlim()[1], color=red,
# label="real covariance likelihood", linestyle=--)
pl.plot(pl.xlim(), 2
*
[loglik_real], --r,
label="real covariance likelihood")
# adjust view
lik_max = np.amax(negative_logliks)
lik_min = np.amin(negative_logliks)
ylim0 = lik_min - 5.
*
np.log((pl.ylim()[1] - pl.ylim()[0]))
ylim1 = lik_max + 10.
*
np.log(lik_max - lik_min)
# LW likelihood
pl.vlines(lw.shrinkage_, ylim0, -loglik_lw, color=g,
linewidth=3, label=Ledoit-Wolf estimate)
# OAS likelihood
pl.vlines(oa.shrinkage_, ylim0, -loglik_oa, color=orange,
linewidth=3, label=OAS estimate)
pl.ylim(ylim0, ylim1)
pl.xlim(shrinkages[0], shrinkages[-1])
pl.legend()
pl.show()
Ledoit-Wolf vs OAS estimation
The usual covariance maximum likelihood estimate can be regularized using shrinkage. Ledoit and Wolf proposed a
close formula to compute the asymptotical optimal shrinkage parameter (minimizing a MSE criterion), yielding the
792 Chapter 2. Example Gallery
scikit-learn user guide, Release 0.12-git
Figure 2.52: Ledoit-Wolf vs OAS estimation
Ledoit-Wolf covariance estimate.
Chen et al. proposed an improvement of the Ledoit-Wolf shrinkage parameter, the OAS coefcient, whose convergence
is signicantly better under the assumption that the data are gaussian.
This example, inspired from Chens publication [1], shows a comparison of the estimated MSE of the LW and OAS
methods, using gaussian distributed data.
[1] Shrinkage Algorithms for MMSE Covariance Estimation Chen et al., IEEE Trans. on Sign. Proc., Volume 58,
Issue 10, October 2010.
Python source code: plot_lw_vs_oas.py
print __doc__
2.1. Examples 793
scikit-learn user guide, Release 0.12-git
import numpy as np
import pylab as pl
from scipy.linalg import toeplitz, cholesky
from sklearn.covariance import LedoitWolf, OAS
###############################################################################
n_features = 100
# simulation covariance matrix (AR(1) process)
r = 0.1
real_cov = toeplitz(r
**
np.arange(n_features))
coloring_matrix = cholesky(real_cov)
n_samples_range = np.arange(6, 31, 1)
repeat = 100
lw_mse = np.zeros((n_samples_range.size, repeat))
oa_mse = np.zeros((n_samples_range.size, repeat))
lw_shrinkage = np.zeros((n_samples_range.size, repeat))
oa_shrinkage = np.zeros((n_samples_range.size, repeat))
for i, n_samples in enumerate(n_samples_range):
for j in range(repeat):
X = np.dot(
np.random.normal(size=(n_samples, n_features)), coloring_matrix.T)
lw = LedoitWolf(store_precision=False)
lw.fit(X, assume_centered=True)
lw_mse[i, j] = lw.error_norm(real_cov, scaling=False)
lw_shrinkage[i, j] = lw.shrinkage_
oa = OAS(store_precision=False)
oa.fit(X, assume_centered=True)
oa_mse[i, j] = oa.error_norm(real_cov, scaling=False)
oa_shrinkage[i, j] = oa.shrinkage_
# plot MSE
pl.subplot(2, 1, 1)
pl.errorbar(n_samples_range, lw_mse.mean(1), yerr=lw_mse.std(1),
label=Ledoit-Wolf, color=g)
pl.errorbar(n_samples_range, oa_mse.mean(1), yerr=oa_mse.std(1),
label=OAS, color=r)
pl.ylabel("Squared error")
pl.legend(loc="upper right")
pl.title("Comparison of covariance estimators")
pl.xlim(5, 31)
# plot shrinkage coefficient
pl.subplot(2, 1, 2)
pl.errorbar(n_samples_range, lw_shrinkage.mean(1), yerr=lw_shrinkage.std(1),
label=Ledoit-Wolf, color=g)
pl.errorbar(n_samples_range, oa_shrinkage.mean(1), yerr=oa_shrinkage.std(1),
label=OAS, color=r)
pl.xlabel("n_samples")
pl.ylabel("Shrinkage")
pl.legend(loc="lower right")
pl.ylim(pl.ylim()[0], 1. + (pl.ylim()[1] - pl.ylim()[0]) / 10.)
pl.xlim(5, 31)
pl.show()
794 Chapter 2. Example Gallery
scikit-learn user guide, Release 0.12-git
Figure 2.53: Robust covariance estimation and Mahalanobis distances relevance
Robust covariance estimation and Mahalanobis distances relevance
For Gaussian ditributed data, the distance of an observation x
i
to the mode of the distribution can be computed using
its Mahalanobis distance: d
(,)
(x
i
)
2
= (x
i
)
1
(x
i
) where and are the location and the covariance of
the underlying gaussian distribution.
In practice, and are replaced by some estimates. The usual covariance maximum likelihood estimate is very
sensitive to the presence of outliers in the data set and therefor, the corresponding Mahalanobis distances are. One
would better have to use a robust estimator of covariance to garanty that the estimation is resistant to errorneous
observations in the data set and that the associated Mahalanobis distances accurately reect the true organisation of
the observations.
The Minimum Covariance Determinant estimator is a robust, high-breakdown point (i.e. it can be used to estimate the
covariance matrix of highly contaminated datasets, up to :math: rac{n_samples-n_features-1}{2} outliers) estimator
of covariance. The idea is to nd :math: rac{n_samples+n_features+1}{2} observations whose empirical covariance
has the smallest determinant, yielding a pure subset of observations from which to compute standards estimates of
location and covariance.
The Minimum Covariance Determinant estimator (MCD) has been introduced by P.J.Rousseuw in [1].
This example illustrates how the Mahalanobis distances are affected by outlying data: observations drawn from a
contaminating distribution are not distinguishable from the observations comming from the real, Gaussian distribution
that one may want to work with. Using MCD-based Mahalanobis distances, the two populations become distinguish-
able. Associated applications are outliers detection, observations ranking, clustering, ... For vizualisation purpose, the
cubique root of the Mahalanobis distances are represented in the boxplot, as Wilson and Hilferty suggest [2]
[1] P. J. Rousseeuw. Least median of squares regression. J. Am Stat Ass, 79:871, 1984.
[2] Wilson, E. B., & Hilferty, M. M. (1931). The distribution of chi-square. Proceedings of the National
Academy of Sciences of the United States of America, 17, 684-688.
2.1. Examples 795
scikit-learn user guide, Release 0.12-git
Python source code: plot_mahalanobis_distances.py
print __doc__
import numpy as np
import pylab as pl
from sklearn.covariance import EmpiricalCovariance, MinCovDet
n_samples = 125
n_outliers = 25
n_features = 2
# generate data
gen_cov = np.eye(n_features)
gen_cov[0, 0] = 2.
X = np.dot(np.random.randn(n_samples, n_features), gen_cov)
# add some outliers
outliers_cov = np.eye(n_features)
outliers_cov[np.arange(1, n_features), np.arange(1, n_features)] = 7.
X[-n_outliers:] = np.dot(np.random.randn(n_outliers, n_features), outliers_cov)
# fit a Minimum Covariance Determinant (MCD) robust estimator to data
robust_cov = MinCovDet().fit(X)
# compare estimators learnt from the full data set with true parameters
796 Chapter 2. Example Gallery
scikit-learn user guide, Release 0.12-git
emp_cov = EmpiricalCovariance().fit(X)
###############################################################################
# Display results
fig = pl.figure()
pl.subplots_adjust(hspace=-.1, wspace=.4, top=.95, bottom=.05)
# Show data set
subfig1 = pl.subplot(3, 1, 1)
inlier_plot = subfig1.scatter(X[:, 0], X[:, 1],
color=black, label=inliers)
outlier_plot = subfig1.scatter(X[:, 0][-n_outliers:], X[:, 1][-n_outliers:],
color=red, label=outliers)
subfig1.set_xlim(subfig1.get_xlim()[0], 11.)
subfig1.set_title("Mahalanobis distances of a contaminated data set:")
# Show contours of the distance functions
xx, yy = np.meshgrid(np.linspace(pl.xlim()[0], pl.xlim()[1], 100),
np.linspace(pl.ylim()[0], pl.ylim()[1], 100))
zz = np.c_[xx.ravel(), yy.ravel()]
mahal_emp_cov = emp_cov.mahalanobis(zz)
mahal_emp_cov = mahal_emp_cov.reshape(xx.shape)
emp_cov_contour = subfig1.contour(xx, yy, np.sqrt(mahal_emp_cov),
cmap=pl.cm.PuBu_r,
linestyles=dashed)
mahal_robust_cov = robust_cov.mahalanobis(zz)
mahal_robust_cov = mahal_robust_cov.reshape(xx.shape)
robust_contour = subfig1.contour(xx, yy, np.sqrt(mahal_robust_cov),
cmap=pl.cm.YlOrBr_r,
linestyles=dotted)
subfig1.legend([emp_cov_contour.collections[1],
robust_contour.collections[1], inlier_plot, outlier_plot],
[MLE dist, robust dist, inliers, outliers],
loc="upper right", borderaxespad=0)
pl.xticks(())
pl.yticks(())
# Plot the scores for each point
emp_mahal = emp_cov.mahalanobis(X - np.mean(X, 0))
**
(0.33)
subfig2 = pl.subplot(2, 2, 3)
subfig2.boxplot([emp_mahal[:-n_outliers], emp_mahal[-n_outliers:]], widths=.25)
subfig2.plot(1.26
*
np.ones(n_samples - n_outliers),
emp_mahal[:-n_outliers], +k, markeredgewidth=1)
subfig2.plot(2.26
*
np.ones(n_outliers),
emp_mahal[-n_outliers:], +k, markeredgewidth=1)
subfig2.axes.set_xticklabels((inliers, outliers), size=15)
subfig2.set_ylabel(r"$\sqrt[3]{\rm{(Mahal. dist.)}}$", size=16)
subfig2.set_title("1. from non-robust estimates\n(Maximum Likelihood)")
pl.yticks(())
robust_mahal = robust_cov.mahalanobis(X - robust_cov.location_)
**
(0.33)
subfig3 = pl.subplot(2, 2, 4)
subfig3.boxplot([robust_mahal[:-n_outliers], robust_mahal[-n_outliers:]],
widths=.25)
subfig3.plot(1.26
*
np.ones(n_samples - n_outliers),
2.1. Examples 797
scikit-learn user guide, Release 0.12-git
robust_mahal[:-n_outliers], +k, markeredgewidth=1)
subfig3.plot(2.26
*
np.ones(n_outliers),
robust_mahal[-n_outliers:], +k, markeredgewidth=1)
subfig3.axes.set_xticklabels((inliers, outliers), size=15)
subfig3.set_ylabel(r"$\sqrt[3]{\rm{(Mahal. dist.)}}$", size=16)
subfig3.set_title("2. from robust estimates\n(Minimum Covariance Determinant)")
pl.yticks(())
pl.show()
Figure 2.54: Outlier detection with several methods.
Outlier detection with several methods.
This example illustrates two ways of performing Novelty and Outlier Detection when the amount of contamination is
known:
based on a robust estimator of covariance, which is assuming that the data are Gaussian distributed and performs
better than the One-Class SVM in that case.
using the One-Class SVM and its ability to capture the shape of the data set, hence performing better when the
data is strongly non-Gaussian, i.e. with two well-separated clusters;
The ground truth about inliers and outliers is given by the points colors while the orange-lled area indicates which
points are reported as outliers by each method.
Here, we assume that we know the fraction of outliers in the datasets. Thus rather than using the predict method of
the objects, we set the threshold on the decision_function to separate out the corresponding fraction.
Script output:
Dataset consists of 400 faces
Extracting the top 6 Eigenfaces - RandomizedPCA...
done in 0.491s
Extracting the top 6 Non-negative components - NMF...
done in 2.500s
Extracting the top 6 Independent components - FastICA...
done in 1.971s
Extracting the top 6 Sparse comp. - MiniBatchSparsePCA...
done in 1.763s
Extracting the top 6 MiniBatchDictionaryLearning...
done in 1.350s
Extracting the top 6 Cluster centers - MiniBatchKMeans...
done in 0.463s
Python source code: plot_faces_decomposition.py
print __doc__
# Authors: Vlad Niculae, Alexandre Gramfort
# License: BSD
import logging
from time import time
from numpy.random import RandomState
import pylab as pl
from sklearn.datasets import fetch_olivetti_faces
812 Chapter 2. Example Gallery
scikit-learn user guide, Release 0.12-git
from sklearn.cluster import MiniBatchKMeans
from sklearn import decomposition
# Display progress logs on stdout
logging.basicConfig(level=logging.INFO,
format=%(asctime)s %(levelname)s %(message)s)
n_row, n_col = 2, 3
n_components = n_row
*
n_col
image_shape = (64, 64)
rng = RandomState(0)
###############################################################################
# Load faces data
dataset = fetch_olivetti_faces(shuffle=True, random_state=rng)
faces = dataset.data
n_samples, n_features = faces.shape
# global centering
faces_centered = faces - faces.mean(axis=0)
# local centering
faces_centered -= faces_centered.mean(axis=1).reshape(n_samples, -1)
print "Dataset consists of %d faces" % n_samples
###############################################################################
def plot_gallery(title, images):
pl.figure(figsize=(2.
*
n_col, 2.26
*
n_row))
pl.suptitle(title, size=16)
for i, comp in enumerate(images):
pl.subplot(n_row, n_col, i + 1)
vmax = max(comp.max(), -comp.min())
pl.imshow(comp.reshape(image_shape), cmap=pl.cm.gray,
interpolation=nearest,
vmin=-vmax, vmax=vmax)
pl.xticks(())
pl.yticks(())
pl.subplots_adjust(0.01, 0.05, 0.99, 0.93, 0.04, 0.)
###############################################################################
# List of the different estimators, whether to center and transpose the
# problem, and whether the transformer uses the clustering API.
estimators = [
(Eigenfaces - RandomizedPCA,
decomposition.RandomizedPCA(n_components=n_components, whiten=True),
True),
(Non-negative components - NMF,
decomposition.NMF(n_components=n_components, init=nndsvda, beta=5.0,
tol=5e-3, sparseness=components),
False),
(Independent components - FastICA,
decomposition.FastICA(n_components=n_components, whiten=True,
max_iter=10),
True),
2.1. Examples 813
scikit-learn user guide, Release 0.12-git
(Sparse comp. - MiniBatchSparsePCA,
decomposition.MiniBatchSparsePCA(n_components=n_components, alpha=0.8,
n_iter=100, chunk_size=3,
random_state=rng),
True),
(MiniBatchDictionaryLearning,
decomposition.MiniBatchDictionaryLearning(n_atoms=15, alpha=0.1,
n_iter=50, chunk_size=3,
random_state=rng),
True),
(Cluster centers - MiniBatchKMeans,
MiniBatchKMeans(n_clusters=n_components, tol=1e-3, batch_size=20,
max_iter=50, random_state=rng),
True)
]
###############################################################################
# Plot a sample of the input data
plot_gallery("First centered Olivetti faces", faces_centered[:n_components])
###############################################################################
# Do the estimation and plot it
for name, estimator, center in estimators:
print "Extracting the top %d %s..." % (n_components, name)
t0 = time()
data = faces
if center:
data = faces_centered
estimator.fit(data)
train_time = (time() - t0)
print "done in %0.3fs" % train_time
if hasattr(estimator, cluster_centers_):
components_ = estimator.cluster_centers_
else:
components_ = estimator.components_
plot_gallery(%s - Train time %.1fs % (name, train_time),
components_[:n_components])
pl.show()
Figure 2.61: Blind source separation using FastICA
814 Chapter 2. Example Gallery
scikit-learn user guide, Release 0.12-git
Blind source separation using FastICA
Independent component analysis (ICA) is used to estimate sources given noisy measurements. Imagine 2 instruments
playing simultaneously and 2 microphones recording the mixed signals. ICA is used to recover the sources ie. what is
played by each instrument.
Python source code: plot_ica_blind_source_separation.py
print __doc__
import numpy as np
import pylab as pl
from sklearn.decomposition import FastICA
###############################################################################
# Generate sample data
np.random.seed(0)
n_samples = 2000
time = np.linspace(0, 10, n_samples)
s1 = np.sin(2
*
time) # Signal 1 : sinusoidal signal
s2 = np.sign(np.sin(3
*
time)) # Signal 2 : square signal
S = np.c_[s1, s2]
S += 0.2
*
np.random.normal(size=S.shape) # Add noise
S /= S.std(axis=0) # Standardize data
# Mix data
2.1. Examples 815
scikit-learn user guide, Release 0.12-git
A = np.array([[1, 1], [0.5, 2]]) # Mixing matrix
X = np.dot(S, A.T) # Generate observations
# Compute ICA
ica = FastICA()
S_ = ica.fit(X).transform(X) # Get the estimated sources
A_ = ica.get_mixing_matrix() # Get estimated mixing matrix
assert np.allclose(X, np.dot(S_, A_.T))
###############################################################################
# Plot results
pl.figure()
pl.subplot(3, 1, 1)
pl.plot(S)
pl.title(True Sources)
pl.subplot(3, 1, 2)
pl.plot(X)
pl.title(Observations (mixed signal))
pl.subplot(3, 1, 3)
pl.plot(S_)
pl.title(ICA estimated sources)
pl.subplots_adjust(0.09, 0.04, 0.94, 0.94, 0.26, 0.36)
pl.show()
Figure 2.62: FastICA on 2D point clouds
FastICA on 2D point clouds
Illustrate visually the results of Independent component analysis (ICA) vs Principal component analysis (PCA) in the
feature space.
Representing ICA in the feature space gives the view of geometric ICA: ICA is an algorithm that nds directions in
the feature space corresponding to projections with high non-Gaussianity. These directions need not be orthogonal in
the original feature space, but they are orthogonal in the whitened feature space, in which all directions correspond to
the same variance.
PCA, on the other hand, nds orthogonal directions in the raw feature space that correspond to directions accounting
for maximum variance.
Here we simulate independent sources using a highly non-Gaussian process, 2 student T with a low number of degrees
of freedom (top left gure). We mix them to create observations (top right gure). In this raw observation space, di-
rections identied by PCA are represented by green vectors. We represent the signal in the PCA space, after whitening
by the variance corresponding to the PCA vectors (lower left). Running ICA corresponds to nding a rotation in this
space to identify the directions of largest non-Gaussianity (lower right).
816 Chapter 2. Example Gallery
scikit-learn user guide, Release 0.12-git
Python source code: plot_ica_vs_pca.py
print __doc__
# Authors: Alexandre Gramfort, Gael Varoquaux
# License: BSD
import numpy as np
import pylab as pl
from sklearn.decomposition import PCA, FastICA
###############################################################################
# Generate sample data
rng = np.random.RandomState(42)
S = rng.standard_t(1.5, size=(20000, 2))
S[:, 0]
*
= 2.
# Mix data
A = np.array([[1, 1], [0, 2]]) # Mixing matrix
X = np.dot(S, A.T) # Generate observations
pca = PCA()
S_pca_ = pca.fit(X).transform(X)
2.1. Examples 817
scikit-learn user guide, Release 0.12-git
ica = FastICA(random_state=rng)
S_ica_ = ica.fit(X).transform(X) # Estimate the sources
S_ica_ /= S_ica_.std(axis=0)
###############################################################################
# Plot results
def plot_samples(S, axis_list=None):
pl.scatter(S[:, 0], S[:, 1], s=2, marker=o, linewidths=0, zorder=10)
if axis_list is not None:
colors = [(0, 0.6, 0), (0.6, 0, 0)]
for color, axis in zip(colors, axis_list):
axis /= axis.std()
x_axis, y_axis = axis
# Trick to get legend to work
pl.plot(0.1
*
x_axis, 0.1
*
y_axis, linewidth=2, color=color)
# pl.quiver(x_axis, y_axis, x_axis, y_axis, zorder=11, width=0.01,
pl.quiver(0, 0, x_axis, y_axis, zorder=11, width=0.01,
scale=6, color=color)
pl.hlines(0, -3, 3)
pl.vlines(0, -3, 3)
pl.xlim(-3, 3)
pl.ylim(-3, 3)
pl.xlabel(x)
pl.ylabel(y)
pl.subplot(2, 2, 1)
plot_samples(S / S.std())
pl.title(True Independent Sources)
axis_list = [pca.components_.T, ica.get_mixing_matrix()]
pl.subplot(2, 2, 2)
plot_samples(X / np.std(X), axis_list=axis_list)
pl.legend([PCA, ICA], loc=upper left)
pl.title(Observations)
pl.subplot(2, 2, 3)
plot_samples(S_pca_ / np.std(S_pca_, axis=0))
pl.title(PCA scores)
pl.subplot(2, 2, 4)
plot_samples(S_ica_ / np.std(S_ica_))
pl.title(ICA estimated sources)
pl.subplots_adjust(0.09, 0.04, 0.94, 0.94, 0.26, 0.26)
pl.show()
Image denoising using dictionary learning
An example comparing the effect of reconstructing noisy fragments of Lena using online Dictionary Learning and
various transform methods.
The dictionary is tted on the non-distorted left half of the image, and subsequently used to reconstruct the right half.
818 Chapter 2. Example Gallery
scikit-learn user guide, Release 0.12-git
Figure 2.63: Image denoising using dictionary learning
A common practice for evaluating the results of image denoising is by looking at the difference between the recon-
struction and the original image. If the reconstruction is perfect this will look like gaussian noise.
It can be seen from the plots that the results of Orthogonal Matching Pursuit (OMP) with two non-zero coefcients is
a bit less biased than when keeping only one (the edges look less prominent). It is in addition closer from the ground
truth in Frobenius norm.
The result of Least Angle Regression is much more strongly biased: the difference is reminiscent of the local intensity
value of the original image.
Thresholding is clearly not useful for denoising, but it is here to show that it can produce a suggestive output with
very high speed, and thus be useful for other tasks such as object classication, where performance is not necessarily
related to visualisation.
Script output:
Distorting image...
Extracting clean patches...
done in 0.27s.
Learning the dictionary...
done in 9.67s.
Extracting noisy patches...
done in 0.19s.
Orthogonal Matching Pursuit
1 atom ...
done in 6.15s.
Orthogonal Matching Pursuit
2 atoms ...
done in 9.46s.
Least-angle regression
5 atoms ...
done in 50.56s.
Thresholding
alpha=0.1 ...
done in 0.97s.
Python source code: plot_image_denoising.py
print __doc__
from time import time
820 Chapter 2. Example Gallery
scikit-learn user guide, Release 0.12-git
import pylab as pl
import numpy as np
from scipy.misc import lena
from sklearn.decomposition import MiniBatchDictionaryLearning
from sklearn.feature_extraction.image import extract_patches_2d
from sklearn.feature_extraction.image import reconstruct_from_patches_2d
###############################################################################
# Load Lena image and extract patches
lena = lena() / 256.0
# downsample for higher speed
lena = lena[::2, ::2] + lena[1::2, ::2] + lena[::2, 1::2] + lena[1::2, 1::2]
lena /= 4.0
height, width = lena.shape
# Distort the right half of the image
print Distorting image...
distorted = lena.copy()
distorted[:, height / 2:] += 0.075
*
np.random.randn(width, height / 2)
# Extract all clean patches from the left half of the image
print Extracting clean patches...
t0 = time()
patch_size = (7, 7)
data = extract_patches_2d(distorted[:, :height / 2], patch_size)
data = data.reshape(data.shape[0], -1)
data -= np.mean(data, axis=0)
data /= np.std(data, axis=0)
print done in %.2fs. % (time() - t0)
###############################################################################
# Learn the dictionary from clean patches
print Learning the dictionary...
t0 = time()
dico = MiniBatchDictionaryLearning(n_atoms=100, alpha=1, n_iter=500)
V = dico.fit(data).components_
dt = time() - t0
print done in %.2fs. % dt
pl.figure(figsize=(4.2, 4))
for i, comp in enumerate(V[:100]):
pl.subplot(10, 10, i + 1)
pl.imshow(comp.reshape(patch_size), cmap=pl.cm.gray_r,
interpolation=nearest)
pl.xticks(())
pl.yticks(())
pl.suptitle(Dictionary learned from Lena patches\n +
Train time %.1fs on %d patches % (dt, len(data)),
fontsize=16)
pl.subplots_adjust(0.08, 0.02, 0.92, 0.85, 0.08, 0.23)
###############################################################################
2.1. Examples 821
scikit-learn user guide, Release 0.12-git
# Display the distorted image
def show_with_diff(image, reference, title):
"""Helper function to display denoising"""
pl.figure(figsize=(5, 3.3))
pl.subplot(1, 2, 1)
pl.title(Image)
pl.imshow(image, vmin=0, vmax=1, cmap=pl.cm.gray, interpolation=nearest)
pl.xticks(())
pl.yticks(())
pl.subplot(1, 2, 2)
difference = image - reference
pl.title(Difference (norm: %.2f) % np.sqrt(np.sum(difference
**
2)))
pl.imshow(difference, vmin=-0.5, vmax=0.5, cmap=pl.cm.PuOr,
interpolation=nearest)
pl.xticks(())
pl.yticks(())
pl.suptitle(title, size=16)
pl.subplots_adjust(0.02, 0.02, 0.98, 0.79, 0.02, 0.2)
show_with_diff(distorted, lena, Distorted image)
###############################################################################
# Extract noisy patches and reconstruct them using the dictionary
print Extracting noisy patches...
t0 = time()
data = extract_patches_2d(distorted[:, height / 2:], patch_size)
data = data.reshape(data.shape[0], -1)
intercept = np.mean(data, axis=0)
data -= intercept
print done in %.2fs. % (time() - t0)
transform_algorithms = [
(Orthogonal Matching Pursuit\n1 atom, omp,
{transform_n_nonzero_coefs: 1}),
(Orthogonal Matching Pursuit\n2 atoms, omp,
{transform_n_nonzero_coefs: 2}),
(Least-angle regression\n5 atoms, lars,
{transform_n_nonzero_coefs: 5}),
(Thresholding\n alpha=0.1, threshold, {transform_alpha: .1})]
reconstructions = {}
for title, transform_algorithm, kwargs in transform_algorithms:
print title, ...
reconstructions[title] = lena.copy()
t0 = time()
dico.set_params(transform_algorithm=transform_algorithm,
**
kwargs)
code = dico.transform(data)
patches = np.dot(code, V)
if transform_algorithm == threshold:
patches -= patches.min()
patches /= patches.max()
patches += intercept
patches = patches.reshape(len(data),
*
patch_size)
822 Chapter 2. Example Gallery
scikit-learn user guide, Release 0.12-git
if transform_algorithm == threshold:
patches -= patches.min()
patches /= patches.max()
reconstructions[title][:, height / 2:] = reconstruct_from_patches_2d(
patches, (width, height / 2))
dt = time() - t0
print done in %.2fs. % dt
show_with_diff(reconstructions[title], lena,
title + (time: %.1fs) % dt)
pl.show()
Figure 2.64: Kernel PCA
Kernel PCA
This example shows that Kernel PCA is able to nd a projection of the data that makes data linearly separable.
2.1. Examples 823
scikit-learn user guide, Release 0.12-git
Python source code: plot_kernel_pca.py
print __doc__
# Authors: Mathieu Blondel
# Andreas Mueller
# License: BSD
import numpy as np
import pylab as pl
from sklearn.decomposition import PCA, KernelPCA
from sklearn.datasets import make_circles
np.random.seed(0)
X, y = make_circles(n_samples=400, factor=.3, noise=.05)
kpca = KernelPCA(kernel="rbf", fit_inverse_transform=True, gamma=10)
X_kpca = kpca.fit_transform(X)
X_back = kpca.inverse_transform(X_kpca)
pca = PCA()
X_pca = pca.fit_transform(X)
# Plot results
824 Chapter 2. Example Gallery
scikit-learn user guide, Release 0.12-git
pl.figure()
pl.subplot(2, 2, 1, aspect=equal)
pl.title("Original space")
reds = y == 0
blues = y == 1
pl.plot(X[reds, 0], X[reds, 1], "ro")
pl.plot(X[blues, 0], X[blues, 1], "bo")
pl.xlabel("$x_1$")
pl.ylabel("$x_2$")
X1, X2 = np.meshgrid(np.linspace(-1.5, 1.5, 50), np.linspace(-1.5, 1.5, 50))
X_grid = np.array([np.ravel(X1), np.ravel(X2)]).T
# projection on the first principal component (in the phi space)
Z_grid = kpca.transform(X_grid)[:, 0].reshape(X1.shape)
pl.contour(X1, X2, Z_grid, colors=grey, linewidths=1, origin=lower)
pl.subplot(2, 2, 2, aspect=equal)
pl.plot(X_pca[reds, 0], X_pca[reds, 1], "ro")
pl.plot(X_pca[blues, 0], X_pca[blues, 1], "bo")
pl.title("Projection by PCA")
pl.xlabel("1st principal component")
pl.ylabel("2nd component")
pl.subplot(2, 2, 3, aspect=equal)
pl.plot(X_kpca[reds, 0], X_kpca[reds, 1], "ro")
pl.plot(X_kpca[blues, 0], X_kpca[blues, 1], "bo")
pl.title("Projection by KPCA")
pl.xlabel("1st principal component in space induced by $\phi$")
pl.ylabel("2nd component")
pl.subplot(2, 2, 4, aspect=equal)
pl.plot(X_back[reds, 0], X_back[reds, 1], "ro")
pl.plot(X_back[blues, 0], X_back[blues, 1], "bo")
pl.title("Original space after inverse transform")
pl.xlabel("$x_1$")
pl.ylabel("$x_2$")
pl.subplots_adjust(0.02, 0.10, 0.98, 0.94, 0.04, 0.35)
pl.show()
Figure 2.65: Principal Component Analysis
Principal Component Analysis
These gures aid in illustrating how a the point cloud can be very ad in one direction - which is where PCA would
come in to choose a direction that is not at.
2.1. Examples 825
scikit-learn user guide, Release 0.12-git
Script output:
explained variance ratio (first two components): [ 0.92461621 0.05301557]
Python source code: plot_pca_vs_lda.py
print __doc__
import pylab as pl
from sklearn import datasets
from sklearn.decomposition import PCA
from sklearn.lda import LDA
iris = datasets.load_iris()
X = iris.data
y = iris.target
target_names = iris.target_names
pca = PCA(n_components=2)
X_r = pca.fit(X).transform(X)
lda = LDA(n_components=2)
X_r2 = lda.fit(X, y).transform(X)
# Percentage of variance explained for each components
print explained variance ratio (first two components):, \
pca.explained_variance_ratio_
830 Chapter 2. Example Gallery
scikit-learn user guide, Release 0.12-git
pl.figure()
for c, i, target_name in zip("rgb", [0, 1, 2], target_names):
pl.scatter(X_r[y == i, 0], X_r[y == i, 1], c=c, label=target_name)
pl.legend()
pl.title(PCA of IRIS dataset)
pl.figure()
for c, i, target_name in zip("rgb", [0, 1, 2], target_names):
pl.scatter(X_r2[y == i, 0], X_r2[y == i, 1], c=c, label=target_name)
pl.legend()
pl.title(LDA of IRIS dataset)
pl.show()
Figure 2.68: Sparse coding with a precomputed dictionary
Sparse coding with a precomputed dictionary
Transforma signal as a sparse combination of Ricker wavelets. This example visually compares different sparse coding
methods using the sklearn.decomposition.SparseCoder estimator. The Ricker (also known as mexican
hat or the second derivative of a gaussian) is not a particularily good kernel to represent piecewise constant signals
like this one. It can therefore be seen how much adding different widths of atoms matters and it therefore motivates
learning the dictionary to best t your type of signals.
The richer dictionary on the right is not larger in size, heavier subsampling is performed in order to stay on the same
order of magnitude.
2.1. Examples 831
scikit-learn user guide, Release 0.12-git
Python source code: plot_sparse_coding.py
print __doc__
import numpy as np
import matplotlib.pylab as pl
from sklearn.decomposition import SparseCoder
def ricker_function(resolution, center, width):
"""Discrete sub-sampled Ricker (mexican hat) wavelet"""
x = np.linspace(0, resolution - 1, resolution)
x = (2 / ((np.sqrt(3
*
width)
*
np.pi
**
1 / 4)))
*
(
1 - ((x - center)
**
2 / width
**
2))
*
np.exp(
(-(x - center)
**
2) / (2
*
width
**
2))
return x
def ricker_matrix(width, resolution, n_atoms):
"""Dictionary of Ricker (mexican hat) wavelets"""
centers = np.linspace(0, resolution - 1, n_atoms)
D = np.empty((n_atoms, resolution))
for i, center in enumerate(centers):
D[i] = ricker_function(resolution, center, width)
D /= np.sqrt(np.sum(D
**
2, axis=1))[:, np.newaxis]
return D
resolution = 1024
subsampling = 3 # subsampling factor
width = 100
n_atoms = resolution / subsampling
# Compute a wavelet dictionary
D_fixed = ricker_matrix(width=width, resolution=resolution, n_atoms=n_atoms)
D_multi = np.r_[tuple(ricker_matrix(width=w, resolution=resolution,
n_atoms=np.floor(n_atoms / 5))
for w in (10, 50, 100, 500, 1000))]
# Generate a signal
y = np.linspace(0, resolution - 1, resolution)
first_quarter = y < resolution / 4
y[first_quarter] = 3.
y[np.logical_not(first_quarter)] = -1.
# List the different sparse coding methods in the following format:
# (title, transform_algorithm, transform_alpha, transform_n_nozero_coefs)
estimators = [(OMP, omp, None, 15),
(Lasso, lasso_cd, 2, None),
]
pl.figure(figsize=(13, 6))
for subplot, (D, title) in enumerate(zip((D_fixed, D_multi),
(fixed width, multiple widths))):
pl.subplot(1, 2, subplot + 1)
pl.title(Sparse coding against %s dictionary % title)
pl.plot(y, ls=dotted, label=Original signal)
# Do a wavelet approximation
832 Chapter 2. Example Gallery
scikit-learn user guide, Release 0.12-git
for title, algo, alpha, n_nonzero in estimators:
coder = SparseCoder(dictionary=D, transform_n_nonzero_coefs=n_nonzero,
transform_alpha=alpha, transform_algorithm=algo)
x = coder.transform(y)
density = len(np.flatnonzero(x))
x = np.ravel(np.dot(x, D))
squared_error = np.sum((y - x)
**
2)
pl.plot(x, label=%s: %s nonzero coefs,\n%.2f error %
(title, density, squared_error))
# Soft thresholding debiasing
coder = SparseCoder(dictionary=D, transform_algorithm=threshold,
transform_alpha=20)
x = coder.transform(y)
_, idx = np.where(x != 0)
x[0, idx], _, _, _ = np.linalg.lstsq(D[idx, :].T, y)
x = np.ravel(np.dot(x, D))
squared_error = np.sum((y - x)
**
2)
pl.plot(x,
label=Thresholding w/ debiasing:\n%d nonzero coefs, %.2f error %
(len(idx), squared_error))
pl.axis(tight)
pl.legend()
pl.subplots_adjust(.04, .07, .97, .90, .09, .2)
pl.show()
2.1.7 Ensemble methods
Examples concerning the sklearn.ensemble package.
Figure 2.69: Feature importances with forests of trees
Feature importances with forests of trees
This examples shows the use of forests of trees to evaluate the importance of features on an artical classication task.
The red plots are the feature importances of each individual tree, and the blue plot is the feature importance of the
whole forest.
As expected, the knee in the blue plot suggests that 3 features are informative, while the remaining are not.
2.1. Examples 833
scikit-learn user guide, Release 0.12-git
Script output:
Feature ranking:
1. feature 1 (0.245865)
2. feature 0 (0.194416)
3. feature 2 (0.174455)
4. feature 7 (0.057138)
5. feature 8 (0.055967)
6. feature 4 (0.055516)
7. feature 5 (0.055179)
8. feature 9 (0.054639)
9. feature 3 (0.053921)
10. feature 6 (0.052904)
Python source code: plot_forest_importances.py
print __doc__
import numpy as np
from sklearn.datasets import make_classification
from sklearn.ensemble import ExtraTreesClassifier
# Build a classification task using 3 informative features
X, y = make_classification(n_samples=1000,
n_features=10,
834 Chapter 2. Example Gallery
scikit-learn user guide, Release 0.12-git
n_informative=3,
n_redundant=0,
n_repeated=0,
n_classes=2,
random_state=0,
shuffle=False)
# Build a forest and compute the feature importances
forest = ExtraTreesClassifier(n_estimators=250,
compute_importances=True,
random_state=0)
forest.fit(X, y)
importances = forest.feature_importances_
indices = np.argsort(importances)[::-1]
# Print the feature ranking
print "Feature ranking:"
for f in xrange(10):
print "%d. feature %d (%f)" % (f + 1, indices[f], importances[indices[f]])
# Plot the feature importances of the trees and of the forest
import pylab as pl
pl.figure()
pl.title("Feature importances")
for tree in forest.estimators_:
pl.plot(xrange(10), tree.feature_importances_[indices], "r")
pl.plot(xrange(10), importances[indices], "b")
pl.show()
Figure 2.70: Pixel importances with a parallel forest of trees
Pixel importances with a parallel forest of trees
This example shows the use of forests of trees to evaluate the importance of the pixels in an image classication task
(faces). The hotter the pixel, the more important.
The code below also illustrates how the construction and the computation of the predictions can be parallelized within
multiple jobs.
2.1. Examples 835
scikit-learn user guide, Release 0.12-git
Script output:
Fitting ExtraTreesClassifier on faces data with 1 cores...
done in 25.886s
Python source code: plot_forest_importances_faces.py
print __doc__
from time import time
import pylab as pl
from sklearn.datasets import fetch_olivetti_faces
from sklearn.ensemble import ExtraTreesClassifier
# Number of cores to use to perform parallel fitting of the forest model
n_jobs = 1
# Loading the digits dataset
836 Chapter 2. Example Gallery
scikit-learn user guide, Release 0.12-git
data = fetch_olivetti_faces()
X = data.images.reshape((len(data.images), -1))
y = data.target
mask = y < 5 # Limit to 5 classes
X = X[mask]
y = y[mask]
# Build a forest and compute the pixel importances
print "Fitting ExtraTreesClassifier on faces data with %d cores..." % n_jobs
t0 = time()
forest = ExtraTreesClassifier(n_estimators=1000,
max_features=128,
compute_importances=True,
n_jobs=n_jobs,
random_state=0)
forest.fit(X, y)
print "done in %0.3fs" % (time() - t0)
importances = forest.feature_importances_
importances = importances.reshape(data.images[0].shape)
# Plot pixel importances
pl.matshow(importances, cmap=pl.cm.hot)
pl.title("Pixel importances with forests of trees")
pl.show()
Figure 2.71: Plot the decision surfaces of ensembles of trees on the iris dataset
Plot the decision surfaces of ensembles of trees on the iris dataset
Plot the decision surfaces of forests of randomized trees trained on pairs of features of the iris dataset.
This plot compares the decision surfaces learned by a decision tree classier (rst column), by a random forest classi-
er (second column) and by an extra- trees classier (third column).
In the rst row, the classiers are built using the sepal width and the sepal length features only, on the second row
using the petal length and sepal length only, and on the third row using the petal width and the petal length only.
2.1. Examples 837
scikit-learn user guide, Release 0.12-git
Python source code: plot_forest_iris.py
print __doc__
import numpy as np
import pylab as pl
from sklearn import clone
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier
from sklearn.tree import DecisionTreeClassifier
# Parameters
n_classes = 3
n_estimators = 30
plot_colors = "bry"
plot_step = 0.02
# Load data
iris = load_iris()
plot_idx = 1
for pair in ([0, 1], [0, 2], [2, 3]):
for model in (DecisionTreeClassifier(),
RandomForestClassifier(n_estimators=n_estimators),
838 Chapter 2. Example Gallery
scikit-learn user guide, Release 0.12-git
ExtraTreesClassifier(n_estimators=n_estimators)):
# We only take the two corresponding features
X = iris.data[:, pair]
y = iris.target
# Shuffle
idx = np.arange(X.shape[0])
np.random.seed(13)
np.random.shuffle(idx)
X = X[idx]
y = y[idx]
# Standardize
mean = X.mean(axis=0)
std = X.std(axis=0)
X = (X - mean) / std
# Train
clf = clone(model)
clf = model.fit(X, y)
# Plot the decision boundary
pl.subplot(3, 3, plot_idx)
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, plot_step),
np.arange(y_min, y_max, plot_step))
if isinstance(model, DecisionTreeClassifier):
Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
cs = pl.contourf(xx, yy, Z, cmap=pl.cm.Paired)
else:
for tree in model.estimators_:
Z = tree.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
cs = pl.contourf(xx, yy, Z, alpha=0.1, cmap=pl.cm.Paired)
pl.axis("tight")
# Plot the training points
for i, c in zip(xrange(n_classes), plot_colors):
idx = np.where(y == i)
pl.scatter(X[idx, 0], X[idx, 1], c=c, label=iris.target_names[i],
cmap=pl.cm.Paired)
pl.axis("tight")
plot_idx += 1
pl.suptitle("Decision surfaces of a decision tree, of a random forest, and of "
"an extra-trees classifier")
pl.show()
2.1. Examples 839
scikit-learn user guide, Release 0.12-git
Figure 2.72: Gradient Boosting regression
Gradient Boosting regression
Demonstrate Gradient Boosting on the boston housing dataset.
This example ts a Gradient Boosting model with least squares loss and 500 regression trees of depth 4.
Script output:
MSE: 6.2736
Python source code: plot_gradient_boosting_regression.py
print __doc__
# Author: Peter Prettenhofer <[email protected]>
#
# License: BSD
import numpy as np
import pylab as pl
from sklearn import ensemble
from sklearn import datasets
from sklearn.utils import shuffle
from sklearn.metrics import mean_squared_error
###############################################################################
840 Chapter 2. Example Gallery
scikit-learn user guide, Release 0.12-git
# Load data
boston = datasets.load_boston()
X, y = shuffle(boston.data, boston.target, random_state=13)
X = X.astype(np.float32)
offset = int(X.shape[0]
*
0.9)
X_train, y_train = X[:offset], y[:offset]
X_test, y_test = X[offset:], y[offset:]
###############################################################################
# Fit regression model
params = {n_estimators: 500, max_depth: 4, min_samples_split: 1,
learn_rate: 0.01, loss: ls}
clf = ensemble.GradientBoostingRegressor(
**
params)
clf.fit(X_train, y_train)
mse = mean_squared_error(y_test, clf.predict(X_test))
print("MSE: %.4f" % mse)
###############################################################################
# Plot training deviance
# compute test set deviance
test_score = np.zeros((params[n_estimators],), dtype=np.float64)
for i, y_pred in enumerate(clf.staged_decision_function(X_test)):
test_score[i] = clf.loss_(y_test, y_pred)
pl.figure(figsize=(12, 6))
pl.subplot(1, 2, 1)
pl.title(Deviance)
pl.plot(np.arange(params[n_estimators]) + 1, clf.train_score_, b-,
label=Training Set Deviance)
pl.plot(np.arange(params[n_estimators]) + 1, test_score, r-,
label=Test Set Deviance)
pl.legend(loc=upper right)
pl.xlabel(Boosting Iterations)
pl.ylabel(Deviance)
###############################################################################
# Plot feature importance
feature_importance = clf.feature_importances_
# make importances relative to max importance
feature_importance = 100.0
*
(feature_importance / feature_importance.max())
sorted_idx = np.argsort(feature_importance)
pos = np.arange(sorted_idx.shape[0]) + .5
pl.subplot(1, 2, 2)
pl.barh(pos, feature_importance[sorted_idx], align=center)
pl.yticks(pos, boston.feature_names[sorted_idx])
pl.xlabel(Relative Importance)
pl.title(Variable Importance)
pl.show()
Gradient Boosting regularization
Illustration of the effect of different regularization strategies for Gradient Boosting. The example is taken from Hastie
et al 2009.
2.1. Examples 841
scikit-learn user guide, Release 0.12-git
Figure 2.73: Gradient Boosting regularization
The loss function used is binomial deviance. In combination with shrinkage, stochastic gradient boosting (Sample 0.5)
can produce more accurate models. Subsampling without shrinkage usually does poorly.
Python source code: plot_gradient_boosting_regularization.py
print __doc__
# Author: Peter Prettenhofer <[email protected]>
#
# License: BSD
import numpy as np
import pylab as pl
from sklearn import ensemble
842 Chapter 2. Example Gallery
scikit-learn user guide, Release 0.12-git
from sklearn import datasets
X, y = datasets.make_hastie_10_2(n_samples=12000, random_state=1)
X = X.astype(np.float32)
X_train, X_test = X[:2000], X[2000:]
y_train, y_test = y[:2000], y[2000:]
original_params = {n_estimators: 1000, max_depth: 2, random_state: 1,
min_samples_split: 5}
pl.figure()
for label, color, setting in [(No shrinkage, orange,
{learn_rate: 1.0, subsample: 1.0}),
(Shrink=0.1, turquoise,
{learn_rate: 0.1, subsample: 1.0}),
(Sample=0.5, blue,
{learn_rate: 1.0, subsample: 0.5}),
(Shrink=0.1, Sample=0.5, gray,
{learn_rate: 0.1, subsample: 0.5})]:
params = dict(original_params)
params.update(setting)
clf = ensemble.GradientBoostingClassifier(
**
params)
clf.fit(X_train, y_train)
# compute test set deviance
test_deviance = np.zeros((params[n_estimators],), dtype=np.float64)
for i, y_pred in enumerate(clf.staged_decision_function(X_test)):
test_deviance[i] = clf.loss_(y_test, y_pred)
pl.plot(np.arange(test_deviance.shape[0]) + 1, test_deviance, -,
color=color, label=label)
pl.title(Deviance)
pl.legend(loc=upper left)
pl.xlabel(Boosting Iterations)
pl.ylabel(Test Set Deviance)
pl.show()
2.1.8 Tutorial exercices
Exercises for the tutorials
Figure 2.74: Cross-validation on diabetes Dataset Exercise
2.1. Examples 843
scikit-learn user guide, Release 0.12-git
Cross-validation on diabetes Dataset Exercise
This exercise is used in the Cross-validated estimators part of the Model selection: choosing estimators and their
parameters section of the A tutorial on statistical-learning for scientic data processing.
Script output:
[0.10000000000000001, 0.10000000000000001, 0.10000000000000001]
Python source code: plot_cv_diabetes.py
print __doc__
import numpy as np
import pylab as pl
from sklearn import cross_validation, datasets, linear_model
diabetes = datasets.load_diabetes()
X = diabetes.data
y = diabetes.target
lasso = linear_model.Lasso()
alphas = np.logspace(-4, -1, 20)
scores = list()
scores_std = list()
for alpha in alphas:
lasso.alpha = alpha
this_scores = cross_validation.cross_val_score(lasso, X, y, n_jobs=1)
scores.append(np.mean(this_scores))
scores_std.append(np.std(this_scores))
pl.figure(1, figsize=(2.5, 2))
pl.clf()
pl.axes([.1, .25, .8, .7])
pl.semilogx(alphas, scores)
pl.semilogx(alphas, np.array(scores) + np.array(scores_std) / 20, b--)
pl.semilogx(alphas, np.array(scores) - np.array(scores_std) / 20, b--)
pl.yticks(())
pl.ylabel(CV score)
pl.xlabel(alpha)
pl.axhline(np.max(scores), linestyle=--, color=.5)
844 Chapter 2. Example Gallery
scikit-learn user guide, Release 0.12-git
pl.text(2e-4, np.max(scores) + 1e-4, .489)
##############################################################################
# Bonus: how much can you trust the selection of alpha?
k_fold = cross_validation.KFold(len(X), 3)
print [lasso.fit(X[train], y[train]).alpha for train, _ in k_fold]
Figure 2.75: Cross-validation on Digits Dataset Exercise
Cross-validation on Digits Dataset Exercise
This exercise is used in the Cross-validation generators part of the Model selection: choosing estimators and their
parameters section of the A tutorial on statistical-learning for scientic data processing.
Python source code: plot_cv_digits.py
print __doc__
import numpy as np
from sklearn import cross_validation, datasets, svm
digits = datasets.load_digits()
X = digits.data
y = digits.target
svc = svm.SVC()
C_s = np.logspace(1, 10, 10)
scores = list()
scores_std = list()
for C in C_s:
svc.C = C
this_scores = cross_validation.cross_val_score(svc, X, y, n_jobs=1)
scores.append(np.mean(this_scores))
scores_std.append(np.std(this_scores))
import pylab as pl
pl.figure(1, figsize=(2.5, 2))
2.1. Examples 845
scikit-learn user guide, Release 0.12-git
pl.clf()
pl.axes([.1, .25, .8, .7])
pl.semilogx(C_s, scores)
pl.semilogx(C_s, np.array(scores) + np.array(scores_std), b--)
pl.semilogx(C_s, np.array(scores) - np.array(scores_std), b--)
pl.yticks(())
pl.ylabel(CV score)
pl.xlabel(Parameter C)
pl.ylim(0, 1.1)
#pl.axhline(np.max(scores), linestyle=--, color=.5)
pl.text(C_s[np.argmax(scores)], .9
*
np.max(scores), %.3f % np.max(scores),
verticalalignment=top, horizontalalignment=center,)
pl.show()
Figure 2.76: Digits Classication Exercise
Digits Classication Exercise
This exercise is used in the Classication part of the Supervised learning: predicting an output variable from high-
dimensional observations section of the A tutorial on statistical-learning for scientic data processing.
Script output:
KNN score: 0.961111
LogisticRegression score: 0.938889
Python source code: plot_digits_classification_exercise.py
print __doc__
from sklearn import datasets, neighbors, linear_model
digits = datasets.load_digits()
X_digits = digits.data
y_digits = digits.target
n_samples = len(X_digits)
X_train = X_digits[:.9
*
n_samples]
y_train = y_digits[:.9
*
n_samples]
X_test = X_digits[.9
*
n_samples:]
y_test = y_digits[.9
*
n_samples:]
knn = neighbors.KNeighborsClassifier()
logistic = linear_model.LogisticRegression()
print(KNN score: %f %
846 Chapter 2. Example Gallery
scikit-learn user guide, Release 0.12-git
knn.fit(X_train, y_train).score(X_test, y_test))
print(LogisticRegression score: %f %
logistic.fit(X_train, y_train).score(X_test, y_test))
Figure 2.77: SVM Exercise
SVM Exercise
This exercise is used in the Using kernels part of the Supervised learning: predicting an output variable from high-
dimensional observations section of the A tutorial on statistical-learning for scientic data processing.
Script output:
Computing regularization path using the lasso...
Computing regularization path using the positive lasso...
Computing regularization path using the elastic net...
Computing regularization path using the positve elastic net...
Python source code: plot_lasso_coordinate_descent_path.py
864 Chapter 2. Example Gallery
scikit-learn user guide, Release 0.12-git
print __doc__
# Author: Alexandre Gramfort <[email protected]>
# License: BSD Style.
import numpy as np
import pylab as pl
from sklearn.linear_model import lasso_path, enet_path
from sklearn import datasets
diabetes = datasets.load_diabetes()
X = diabetes.data
y = diabetes.target
X /= X.std(0) # Standardize data (easier to set the rho parameter)
###############################################################################
# Compute paths
eps = 5e-3 # the smaller it is the longer is the path
print "Computing regularization path using the lasso..."
models = lasso_path(X, y, eps=eps)
alphas_lasso = np.array([model.alpha for model in models])
coefs_lasso = np.array([model.coef_ for model in models])
print "Computing regularization path using the positive lasso..."
models = lasso_path(X, y, eps=eps, positive=True)
alphas_positive_lasso = np.array([model.alpha for model in models])
coefs_positive_lasso = np.array([model.coef_ for model in models])
print "Computing regularization path using the elastic net..."
models = enet_path(X, y, eps=eps, rho=0.8)
alphas_enet = np.array([model.alpha for model in models])
coefs_enet = np.array([model.coef_ for model in models])
print "Computing regularization path using the positve elastic net..."
models = enet_path(X, y, eps=eps, rho=0.8, positive=True)
alphas_positive_enet = np.array([model.alpha for model in models])
coefs_positive_enet = np.array([model.coef_ for model in models])
###############################################################################
# Display results
pl.figure(1)
ax = pl.gca()
ax.set_color_cycle(2
*
[b, r, g, c, k])
l1 = pl.plot(coefs_lasso)
l2 = pl.plot(coefs_enet, linestyle=--)
pl.xlabel(-Log(lambda))
pl.ylabel(weights)
pl.title(Lasso and Elastic-Net Paths)
pl.legend((l1[-1], l2[-1]), (Lasso, Elastic-Net), loc=lower left)
pl.axis(tight)
2.1. Examples 865
scikit-learn user guide, Release 0.12-git
pl.figure(2)
ax = pl.gca()
ax.set_color_cycle(2
*
[b, r, g, c, k])
l1 = pl.plot(coefs_lasso)
l2 = pl.plot(coefs_positive_lasso, linestyle=--)
pl.xlabel(-Log(lambda))
pl.ylabel(weights)
pl.title(Lasso and positive Lasso)
pl.legend((l1[-1], l2[-1]), (Lasso, positive Lasso), loc=lower left)
pl.axis(tight)
pl.figure(3)
ax = pl.gca()
ax.set_color_cycle(2
*
[b, r, g, c, k])
l1 = pl.plot(coefs_enet)
l2 = pl.plot(coefs_positive_enet, linestyle=--)
pl.xlabel(-Log(lambda))
pl.ylabel(weights)
pl.title(Elastic-Net and positive Elastic-Net)
pl.legend((l1[-1], l2[-1]), (Elastic-Net, positive Elastic-Net),
loc=lower left)
pl.axis(tight)
pl.show()
Figure 2.86: Lasso path using LARS
Lasso path using LARS
Computes Lasso Path along the regularization parameter using the LARS algorithm on the diabetest dataset. Each
color represents a different feature of the coefcient vector, and this is displayed as a function of the regularization
parameter.
866 Chapter 2. Example Gallery
scikit-learn user guide, Release 0.12-git
Script output:
Computing regularization path using the LARS ...
.
Python source code: plot_lasso_lars.py
print __doc__
# Author: Fabian Pedregosa <[email protected]>
# Alexandre Gramfort <[email protected]>
# License: BSD Style.
import numpy as np
import pylab as pl
from sklearn import linear_model
from sklearn import datasets
diabetes = datasets.load_diabetes()
X = diabetes.data
y = diabetes.target
print "Computing regularization path using the LARS ..."
alphas, _, coefs = linear_model.lars_path(X, y, method=lasso, verbose=True)
xx = np.sum(np.abs(coefs.T), axis=1)
2.1. Examples 867
scikit-learn user guide, Release 0.12-git
xx /= xx[-1]
pl.plot(xx, coefs.T)
ymin, ymax = pl.ylim()
pl.vlines(xx, ymin, ymax, linestyle=dashed)
pl.xlabel(|coef| / max|coef|)
pl.ylabel(Coefficients)
pl.title(LASSO Path)
pl.axis(tight)
pl.show()
Figure 2.87: Lasso model selection: Cross-Validation / AIC / BIC
Lasso model selection: Cross-Validation / AIC / BIC
Use the Akaike information criterion (AIC), the Bayes Information criterion (BIC) and cross-validation to select an
optimal value of the regularization parameter alpha of the Lasso estimator.
Results obtained with LassoLarsIC are based on AIC/BIC criteria.
Information-criterion based model selection is very fast, but it relies on a proper estimation of degrees of freedom, are
derived for large samples (asymptotic results) and assume the model is correct, i.e. that the data are actually generated
by this model. They also tend to break when the problem is badly conditioned (more features than samples).
For cross-validation, we use 20-fold with 2 algorithms to compute the Lasso path: coordinate descent, as implemented
by the LassoCV class, and Lars (least angle regression) as implemented by the LassoLarsCV class. Both algorithms
give roughly the same results. They differ with regards to their execution speed and sources of numerical errors.
Lars computes a path solution only for each kink in the path. As a result, it is very efcient when there are only of
few kinks, which is the case if there are few features or samples. Also, it is able to compute the full path without
setting any meta parameter. On the opposite, coordinate descent compute the path points on a pre-specied grid (here
we use the default). Thus it is more efcient if the number of grid points is smaller than the number of kinks in the
path. Such a strategy can be interesting if the number of features is really large and there are enough samples to select
a large amount. In terms of numerical errors, for heavily correlated variables, Lars will accumulate more erros, while
the coordinate descent algorithm will only sample the path on a grid.
Note how the optimal value of alpha varies for each fold. This illustrates why nested-cross validation is necessary
when trying to evaluate the performance of a method for which a parameter is chosen by cross-validation: this choice
of parameter may not be optimal for unseen data.
868 Chapter 2. Example Gallery
scikit-learn user guide, Release 0.12-git
Script output:
Computing regularization path using the coordinate descent lasso...
Computing regularization path using the Lars lasso...
Python source code: plot_lasso_model_selection.py
print __doc__
# Author: Olivier Grisel, Gael Varoquaux, Alexandre Gramfort
# License: BSD Style.
import time
2.1. Examples 869
scikit-learn user guide, Release 0.12-git
import numpy as np
import pylab as pl
from sklearn.linear_model import LassoCV, LassoLarsCV, LassoLarsIC
from sklearn import datasets
diabetes = datasets.load_diabetes()
X = diabetes.data
y = diabetes.target
rng = np.random.RandomState(42)
X = np.c_[X, rng.randn(X.shape[0], 14)] # add some bad features
# normalize data as done by Lars to allow for comparison
X /= np.sqrt(np.sum(X
**
2, axis=0))
##############################################################################
# LassoLarsIC: least angle regression with BIC/AIC criterion
model_bic = LassoLarsIC(criterion=bic)
t1 = time.time()
model_bic.fit(X, y)
t_bic = time.time() - t1
alpha_bic_ = model_bic.alpha_
model_aic = LassoLarsIC(criterion=aic)
model_aic.fit(X, y)
alpha_aic_ = model_aic.alpha_
def plot_ic_criterion(model, name, color):
alpha_ = model.alpha_
alphas_ = model.alphas_
criterion_ = model.criterion_
pl.plot(-np.log10(alphas_), criterion_, --, color=color,
linewidth=3, label=%s criterion % name)
pl.axvline(-np.log10(alpha_), color=color,
linewidth=3, label=alpha: %s estimate % name)
pl.xlabel(-log(lambda))
pl.ylabel(criterion)
pl.figure()
plot_ic_criterion(model_aic, AIC, b)
plot_ic_criterion(model_bic, BIC, r)
pl.legend()
pl.title(Information-criterion for model selection (training time %.3fs)
% t_bic)
##############################################################################
# LassoCV: coordinate descent
# Compute paths
print "Computing regularization path using the coordinate descent lasso..."
t1 = time.time()
model = LassoCV(cv=20).fit(X, y)
t_lasso_cv = time.time() - t1
# Display results
870 Chapter 2. Example Gallery
scikit-learn user guide, Release 0.12-git
m_log_alphas = -np.log10(model.alphas)
pl.figure()
ymin, ymax = 2300, 3800
pl.plot(m_log_alphas, model.mse_path_, :)
pl.plot(m_log_alphas, model.mse_path_.mean(axis=-1), k,
label=Average across the folds, linewidth=2)
pl.axvline(-np.log10(model.alpha), linestyle=--, color=k,
label=alpha: CV estimate)
pl.legend()
pl.xlabel(-log(lambda))
pl.ylabel(Mean square error)
pl.title(Mean square error on each fold: coordinate descent
(train time: %.2fs) % t_lasso_cv)
pl.axis(tight)
pl.ylim(ymin, ymax)
##############################################################################
# LassoLarsCV: least angle regression
# Compute paths
print "Computing regularization path using the Lars lasso..."
t1 = time.time()
model = LassoLarsCV(cv=20).fit(X, y)
t_lasso_lars_cv = time.time() - t1
# Display results
m_log_alphas = -np.log10(model.cv_alphas)
pl.figure()
pl.plot(m_log_alphas, model.cv_mse_path_, :)
pl.plot(m_log_alphas, model.cv_mse_path_.mean(axis=-1), k,
label=Average across the folds, linewidth=2)
pl.axvline(-np.log10(model.alpha), linestyle=--, color=k,
label=alpha CV)
pl.legend()
pl.xlabel(-log(lambda))
pl.ylabel(Mean square error)
pl.title(Mean square error on each fold: Lars (train time: %.2fs) %
t_lasso_lars_cv)
pl.axis(tight)
pl.ylim(ymin, ymax)
pl.show()
Figure 2.88: Logit function
2.1. Examples 871
scikit-learn user guide, Release 0.12-git
Logit function
Show in the plot is how the logistic regression would, in this synthetic dataset, classify values as either 0 or 1, i.e. class
one or two, using the logit-curve.
Python source code: plot_logistic.py
print __doc__
# Code source: Gael Varoqueux
# License: BSD
import numpy as np
import pylab as pl
from sklearn import linear_model
# this is our test set, its just a straight line with some
# gaussian noise
xmin, xmax = -5, 5
n_samples = 100
np.random.seed(0)
X = np.random.normal(size=n_samples)
y = (X > 0).astype(np.float)
X[X > 0]
*
= 4
X += .3
*
np.random.normal(size=n_samples)
X = X[:, np.newaxis]
# run the classifier
clf = linear_model.LogisticRegression(C=1e5)
clf.fit(X, y)
# and plot the result
pl.figure(1, figsize=(4, 3))
pl.clf()
pl.scatter(X.ravel(), y, color=black, zorder=20)
X_test = np.linspace(-5, 10, 300)
872 Chapter 2. Example Gallery
scikit-learn user guide, Release 0.12-git
def model(x):
return 1 / (1 + np.exp(-x))
loss = model(X_test
*
clf.coef_ + clf.intercept_).ravel()
pl.plot(X_test, loss, color=blue, linewidth=3)
ols = linear_model.LinearRegression()
ols.fit(X, y)
pl.plot(X_test, ols.coef_
*
X_test + ols.intercept_, linewidth=1)
pl.axhline(.5, color=.5)
pl.ylabel(y)
pl.xlabel(X)
pl.xticks(())
pl.yticks(())
pl.ylim(-.25, 1.25)
pl.xlim(-4, 10)
pl.show()
Figure 2.89: L1 Penalty and Sparsity in Logistic Regression
L1 Penalty and Sparsity in Logistic Regression
Comparison of the sparsity (percentage of zero coefcients) of solutions when L1 and L2 penalty are used for different
values of C. We can see that large values of C give more freedom to the model. Conversely, smaller values of C
constrain the model more. In the L1 penalty case, this leads to sparser solutions.
We classify 8x8 images of digits into two classes: 0-4 against 5-9. The visualization shows coefcients of the models
for varying C.
2.1. Examples 873
scikit-learn user guide, Release 0.12-git
Script output:
C=10.000000
Sparsity with L1 penalty: 6.250000
score with L1 penalty: 0.910406
Sparsity with L2 penalty: 4.687500
score with L2 penalty: 0.909293
C=100.000000
Sparsity with L1 penalty: 4.687500
score with L1 penalty: 0.908737
Sparsity with L2 penalty: 4.687500
score with L2 penalty: 0.909850
C=1000.000000
Sparsity with L1 penalty: 4.687500
score with L1 penalty: 0.910406
Sparsity with L2 penalty: 4.687500
score with L2 penalty: 0.909850
Python source code: plot_logistic_l1_l2_sparsity.py
print __doc__
# Authors: Alexandre Gramfort <[email protected]>
# Mathieu Blondel <[email protected]>
# Andreas Mueller <[email protected]>
# License: BSD Style.
874 Chapter 2. Example Gallery
scikit-learn user guide, Release 0.12-git
import numpy as np
import pylab as pl
from sklearn.linear_model import LogisticRegression
from sklearn import datasets
from sklearn.preprocessing import Scaler
digits = datasets.load_digits()
X, y = digits.data, digits.target
X = Scaler().fit_transform(X)
# classify small against large digits
y = (y > 4).astype(np.int)
# Set regularization parameter
for i, C in enumerate(10.
**
np.arange(1, 4)):
# turn down tolerance for short training time
clf_l1_LR = LogisticRegression(C=C, penalty=l1, tol=0.01)
clf_l2_LR = LogisticRegression(C=C, penalty=l2, tol=0.01)
clf_l1_LR.fit(X, y)
clf_l2_LR.fit(X, y)
coef_l1_LR = clf_l1_LR.coef_.ravel()
coef_l2_LR = clf_l2_LR.coef_.ravel()
# coef_l1_LR contains zeros due to the
# L1 sparsity inducing norm
sparsity_l1_LR = np.mean(coef_l1_LR == 0)
*
100
sparsity_l2_LR = np.mean(coef_l2_LR == 0)
*
100
print "C=%f" % C
print "Sparsity with L1 penalty: %f" % sparsity_l1_LR
print "score with L1 penalty: %f" % clf_l1_LR.score(X, y)
print "Sparsity with L2 penalty: %f" % sparsity_l2_LR
print "score with L2 penalty: %f" % clf_l2_LR.score(X, y)
l1_plot = pl.subplot(3, 2, 2
*
i + 1)
l2_plot = pl.subplot(3, 2, 2
*
(i + 1))
if i == 0:
l1_plot.set_title("L1 penalty")
l2_plot.set_title("L2 penalty")
l1_plot.imshow(np.abs(coef_l1_LR.reshape(8, 8)), interpolation=nearest,
cmap=binary, vmax=1, vmin=0)
l2_plot.imshow(np.abs(coef_l2_LR.reshape(8, 8)), interpolation=nearest,
cmap=binary, vmax=1, vmin=0)
pl.text(-8, 3, "C = %d" % C)
l1_plot.set_xticks(())
l1_plot.set_yticks(())
l2_plot.set_xticks(())
l2_plot.set_yticks(())
pl.show()
2.1. Examples 875
scikit-learn user guide, Release 0.12-git
Figure 2.90: Path with L1- Logistic Regression
Path with L1- Logistic Regression
Computes path on IRIS dataset.
Script output:
Computing regularization path ...
This took 0:00:00.024074
Python source code: plot_logistic_path.py
print __doc__
# Author: Alexandre Gramfort <[email protected]>
# License: BSD Style.
876 Chapter 2. Example Gallery
scikit-learn user guide, Release 0.12-git
from datetime import datetime
import numpy as np
import pylab as pl
from sklearn import linear_model
from sklearn import datasets
from sklearn.svm import l1_min_c
iris = datasets.load_iris()
X = iris.data
y = iris.target
X = X[y != 2]
y = y[y != 2]
X -= np.mean(X, 0)
###############################################################################
# Demo path functions
cs = l1_min_c(X, y, loss=log)
*
np.logspace(0, 3)
print "Computing regularization path ..."
start = datetime.now()
clf = linear_model.LogisticRegression(C=1.0, penalty=l1, tol=1e-6)
coefs_ = []
for c in cs:
clf.set_params(C=c)
clf.fit(X, y)
coefs_.append(clf.coef_.ravel().copy())
print "This took ", datetime.now() - start
coefs_ = np.array(coefs_)
pl.plot(np.log10(cs), coefs_)
ymin, ymax = pl.ylim()
pl.xlabel(log(C))
pl.ylabel(Coefficients)
pl.title(Logistic Regression Path)
pl.axis(tight)
pl.show()
Figure 2.91: Linear Regression Example
2.1. Examples 877
scikit-learn user guide, Release 0.12-git
Linear Regression Example
This example uses the only the rst feature of the diabetes dataset, in order to illustrate a two-dimensional plot of
this regression technique. The straight line can be seen in the plot, showing how linear regression attempts to draw a
straight line that will best minimize the residual sum of squares between the observed responses in the dataset, and the
responses predicted by the linear approximation.
The coefcients, the residual sum of squares and the variance score are also calculated.
Script output:
Coefficients:
[ 938.23786125]
Residual sum of squares: 2548.07
Variance score: 0.47
Python source code: plot_ols.py
print __doc__
# Code source: Jaques Grobler
# License: BSD
import pylab as pl
import numpy as np
878 Chapter 2. Example Gallery
scikit-learn user guide, Release 0.12-git
from sklearn import datasets, linear_model
# Load the diabetes dataset
diabetes = datasets.load_diabetes()
# Use only one feature
diabetes_X = diabetes.data[:, np.newaxis]
diabetes_X_temp = diabetes_X[:, :, 2]
# Split the data into training/testing sets
diabetes_X_train = diabetes_X_temp[:-20]
diabetes_X_test = diabetes_X_temp[-20:]
from sklearn.datasets.samples_generator import make_regression
# this is our test set, its just a straight line with some
# gaussian noise
X, Y = make_regression(n_samples=100, n_features=1, n_informative=1,\
random_state=0, noise=35)
# Split the targets into training/testing sets
diabetes_y_train = diabetes.target[:-20]
diabetes_y_test = diabetes.target[-20:]
# Create linear regression object
regr = linear_model.LinearRegression()
# Train the model using the training sets
regr.fit(diabetes_X_train, diabetes_y_train)
# The coefficients
print Coefficients: \n, regr.coef_
# The mean square error
print ("Residual sum of squares: %.2f" %
np.mean((regr.predict(diabetes_X_test) - diabetes_y_test)
**
2))
# Explained variance score: 1 is perfect prediction
print (Variance score: %.2f % regr.score(diabetes_X_test, diabetes_y_test))
# Plot outputs
pl.scatter(diabetes_X_test, diabetes_y_test, color=black)
pl.plot(diabetes_X_test, regr.predict(diabetes_X_test), color=blue,
linewidth=3)
pl.xticks(())
pl.yticks(())
pl.show()
Figure 2.92: Sparsity Example: Fitting only features 1 and 2
2.1. Examples 879
scikit-learn user guide, Release 0.12-git
Sparsity Example: Fitting only features 1 and 2
Features 1 and 2 of the diabetes-dataset are tted and plotted below. It illustrates that although feature 2 has a strong
coefcient on the full model, it does not give us much regarding y when compared to just feautre 1
Script output:
Computing random projection
Computing PCA projection
Computing LDA projection
Computing Isomap embedding
Done.
Computing LLE embedding
Done. Reconstruction error: 1.28555e-06
Computing modified LLE embedding
Done. Reconstruction error: 0.359782
Computing Hessian LLE embedding
Done. Reconstruction error: 0.212118
Computing LTSA embedding
912 Chapter 2. Example Gallery
scikit-learn user guide, Release 0.12-git
Done. Reconstruction error: 0.212075
Computing MDS embedding
Done. Stress: 143525262.393712
Python source code: plot_lle_digits.py
# Authors: Fabian Pedregosa <[email protected]>
# Olivier Grisel <[email protected]>
# Mathieu Blondel <[email protected]>
# License: BSD, (C) INRIA 2011
print __doc__
from time import time
import numpy as np
import pylab as pl
from matplotlib import offsetbox
from sklearn.utils.fixes import qr_economic
from sklearn import manifold, datasets, decomposition, lda
from sklearn.metrics import euclidean_distances
digits = datasets.load_digits(n_class=6)
X = digits.data
y = digits.target
n_samples, n_features = X.shape
n_neighbors = 30
#----------------------------------------------------------------------
# Scale and visualize the embedding vectors
def plot_embedding(X, title=None):
x_min, x_max = np.min(X, 0), np.max(X, 0)
X = (X - x_min) / (x_max - x_min)
pl.figure()
ax = pl.subplot(111)
for i in range(X.shape[0]):
pl.text(X[i, 0], X[i, 1], str(digits.target[i]),
color=pl.cm.Set1(y[i] / 10.),
fontdict={weight: bold, size: 9})
if hasattr(offsetbox, AnnotationBbox):
# only print thumbnails with matplotlib > 1.0
shown_images = np.array([[1., 1.]]) # just something big
for i in range(digits.data.shape[0]):
dist = np.sum((X[i] - shown_images)
**
2, 1)
if np.min(dist) < 4e-3:
# dont show points that are too close
continue
shown_images = np.r_[shown_images, [X[i]]]
imagebox = offsetbox.AnnotationBbox(
offsetbox.OffsetImage(digits.images[i], cmap=pl.cm.gray_r),
X[i])
ax.add_artist(imagebox)
pl.xticks([]), pl.yticks([])
if title is not None:
pl.title(title)
2.1. Examples 913
scikit-learn user guide, Release 0.12-git
#----------------------------------------------------------------------
# Plot images of the digits
N = 20
img = np.zeros((10
*
N, 10
*
N))
for i in range(N):
ix = 10
*
i + 1
for j in range(N):
iy = 10
*
j + 1
img[ix:ix + 8, iy:iy + 8] = X[i
*
N + j].reshape((8, 8))
pl.imshow(img, cmap=pl.cm.binary)
pl.xticks([])
pl.yticks([])
pl.title(A selection from the 64-dimensional digits dataset)
#----------------------------------------------------------------------
# Random 2D projection using a random unitary matrix
print "Computing random projection"
rng = np.random.RandomState(42)
Q, _ = qr_economic(rng.normal(size=(n_features, 2)))
X_projected = np.dot(Q.T, X.T).T
plot_embedding(X_projected, "Random Projection of the digits")
#----------------------------------------------------------------------
# Projection on to the first 2 principal components
print "Computing PCA projection"
t0 = time()
X_pca = decomposition.RandomizedPCA(n_components=2).fit_transform(X)
plot_embedding(X_pca,
"Principal Components projection of the digits (time %.2fs)" %
(time() - t0))
#----------------------------------------------------------------------
# Projection on to the first 2 linear discriminant components
print "Computing LDA projection"
X2 = X.copy()
X2.flat[::X.shape[1] + 1] += 0.01 # Make X invertible
t0 = time()
X_lda = lda.LDA(n_components=2).fit_transform(X2, y)
plot_embedding(X_lda,
"Linear Discriminant projection of the digits (time %.2fs)" %
(time() - t0))
#----------------------------------------------------------------------
# Isomap projection of the digits dataset
print "Computing Isomap embedding"
t0 = time()
X_iso = manifold.Isomap(n_neighbors, n_components=2).fit_transform(X)
print "Done."
plot_embedding(X_iso,
"Isomap projection of the digits (time %.2fs)" %
(time() - t0))
914 Chapter 2. Example Gallery
scikit-learn user guide, Release 0.12-git
#----------------------------------------------------------------------
# Locally linear embedding of the digits dataset
print "Computing LLE embedding"
clf = manifold.LocallyLinearEmbedding(n_neighbors, n_components=2,
method=standard)
t0 = time()
X_lle = clf.fit_transform(X)
print "Done. Reconstruction error: %g" % clf.reconstruction_error_
plot_embedding(X_lle,
"Locally Linear Embedding of the digits (time %.2fs)" %
(time() - t0))
#----------------------------------------------------------------------
# Modified Locally linear embedding of the digits dataset
print "Computing modified LLE embedding"
clf = manifold.LocallyLinearEmbedding(n_neighbors, n_components=2,
method=modified)
t0 = time()
X_mlle = clf.fit_transform(X)
print "Done. Reconstruction error: %g" % clf.reconstruction_error_
plot_embedding(X_mlle,
"Modified Locally Linear Embedding of the digits (time %.2fs)" %
(time() - t0))
#----------------------------------------------------------------------
# HLLE embedding of the digits dataset
print "Computing Hessian LLE embedding"
clf = manifold.LocallyLinearEmbedding(n_neighbors, n_components=2,
method=hessian)
t0 = time()
X_hlle = clf.fit_transform(X)
print "Done. Reconstruction error: %g" % clf.reconstruction_error_
plot_embedding(X_hlle,
"Hessian Locally Linear Embedding of the digits (time %.2fs)" %
(time() - t0))
#----------------------------------------------------------------------
# LTSA embedding of the digits dataset
print "Computing LTSA embedding"
clf = manifold.LocallyLinearEmbedding(n_neighbors, n_components=2,
method=ltsa)
t0 = time()
X_ltsa = clf.fit_transform(X)
print "Done. Reconstruction error: %g" % clf.reconstruction_error_
plot_embedding(X_ltsa,
"Local Tangent Space Alignment of the digits (time %.2fs)" %
(time() - t0))
#----------------------------------------------------------------------
# MDS embedding of the digits dataset
print "Computing MDS embedding"
clf = manifold.MDS(n_components=2, n_init=1, max_iter=100)
t0 = time()
X_mds = clf.fit_transform(euclidean_distances(X))
print "Done. Stress: %f" % clf.stress_
2.1. Examples 915
scikit-learn user guide, Release 0.12-git
plot_embedding(X_mds,
"MDS embedding of the digits (time %.2fs)" %
(time() - t0))
pl.show()
Figure 2.108: Multi-dimensional scaling
Multi-dimensional scaling
An illustration of the metric and non-metric MDS on generated noisy data.
The reconstructed points using the metric MDS and non metric MDS are slightly shifted to avoid overlapping.
Python source code: plot_mds.py
916 Chapter 2. Example Gallery
scikit-learn user guide, Release 0.12-git
# Author: Nelle Varoquaux <[email protected]>
# Licence: BSD
print __doc__
import numpy as np
from matplotlib import pyplot as plt
from matplotlib.collections import LineCollection
from sklearn import manifold
from sklearn.metrics import euclidean_distances
from sklearn.decomposition import PCA
n_samples = 20
seed = np.random.RandomState(seed=3)
X_true = seed.randint(0, 20, 2
*
n_samples).astype(np.float)
X_true = X_true.reshape((n_samples, 2))
# Center the data
X_true -= X_true.mean()
similarities = euclidean_distances(X_true)
# Add noise to the similarities
noise = np.random.rand(n_samples, n_samples)
noise = noise + noise.T
noise[np.arange(noise.shape[0]), np.arange(noise.shape[0])] = 0
similarities += noise
mds = manifold.MDS(n_components=2, max_iter=3000,
eps=1e-9, random_state=seed,
n_jobs=1)
pos = mds.fit(similarities).embedding_
nmds = manifold.MDS(n_components=2, metric=False,
max_iter=3000,
eps=1e-9, random_state=seed, n_jobs=1)
npos = mds.fit_transform(similarities)
# Rotate the data
clf = PCA(n_components=2)
X_true = clf.fit_transform(X_true)
pos = clf.fit_transform(pos)
npos = clf.fit_transform(pos)
fig = plt.figure(1)
ax = plt.axes([0., 0., 1., 1.])
plt.scatter(X_true[:, 0], X_true[:, 1], c=r, s=20)
plt.scatter(pos[:, 0] + 0.2, pos[:, 1] + 0.2, s=20, c=g)
plt.scatter(npos[:, 0] - 0.2, npos[:, 1] - 0.2, s=20, c=b)
plt.legend((True position, MDS, NMDS), loc=best)
similarities = similarities.max() / similarities
*
100
similarities[np.isinf(similarities)] = 0
# Plot the edges
2.1. Examples 917
scikit-learn user guide, Release 0.12-git
start_idx, end_idx = np.where(pos)
#a sequence of (
*
line0
*
,
*
line1
*
,
*
line2
*
), where::
# linen = (x0, y0), (x1, y1), ... (xm, ym)
segments = [[pos[i, :], pos[j, :]]
for i in range(len(pos)) for j in range(len(pos))]
values = np.abs(similarities)
lc = LineCollection(segments,
zorder=0, cmap=plt.cm.hot_r,
norm=plt.Normalize(0, values.max()))
lc.set_array(similarities.flatten())
lc.set_linewidths(0.5
*
np.ones(len(segments)))
ax.add_collection(lc)
plt.show()
Figure 2.109: Swiss Roll reduction with LLE
Swiss Roll reduction with LLE
An illustration of Swiss Roll reduction with locally linear embedding
918 Chapter 2. Example Gallery
scikit-learn user guide, Release 0.12-git
Script output:
Computing LLE embedding
Done. Reconstruction error: 9.68564e-08
Python source code: plot_swissroll.py
# Author: Fabian Pedregosa -- <[email protected]>
# License: BSD, (C) INRIA 2011
print __doc__
import pylab as pl
# This import is needed to modify the way figure behaves
from mpl_toolkits.mplot3d import Axes3D
Axes3D
#----------------------------------------------------------------------
# Locally linear embedding of the swiss roll
from sklearn import manifold, datasets
X, color = datasets.samples_generator.make_swiss_roll(n_samples=1500)
print "Computing LLE embedding"
X_r, err = manifold.locally_linear_embedding(X, n_neighbors=12,
n_components=2)
2.1. Examples 919
scikit-learn user guide, Release 0.12-git
print "Done. Reconstruction error: %g" % err
#----------------------------------------------------------------------
# Plot result
fig = pl.figure()
try:
# compatibility matplotlib < 1.0
ax = fig.add_subplot(211, projection=3d)
ax.scatter(X[:, 0], X[:, 1], X[:, 2], c=color, cmap=pl.cm.Spectral)
except:
ax = fig.add_subplot(211)
ax.scatter(X[:, 0], X[:, 2], c=color, cmap=pl.cm.Spectral)
ax.set_title("Original data")
ax = fig.add_subplot(212)
ax.scatter(X_r[:, 0], X_r[:, 1], c=color, cmap=pl.cm.Spectral)
pl.axis(tight)
pl.xticks([]), pl.yticks([])
pl.title(Projected data)
pl.show()
2.1.12 Gaussian Mixture Models
Examples concerning the sklearn.mixture package.
Figure 2.110: Gaussian Mixture Model Ellipsoids
Gaussian Mixture Model Ellipsoids
Plot the condence ellipsoids of a mixture of two gaussians with EM and variational dirichlet process.
Both models have access to ve components with which to t the data. Note that the EM model will necessarily use all
ve components while the DP model will effectively only use as many as are needed for a good t. This is a property
of the Dirichlet Process prior. Here we can see that the EM model splits some components arbitrarily, because it is
trying to t too many components, while the Dirichlet Process model adapts it number of state automatically.
This example doesnt show it, as were in a low-dimensional space, but another advantage of the dirichlet process
model is that it can t full covariance matrices effectively even when there are less examples per cluster than there are
dimensions in the data, due to regularization properties of the inference algorithm.
920 Chapter 2. Example Gallery
scikit-learn user guide, Release 0.12-git
Python source code: plot_gmm.py
import itertools
import numpy as np
from scipy import linalg
import pylab as pl
import matplotlib as mpl
from sklearn import mixture
# Number of samples per component
n_samples = 500
# Generate random sample, two components
np.random.seed(0)
C = np.array([[0., -0.1], [1.7, .4]])
X = np.r_[np.dot(np.random.randn(n_samples, 2), C),
.7
*
np.random.randn(n_samples, 2) + np.array([-6, 3])]
# Fit a mixture of gaussians with EM using five components
gmm = mixture.GMM(n_components=5, covariance_type=full)
gmm.fit(X)
# Fit a dirichlet process mixture of gaussians using five components
dpgmm = mixture.DPGMM(n_components=5, covariance_type=full)
2.1. Examples 921
scikit-learn user guide, Release 0.12-git
dpgmm.fit(X)
color_iter = itertools.cycle([r, g, b, c, m])
for i, (clf, title) in enumerate([(gmm, GMM),
(dpgmm, Dirichlet Process GMM)]):
splot = pl.subplot(2, 1, 1 + i)
Y_ = clf.predict(X)
for i, (mean, covar, color) in enumerate(zip(
clf.means_, clf._get_covars(), color_iter)):
v, w = linalg.eigh(covar)
u = w[0] / linalg.norm(w[0])
# as the DP will not use every component it has access to
# unless it needs it, we shouldnt plot the redundant
# components.
if not np.any(Y_ == i):
continue
pl.scatter(X[Y_ == i, 0], X[Y_ == i, 1], .8, color=color)
# Plot an ellipse to show the Gaussian component
angle = np.arctan(u[1] / u[0])
angle = 180
*
angle / np.pi # convert to degrees
ell = mpl.patches.Ellipse(mean, v[0], v[1], 180 + angle, color=color)
ell.set_clip_box(splot.bbox)
ell.set_alpha(0.5)
splot.add_artist(ell)
pl.xlim(-10, 10)
pl.ylim(-3, 6)
pl.xticks(())
pl.yticks(())
pl.title(title)
pl.show()
Figure 2.111: GMM classication
GMM classication
Demonstration of Gaussian mixture models for classication.
Plots predicted labels on both training and held out test data using a variety of GMM classiers on the iris dataset.
Compares GMMs with spherical, diagonal, full, and tied covariance matrices in increasing order of performance.
Although one would expect full covariance to perform best in general, it is prone to overtting on small datasets and
does not generalize well to held out test data.
On the plots, train data is shown as dots, while test data is shown as crosses. The iris dataset is four-dimensional. Only
922 Chapter 2. Example Gallery
scikit-learn user guide, Release 0.12-git
the rst two dimensions are shown here, and thus some points are separated in other dimensions.
Python source code: plot_gmm_classifier.py
print __doc__
# Author: Ron Weiss <[email protected]>, Gael Varoquaux
# License: BSD Style.
# $Id$
import pylab as pl
import matplotlib as mpl
import numpy as np
from sklearn import datasets
from sklearn.cross_validation import StratifiedKFold
from sklearn.mixture import GMM
2.1. Examples 923
scikit-learn user guide, Release 0.12-git
def make_ellipses(gmm, ax):
for n, color in enumerate(rgb):
v, w = np.linalg.eigh(gmm._get_covars()[n][:2, :2])
u = w[0] / np.linalg.norm(w[0])
angle = np.arctan2(u[1], u[0])
angle = 180
*
angle / np.pi # convert to degrees
v
*
= 9
ell = mpl.patches.Ellipse(gmm.means_[n, :2], v[0], v[1],
180 + angle, color=color)
ell.set_clip_box(ax.bbox)
ell.set_alpha(0.5)
ax.add_artist(ell)
iris = datasets.load_iris()
# Break up the dataset into non-overlapping training (75%) and testing
# (25%) sets.
skf = StratifiedKFold(iris.target, k=4)
# Only take the first fold.
train_index, test_index = skf.__iter__().next()
X_train = iris.data[train_index]
y_train = iris.target[train_index]
X_test = iris.data[test_index]
y_test = iris.target[test_index]
n_classes = len(np.unique(y_train))
# Try GMMs using different types of covariances.
classifiers = dict((covar_type, GMM(n_components=n_classes,
covariance_type=covar_type, init_params=wc, n_iter=20))
for covar_type in [spherical, diag, tied, full])
n_classifiers = len(classifiers)
pl.figure(figsize=(3
*
n_classifiers / 2, 6))
pl.subplots_adjust(bottom=.01, top=0.95, hspace=.15, wspace=.05,
left=.01, right=.99)
for index, (name, classifier) in enumerate(classifiers.iteritems()):
# Since we have class labels for the training data, we can
# initialize the GMM parameters in a supervised manner.
classifier.means_ = np.array([X_train[y_train == i].mean(axis=0)
for i in xrange(n_classes)])
# Train the other parameters using the EM algorithm.
classifier.fit(X_train)
h = pl.subplot(2, n_classifiers / 2, index + 1)
make_ellipses(classifier, h)
for n, color in enumerate(rgb):
data = iris.data[iris.target == n]
pl.scatter(data[:, 0], data[:, 1], 0.8, color=color,
label=iris.target_names[n])
# Plot the test data with crosses
924 Chapter 2. Example Gallery
scikit-learn user guide, Release 0.12-git
for n, color in enumerate(rgb):
data = X_test[y_test == n]
pl.plot(data[:, 0], data[:, 1], x, color=color)
y_train_pred = classifier.predict(X_train)
train_accuracy = np.mean(y_train_pred.ravel() == y_train.ravel())
*
100
pl.text(0.05, 0.9, Train accuracy: %.1f % train_accuracy,
transform=h.transAxes)
y_test_pred = classifier.predict(X_test)
test_accuracy = np.mean(y_test_pred.ravel() == y_test.ravel())
*
100
pl.text(0.05, 0.8, Test accuracy: %.1f % test_accuracy,
transform=h.transAxes)
pl.xticks(())
pl.yticks(())
pl.title(name)
pl.legend(loc=lower right, prop=dict(size=12))
pl.show()
Figure 2.112: Density Estimation for a mixture of Gaussians
Density Estimation for a mixture of Gaussians
Plot the density estimation of a mixture of two gaussians. Data is generated from two gaussians with different centers
and covariance matrices.
2.1. Examples 925
scikit-learn user guide, Release 0.12-git
Python source code: plot_gmm_pdf.py
import numpy as np
import pylab as pl
from sklearn import mixture
n_samples = 300
# generate random sample, two components
np.random.seed(0)
C = np.array([[0., -0.7], [3.5, .7]])
X_train = np.r_[np.dot(np.random.randn(n_samples, 2), C),
np.random.randn(n_samples, 2) + np.array([20, 20])]
clf = mixture.GMM(n_components=2, covariance_type=full)
clf.fit(X_train)
x = np.linspace(-20.0, 30.0)
y = np.linspace(-20.0, 40.0)
X, Y = np.meshgrid(x, y)
XX = np.c_[X.ravel(), Y.ravel()]
Z = np.log(-clf.eval(XX)[0])
Z = Z.reshape(X.shape)
CS = pl.contour(X, Y, Z)
CB = pl.colorbar(CS, shrink=0.8, extend=both)
926 Chapter 2. Example Gallery
scikit-learn user guide, Release 0.12-git
pl.scatter(X_train[:, 0], X_train[:, 1], .8)
pl.axis(tight)
pl.show()
Figure 2.113: Gaussian Mixture Model Selection
Gaussian Mixture Model Selection
This example shows that model selection can be perfomed with Gaussian Mixture Models using information-theoretic
criteria (BIC). Model selection concerns both the covariance type and the number of components in the model. In that
case, AIC also provides the right result (not shown to save time), but BIC is better suited if the problem is to identify
the right model. Unlike Bayesian procedures, such inferences are prior-free.
In that case, the model with 2 components and full covariance (which corresponds to the true generative model) is
selected.
2.1. Examples 927
scikit-learn user guide, Release 0.12-git
Python source code: plot_gmm_selection.py
print __doc__
import itertools
import numpy as np
from scipy import linalg
import pylab as pl
import matplotlib as mpl
from sklearn import mixture
# Number of samples per component
n_samples = 500
# Generate random sample, two components
np.random.seed(0)
C = np.array([[0., -0.1], [1.7, .4]])
X = np.r_[np.dot(np.random.randn(n_samples, 2), C),
.7
*
np.random.randn(n_samples, 2) + np.array([-6, 3])]
lowest_bic = np.infty
bic = []
n_components_range = range(1, 7)
cv_types = [spherical, tied, diag, full]
928 Chapter 2. Example Gallery
scikit-learn user guide, Release 0.12-git
for cv_type in cv_types:
for n_components in n_components_range:
# Fit a mixture of gaussians with EM
gmm = mixture.GMM(n_components=n_components, covariance_type=cv_type)
gmm.fit(X)
bic.append(gmm.bic(X))
if bic[-1] < lowest_bic:
lowest_bic = bic[-1]
best_gmm = gmm
bic = np.array(bic)
color_iter = itertools.cycle([k, r, g, b, c, m, y])
clf = best_gmm
bars = []
# Plot the BIC scores
spl = pl.subplot(2, 1, 1)
for i, (cv_type, color) in enumerate(zip(cv_types, color_iter)):
xpos = np.array(n_components_range) + .2
*
(i - 2)
bars.append(pl.bar(xpos, bic[i
*
len(n_components_range):
(i + 1)
*
len(n_components_range)],
width=.2, color=color))
pl.xticks(n_components_range)
pl.ylim([bic.min()
*
1.01 - .01
*
bic.max(), bic.max()])
pl.title(BIC score per model)
xpos = np.mod(bic.argmin(), len(n_components_range)) + .65 +\
.2
*
np.floor(bic.argmin() / len(n_components_range))
pl.text(xpos, bic.min()
*
0.97 + .03
*
bic.max(),
*
, fontsize=14)
spl.set_xlabel(Number of components)
spl.legend([b[0] for b in bars], cv_types)
# Plot the winner
splot = pl.subplot(2, 1, 2)
Y_ = clf.predict(X)
for i, (mean, covar, color) in enumerate(zip(clf.means_, clf.covars_,
color_iter)):
v, w = linalg.eigh(covar)
if not np.any(Y_ == i):
continue
pl.scatter(X[Y_ == i, 0], X[Y_ == i, 1], .8, color=color)
# Plot an ellipse to show the Gaussian component
angle = np.arctan2(w[0][1], w[0][0])
angle = 180
*
angle / np.pi # convert to degrees
v
*
= 4
ell = mpl.patches.Ellipse(mean, v[0], v[1], 180 + angle, color=color)
ell.set_clip_box(splot.bbox)
ell.set_alpha(.5)
splot.add_artist(ell)
pl.xlim(-10, 10)
pl.ylim(-3, 6)
pl.xticks(())
pl.yticks(())
pl.title(Selected GMM: full model, 2 components)
pl.subplots_adjust(hspace=.35, bottom=.02)
pl.show()
2.1. Examples 929
scikit-learn user guide, Release 0.12-git
Figure 2.114: Gaussian Mixture Model Sine Curve
Gaussian Mixture Model Sine Curve
This example highlights the advantages of the Dirichlet Process: complexity control and dealing with sparse data. The
dataset is formed by 100 points loosely spaced following a noisy sine curve. The t by the GMM class, using the
expectation-maximization algorithm to t a mixture of 10 gaussian components, nds too-small components and very
little structure. The ts by the dirichlet process, however, show that the model can either learn a global structure for the
data (small alpha) or easily interpolate to nding relevant local structure (large alpha), never falling into the problems
shown by the GMM class.
Python source code: plot_gmm_sin.py
import itertools
import numpy as np
930 Chapter 2. Example Gallery
scikit-learn user guide, Release 0.12-git
from scipy import linalg
import pylab as pl
import matplotlib as mpl
from sklearn import mixture
# Number of samples per component
n_samples = 100
# Generate random sample following a sine curve
np.random.seed(0)
X = np.zeros((n_samples, 2))
step = 4
*
np.pi / n_samples
for i in xrange(X.shape[0]):
x = i
*
step - 6
X[i, 0] = x + np.random.normal(0, 0.1)
X[i, 1] = 3
*
(np.sin(x) + np.random.normal(0, .2))
color_iter = itertools.cycle([r, g, b, c, m])
for i, (clf, title) in enumerate([
(mixture.GMM(n_components=10, covariance_type=full, n_iter=100), \
"Expectation-maximization"),
(mixture.DPGMM(n_components=10, covariance_type=full,
alpha=0.01, n_iter=100),
"Dirichlet Process,alpha=0.01"),
(mixture.DPGMM(n_components=10, covariance_type=diag,
alpha=100., n_iter=100),
"Dirichlet Process,alpha=100.")
]):
clf.fit(X)
splot = pl.subplot(3, 1, 1 + i)
Y_ = clf.predict(X)
for i, (mean, covar, color) in enumerate(zip(
clf.means_, clf._get_covars(), color_iter)):
v, w = linalg.eigh(covar)
u = w[0] / linalg.norm(w[0])
# as the DP will not use every component it has access to
# unless it needs it, we shouldnt plot the redundant
# components.
if not np.any(Y_ == i):
continue
pl.scatter(X[Y_ == i, 0], X[Y_ == i, 1], .8, color=color)
# Plot an ellipse to show the Gaussian component
angle = np.arctan(u[1] / u[0])
angle = 180
*
angle / np.pi # convert to degrees
ell = mpl.patches.Ellipse(mean, v[0], v[1], 180 + angle, color=color)
ell.set_clip_box(splot.bbox)
ell.set_alpha(0.5)
splot.add_artist(ell)
pl.xlim(-6, 4
*
np.pi - 6)
pl.ylim(-5, 5)
2.1. Examples 931
scikit-learn user guide, Release 0.12-git
pl.title(title)
pl.xticks(())
pl.yticks(())
pl.show()
2.1.13 Nearest Neighbors
Examples concerning the sklearn.neighbors package.
Figure 2.115: Nearest Neighbors Classication
Nearest Neighbors Classication
Sample usage of Nearest Neighbors classication. It will plot the decision boundaries for each class.
Script output:
None 0.813333333333
0.1 0.826666666667
Python source code: plot_nearest_centroid.py
print __doc__
import numpy as np
import pylab as pl
from matplotlib.colors import ListedColormap
from sklearn import datasets
from sklearn.neighbors import NearestCentroid
n_neighbors = 15
# import some data to play with
934 Chapter 2. Example Gallery
scikit-learn user guide, Release 0.12-git
iris = datasets.load_iris()
X = iris.data[:, :2] # we only take the first two features. We could
# avoid this ugly slicing by using a two-dim dataset
y = iris.target
h = .02 # step size in the mesh
# Create color maps
cmap_light = ListedColormap([#FFAAAA, #AAFFAA, #AAAAFF])
cmap_bold = ListedColormap([#FF0000, #00FF00, #0000FF])
for shrinkage in [None, 0.1]:
# we create an instance of Neighbours Classifier and fit the data.
clf = NearestCentroid(shrink_threshold=shrinkage)
clf.fit(X, y)
y_pred = clf.predict(X)
print shrinkage, np.mean(y == y_pred)
# Plot the decision boundary. For that, we will asign a color to each
# point in the mesh [x_min, m_max]x[y_min, y_max].
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
np.arange(y_min, y_max, h))
Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
# Put the result into a color plot
Z = Z.reshape(xx.shape)
pl.figure()
pl.pcolormesh(xx, yy, Z, cmap=cmap_light)
# Plot also the training points
pl.scatter(X[:, 0], X[:, 1], c=y, cmap=cmap_bold)
pl.title("3-Class classification (shrink_threshold=%r)"
% shrinkage)
pl.axis(tight)
pl.show()
Figure 2.117: Nearest Neighbors regression
Nearest Neighbors regression
Demonstrate the resolution of a regression problem using a k-Nearest Neighbor and the interpolation of the target
using both barycenter and constant weights.
2.1. Examples 935
scikit-learn user guide, Release 0.12-git
Python source code: plot_regression.py
print __doc__
# Author: Alexandre Gramfort <[email protected]>
# Fabian Pedregosa <[email protected]>
#
# License: BSD, (C) INRIA
###############################################################################
# Generate sample data
import numpy as np
import pylab as pl
from sklearn import neighbors
np.random.seed(0)
X = np.sort(5
*
np.random.rand(40, 1), axis=0)
T = np.linspace(0, 5, 500)[:, np.newaxis]
y = np.sin(X).ravel()
# Add noise to targets
y[::5] += 1
*
(0.5 - np.random.rand(8))
###############################################################################
# Fit regression model
936 Chapter 2. Example Gallery
scikit-learn user guide, Release 0.12-git
n_neighbors = 5
for i, weights in enumerate([uniform, distance]):
knn = neighbors.KNeighborsRegressor(n_neighbors, weights=weights)
y_ = knn.fit(X, y).predict(T)
pl.subplot(2, 1, i + 1)
pl.scatter(X, y, c=k, label=data)
pl.plot(T, y_, c=g, label=prediction)
pl.axis(tight)
pl.legend()
pl.title("KNeighborsRegressor (k = %i, weights = %s)" % (n_neighbors,
weights))
pl.show()
2.1.14 Semi Supervised Classication
Examples concerning the sklearn.semi_supervised package.
Figure 2.118: Label Propagation digits: Demonstrating performance
Label Propagation digits: Demonstrating performance
This example demonstrates the power of semisupervised learning by training a Label Spreading model to classify
handwritten digits with sets of very few labels.
The handwritten digit dataset has 1797 total points. The model will be trained using all points, but only 30 will be
labeled. Results in the form of a confusion matrix and a series of metrics over each class will be very good.
At the end, the top 10 most uncertain predictions will be shown.
2.1. Examples 937
scikit-learn user guide, Release 0.12-git
Script output:
Label Spreading model: 30 labeled & 300 unlabeled points (330 total)
precision recall f1-score support
0 1.00 1.00 1.00 23
1 0.58 0.54 0.56 28
2 0.96 0.93 0.95 29
3 0.00 0.00 0.00 28
4 0.91 0.80 0.85 25
5 0.96 0.79 0.87 33
6 0.97 0.97 0.97 36
7 0.89 1.00 0.94 34
8 0.48 0.83 0.61 29
9 0.54 0.77 0.64 35
avg / total 0.73 0.77 0.74 300
Confusion matrix
[[23 0 0 0 0 0 0 0 0]
[ 0 15 1 0 0 1 0 11 0]
[ 0 0 27 0 0 0 2 0 0]
[ 0 5 0 20 0 0 0 0 0]
[ 0 0 0 0 26 0 0 1 6]
[ 0 1 0 0 0 35 0 0 0]
[ 0 0 0 0 0 0 34 0 0]
[ 0 5 0 0 0 0 0 24 0]
[ 0 0 0 2 1 0 2 3 27]]
938 Chapter 2. Example Gallery
scikit-learn user guide, Release 0.12-git
Python source code: plot_label_propagation_digits.py
print __doc__
# Authors: Clay Woolam <[email protected]>
# Licence: BSD
import numpy as np
import pylab as pl
from scipy import stats
from sklearn import datasets
from sklearn.semi_supervised import label_propagation
from sklearn.metrics import metrics
from sklearn.metrics.metrics import confusion_matrix
digits = datasets.load_digits()
rng = np.random.RandomState(0)
indices = np.arange(len(digits.data))
rng.shuffle(indices)
X = digits.data[indices[:330]]
y = digits.target[indices[:330]]
images = digits.images[indices[:330]]
n_total_samples = len(y)
n_labeled_points = 30
indices = np.arange(n_total_samples)
unlabeled_set = indices[n_labeled_points:]
# shuffle everything around
y_train = np.copy(y)
y_train[unlabeled_set] = -1
###############################################################################
# Learn with LabelSpreading
lp_model = label_propagation.LabelSpreading(gamma=0.25, max_iters=5)
lp_model.fit(X, y_train)
predicted_labels = lp_model.transduction_[unlabeled_set]
true_labels = y[unlabeled_set]
cm = confusion_matrix(true_labels, predicted_labels,
labels=lp_model.classes_)
print "Label Spreading model: %d labeled & %d unlabeled points (%d total)" % \
(n_labeled_points, n_total_samples - n_labeled_points, n_total_samples)
print metrics.classification_report(true_labels, predicted_labels)
print "Confusion matrix"
print cm
# calculate uncertainty values for each transduced distribution
pred_entropies = stats.distributions.entropy(lp_model.label_distributions_.T)
2.1. Examples 939
scikit-learn user guide, Release 0.12-git
# pick the top 10 most uncertain labels
uncertainty_index = np.argsort(pred_entropies)[-10:]
###############################################################################
# plot
f = pl.figure(figsize=(7, 5))
for index, image_index in enumerate(uncertainty_index):
image = images[image_index]
sub = f.add_subplot(2, 5, index + 1)
sub.imshow(image, cmap=pl.cm.gray_r)
pl.xticks([])
pl.yticks([])
sub.set_title(predict: %i\ntrue: %i % (
lp_model.transduction_[image_index], y[image_index]))
f.suptitle(Learning with small amount of labeled data)
pl.show()
Figure 2.119: Label Propagation digits active learning
Label Propagation digits active learning
Demonstrates an active learning technique to learn handwritten digits using label propagation.
We start by training a label propagation model with only 10 labeled points, then we select the top ve most uncertain
points to label. Next, we train with 15 labeled points (original 10 + 5 new ones). We repeat this process four times to
have a model trained with 30 labeled examples.
A plot will appear showing the top 5 most uncertain digits for each iteration of training. These may or may not contain
mistakes, but we will train the next model with their true labels.
940 Chapter 2. Example Gallery
scikit-learn user guide, Release 0.12-git
Script output:
Iteration 0 ______________________________________________________________________
Label Spreading model: 10 labeled & 320 unlabeled (330 total)
precision recall f1-score support
0 0.00 0.00 0.00 24
1 0.49 0.90 0.63 29
2 0.88 0.97 0.92 31
3 0.00 0.00 0.00 28
4 0.00 0.00 0.00 27
5 0.89 0.49 0.63 35
6 0.86 0.95 0.90 40
7 0.75 0.92 0.83 36
8 0.54 0.79 0.64 33
9 0.41 0.86 0.56 37
avg / total 0.52 0.63 0.55 320
Confusion matrix
[[26 1 0 0 1 0 1]
[ 1 30 0 0 0 0 0]
[ 0 0 17 6 0 2 10]
[ 2 0 0 38 0 0 0]
[ 0 3 0 0 33 0 0]
[ 7 0 0 0 0 26 0]
2.1. Examples 941
scikit-learn user guide, Release 0.12-git
[ 0 0 2 0 0 3 32]]
Iteration 1 ______________________________________________________________________
Label Spreading model: 15 labeled & 315 unlabeled (330 total)
precision recall f1-score support
0 1.00 1.00 1.00 23
1 0.61 0.59 0.60 29
2 0.91 0.97 0.94 31
3 1.00 0.56 0.71 27
4 0.79 0.88 0.84 26
5 0.89 0.46 0.60 35
6 0.86 0.95 0.90 40
7 0.97 0.92 0.94 36
8 0.54 0.84 0.66 31
9 0.70 0.81 0.75 37
avg / total 0.82 0.80 0.79 315
Confusion matrix
[[23 0 0 0 0 0 0 0 0 0]
[ 0 17 1 0 2 0 0 1 7 1]
[ 0 1 30 0 0 0 0 0 0 0]
[ 0 0 0 15 0 0 0 0 10 2]
[ 0 3 0 0 23 0 0 0 0 0]
[ 0 0 0 0 1 16 6 0 2 10]
[ 0 2 0 0 0 0 38 0 0 0]
[ 0 0 2 0 1 0 0 33 0 0]
[ 0 5 0 0 0 0 0 0 26 0]
[ 0 0 0 0 2 2 0 0 3 30]]
Iteration 2 ______________________________________________________________________
Label Spreading model: 20 labeled & 310 unlabeled (330 total)
precision recall f1-score support
0 1.00 1.00 1.00 23
1 0.68 0.59 0.63 29
2 0.91 0.97 0.94 31
3 0.96 1.00 0.98 23
4 0.81 1.00 0.89 25
5 0.89 0.46 0.60 35
6 0.86 0.95 0.90 40
7 0.97 0.92 0.94 36
8 0.68 0.84 0.75 31
9 0.75 0.81 0.78 37
avg / total 0.85 0.84 0.83 310
Confusion matrix
[[23 0 0 0 0 0 0 0 0 0]
[ 0 17 1 0 2 0 0 1 7 1]
[ 0 1 30 0 0 0 0 0 0 0]
[ 0 0 0 23 0 0 0 0 0 0]
[ 0 0 0 0 25 0 0 0 0 0]
[ 0 0 0 1 1 16 6 0 2 9]
[ 0 2 0 0 0 0 38 0 0 0]
[ 0 0 2 0 1 0 0 33 0 0]
[ 0 5 0 0 0 0 0 0 26 0]
[ 0 0 0 0 2 2 0 0 3 30]]
Iteration 3 ______________________________________________________________________
942 Chapter 2. Example Gallery
scikit-learn user guide, Release 0.12-git
Label Spreading model: 25 labeled & 305 unlabeled (330 total)
precision recall f1-score support
0 1.00 1.00 1.00 23
1 0.70 0.85 0.77 27
2 1.00 0.90 0.95 31
3 1.00 1.00 1.00 23
4 1.00 1.00 1.00 25
5 0.96 0.74 0.83 34
6 1.00 0.95 0.97 40
7 0.90 1.00 0.95 35
8 0.83 0.81 0.82 31
9 0.75 0.83 0.79 36
avg / total 0.91 0.90 0.90 305
Confusion matrix
[[23 0 0 0 0 0 0 0 0 0]
[ 0 23 0 0 0 0 0 0 4 0]
[ 0 1 28 0 0 0 0 2 0 0]
[ 0 0 0 23 0 0 0 0 0 0]
[ 0 0 0 0 25 0 0 0 0 0]
[ 0 0 0 0 0 25 0 0 0 9]
[ 0 2 0 0 0 0 38 0 0 0]
[ 0 0 0 0 0 0 0 35 0 0]
[ 0 5 0 0 0 0 0 0 25 1]
[ 0 2 0 0 0 1 0 2 1 30]]
Iteration 4 ______________________________________________________________________
Label Spreading model: 30 labeled & 300 unlabeled (330 total)
precision recall f1-score support
0 1.00 1.00 1.00 23
1 0.77 0.88 0.82 26
2 1.00 0.90 0.95 31
3 1.00 1.00 1.00 23
4 1.00 1.00 1.00 25
5 0.94 0.97 0.95 32
6 1.00 0.97 0.99 39
7 0.90 1.00 0.95 35
8 0.89 0.81 0.85 31
9 0.94 0.89 0.91 35
avg / total 0.94 0.94 0.94 300
Confusion matrix
[[23 0 0 0 0 0 0 0 0 0]
[ 0 23 0 0 0 0 0 0 3 0]
[ 0 1 28 0 0 0 0 2 0 0]
[ 0 0 0 23 0 0 0 0 0 0]
[ 0 0 0 0 25 0 0 0 0 0]
[ 0 0 0 0 0 31 0 0 0 1]
[ 0 1 0 0 0 0 38 0 0 0]
[ 0 0 0 0 0 0 0 35 0 0]
[ 0 5 0 0 0 0 0 0 25 1]
[ 0 0 0 0 0 2 0 2 0 31]]
Python source code: plot_label_propagation_digits_active_learning.py
2.1. Examples 943
scikit-learn user guide, Release 0.12-git
print __doc__
# Authors: Clay Woolam <[email protected]>
# Licence: BSD
import numpy as np
import pylab as pl
from scipy import stats
from sklearn import datasets
from sklearn.semi_supervised import label_propagation
from sklearn.metrics import classification_report, confusion_matrix
digits = datasets.load_digits()
rng = np.random.RandomState(0)
indices = np.arange(len(digits.data))
rng.shuffle(indices)
X = digits.data[indices[:330]]
y = digits.target[indices[:330]]
images = digits.images[indices[:330]]
n_total_samples = len(y)
n_labeled_points = 10
unlabeled_indices = np.arange(n_total_samples)[n_labeled_points:]
f = pl.figure()
for i in range(5):
y_train = np.copy(y)
y_train[unlabeled_indices] = -1
lp_model = label_propagation.LabelSpreading(gamma=0.25, max_iters=5)
lp_model.fit(X, y_train)
predicted_labels = lp_model.transduction_[unlabeled_indices]
true_labels = y[unlabeled_indices]
cm = confusion_matrix(true_labels, predicted_labels,
labels=lp_model.classes_)
print (Iteration %i + 70
*
_) % i
print "Label Spreading model: %d labeled & %d unlabeled (%d total)" %\
(n_labeled_points, n_total_samples - n_labeled_points, n_total_samples)
print classification_report(true_labels, predicted_labels)
print "Confusion matrix"
print cm
# compute the entropies of transduced label distributions
pred_entropies = stats.distributions.entropy(
lp_model.label_distributions_.T)
# select five digit examples that the classifier is most uncertain about
uncertainty_index = uncertainty_index = np.argsort(pred_entropies)[-5:]
# keep track of indicies that we get labels for
944 Chapter 2. Example Gallery
scikit-learn user guide, Release 0.12-git
delete_indices = np.array([])
f.text(.05, (1 - (i + 1)
*
.183),
"model %d\n\nfit with\n%d labels" % ((i + 1), i
*
5 + 10), size=10)
for index, image_index in enumerate(uncertainty_index):
image = images[image_index]
sub = f.add_subplot(5, 5, index + 1 + (5
*
i))
sub.imshow(image, cmap=pl.cm.gray_r)
sub.set_title(predict: %i\ntrue: %i % (
lp_model.transduction_[image_index], y[image_index]), size=10)
sub.axis(off)
# labeling 5 points, remote from labeled set
delete_index, = np.where(unlabeled_indices == image_index)
delete_indices = np.concatenate((delete_indices, delete_index))
unlabeled_indices = np.delete(unlabeled_indices, delete_indices)
n_labeled_points += 5
f.suptitle("Active learning with Label Propagation.\nRows show 5 most "
"uncertain labels to learn with the next model.")
pl.subplots_adjust(0.12, 0.03, 0.9, 0.8, 0.2, 0.45)
pl.show()
Figure 2.120: Label Propagation learning a complex structure
Label Propagation learning a complex structure
Example of LabelPropagation learning a complex internal structure to demonstrate manifold learning. The outer
circle should be labeled red and the inner circle blue. Because both label groups lie inside their own distinct
shape, we can see that the labels propagate correctly around the circle.
2.1. Examples 945
scikit-learn user guide, Release 0.12-git
Python source code: plot_label_propagation_structure.py
print __doc__
# Authors: Clay Woolam <[email protected]>
# Andreas Mueller <[email protected]>
# Licence: BSD
import numpy as np
import pylab as pl
from sklearn.semi_supervised import label_propagation
from sklearn.datasets import make_circles
# generate ring with inner box
n_samples = 200
X, y = make_circles(n_samples=n_samples, shuffle=False)
outer, inner = 0, 1
labels = -np.ones(n_samples)
labels[0] = outer
labels[-1] = inner
###############################################################################
# Learn with LabelSpreading
label_spread = label_propagation.LabelSpreading(kernel=knn, alpha=1.0)
label_spread.fit(X, labels)
###############################################################################
# Plot output labels
output_labels = label_spread.transduction_
pl.figure(figsize=(8.5, 4))
pl.subplot(1, 2, 1)
plot_outer_labeled, = pl.plot(X[labels == outer, 0],
X[labels == outer, 1], rs)
plot_unlabeled, = pl.plot(X[labels == -1, 0], X[labels == -1, 1], g.)
plot_inner_labeled, = pl.plot(X[labels == inner, 0],
X[labels == inner, 1], bs)
pl.legend((plot_outer_labeled, plot_inner_labeled, plot_unlabeled),
(Outer Labeled, Inner Labeled, Unlabeled), upper left,
946 Chapter 2. Example Gallery
scikit-learn user guide, Release 0.12-git
numpoints=1, shadow=False)
pl.title("Raw data (2 classes=red and blue)")
pl.subplot(1, 2, 2)
output_label_array = np.asarray(output_labels)
outer_numbers = np.where(output_label_array == outer)[0]
inner_numbers = np.where(output_label_array == inner)[0]
plot_outer, = pl.plot(X[outer_numbers, 0], X[outer_numbers, 1], rs)
plot_inner, = pl.plot(X[inner_numbers, 0], X[inner_numbers, 1], bs)
pl.legend((plot_outer, plot_inner), (Outer Learned, Inner Learned),
upper left, numpoints=1, shadow=False)
pl.title("Labels learned with Label Spreading (KNN)")
pl.subplots_adjust(left=0.07, bottom=0.07, right=0.93, top=0.92)
pl.show()
Figure 2.121: Decision boundary of label propagation versus SVM on the Iris dataset
Decision boundary of label propagation versus SVM on the Iris dataset
Comparison for decision boundary generated on iris dataset between Label Propagation and SVM.
This demonstrates Label Propagation learning a good boundary even with a small amount of labeled data.
2.1. Examples 947
scikit-learn user guide, Release 0.12-git
Python source code: plot_label_propagation_versus_svm_iris.py
print __doc__
# Authors: Clay Woolam <[email protected]>
# Licence: BSD
import numpy as np
import pylab as pl
from sklearn import datasets
from sklearn import svm
from sklearn.semi_supervised import label_propagation
rng = np.random.RandomState(0)
iris = datasets.load_iris()
X = iris.data[:, :2]
y = iris.target
# step size in the mesh
h = .02
y_30 = np.copy(y)
y_30[rng.rand(len(y)) < 0.3] = -1
y_50 = np.copy(y)
948 Chapter 2. Example Gallery
scikit-learn user guide, Release 0.12-git
y_50[rng.rand(len(y)) < 0.5] = -1
# we create an instance of SVM and fit out data. We do not scale our
# data since we want to plot the support vectors
ls30 = (label_propagation.LabelSpreading().fit(X, y_30),
y_30)
ls50 = (label_propagation.LabelSpreading().fit(X, y_50),
y_50)
ls100 = (label_propagation.LabelSpreading().fit(X, y), y)
rbf_svc = (svm.SVC(kernel=rbf).fit(X, y), y)
# create a mesh to plot in
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
np.arange(y_min, y_max, h))
# title for the plots
titles = [Label Spreading 30% data,
Label Spreading 50% data,
Label Spreading 100% data,
SVC with rbf kernel]
color_map = {-1: (1, 1, 1), 0: (0, 0, .9), 1: (1, 0, 0), 2: (.8, .6, 0)}
for i, (clf, y_train) in enumerate((ls30, ls50, ls100, rbf_svc)):
# Plot the decision boundary. For that, we will asign a color to each
# point in the mesh [x_min, m_max]x[y_min, y_max].
pl.subplot(2, 2, i + 1)
Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
# Put the result into a color plot
Z = Z.reshape(xx.shape)
pl.contourf(xx, yy, Z, cmap=pl.cm.Paired)
pl.axis(off)
# Plot also the training points
colors = [color_map[y] for y in y_train]
pl.scatter(X[:, 0], X[:, 1], c=colors, cmap=pl.cm.Paired)
pl.title(titles[i])
pl.text(.90, 0, "Unlabeled points are colored white")
pl.show()
2.1.15 Support Vector Machines
Examples concerning the sklearn.svm package.
SVM with custom kernel
Simple usage of Support Vector Machines to classify a sample. It will plot the decision surface and the support vectors.
2.1. Examples 949
scikit-learn user guide, Release 0.12-git
Figure 2.122: SVM with custom kernel
Python source code: plot_custom_kernel.py
print __doc__
import numpy as np
import pylab as pl
from sklearn import svm, datasets
# import some data to play with
iris = datasets.load_iris()
X = iris.data[:, :2] # we only take the first two features. We could
# avoid this ugly slicing by using a two-dim dataset
Y = iris.target
950 Chapter 2. Example Gallery
scikit-learn user guide, Release 0.12-git
def my_kernel(x, y):
"""
We create a custom kernel:
(2 0)
k(x, y) = x ( ) y.T
(0 1)
"""
M = np.array([[2, 0], [0, 1.0]])
return np.dot(np.dot(x, M), y.T)
h = .02 # step size in the mesh
# we create an instance of SVM and fit out data.
clf = svm.SVC(kernel=my_kernel)
clf.fit(X, Y)
# Plot the decision boundary. For that, we will asign a color to each
# point in the mesh [x_min, m_max]x[y_min, y_max].
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
# Put the result into a color plot
Z = Z.reshape(xx.shape)
pl.pcolormesh(xx, yy, Z, cmap=pl.cm.Paired)
# Plot also the training points
pl.scatter(X[:, 0], X[:, 1], c=Y, cmap=pl.cm.Paired)
pl.title(3-Class classification using Support Vector Machine with custom
kernel)
pl.axis(tight)
pl.show()
Figure 2.123: Plot different SVM classiers in the iris dataset
Plot different SVM classiers in the iris dataset
Comparison of different linear SVM classiers on the iris dataset. It will plot the decision surface for four different
SVM classiers.
2.1. Examples 951
scikit-learn user guide, Release 0.12-git
Python source code: plot_iris.py
print __doc__
import numpy as np
import pylab as pl
from sklearn import svm, datasets
# import some data to play with
iris = datasets.load_iris()
X = iris.data[:, :2] # we only take the first two features. We could
# avoid this ugly slicing by using a two-dim dataset
Y = iris.target
h = .02 # step size in the mesh
# we create an instance of SVM and fit out data. We do not scale our
# data since we want to plot the support vectors
C = 1.0 # SVM regularization parameter
svc = svm.SVC(kernel=linear, C=C).fit(X, Y)
rbf_svc = svm.SVC(kernel=rbf, gamma=0.7, C=C).fit(X, Y)
poly_svc = svm.SVC(kernel=poly, degree=3, C=C).fit(X, Y)
lin_svc = svm.LinearSVC(C=C).fit(X, Y)
# create a mesh to plot in
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
952 Chapter 2. Example Gallery
scikit-learn user guide, Release 0.12-git
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
np.arange(y_min, y_max, h))
# title for the plots
titles = [SVC with linear kernel,
SVC with RBF kernel,
SVC with polynomial (degree 3) kernel,
LinearSVC (linear kernel)]
for i, clf in enumerate((svc, rbf_svc, poly_svc, lin_svc)):
# Plot the decision boundary. For that, we will asign a color to each
# point in the mesh [x_min, m_max]x[y_min, y_max].
pl.subplot(2, 2, i + 1)
Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
# Put the result into a color plot
Z = Z.reshape(xx.shape)
pl.contourf(xx, yy, Z, cmap=pl.cm.Paired)
pl.axis(off)
# Plot also the training points
pl.scatter(X[:, 0], X[:, 1], c=Y, cmap=pl.cm.Paired)
pl.title(titles[i])
pl.show()
Figure 2.124: One-class SVM with non-linear kernel (RBF)
One-class SVM with non-linear kernel (RBF)
One-class SVM is an unsupervised algorithm that learns a decision function for novelty detection: classifying new
data as similar or different to the training set.
2.1. Examples 953
scikit-learn user guide, Release 0.12-git
Python source code: plot_oneclass.py
print __doc__
import numpy as np
import pylab as pl
import matplotlib.font_manager
from sklearn import svm
xx, yy = np.meshgrid(np.linspace(-5, 5, 500), np.linspace(-5, 5, 500))
# Generate train data
X = 0.3
*
np.random.randn(100, 2)
X_train = np.r_[X + 2, X - 2]
# Generate some regular novel observations
X = 0.3
*
np.random.randn(20, 2)
X_test = np.r_[X + 2, X - 2]
# Generate some abnormal novel observations
X_outliers = np.random.uniform(low=-4, high=4, size=(20, 2))
# fit the model
clf = svm.OneClassSVM(nu=0.1, kernel="rbf", gamma=0.1)
clf.fit(X_train)
y_pred_train = clf.predict(X_train)
y_pred_test = clf.predict(X_test)
y_pred_outliers = clf.predict(X_outliers)
n_error_train = y_pred_train[y_pred_train == -1].size
954 Chapter 2. Example Gallery
scikit-learn user guide, Release 0.12-git
n_error_test = y_pred_test[y_pred_test == -1].size
n_error_outliers = y_pred_outliers[y_pred_outliers == 1].size
# plot the line, the points, and the nearest vectors to the plane
Z = clf.decision_function(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
pl.title("Novelty Detection")
pl.contourf(xx, yy, Z, levels=np.linspace(Z.min(), 0, 7), cmap=pl.cm.Blues_r)
a = pl.contour(xx, yy, Z, levels=[0], linewidths=2, colors=red)
pl.contourf(xx, yy, Z, levels=[0, Z.max()], colors=orange)
b1 = pl.scatter(X_train[:, 0], X_train[:, 1], c=white)
b2 = pl.scatter(X_test[:, 0], X_test[:, 1], c=green)
c = pl.scatter(X_outliers[:, 0], X_outliers[:, 1], c=red)
pl.axis(tight)
pl.xlim((-5, 5))
pl.ylim((-5, 5))
pl.legend([a.collections[0], b1, b2, c],
["learned frontier", "training observations",
"new regular observations", "new abnormal observations"],
loc="upper left",
prop=matplotlib.font_manager.FontProperties(size=11))
pl.xlabel(
"error train: %d/200 ; errors novel regular: %d/20 ; " \
"errors novel abnormal: %d/20"
% (n_error_train, n_error_test, n_error_outliers))
pl.show()
Figure 2.125: RBF SVM parameters
RBF SVM parameters
This example illustrates the effect of the parameters gamma and C of the rbf kernel SVM.
Intuitively, the gamma parameter denes how far the inuence of a single training example reaches, with low values
meaning far and high values meaning close. The C parameter trades off misclassication of training examples
against simplicity of the decision surface. A low C makes the decision surface smooth, while a high C aims at
classifying all training examples correctly.
Two plots are generated. The rst is a visualization of the decision function for a variety of parameter values, and the
second is a heatmap of the classiers cross-validation accuracy as a function of C and gamma.
2.1. Examples 955
scikit-learn user guide, Release 0.12-git
Script output:
(The best classifier is: , SVC(C=1000000.0, cache_size=200, class_weight=None, coef0=0.0, degree=3,
gamma=0.0001, kernel=rbf, probability=False, shrinking=True, tol=0.001,
verbose=False))
Python source code: plot_rbf_parameters.py
print __doc__
import numpy as np
import pylab as pl
from sklearn.svm import SVC
from sklearn.preprocessing import Scaler
from sklearn.datasets import load_iris
from sklearn.cross_validation import StratifiedKFold
from sklearn.grid_search import GridSearchCV
##############################################################################
# Load and prepare data set
#
# dataset for grid search
iris = load_iris()
X = iris.data
Y = iris.target
# dataset for decision function visualization
X_2d = X[:, :2]
956 Chapter 2. Example Gallery
scikit-learn user guide, Release 0.12-git
X_2d = X_2d[Y > 0]
Y_2d = Y[Y > 0]
Y_2d -= 1
# It is usually a good idea to scale the data for SVM training.
# We are cheating a bit in this example in scaling all of the data,
# instead of fitting the transformation on the training set and
# just applying it on the test set.
scaler = Scaler()
X = scaler.fit_transform(X)
X_2d = scaler.fit_transform(X_2d)
##############################################################################
# Train classifier
#
# For an initial search, a logarithmic grid with basis
# 10 is often helpful. Using a basis of 2, a finer
# tuning can be achieved but at a much higher cost.
C_range = 10.0
**
np.arange(-2, 9)
gamma_range = 10.0
**
np.arange(-5, 4)
param_grid = dict(gamma=gamma_range, C=C_range)
grid = GridSearchCV(SVC(), param_grid=param_grid, cv=StratifiedKFold(y=Y, k=3))
grid.fit(X, Y)
print("The best classifier is: ", grid.best_estimator_)
# Now we need to fit a classifier for all parameters in the 2d version
# (we use a smaller set of parameters here because it takes a while to train)
C_2d_range = [1, 1e2, 1e4]
gamma_2d_range = [1e-1, 1, 1e1]
classifiers = []
for C in C_2d_range:
for gamma in gamma_2d_range:
clf = SVC(C=C, gamma=gamma)
clf.fit(X_2d, Y_2d)
classifiers.append((C, gamma, clf))
##############################################################################
# visualization
#
# draw visualization of parameter effects
pl.figure(figsize=(8, 6))
xx, yy = np.meshgrid(np.linspace(-5, 5, 200), np.linspace(-5, 5, 200))
for (k, (C, gamma, clf)) in enumerate(classifiers):
# evaluate decision function in a grid
Z = clf.decision_function(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
# visualize decision function for these parameters
pl.subplot(len(C_2d_range), len(gamma_2d_range), k + 1)
pl.title("gamma 10^%d, C 10^%d" % (np.log10(gamma), np.log10(C)),
size=medium)
# visualize parameters effect on decision function
2.1. Examples 957
scikit-learn user guide, Release 0.12-git
pl.pcolormesh(xx, yy, -Z, cmap=pl.cm.jet)
pl.scatter(X_2d[:, 0], X_2d[:, 1], c=Y_2d, cmap=pl.cm.jet)
pl.xticks(())
pl.yticks(())
pl.axis(tight)
# plot the scores of the grid
# grid_scores_ contains parameter settings and scores
score_dict = grid.grid_scores_
# We extract just the scores
scores = [x[1] for x in score_dict]
scores = np.array(scores).reshape(len(C_range), len(gamma_range))
# draw heatmap of accuracy as a function of gamma and C
pl.figure(figsize=(8, 6))
pl.subplots_adjust(left=0.05, right=0.95, bottom=0.15, top=0.95)
pl.imshow(scores, interpolation=nearest, cmap=pl.cm.spectral)
pl.xlabel(gamma)
pl.ylabel(C)
pl.colorbar()
pl.xticks(np.arange(len(gamma_range)), gamma_range, rotation=45)
pl.yticks(np.arange(len(C_range)), C_range)
pl.show()
Figure 2.126: SVM: Maximum margin separating hyperplane
SVM: Maximum margin separating hyperplane
Plot the maximum margin separating hyperplane within a two-class separable dataset using a Support Vector Machines
classier with linear kernel.
958 Chapter 2. Example Gallery
scikit-learn user guide, Release 0.12-git
Python source code: plot_separating_hyperplane.py
print __doc__
import numpy as np
import pylab as pl
from sklearn import svm
# we create 40 separable points
np.random.seed(0)
X = np.r_[np.random.randn(20, 2) - [2, 2], np.random.randn(20, 2) + [2, 2]]
Y = [0]
*
20 + [1]
*
20
# fit the model
clf = svm.SVC(kernel=linear)
clf.fit(X, Y)
# get the separating hyperplane
w = clf.coef_[0]
a = -w[0] / w[1]
xx = np.linspace(-5, 5)
yy = a
*
xx - (clf.intercept_[0]) / w[1]
# plot the parallels to the separating hyperplane that pass through the
# support vectors
b = clf.support_vectors_[0]
2.1. Examples 959
scikit-learn user guide, Release 0.12-git
yy_down = a
*
xx + (b[1] - a
*
b[0])
b = clf.support_vectors_[-1]
yy_up = a
*
xx + (b[1] - a
*
b[0])
# plot the line, the points, and the nearest vectors to the plane
pl.plot(xx, yy, k-)
pl.plot(xx, yy_down, k--)
pl.plot(xx, yy_up, k--)
pl.scatter(clf.support_vectors_[:, 0], clf.support_vectors_[:, 1],
s=80, facecolors=none)
pl.scatter(X[:, 0], X[:, 1], c=Y, cmap=pl.cm.Paired)
pl.axis(tight)
pl.show()
Figure 2.127: SVM: Separating hyperplane for unbalanced classes
SVM: Separating hyperplane for unbalanced classes
Find the optimal separating hyperplane using an SVC for classes that are unbalanced.
We rst nd the separating plane with a plain SVC and then plot (dashed) the separating hyperplane with automatically
correction for unbalanced classes.
960 Chapter 2. Example Gallery
scikit-learn user guide, Release 0.12-git
Python source code: plot_separating_hyperplane_unbalanced.py
print __doc__
import numpy as np
import pylab as pl
from sklearn import svm
# we create 40 separable points
rng = np.random.RandomState(0)
n_samples_1 = 1000
n_samples_2 = 100
X = np.r_[1.5
*
rng.randn(n_samples_1, 2),
0.5
*
rng.randn(n_samples_2, 2) + [2, 2]]
y = [0]
*
(n_samples_1) + [1]
*
(n_samples_2)
# fit the model and get the separating hyperplane
clf = svm.SVC(kernel=linear, C=1.0)
clf.fit(X, y)
w = clf.coef_[0]
a = -w[0] / w[1]
xx = np.linspace(-5, 5)
yy = a
*
xx - clf.intercept_[0] / w[1]
2.1. Examples 961
scikit-learn user guide, Release 0.12-git
# get the separating hyperplane using weighted classes
wclf = svm.SVC(kernel=linear, class_weight={1: 10})
wclf.fit(X, y)
ww = wclf.coef_[0]
wa = -ww[0] / ww[1]
wyy = wa
*
xx - wclf.intercept_[0] / ww[1]
# plot separating hyperplanes and samples
h0 = pl.plot(xx, yy, k-, label=no weights)
h1 = pl.plot(xx, wyy, k--, label=with weights)
pl.scatter(X[:, 0], X[:, 1], c=y, cmap=pl.cm.Paired)
pl.legend()
pl.axis(tight)
pl.show()
Figure 2.128: SVM-Anova: SVM with univariate feature selection
SVM-Anova: SVM with univariate feature selection
This example shows how to perform univariate feature before running a SVC (support vector classier) to improve the
classication scores.
962 Chapter 2. Example Gallery
scikit-learn user guide, Release 0.12-git
Python source code: plot_svm_anova.py
print __doc__
import numpy as np
import pylab as pl
from sklearn import svm, datasets, feature_selection, cross_validation
from sklearn.pipeline import Pipeline
###############################################################################
# Import some data to play with
digits = datasets.load_digits()
y = digits.target
# Throw away data, to be in the curse of dimension settings
y = y[:200]
X = digits.data[:200]
n_samples = len(y)
X = X.reshape((n_samples, -1))
# add 200 non-informative features
X = np.hstack((X, 2
*
np.random.random((n_samples, 200))))
###############################################################################
# Create a feature-selection transform and an instance of SVM that we
# combine together to have an full-blown estimator
transform = feature_selection.SelectPercentile(feature_selection.f_classif)
2.1. Examples 963
scikit-learn user guide, Release 0.12-git
clf = Pipeline([(anova, transform), (svc, svm.SVC(C=1.0))])
###############################################################################
# Plot the cross-validation score as a function of percentile of features
score_means = list()
score_stds = list()
percentiles = (1, 3, 6, 10, 15, 20, 30, 40, 60, 80, 100)
for percentile in percentiles:
clf.set_params(anova__percentile=percentile)
# Compute cross-validation score using all CPUs
this_scores = cross_validation.cross_val_score(clf, X, y, n_jobs=1)
score_means.append(this_scores.mean())
score_stds.append(this_scores.std())
pl.errorbar(percentiles, score_means, np.array(score_stds))
pl.title(
Performance of the SVM-Anova varying the percentile of features selected)
pl.xlabel(Percentile)
pl.ylabel(Prediction rate)
pl.axis(tight)
pl.show()
Figure 2.129: SVM-SVC (Support Vector Classication)
SVM-SVC (Support Vector Classication)
The classication application of the SVM is used below. The Iris dataset has been used for this example
The decision boundaries, are shown with all the points in the training-set.
964 Chapter 2. Example Gallery
scikit-learn user guide, Release 0.12-git
Python source code: plot_svm_iris.py
print __doc__
# Code source: Gael Varoqueux
# Modified for Documentation merge by Jaques Grobler
# License: BSD
import numpy as np
import pylab as pl
from sklearn import svm, datasets
# import some data to play with
iris = datasets.load_iris()
X = iris.data[:, :2] # we only take the first two features.
Y = iris.target
h = .02 # step size in the mesh
clf = svm.SVC(C=1.0, kernel=linear)
# we create an instance of SVM Classifier and fit the data.
clf.fit(X, Y)
# Plot the decision boundary. For that, we will asign a color to each
# point in the mesh [x_min, m_max]x[y_min, y_max].
x_min, x_max = X[:, 0].min() - .5, X[:, 0].max() + .5
y_min, y_max = X[:, 1].min() - .5, X[:, 1].max() + .5
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
# Put the result into a color plot
Z = Z.reshape(xx.shape)
pl.figure(1, figsize=(4, 3))
pl.pcolormesh(xx, yy, Z, cmap=pl.cm.Paired)
# Plot also the training points
2.1. Examples 965
scikit-learn user guide, Release 0.12-git
pl.scatter(X[:, 0], X[:, 1], c=Y, cmap=pl.cm.Paired)
pl.xlabel(Sepal length)
pl.ylabel(Sepal width)
pl.xlim(xx.min(), xx.max())
pl.ylim(yy.min(), yy.max())
pl.xticks(())
pl.yticks(())
pl.show()
Figure 2.130: SVM-Kernels
SVM-Kernels
Three different types of SVM-Kernels are displayed below. The polynomial and RBF are especially useful when the
data-points are not linearly seperable.