sklearn
sklearn
ADVANCED PROGRAMMING
STATISTICS FOR DATA SCIENCE
Ricardo Aler
A Tutorial on Scikit Learn
What is Scikit Learn?
• It is the standard Python library for doing machine learning
from sklearn import ...
• Collection of machine learning algorithms and tools in
Python.
• BSD Licensed, used in academia and industry (Spotify, bit.ly,
Evernote).
• ~20 core developers.
• http://scikit-learn.org/stable/
• Other packages for Machine Learning in Python:
Pylearn2, PyBrain, ...
The Machine Learning
workflow
• Knowledge about the main ideas of
Machine Learning / Statistical Learning is
assumed
• The workflow:
– Data preprocessing
– Training:
• Training the model
• Hyper-parameter tuning
– Model evaluation (holdout, crossvalidation)
The input: the dataset
• Datasets for sklearn are numpy numeric matrices:
– This implies that categorical attributes/variables must
be represented as:
• Integers
• One-hot-encoding / dummy variables
• However, there is a trend for integrating Pandas
dataframes with scikit learn
• Missing values are represented as np.nan
Example of dataset: iris
• It is a dataset for classification of plants
– Attributes / features:
• ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
Height K=1
Child
Test
instance Adult
Aged
Weight
10
Models: decision tree
Training a decision tree
In [93]: from sklearn import tree
# Here, we define the type of training method (nothing happens yet)
In [94]: clf = tree.DecisionTreeClassifier()
# Now, we train (fit) the method on the (X,y) dataset
In [95]: clf = clf.fit(X, y)
# clf contains the trained model
In [96]: clf
Out[96]:
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, presort=False, random_state=None,
splitter='best')
Training and evaluating models
with a test partition (holdout)
Attributes Class
Training ML method
2/3
partition
Available Performance measure
data
predictions
1/3 Test partition Success rate
computation ^yT
(holdout) yT actual
classes
Model
xT
Rule: never evaluate a model with the same data used for training it
Training and evaluating models
with a test partition (holdout)
• First, we create the train / test partitions
In []: from sklearn.model_selection import train_test_split
In []: from sklearn import preprocessing
Available data
Fold X
Method
Fold Y
Fold Z
80%
3-fold cross-validation evaluation
Available data
81%
Fold X
Method
Fold Y
Fold Z
3-fold cross-validation evaluation
Available data
Fold X
Fold Y
Method
Fold Z
78%
3-fold cross-validation evaluation
Available data
Fold X 80%
81% T=
Fold Y
Evaluation (80%+81%+78%)/3
= 79.7%
Fold Z 78%
3-fold cross-validation evaluation
Fold X 80%
81% T=
Fold Y
Evaluation (80%+81%+78%)/3
= 79.7%
Fold Z 78%
3-fold cross-validation evaluation
Available data
Fold X
Final
model
Method
Fold Y
Fold Z
3-fold cross-validation evaluation
Available data
Estimation of future
performance = 79.7%
Fold X
Final
model
Fold Y
Method
Fold Z
Estimating performance
(evaluation) with crossvalidation
In []: from sklearn.model_selection import cross_val_score, KFold
X = boston.data
y = boston.target
Exercise: regression
• Use train (75%)/test (25%) for training / evaluating a
decision tree regression model:
tree.DecisionTreeRegressor()
metrics.mean_squared_error
Do the same with KNN:
KNeighborsRegressor
find it yourself in the scikit docs (https://scikit-learn.org/)
Now, do the evaluation with 5-fold crossvalidation:
scoring='neg_mean_squared_error',
Hyper-parameters
• All machine learning methods have hyper-
parameters that control their behavior
• For example, KNN has K = number of
neighbors:
– n_neighbors
• For example, decision trees have (at least):
– max_depth
– min_samples_split
MAX-DEPTH HYPER-PARAMETER FOR
DECISION TREES
• With max_depth = 1, boundary is a line.
Y=4
X=2 X=6
MAX-DEPTH HYPER-PARAMETER FOR
DECISION TREES
• With max_depth = 2, boundary is non-linear
Red
Y=4
X=2 X=6
MAX-DEPTH HYPER-PARAMETER FOR
DECISION TREES
• With max_depth = 3, boundary is non-linear and more complex than with
max_depth = 2
• And so on
Red
Blue
Y=4
X=2 X=6
NUMBER OF NEIGHBORS IN KNN
Hyper-parameters
• It is possible to set them by hand when the
method is defined:
In [191]: clf = tree.DecisionTreeClassifier()
In [192]: clf
Out[192]:
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None, max_features=None,
max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1,
min_samples_split=2, min_weight_fraction_leaf=0.0, presort=False, random_state=None,
splitter='best')
Random search: test randomly only some of the combinations (Budget=4, in this
case).
maxdepth 2 4 6 8
minsplit
2 70% 75% 76% 68%
4 72% 73% 81% 70%
6 68% 70% 71% 67%
Random search
budget = 100 # budget is the maximum amount of hyper-parameter values to try
while(budget>0){
budget = budget – 1 # Decrease budget
(maxdepth, minsplit) = “get a random combination of hiper-parameter values”
model = train(train_set, maxdepth, minsplit)
evaluation <- "evaluate model with validation set"
}}
"Return (maxdepth, minsplit) of model with best evaluation"
Automatic Hyper-parameter
tuning
• In general, hyper-parameter tuning is a search in a parameter space
for a particular machine learning method (or estimator).
Therefore, it is necessary to define:
– The search space (the hyper-parameters of the method and their
allowed values)
– The search method: so far, grid-search or random-search, but
there are more (such as model based optimization)
– The evaluation method: basically, validation set (holdout) or
crossvalidation
– The performance measure (or score function):
missclassification error, balanced accuracy, RMSE, …
Defining the search space for
grid-search
• For grid search, we must specify the list of
actual values to be checked:
• Equivalently:
Defining the search space for
random search
• For random search, we can also specify the list of values to be checked
• But also, the statistical distribution out of which values can be sampled
(this is preferred):
A 1 2 3
Available data
Train
• Now, we are going to use 3-fold crossvalidation for hyper-parameter tuning, but
train/test (holdout) for model evaluation (a.ka. estimation of future performance)
• First, we train with A and B, and validate with C
HYPER-PARAMETER tuning
with crossvalidation
Attr. x Class y
A 1 2 3
Train
C
Test
º
Train
B 1 2 3
C
Test
B
60% 92% 70%
C 60.33% 91.66 % 70% = averages
Test
• A model is trained with the whole train partition, with the best max
depth.
HYPER-PARAMETER tuning
with crossvalidation
1 2 3 Attr. x Class y
1
60% 93% 71%
Model with max depth = 2
61% 90% 69%
Train
60% 92% 70%
90.5 %
https://imbalanced-learn.readthedocs.io/
imblearn.under_sampling.EditedNearestNeighbours
SMOTE (balance minority
classes)
https://imbalanced-learn.readthedocs.io/
Standarization and normalization
to a range
• Different attributes may have different ranges (e.g. height: 0m-2m,
weight: 0kg-100kg, …)
• The aim is that all attributes have the same range or spread
– Important for some methods such as KNN, Support Vector
Machines, and neural networks. Not important for Decision trees.
• If xi is an attribute / feature (i.e. a column in a data matrix)
• Normalization: xi = (xi-min(xi)) / (max(xi) – min(xi))
– New range = 0-1
• Standarization: xi = (xi-mean(xi)) / std(xi)
Imputation
• Imputation = replacing missing values (np.nan)
• Some methods are able to deal with missing values (e.g. trees), but some methods
aren't (e.g. KNN, SVM, ...)
• Strategies:
– Remove instances with np.nan 's
– Remove attributes with np.nan 's
– Univariate: replace np.nan 's with mean, median, or mode (categorical
attributes):
• sklearn.impute.SimpleImputer
– Multivariate: use a machine learning method to compute models of an
attribute in terms of the other attributes. Use the model to impute each
attribute, in turn.
• sklearn.impute.IterativeImputer
Encoding categorical variables: one-
hot-encoding (dummy variables)
• Some machine learning methods are not able to deal with
categorical/discrete attributes
• Most commonly used: dummy variables or one-hot-
encoding (typically, only N-1 columns are kept)
Encoding categorical variables:
frequency and integer
• However, one-hot-encoding generates too many columns for variables with many
values.
• Alternatives: integer/label encoding
• Problem: an artificial (false) order is introduced
Label/integer encoding
Encoding categorical variables
• sklearn.feature_selection.SelectKBest
Pipelines in Scikit Learn
• Sometimes training a model involves applying a sequence of methods, in most
cases involving some preprocessing steps.
• For example, we might want to do:
1. Imputation (to remove missing values)
2. Attribute selection (to select the most relevant features)
3. Model training
Available data
Training
Method
Preprocessing
Test
E.g: imputation, Evaluation
select relevant 90%
attributes, etc.
How to do preprocessing
correctly?
• We shouldn’t use test data for training the model, in any way
Available data
Training
Method
Preprocessing
Test
E.g: imputation, Evaluation
select relevant 90%
attributes, etc.
How to do preprocessing
correctly?
• It is better to create a pipeline
• E.g. for attribute selection:
x1 x2 x3 x4 x5 y x1 x3 y
Model
Attribute
Train decision
DATA selection DATA tree
How to do preprocessing
correctly?
Available data Selected attributes x1,
x5, x20
Training
Estimators
boston = datasets.load_boston()
X = boston.data
y = boston.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=33)
Classifier / regressor: fit
• The fit method trains a model
Available data
FIT
Training
Algorithm
Model
Test
Training
Model
𝑦
ෞ1
Test 𝑦
ෞ2
𝑦
ෞ3 y_test_pred =
...
clf.predict(X_test)
Predictions
Now, let's go to the transformers ...
• But let's put one nan for illustration purposes
In [63]: X_train
X_train[1, 1] = np.nan Out[63]:
array([
X_test[1, 1] = np.nan [2.9 e-01, 0.0 e+00, 6.2 e+00, ...],
[5.0 e-02, nan, 6.0 e+00, ...,],
X_train [1.3 e+01, 0.0 e+00, 1.8 e+01, ...],
...,
X_test [4.5 e-02, 0.0 e+00, 1.3 e+01, ...],
[5.2 e+00, 0.0 e+00, 1.8 e+01, ...],
[1.2 e+00, 0.0 e+00, 8.1 e+00, ...])
In [64]: X_test
Out[64]:
array([
[9.2 e-02, 0.0 e+00, 2.5 e+01, ...],
[2.5 e+01, nan, 1.8 e+01, ...],
[7.0 e+00, 0.0 e+00, 1.8 e+01, ...],
...,
[1.5 e+01, 0.0 e+00, 1.8 e+01, ...],
[2.0 e-01, 2.2 e+01, 5.8 e+00, ...],
[3.4 e-01, 0.0 e+00, 7.3 e+00, ...]])
Transformer: fit
In [63]: X_train
Out[63]:
array([
[2.9 e-01, 0.0 e+00, 6.2 e+00, ...],
[5.0 e-02, nan, 6.0 e+00, ...],
[1.3 e+01, 0.0 e+00, 1.8 e+01, ...],
...,
[4.5 e-02, 0.0 e+00, 1.3 e+01, ...],
[5.2 e+00, 0.0 e+00, 1.8 e+01, ...],
[1.2 e+00, 0.0 e+00, 8.1 e+00, ...]])
Transformer: transform
X_train = trf.transform(X_train)
X_test = trf.transform(X_test)
In [63]: X_train
Out[63]:
array([[2.9 e-01, 0.0 e+00, 6.2 e+00, ...],
[ 5.0 e-02, nan, 6.0 e+00, ...],
[ 1.3 e+01, 0.0 e+00, 1.8 e+01, ...],
...,
[ 4.5 e-02, 0.0 e+00, 1.3 e+01, ...],
[ 5.2 e+00, 0.0 e+00, 1.8 e+01, ...],
[ 1.2 e+00, 0.0 e+00, 8.1 e+00, ...]])
X_train = trf.transform(X_train)
array([[2.9 e-01, 0.0 e+00, 6.2 e+00, ...],
[5.0 e-02, 1.1 e+01, 6.0 e+00, ...],
[1.3 e+01, 0.0 e+00, 1.8 e+01, ...],
...,
[4.5 e-02, 0.0 e+00, 1.3 e+01, ...],
[5.2 e+00, 0.0 e+00, 1.8 e+01, ...],
[1.2 e+00, 0.0 e+00, 8.1 e+00, ...]])
trf.statistics_
Out[78]:
array([3.2 e+00, 1.1 e+01, 1.0 e+01, 6.7 e-02, 5.4 e-01, 6.3 e+00, 6.8 e+01, 3.8 e+00, 8.7 e+00, 3.8 e+02, 1.8 e+01, 3.6 e+02, 1.2 e+01])
Transformer: transform
• Notice that the same transformation is applied to
train and test
In [64]: X_test
Out[64]:
array([
[9.2 e-02, 0.0 e+00, 2.5 e+01, ...],
[2.5 e+01, nan, 1.8 e+01, ...],
[7.0 e+00, 0.0 e+00, 1.8 e+01, ...],
...,
[1.5 e+01, 0.0 e+00, 1.8 e+01, ...],
[2.0 e-01, 2.2 e+01, 5.8 e+00, ...],
[3.4 e-01, 0.0 e+00, 7.3 e+00, ...]])
X_test = trf.transform(X_test)
array([
[9.2 e-02, 0.0 e+00, 2.5 +01, ...],
[2.5 e+01, 1.1 e+01, 1.8 e+01, ...],
[7.0 e+00, 0.0 e+00, 1.8 e+01, ...],
...,
[1.5 e+01, 0.0 e+00, 1.8 e+01, ...],
[2.0 e-01, 2.2 e+01, 5.8 e+00, ...],
[3.4 e-01, 0.0 e+00, 7.3 e+00, ...]])
# Complete code
import numpy as np
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
boston = datasets.load_boston()
X = boston.data
y = boston.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=33)
X_train[1, 1] = np.nan
X_test[1, 1] = np.nan
X_train
X_test
trf = SimpleImputer(strategy='mean')
trf = trf.fit(X_train)
trf.statistics_
X_train = trf.transform(X_train)
X_test = trf.transform(X_test)
Pipelines in Scikit Learn
• Let's put transfomers and class/regressors together: pipelines
• A sequence of transformers IS a transformer:
– transformer + transformer + ... + transformer ≡ transformer
– that means that it has the .fit and .transform methods
trf = Pipeline([
('impute', imputer),
('select', selector)])
trf is a sequence of transformers, therefore, trf IS a transformer (with .fit, and .transform
methods).
trf
impute select
A transformer pipeline:
Accessing the individual steps
• We can fit the transformer pipeline and then
access each step (tab completes the step names)
Shorter or by integer position
trf = trf.fit(X_train, y_train)
trf.named_steps['impute'] trf['impute']
trf['select']
trf[0]
trf[1]
trf.named_steps['select']
# The imputation step
In [36]: trf['impute']
Out[36]: trf
SimpleImputer(add_indicator=False, copy=True,
fill_value=None,
missing_values=nan, strategy='mean', verbose=0)
impute select
# The feature selection step
In [37]: trf['select'] trf[0] trf[1]
Out[37]: SelectKBest(k=3, score_func=<function
f_regression at 0x0000018F017A3E58>)
A transformer pipeline:
Getting the properties of each individual steps
trf.named_steps['impute'].statistics_
trf.named_steps['select'].get_support()
These values will be
# The imputation step
In [126]: trf.named_steps['impute'].statistics_ used for imputation
Out[126]:
array([3.2 e+00, 1.1 e+01, 1.0 e+01, ...])
[[2.9 e-01 0.0 e+00 6.2 e+00 ... 1.7 e+01 3.7 e+02 3.9 e+00]
[5.0 e-02 nan 6.0 e+00 ... 1.6 e+01 3.9 e+02 1.2 e+01]
[1.3 e+01 0.0 e+00 1.8 e+01 ... 2.0 e+01 1.3 e+02 1.3 e+01]
...
[4.5 e-02 0.0 e+00 1.3 e+01 ... 1.6 e+01 3.9 e+02 1.3 e+01]
[5.2 e+00 0.0 e+00 1.8 e+01 ... 2.0 e+01 3.7 e+02 1.8 e+01]
[1.2 e+00 0.0 e+00 8.1 e+00 ... 2.1 e+01 3.7 e+02 2.1 e+01]]
X_train = trf.transform(X_train)
[[ 7.68 17.4 3.92 ]
[ 5.70 16.9 12.43 ]
[ 3.86 20.2 13.33 ]
... Attributes 5, 10, 12 have been
[ 5.88 16.4 13.51 ]
[ 6.05 20.2 18.76 ]
selected, and the np.nan have
[ 5.57 21. 21.02 ]] been imputed
clf
imputer = SimpleImputer(strategy='mean')
selector = SelectKBest(f_regression, k=3)
knn = KNeighborsRegressor()
imputer = SimpleImputer(strategy='mean')
selector = SelectKBest(f_regression, k=3)
knn = KNeighborsRegressor()
# Defining hyper-parameter space # The best hyper-parameter values (and their scores)
from sklearn.model_selection import GridSearchCV can be accessed
param_grid = { clf_grid.best_params_
'select__k': [2,3,4], Out[]: {'knn_regression__n_neighbors': 5, 'select__k':
'knn_regression__n_neighbors': [1,3,5] 3}
}
clf_grid.best_score_
Out[]: -20.14685427728613
Hyper-parameter tuning of
pipelines
• We can even get the optimized pipeline itself:
clf_grid.best_estimator_
Out[]:
Pipeline(memory=None,
steps=[('impute',
SimpleImputer(add_indicator=False, copy=True, fill_value=None,
missing_values=nan, strategy='mean',
verbose=0)),
('select',
SelectKBest(k=3,
score_func=<function f_regression at 0x0000012D3D2FC798>)),
('knn_regression',
KNeighborsRegressor(algorithm='auto', leaf_size=30,
metric='minkowski', metric_params=None,
n_jobs=None, n_neighbors=5, p=2,
weights='uniform'))],
verbose=False)
Hyper-parameter tuning of
pipelines
• Note: if needed, all pipeline hyper-parameters can
be obtained with method .get_params()
clf.get_params()
'impute__add_indicator': False,
'impute__copy': True,
'impute__fill_value': None,
'impute__missing_values': nan,
'impute__strategy': 'mean',
'impute__verbose': 0,
'select__k': 10,
'select__score_func': <function sklearn.feature_selection.univariate_selection.f_regression(X, y, center=True)>,
'knn_regression__algorithm': 'auto',
'knn_regression__leaf_size': 30,
'knn_regression__metric': 'minkowski',
'knn_regression__metric_params': None,
'knn_regression__n_jobs': None,
'knn_regression__n_neighbors': 5,
'knn_regression__p': 2,
'knn_regression__weights': 'uniform'}
Hyper-parameter tuning of
pipelines
• and they can also be set with .set_params, like this:
clf = clf.set_params(**{'knn_regression__n_neighbors':10})
clf.get_params()
'impute__add_indicator': False,
'impute__copy': True,
'impute__fill_value': None,
'impute__missing_values': nan,
'impute__strategy': 'mean',
'impute__verbose': 0,
'select__k': 10,
'select__score_func': <function sklearn.feature_selection.univariate_selection.f_regression(X, y, center=True)>,
'knn_regression__algorithm': 'auto',
'knn_regression__leaf_size': 30,
'knn_regression__metric': 'minkowski',
'knn_regression__metric_params': None,
'knn_regression__n_jobs': None,
'knn_regression__n_neighbors': 10,
'knn_regression__p': 2,
'knn_regression__weights': 'uniform'}
Caching steps in a pipeline
• For hyper-parameter tuning, some of the transformers in
the pipeline should be fitted just once
• For example, ordering the features should be done only
once (in principle, the same ordering of features is going to
be obtained everytime).
param_grid = {
'select__k': [2,3,4],
'knn_regression__n_neighbors': [1,3,5]
}
Feature union
Feature Unions
# Just importing modules and preparing the data
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.model_selection import GridSearchCV
from sklearn.neighbors import KNeighborsRegressor
from sklearn.datasets import load_iris
from sklearn.decomposition import PCA
from sklearn.feature_selection import SelectKBest
from sklearn.model_selection import train_test_split
iris = load_iris()
X, y = iris.data, iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=33)
Feature Unions
# Now, we prepare the two sources of features/attributes: PCA and Feature Selection
# We compute two features from each
pca = PCA(n_components=2)
selection = SelectKBest(k=2)
select
Feature Unions
Feature Unions can be used as a standalone transformer. We fit it
with the training data and use it to transform both training and test.
# ...
# Build estimator from PCA and selection:
combined_features = FeatureUnion([("pca", pca),
("select", selection)])
combined_features = combined_features.fit(X, y)
pca
...
X_train_new = combined_features.transform(X_train) Data
X_test_new = combined_features.transform(X_test) select
... ...
pca = PCA(n_components=2)
pca
...
Data
select
selection = SelectKBest(k=2)
Feature Unions
Feature Unions can also be used as a transformer step in a pipeline.
# ...
# Build estimator from PCA and selection: pca_sel_knn
combined_features =
FeatureUnion([("pca", pca),
features
("select", selection)])
pca
knn = KNeighborsRegressor()
# Construct the pipeline of pca&select + knn X_train knn model
pca_sel_knn =
select
Pipeline([("features", combined_features),
("knn", knn)])
# Fit it
pca_sel_knn = pca_sel_knn.fit(X_train, y_train)
# And use it for making predictions for the train and test datasets
pred_train = pca_sel_knn.predict(X_train)
pred_test = pca_sel_knn.predict(X_test)
Feature Unions
We can still Access each one of the steps in the pipeline
pca_sel_knn['features'].transformer_list[0]
Out[]: ('pca', features
PCA(copy=True, iterated_power='auto', n_components=2,
random_state=None,
pca
[0]
svd_solver='auto', tol=0.0, whiten=False)) knn
select
pca_sel_knn['features'].transformer_list[1] [1]
Out[]: ('select',
SelectKBest(k=2, score_func=<function f_classif at
0x0000012D3D2F7EE8>))
pca_sel_knn['knn']
Out[]: KNeighborsRegressor(algorithm='auto', leaf_size=30,
metric='minkowski',
metric_params=None, n_jobs=None, n_neighbors=5, p=2,
weights='uniform')
Feature Unions
... and use the individual steps to transform data!
X_train_transformed = pca_sel_knn['features'].transform(X_train)
features
print(X_train_transformed[:5,:])
Out[]: pca
[[-0.0 -0.2 3.6 1.3 ] [0]
[ 2.0 0.0 5.5 1.8 ] knn
[-2.1 0.7 1.7 0.4 ] select
[ 1.3 0.6 4.7 1.4 ]
[1]
[ 1.6 -0.5 5.1 2.4 ]]
X_train_transformed =
pca_sel_knn['features'].transformer_list[0][1].transform(X_train)
print(X_train_transformed[:5,:])
Out[]: [[-0.0 -0.2]
[ 2.0 0.0]
[-2.1 0.7]
[ 1.3 0.6]
[ 1.6 -0.5]]
Feature Unions: exercise
• Create a FeatureUnion that selects the first more relevant attribute according to
three ranking methods:
– f_classif
– mutual_info_classif
– chi2
• First, use it as standalone transformer and check that it works (that when used to
transform a dataset (X_test, for instance), three features are created).
• And then use it into a pipeline together with knn. Fit the pipeline, and check that
the three features are being created. You will need to access the FeatureUnion
step in the pipeline and use it to transform a dataset (X_test, for instance), and
see that three features are created.
• This transformer is not very useful, as the three methods will usually select the
same attribute. Just for practising.
Feature Unions: exercise
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.neighbors import KNeighborsRegressor
from sklearn.datasets import load_iris
from sklearn.feature_selection import SelectKBest, f_classif,
mutual_info_classif, chi2
from sklearn.model_selection import train_test_split
iris = load_iris()
X, y = iris.data, iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33,
random_state=33)
...
Feature Unions: exercise
...
# Combine the three features:
combined_features = FeatureUnion([("f1", first_selector),
("f2", second_selector),
("f3", third_selector)])
Out[]:
array([[4.2, 1.3, 4.2],
[4.4, 1.4, 4.4],
[1.6, 0.2, 1.6],
[4.6, 1.5, 4.6],
[5.6, 1.4, 5.6]])
Feature Unions: exercise
...
# Combine the three features:
combined_features = FeatureUnion([("f1", first_selector),
("f2", second_selector),
("f3", third_selector)])
knn = KNeighborsRegressor()
# Construct the pipeline
f1f2f3_knn = Pipeline([("features", combined_features),
("knn", knn)])
# Fit it
f1f2f3_knn = f1f2f3_knn.fit(X_train, y_train)
# We access to the 'features' step of the trained pipeline and use it to transform the test set
new_X_test = f1f2f3_knn['features'].transform(X_test)
# We see that the new data matrix has three features
print(new_X_test[:5,:])
Out[]:
array([[4.2, 1.3, 4.2],
[4.4, 1.4, 4.4],
[1.6, 0.2, 1.6],
[4.6, 1.5, 4.6],
[5.6, 1.4, 5.6]])
Transforming individual features
• Up to now, all pre-processing steps process all attributes in the
dataset
• But in some cases, different attributes/features need to follow
different pre-processing steps.
• For instance, categorical attributes should undergo some pre-
processing and numerical attributes some other pre-processing.
• ColumnTransformer can be used for that
• Important: all pre-processing steps in a pipeline transform numpy
arrays into numpy arrays, but ColumnTransformer can start from
Pandas dataframes (and transform them into numpy arrays)
Transforming individual features
• Let's suppose that we start with the titanic
dataset which is a Pandas dataframe
Categorical Numerical
y
Transforming individual features
• Each attribute or each type of attribute (numeric, categorical, ...) can be
transformed in a different way
– https://scikit-learn.org/stable/modules/compose.html#pipeline
preprocessor
num
age imputer
fare median
scaler
cat classifier
embarked
sex imputer onehot
constant
pclass
clf
Transforming individual features
preprocessor__num__imputer__strategy
Function transformers
def drop_first_column(X):
return X[:, 1:]
knn = KNeighborsRegressor()
remove_column_1 = FunctionTransformer(drop_first_column)
pipe = Pipeline([
('drop_col_1', remove_column_1),
('knn', knn)
])
Creating new transformers for
pipelines
• There are cases where you want to do some pre-
processing, but sklearn does not provide that
operation to be included in your pipeline.
• And functionTransformer cannot be used.
• But you can extend sklearn by creating your own
new pre-processing steps.
• We are going to program a transformer for "getting
just the first colum" (although this is so simple that
it could also be achieved via
FunctionTransformer).
A simple (not very useful) transformer
class get_one_col(TransformerMixin):
def __init__(self):
pass
• Now, my transformer is
initialized
• and then fitted one_col_trans = get_one_col ()
one_col_trans.fit(X,y)
A simple transformer
(selecting first column)
print(XX[:10,:])
[[5.1]
[4.9]
[4.7]]
A simple transformer
(selecting first column)
class get_one_col(TransformerMixin):
def __init__(self):
pass • fit is the operation that trains
the transformer.
def fit(self, X, y=None):
return(self) • This particular transformer
always selects column 0,
def transform(self, X): independently of the training
return(X[:,[0]])
data.
• Therefore, .fit just returns the
.transform is the operation transformer (self) without
that transforms the data. In changing it. That is, .fit does
this case, we just select nothing.
column 0
A simple transformer
(selecting first column)
from sklearn.datasets import load_iris
• Let's try it from sklearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsRegressor
• We first import from sklearn.base import TransformerMixin
• Now, my transformer is
initialized
• and then fitted one_col_trans = get_one_col ()
one_col_trans.fit(X,y)
• In this simple case,
fitting does nothing
class get_one_col (TransformerMixin):
def __init__(self):
pass
print(XX[:10,:])
[[5.1]
[4.9]
[4.7]]
Using our transformer in a pipeline
iris = load_iris()
X, y = iris.data, iris.target
iris = load_iris()
X, y = iris.data, iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=33)
class get_one_col(TransformerMixin):
def __init__(self):
pass
new_X_test[:5,:]
Out[]:
array([[14.1],
[15.6],
[ 9.7],
[15.4],
[15.7]])
A more complicated transformer
• This one is going to do imputation of numerical
attributes, but using the first quartile instead of the
median or the mean: from sklearn.base import TransformerMixin
import numpy as np
class SimpleImputerQuartile(TransformerMixin):
def __init__(self):
pass
In this case, fitting the
def fit(self, X, y=None):
transformer puts some # nanquantile computes quantiles, while ignoring nan
self.statistics_ = np.nanquantile(X, 0.25, axis = 0)
information inside the return(self)
transformer (self)
def transform(self, X):
for j in range(X.shape[1]):
for i in range(X.shape[0]):
if(np.isnan(X[i,j])):
X[i,j]=self.statistics_[j]
return(X)
A more complicated transformer
• Let's analyze the .fit method
def fit(self, X, y=None):
self.statistics_ = np.nanquantile(X, 0.25, axis = 0)
return(self)
input X XX = my_quartile_imputer.transform(X)
print(X[:5,:]) print(XX[:5,:])
[[5.1 3.5 1.4 0.2]
[[nan 3.5 1.4 0.2]
[4.9 2.8 1.4 0.2]
[4.9 nan 1.4 0.2] [4.7 3.2 1.6 0.2]
[4.7 3.2 nan 0.2] [4.6 3.1 1.5 0.3]
[5. 3.6 1.4 0.2]]
[4.6 3.1 1.5 nan]
[5. 3.6 1.4 0.2]]
A more complicated transformer
• We can also use our SimpleImputerQuartile in a
pipeline:
quartile_imputer = SimpleImputerQuartile()
knn = KNeighborsRegressor()
qi_knn = Pipeline([
('quartile_imputer', quartile_imputer),
('knn', knn)
])
pipe = qi_knn.fit(X,y)