0% found this document useful (0 votes)
5 views

1729585037_ML11_Generalization

Uploaded by

TANISHA SINHA
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

1729585037_ML11_Generalization

Uploaded by

TANISHA SINHA
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 40

Model Assessment and

Selection
Generalisation
• How well a model trained on the training set predicts the right output for
new instances is called generalization.
• The goal of a good machine learning model is to generalize well from the
training data to any data from the problem domain. This allows us to
make predictions in the future on data the model has never seen.
• Overfitting and underfitting are the two biggest causes for poor
performance of machine learning algorithms.
• Underfitting is the production of a machine learning model that is not complex
enough to accurately capture relationships between a dataset features and a target
variable.
• Overfitting is the production of an analysis which corresponds too closely or
exactly to a particular set of data, and may therefore fail to fit additional data or
predict future observations reliably.
• A high bias model results from not  A high variance model results from
learning data well enough. learning the data too well.
• Hence future predictions will be  Hence it is impossible to predict the
unrelated to the data and thus incorrect. next point accurately.
Both training Small training error
and test errors and large test error
are large
Adding more
Adding more training examples
training can possibly help
examples here because we
won’t help as possibly have a
the model complex model
itself is too but not enough
simple training data to
learn it well
Use a more
complex model Or regularize the model
or increase the Bias-Variance trade-off: To balance, better to prevent overfitting
number of both the bias and variance and keep or use fewer features, or
features both of them low. switch to a simpler model
Addressing overfitting
• Reduce number of features
• Regularization: Keep all
features, but reduce
magnitude of parameters θ

• Cost function optimization for regularization: Penalize and make some of


the θ parameters really small; With regularization, take cost function and
modify it to shrink all the parameters;
• Controls a trade off between our two goals
• Want to fit the training set well
• Want to keep number of parameters less
Ridge Regression (L2 Regularization)
To overcome the situation of overfitting, the cost function of
regression model is modified as,
𝑁 𝑚
(𝑖) (𝑖) 2
𝐽 𝜃 = ℎ𝜃 𝑥 −𝑦 +𝜆 (𝜃𝑗 )2
𝑖=1 𝑗=1

λ is hyperparameter (regularization parameter)

Here, J(θ) is always positive, so no overfitting


Lasso Regression (L1 Regularization)
• least absolute shrinkage and selection operator
• Prevents overfitting as well as helps in feature selection
• The cost function is as below.

𝑁 𝑚
(𝑖) (𝑖) 2
𝐽 𝜃 = ℎ𝜃 𝑥 −𝑦 +𝜆 |𝜃𝑗 |
𝑖=1 𝑗=1

λ is hyperparameter (regularization parameter)


Elastic-Net Regression
• Combines both L1 and L2 Regularization
• The cost function is as below.

𝑁 𝑚 𝑚
(𝑖) (𝑖) 2
𝐽 𝜃 = ℎ𝜃 𝑥 −𝑦 +𝜆 (𝜃𝑗 )2 + 𝜆 |𝜃𝑗 |
𝑖=1 𝑗=1 𝑗=1

λ is hyperparameter (regularization parameter)


Testing generalisation: k-fold Cross-validation
Testing generalisation: Bootstrap sampling
It uses the technique
of Simple Random
Sampling with
Replacement (SRSWR)
Ensemble Learning Methods
• Enhance the performance of machine learning models.
• Individual models (known as weak learners), either have a
high bias or high variance so they cannot learn efficiently and
perform poorly.
• Multiple machine learning models are combined together to
obtain more accurate model.
• Ensemble learning tries to balance the bias-variance
trade-off.
• Bagging, boosting and stacking are the three most
popular ensemble learning techniques.
Bagging (Bootstrap Aggregating)

• Proposed by Leo Breiman in 1994


• Bagging is used by combining weak learners of high
variance
• Thus it aims to produce a model with lower variance than
the individual weak models.
• These weak learners are generally homogenous
• Bagging: combination of Bootstrapping and Aggregation
Bootstrapping
• Involves resampling subsets of data with replacement from an
initial dataset.
• These subsets of data are called bootstrapped datasets or,
simply, bootstraps.
• Resampled ‘with replacement’ means an individual data point
can be sampled multiple times.
• Each bootstrap dataset is used to train a weak learner.
• Both row sampling and feature sampling can be used in
Boosting technique.
Aggregating
• The individual weak learners are trained independently from
each other. Each learner makes independent predictions.
• The results of those predictions are aggregated at the end to get
the overall prediction.
• Max Voting is commonly used for classification problems. It
consists of taking the mode of the predictions (the most
occurring prediction).
• Averaging is generally used for regression problems. It involves
taking the average of the predictions.
As an example, the snippet below illustrates a bagging ensemble
(using BaggingClassifier)of KNeighborsClassifier estimators,
each built on random subsets of 50% of the samples and 50% of the
features.

>>> from sklearn.ensemble import BaggingClassifier


>>> from sklearn.neighbors import KNeighborsClassifier
>>> bagging = BaggingClassifier
(KNeighborsClassifier(), max_samples =0.5,
max_features=0.5)
The following example illustrates how the decision regions may change when a
soft VotingClassifier is used based on a linear Support Vector Machine, a Decision
Tree, and a K-nearest neighbor classifier:
>>> from sklearn import datasets
>>> from sklearn.tree import DecisionTreeClassifier
>>> from sklearn.neighbors import KNeighborsClassifier
>>> from sklearn.svm import SVC
>>> from itertools import product
>>> from sklearn.ensemble import VotingClassifier
>>> # Loading some example data
>>> iris = datasets.load_iris()
>>> X = iris.data[:, [0, 2]]
>>> y = iris.target
>>> # Training classifiers
>>> clf1 = DecisionTreeClassifier(max_depth=4)
>>> clf2 = KNeighborsClassifier(n_neighbors=7)
>>> clf3 = SVC(kernel='rbf', probability=True)
>>> eclf = VotingClassifier(estimators=[('dt', clf1), ('knn', clf2), ('svc', clf3)], voting='soft', weights=[2, 1, 2])
>>> clf1 = clf1.fit(X, y)
>>> clf2 = clf2.fit(X, y)
>>> clf3 = clf3.fit(X, y)
>>> eclf = eclf.fit(X, y)
Random Forest
• Decision trees are prone to overfitting. Because of this, a
single decision tree can’t be relied on for making predictions.
• The random forest is an bagging technique of multiple
decision trees (weak learners are homogenous).
• Both row sampling and feature sampling can be used.
• Pruning may be done.
• Improve the prediction accuracy of decision trees.
• The resulting random forest has a lower variance compared to
the individual trees.
• Maximum voting for classification and averaging for regression
n_estimators: required number of trees in the Random Forest

y_pred = classifier.predict(x_test)
Boosting
• Use boosting for combining weak learners with high bias.
• Produce a model with a lower bias than that of the individual models.
• Boosting involves sequentially training weak learners. Here, each subsequent
learner improves the errors of previous learners in the sequence.
• A sample data is first taken from the initial dataset, and is used to train the first
model; The model makes its prediction, either be correctly or incorrectly
predicted.
• The samples that are wrongly predicted are reused for training the next model.
• In this way, subsequent models can improve on the errors of previous models.
• Unlike bagging, which aggregates prediction results at the end, boosting
aggregates the results at each step. They are aggregated using max voting or
weighted averaging.
Adaboost (Adaptive Boosting)
• Most common Weak Learner used is Decision Tree with one
level, i.e. one branch (called Decision Stumps)
• Where ever the model is performing bad, needs improvement
and focus more as compared to where it is performing better.
• Example:
X1 1 2 3 4 5 6 7
X2 7 6 5 4 3 2 1
Y R B R R B R B
FIRST WEAK LEARNER: Lets say split DT for stump at X2 >= 4
Actual Y Predicted Y
8
X2 >= 4 R R 7
B R 6
YES NO 5
R R

X2
4
R R 3
2
RED BLUE B B 1
0
R B 0 2 4 6 8
B B X1

INPUT TO THE SECOND WEAK LEARNER:


• Data set to be given weights while sampling (more weight to be given for
examples with wrong prediction)
• Multiply by 𝑒 𝜆𝑘 to examples with wrong predictions and 𝑒 −𝜆𝑘 to
examples with correct predictions
• 𝑒 𝜆𝑘 : λ is a learning rate (here the value is chosen between 0 to 1)
1 1−𝑒𝑟𝑟𝑜𝑟 1 1−0.286
• 𝑘 = ln ; error = 2/7 = 0.286, then 𝑘 = ln = 0.4581
2 𝑒𝑟𝑟𝑜𝑟 2 0.286
• Let, λ = 1 𝑒 𝜆𝑘 = 𝑒 0.4581 = 1.5811 and 𝑒 −𝜆𝑘 = 0.6325

Normalized Random
Actual Y Predicted Y Weight weight Sampling
R R (1/7)×(0.6325)=0.0903 0.1 [0.00 – 0.10]
B R (1/7)×(1.5811)=0.2259 0.25 [0.10 – 0.35]
R R (1/7)×(0.6325)=0.0903 0.1 [0.35 – 0.45]
R R (1/7)×(0.6325)=0.0903 0.1 [0.45 – 0.55]
B B (1/7)×(0.6325)=0.0903 0.1 [0.55 – 0.65]
R B (1/7)×(1.5811)=0.2259 0.25 [0.65 – 0.90]
B B (1/7)×(0.6325)=0.0903 0.1 [0.90 – 1.00]
Rand # X1 X2 Y 8
0.612092 5 3 B 6
0.932126 7 1 B

X2
4
0.32812 2 6 B
2
0.127205 2 6 B
0.172399 2 6 B 0
0 2 4 6 8
0.473324 4 4 R
X1
0.238094 2 6 B
SECOND WEAK LEARNER: Lets say split DT for stump at X1 < 3
X1 < 3

YES NO 2 misclassification

BLUE RED
• 𝑒 𝜆𝑘 : λ is a learning rate (here the value is chosen between 0 to 1)
1 1−𝑒𝑟𝑟𝑜𝑟 1 1−0.286
• 𝑘 = ln ; error = 2/7 = 0.286, then 𝑘 = ln = 0.4581
2 𝑒𝑟𝑟𝑜𝑟 2 0.286
• Let, λ = 1 𝑒 𝜆𝑘 = 𝑒 0.4581 = 1.5811 and 𝑒 −𝜆𝑘 = 0.6325

Normalized Random
Actual Y Predicted Y Weight weight Sampling
B R 0.1×1.5811 =0.1581 0.15625 [0.00 – 0.16]
B R 0.1×1.5811 =0.1581 0.15625 [0.16 – 0.31]
B B 0.25×0.6325 =0.1581 0.15625 [0.31 – 0.47]
B B 0.25×0.6325 =0.1581 0.15625 [0.47 – 0.63]
B B 0.25×0.6325 =0.1581 0.15625 [0.63 – 0.78]
R R 0.1×0.6325 =0.0632 0.0625 [0.78 – 0.84]
B B 0.25×0.6325 =0.1581 0.15625 [0.84 – 1.00]
Rand # X1 X2 Y 8
0.1465 5 3 B 6
0.2749 7 1 B

X2
4
0.6467 2 6 B
0.0360 5 3 B 2
0.0472 5 3 B 0
0.8494 2 6 B 0 2 4 6 8
0.0953 5 3 B X1

THIRD WEAK LEARNER: Lets say split DT for stump at X1 + X2 <= 8

X1 + X2 <= 8

YES NO No misclassification, so equal probability


of being selected for the next sample

BLUE RED
X2 > 4 X1 < 3 X1 + X2 <= 8

YES NO YES NO YES NO

RED BLUE BLUE RED BLUE RED

Max Voting
Say, X1 = 1, X2 = 7: RED, BLUE, BLUE Class: BLUE
Say, X1 = 3, X2 = 3: BLUE, RED, BLUE Class: BLUE
Say, X1 = 5, X2 = 5: RED, RED, RED Class: RED
Gradient Boosting
• Gradient tree boosting is a very useful algorithm for both regression and
classification problems.
• It handles mixed data types, and it is also quite robust to outliers.
• Additionally, it has better predictive powers than many other algorithms;
• However, its sequential architecture makes it unsuitable for parallel
techniques, and therefore, it does not scale well to large data sets. For
datasets with a large number of classes, it is recommended to use
RandomForestClassifier instead.
• Gradient boosting typically uses decision trees to build a prediction model
based on an ensemble of weak learners, applying an optimization algorithm
on the cost function.
MODEL 1
IQ CGPA Salary Predicted Psedo 1
90 8 3 4.8 -1.8 IQ < 105
100 7 4 4.8 -0.8
110 6 8 4.8 3.2 YES NO
120 9 6 4.8 1.2
IQ < 95 CGPA < 7.5
80 5 3 4.8 -1.8
Mean Salary = 4.8 YES NO YES NO

-1.8 -0.8 3.2 1.2

• Psedo Residual = Actual y – Predicted y


Predict Psedo Predt Psedo Predt MODEL 2
IQ CGPA Salary ed 1 1 2 2
90 8 3 4.8 -1.8 4.62 -1.62 4.458 IQ < 105
100 7 4 4.8 -0.8 4.72 -0.72 4.648
110 6 8 4.8 3.2 5.12 2.88 5.408 YES NO
120 9 6 4.8 1.2 4.92 1.08 5.028
80 5 3 4.8 -1.8 4.62 -1.62 4.458 IQ < 95 CGPA < 7.5
YES NO YES NO

-1.62 -0.72 2.88 1.08

• Prediction = Mean + α (Prediction from Model 1) α is the learning rate (say 0.1)
• Prediction after model 2 = Mean + 0.1 (Prediction from Model 1) + 0.1 (Prediction
from Model 2)
• Adaboost: use decision stump; prioritize models based on weights
• Gradient Boost: Use DT with leaf node of 8 to 32; Use learning rate
concept.
• For classification problems instead of Mean use log(odds) to compute
pseudo residuals

IQ CGPA Salary Salary Psedo 1 Use this to


90 8 L 0 -0.41 construct
100 7 H 1 0.59 Decision Trees
110 6 H 1 0.59
120 9 H 1 0.59
80 5 L 0 -0.41
log(odds) = ln(3/2) = 0.41
Gradient Boosting
>>> from sklearn.datasets import make_hastie_10_2
>>> from sklearn.ensemble import GradientBoostingClassifier
>>> X, y = make_hastie_10_2(random_state=0)
>>> X_train, X_test = X[:2000], X[2000:]
>>> y_train, y_test = y[:2000], y[2000:]
>>> clf = GradientBoostingClassifier(n_estimators=100, learning_rate=1.0,
max_depth=1, random_state=0).fit(X_train, y_train)
>>> clf.score(X_test, y_test) 0.913
XgBoost (eXtreme Gradient Boosting)
• Developed by Tianqi Chen; focus more on implementation of gradient
boosting algorithm.
• It is one of the most widely used ensemble learning method amongst the
various other tools available for the Distributed Machine Learning
Community, more commonly referred to as DMLC.
• Supports various features of Scikit-learn implementations, along with new
additions like regularization.
• The model supports these three forms of gradient boosting:
• Gradient Boosting Algorithm including learning rate.
• Stochastic Gradient Boosting with sub-sampling at column and row levels.
• Regularized Gradient Boosting with L1 as well as L2 regularization. This allows the
algorithm to avoid overfitting.
• This algorithm is known for its accuracy and efficiency.
• # Importing the libraries • # Importing the libraries
• import numpy as np • # Splitting the dataset into the Training
• import matplotlib.pyplot as plt set and Test set
• import pandas as pd • from sklearn.model_selection import
• # Importing the dataset
train_test_split
• dataset = pd.read_csv('Churn_Modelling.csv') • X_train, X_test, y_train, y_test =
train_test_split(
• X = dataset.iloc[:, 3:13].values
• y = dataset.iloc[:, 13].values
• X, y, test_size = 0.2, random_state = 0)
• # Encoding categorical data • # Fitting XGBoost to the training data
• from sklearn.preprocessing import LabelEncoder, • import xgboost as xgb
OneHotEncoder
• my_model = xgb.XGBClassifier()
• labelencoder_X_1 = LabelEncoder()
• my_model.fit(X_train, y_train)
• X[:, 1] = labelencoder_X_1.fit_transform(X[:, 1])
• labelencoder_X_2 = LabelEncoder() • # Predicting the Test set results
• X[:, 2] = labelencoder_X_2.fit_transform(X[:, 2]) • y_pred = my_model.predict(X_test)
• onehotencoder = OneHotEncoder(categorical_features • # Making the Confusion Matrix
= [1])
• from sklearn.metrics import
• X = onehotencoder.fit_transform(X).toarray() confusion_matrix
• X = X[:, 1:] • cm = confusion_matrix(y_test, y_pred)
Stacking
• Improve the prediction accuracy of strong learners.
• Stacking aims to create a single robust model from multiple heterogeneous
strong learners.
• Stacking differs from bagging and boosting in that:
• It combines strong learners
• It combines heterogeneous models
• It consists of creating a Metamodel. A metamodel is a model created using a new
dataset.
• Individual heterogeneous models are trained using an initial dataset. These
models make predictions and form a single new dataset using those
predictions. This new data set is used to train the metamodel, which makes
the final prediction. The prediction is combined using weighted averaging.
• It can combine bagged or boosted models.
The StackingClassifier and StackingRegressor provide such strategies which can
be applied to classification and regression problems.

>>> from sklearn.linear_model import RidgeCV, LassoCV


>>> from sklearn.neighbors import KNeighborsRegressor
>>> estimators = [('ridge', RidgeCV()), ('lasso', LassoCV(random_state =42)),
('knr', KNeighborsRegressor(n_neighbors =20, metric ='euclidean'))]
>>> from sklearn.ensemble import GradientBoostingRegressor
>>> from sklearn.ensemble import StackingRegressor
>>> final_estimator = GradientBoostingRegressor(n_estimators = 25, subsample
= 0.5, min_samples_leaf = 25, max_features = 1, random_state = 42)
>>> reg = StackingRegressor(estimators=estimators, final_estimator =
final_estimator)

You might also like