100% found this document useful (1 vote)
53 views

Bagging and Boosting Regression Algorithms

This document discusses various ensemble learning techniques used to combine multiple machine learning models to improve predictive performance. It describes bagging, boosting, and other ensemble methods like max voting, averaging, weighted averaging, stacking, and blending. Bagging involves training base models on randomly sampled subsets of data and combining their predictions. Stacking uses predictions from base models as inputs to train a new metalearning model, while blending is similar but uses a validation set instead of the full training set. These ensemble techniques aim to reduce variance and bias through diversification and aggregation of multiple models.

Uploaded by

Raja
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
53 views

Bagging and Boosting Regression Algorithms

This document discusses various ensemble learning techniques used to combine multiple machine learning models to improve predictive performance. It describes bagging, boosting, and other ensemble methods like max voting, averaging, weighted averaging, stacking, and blending. Bagging involves training base models on randomly sampled subsets of data and combining their predictions. Stacking uses predictions from base models as inputs to train a new metalearning model, while blending is similar but uses a validation set instead of the full training set. These ensemble techniques aim to reduce variance and bias through diversification and aggregation of multiple models.

Uploaded by

Raja
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 84

Bagging and Boosting Regression Algorithms

Introduction to Ensemble Learning:


 When you want to buy a new Bike or Cycle , will you go to the shop and purchase right

away?
 The straight answer is no . You will browse some 100 reviews before buying a new

product.
 Ensemble models in Machine learning operate on a similar idea.

 They combine decisions from various models to improve the performance.

 There are several algorithms used in Ensemble learning.

 This is true for a diverse set of models in comparison to a single model .This

diversification in machine learning is achieved by a technique called Ensemble learning.


Ensemble Techniques:
 Max Voting.
 Averaging.
 Weighted Averaging.
Max Voting:
 This voting method is generally used for classification problems.

 In this technique , multiple models are used to make predictions for each data point.

 The predictions by each model are taken as a vote.

 The predictions we get from the majority of the models are used for the final
predictions.
 For eg , we asked 5 of them to rate a movie , we will assume 3 of them rated it as 4 and

two of them rated it as 5.


 Since the majority rating is 4 , this will be taken as the final rating.

 We can consider this as taking the mode of all the predictions.


Max Voting - Result
Colleague 1 Colleague 2 Colleague 3 Colleague 4 Colleague 5

5 4 5 4 4
Final Rating
5
EXAMPLE CODE:
 X_train consists of independent variables in training .
 Y_train is the target variable for training data.
 The validation set is x_test (independent variables) and y_test(target
variable).
EXAMPLE CODE:
model1 = tree.DecisionTreeClassifier()

model2 = KNeighborsClassifier()

model3= LogisticRegression()

model1.fit(x_train,y_train)

model2.fit(x_train,y_train)

model3.fit(x_train,y_train)

pred1=model1.predict(x_test)

pred2=model2.predict(x_test)

pred3=model3.predict(x_test)

final_pred = np.array([])

for i in range(0,len(x_test)):    

final_pred = np.append(final_pred, mode([pred1[i], pred2[i], pred3[i]]))


EXAMPLE CODE:
 The voting classifier module in sklearn can be used.

 from sklearn.ensemble import VotingClassifier

 model1 = LogisticRegression(random_state=1)

 model2 = tree.DecisionTreeClassifier(random_state=1)

 model = VotingClassifier(estimators=[('lr', model1), ('dt', model2)],


voting='hard')
 model.fit(x_train,y_train)

 model.score(x_test,y_test)
AVERAGING:
 Multiple predictions are made for each point in averaging.

 We take an average of predictions from all models and use it to make the final

prediction.
 It can be used for making predictions in regression problems or while making

predictions in classification problems.


 In the below case , the averaging method would take the average of all values.

 i.e. (5+4+5+4+4)/5 = 4.4.


Output:
Colleague 1 Colleague 2 Colleague 3 Colleague 4
Colleague5
5 4 5 4 4
Final Rating
4.4
Sample Code:
model1 = tree.DecisionTreeClassifier()
model2 = KNeighborsClassifier()
model3= LogisticRegression()
model1.fit(x_train,y_train)
model2.fit(x_train,y_train)
model3.fit(x_train,y_train)
pred1=model1.predict_proba(x_test)
pred2=model2.predict_proba(x_test)
pred3=model3.predict_proba(x_test)
finalpred=(pred1+pred2+pred3)/3
Weighted Average:
 This is an extension of averaging method.

 All the models are assigned different weights defining the importance of each model for

prediction.
 If two of the colleagues are critics , while others have no prior experience in this field , then

the answers by these two friends are given more importance as compared to other people.
 The result is calculated as [(5*0.23) + (4*0.23) + (5*0.18) + (4*0.18) + (4*0.18)] = 4.41.
Output:
Colleague 1 Colleague 2 Colleague 3 Colleague 4
weight 0.23 0.23 0.18 0.18

Colleague 5 Final Rating


0.18
Rating 5 4 5 4
4 4.41
Sample Code:
model1 = tree.DecisionTreeClassifier()
model2 = KNeighborsClassifier()
model3= LogisticRegression()

model1.fit(x_train,y_train)
model2.fit(x_train,y_train)
model3.fit(x_train,y_train)

pred1=model1.predict_proba(x_test)
pred2=model2.predict_proba(x_test)
pred3=model3.predict_proba(x_test)

finalpred=(pred1*0.3+pred2*0.3+pred3*0.4)
Advanced Ensemble Techniques
 Stacking:
 It is an ensemble learning technique that uses prediction from multiple
models(Decision tree , knn and svm) to build a new model.
 This model is used for predictions on the test set.
 Following steps are used to create a stacked ensemble:
Steps:
1. The train set is split into 10 parts.
Advanced Ensemble Techniques
2. The base model is built on 9 parts and predictions are made on the
10th part.
 This is done for each part of the train set.
Advanced Ensemble Techniques
3.The decision tree is then built on the whole train dataset.
4.Using this model , predictions are made on the test set.
Advanced Ensemble Techniques
5. Steps 2 to 4 are repeated for another base model ( say knn)
resulting in another set of predictions for train set and test set.
Advanced Ensemble Techniques
6.The predictions from the train set are used as features to build a
new model.
Advanced Ensemble Techniques
7. This model is used to make predictions on the test prediction set.
Sample Code:
 First , define a function to make predictions on n-folds of train and

test dataset.
 This function returns the prediction for train and test for each model.
Sample Code:
def Stacking(model,train,y,test,n_fold): folds=StratifiedKFold(n_splits=n_fold,random_state=1)
test_pred=np.empty((test.shape[0],1),float)
train_pred=np.empty((0,1),float)
for train_indices,val_indices in folds.split(train,y.values):
x_train,x_val=train.iloc[train_indices],train.iloc[val_indices]
y_train,y_val=y.iloc[train_indices],y.iloc[val_indices]

model.fit(X=x_train,y=y_train) train_pred=np.append(train_pred,model.predict(x_val))
test_pred=np.append(test_pred,model.predict(test))
return test_pred.reshape(-1,1),train_pred
Sample Code:
 Create two Base Models – Decision tree and knn:

 model1 = tree.DecisionTreeClassifier(random_state=1)

 test_pred1 ,train_pred1=Stacking(model=model1,n_fold=10,

train=x_train,test=x_test,y=y_train)
 train_pred1=pd.DataFrame(train_pred1)

 test_pred1=pd.DataFrame(test_pred1)
Sample Code:
model2 = KNeighborsClassifier()
test_pred2 ,train_pred2=Stacking(model=model2,n_fold=10,train=x_train
,test=x_test,y=y_train)
train_pred2=pd.DataFrame(train_pred2)
test_pred2=pd.DataFrame(test_pred2)
Sample Code:
 Create a third model , Logistic Regression on the predictions of

Decision tree and Knn models.


 df = pd.concat([train_pred1, train_pred2], axis=1)

 df_test = pd.concat([test_pred1, test_pred2], axis=1)

 model = LogisticRegression(random_state=1)

 model.fit(df,y_train)

 model.score(df_test, y_test)
Sample Code:
 The Stacking model we have created has only two levels.

 The decision tree and knn models are built at level zero and Logistic

Regression model is built at level one.


 Multiple levels can be created in a Stacking model.
Blending:
 It follows the same approach as stacking but uses only a validation set

from the train set to make predictions.


 In case of blending , the predictions are made on the validation set

only.
 The holdout set and predictions are used to build a model which is run

on the test set.


 Following is the process involved in Blending.
Blending:
 The train set is split into training and validation sets .
Blending:
 Models are fitted on the training sets.
 The predictions are made on the validation set and the test set.
Blending:
 The validation set and its predictions are used as features to build a

new model.
 This model is used to make predictions on the test and meta

features.
Sample Code:
 Build two models , decision tree and knn on the train set in order to make

predictions on the validation set.


 model1 = tree.DecisionTreeClassifier()

 model1.fit(x_train, y_train)

 val_pred1=model1.predict(x_val)

 test_pred1=model1.predict(x_test)

 val_pred1=pd.DataFrame(val_pred1)

 test_pred1=pd.DataFrame(test_pred1)
Sample Code:
model2 = KNeighborsClassifier()
model2.fit(x_train,y_train)
val_pred2=model2.predict(x_val)
test_pred2=model2.predict(x_test)
val_pred2=pd.DataFrame(val_pred2)
test_pred2=pd.DataFrame(test_pred2)
Blending:
 Combining the meta features and the validation set , a logistic regression

model is built to make predictions on the test set.

 df_val=pd.concat([x_val, val_pred1,val_pred2],axis=1)

df_test=pd.concat([x_test, test_pred1,test_pred2],axis=1)
 model = LogisticRegression()

 model.fit(df_val,y_val)

 model.score(df_test,y_test)
Bagging:
 The main idea behind bagging is to combine the results of multiple models(for eg , decision trees) to

get a generalized result.


 There is a high chance that these models will give the same result since they are getting the same

input.
 One of the techniques used is Bootstrapping.

 It is a sampling technique in which we create subsets of observations from the original dataset , with

replacement.
 The size of the subsets is of the same size as the original set.

 Bagging techniques uses these subsets(bags) to get a fair idea of the distribution(complete set).

 The size of the subsets created for bagging may be less than the original set.
Bagging:
Bagging:
1. Multiple subsets are created from original dataset selecting
observations with replacement.
2. A base model is created on each of these subsets.
3. The models run in parallel and are independent of each other.
4. The final predictions are determined by combining the predictions
from all the models.
Bagging:
Boosting:
 Various Boosting techniques are used which are used in building an ensemble

model.
 Boosting methods are used to build an ensemble model in an increment way.

 The main principle is build the model incrementally by training each base

model estimator sequentially.


 In order to build powerful ensemble , these methods combine several week

learners which are sequentially trained over multiple iterations of training data.
 The sklearn ensemble module is having two boosting methods.
Boosting:
AdaBoost:
 It is one of the powerful boosting ensemble method.

 The main key is in the way they give weights to the instances in the

dataset.
 The algorithm needs to pay less attention to the instances while
constructing subsequent models.
Classification With AdaBoost:
 The scikit – learn module provides sklearn . ensemble. AdaBoostClassifier.

 While building this classifier , the main parameter this module use is the

base_estimator.
 Base_estimator is the value of the base estimator from which the boosted

ensemble is built.
 If we choose this parameter’s value to none then , the base estimator would

be DecisionTreeClassifer(max_depth=1).
Implementation Example:
 In the following example , we are building a AdaBoost Classifier by using

sklearn.ensemble.AdaBoostClassifier and also predicting and checking its score.


 from sklearn.ensemble import AdaBoostClassifier

 from sklearn.datasets import make_classification

 X, y = make_classification(n_samples = 1000, n_features = 10,n_informative =

2, n_redundant = 0,random_state = 0, shuffle = False)


 ADBclf = AdaBoostClassifier(n_estimators = 100, random_state = 0)
ADBclf.fit(X, y)
Output:
 AdaBoostClassifier(algorithm = 'SAMME.R', base_estimator = None, learning_rate = 1.0, n_estimators =

100, random_state = 0)

 Example:

 Once fitted , we can predict the new values as follows:

 print(ADBclf.predict([[0, 2, 3, 0, 1, 1, 1, 1, 2, 2]]))

 Output:

 [1]
Example:
 Now , we can check the score as follows:

 ADBclf.score(X, y)

 Output:

 0.995
Example:
 We can also use sklearn dataset to build classifier using Extra – tree

method.
 In the example below, we are using pima-Indians diabtes dataset.
Example:
from pandas import read_csv

from sklearn.model_selection import Kfold

from sklearn.model_selection import cross_val_score

from sklearn.ensemble import AdaBoostClassifier

path = r"C:\pima-indians-diabetes.csv“

headernames = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']

data = read_csv(path, names = headernames)


Example:
X = array[:,0:8]

Y = array[:,8]

seed = 5

kfold = KFold(n_splits = 10, random_state = seed)

num_trees = 100

max_features = 5

ADBclf = AdaBoostClassifier(n_estimators = num_trees, max_features = max_features)

results = cross_val_score(ADBclf, X, Y, cv = kfold)

print(results.mean())
Output:
 0.7851435406698566
Regression With AdaBoost:
 For creating a Regressor with Ada Boost method , the Sci-kit Learn

library provides sklearn.ensemble.AdaBoostRegressor.


 While building regressor , it will use the same parameters as used by

sklearn.ensemble.AdaBoostClassifier.
Implementation Example:
 In the following example , we are building a AdaBoostregressor and also predicting for

new values by using the predict () method.

 from sklearn.ensemble import AdaBoostRegressor

 from sklearn.datasets import make_regression

 X, y = make_regression(n_features = 10, n_informative = 2,random_state = 0, shuffle =

False)
 ADBregr = RandomForestRegressor(random_state = 0,n_estimators = 100)
ADBregr.fit(X, y)
Output:
 AdaBoostRegressor(base_estimator = None, learning_rate = 1.0, loss = 'linear', n_estimators = 100,

random_state = 0)

 Example:

 Once fitted , we can predict from Regression model as follows:

 print(ADBregr.predict([[0, 2, 3, 0, 1, 1, 1, 1, 2, 2]]))

 Output:

 [85.50955817]
Gradient Tree Boosting:
 It is also called Gradient Boosted Regression Trees(GBRT).

 It is a generalization of Boostint to arbitrarily differentiable loss functions.

 It produces a prediction model in the form of ensemble of week prediction

models.
 It can be used for regression as well as classification problems.

 The main advantage lies in the fact that they naturally handle the mixed

type data.
Boosting:
Classification With Gradient Tree Boost:
 For creating a Gradient Tree Boost Classifier , the Scikit – Learn module provides

sklearn.ensemble.GradientBoostingClassifier.
 While building this classifier , the main parameter that the module use is ‘loss’.

 ‘loss’ is the value of loss function to be optimized.

 If we choose loss = deviance , it refers to deviance for classification with probabilistic outputs.

 If we choose the parameter’s value to exponential , then it recovers the AdaBoost Algorithm.

 The parameters n_estimator will control the number of week learners.

 A hyper – parameter named learning rate (in the range of (0.0, 1.0]) will control overfitting via

shrinkage.
Implementation Example:
 In the following example , we are building a Gradient Boosting Classifier by using

sklearn.ensemble.GradientBoostingClassifier.
 This classifier is fitted with 50 week learners.

 from sklearn.datasets import make_hastie_10_2

 from sklearn.ensemble import GradientBoostingClassifier

 X, y = make_hastie_10_2(random_state = 0)

 X_train, X_test = X[:5000], X[5000:]

 y_train, y_test = y[:5000], y[5000:]

 GDBclf = GradientBoostingClassifier(n_estimators = 50, learning_rate = 1.0,max_depth = 1,

random_state = 0).fit(X_train, y_train) GDBclf.score(X_test, y_test)


Output:
 0.8724285714285714
Example:
 Sklearn datasets can also be used to build the classifier using sklearn:

 from pandas import read_csv

 from sklearn.model_selection import KFold

 from sklearn.model_selection import cross_val_score

 from sklearn.ensemble import GradientBoostingClassifier

 path = r"C:\pima-indians-diabetes.csv"

 headernames = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']

 data = read_csv(path, names = headernames)

 array = data.values

 X = array[:,0:8]

 Y = array[:,8]

 ))
Example:
 seed = 5
 kfold = KFold(n_splits = 10, random_state = seed)

 num_trees = 100

 max_features = 5

 ADBclf = GradientBoostingClassifier(n_estimators = num_trees, max_features =

max_features)
 results = cross_val_score(ADBclf, X, Y, cv = kfold)

 print(results.mean())
Output:
 0.7946582356674234
Regression With Gradient Tree Boost:
 For creating a regressor with Gradient tree Boost method , the Scikit-

learn library provides sklearn.ensemble.GradientBoostingRegressor.


 It can specify the loss function for regressor via the parameter name

‘loss’.
 The default value for loss is ‘ls’.
Implementation Example:
A gradient Boosting Regressor is built by using the Gradient Boosting Regressor by using
sklearn.ensemble.GradientBoostingRegressor.
 The mean_squared_error is found by using the mean_squared_error() method.

 import numpy as np

 from sklearn.metrics import mean_squared_error

 from sklearn.datasets import make_friedman1

 from sklearn.ensemble import GradientBoostingRegressor

 X, y = make_friedman1(n_samples = 2000, random_state = 0, noise = 1.0) X_train, X_test = X[:1000], X[1000:]

 y_train, y_test = y[:1000], y[1000:]

 GDBreg = GradientBoostingRegressor(n_estimators = 80, learning_rate=0.1, max_depth = 1, random_state = 0, loss

= 'ls').fit(X_train, y_train)
Output:
 Once fitted , we can find the mean_squared_error as follows:

mean_squared_error(y_test, GDBreg.predict(X_test)).

 Output:

 5.391246106657164
Gradient Boosting:
 It is an ensemble machine learning algorithm that works well for regression as

well as classification problems.


 It uses the boosting technique.

 It combines a number of weak learners to form a strong learner.

 Regression trees are used as a base learner .

 Each subsequent tree in a series is built on the errors calculated by the previous

tree.
 We need to predict the age group of people using the following data.
Gradient Boosting:
Gradient Boosting:
 The mean age is assumed to be the predicted value for all the

observations in the dataset.


 The errors are calculated using the mean prediction and actual values

of age.
Gradient Boosting:
 A tree model is created using the errors created above as the target

variable.
 The main objective is to find the best split in order to minimize the

error.
Gradient Boosting:
 The predcitions by this model are combined with the predictions 1.

 The value calculated above is the new prediction.

 New errors are calculated using this predicted value and actual value.
Gradient Boosting:
 Steps 2 to 6 is repeated till the max. number of iterations is reached.

 Or the error function does not change.

 Code:

 from sklearn.ensemble import GradientBoostingClassifier

 model= GradientBoostingClassifier(learning_rate=0.01,random_state=1)
model.fit(x_train, y_train)
 model.score(x_test,y_test)

 0.81621621621621621
Sample Code – Regression Problem
 from sklearn.ensemble import GradientBoostingRegressor

 model= GradientBoostingRegressor()

 model.fit(x_train, y_train)

 model.score(x_test,y_test)
Parameters
 Min_samples_split:

 Defines the min. number of samples which are required in a node to be

considered for splitting.


 It is used to control over – splitting.

 Min_samples_leaf:

 Defines the min. number of samples required in a leaf node.

 Lower values should be chosen for imbalanced class problems.


Parameters
 Min_weight_fraction_leaf:

 It is defined as a fraction of total number of observations instead of an integer.

 Max_depth:

 It denotes the maximum depth of a tree.

 It is used to control over – fitting .

 Higher depth will allow the model to learn relations specific to a particular

sample.
Parameters
 Max_leaf_nodes:

 The max. number of terminal nodes in a tree.

 It can be defined in place of max. depth.

 Max_features:

 The max . Number of features to be considered while searching for the best split.

 The square root of the total number of festures works well.

 Higher values can lead to overfitting problem.


XGBoost:(extreme Gradient Boosting)
 It is an advanced implementation of the gradient boosting algorithm.

 It is one of the most powerful algorithms that is used for doing any type of

predictions.
 XGBoost algorithm has predictive power and is 10 times faster than other

gradient boosting techniques.


 It also includes a variety of regularization.

 It reduces overfitting and improves overall performance.

 It is also known as regularized boosting technique.


XGBoost:(extreme Gradient Boosting)
 It is comparatively better when compared to other types of techniques:
 Regularization:
 XGBoost has better regularization when compared to standard GBM
implementation.
 XGBoost helps to reduce overfitting.

 Parallel Processing:
 It implements parallel processing .
 It is faster than GBM.
XGBoost:(extreme Gradient Boosting)
 Parallel Processing:

 It also supports implementation on Hadoop.

 High Flexibility:

 It allows users to define custom optimization objectives.

 Handling Missing values:

 XGBoost has inbuilt routines to handle missing values.


XGBoost:(extreme Gradient Boosting)
 Tree Pruning:

 XGBoost makes splits upto the max_depth specified and then starts pruning the tree

backwards.
 It removes splits beyond which there is no positive sign.

 Built – In Cross Validation:

 XGBoost allows a user to run a cross – validation at each iteration of the boosting process .

 It is easy to get the exact optimum number of boosting iterations in a single run.
XGBoost:Sample Code
 import xgboost as xgb

model=xgb.XGBClassifier(random_state=1,learning_rate=0.01)
model.fit(x_train, y_train)
 model.score(x_test,y_test)

 0.82702702702702702
XGBoost:Sample Code
 import xgboost as xgb

 model=xgb.XGBRegressor()

 model.fit(x_train, y_train)

 model.score(x_test,y_test)
XGBoost: Parameters
 Nthread:

 It is used for parallel processing .

 The number of cores in the system should be entered.

 If you wish to run all the cores automatically , the algorithm will detect automatically.

 Eta:

 Analogous to learning rate in GBM.

 Makes the model more robust by shrinking the weights on each step.
XGBoost: Parameters
 Min_child_weight:
 Defines the min. sum of weights of all observations required in a child.
 It is used to avoid over – fitting.
 Higher values prevent a model from learning relations.

 Max_depth:
 It is used to define the maximum depth.
 Higher depth will allow the model to learn relations very specific to a
particular sample.
XGBoost: Parameters
 max_leaf_nodes:

 The maximum number of terminal nodes or leaves in a tree.

 Can be defined in place of max_depth. Since binary trees are created, a

depth of ‘n’ would produce a maximum of 2^n leaves.


XGBoost: Parameters
 gamma

 A node is split only when the resulting split gives a positive

reduction in the loss function. Gamma specifies the minimum loss

reduction required to make a split.


 Makes the algorithm conservative. The values can vary depending

on the loss function and should be tuned.


XGBoost: Parameters
 subsample

 Same as the subsample of GBM. Denotes the fraction of

observations to be randomly sampled for each tree.


 Lower values make the algorithm more conservative and prevent

overfitting but values that are too small might lead to under-fitting.
XGBoost: Parameters
 colsample_bytree

 It is similar to max_features in GBM.

 Denotes the fraction of columns to be randomly sampled for each

tree.

You might also like