100% found this document useful (1 vote)
107 views

Parallelism of Statistics and Machine Learning & Logistic Regression Versus Random Forest

The document discusses linear regression and machine learning models. It explains that both try to solve problems of fitting lines or planes to data in different ways. Linear regression assumes a predefined linear function and finds parameters to minimize errors, while machine learning converts it to an optimization problem and alters weights to minimize squared errors without assumptions. It also discusses assumptions of linear regression like linearity, no autocorrelation, normal errors, no multicollinearity, and homoscedasticity. Steps in linear regression modeling and an example from UCI data are presented.

Uploaded by

vdjohn
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
107 views

Parallelism of Statistics and Machine Learning & Logistic Regression Versus Random Forest

The document discusses linear regression and machine learning models. It explains that both try to solve problems of fitting lines or planes to data in different ways. Linear regression assumes a predefined linear function and finds parameters to minimize errors, while machine learning converts it to an optimization problem and alters weights to minimize squared errors without assumptions. It also discusses assumptions of linear regression like linearity, no autocorrelation, normal errors, no multicollinearity, and homoscedasticity. Steps in linear regression modeling and an example from UCI data are presented.

Uploaded by

vdjohn
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 72

SRM INSTITUTE OF SCIENCE AND TECHNOLOGY,

RAMAPURAM CAMPUS

Parallelism of Statistics and Machine Learning


&
Logistic Regression Versus Random Forest

Dr.S.Veena,Associate Professor/CSE
1
Parallelism of Statistics and Machine Learning

Dr.S.Veena,Associate Professor/CSE 2
Logistic Regression Versus Random Forest

Dr.S.Veena,Associate Professor/CSE
3
Comparison between regression and
Machine Learning models
• Linear regression and machine learning models both try
to solve the same problem in different ways.

• Consider a simple example of a two-variable equation


fitting the best possible plane,
– Regression models - It tries to fit the best possible
hyperplane by minimizing the errors between the
hyperplane and actual observations.
– Machine learning, - The same problem has been
converted into an optimization problem in which
errors are modeled in squared form to minimize errors
by altering the weights.

Dr.S.Veena,Associate Professor/CSE
4
Comparison between Regression and
Machine Learning models

• In statistical modeling, samples are drawn from the population and the
model will be fitted on sampled data.
• However, in machine learning, even small numbers such as 30
observations would be good enough to update the weights at the end of
each iteration

Dr.S.Veena,Associate Professor/CSE 5
Comparison between Regression and
Machine Learning models
• Statistical models are parametric in nature, which means a model
will have parameters on which diagnostics are performed to check
the validity of the model.
• Whereas machine learning models are non-parametric, do not
have any parameters, or curve assumptions; these models learn by
themselves based on provided data and come up with complex
and intricate functions rather than predefined function fitting.
• Multi-collinearity checks are required to be performed in statistical
modeling.
• Whereas, in machine learning space, weights automatically get
adjusted to compensate the multicollinearity problem.
Note : Multicollinearity occurs when two or more independent variables are highly
correlated with one another in a regression model. This means that an
independent variable can be predicted from another independent variable in a
regression model.

Dr.S.Veena,Associate Professor/CSE 6
Compensating Factors In Machine Learning Model

Dr.S.Veena,Associate Professor/CSE 7
Compensating Factors In Machine Learning Model

Statistical model
• The two-point validation is performed on the statistical modeling methodology on
training data
– overall model accuracy
– individual parameters significance test.
Machine Learning Model
• On top, statistical diagnostics on individual variables are not performed in machine
learning.
• In machine learning, data will be split into three parts (train data - 50 percent,
validation data - 25 percent, testing data - 25 percent) rather than two parts in statistical
methodology.
• Machine learning models should be developed on training data, and its
hyperparameters should be tuned based on validation data to ensure the two-point
validation equivalence.
• Thus the robustness of models is ensured without diagnostics performed at an
individual variable level.

Dr.S.Veena,Associate Professor/CSE 8
Assumptions of linear
regression
Linear regression has the following assumptions, failing
which the linear regression model does not hold true:
• The dependent variable should be a linear
combination of independent variables
• No autocorrelation in error terms
• Errors should have zero mean and be normally
distributed
• No or little multi-collinearity
• Error terms should be homoscedastic

Dr.S.Veena,Associate Professor/CSE 9
Assumptions of linear regression

The dependent variable should be a linear combination


of independent variables:
• Y should be a linear combination of X variables. Please
note, in the following equation, X2 has raised to the
power of 2, the equation is still holding the assumption
of a linear combination of variables:

Dr.S.Veena,Associate Professor/CSE 10
Assumptions of linear regression

• In the preceding sample graph, initially, linear


regression was applied and the errors seem to have a
pattern rather than being pure white noise; in this
case, it is simply showing the presence of non-
linearity.
• After increasing the power of the polynomial value,
now the errors simply look like white noise.

Dr.S.Veena,Associate Professor/CSE 11
Assumptions of linear regression

No autocorrelation in error terms:


• Presence of correlation in error terms
penalized model accuracy.
• How to diagnose:

– Durbin-Watson's d tests the null hypothesis that


the residuals are not linearly auto correlated.
– While d can lie between 0 and 4,
• if d ≈ 2 indicates no autocorrelation,
• 0<d<2 implies positive autocorrelation
• 2<d<4 indicates negative autocorrelation.

Dr.S.Veena,Associate Professor/CSE 12
Assumptions of linear regression

Error should have zero mean and be normally distributed:


• Errors should have zero mean for the model to create an unbiased
estimate.
• Plotting the errors will show the distribution of errors.
• If error terms are not normally distributed, it implies confidence intervals
will become too wide or narrow, which leads to difficulty in estimating
coefficients based on minimization of least squares:

Dr.S.Veena,Associate Professor/CSE 13
Assumptions of linear regression

How to diagnose:
• Q-Q plot and Kolmogorov-Smirnov tests will be helpful.
• By looking into the above Q-Q plot, it is evidentthat the first chart shows
errors are normally distributed, as the residuals do not seem to be
deviating much compared with the diagonal-like line,
• In the right-hand chart, it is clearly showing that errors are not normally
distributed;
• In these scenarios, we need to reevaluate the variables by taking log
transformations and so on to make residuals look as they do on the left-
hand chart.

Dr.S.Veena,Associate Professor/CSE 14
Assumptions of linear regression

No or little multi-collinearity:
• Multi-collinearity is the case in which independent variables are
correlated with each other and this situation creates unstable models by
inflating the magnitude of coefficients/estimates.
• It also becomes difficult to determine which variable is contributing to
predict the response variable.
• VIF is calculated for each independent variable by calculating the R-
squared value with respect to all the other independent variables and
tries to eliminate which variable has the highest VIF value one by one:
• Variance inflation factor (VIF). If VIF <= 4 suggests no multi-
collinearity, in banking scenarios, people use VIF <= 2

Dr.S.Veena,Associate Professor/CSE 15
Assumptions of linear regression

Errors should be homoscedastic:


• Errors should have constant variance with
respect to the independent variable, which leads
to impractically wide or narrow confidence
intervals for estimates, which degrades the
model's performance.
• One reason for not holding homoscedasticity is
due to the presence of outliers in the data, which
drags the model fit toward them with higher
weights
Dr.S.Veena,Associate Professor/CSE 16
Assumptions of linear regression

How to diagnose:
• Look into the residual versus dependent variables plot;
• if any pattern of cone or divergence does exist, it indicates
the errors do not have constant variance, which impacts its
predictions.

Dr.S.Veena,Associate Professor/CSE 17
Steps applied in linear regression modeling

Steps applied in linear regression modeling

The following steps are applied in linear regression modeling in industry:


1. Missing value and outlier treatment
2. Correlation check of independent variables
3. Train and test random classification
4. Fit the model on train data
5. Evaluate model on test data

Dr.S.Veena,Associate Professor/CSE 18
Example of simple linear regression from first principles

• UCI machine learning repository at


https://archive.ics.uci.edu/ml/datasets/Wine+Quality.
• Simple linear regression is a straightforward approach for
predicting the dependent/response variable Y given the
independent/predictor variable X. It assumes a linear relationship
between X and Y
• β0 and β1 are two unknown constants which are intercept and slope
parameters respectively. Once we determine the constants, we can
utilize them for the prediction of the dependent variable
• Residuals are the differences between the ith observed response
value and the ith response value that is predicted from the model.
Residual sum of squares is shown. The least squares approach
chooses estimates by minimizing errors

Dr.S.Veena,Associate Professor/CSE 19
Example of simple linear regression from first principles

Dr.S.Veena,Associate Professor/CSE 20
Example of simple linear regression from first principles

Dr.S.Veena,Associate Professor/CSE 21
Example of simple linear regression from first principles

• In order to prove statistically that linear regression is significant, we have to perform


hypothesis testing.
• Let's assume we start with the null hypothesis that there is no significant relationship
between X and Y
• Since, if β1 = 0, then the model shows no association between both variables (Y = β0 +
ε), these are the null hypothesis assumptions;
• In order to prove this assumption right or wrong, we need to determine β1 is
sufficiently far from 0 so that we can be confident that β1 is nonzero and have a
significant relationship between both variables

Dr.S.Veena,Associate Professor/CSE 22
Example of simple linear regression from first principles

• It depends on the distribution of β1, which is its mean and standard


error (similar to standard deviation).
• In some cases, if the standard error is small, even relatively small
values may provide strong evidence that β1 ≠ 0, hence there is a
relationship between X and Y.
• In contrast, if SE(β1) is large, then β1 must be large in absolute
value in order for us to reject the null hypothesis.

Dr.S.Veena,Associate Professor/CSE 23
Example of simple linear regression from first principles

• We usually perform the t test to check how many standard


deviations β1 is away from the value 0:
• With this t value, we calculate the probability of observing any value
equal to |t| or larger, assuming β1 = 0; this probability is also known
as the p-value.
• If p-value < 0.05, it signifies that β1 is significantly far from 0, hence
we can reject the null hypothesis and agree that there exists a
strong relationship
• if p-value > 0.05, we accept the null hypothesis and conclude that
there is no significant relationship between both variables

Dr.S.Veena,Associate Professor/CSE 24
Example of simple linear regression from first principles

• To predict the dependent value and check for the R-squared value;
• if the value is >= 0.7, it means the model is good enough to deploy
on unseen data
• if it is not such a good value (<0.6), we can conclude that this model
is not good enough to deploy

Dr.S.Veena,Associate Professor/CSE 25
Machine learning models - ridge and lasso regression

• In linear regression, only the residual sum of squares (RSS) is


minimized
• In ridge and lasso regression, a penalty is applied (also known as
shrinkage penalty) on coefficient values to regularize the
coefficients with the tuning parameter λ
• When λ=0, the penalty has no impact, ridge/lasso produces the
same result as linear regression, whereas λ -> ∞ will bring
coefficients to zero

Dr.S.Veena,Associate Professor/CSE 26
Machine learning models - ridge and lasso regression
Lagrangian multipliers
● Defintion of Lagrangian : a function that describes the
state of a dynamic system in terms of position coordinates
and their time derivatives
● The Lagrange multiplier, λ, measures the increase in the
objective function (f(x, y) that is obtained through a marginal
relaxation in the constraint (an increase in k). For this reason,
the Lagrange multiplier is often termed a shadow price.
• The method of Lagrange multipliers in Machine learning is a
simple and elegant method of finding the local minima or local
maxima of a function subject to equality or inequality
constraints. Lagrange multipliers are also called undetermined
multipliers.

Dr.S.Veena,Associate Professor/CSE 27
Machine learning models - ridge and lasso regression
Lagrangian multipliers
• The objective is RSS subjected to cost constraint (s) of budget.
• For every value of λ, there is an s such that will provide the
equivalent equations, as shown for the overall objective function
with a penalty factor

Dr.S.Veena,Associate Professor/CSE 28
Machine learning models - ridge and lasso regression

Dr.S.Veena,Associate Professor/CSE 29
Machine learning models - ridge and lasso regression

• For any fixed value of λ, ridge regression only fits a single model and the
model-fitting procedure can be performed very quickly
• One disadvantage of ridge regression
– Given a situation where the number of predictors is significantly large,
using ridge may provide accuracy, but it includes all the variables, which
is not desired in a compact representation of the model;
• But in lasso, it will set the weights of unnecessary variables to zero

Dr.S.Veena,Associate Professor/CSE 30
Machine learning models - ridge and lasso regression

What is Ridge Regression?


• Ridge regression is a model tuning method that is used to analyse any data

that suffers from multicollinearity. This method performs L2


regularization. When the issue of multicollinearity occurs, least-squares are
unbiased, and variances are large, this results in predicted values being far
away from the actual values.
• The cost function for ridge regression:

Min(||Y – X(theta)||^2 + λ||theta||^2)


Lambda is the penalty term. λ given here is denoted by an alpha parameter
in the ridge function. So, by changing the values of alpha, we are
controlling the penalty term. The higher the values of alpha, the bigger is
the penalty and therefore the magnitude of coefficients is reduced.
● It shrinks the parameters. Therefore, it is used to prevent multicollinearity

● It reduces the model complexity by coefficient shrinkage

Dr.S.Veena,Associate Professor/CSE 31
Machine learning models - ridge and lasso regression
Lasso Meaning
• The word “LASSO” stands for Least Absolute Shrinkage and Selection Operator. It is a
statistical formula for the regularisation of data models and feature selection.
Regularization
• Regularization is an important concept that is used to avoid overfitting of the data, especially
when the trained and test data are much varying.
• Regularization is implemented by adding a “penalty” term to the best fit derived from the
trained data, to achieve a lesser variance with the tested data and also restricts the influence of
predictor variables over the output variable by compressing their coefficients.

Dr.S.Veena,Associate Professor/CSE 32
Machine learning models - ridge and lasso regression
The key difference is in how they assign penalty to the
coefficients:
1. Ridge Regression:
○ Performs L2 regularization, i.e. adds penalty equivalent
to square of the magnitude of coefficients
○ Minimization objective = LS Obj + α * (sum of square of
coefficients)
2. Lasso Regression:
○ Performs L1 regularization, i.e. adds penalty equivalent
to absolute value of the magnitude of coefficients
○ Minimization objective = LS Obj + α * (sum of absolute
value of coefficients)

Dr.S.Veena,Associate Professor/CSE 33
Example of ridge regression machine learning

● Ridge regression is a machine learning model in which we do not perform any statistical
diagnostics on the independent variables and just utilize the model to fit on test data
and check the accuracy of the fit. Here, we have used the scikit-learn package

Dr.S.Veena,Associate Professor/CSE 34
Example of ridge regression machine learning

Dr.S.Veena,Associate Professor/CSE 35
Example of ridge regression machine learning

● Ridge regression is a machine learning model in which we do not perform any statistical
diagnostics on the independent variables and just utilize the model to fit on test data
and check the accuracy of the fit. Here, we have used the scikit-learn package

Dr.S.Veena,Associate Professor/CSE 36
Example of Lasso regression machine learning
Lasso regression is a close cousin of ridge regression, in which absolute values of coefficients are minimized rather than the square
of values

Dr.S.Veena,Associate Professor/CSE 37
Example of Lasso regression machine learning

Dr.S.Veena,Associate Professor/CSE 38
Example of Lasso regression machine learning

The following results show the coefficient values of both methods; the coefficient of density has
been set to 0 in lasso regression, whereas the density value is -5.5672 in ridge regression; also,
none of the coefficients in ridge regression are zero values:

Dr.S.Veena,Associate Professor/CSE 39
Logistic Regression

● Logistic regression is a process of modeling the probability of a discrete outcome


given an input variable.
● It is used in statistical software to understand the relationship between the
dependent variable and one or more independent variables by estimating
probabilities using a logistic regression equation. This type of analysis can help you
predict the likelihood of an event happening or a choice being made.

Dr.S.Veena,Associate Professor/CSE 40
Maximum Likelihood

● Maximum likelihood estimation is a method of estimating the


parameters of a model given observations, by finding the parameter
values that maximize the likelihood of making the observations, this
means finding parameters that maximize the probability p of event 1 and
(1-p) of non-event 0

probability (event + non-event) = 1

Dr.S.Veena,Associate Professor/CSE 41
Maximum Likelihood

Example: Sample (0, 1, 0, 0, 1, 0) is drawn from binomial distribution. What is the


maximum likelihood estimate of µ?
For binomial distribution P(X=1) = µ and P(X=0) = 1- µ where µ is the parameter:

log is applied to both sides of the equation for mathematical convenience; also, maximizing
likelihood is the same as the maximizing log of likelihood

Dr.S.Veena,Associate Professor/CSE 42
Maximum Likelihood

• Determining the maximum value of µ by equating derivative to


zero

Do double differentiation to determine the saddle point obtained


from equating derivative to zero is maximum or minimum.
If the µ value is maximum; double differentiation of log(L(µ))
should be a negative value

Dr.S.Veena,Associate Professor/CSE 43
Maximum Likelihood

Even without substitution of µ value in double differentiation, we can determine that it is a negative
value, as denominator values are squared and it has a negative sign against both terms. Nonetheless, we
are substituting and the value is

it has been proven that at value µ = 1/3, it is maximizing the likelihood. If we substitute the value in the log
likelihood function, we will obtain:

So, logistic regression tries to find the parameters by maximizing the likelihood with respect to individual
parameters

Dr.S.Veena,Associate Professor/CSE 44
Terminology involved in Logistic regression

Information value (IV)


• This is very useful in the preliminary filtering of variables prior to including them in
the model.
• IV is mainly used by industry for eliminating major variables in the first step prior to
fitting the model, as the number of variables present in the final model would be about
10.
• Hence, initial processing is needed to reduce variables from 400+ in number or so

Dr.S.Veena,Associate Professor/CSE 45
Terminology involved in Logistic regression
Example: In the following table, continuous variable (price) has been broken down
into deciles (10 bins) based on price range and the counted number of events and
non-events in that bin, and the information value has been calculated for all the
segments and added together. We got the total value as 0.0356, meaning it is a weak
predictor to classify events.

Dr.S.Veena,Associate Professor/CSE 46
Terminology involved in Logistic regression

Akaike information criteria (AIC):


• The Akaike information criterion (AIC) is a mathematical method for evaluating how well
a model fits the data it was generated from. In statistics, AIC is used to compare different
possible models and determine which one is the best fit for the data.
• This measures the relative quality of a statistical model for a given set of data.
• During a comparison between two models, the model with less AIC is preferred
over higher value

Dr.S.Veena,Associate Professor/CSE 47
Terminology involved in Logistic regression
Receiver operating characteristic (ROC) curve:
• This is a graphical plot that illustrates the performance of a binary classifier as its discriminant
threshold is varied.
• The curve is created by plotting true positive rate (TPR) against false positive rate (FPR) at
various threshold values.
• A threshold is a real value between 0 and 1, used to convert the predicted probability of output
into class.
• Ideally, the threshold should be set in a way that trade-offs value between both categories and
produces higher overall accuracy
• Optimum threshold = Threshold where maximum (sensitivity + specificity) is possible
Confuion Matrix

Dr.S.Veena,Associate Professor/CSE 48
Terminology involved in Logistic regression

Dr.S.Veena,Associate Professor/CSE 49
Terminology involved in Logistic regression

Rank ordering: After sorting observations in descending order by


predicted probabilities, deciles are created (10 equal bins with 10
percent of total observations in each bin). By adding up the number of
events in each decile, we will get aggregated events for each decile and
this number should be in decreasing order, else it will be in serious
violation of logistic regression methodology

Concordance/c-statistic: The C-statistic (sometimes called the “concordance”


statistic or C-index) is a measure of goodness of fit for binary outcomes in a
logistic regression model.This is a measure of quality of fit for a binary
outcome in a logistic regression model. It is a proportion of pairs in
which the predicted event probability is higher for the actual event
than non-event

Dr.S.Veena,Associate Professor/CSE 50
Terminology involved in Logistic regression

In the following table, both actual and predicted values are shown with
a sample of seven rows. Actual is the true category, either default or
not; whereas predicted is predicted probabilities from the logistic
regression model. Calculate the concordance value
For calculating concordance, we need to split the table into two (each
table with actual values as 1 and 0) and apply the Cartesian product of
each row from both tables to form pairs

Dr.S.Veena,Associate Professor/CSE 51
Terminology involved in Logistic regression

The complete Cartesian product has been calculated and has classified
the pair as a concordant pair whenever the predicted probability for 1
category is higher than the predicted probability for 0 category

Dr.S.Veena,Associate Professor/CSE 52
Terminology involved in Logistic regression

Dr.S.Veena,Associate Professor/CSE 53
Terminology involved in Logistic regression

C-statistic: This is 0.83315 percent or 83.315 percent, and any value greater
than 0.7 percent or 70 percent is considered a good model to use for practical
purposes

Divergence: The distance between the average score of default accounts and
the average score of non-default accounts. The greater the distance, the more
effective the scoring system is at segregating good and bad observations.

K-S statistic: This is the maximum distance between two population


distributions. It helps with discriminating default accounts from non-default
accounts

Dr.S.Veena,Associate Professor/CSE 54
Terminology involved in Logistic regression

Population stability index (PSI): This is the metric used to check


that drift in the current population on which the credit scoring model
will be used is the same as the population with respective to
development time:
● PSI <= 0.1: This states no change in characteristics of the
current population with respect to the development population
● 0.1 < PSI <= 0.25: This signifies some change has taken place
and warns for attention, but can still be used
● PSI >0.25: This indicates a large shift in the score distribution of
the current population compared with development time

Dr.S.Veena,Associate Professor/CSE 55
Terminology involved in Logistic regression

To calculate the PSI we first divide the initial population range into 10 buckets (an arbitrary number I
chose), and count the number of values in each of those buckets for the initial and new populations, and
then divide those by the total values in each population to get the percents in each bucket. As expected,
plotting the percents ends up looking like a discretized version of the original chart:

Dr.S.Veena,Associate Professor/CSE 56
Terminology involved in Logistic regression

Dr.S.Veena,Associate Professor/CSE 57
Applying steps in logistic regression modeling

The following steps are applied in linear regression modeling in


industry:
1. Exclusion criteria and good-bad definition finalization
2. Initial data preparation and univariate analysis
3. Derived/dummy variable creation
4. Fine classing and coarse classing
5. Fitting the logistic model on the training data
6. Evaluating the model on test data

Dr.S.Veena,Associate Professor/CSE 58
Random forest
• Random Forest is a popular machine learning algorithm that belongs
to the supervised learning technique.
• It can be used for both Classification and Regression problems in ML.
• It is based on the concept of ensemble learning, which is a process of
combining multiple classifiers to solve a complex problem and to
improve the performance of the model
• Random Forest is a classifier that contains a number of decision trees
on various subsets of the given dataset and takes the average to
improve the predictive accuracy of that dataset.
• Instead of relying on one decision tree, the random forest takes the
prediction from each tree and based on the majority votes of
predictions, and it predicts the final output.
• The greater number of trees in the forest leads to higher accuracy and
prevents the problem of overfitting.

Dr.S.Veena,Associate Professor/CSE 59
Random forest
• RF focuses on sampling both observations and variables of training
data to develop independent decision trees and take majority
voting for classification and averaging for regression problems
respectively
• In contrast, bagging samples only observations at random and
selects all columns that have the deficiency of representing
significant variables at root for all decision trees.
• This way makes trees that are dependent on each other, for which
accuracy will be penalized.
• The following are a few rules of thumb when selecting sub-
samples from observations using random forest

Dr.S.Veena,Associate Professor/CSE 60
Random forest

Dr.S.Veena,Associate Professor/CSE 61
Random forest

Dr.S.Veena,Associate Professor/CSE 62
Example of random forest using German credit data
● The same German credit data is being utilized to illustrate the random
forest model
● A very significant difference anyone can observe compared with logistic
regression is that effort applied on data preprocessing drastically
decreases.
● In RF, we have not removed variables one by one from analysis based on
significance and VIF values, as significance tests are not applicable for ML
models. However five-fold cross validation has been performed on
training data to ensure the model's robustness
● In RF we have not removed the extra dummy variable from the analysis,
as the latter automatically takes care of multi-collinearity.
● Random forest requires much less human effort and intervention to train
the model.

Dr.S.Veena,Associate Professor/CSE 63
Example of random forest using German credit data

Dr.S.Veena,Associate Professor/CSE 64
Example of random forest using German credit data

Dr.S.Veena,Associate Professor/CSE 65
Example of random forest using German credit data

Dr.S.Veena,Associate Professor/CSE 66
Example of random forest using German credit data

Dr.S.Veena,Associate Professor/CSE 67
Example of random forest using German credit data
The test accuracy produced from random forest is 0.855,

Dr.S.Veena,Associate Professor/CSE 68
Grid search on Random Forest
Grid search has been performed by changing various hyperparameters with the following settings. However, readers are encouraged
to try other parameters to explore further in this space.

Number of trees is (1000,2000,3000)


Maximum depth is (100,200,300)
Minimum samples per split are (2,3)
Minimum samples in leaf node are (1,2)

Dr.S.Veena,Associate Professor/CSE 69
Grid search on Random Forest

Dr.S.Veena,Associate Professor/CSE 70
Variable importance plot
Variable importance plot provides a list of the most significant variables in descending order by a mean decrease in Gini. The top
variables contribute more to the model than the bottom ones and also have high predictive power in classifying default and non-default
customers
Grid search does not have variable importance functionality in Python scikit- learn, hence we are using the best parameters from grid
search and plotting the variable importance graph with simple random forest scikit-learn function. Whereas, in R programming, we
have that provision, hence R code would be compact here:

Dr.S.Veena,Associate Professor/CSE 71
Comparison of Logistic regression with Random Forest
In the following table, both models explanatory variables have been put in descending order based on the importance of them towards
the model contribution.
In the logistic regression model, it is the p-value (minimum is a better predictor), and for random forest it is the mean decrease in Gini
(maximum is a better predictor).
Many of the variables are very much matching in importance like, status_exs_accnt_A14, credit_hist_A34,
Installment_rate_in_percentage_of_disposable_income, property_A_24, Credit_amount, Duration_in_month

Dr.S.Veena,Associate Professor/CSE 72

You might also like