Parallelism of Statistics and Machine Learning & Logistic Regression Versus Random Forest
Parallelism of Statistics and Machine Learning & Logistic Regression Versus Random Forest
RAMAPURAM CAMPUS
Dr.S.Veena,Associate Professor/CSE
1
Parallelism of Statistics and Machine Learning
Dr.S.Veena,Associate Professor/CSE 2
Logistic Regression Versus Random Forest
Dr.S.Veena,Associate Professor/CSE
3
Comparison between regression and
Machine Learning models
• Linear regression and machine learning models both try
to solve the same problem in different ways.
Dr.S.Veena,Associate Professor/CSE
4
Comparison between Regression and
Machine Learning models
• In statistical modeling, samples are drawn from the population and the
model will be fitted on sampled data.
• However, in machine learning, even small numbers such as 30
observations would be good enough to update the weights at the end of
each iteration
Dr.S.Veena,Associate Professor/CSE 5
Comparison between Regression and
Machine Learning models
• Statistical models are parametric in nature, which means a model
will have parameters on which diagnostics are performed to check
the validity of the model.
• Whereas machine learning models are non-parametric, do not
have any parameters, or curve assumptions; these models learn by
themselves based on provided data and come up with complex
and intricate functions rather than predefined function fitting.
• Multi-collinearity checks are required to be performed in statistical
modeling.
• Whereas, in machine learning space, weights automatically get
adjusted to compensate the multicollinearity problem.
Note : Multicollinearity occurs when two or more independent variables are highly
correlated with one another in a regression model. This means that an
independent variable can be predicted from another independent variable in a
regression model.
Dr.S.Veena,Associate Professor/CSE 6
Compensating Factors In Machine Learning Model
Dr.S.Veena,Associate Professor/CSE 7
Compensating Factors In Machine Learning Model
Statistical model
• The two-point validation is performed on the statistical modeling methodology on
training data
– overall model accuracy
– individual parameters significance test.
Machine Learning Model
• On top, statistical diagnostics on individual variables are not performed in machine
learning.
• In machine learning, data will be split into three parts (train data - 50 percent,
validation data - 25 percent, testing data - 25 percent) rather than two parts in statistical
methodology.
• Machine learning models should be developed on training data, and its
hyperparameters should be tuned based on validation data to ensure the two-point
validation equivalence.
• Thus the robustness of models is ensured without diagnostics performed at an
individual variable level.
Dr.S.Veena,Associate Professor/CSE 8
Assumptions of linear
regression
Linear regression has the following assumptions, failing
which the linear regression model does not hold true:
• The dependent variable should be a linear
combination of independent variables
• No autocorrelation in error terms
• Errors should have zero mean and be normally
distributed
• No or little multi-collinearity
• Error terms should be homoscedastic
Dr.S.Veena,Associate Professor/CSE 9
Assumptions of linear regression
Dr.S.Veena,Associate Professor/CSE 10
Assumptions of linear regression
Dr.S.Veena,Associate Professor/CSE 11
Assumptions of linear regression
Dr.S.Veena,Associate Professor/CSE 12
Assumptions of linear regression
Dr.S.Veena,Associate Professor/CSE 13
Assumptions of linear regression
How to diagnose:
• Q-Q plot and Kolmogorov-Smirnov tests will be helpful.
• By looking into the above Q-Q plot, it is evidentthat the first chart shows
errors are normally distributed, as the residuals do not seem to be
deviating much compared with the diagonal-like line,
• In the right-hand chart, it is clearly showing that errors are not normally
distributed;
• In these scenarios, we need to reevaluate the variables by taking log
transformations and so on to make residuals look as they do on the left-
hand chart.
Dr.S.Veena,Associate Professor/CSE 14
Assumptions of linear regression
No or little multi-collinearity:
• Multi-collinearity is the case in which independent variables are
correlated with each other and this situation creates unstable models by
inflating the magnitude of coefficients/estimates.
• It also becomes difficult to determine which variable is contributing to
predict the response variable.
• VIF is calculated for each independent variable by calculating the R-
squared value with respect to all the other independent variables and
tries to eliminate which variable has the highest VIF value one by one:
• Variance inflation factor (VIF). If VIF <= 4 suggests no multi-
collinearity, in banking scenarios, people use VIF <= 2
Dr.S.Veena,Associate Professor/CSE 15
Assumptions of linear regression
How to diagnose:
• Look into the residual versus dependent variables plot;
• if any pattern of cone or divergence does exist, it indicates
the errors do not have constant variance, which impacts its
predictions.
Dr.S.Veena,Associate Professor/CSE 17
Steps applied in linear regression modeling
Dr.S.Veena,Associate Professor/CSE 18
Example of simple linear regression from first principles
Dr.S.Veena,Associate Professor/CSE 19
Example of simple linear regression from first principles
Dr.S.Veena,Associate Professor/CSE 20
Example of simple linear regression from first principles
Dr.S.Veena,Associate Professor/CSE 21
Example of simple linear regression from first principles
Dr.S.Veena,Associate Professor/CSE 22
Example of simple linear regression from first principles
Dr.S.Veena,Associate Professor/CSE 23
Example of simple linear regression from first principles
Dr.S.Veena,Associate Professor/CSE 24
Example of simple linear regression from first principles
• To predict the dependent value and check for the R-squared value;
• if the value is >= 0.7, it means the model is good enough to deploy
on unseen data
• if it is not such a good value (<0.6), we can conclude that this model
is not good enough to deploy
Dr.S.Veena,Associate Professor/CSE 25
Machine learning models - ridge and lasso regression
Dr.S.Veena,Associate Professor/CSE 26
Machine learning models - ridge and lasso regression
Lagrangian multipliers
● Defintion of Lagrangian : a function that describes the
state of a dynamic system in terms of position coordinates
and their time derivatives
● The Lagrange multiplier, λ, measures the increase in the
objective function (f(x, y) that is obtained through a marginal
relaxation in the constraint (an increase in k). For this reason,
the Lagrange multiplier is often termed a shadow price.
• The method of Lagrange multipliers in Machine learning is a
simple and elegant method of finding the local minima or local
maxima of a function subject to equality or inequality
constraints. Lagrange multipliers are also called undetermined
multipliers.
Dr.S.Veena,Associate Professor/CSE 27
Machine learning models - ridge and lasso regression
Lagrangian multipliers
• The objective is RSS subjected to cost constraint (s) of budget.
• For every value of λ, there is an s such that will provide the
equivalent equations, as shown for the overall objective function
with a penalty factor
Dr.S.Veena,Associate Professor/CSE 28
Machine learning models - ridge and lasso regression
Dr.S.Veena,Associate Professor/CSE 29
Machine learning models - ridge and lasso regression
• For any fixed value of λ, ridge regression only fits a single model and the
model-fitting procedure can be performed very quickly
• One disadvantage of ridge regression
– Given a situation where the number of predictors is significantly large,
using ridge may provide accuracy, but it includes all the variables, which
is not desired in a compact representation of the model;
• But in lasso, it will set the weights of unnecessary variables to zero
Dr.S.Veena,Associate Professor/CSE 30
Machine learning models - ridge and lasso regression
Dr.S.Veena,Associate Professor/CSE 31
Machine learning models - ridge and lasso regression
Lasso Meaning
• The word “LASSO” stands for Least Absolute Shrinkage and Selection Operator. It is a
statistical formula for the regularisation of data models and feature selection.
Regularization
• Regularization is an important concept that is used to avoid overfitting of the data, especially
when the trained and test data are much varying.
• Regularization is implemented by adding a “penalty” term to the best fit derived from the
trained data, to achieve a lesser variance with the tested data and also restricts the influence of
predictor variables over the output variable by compressing their coefficients.
Dr.S.Veena,Associate Professor/CSE 32
Machine learning models - ridge and lasso regression
The key difference is in how they assign penalty to the
coefficients:
1. Ridge Regression:
○ Performs L2 regularization, i.e. adds penalty equivalent
to square of the magnitude of coefficients
○ Minimization objective = LS Obj + α * (sum of square of
coefficients)
2. Lasso Regression:
○ Performs L1 regularization, i.e. adds penalty equivalent
to absolute value of the magnitude of coefficients
○ Minimization objective = LS Obj + α * (sum of absolute
value of coefficients)
Dr.S.Veena,Associate Professor/CSE 33
Example of ridge regression machine learning
● Ridge regression is a machine learning model in which we do not perform any statistical
diagnostics on the independent variables and just utilize the model to fit on test data
and check the accuracy of the fit. Here, we have used the scikit-learn package
Dr.S.Veena,Associate Professor/CSE 34
Example of ridge regression machine learning
Dr.S.Veena,Associate Professor/CSE 35
Example of ridge regression machine learning
● Ridge regression is a machine learning model in which we do not perform any statistical
diagnostics on the independent variables and just utilize the model to fit on test data
and check the accuracy of the fit. Here, we have used the scikit-learn package
Dr.S.Veena,Associate Professor/CSE 36
Example of Lasso regression machine learning
Lasso regression is a close cousin of ridge regression, in which absolute values of coefficients are minimized rather than the square
of values
Dr.S.Veena,Associate Professor/CSE 37
Example of Lasso regression machine learning
Dr.S.Veena,Associate Professor/CSE 38
Example of Lasso regression machine learning
The following results show the coefficient values of both methods; the coefficient of density has
been set to 0 in lasso regression, whereas the density value is -5.5672 in ridge regression; also,
none of the coefficients in ridge regression are zero values:
Dr.S.Veena,Associate Professor/CSE 39
Logistic Regression
Dr.S.Veena,Associate Professor/CSE 40
Maximum Likelihood
Dr.S.Veena,Associate Professor/CSE 41
Maximum Likelihood
log is applied to both sides of the equation for mathematical convenience; also, maximizing
likelihood is the same as the maximizing log of likelihood
Dr.S.Veena,Associate Professor/CSE 42
Maximum Likelihood
Dr.S.Veena,Associate Professor/CSE 43
Maximum Likelihood
Even without substitution of µ value in double differentiation, we can determine that it is a negative
value, as denominator values are squared and it has a negative sign against both terms. Nonetheless, we
are substituting and the value is
it has been proven that at value µ = 1/3, it is maximizing the likelihood. If we substitute the value in the log
likelihood function, we will obtain:
So, logistic regression tries to find the parameters by maximizing the likelihood with respect to individual
parameters
Dr.S.Veena,Associate Professor/CSE 44
Terminology involved in Logistic regression
Dr.S.Veena,Associate Professor/CSE 45
Terminology involved in Logistic regression
Example: In the following table, continuous variable (price) has been broken down
into deciles (10 bins) based on price range and the counted number of events and
non-events in that bin, and the information value has been calculated for all the
segments and added together. We got the total value as 0.0356, meaning it is a weak
predictor to classify events.
Dr.S.Veena,Associate Professor/CSE 46
Terminology involved in Logistic regression
Dr.S.Veena,Associate Professor/CSE 47
Terminology involved in Logistic regression
Receiver operating characteristic (ROC) curve:
• This is a graphical plot that illustrates the performance of a binary classifier as its discriminant
threshold is varied.
• The curve is created by plotting true positive rate (TPR) against false positive rate (FPR) at
various threshold values.
• A threshold is a real value between 0 and 1, used to convert the predicted probability of output
into class.
• Ideally, the threshold should be set in a way that trade-offs value between both categories and
produces higher overall accuracy
• Optimum threshold = Threshold where maximum (sensitivity + specificity) is possible
Confuion Matrix
Dr.S.Veena,Associate Professor/CSE 48
Terminology involved in Logistic regression
Dr.S.Veena,Associate Professor/CSE 49
Terminology involved in Logistic regression
Dr.S.Veena,Associate Professor/CSE 50
Terminology involved in Logistic regression
In the following table, both actual and predicted values are shown with
a sample of seven rows. Actual is the true category, either default or
not; whereas predicted is predicted probabilities from the logistic
regression model. Calculate the concordance value
For calculating concordance, we need to split the table into two (each
table with actual values as 1 and 0) and apply the Cartesian product of
each row from both tables to form pairs
Dr.S.Veena,Associate Professor/CSE 51
Terminology involved in Logistic regression
The complete Cartesian product has been calculated and has classified
the pair as a concordant pair whenever the predicted probability for 1
category is higher than the predicted probability for 0 category
Dr.S.Veena,Associate Professor/CSE 52
Terminology involved in Logistic regression
Dr.S.Veena,Associate Professor/CSE 53
Terminology involved in Logistic regression
C-statistic: This is 0.83315 percent or 83.315 percent, and any value greater
than 0.7 percent or 70 percent is considered a good model to use for practical
purposes
Divergence: The distance between the average score of default accounts and
the average score of non-default accounts. The greater the distance, the more
effective the scoring system is at segregating good and bad observations.
Dr.S.Veena,Associate Professor/CSE 54
Terminology involved in Logistic regression
Dr.S.Veena,Associate Professor/CSE 55
Terminology involved in Logistic regression
To calculate the PSI we first divide the initial population range into 10 buckets (an arbitrary number I
chose), and count the number of values in each of those buckets for the initial and new populations, and
then divide those by the total values in each population to get the percents in each bucket. As expected,
plotting the percents ends up looking like a discretized version of the original chart:
Dr.S.Veena,Associate Professor/CSE 56
Terminology involved in Logistic regression
Dr.S.Veena,Associate Professor/CSE 57
Applying steps in logistic regression modeling
Dr.S.Veena,Associate Professor/CSE 58
Random forest
• Random Forest is a popular machine learning algorithm that belongs
to the supervised learning technique.
• It can be used for both Classification and Regression problems in ML.
• It is based on the concept of ensemble learning, which is a process of
combining multiple classifiers to solve a complex problem and to
improve the performance of the model
• Random Forest is a classifier that contains a number of decision trees
on various subsets of the given dataset and takes the average to
improve the predictive accuracy of that dataset.
• Instead of relying on one decision tree, the random forest takes the
prediction from each tree and based on the majority votes of
predictions, and it predicts the final output.
• The greater number of trees in the forest leads to higher accuracy and
prevents the problem of overfitting.
Dr.S.Veena,Associate Professor/CSE 59
Random forest
• RF focuses on sampling both observations and variables of training
data to develop independent decision trees and take majority
voting for classification and averaging for regression problems
respectively
• In contrast, bagging samples only observations at random and
selects all columns that have the deficiency of representing
significant variables at root for all decision trees.
• This way makes trees that are dependent on each other, for which
accuracy will be penalized.
• The following are a few rules of thumb when selecting sub-
samples from observations using random forest
Dr.S.Veena,Associate Professor/CSE 60
Random forest
Dr.S.Veena,Associate Professor/CSE 61
Random forest
Dr.S.Veena,Associate Professor/CSE 62
Example of random forest using German credit data
● The same German credit data is being utilized to illustrate the random
forest model
● A very significant difference anyone can observe compared with logistic
regression is that effort applied on data preprocessing drastically
decreases.
● In RF, we have not removed variables one by one from analysis based on
significance and VIF values, as significance tests are not applicable for ML
models. However five-fold cross validation has been performed on
training data to ensure the model's robustness
● In RF we have not removed the extra dummy variable from the analysis,
as the latter automatically takes care of multi-collinearity.
● Random forest requires much less human effort and intervention to train
the model.
Dr.S.Veena,Associate Professor/CSE 63
Example of random forest using German credit data
Dr.S.Veena,Associate Professor/CSE 64
Example of random forest using German credit data
Dr.S.Veena,Associate Professor/CSE 65
Example of random forest using German credit data
Dr.S.Veena,Associate Professor/CSE 66
Example of random forest using German credit data
Dr.S.Veena,Associate Professor/CSE 67
Example of random forest using German credit data
The test accuracy produced from random forest is 0.855,
Dr.S.Veena,Associate Professor/CSE 68
Grid search on Random Forest
Grid search has been performed by changing various hyperparameters with the following settings. However, readers are encouraged
to try other parameters to explore further in this space.
Dr.S.Veena,Associate Professor/CSE 69
Grid search on Random Forest
Dr.S.Veena,Associate Professor/CSE 70
Variable importance plot
Variable importance plot provides a list of the most significant variables in descending order by a mean decrease in Gini. The top
variables contribute more to the model than the bottom ones and also have high predictive power in classifying default and non-default
customers
Grid search does not have variable importance functionality in Python scikit- learn, hence we are using the best parameters from grid
search and plotting the variable importance graph with simple random forest scikit-learn function. Whereas, in R programming, we
have that provision, hence R code would be compact here:
Dr.S.Veena,Associate Professor/CSE 71
Comparison of Logistic regression with Random Forest
In the following table, both models explanatory variables have been put in descending order based on the importance of them towards
the model contribution.
In the logistic regression model, it is the p-value (minimum is a better predictor), and for random forest it is the mean decrease in Gini
(maximum is a better predictor).
Many of the variables are very much matching in importance like, status_exs_accnt_A14, credit_hist_A34,
Installment_rate_in_percentage_of_disposable_income, property_A_24, Credit_amount, Duration_in_month
Dr.S.Veena,Associate Professor/CSE 72