0% found this document useful (0 votes)
27 views

4 ML

Regression analysis is a statistical method used to model relationships between variables and make predictions. The document discusses different types of regression like linear, logistic, polynomial, and support vector regression. Applications of regression include prediction, forecasting, and determining causal relationships.

Uploaded by

aniket singh
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views

4 ML

Regression analysis is a statistical method used to model relationships between variables and make predictions. The document discusses different types of regression like linear, logistic, polynomial, and support vector regression. Applications of regression include prediction, forecasting, and determining causal relationships.

Uploaded by

aniket singh
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 41

Regression Analysis in Machine

learning
Regression analysis is a statistical method to model the relationship between a
dependent (target) and independent (predictor) variables with one or more independent
variables. More specifically, Regression analysis helps us to understand how the value of
the dependent variable is changing corresponding to an independent variable when
other independent variables are held fixed. It predicts continuous/real values such
as temperature, age, salary, price, etc.

We can understand the concept of regression analysis using the below example:

Example: Suppose there is a marketing company A, who does various advertisement


every year and get sales on that. The below list shows the advertisement made by the
company in the last 5 years and the corresponding sales:

Now, the company wants to do the advertisement of $200 in the year 2019 and wants
to know the prediction about the sales for this year. So to solve such type of
prediction problems in machine learning, we need regression analysis.

Regression is a supervised learning technique which helps in finding the correlation


between variables and enables us to predict the continuous output variable based on
the one or more predictor variables. It is mainly used for prediction, forecasting, time
series modeling, and determining the causal-effect relationship between variables.

In Regression, we plot a graph between the variables which best fits the given
datapoints, using this plot, the machine learning model can make predictions about the
data. In simple words, "Regression shows a line or curve that passes through all the
datapoints on target-predictor graph in such a way that the vertical distance
between the datapoints and the regression line is minimum." The distance between
datapoints and line tells whether a model has captured a strong relationship or not.

Some examples of regression can be as:

o Prediction of rain using temperature and other factors


o Determining Market trends
o Prediction of road accidents due to rash driving.

Terminologies Related to the Regression


Analysis:
o Dependent Variable: The main factor in Regression analysis which we want to
predict or understand is called the dependent variable. It is also called target
variable.
o Independent Variable: The factors which affect the dependent variables or
which are used to predict the values of the dependent variables are called
independent variable, also called as a predictor.
o Outliers: Outlier is an observation which contains either very low value or very
high value in comparison to other observed values. An outlier may hamper the
result, so it should be avoided.
o Multicollinearity: If the independent variables are highly correlated with each
other than other variables, then such condition is called Multicollinearity. It
should not be present in the dataset, because it creates problem while ranking
the most affecting variable.
o Underfitting and Overfitting: If our algorithm works well with the training
dataset but not well with test dataset, then such problem is called Overfitting.
And if our algorithm does not perform well even with training dataset, then such
problem is called underfitting.

Why do we use Regression Analysis?


As mentioned above, Regression analysis helps in the prediction of a continuous
variable. There are various scenarios in the real world where we need some future
predictions such as weather condition, sales prediction, marketing trends, etc., for such
case we need some technology which can make predictions more accurately. So for
such case we need Regression analysis which is a statistical method and used in
machine learning and data science. Below are some other reasons for using Regression
analysis:

o Regression estimates the relationship between the target and the independent
variable.
o It is used to find the trends in data.
o It helps to predict real/continuous values.
o By performing the regression, we can confidently determine the most important
factor, the least important factor, and how each factor is affecting the other
factors.

Types of Regression
There are various types of regressions which are used in data science and machine
learning. Each type has its own importance on different scenarios, but at the core, all the
regression methods analyze the effect of the independent variable on dependent
variables. Here we are discussing some important types of regression which are given
below:

o Linear Regression
o Logistic Regression
o Polynomial Regression
o Support Vector Regression
o Decision Tree Regression
o Random Forest Regression
o Ridge Regression
o Lasso Regression:

Linear Regression:

o Linear regression is a statistical regression method which is used for predictive


analysis.
o It is one of the very simple and easy algorithms which works on regression and
shows the relationship between the continuous variables.
o It is used for solving the regression problem in machine learning.
o Linear regression shows the linear relationship between the independent variable
(X-axis) and the dependent variable (Y-axis), hence called linear regression.
o If there is only one input variable (x), then such linear regression is called simple
linear regression. And if there is more than one input variable, then such linear
regression is called multiple linear regression.
o The relationship between variables in the linear regression model can be
explained using the below image. Here we are predicting the salary of an
employee on the basis of the year of experience.

o Below is the mathematical equation for Linear regression:

1. Y= aX+b

Here, Y = dependent variables (target variables),


X= Independent variables (predictor variables),
a and b are the linear coefficients

Some popular applications of linear regression are:

o Analyzing trends and sales estimates


o Salary forecasting
o Real estate prediction
o Arriving at ETAs in traffic.
Logistic Regression:

o Logistic regression is another supervised learning algorithm which is used to


solve the classification problems. In classification problems, we have dependent
variables in a binary or discrete format such as 0 or 1.
o Logistic regression algorithm works with the categorical variable such as 0 or 1,
Yes or No, True or False, Spam or not spam, etc.
o It is a predictive analysis algorithm which works on the concept of probability.
o Logistic regression is a type of regression, but it is different from the linear
regression algorithm in the term how they are used.
o Logistic regression uses sigmoid function or logistic function which is a complex
cost function. This sigmoid function is used to model the data in logistic
regression. The function can be represented as:

o f(x)= Output between the 0 and 1 value.


o x= input to the function
o e= base of natural logarithm.

When we provide the input values (data) to the function, it gives the S-curve as follows:
o It uses the concept of threshold levels, values above the threshold level are
rounded up to 1, and values below the threshold level are rounded up to 0.

There are three types of logistic regression:

o Binary(0/1, pass/fail)
o Multi(cats, dogs, lions)
o Ordinal(low, medium, high)

Polynomial Regression:

o Polynomial Regression is a type of regression which models the non-linear


dataset using a linear model.
o It is similar to multiple linear regression, but it fits a non-linear curve between the
value of x and corresponding conditional values of y.
o Suppose there is a dataset which consists of datapoints which are present in a
non-linear fashion, so for such case, linear regression will not best fit to those
datapoints. To cover such datapoints, we need Polynomial regression.
o In Polynomial regression, the original features are transformed into
polynomial features of given degree and then modeled using a linear
model. Which means the datapoints are best fitted using a polynomial line.
o The equation for polynomial regression also derived from linear regression
equation that means Linear regression equation Y= b 0+ b1x, is transformed into
Polynomial regression equation Y= b0+b1x+ b2x2+ b3x3+.....+ bnxn.
o Here Y is the predicted/target output, b0, b1,... bn are the regression
coefficients. x is our independent/input variable.
o The model is still linear as the coefficients are still linear with quadratic

Note: This is different from Multiple Linear regression in such a way that in Polynomial
regression, a single element has different degrees instead of multiple variables with the
same degree.

Support Vector Regression:


Support Vector Machine is a supervised learning algorithm which can be used for
regression as well as classification problems. So if we use it for regression problems,
then it is termed as Support Vector Regression.

Support Vector Regression is a regression algorithm which works for continuous


variables. Below are some keywords which are used in Support Vector Regression:

o Kernel: It is a function used to map a lower-dimensional data into higher


dimensional data.
o Hyperplane: In general SVM, it is a separation line between two classes, but in
SVR, it is a line which helps to predict the continuous variables and cover most of
the datapoints.
o Boundary line: Boundary lines are the two lines apart from hyperplane, which
creates a margin for datapoints.
o Support vectors: Support vectors are the datapoints which are nearest to the
hyperplane and opposite class.

In SVR, we always try to determine a hyperplane with a maximum margin, so that


maximum number of datapoints are covered in that margin. The main goal of SVR is
to consider the maximum datapoints within the boundary lines and the hyperplane
(best-fit line) must contain a maximum number of datapoints. Consider the below
image:
Here, the blue line is called hyperplane, and the other two lines are known as boundary
lines.

Decision Tree Regression:

o Decision Tree is a supervised learning algorithm which can be used for solving
both classification and regression problems.
o It can solve problems for both categorical and numerical data
o Decision Tree regression builds a tree-like structure in which each internal node
represents the "test" for an attribute, each branch represent the result of the test,
and each leaf node represents the final decision or result.
o A decision tree is constructed starting from the root node/parent node (dataset),
which splits into left and right child nodes (subsets of dataset). These child nodes
are further divided into their children node, and themselves become the parent
node of those nodes. Consider the below image:
Above image showing the example of Decision Tee regression, here, the model is trying
to predict the choice of a person between Sports cars or Luxury car.

o Random forest is one of the most powerful supervised learning algorithms which
is capable of performing regression as well as classification tasks.
o The Random Forest regression is an ensemble learning method which combines
multiple decision trees and predicts the final output based on the average of
each tree output. The combined decision trees are called as base models, and it
can be represented more formally as:

g(x)= f0(x)+ f1(x)+ f2(x)+....

o Random forest uses Bagging or Bootstrap Aggregation technique of ensemble


learning in which aggregated decision tree runs in parallel and do not interact
with each other.
o With the help of Random Forest regression, we can prevent Overfitting in the
model by creating random subsets of the dataset.
Ridge Regression:

o Ridge regression is one of the most robust versions of linear regression in which
a small amount of bias is introduced so that we can get better long term
predictions.
o The amount of bias added to the model is known as Ridge Regression penalty.
We can compute this penalty term by multiplying with the lambda to the squared
weight of each individual features.
o The equation for ridge regression will be:

o A general linear or polynomial regression will fail if there is high collinearity


between the independent variables, so to solve such problems, Ridge regression
can be used.
o Ridge regression is a regularization technique, which is used to reduce the
complexity of the model. It is also called as L2 regularization.
o It helps to solve the problems if we have more parameters than samples.

Lasso Regression:

o Lasso regression is another regularization technique to reduce the complexity of


the model.
o It is similar to the Ridge Regression except that penalty term contains only the
absolute weights instead of a square of weights.
o Since it takes absolute values, hence, it can shrink the slope to 0, whereas Ridge
Regression can only shrink it near to 0.
o It is also called as L1 regularization. The equation for Lasso regression will be:

Classification Algorithm in Machine


Learning
As we know, the Supervised Machine Learning algorithm can be broadly classified into
Regression and Classification Algorithms. In Regression algorithms, we have predicted
the output for continuous values, but to predict the categorical values, we need
Classification algorithms.

What is the Classification Algorithm?


The Classification algorithm is a Supervised Learning technique that is used to identify
the category of new observations on the basis of training data. In Classification, a
program learns from the given dataset or observations and then classifies new
observation into a number of classes or groups. Such as, Yes or No, 0 or 1, Spam or
Not Spam, cat or dog, etc. Classes can be called as targets/labels or categories.
Unlike regression, the output variable of Classification is a category, not a value, such as
"Green or Blue", "fruit or animal", etc. Since the Classification algorithm is a Supervised
learning technique, hence it takes labeled input data, which means it contains input with
the corresponding output.

In classification algorithm, a discrete output function(y) is mapped to input variable(x).

1. y=f(x), where y = categorical output

The best example of an ML classification algorithm is Email Spam Detector.

The main goal of the Classification algorithm is to identify the category of a given
dataset, and these algorithms are mainly used to predict the output for the categorical
data.

Classification algorithms can be better understood using the below diagram. In the
below diagram, there are two classes, class A and Class B. These classes have features
that are similar to each other and dissimilar to other classes.

The algorithm which implements the classification on a dataset is known as a classifier.


There are two types of Classifications:
o Binary Classifier: If the classification problem has only two possible outcomes, then it is
called as Binary Classifier.
Examples: YES or NO, MALE or FEMALE, SPAM or NOT SPAM, CAT or DOG, etc.
o Multi-class Classifier: If a classification problem has more than two outcomes, then it is
called as Multi-class Classifier.
Example: Classifications of types of crops, Classification of types of music.

Learners in Classification Problems:


In the classification problems, there are two types of learners:

1. Lazy Learners: Lazy Learner firstly stores the training dataset and wait until it receives
the test dataset. In Lazy learner case, classification is done on the basis of the most
related data stored in the training dataset. It takes less time in training but more time for
predictions.
Example: K-NN algorithm, Case-based reasoning
2. Eager Learners:Eager Learners develop a classification model based on a training
dataset before receiving a test dataset. Opposite to Lazy learners, Eager Learner takes
more time in learning, and less time in prediction. Example: Decision Trees, Naïve Bayes,
ANN.

Types of ML Classification Algorithms:


Classification Algorithms can be further divided into the Mainly two category:

o Linear Models
o Logistic Regression
o Support Vector Machines
o Non-linear Models
o K-Nearest Neighbours
o Kernel SVM
o Naïve Bayes
o Decision Tree Classification
o Random Forest Classification
Note: We will learn the above algorithms in later chapters.

Evaluating a Classification model:


Once our model is completed, it is necessary to evaluate its performance; either it is a
Classification or Regression model. So for evaluating a Classification model, we have the
following ways:

1. Log Loss or Cross-Entropy Loss:

o It is used for evaluating the performance of a classifier, whose output is a probability


value between the 0 and 1.
o For a good binary Classification model, the value of log loss should be near to 0.
o The value of log loss increases if the predicted value deviates from the actual value.
o The lower log loss represents the higher accuracy of the model.
o For Binary classification, cross-entropy can be calculated as:

1. ?(ylog(p)+(1?y)log(1?p))

Where y= Actual output, p= predicted output.

2. Confusion Matrix:

o The confusion matrix provides us a matrix/table as output and describes the


performance of the model.
o It is also known as the error matrix.
o The matrix consists of predictions result in a summarized form, which has a total number
of correct predictions and incorrect predictions. The matrix looks like as below table:

o
Actual Positive Actual Negative

Predicted Positive True Positive False Positive

Predicted Negative False Negative True Negative


3. AUC-ROC curve:

o ROC curve stands for Receiver Operating Characteristics Curve and AUC stands
for Area Under the Curve.
o It is a graph that shows the performance of the classification model at different
thresholds.
o To visualize the performance of the multi-class classification model, we use the AUC-ROC
Curve.
o The ROC curve is plotted with TPR and FPR, where TPR (True Positive Rate) on Y-axis and
FPR(False Positive Rate) on X-axis.

Use cases of Classification Algorithms


Classification algorithms can be used in different places. Below are some popular use
cases of Classification Algorithms:

o Email Spam Detection


o Speech Recognition
o Identifications of Cancer tumor cells.
o Drugs Classification
o Biometric Identification, etc.
Linear Regression in Machine Learning
Linear regression is one of the easiest and most popular Machine Learning algorithms. It
is a statistical method that is used for predictive analysis. Linear regression makes
predictions for continuous/real or numeric variables such as sales, salary, age, product
price, etc.

Linear regression algorithm shows a linear relationship between a dependent (y) and
one or more independent (y) variables, hence called as linear regression. Since linear
regression shows the linear relationship, which means it finds how the value of the
dependent variable is changing according to the value of the independent variable.

The linear regression model provides a sloped straight line representing the relationship
between the variables. Consider the below image:

Mathematically, we can represent a linear regression as:

Backward Skip 10sPlay VideoForward Skip 10s

y= a0+a1x+ ε
Here,

Y= Dependent Variable (Target Variable)


X= Independent Variable (predictor Variable)
a0= intercept of the line (Gives an additional degree of freedom)
a1 = Linear regression coefficient (scale factor to each input value).
ε = random error

The values for x and y variables are training datasets for Linear Regression model
representation.

Types of Linear Regression


Linear regression can be further divided into two types of the algorithm:

o Simple Linear Regression:


If a single independent variable is used to predict the value of a numerical dependent
variable, then such a Linear Regression algorithm is called Simple Linear Regression.
o Multiple Linear regression:
If more than one independent variable is used to predict the value of a numerical
dependent variable, then such a Linear Regression algorithm is called Multiple Linear
Regression.

Linear Regression Line


A linear line showing the relationship between the dependent and independent
variables is called a regression line. A regression line can show two types of
relationship:

o Positive Linear Relationship:


If the dependent variable increases on the Y-axis and independent variable increases on
X-axis, then such a relationship is termed as a Positive linear relationship.
o Negative Linear Relationship:
If the dependent variable decreases on the Y-axis and independent variable increases on
the X-axis, then such a relationship is called a negative linear relationship.

Finding the best fit line:


When working with linear regression, our main goal is to find the best fit line that means
the error between predicted values and actual values should be minimized. The best fit
line will have the least error.

The different values for weights or the coefficient of lines (a 0, a1) gives a different line of
regression, so we need to calculate the best values for a 0 and a1 to find the best fit line,
so to calculate this we use cost function.

Cost function-
o The different values for weights or coefficient of lines (a 0, a1) gives the different line of
regression, and the cost function is used to estimate the values of the coefficient for the
best fit line.
o Cost function optimizes the regression coefficients or weights. It measures how a linear
regression model is performing.
o We can use the cost function to find the accuracy of the mapping function, which maps
the input variable to the output variable. This mapping function is also known
as Hypothesis function.

For Linear Regression, we use the Mean Squared Error (MSE) cost function, which is
the average of squared error occurred between the predicted values and actual values. It
can be written as:

For the above linear equation, MSE can be calculated as:

Where,

N=Total number of observation


Yi = Actual value
(a1xi+a0)= Predicted value.

Residuals: The distance between the actual value and predicted values is called residual.
If the observed points are far from the regression line, then the residual will be high, and
so cost function will high. If the scatter points are close to the regression line, then the
residual will be small and hence the cost function.

Gradient Descent:

o Gradient descent is used to minimize the MSE by calculating the gradient of the cost
function.
o A regression model uses gradient descent to update the coefficients of the line by
reducing the cost function.
o It is done by a random selection of values of coefficient and then iteratively update the
values to reach the minimum cost function.
Model Performance:
The Goodness of fit determines how the line of regression fits the set of observations.
The process of finding the best model out of various models is called optimization. It
can be achieved by below method:

1. R-squared method:

o R-squared is a statistical method that determines the goodness of fit.


o It measures the strength of the relationship between the dependent and independent
variables on a scale of 0-100%.
o The high value of R-square determines the less difference between the predicted values
and actual values and hence represents a good model.
o It is also called a coefficient of determination, or coefficient of multiple
determination for multiple regression.
o It can be calculated from the below formula:

Assumptions of Linear Regression


Below are some important assumptions of Linear Regression. These are some formal
checks while building a Linear Regression model, which ensures to get the best possible
result from the given dataset.

o Linear relationship between the features and target:


Linear regression assumes the linear relationship between the dependent and
independent variables.
o Small or no multicollinearity between the features:
Multicollinearity means high-correlation between the independent variables. Due to
multicollinearity, it may difficult to find the true relationship between the predictors and
target variables. Or we can say, it is difficult to determine which predictor variable is
affecting the target variable and which is not. So, the model assumes either little or no
multicollinearity between the features or independent variables.
o Homoscedasticity Assumption:
Homoscedasticity is a situation when the error term is the same for all the values of
independent variables. With homoscedasticity, there should be no clear pattern
distribution of data in the scatter plot.
o Normal distribution of error terms:
Linear regression assumes that the error term should follow the normal distribution
pattern. If error terms are not normally distributed, then confidence intervals will become
either too wide or too narrow, which may cause difficulties in finding coefficients.
It can be checked using the q-q plot. If the plot shows a straight line without any
deviation, which means the error is normally distributed.
o No autocorrelations:
The linear regression model assumes no autocorrelation in error terms. If there will be
any correlation in the error term, then it will drastically reduce the accuracy of the model.
Autocorrelation usually occurs if there is a dependency between residual errors.

Regression vs. Classification in


Machine Learning
Regression and Classification algorithms are Supervised Learning algorithms. Both the
algorithms are used for prediction in Machine learning and work with the labeled
datasets. But the difference between both is how they are used for different machine
learning problems.

The main difference between Regression and Classification algorithms that Regression
algorithms are used to predict the continuous values such as price, salary, age, etc. and
Classification algorithms are used to predict/Classify the discrete values such as Male
or Female, True or False, Spam or Not Spam, etc.

Consider the below diagram:


Classification:
Classification is a process of finding a function which helps in dividing the dataset into
classes based on different parameters. In Classification, a computer program is trained
on the training dataset and based on that training, it categorizes the data into different
classes.

The task of the classification algorithm is to find the mapping function to map the
input(x) to the discrete output(y).

Example: The best example to understand the Classification problem is Email Spam
Detection. The model is trained on the basis of millions of emails on different
parameters, and whenever it receives a new email, it identifies whether the email is spam
or not. If the email is spam, then it is moved to the Spam folder.

Types of ML Classification Algorithms:

Classification Algorithms can be further divided into the following types:

o Logistic Regression
o K-Nearest Neighbours
o Support Vector Machines
o Kernel SVM
o Naïve Bayes
o Decision Tree Classification
o Random Forest Classification

Regression:
Regression is a process of finding the correlations between dependent and independent
variables. It helps in predicting the continuous variables such as prediction of Market
Trends, prediction of House prices, etc.

The task of the Regression algorithm is to find the mapping function to map the input
variable(x) to the continuous output variable(y).

Example: Suppose we want to do weather forecasting, so for this, we will use the
Regression algorithm. In weather prediction, the model is trained on the past data, and
once the training is completed, it can easily predict the weather for future days.

Types of Regression Algorithm:

o Simple Linear Regression


o Multiple Linear Regression
o Polynomial Regression
o Support Vector Regression
o Decision Tree Regression
o Random Forest Regression

Difference between Regression and Classification

Regression Algorithm Classification Algorithm

In Regression, the output variable must be of In Classification, the output variable must be a discrete
continuous nature or real value. value.

The task of the regression algorithm is to The task of the classification algorithm is to map the
map the input value (x) with the continuous input value(x) with the discrete output variable(y).
output variable(y).

Regression Algorithms are used with Classification Algorithms are used with discrete data.
continuous data.

In Regression, we try to find the best fit line, In Classification, we try to find the decision boundary,
which can predict the output more which can divide the dataset into different classes.
accurately.

Regression algorithms can be used to solve Classification Algorithms can be used to solve
the regression problems such as Weather classification problems such as Identification of spam
Prediction, House price prediction, etc. emails, Speech Recognition, Identification of cancer cells,
etc.

The regression Algorithm can be further The Classification algorithms can be divided into Binary
divided into Linear and Non-linear Classifier and Multi-class Classifier.
Regression.

Logistic Regression in Machine


Learning
o Logistic regression is one of the most popular Machine Learning algorithms, which
comes under the Supervised Learning technique. It is used for predicting the categorical
dependent variable using a given set of independent variables.
o Logistic regression predicts the output of a categorical dependent variable. Therefore the
outcome must be a categorical or discrete value. It can be either Yes or No, 0 or 1, true or
False, etc. but instead of giving the exact value as 0 and 1, it gives the probabilistic
values which lie between 0 and 1.
o Logistic Regression is much similar to the Linear Regression except that how they are
used. Linear Regression is used for solving Regression problems, whereas Logistic
regression is used for solving the classification problems.
o In Logistic regression, instead of fitting a regression line, we fit an "S" shaped logistic
function, which predicts two maximum values (0 or 1).
o The curve from the logistic function indicates the likelihood of something such as
whether the cells are cancerous or not, a mouse is obese or not based on its weight, etc.
o Logistic Regression is a significant machine learning algorithm because it has the ability
to provide probabilities and classify new data using continuous and discrete datasets.
o Logistic Regression can be used to classify the observations using different types of data
and can easily determine the most effective variables used for the classification. The
below image is showing the logistic function:

Note: Logistic regression uses the concept of predictive modeling as regression; therefore,
it is called logistic regression, but is used to classify samples; Therefore, it falls under the
classification algorithm.

Logistic Function (Sigmoid Function):

o The sigmoid function is a mathematical function used to map the predicted values to
probabilities.
o It maps any real value into another value within a range of 0 and 1.
o The value of the logistic regression must be between 0 and 1, which cannot go beyond
this limit, so it forms a curve like the "S" form. The S-form curve is called the Sigmoid
function or the logistic function.
o In logistic regression, we use the concept of the threshold value, which defines the
probability of either 0 or 1. Such as values above the threshold value tends to 1, and a
value below the threshold values tends to 0.

Assumptions for Logistic Regression:

o The dependent variable must be categorical in nature.


o The independent variable should not have multi-collinearity.

Logistic Regression Equation:


The Logistic regression equation can be obtained from the Linear Regression equation.
The mathematical steps to get Logistic Regression equations are given below:

o We know the equation of the straight line can be written as:

o In Logistic Regression y can be between 0 and 1 only, so for this let's divide the above
equation by (1-y):

o But we need range between -[infinity] to +[infinity], then take logarithm of the equation
it will become:

The above equation is the final equation for Logistic Regression.

Type of Logistic Regression:


On the basis of the categories, Logistic Regression can be classified into three types:
o Binomial: In binomial Logistic regression, there can be only two possible types of the
dependent variables, such as 0 or 1, Pass or Fail, etc.
o Multinomial: In multinomial Logistic regression, there can be 3 or more possible
unordered types of the dependent variable, such as "cat", "dogs", or "sheep"
o Ordinal: In ordinal Logistic regression, there can be 3 or more possible ordered types of
dependent variables, such as "low", "Medium", or "High".

Python Implementation of Logistic Regression


(Binomial)
To understand the implementation of Logistic Regression in Python, we will use the
below example:

Example: There is a dataset given which contains the information of various users
obtained from the social networking sites. There is a car making company that has
recently launched a new SUV car. So the company wanted to check how many users
from the dataset, wants to purchase the car.

For this problem, we will build a Machine Learning model using the Logistic regression
algorithm. The dataset is shown in the below image. In this problem, we will predict
the purchased variable (Dependent Variable) by using age and salary (Independent
variables).
Steps in Logistic Regression: To implement the Logistic Regression using Python, we
will use the same steps as we have done in previous topics of Regression. Below are the
steps:

o Data Pre-processing step


o Fitting Logistic Regression to the Training set
o Predicting the test result
o Test accuracy of the result(Creation of Confusion matrix)
o Visualizing the test set result.

1. Data Pre-processing step: In this step, we will pre-process/prepare the data so that
we can use it in our code efficiently. It will be the same as we have done in Data pre-
processing topic. The code for this is given below:

1. #Data Pre-procesing Step


2. # importing libraries
3. import numpy as nm
4. import matplotlib.pyplot as mtp
5. import pandas as pd
6.
7. #importing datasets
8. data_set= pd.read_csv('user_data.csv')

By executing the above lines of code, we will get the dataset as the output. Consider the
given image:

Now, we will extract the dependent and independent variables from the given dataset.
Below is the code for it:

1. #Extracting Independent and dependent Variable


2. x= data_set.iloc[:, [2,3]].values
3. y= data_set.iloc[:, 4].values

In the above code, we have taken [2, 3] for x because our independent variables are age
and salary, which are at index 2, 3. And we have taken 4 for y variable because our
dependent variable is at index 4. The output will be:

Now we will split the dataset into a training set and test set. Below is the code for it:

1. # Splitting the dataset into training and test set.


2. from sklearn.model_selection import train_test_split
3. x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.25, random_state=0
)

The output for this is given below:


For test
set:

For training set:


In logistic regression, we will do feature scaling because we want accurate result of
predictions. Here we will only scale the independent variable because dependent
variable have only 0 and 1 values. Below is the code for it:

1. #feature Scaling
2. from sklearn.preprocessing import StandardScaler
3. st_x= StandardScaler()
4. x_train= st_x.fit_transform(x_train)
5. x_test= st_x.transform(x_test)

The scaled output is given below:


2. Fitting Logistic Regression to the Training set:

We have well prepared our dataset, and now we will train the dataset using the training
set. For providing training or fitting the model to the training set, we will import
the LogisticRegression class of the sklearn library.

After importing the class, we will create a classifier object and use it to fit the model to
the logistic regression. Below is the code for it:

1. #Fitting Logistic Regression to the training set


2. from sklearn.linear_model import LogisticRegression
3. classifier= LogisticRegression(random_state=0)
4. classifier.fit(x_train, y_train)

Output: By executing the above code, we will get the below output:

Out[5]:

1. LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,


2. intercept_scaling=1, l1_ratio=None, max_iter=100,
3. multi_class='warn', n_jobs=None, penalty='l2',
4. random_state=0, solver='warn', tol=0.0001, verbose=0,
5. warm_start=False)

Hence our model is well fitted to the training set.

3. Predicting the Test Result

Our model is well trained on the training set, so we will now predict the result by using
test set data. Below is the code for it:

1. #Predicting the test set result


2. y_pred= classifier.predict(x_test)

In the above code, we have created a y_pred vector to predict the test set result.

Output: By executing the above code, a new vector (y_pred) will be created under the
variable explorer option. It can be seen as:
The above output image shows the corresponding predicted users who want to
purchase or not purchase the car.

4. Test Accuracy of the result

Now we will create the confusion matrix here to check the accuracy of the classification.
To create it, we need to import the confusion_matrix function of the sklearn library.
After importing the function, we will call it using a new variable cm. The function takes
two parameters, mainly y_true( the actual values) and y_pred (the targeted value return
by the classifier). Below is the code for it:

1. #Creating the Confusion matrix


2. from sklearn.metrics import confusion_matrix
3. cm= confusion_matrix()
Output:

By executing the above code, a new confusion matrix will be created. Consider the
below image:

We can find the accuracy of the predicted result by interpreting the confusion matrix. By
above output, we can interpret that 65+24= 89 (Correct Output) and 8+3= 11(Incorrect
Output).

5. Visualizing the training set result

Finally, we will visualize the training set result. To visualize the result, we will
use ListedColormap class of matplotlib library. Below is the code for it:

1. #Visualizing the training set result


2. from matplotlib.colors import ListedColormap
3. x_set, y_set = x_train, y_train
4. x1, x2 = nm.meshgrid(nm.arange(start = x_set[:, 0].min() - 1, stop = x_set[:, 0].max() + 1, s
tep =0.01),
5. nm.arange(start = x_set[:, 1].min() - 1, stop = x_set[:, 1].max() + 1, step = 0.01))
6. mtp.contourf(x1, x2, classifier.predict(nm.array([x1.ravel(), x2.ravel()]).T).reshape(x1.shape),
7. alpha = 0.75, cmap = ListedColormap(('purple','green' )))
8. mtp.xlim(x1.min(), x1.max())
9. mtp.ylim(x2.min(), x2.max())
10. for i, j in enumerate(nm.unique(y_set)):
11. mtp.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1],
12. c = ListedColormap(('purple', 'green'))(i), label = j)
13. mtp.title('Logistic Regression (Training set)')
14. mtp.xlabel('Age')
15. mtp.ylabel('Estimated Salary')
16. mtp.legend()
17. mtp.show()

In the above code, we have imported the ListedColormap class of Matplotlib library to
create the colormap for visualizing the result. We have created two new
variables x_set and y_set to replace x_train and y_train. After that, we have used
the nm.meshgrid command to create a rectangular grid, which has a range of -
1(minimum) to 1 (maximum). The pixel points we have taken are of 0.01 resolution.

To create a filled contour, we have used mtp.contourf command, it will create regions
of provided colors (purple and green). In this function, we have passed
the classifier.predict to show the predicted data points predicted by the classifier.

Output: By executing the above code, we will get the below output:

The graph can be explained in the below points:


o In the above graph, we can see that there are some Green points within the green
region and Purple points within the purple region.
o All these data points are the observation points from the training set, which shows the
result for purchased variables.
o This graph is made by using two independent variables i.e., Age on the x-
axis and Estimated salary on the y-axis.
o The purple point observations are for which purchased (dependent variable) is
probably 0, i.e., users who did not purchase the SUV car.
o The green point observations are for which purchased (dependent variable) is probably
1 means user who purchased the SUV car.
o We can also estimate from the graph that the users who are younger with low salary, did
not purchase the car, whereas older users with high estimated salary purchased the car.
o But there are some purple points in the green region (Buying the car) and some green
points in the purple region(Not buying the car). So we can say that younger users with a
high estimated salary purchased the car, whereas an older user with a low estimated
salary did not purchase the car.

The goal of the classifier:

We have successfully visualized the training set result for the logistic regression, and our
goal for this classification is to divide the users who purchased the SUV car and who did
not purchase the car. So from the output graph, we can clearly see the two regions
(Purple and Green) with the observation points. The Purple region is for those users who
didn't buy the car, and Green Region is for those users who purchased the car.

Linear Classifier:

As we can see from the graph, the classifier is a Straight line or linear in nature as we
have used the Linear model for Logistic Regression. In further topics, we will learn for
non-linear Classifiers.

Visualizing the test set result:

Our model is well trained using the training dataset. Now, we will visualize the result for
new observations (Test set). The code for the test set will remain same as above except
that here we will use x_test and y_test instead of x_train and y_train. Below is the code
for it:
1. #Visulaizing the test set result
2. from matplotlib.colors import ListedColormap
3. x_set, y_set = x_test, y_test
4. x1, x2 = nm.meshgrid(nm.arange(start = x_set[:, 0].min() - 1, stop = x_set[:, 0].max() + 1, s
tep =0.01),
5. nm.arange(start = x_set[:, 1].min() - 1, stop = x_set[:, 1].max() + 1, step = 0.01))
6. mtp.contourf(x1, x2, classifier.predict(nm.array([x1.ravel(), x2.ravel()]).T).reshape(x1.shape),

7. alpha = 0.75, cmap = ListedColormap(('purple','green' )))


8. mtp.xlim(x1.min(), x1.max())
9. mtp.ylim(x2.min(), x2.max())
10. for i, j in enumerate(nm.unique(y_set)):
11. mtp.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1],
12. c = ListedColormap(('purple', 'green'))(i), label = j)
13. mtp.title('Logistic Regression (Test set)')
14. mtp.xlabel('Age')
15. mtp.ylabel('Estimated Salary')
16. mtp.legend()
17. mtp.show()

Output:
The above graph shows the test set result. As we can see, the graph is divided into two
regions (Purple and Green). And Green observations are in the green region, and Purple
observations are in the purple region. So we can say it is a good prediction and model.
Some of the green and purple data points are in different regions, which can be ignored
as we have already calculated this error using the confusion matrix (11 Incorrect output).

Hence our model is pretty good and ready to make new predictions for this
classification problem.

You might also like