DAUNIT-3
DAUNIT-3
Regression – Concepts:
Introduction:
The term regression is used to indicate the estimation or prediction of
the average value of one variable for a specified value of another variable.
Regression analysis is a very widely used statistical tool to establish a
relationship model between two variables.
3. No Perfect Multicollinearity
The independent variables must not be perfectly correlated:
rank(X)=p
If multicollinearity exists, the matrix XTX becomes singular making
it possible to compute (XTX)-1
If variables are highly correlated but not perfectly,OLS can still be
used but may be unstable.
5. No Autocorrelation
Error terms should not be correlated:
E[ϵi, ϵj]=0 for all i≠j
Note: If autocorrelation exists, OLS is still unbiased but inefficient
leading to incorrect standard errors and hypothesis test.
S= ∑ yi-( β0+β1xi))
2
β0 mean y – β *meanx 1
If we had multiple input attributes (e.g. x1, x2, x3, etc.) This would be
called multiple linear regression. The procedure for linear regression is
different and simpler than that for multiple linear regression.
Let us consider the following Example:
for an equation y=2*x+3.
xi-mean(x)
X Y xi-mean(x) yi-mean(y) (xi-mean(xi)2
* yi-
mean(y)
-3 -3 -4.4 -8.8 38.72 19.36
-1 1 -2.4 -4.8 11.52 5.76
2 7 0.6 1.2 0.72 0.36
4 11 2.6 5.2 13.52 6.76
5 13 3.6 7.2 25.92 12.96
1.4 5.8 Sum = 90.4 Sum = 45.2
2
mean(x)
i
1
xi
β0 mean y – β1*meanx
We can find from the above formulas,
β1=2 and β0=3
Example for Linear Regression using R:
Consider the following data set:
x = {1,2,4,3,5} and y = {1,3,3,2,5}
We use R to apply Linear Regression for the above data.
> rm(list=ls()) #removes the list of variables in the current session of R
> x<-c(1,2,4,3,5) #assigns values to x
> y<-c(1,3,3,2,5) #assigns values to y
> x;y
[1] 1 2 4 3 5
[1] 1 3 3 2 5
> graphics.off() #to clear the existing plot/s
> plot(x,y,pch=16, col="red")
> relxy<-lm(y~x)
> relxy
Call:
lm(formula = y ~ x)
Coefficients:
(Intercept)
x
0.40.8
> abline(relxy,col="Blue")
> a <- data.frame(x = 7)
>a
7
> result <- predict(relxy,a)
> print(resul
t) 6
> #Note: you can observe that
> 0.8*7+0.4
[1] 6 #The same calculated using the line equation y= 0.8*x +0.4.
Simple linear regression is the simplest form of regression and the most studied.
Calculating β1 & β0 using Correlations and Standard Deviations:
β1 = corr(x, y) * stdev(y) / stdev(x)
1. Problem Definition
2. Hypothesis Generation
3. Data Collection
4. Data Exploration/Transformation
5. Predictive Modelling
6. Model Deployment
1. Problem Definition
The first step in constructing a model is to understand the industrial problem in
a more comprehensive way. To identify the purpose of the problem and the
prediction target, we must define the project objectives appropriately.
Therefore, to proceed with an analytical approach, we have to recognize the
obstacles first. Remember, excellent results always depend on a better
understanding of the problem.
2. Hypothesis Generation
3. Data Collection
Data collection is gathering data from relevant sources regarding the analytical
problem, then we extract meaningful insights from the data for prediction.
4. Data Exploration/Transformation
The data you collected may be in unfamiliar shapes and sizes. It may contain
unnecessary features, null values, unanticipated small values, or immense
values. So, before applying any algorithmic model to data, we have to explore it
first.
By inspecting the data, we get to understand the explicit and hidden trends in
data. We find the relation between data features and the target variable.
Usually, a data scientist invests his 60–70% of project time dealing with data
exploration only.
There are several sub steps involved in data exploration:
o Feature Identification:
You need to analyze which data features are available and which
ones are not.
Identify independent and target variables.
Identify data types and categories of these variables.
o Univariate Analysis:
We inspect each variable one by one. This kind of analysis
depends on the variable type whether it is categorical and
continuous.
Continuous variable: We mainly look for statistical trends
like mean, median, standard deviation, skewness, and many
more in the dataset.
Categorical variable: We use a frequency table to
understand the spread of data for each category. We can
measure the counts and frequency of occurrence of values.
o Multi-variate Analysis:
The bi-variate analysis helps to discover the relation between two
or more variables.
We can find the correlation in case of continuous variables and the
case of categorical, we look for association and dissociation
between them.
o Filling Null Values:
Usually, the dataset contains null values which lead to lower the
potential of the model. With a continuous variable, we fill these
null values using the mean or mode of that specific column. For
the null values present in the categorical column, we replace them
with the most frequently occurred categorical value. Remember,
don’t delete that rows because you may lose the information.
5. Predictive Modeling
Predictive modeling is a mathematical approach to create a statistical model
to forecast future behavior based on input test data.
Steps involved in predictive modeling:
Algorithm Selection:
o When we have the structured dataset, and we want to estimate the
continuous or categorical outcome then we use supervised machine
learning methodologies like regression and classification techniques.
When we have unstructured data and want to predict the clusters of
items to which a particular input test sample belongs, we use
unsupervised algorithms. An actual data scientist applies multiple
algorithms to get a more accurate model.
Train Model:
o After assigning the algorithm and getting the data handy, we train our
model using the input data applying the preferred algorithm. It is an
action to determine the correspondence between independent variables,
and the prediction targets.
Model Prediction:
o We make predictions by giving the input test data to the trained model.
We measure the accuracy by using a cross-validation strategy or ROC
curve which performs well to derive model output for test data.
6. Model Deployment
There is nothing better than deploying the model in a real-time environment. It
helps us to gain analytical insights into the decision-making procedure. You
constantly need to update the model with additional features for customer
satisfaction.
To predict business decisions, plan market strategies, and create personalized
customer interests, we integrate the machine learning model into the existing
production domain.
When you go through the Amazon website and notice the product
recommendations completely based on your curiosities. You can experience the
increase in the involvement of the customers utilizing these services. That’s
how a deployed model changes the mindset of the customer and convince him
to purchase the product.
Key Takeaways
Definition: Multi-collinearity:
Multicollinearity is a statistical phenomenon in which multiple independent
variables show high correlation between each other and they are too inter-
related.
Multicollinearity also called as Collinearity and it is an undesired situation for
any statistical regression model since it diminishes the reliability of the model
itself.
If two or more independent variables are too correlated, the data
obtained from the regression will be disturbed because the independent
variables are actually dependent between each other.
Assumptions for Logistic Regression:
The dependent variable must be categorical in nature.
The independent variable should not have multi-collinearity.
Logistic Regression Equation:
The Logistic regression equation can be obtained from the Linear
Regression equation. The mathematical steps to get Logistic Regression
equations are given below:
Logistic Regression uses a more complex cost function, this cost function
can be defined as the ‘Sigmoid function’ or also known as the ‘logistic
function’ instead of a linear function.
The hypothesis of logistic regression tends it to limit the cost function
between 0 and 1. Therefore linear functions fail to represent it as it can have a
value greater than 1 or less than 0 which is not possible as per the hypothesis of
logistic regression.
z sigmoid ( y) ( 1
y)
1 e y
Hypothesis Representation
When using linear regression, we used a formula for the line equation as:
y b0 b1x1 b2 x2 ... bn xn
In the above equation y is a response variable, x1, x2 ,...xn are the predictor
variables,
and b0 , b1, b2 ,..., bn are the coefficients, which are numeric constants.
z ( y)
1
1 e(b0 b1x1 b2 x2 ...bn xn
Example for Sigmoid Function in R: )
> #Example for Sigmoid Function
> y<-c(-10:10);y
[1] -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 89
10
> z<-1/(1+exp(-y));z
[1] 4.539787e-05 1.233946e-04 3.353501e-04 9.110512e-04 2.472623e-03
6.692851e-03 1.798621e-02 4.742587e-02
[9] 1.192029e-01 2.689414e-
01 5.000000e-01 7.310586e-
01
8.807971e-01 9.525741e-01
9.820138e-01 9.933071e-01
[17] 9.975274e-01 9.990889e-
01 9.996646e-01 9.998766e-
01
9.999546e-01
> plot(y,z)
> rm(list=ls())
> attach(mtcars) #attaching
a data set into the R
environment
> input <- mtcars[,c("mpg","disp","hp","wt")]
> head(input)
mpg disp hp wt
Mazda RX4 21.0 160 110 2.620
Mazda RX4 Wag 21.0 160 110 2.875
Datsun 710 22.8 108 93 2.320
Hornet 4 Drive 21.4 258 110 3.215
Coefficients
: disp hp wt
(Intercept)
37.105505 -0.000937 -0.031157 -3.800891
True Positive
True Negative
False Positive – Type 1 Error
False Negative – Type 2 Error
Precision = TP
TP+
FP
Precision is a useful metric in cases where False Positive is a higher
concern than False Negatives.
Precision is important in music or video recommendation systems, e-
commerce websites, etc. Wrong results could lead to customer churn
and be harmful to the business.
Recall: (Sensitivity)
Recall is the ratio of correctly predicted positive observations to the all
observations in actual class.
Recall = TP
TP+ FN
Recall is a useful metric in cases where False Negative trumps False
Positive.
Recall is important in medical cases where it doesn’t matter whether
we raise a false alarm but the actual positive cases should not go
undetected!
F1-Score:
F1-score is a harmonic mean of Precision and Recall. It gives a combined idea about
these two metrics. It is maximum when Precision is equal to Recall.
Therefore, this score takes both false positives and false negatives into
account.
F Score
2 Precision* Recall
1 1 Precision
1 =2*
Recall
Precision= TP 560
560 0.903
TP+
60
FP
We can easily calculate Precision and Recall for our model by plugging in the
values into the above questions:
Recall = TP 560
560 0.918
TP+
50
FN
F1-Score
Precision* Recall
F1 Score 2*
Precision Recall
0.903* 0.918 0.8289
F Score 2 * 0.4552
1
0.903 0.918 1.821
The ROC curve is plotted with TPR against the FPR where TPR is on the y-axis
and FPR is on the x-axis.
Specificity
ROC curve
An ROC curve (receiver operating characteristic curve) is a graph showing the performance of
a classification model at all classification thresholds. This curve plots two parameters:
True Positive Rate (TPR) is a synonym for recall and is therefore defined as follows: