0% found this document useful (0 votes)
140 views

Business Analytics: Advance: Simple & Multiple Linear Regression

Uploaded by

Ketan Bhalerao
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
140 views

Business Analytics: Advance: Simple & Multiple Linear Regression

Uploaded by

Ketan Bhalerao
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 38

Business Analytics:

Advance
SIMPLE & MULTIPLE LINEAR REGRESSION

Copyright © 2016 Defour Analytics Pvt. Ltd


Recap
 What is Statistics?
STATISTICS
 Here,

Inferential Statistics
Descriptive Statistics Drawing conclusions
Presenting, organizing about a population based
and summarizing data on data observed in
sample

Measure of central Measure of dispersion


Summarizing data
tendency

COPYRIGHT 2017@ DEFOUR ANALYTICS PVT LTD 2


Recap

Standardization of scores

Normal Distribution

Hypothesis Testing

Basic R codes
Outline of the Supervised Learning Program
Day Topics (Professionals) Plan
1 Introduction to Analytics and its Applications Program Duration: 3 months
2 Basics of Data/Statistics/R (Analytical tool) - I On Every Saturday & Sunday from
3 Basics of Data/Statistics/R/Alteryx Demo(Analytical tool) - II 10 am to 1 pm IST
4 Linear Regression 8 Weeks of support after the
completion of the program (12 hrs,
5 Logistic Regression based on pre-booked appointment)
6 Clustering Change in dates will be notified in
7 Decision Tree advance as needed
8 Time series Modelling
9 Practical Session on Use cases
10 Market Basket Analysis
11 Text Mining
12 Data Visualization

COPYRIGHT 2017@ DEFOUR ANALYTICS PVT LTD 4


Regression: Agenda
What is regression?

Types of regression analysis

Purpose of regression analysis

Understanding Simple Linear Regression

Understanding Multiple Linear Regression


What is Regression?
 A statistical measure that attempts to determine the strength of the relationship between one
dependent variable (usually denoted by Y) and a series of other changing variables (known as
independent variables).
Y = f (x)
 It consists of 3 stages –
(1) analysing the correlation and directionality of the data, (correlation & covariance)
(2) estimating the model, i.e., fitting the line,
(3) evaluating the validity and usefulness of the model.
Covariance & Correlation
 Covariance: -  is a measure of the joint variability of two random variables. It is a measure
which helps to find out the direction of relationship between two variables.
i.e. what happens to Y when X increases or decreases?

 Correlation: - Correlation signifies the strength and direction of linear relationship. (It only
captures linear relationship)
Population correlation is denoted by ρ. (rho)
Sample correlation is denoted by r.

The covariance is very hard to compare. E.g. when you compare


height and weight in different units (mtr-Kg & inch-Kg) the covariance
will differ.
The solution to this is to normalize the covariance by removing its unit
and get the values between -1 and 1, which is correlation.
Features of r (correlation)
 It is Unit free.
Ranges between -1 and 1.
a) The closer to 1, the stronger the
positive linear relationship.
b) The closer to -1, the stronger the
negative linear relationship.
c) The closer to 0, the weaker the linear
relationship.
Types of regression analysis
Type of
Sr. No. Type
Dependent Variable (DV) Independent Variable (IV)

1 Simple Linear Regression Interval or Ratio Interval or Ratio or dichotomous

2 Multiple Linear Regression Interval or Ratio Interval or Ratio or dichotomous

3 Logistic Regression Binary Interval or Ratio or dichotomous

4 Ordinal Regression Ordinal nominal or dichotomous

5 Multinomial Regression Nominal Interval or Ratio or dichotomous


Purpose of regression analysis
 The purpose of regression analysis is to analyse relationships among variables.
 Usually, the investigator seeks to ascertain the causal effect of one variable upon another.
(causal analysis)
 The analysis is carried out through the estimation of a relationship

y = f(x1, x2,..., xk)


 The results serve the following two purposes:
 Answer the question of how much y changes with changes in each of the xs (x1, x2,..., xk ). (forecasting an

effect / impact of an effect)


 Forecast or predict the value of y based on the values of the xs. (trend forecasting)
Understanding Simple Linear Regression
In Mathematics, Linear equation is :
Y= mx +b

We call it linear as the equation


represents a straight line.
Dependent Independent
Variable Variable

• Y = Dependent variable
• X = Independent variable
• b0 = Intercept or constant
• b1 = x’s slope or coefficient
• e = error term
Term use in regression analysis
 Explained variance = R2 (coefficient of determination).
 Unexplained variance = residuals (error).
 Adjusted R-Square = reduces the R2 by taking into account
the sample size and the number of independent variables in the
regression model (It becomes smaller as we have fewer
observations per independent variable).
 Standard Error of the Estimate (SEE) = a measure of the
accuracy of the regression predictions. It estimates the
variation of the dependent variable values around the
regression line.
Term use in regression analysis
 Total Sum of Squares (SST) = total amount of variation that
exists to be explained by the independent variables. TSS = the
sum of SSE and SSR.
 Sum of Squared Errors (SSE) = the variance in the dependent
variable not accounted for by the regression model = residual.
The objective is to obtain the smallest possible sum of squared
errors as a measure of prediction accuracy.
 Sum of Squares Regression (SSR) = the amount of
improvement in explanation of the dependent variable
attributable to the independent variables.
Term use in regression analysis
Total variation is made up of two parts:

SST  SSE  RSS


Total sum of Regression Sum of Squares
Sum of Squared Errors
Squares Also known as
Square Sum of Regression SSR

SST   ( y  y ) 2
SSE   ( y  yˆ ) 2 SSR   ( yˆ  y ) 2

Where:
y
= Average value of the dependent variable
ŷy = Observed values of the dependent variable
= Estimated value of y for the given x value
Coefficient of Determination)
 
The coefficient of determination is the ratio of total variation in the dependent variable that is

explained by variation in the independent variable.


R 
2 SSR sum of squares error explained by regression
 where 0  R 1 2
SST total sum of squares

i.e. if = 0.70 than 70% variance in Y is explained by X.


For model to be good / acceptable value should be closer to 1.

Note: - In the single independent variable case, the coefficient of determination is


R2  r 2

Where:
R2 = Coefficient of determination
r = Simple correlation coefficient
Coefficient of determination (R2) and Adjusted R2
 Coefficient of determination(R2) can also be used to test the significance of the coefficients
collectively apart from using F-test.
SST - SSE RSS Sum of Squares explained by regression
R2   
SST SST Total Sum of Squares

 The drawback of using Coefficient of determination is that the value of the coefficient of
determination always increases as the number of independent variables are increased even if the
marginal contribution of the incoming variable is statistically insignificant.
 To take care of the above drawback, coefficient of determination is adjusted for the number of
independent variables taken. This adjusted measure of coefficient of determination is called adjusted
R2  n  1  2 
 Adjusted R is given by the following formula:
2 Ra
2
 1  
 n  k 1   1  R 
  
where
n = Number of Observations
k = Number of Independent Variables
= Adjusted R2
Multiple Linear Regression
 “Multiple regression” is a technique that allows additional
factors to enter the analysis separately so that the effect of
each can be estimated.

 It is valuable for quantifying the impact of various


simultaneous influences upon a single dependent variable.
Multiple Linear Regression
Multiple Regression:
 

= + + +…..+ +e

Where:
= the variable that we are trying to predict(DV)
= the variable that we are using to predict Y(IV)
= the intercept
= the slope (Coefficient of X1)
e = the regression residual (error term)
Assumptions in Multiple Regression Analysis
 

Linearity of the phenomenon measured.


Constant variance of the error terms.
Independence of the error terms.
Normality of the error term distribution.
Error term : e = ( – )
Typical Applications of Regression Analysis
Building of models to ascertain the pattern/behaviour of certain
performance measures
 Asset Management :- Predict performance, Maintenance cost
 Production operations :- Proactive alerts, Likelihood of Event (e.g. Trip,
Failure,…)
 Frequently used statistical tool in Asset-Liability Management, credit
scoring, etc
 Demand Forecasting :- Retail sales, ambulance dispatches, market value
of some product
 To model residential home prices as a function of home’s living area
 Etc…
Residual Analysis
Following assumptions for residual analysis should be checked.
 They are independent
 They are normally distributed
They have a constant variance σ2 for all settings of the independent
variables (Homoscedasticity)
They have a zero mean.
Multicollinearity
 Significant problem faced in the Regression Analysis is when the independent variables or the linear
combinations of the independent variables are correlated with each other.

 This correlation among the independent variables is called Multicollinearity which creates problems in
conducting t-statistic for statistical significance.

High correlation among the independent variables suggests the presence of multicollinearity but lower
values of correlations doesn't omit the chances of presence of multicollinearity.

 The most common method of correcting multicollinearity is by systematically removing the


independent variable until multicollinearity is minimized.
In R we use function called vif (variance inflation factor) to check if multicollinearity is present or not.
Normally, if vif < 5 than no multicollinearity
if vif > 5 than multicollinearity is present
Heteroskedasticity
 When the requirement of a constant variance is violated,we have a condition of
heteroskedasticity.
 We can diagnose heteroskedasticity by plotting the residual against the predicted y or by
Breusch-Pagan chi-square test.
 Breusch-Pagan test statistic follows a chi-square distribution with k degrees of freedom,
where k is the number of independent variables.

BP Chi Square Test Statistic  n  Rresid


2

where
2 n: number of observations
Rresid
:Coefficient of determination when residuals are regressed with independent
variables
CASE STUDY

Copyright © 2016 Defour Analytics Pvt. Ltd


Use Case – Simple Linear Regression
 This Sales data set contains Sales data in lakhs and Advertising cost in thousands.

This is a monthly data of sales and advertising cost for a particular product

 We need to predict, for a given new data for advertising cost, How much will be the
sales.
R code - Simple Linear Regression
setwd("E:\\Unnati\\Webinar\\Linear Regression") # Setting the working directory

sales.data = read.csv("Data.csv") # Reading sales data


View(sales.data)
head(sales.data)
plot(sales.data) # We would like to see the relationship between advertising cost and sales.
cor(sales.data) # correlation between variables
# And want to predict for how much sales will be incured, for advertising cost
simple.model <- lm(Sales.l~Adv.cost.k, sales.data) # creating Linear model
summary(simple.model) # checking R-square value and other parameters
new_data = read.csv("new_data.csv") # importing new data for prediction
View(new_data)

predictions <- predict.lm(simple.model,newdata = new_data) # Predicting the values

predictions <- as.data.frame(predictions) # Converting predictions to dataframe for next


step

cbind(new_data,predictions) # Combining new data and prediction values

#Checking Assumptions
qqnorm(simple.model$residuals) # Checking normality of residuals
hist(simple.model$residuals)
shapiro.test(simple.model$residuals)
Use Case – Multi-variate Regression
 The Walmart data set contains 200 entries of customer experience. This was
collected by a survey.

 In this data, we need to find out:


 First, What are the variables on which customer satisfaction depends?

 Secondly, Using those dependent variables, we need to build a regression model using some
training data and validate the same for testing data. This model will used for predicting
customer satisfaction.
R code – Multivariate Regression cont.
# Setting the working directory
setwd("E:\\Courses_material\\Modules\\Foundation_Final\\Day 4\\use case")
########################## Multi-variate Regression ##################################
# Importing and viewing the dataset
walmart.data <- read.sas7bdat(file="walmart.sas7bdat")
View(walmart.data) # the dataset contains 200 observations and 14 variables.

set.seed(25) # saving random generating variables in R (generating in next step)


temp <- sample(c("train","test"),size = nrow(walmart.data),replace = T, prob = c(0.8,0.2))
# creating a vector of size equal to observations of dataset - 80% for training & 20% for model validation
# Now creating training and testing dataset from orginal walmart dataset
training.data <- walmart.data[temp=="train",]
testing.data <- walmart.data[temp=="test",]

dim(training.data) # this function tells the dimensions of the data frame


dim(testing.data)

indp.var <- colnames(training.data[,2:14]) # copying the variable names in indp.var

indp.var <- paste(indp.var, collapse = " + ")

equation <- paste("Customer_Satisfaction ~ ",indp.var)


train.model <- lm(my.formula,training.data) # creating first model with all variables
my.formula <- as.formula(equation) # Converting the text format into formula format

cor(walmart.data) # Seeing correlation with all variables


train.model <- lm(my.formula,training.data) # creating first model with all variables
summary(step.train.model)
vif(train.model) # Checking for multicollinearity among variables

# If the vif values are greater than 5 (in our case, standards might differ for different case) for any varible, Then
multicollinearity is present.
# In our case multicollinearity is present as values are more than 5 for some variables
# Now, we will go for step wise regression. Let R decide which variables to be selected

step.train.model <- step(train.model,direction = "both")

summary(step.train.model) # Summary of the model looks good.

train.model <- lm(Customer_Satisfaction ~ Product_Quality+E_Commerce+Technical_Support+


Product_Line+Salesforce_Image+Order_Billing+Price_Flexibility ,training.data) # creating first model
with all variables

summary(train.model)

train.model1 <- lm(Customer_Satisfaction ~ Product_Quality+E_Commerce+


Product_Line+Salesforce_Image+Price_Flexibility ,training.data) # creating first model with all variables

summary(train.model1)
Now checking the assumption of Linear regression

vif(train.model1) # Checking multicollinearity

qqnorm(train.model1$residuals) # Checking normality of residuals

hist(train.model1$residuals)

durbinWatsonTest(step.train.model) # Checking autocorrelation of residuals

test.predict <- predict.lm(train.model1,newdata = testing.data)


test.predict
Accomplishments today!
What is regression?

Types of regression analysis

Purpose of regression analysis

Understanding Simple Linear Regression

Understanding Multiple Linear Regression

Use case with R


Quiz
1) Which of the graph below has very strong positive correlation?

A B C
Quiz
2) It is observed that there is a very high correlation between math test scores and amount of physical exercise
done by a student on the test day. What can you infer from this?

1. High correlation implies that after exercise the test scores are high.
2. Correlation does not imply causation.
3. Correlation measures the strength of linear relationship between amount of exercise and test scores.

A) Only 1
B) 1 and 3
C) 2 and 3
D) All the statements are true
Quiz
3) If the correlation coefficient (r) between scores in a math test and amount of physical exercise by a student is
0.86, what percentage of variability in math test is explained by the amount of exercise?
A) 86%
B) 74%
C) 14%
D) 26%
4) A regression analysis between weight (y) and height (x) resulted in the following least squares line: y = 120 +
5x. This implies that if the height is increased by 1 inch, the weight is expected to
A) increase by 1 pound
B) increase by 5 pound
C) increase by 125 pound
D) None of the above
THANK YOU

Copyright © 2016 Defour Analytics Pvt. Ltd

You might also like