Business Analytics: Advance: Simple & Multiple Linear Regression
Business Analytics: Advance: Simple & Multiple Linear Regression
Advance
SIMPLE & MULTIPLE LINEAR REGRESSION
Inferential Statistics
Descriptive Statistics Drawing conclusions
Presenting, organizing about a population based
and summarizing data on data observed in
sample
Standardization of scores
Normal Distribution
Hypothesis Testing
Basic R codes
Outline of the Supervised Learning Program
Day Topics (Professionals) Plan
1 Introduction to Analytics and its Applications Program Duration: 3 months
2 Basics of Data/Statistics/R (Analytical tool) - I On Every Saturday & Sunday from
3 Basics of Data/Statistics/R/Alteryx Demo(Analytical tool) - II 10 am to 1 pm IST
4 Linear Regression 8 Weeks of support after the
completion of the program (12 hrs,
5 Logistic Regression based on pre-booked appointment)
6 Clustering Change in dates will be notified in
7 Decision Tree advance as needed
8 Time series Modelling
9 Practical Session on Use cases
10 Market Basket Analysis
11 Text Mining
12 Data Visualization
Correlation: - Correlation signifies the strength and direction of linear relationship. (It only
captures linear relationship)
Population correlation is denoted by ρ. (rho)
Sample correlation is denoted by r.
• Y = Dependent variable
• X = Independent variable
• b0 = Intercept or constant
• b1 = x’s slope or coefficient
• e = error term
Term use in regression analysis
Explained variance = R2 (coefficient of determination).
Unexplained variance = residuals (error).
Adjusted R-Square = reduces the R2 by taking into account
the sample size and the number of independent variables in the
regression model (It becomes smaller as we have fewer
observations per independent variable).
Standard Error of the Estimate (SEE) = a measure of the
accuracy of the regression predictions. It estimates the
variation of the dependent variable values around the
regression line.
Term use in regression analysis
Total Sum of Squares (SST) = total amount of variation that
exists to be explained by the independent variables. TSS = the
sum of SSE and SSR.
Sum of Squared Errors (SSE) = the variance in the dependent
variable not accounted for by the regression model = residual.
The objective is to obtain the smallest possible sum of squared
errors as a measure of prediction accuracy.
Sum of Squares Regression (SSR) = the amount of
improvement in explanation of the dependent variable
attributable to the independent variables.
Term use in regression analysis
Total variation is made up of two parts:
SST ( y y ) 2
SSE ( y yˆ ) 2 SSR ( yˆ y ) 2
Where:
y
= Average value of the dependent variable
ŷy = Observed values of the dependent variable
= Estimated value of y for the given x value
Coefficient of Determination)
The coefficient of determination is the ratio of total variation in the dependent variable that is
Where:
R2 = Coefficient of determination
r = Simple correlation coefficient
Coefficient of determination (R2) and Adjusted R2
Coefficient of determination(R2) can also be used to test the significance of the coefficients
collectively apart from using F-test.
SST - SSE RSS Sum of Squares explained by regression
R2
SST SST Total Sum of Squares
The drawback of using Coefficient of determination is that the value of the coefficient of
determination always increases as the number of independent variables are increased even if the
marginal contribution of the incoming variable is statistically insignificant.
To take care of the above drawback, coefficient of determination is adjusted for the number of
independent variables taken. This adjusted measure of coefficient of determination is called adjusted
R2 n 1 2
Adjusted R is given by the following formula:
2 Ra
2
1
n k 1 1 R
where
n = Number of Observations
k = Number of Independent Variables
= Adjusted R2
Multiple Linear Regression
“Multiple regression” is a technique that allows additional
factors to enter the analysis separately so that the effect of
each can be estimated.
= + + +…..+ +e
Where:
= the variable that we are trying to predict(DV)
= the variable that we are using to predict Y(IV)
= the intercept
= the slope (Coefficient of X1)
e = the regression residual (error term)
Assumptions in Multiple Regression Analysis
This correlation among the independent variables is called Multicollinearity which creates problems in
conducting t-statistic for statistical significance.
High correlation among the independent variables suggests the presence of multicollinearity but lower
values of correlations doesn't omit the chances of presence of multicollinearity.
where
2 n: number of observations
Rresid
:Coefficient of determination when residuals are regressed with independent
variables
CASE STUDY
This is a monthly data of sales and advertising cost for a particular product
We need to predict, for a given new data for advertising cost, How much will be the
sales.
R code - Simple Linear Regression
setwd("E:\\Unnati\\Webinar\\Linear Regression") # Setting the working directory
#Checking Assumptions
qqnorm(simple.model$residuals) # Checking normality of residuals
hist(simple.model$residuals)
shapiro.test(simple.model$residuals)
Use Case – Multi-variate Regression
The Walmart data set contains 200 entries of customer experience. This was
collected by a survey.
Secondly, Using those dependent variables, we need to build a regression model using some
training data and validate the same for testing data. This model will used for predicting
customer satisfaction.
R code – Multivariate Regression cont.
# Setting the working directory
setwd("E:\\Courses_material\\Modules\\Foundation_Final\\Day 4\\use case")
########################## Multi-variate Regression ##################################
# Importing and viewing the dataset
walmart.data <- read.sas7bdat(file="walmart.sas7bdat")
View(walmart.data) # the dataset contains 200 observations and 14 variables.
# If the vif values are greater than 5 (in our case, standards might differ for different case) for any varible, Then
multicollinearity is present.
# In our case multicollinearity is present as values are more than 5 for some variables
# Now, we will go for step wise regression. Let R decide which variables to be selected
summary(train.model)
summary(train.model1)
Now checking the assumption of Linear regression
hist(train.model1$residuals)
A B C
Quiz
2) It is observed that there is a very high correlation between math test scores and amount of physical exercise
done by a student on the test day. What can you infer from this?
1. High correlation implies that after exercise the test scores are high.
2. Correlation does not imply causation.
3. Correlation measures the strength of linear relationship between amount of exercise and test scores.
A) Only 1
B) 1 and 3
C) 2 and 3
D) All the statements are true
Quiz
3) If the correlation coefficient (r) between scores in a math test and amount of physical exercise by a student is
0.86, what percentage of variability in math test is explained by the amount of exercise?
A) 86%
B) 74%
C) 14%
D) 26%
4) A regression analysis between weight (y) and height (x) resulted in the following least squares line: y = 120 +
5x. This implies that if the height is increased by 1 inch, the weight is expected to
A) increase by 1 pound
B) increase by 5 pound
C) increase by 125 pound
D) None of the above
THANK YOU