Linear Regression Analysis: Gaurav Garg (IIM Lucknow)
Linear Regression Analysis: Gaurav Garg (IIM Lucknow)
Correlation
Simple Linear Regression
The Multiple Linear Regression Model
Least Squares Estimates
R2 and Adjusted R2
Overall Validity of the Model (F test)
Testing for individual regressor (t test)
Problem of Multicollinearity
Gaurav Garg (IIM Lucknow)
60
40
20
0
0
10
20
30
65
Gaurav Garg (IIM Lucknow)
70
Properties of Covariance:
Correlation
Karl Pearsons Correlation coefficient is given by
rXY Corr ( X , Y )
Cov( X , Y )
Var( X )Var(Y )
Scatter Diagram
Y
X
Positively Correlated
Weakly Correlated
Negatively Correlated
Strongly Correlated
Gaurav Garg (IIM Lucknow)
Not Correlated
y y
xx
1.25
125
-0.9
45
0.8100
2025
-40.50
1.75
105
-0.4
25
0.1600
625
-10.00
2.25
65
0.1
-15
0.0100
225
-1.50
2.00
85
-0.15
0.0225
25
-0.75
2.50
75
0.35
-5
0.1225
25
-1.75
2.25
80
0.1
0.0100
2.70
50
0.55
-30
0.3025
900
-16.50
2.50
55
0.35
-25
0.1225
625
-8.75
17.50
640
1.560
4450
-79.75
SSX
SSY
SSXY
( x x )2 ( y y)2
( x x )( y y )
Cov( X , Y )
SSXY
79.75
r
0.957
Var ( X )Var (Y ) GauravSSX
SSY
1.56 4450
Garg (IIM Lucknow)
x y
, SSXY xy
, SSY y
n
n
x2
y2
x.y
1.25
125
1.5625
15625
156.25
1.75
105
3.0625
11025
183.75
2.25
65
5.0625
4225
146.25
2.00
85
4.0000
7225
170.00
2.50
75
6.2500
5625
187.50
2.25
80
5.0625
6400
180.00
2.70
50
7.2500
2500
135.00
2.50
55
6.2500
3025
137.50
17.20
640
38.54
55650
1296.25
SSX x
SSX = 1.56
SSY = 4450
SSXY= -79.75
Cov( X , Y )
SSXY
79.75
r
0.957
Var ( X )Var (Y ) GauravSSX
SSY
1.56 4450
Garg (IIM Lucknow)
0
5
10
15
20
50
0
25
100
225
400
750
XY
0
210
330
465
580
1585
Lung
Capacity
(Y)
2025
1764
1089
961
841
6680
45
42
33
31
29
180
rxy
(5)(1585) (50)(180)
1075
1250 (1000)
.9615
Regression Analysis
Having determined the correlation between X and Y, we
wish to determine a mathematical relationship between
them.
Dependent variable: the variable you wish to explain
Independent variables: the variables used to explain the
dependent variable
Regression analysis is used to:
Predict the value of dependent variable based on the
value of independent variable(s)
Explain the impact of changes in an independent
variable on the dependent variable
Gaurav Garg (IIM Lucknow)
Types of Relationships
Linear relationships
Curvilinear relationships
X
Gaurav Garg (IIM Lucknow)
Types of Relationships
Strong relationships
Weak relationships
X
Gaurav Garg (IIM Lucknow)
Types of Relationships
No relationship
X
Y
X
Gaurav Garg (IIM Lucknow)
}a
1
y = a + b.x
X
X
Gaurav Garg (IIM Lucknow)
a bX
error
yi
a bx i
X
xi
SSE (Yi a bX i )
i 1
Yi na b X i
i 1
(1)
i 1
n
SSE
0 2 (Yi a bX i ) X i 0
b
i 1
n
Yi X i a X i b X i2
i 1
i 1
i 1
( 2)
Y
i 1
Y X
i 1
na b X i
i 1
a X i b X
i 1
n n
n Yi X i Yi X i
i 1
i 1
i 1
2
n
n
2
Xi Xi
i 1
i 1
n
i 1
2
i
Y Y X
n
i 1
a Y bX
Gaurav Garg (IIM Lucknow)
X
n
i 1
SSXY
SSX
SSXY
b
.
SSX
Also the correlation coefficient between X and Y is
a Y bX,
rXY
Cov( X , Y )
Var ( X )Var (Y )
SSXY
SSXY
SSX
SSX SSY
Gaurav Garg (IIM Lucknow)
SSX
SSX
b
SSY
SSY
y y
xx
1.25
125
-0.9
45
0.8100
2025
-40.50
1.75
105
-0.4
25
0.1600
625
-10.00
2.25
65
0.1
-15
0.0100
225
-1.50
2.00
85
-0.15
0.0225
25
-0.75
2.50
75
0.35
-5
0.1225
25
-1.75
2.25
80
0.1
0.0100
2.70
50
0.55
-30
0.3025
900
-16.50
2.50
55
0.35
-25
0.1225
625
-8.75
17.50
640
1.560
4450
-79.75
SSX
SSY
SSXY
( x x )2 ( y y)2
X 2.15, Y 80 .
Gaurav Garg (IIM Lucknow)
( x x )( y y )
SSXY
0.957
SSX SSY
60
40
0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 2.25 2.50 2.75
Residuals : ei Yi Yi
Residual is the unexplained part of Y
The smaller the residuals, the better the utility of
Regression.
Sum of Residuals is always zero. Least Square
procedure ensures that.
Residuals play an important role in investigating
the adequacy of the fitted model.
We obtain coefficient of determination (R2)
using the residuals.
R2 is used to examine the adequacy of the fitted
linear model to the given data.
Gaurav Garg (IIM Lucknow)
Coefficient of Determination
Y Y Y Y
Y Y
Y
X
n
r = -1
r=1
R2 = 1
Perfect linear
relationship
100% of the variation
in Y is explained by X
0 < R2 < 1
Weak linear
relationships
Some but not all of
the variation in Y is
explained by X
Gaurav Garg (IIM Lucknow)
R2 = 0
No linear
relationship
None of the
variation in Y is
explained by X
1.25
125
(Y Y ) (Y Y ) (Y Y ) (Y Y ) 2 (Y Y ) 2 (Y Y ) 2
126.0
45
-1
46
2025
1
2116
1.75
105
100.5
25
4.5
20.5
625
20.25
420.25
2.25
65
74.9
-15
-9.9
-5.1
225
98.00
26.01
2.00
85
87.7
-2.2
7.7
25
4.84
59.29
2.50
75
62.1
-5
12.9
-17.7
25
166.41 313.29
2.25
80
74.9
5.1
-5.1
26.01
26.01
2.70
50
51.9
-30
-1.9
-28.1
900
3.61
789.61
2.50
55
62.1
-25
-7.1
-17.9
625
50.41
320.41
17.20
640
4450
370.54 4079.4
6
Example:
Watching television also reduces the amount of physical exercise,
causing weight gains.
A sample of fifteen 10-year old children was taken.
The number of pounds each child was overweight was recorded
(a negative number indicates the child is underweight).
Additionally, the number of hours of television viewing per weeks
was also recorded. These data are listed here.
TV
42 34 25 35 37 38 31 33 19 29 38 28 29 36 18
18 sample
6 0 regression
-1 13 14 line
7 and
7 -9describe
8 8 what
5 3 the
14 -7
Overweight
Calculate the
and
R2 =
20.00
15.00
10.00
5.00
Y
Predicted Y
0.00
10
-5.00
-10.00
-15.00
11
12
13
14
15
Standard Error
Consider a dataset.
All the observations can not be exactly the same as
arithmetic mean (AM).
Variability of the observations around AM is measured
by standard deviation.
Similarly in regression, all Y values can not be the same
as predicted Y values.
Variability of Y values around the prediction line is
measured by STANDARD ERROR OF THE ESTIMATE.
n
It is given by
2
SYX
SSE
n2
(Y Y )
i 1
n2
Assumptions
The relationship between X and Y is linear
Error values are statistically independent
All the Errors have a common variance.
(Homoscedasticity)
Var(ei )= 2, where e Y Y
i
i
i
E(ei )= 0
No distributional assumption about errors is
required for least squares method.
Linearity
Linear
Not Linear
residuals
residuals
Independence
Independent
residuals
residuals
residuals
Not Independent
Equal Variance
Unequal variance
(Heteroscadastic)
Equal variance
(Homoscadastic)
Y
residuals
residuals
15.00
10.00
5.00
0.00
0
10
15
20
25
30
35
40
45
-5.00
-10.00
-15.00
10
15
20
25
-4.00
-6.00
-8.00
-10.00
-12.00
30
35
40
45
Example:
A distributor of frozen dessert pies wants to
evaluate factors which influence the demand
Dependent variable:
Y: Pie sales (units per week)
Independent variables:
X1: Price (in $)
X2: Advertising Expenditure ($100s)
Week
Pie
Sales
Price
($)
Advertising
($100s)
350
5.50
3.3
460
7.50
3.3
350
8.00
3.0
430
8.00
4.5
350
6.80
3.0
380
7.50
4.0
430
4.50
3.0
470
6.40
3.7
450
7.00
3.5
10
490
5.00
4.0
11
340
7.20
3.5
12
300
7.90
3.2
13
440
5.90
4.0
14
450
5.00
3.5
15
300
7.00
2.7
Random Error
Slopes
Yi 0 1 X 1i 2 X 2 i k X ki i
i 1,2,, n.
Gaurav Garg (IIM Lucknow)
Estimate of
intercept
Estimates of slopes
Yi b0 b1 X1i b2 X 2i bk X ki
i 1,2,, n.
Gaurav Garg (IIM Lucknow)
X2
X1
Y 1 X
X
n1
n2
nk
n
k
or
Y X
Gaurav Garg (IIM Lucknow)
Assumptions
No. of observations (n) is greater than no. of
regressors (k). i.e., n> k
Random Errors are independent
Random Errors have the same variances.
(Homoscedasticity)
Var(i )= 2
In long run, mean effect of random errors is zero.
E(i )= 0.
S( ) (Y X)(Y X )
i 1
2
i
Y Y-2 X Y X X
We differentiate S() with respect to and equate
to zero, i.e.,
This gives
S
0,
b (X X) X Y
Intercept
(b0)
306.53
LSE of slope 1
Price
(b1)
-24.98
LSE of slope 2
Advertising (b2)
74.13
Prediction:
Predict sales for a week in which
selling price is $5.50
Advertising eXpenditure is $350:
= 428.62
X1
5.5
7.5
8.0
8.0
6.8
7.5
4.5
6.4
7.0
5.0
7.2
7.9
5.9
5.0
7.0
X2
3.3
3.3
3.0
4.5
3.0
4.0
3.0
3.7
3.5
4.0
3.5
3.2
4.0
3.5
2.7
Predicted Y
413.77
363.81
329.08
440.28
359.06
415.70
416.51
420.94
391.13
478.15
386.13
346.40
455.67
441.09
331.82
Residuals
-63.80
96.15
20.88
-10.31
-9.09
-35.74
13.47
49.03
58.84
11.83
-46.16
-46.44
-15.70
8.89
-31.85
600
500
400
Y
Predicted Y
300
200
100
0
1
10
11
12
13
14
15
Coefficient of Determination
Coefficient of Determination (R2 ) is obtained using the
same formula as was in simple linear regression.
n
R2 = SSR/SST = 1 (SSE/SST)
R2 is the proportion of variation in Y explained by
regression.
Gaurav Garg (IIM Lucknow)
Since
SST = SSR + SSE
and all three quantities are non-negative,
Also,
0 SSR SST
So
0 SSR/SST 1
Or
0 R2 1
Adjusted R2
If one more regressor is added to the model, the value
of R2 will increase
This increase is regardless of the contribution of newly
added regressor.
So, an adjusted value of R2 is defined, which is called as
adjusted R2 and defined as
R
2
Adj
SSE (n k 1 )
1
SST (n 1 )
where
SST Yi Y
n
i 1
n
SSE e Yi Yi
i 1
2
i
i 1
SS
MS
Fc
Regression
SSR
MSR
MSR/MSE
Residual or Error
n-k-1
SSE
MSE
Total
n-1
SST
Test Statistic:
Fc = MSR / MSE ~ F(k, n-k-1)
For the previous eXample, we wish to test
H0: 1 = 2 = 0 Against H1: at least one i 0
ANOVA Table
df
SS
MS
F(2,12)(0.05)
Regression
29460.03
14730.01
6.5386
3.89
Residual or Error
12
27033.31
2252.78
Total
14
56493.33
Test Statistic:
Tc
bj
2 C jj
MSE
In our example
2 2252.7755 and
Standard Error
Consider a dataset.
All the observations can not be exactly the same as
arithmetic mean (AM).
Variability of the observations around AM is measured
by standard deviation.
Similarly in regression, all Y values can not be the same
as predicted Y values.
Variability of Y values around the prediction line is
measured by STANDARD ERROR OF THE ESTIMATE.
n
It is given by
2
SYX
SSE
n k 1
(Y
i 1
Yi )
n k 1
Assumption of Linearity
Linear
Not Linear
residuals
residuals
residuals
residuals
Unequal variance
Y
Equal variance
i 2
i 1
e
i 1
2
i
residuals
Independent
Y
residuals
residuals
Not Independent
Assumption of Normality
When we use F test or t test, we assume that 1,
2 , , n are normally distributed.
This assumption can be examined by histogram
of residuals.
NORMAL
NOT NORMAL
Gaurav Garg (IIM Lucknow)
NORMAL
NOT NORMAL
Gaurav Garg (IIM Lucknow)
sY
1 n
2
(
Y
Y
)
i
n 1 i 1
1 n
X 1 X 1i , s X1
n i 1
1 n
2
(
X
X
)
1i
1
n 1 i 1
1 n
X 2 X 2i , s X 2
n i 1
1 n
2
(
X
X
)
2i
2
n 1 i 1
Standardized Data
Week
Pie
Sales
Price
($)
Advertising
($100s)
-0.78
-0.95
-0.37
0.96
0.76
-0.37
-0.78
1.18
-0.98
0.48
1.18
2.09
-0.78
0.16
-0.98
-0.30
0.76
1.06
0.48
-1.80
-0.98
1.11
-0.18
0.45
0.80
0.33
0.04
10
1.43
-1.38
1.06
11
-0.93
0.50
0.04
12
-1.56
1.10
-0.57
13
0.64
-0.61
1.06
14
0.80
-1.38
0.04
15
-1.56
0.33
-1.60
Y = 0 0.461 X1 + 0.570 X2
Since 0.461 < 0.570
X2 Contributes the most
Note that:
2
R Adj
(1 R 2 )( n 1)
1
(n k 1)
(n k 1) R 2
Fc
k (1 R 2 )
Ads (Nos.)
43.6
12
13.9
38.0
11
12
30.1
9.3
35.3
9.7
46.4
12
12.3
34.2
11.4
30.2
9.3
40.7
13
14.3
38.5
10.2
22.6
8.4
37.6
11.2
35.2
10
11.1
ANOVAb
Model
1
Sum of
Squares
309.986
Regression
Residual
Total
df
Mean Square
2
154.993
143.201
453.187
11
F
9.741
Sig.
.006a
15.911
CONTRADICTION
H0 is rejected;
0 =0, 1 =0, 2 =0
Coefficientsa
Standardized
Unstandardized Coefficients Coefficients
Model
B
Std. Error
Beta
1
(Constant)
6.584
8.542
No_Adv
.625
1.120
.234
Ex_Adv
2.139
1.470
.611
a. Dependent Variable: Sales
Gaurav Garg (IIM Lucknow)
.771
.558
1.455
Sig.
.461
.591
.180
Multicollinearity
We assume that regressors are independent variables.
When we regress Y on regressors X1, X2, , Xk.
VIF j
1 R
2
j
Coefficientsa
Standardize
d
Unstandardized
Coefficients
Coefficients
Model
B
Std. Error
Beta
1
(Constant)
6.584
8.542
No_Adv
.625
1.120
.234
Ex_Adv
2.139
1.470
.611
a. Dependent Variable: Sales
t
.771
.558
1.455
Tolerance = 1/VIF
Sig.
.461
.591
.180
Collinearity
Statistics
Tolerance VIF
.199 5.022
.199 5.022
Greater than 5
Collinearity Diagnosticsa
Variance Proportions
Condition
Index
Model
Dimension
Eigenvalue
(Constant) No_Adv Ex_Adv
1
1
2.966
1.000
.00
.00
.00
2
.030
9.882
.33
.17
.00
3
.003
30.417
.67
.83
1.00
a. Dependent Variable: Sales
Negligible Value
Large Value
(based on ANOVA)
(based on Correlation)
(based on Correlation)
Stepwise Regression
Y = 0 + 1 X 1 + 2 X2 + 3 X3 + 4 X4 + 5 X5 +
Step 1: Run 5 simple linear regressions:
Y = 0 + 1 X1
Y = 0 + 2 X2
Y = 0 + 3 X3
Y = 0 + 4 X4 <==== has lowest p-value (ANOVA) < 0.05
Y = 0 + 5 X5
Y = 0 + 4 X4 + 1 X1
Y = 0 + 4 X4 + 2 X2
Y = 0 + 4 X4 + 3 X3 <= has lowest p-value (ANOVA) < 0.05
Y = 0 + 4 X4 + 5 X5
Gaurav Garg (IIM Lucknow)
STOP
Best model is the one with X3 and X4 only
Gaurav Garg (IIM Lucknow)
Ads (Nos.)
43.6
12
13.9
38.0
11
12
30.1
9.3
35.3
9.7
46.4
12
12.3
34.2
11.4
30.2
9.3
40.7
13
14.3
38.5
10.2
22.6
8.4
37.6
11.2
35.2
10
11.1
ANOVAb
Sum of Squares
df
Mean Square
F
276.308
1
276.308 15.621
176.879
10
17.688
Regression
Residual
Total
453.187
Sig.
.003a
11
(Constant)
No_Adv
a. Dependent Variable: Sales
Standardized
Unstandardized Coefficients Coefficients
B
Std. Error
Beta
16.937
4.982
2.083
.527
.781
t
3.400
3.952
Sig.
.007
.003
ANOVAb
Sum of Squares
df
Mean Square
F
305.039
1
305.039 20.590
148.148
10
14.815
Regression
Residual
Total
453.187
Sig.
.001a
11
(Constant)
Ex_Adv
a. Dependent Variable: Sales
Standardized
Unstandardized Coefficients Coefficients
B
Std. Error
Beta
4.173
7.109
2.872
.633
.820
t
.587
4.538
Sig.
.570
.001
ANOVAb
Sum of Squares
df
309.986
143.201
Regression
Residual
Total
453.187
a. Predictors: (Constant), Ex_Adv, No_Adv
Mean Square
2
154.993
9
15.911
F
9.741
Sig.
.006a
Sig.
.461
.591
.180
11
Coefficientsa
Model
1
(Constant)
No_Adv
Ex_Adv
a. Dependent Variable: Sales
Unstandardized Coefficients
B
Std. Error
6.584
8.542
.625
1.120
2.139
1.470
Standardized
Coefficients
Beta
.234
.611
.771
.558
1.455
Type of Repair
electrical
mechanical
electrical
mechanical
electrical
electrical
mechanical
mechanical
electrical
electrical
Repair Time in
Hours
2.9
3.0
4.8
1.8
2.9
4.9
4.2
4.8
4.4
4.5
Y 2.1473 0.3041 X 1
R2 =0.534
At 5% level of significance, we reject
H0: 0 = 0 (Using t test)
H0: 1 = 0 (Using t and F test)
X1 alone explains 53.4% variability in repair time.
To introduce the type of repair into the model, we define a
dummy variable given as
0, if type of repair is mechanical
X2
1, if type of repair is electrical
Summary
Multiple linear regression model Y=X +
Least Squares Estimate of is given by b= (XX)-1XY
R2 and adjusted R2
Using ANOVA (F test), we examine if all s are zero or
not.
t test is conducted for each regressor separately.
Using t test, we examine if corresponding to that
regressor is zero or not.
Problem of Multicollinearity VIF, eigen value
Dummy Variable
Examining the assumptions :
common variance, independence, normality
Gaurav Garg (IIM Lucknow)