Linear Regression
Linear Regression
and Correlation
1
Two Main Objectives
• Establish if there is a relationship between two variables.
- More specifically, establish if there is a statistically
significant relationship between the two
-Examples: Income and spending, wage and gender,
student height and exam scores
3
4
The Model
• The first order linear model
yy 00 11xx
y = dependent variable 0 and 1 are unknown,
x = independent variable y therefore, are estimated
from the data.
0 = y-intercept
1 = slope of the line Rise = Rise/Run
= error variable Run
0
x
5
Estimating the Coefficients
Sum of squared differences = (2 - 1)2 + (4 - 2)2 + (1.5 - 3)2 + (3.2 - 4)2 = 6.89
Sum of squared differences = (2 -2.5)2 + (4 - 2.5)2 + (1.5 - 2.5)2 + (3.2 - 2.5)2 = 3.99
(2,4)
Let us compare two lines
4
The second line is horizontal
3 (4,3.2)
2.5
2
(1,2) The smaller the sum of
(3,1.5)
1 squared differences
the better the fit of the
1 2 3 4
line to the data.
7
To calculate the estimates of the coefficients The regression equation that estimates
that minimize the differences between the data the equation of the first order linear model
points and the line, use the formulas: is:
cov(XX,,YY))
cov(
bb11
s22
s xx ŷŷ bb00 bb11xx
bb00 yybb11xx
8
• Example 17.1 Relationship between odometer
reading and a used car’s selling price.
Independent variable x
Dependent variable y
9
• Solution
– Solving by hand
• To calculate b0 and b1 we need to calculate several
statistics first;
x 36,009.45; s 2x
2
( x x)
i
43,528,688
n 1
cov(X , Y ) 1,356,256
b1 .0312
s 2x 43,528,688
b 0 y b1x 5411 .41 ( .0312)(36,009.45) 6,533
ŷ b 0 b1x 6,533 .0312x
10
– Using the computer (see file Xm17-01.xls)
Tools > Data analysis > Regression > [Shade the y range and the x range] > OK
SUMMARY OUTPUT
6000
Price
Multiple R 0.806308 5000
R Square 0.650132
Adjusted R Square
0.646562
4500
Standard Error
151.5688 19000 29000 39000 49000
Observations 100 Odometer
ŷ 6,533 .0312x
ANOVA
df SS MS F Significance F
Regression 1 4183528 4183528 182.1056 4.4435E-24
Residual 98 2251362 22973.09
Total 99 6434890
Price
5000
4500
0 No data 19000 29000 39000 49000
Odometer
ŷ 6,533 .0312x
12
From the
From the first
first three
three assumptions
assumptions wewe have:
have:
yy isis normally
normally distributed
distributed with
with mean
mean
E(y) == 00 ++ 11x,
E(y) x, and
and aa constant
constant standard
standard
deviation
deviation
E(y|x3)
The standard deviation remains constant,
0 + 1x3
E(y|x2)
0 + 1x2
0 + 1x1
x1 x2 x3
13
• Testing the slope
2
2 [cov( X , Y )]
[cov( X , Y )] 2
2 SSE
SSE
R2
R or R 11
or R2
s22 s22
s xs y
x y
( y
i
y
( y i y))22
15
– To understand the significance of this coefficient
note:
16
17
18
19
20
21