0% found this document useful (0 votes)
17 views

Linear Regression

Mathematics
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views

Linear Regression

Mathematics
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 21

Simple Linear Regression

and Correlation

1
Two Main Objectives
• Establish if there is a relationship between two variables.
- More specifically, establish if there is a statistically
significant relationship between the two
-Examples: Income and spending, wage and gender,
student height and exam scores

• Forecast new observations.


- Can we use what we know about the relationship to
forecast unobserved values?
- Examples: What will our sales be over the next
quarter? What will the ROI of a new store opening be
contingent on store attributes?
2
Variable’s Roles
Dependent Variable Independent variable
•This is the variable whose •This is the variable that
values we want to explain explains the other one.
or forecast. •Its values are
•Its values depend on independent.
something else. •We denote it as X.
•We denote it as y.

3
4
The Model
• The first order linear model

yy 00 11xx
y = dependent variable 0 and 1 are unknown,
x = independent variable y therefore, are estimated
from the data.
0 = y-intercept
1 = slope of the line Rise  = Rise/Run
= error variable Run
 0
x
5
Estimating the Coefficients

• The estimates are determined by


– drawing a sample from the population of interest,
– calculating sample statistics.
– producing a straight line that cuts into the data.

y  The question is:


 Which straight line fits best?




  

x 6
The best line is the one that minimizes
the sum of squared vertical differences
between the points and the line.

Sum of squared differences = (2 - 1)2 + (4 - 2)2 + (1.5 - 3)2 + (3.2 - 4)2 = 6.89
Sum of squared differences = (2 -2.5)2 + (4 - 2.5)2 + (1.5 - 2.5)2 + (3.2 - 2.5)2 = 3.99

(2,4)
Let us compare two lines
4
 The second line is horizontal
3  (4,3.2)
2.5
2
(1,2)  The smaller the sum of
 (3,1.5)
1 squared differences
the better the fit of the
1 2 3 4
line to the data.
7
To calculate the estimates of the coefficients The regression equation that estimates
that minimize the differences between the data the equation of the first order linear model
points and the line, use the formulas: is:

cov(XX,,YY))
cov(
bb11 
s22
s xx ŷŷ bb00 bb11xx
bb00 yybb11xx

8
• Example 17.1 Relationship between odometer
reading and a used car’s selling price.

– A car dealer wants to find Car Odometer Price


the relationship between 1 37388 5318
the odometer reading and 2 44758 5061
the selling price of used 3 45833 5008
cars. 4 30862 5795
– A random sample of 100 5 31705 5784
cars is selected, and the 6 34010 5359
data . . .
recorded. . . .
– Find the regression line. . . .

Independent variable x
Dependent variable y
9
• Solution
– Solving by hand
• To calculate b0 and b1 we need to calculate several
statistics first;
x  36,009.45; s 2x
2

 ( x  x)
i
 43,528,688
n 1

y  5,411 .41; cov( X , Y ) 


 ( x  x )( y
i i  y)
 1,356,256
n 1
where n = 100.

cov(X , Y ) 1,356,256
b1    .0312
s 2x 43,528,688
b 0  y  b1x  5411 .41  ( .0312)(36,009.45)  6,533
ŷ  b 0  b1x  6,533  .0312x
10
– Using the computer (see file Xm17-01.xls)
Tools > Data analysis > Regression > [Shade the y range and the x range] > OK
SUMMARY OUTPUT
6000

Regression Statistics 5500

Price
Multiple R 0.806308 5000
R Square 0.650132
Adjusted R Square
0.646562
4500
Standard Error
151.5688 19000 29000 39000 49000
Observations 100 Odometer
ŷ  6,533  .0312x
ANOVA
df SS MS F Significance F
Regression 1 4183528 4183528 182.1056 4.4435E-24
Residual 98 2251362 22973.09
Total 99 6434890

CoefficientsStandard Error t Stat P-value


Intercept 6533.383 84.51232 77.30687 1.22E-89
Odometer -0.03116 0.002309 -13.4947 4.44E-24
11
6533
6000
5500

Price
5000
4500
0 No data 19000 29000 39000 49000
Odometer

ŷ  6,533  .0312x

The intercept is b0 = 6533. This is the slope of the line.


For each additional mile on the odometer,
the price decreases by an average of $0.0312
Do not interpret the intercept as the
“Price of cars that have not been driven”

12
From the
From the first
first three
three assumptions
assumptions wewe have:
have:
yy isis normally
normally distributed
distributed with
with mean
mean
E(y) == 00 ++ 11x,
E(y) x, and
and aa constant
constant standard
standard
deviation 
deviation
E(y|x3)
The standard deviation remains constant,
0 + 1x3 
E(y|x2)
0 + 1x2 

but the mean value changes with x E(y|x1)

0 + 1x1 

x1 x2 x3

13
• Testing the slope

– When no linear relationship exists between two


variables, the regression line should be horizontal.




 
  

 
       
   
   
   
     
            

 
    
  
   
         
     

Linear relationship. No linear relationship.


Different inputs (x) yield Different inputs (x) yield
different outputs (y). the same output (y).

The slope is not equal to zero The slope is equal to zero


14
• Coefficient of determination
– When we want to measure the strength of the linear
relationship, we use the coefficient of determination.

2
2 [cov( X , Y )]
[cov( X , Y )] 2
2 SSE
SSE
R2
R  or R 11
or R2
s22 s22
s xs y
x y 
 ( y
i
 y
( y i  y))22

15
– To understand the significance of this coefficient
note:

par t by The regression model


lained in
Ex p
Overall variability in y R
emain
s, in p
art, u
nexpl
ained The error

16
17
18
19
20
21

You might also like