0% found this document useful (0 votes)
13 views

Lecture 11

Uploaded by

Chan peter
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

Lecture 11

Uploaded by

Chan peter
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 62

Chapter 16

Simple Linear Regression and Correlation

•Model

•Estimating the Coefficients

•Error Variable: Required Conditions

•Assessing the Model

1
Regression Analysis…
Our problem objective is to analyze the relationship between
interval variables; regression analysis is the tool we will study.

Regression analysis is used to predict the value of one variable


(the dependent variable) on the basis of other variables (the
independent variables).

Dependent variable: denoted as Y


Independent variables: denoted as X1, X2, …, Xk

2
Regression Analysis…

This chapter will examine the relationship between two


variables, sometimes called simple linear regression.

Mathematical equations describing these relationships are also


called models, and they fall into two types: deterministic or
probabilistic.

3
Model Types…
Deterministic Model: an equation or set of equations that allow us
to fully determine the value of the dependent variable from the
values of the independent variables. Deterministic models are
usually unrealistic.
E.g. is it reasonable to believe that we can determine the selling
price of a house solely on the basis of its size?
Contrast this with…

Probabilistic Model: a method used to capture the randomness


that is part of a real-life process.

E.g. do all houses of the same size (measured in square feet) sell
for exactly the same price?

4
A Model…
To create a probabilistic model, we start with a deterministic
model that approximates the relationship we want to model
and add a random term that measures the error of the
deterministic component.

Deterministic Model:
The cost of building a new house is about $100 per square foot
and most lots sell for about $100,000. Hence the approximate
selling price (y) would be:
y = $100,000 + (100$/ft2)(x)
(where x is the size of the house in square feet)

5
A Model…
A model of the relationship between house size (independent
variable) and house price (dependent variable) would be:

House
Price t
o u
s ab
st
c o
o us e o t. i ze )
g a h are fo 10 0 (S
il din s qu 0 +
u
B 0 pe r ,0 0
0 1 0 0
$1 e =
P ric
Most lots sell
o u se
for $100,000 H

House size

In this model, the price of the house is completely determined by the size.
6
A Model…
In real life however, the house cost will vary even among the
same size of house:
Lower vs. Higher
Variability
House
Price

100K$

House Price = 100,000 + 100(Size) + ɛ


x House size
Same square footage, but different price points (e.g. décor options, lot location…)
7
Random Term…
We now represent the price of a house as a function of its size
in this Probabilistic Model:

y = 100,000 + 100x + ɛ

Where ɛ (Greek letter epsilon) is the random term (a.k.a. error


variable). It is the difference between the actual selling price
and the estimated price based on the size of the house. Its value
will vary from house sale to house sale, even if the square
footage (i.e. x) remains the same.

8
Simple Linear Regression Model…

A straight line model with one independent variable is called a


first order linear model or a simple linear regression model.
Its is written as:

dependent independent
variable variable

y-intercept slope of the line error variable

9
Simple Linear Regression Model…

Note that both β0 and β1 are population parameters which are


usually unknown and hence estimated from the data.

rise

run
β1 =slope (=rise/run)
β0 =y-intercept

10
Estimating the Coefficients…
In much the same way we base estimates of µ on x , we
estimate β0 using b0 and β1 using b1, the y-intercept and slope
(respectively) of the least squares or regression line given by:

(Recall: this is an application of the least squares method and it


produces a straight line that minimizes the sum of the squared
differences between the points and the line)

11
Example 16.1
The annual bonuses ($1,000s) of six employees with
different years of experience were recorded as
follows. We wish to determine the straight line
relationship between annual bonus and years of
experience.

Years of experience (x) 1 2 3 4 5 6


Annual bonus (y) 6 1 9 5 17 12

12
Least Squares Line…

these differences are


called residuals

n ces
re
d iffe
a r ed
e squ line…
o f th d the
e s u m n ts a n
s th poi
iz e the
im n
e min twee
is lin be
Th

13
Example 16.2…
Car dealers across North America use the "Red Book" to help
them determine the value of used cars that their customers trade
in when purchasing new cars.
The book, which is published monthly, lists the trade-in values
for all basic models of cars.
It provides alternative values for each car model according to
its condition and optional features.
The values are determined on the basis of the average paid at
recent used-car auctions, the source of supply for many used-
car dealers.

14
Example 16.2…
However, the Red Book does not indicate the value determined by
the odometer reading, despite the fact that a critical factor for
used-car buyers is how far the car has been driven.

To examine this issue, a used-car dealer randomly selected 100


three-year old Toyota Camrys that were sold at auction during the
past month.

The dealer recorded the price ($1,000) and the number of miles
(thousands) on the odometer. (Xm16-02).

The dealer wants to find the regression line.

15
Example 16.2…
Click Data, Data Analysis, Regression

16
Example 16.2…
A B C D E F
1 SUMMARY OUTPUT
2
3 Regression Statistics
4 Multiple R 0.8052
5 R Square 0.6483
6 Adjusted R Square 0.6447 Lots of good statistics calculated for
7 Standard Error 0.3265 us, but for now, all we’re interested
8 Observations 100 in is this…
9
10 ANOVA
11 df SS MS F Significance F
12 Regression 1 19.26 19.26 180.64 5.75E-24
13 Residual 98 10.45 0.11
14 Total 99 29.70
15
16 Coefficients Standard Error t Stat P-value
17 Intercept 17.25 0.182 94.73 3.57E-98
18 Odometer -0.0669 0.0050 -13.44 5.75E-24

17
Example 16.2… INTERPRET

As you might expect with used cars…

The slope coefficient, b1, is –0.0669, that is, each additional


mile on the odometer decreases the price by $.0669 or 6.69¢

The intercept, b0, is 17,250. One interpretation would be that


when x = 0 (no miles on the car) the selling price is $17,250.
However, we have no data for cars with less than 19,100 miles
on them so this isn’t a correct assessment.

18
Example 16.2… INTERPRET

Selecting “line fit plots” on the Regression dialog box, will


produce a scatter plot of the data and the regression line…

19
Required Conditions…
For these regression methods to be valid the following four
conditions for the error variable (ɛ) must be met:
• The probability distribution of ɛ is normal.
• The mean of the distribution is 0; that is, E(ɛ) = 0.
• The standard deviation of ɛ is σɛ, which is a constant
regardless of the value of x.
• The value of ɛ associated with any particular value of y is
independent of ɛ associated with any other value of y.

20
Assessing the Model…
The least squares method will always produce a straight line,
even if there is no relationship between the variables, or if the
relationship is something other than linear.

Hence, in addition to determining the coefficients of the least


squares line, we need to assess it to see how well it “fits” the
data. We’ll see these evaluation methods now. They’re based on
the sum of squares for errors (SSE).

21
Sum of Squares for Error (SSE)…
The sum of squares for error is calculated as:
n
SSE   ( y i  ŷ i ) 2
i 1

and is used in the calculation of the standard error of estimate:

If sɛ is zero, all the points fall on the regression line.

22
Standard Error of Estimate…
If sε is small, the fit is excellent and the linear model should be
used for forecasting. If sε is large, the model is poor…
But what is small and what is large?

23
Standard Error of Estimate…

24
Testing the Slope…
If no linear relationship exists between the two variables, we
would expect the regression line to be horizontal, that is, to
have a slope of zero.

We want to see if there is a linear relationship, i.e. we want to


see if the slope (β1) is something other than zero. Our research
hypothesis becomes:
H 1 : β1 ≠ 0
Thus the null hypothesis becomes:
H 0 : β1 = 0

25
Testing the Slope…
We can implement this test statistic to try our hypotheses:

where Sb1 is the standard error of b1, defined as:

If the error variable (ɛ) is normally distributed, the test statistic


has a Student t-distribution with n–2 degrees of freedom. The
rejection region depends on whether or not we’re doing a one-
or two- tail test (two-tail test is most typical).

26
Example 16.4…
Test to determine if there is a linear relationship between the price
& odometer readings… (at 5% significance level)

We want to test:
H0: β1 = 0
H 1 : β1 ≠ 0
(if the null hypothesis is true, no linear relationship exists)
The rejection region is:

27
Example 16.4… COMPUTE

We can compute t manually or refer to our Excel output…

p-value

We see that the t statistic for Compare

“odometer” (i.e. the slope, b1) is –13.44


which is greater than tCritical = –1.984. We also note that the p-value is
0.000.
There is overwhelming evidence to infer that a linear relationship
between odometer reading and price exists.

28
Testing the Slope…
If we wish to test for positive or negative linear relationships
we conduct one-tail tests, i.e. our research hypothesis become:
H 1 : β1 < 0 (testing for a negative slope)
or
H1: β1 >0 (testing for a positive slope)

Of course, the null hypothesis remains: H0: β1 = 0.

29
Coefficient of Determination…
Tests thus far have shown if a linear relationship exists; it is
also useful to measure the strength of the relationship. This is
done by calculating the coefficient of determination (R2).

The coefficient of determination is the square of the coefficient


of correlation (r), hence R2 = (r)2

30
Coefficient of Determination…
As we did with analysis of variance, we can partition the variation
in y into two parts:

Variation in y = SSE + SSR

SSE – Sum of Squares Error – measures the amount of variation in


y that remains unexplained (i.e. due to error)
SSR – Sum of Squares Regression – measures the amount of
variation in y explained by variation in the independent variable x.

31
Coefficient of Determination COMPUTE

We can obtain this with Excel…

32
Coefficient of Determination INTERPRET

R2 has a value of .6483. This means 64.83% of the variation in


the auction selling prices (y) is explained by the variation in the
odometer readings (x). The remaining 35.17% is unexplained,
i.e. due to error.
Unlike the value of a test statistic, the coefficient of
determination does not have a critical value that enables us to
draw conclusions.
In general the higher the value of R2, the better the model fits
the data.
R2 = 1: Perfect match between the line and the data points.
R2 = 0: There are no linear relationship between x and y.

33
More on Excel’s Output…
An analysis of variance (ANOVA) table for the
simple linear regression model can be given by:
degrees
Sums of Mean
Source of F-Statistic
Squares Squares
freedom
MSR =
Regression 1 SSR F=MSR/MSE
SSR/1
MSE =
Error n–2 SSE
SSE/(n–2)
Variation
Total n–1
in y

34
Chapter 17
Multiple Regression

• Model and Required Conditions

• Estimating the Coefficients and Assessing the Model

35
Multiple Regression…
The simple linear regression model was used to analyze how
one interval variable (the dependent variable y) is related to
one other interval variable (the independent variable x).

Multiple regression allows for any number of independent


variables.

We expect to develop models that fit the data better than would
a simple linear regression model.

36
The Model…
We now assume we have k independent variables potentially
related to the one dependent variable. This relationship is
represented in this first order linear equation:

dependent independent variables


variable

error variable

coefficients
In the one variable, two dimensional case we drew a regression
line; here we imagine a response surface.

37
Required Conditions…
For these regression methods to be valid the following four
conditions for the error variable (ɛ) must be met:
• The probability distribution of the error variable (ɛ) is normal.
• The mean of the error variable is 0.
• The standard deviation of ɛ is σɛ , which is a constant.
• The errors are independent.

38
Estimating the Coefficients…

The sample regression equation is expressed as:

We will use computer output to:


Assess the model…
How well it fits the data
Is it useful
Employ the model…
Interpreting the coefficients

39
Regression Analysis Steps…
 Use a computer and software to generate the coefficients and the
statistics used to assess the model.

 Diagnose violations of required conditions. If there are problems,


attempt to remedy them.

 Assess the model’s fit.


standard error of estimate,
coefficient of determination,
F-test of the analysis of variance.

 If , , and  are OK, use the model to predict or estimate the


expected value of the dependent variable.

40
Example 17.1
GENERAL SOCIAL SURVEY: VARIABLES THAT
AFFECT INCOME

In the Chapter 16 opening example we showed using the


General Social Survey that income and education are
linearly related. This raises the question, what other
variables affect one’s income? To answer this question
we need to expand the simple linear regression
technique used in the previous chapter to allow for
more than one independent variable. [Xm17-00]

41
Example 17.1
Here is a list of all the interval variables the General Social Survey
created.

Age (AGE)

Years of education of respondent, spouse, father and mother (EDUC,


SPEDUC, PAEDUC, MAEDUC)

Number of family members earning money (EARNRS)

Hours of work per week of respondent and of spouse (HRS and SPHRS)
Number of children (CHILDS)

Age when first child was born (AGEKDBRN)

42
Example 17.1
Score on question, Should government reduce
income differences between rich and poor?
(EQWLTH)
Score on question, Should government improve
standard of living of poor people? (HELPPOOR)
Score on question, Should government do more or
less to solve country’s problems? (HELPNOT)
Score on question, Is it government’s responsibility
to help pay for doctor and hospital bills?
(HELPSICK)

43
Example 17.1
Here are the available variables from the General Social Survey of 2012
and the reason why we have selected each one:

Age (AGE): For most people, income increases with age.

Years of education (EDUC): We’ve already shown (Chapter 16


opening example) that education is linearly related to income.

Hours of work per week (HRS1): Obviously, more hours of work


should produce more income.

44
Example 17.1
Spouse’s hours of work (SPHRS1): It is possible that, if one’s
spouse works more and earns more, the other spouse may
choose to work less and thus earn less

Number of family members earning money (EARNRS): As is


the case With SPHRS1, if more family members earn
income there may be less pressure on the respondent to
work harder

Number of children (CHILDS): Children are expensive, which


may encourage their parents to work harder and thus earn
more.

45
Example 17.1
Step 2: Use a computer to compute all the coefficients and other
statistics

46
Model Assessment…
We will assess the model in three ways:

Standard error of estimate,


Coefficient of determination, and
F-test of the analysis of variance.

47
Standard Error of Estimate…
In multiple regression, the standard error of estimate is
defined as:

n is the sample size and k is the number of independent


variables in the model.

Standard Error = 35,841


It seems the standard error of estimate is not particularly small.
What can we conclude?

48
Coefficient of Determination…

49
Testing the Validity of the Model…

In a multiple regression model (i.e. more than one independent


variable), we utilize an analysis of variance technique to test
the overall validity of the model. Here’s the idea:
H0:
H1: At least one βi is not equal to zero.
If the null hypothesis is true, none of the independent variables
is linearly related to y, and so the model is invalid.
If at least one βi is not equal to 0, the model does have some
validity.

50
Testing the Validity of the Model…
ANOVA table for regression analysis…
Source of degrees of Sums of
Mean Squares F-Statistic
Variation freedom Squares
Regression k SSR MSR = SSR/k F=MSR/MSE

Error n–k–1 SSE MSE = SSE/(n–k-1)

Total n–1

A large value of F indicates that most of the variation in y is explained by


the regression equation and that the model is valid. A small value of F
indicates that most of the variation in y is unexplained.
51
Testing the Validity of the Model…

Our rejection region is:

F > Fα,k,n-k-1 = F.05,6,439 ≈ 2.10

Excel calculated the F statistic as F = 38.64, with a


p-value ≈ 0. Hence, we reject H0 in favor of H1, that is:

“there is a great deal of evidence to infer


that the model is valid”

52
Table 17.2… Summary

Assessment
SSE R2 F
of Model

0 0 1 Perfect

small small close to 1 large Good

large large close to 0 small Poor

0 0 Invalid

Once we’re satisfied that the model fits the data as well as possible, and that the
required conditions are satisfied, we can interpret and test the individual coefficients
and use the model to predict and estimate…
53
Interpreting the Coefficients*
Intercept
The intercept is b0 = −108,240. This is the average income when all
the independent variables are zero. As we observed in Chapter
16, it is often misleading to try to interpret this value,
particularly if 0 is outside the range of the values of the
independent variables (as is the case here).
Age
The relationship between income and age is described by b1 = 974.
From this number we learn that in this model, for each additional
year of age, income increases on average by $974, assuming that
the other independent variables in this model are held constant.

*in each case we assume all other variables are held constant…
54
Interpreting the Coefficients*
Education
The coefficient b2 = 5,680 specifies that in this sample for each
additional year of education the income increases on
average by $5,680, assuming the constancy of the other
independent variables.

Hours of work
The relationship between hours of work per week is expressed
by b3 = 1,091. We interpret this number as the average
increase in annual income for each additional hour of work
per week keeping the other independent variables fixed in
this sample.

*in each case we assume all other variables are held constant…

55
Interpreting the Coefficients*
Spouse’s hours of work
The relationship between annual income and a spouse’s hours of
work per week is described in this sample b 4 = −250.90,
which we interpret to mean that for each additional hour a
spouse works per week income decreases on average by
$250.90 when the other variables are constant.
Number of family members earning income
In this dataset the relationship between annual income and the
number of family members who earn money is expressed by b 5
= 958.5, which tells us that for each additional family member
earning money annual Income increases on average by $958.5
assuming that the other Independent variables are constant.

*in each case we assume all other variables are held constant…
56
Interpreting the Coefficients*
Number of children
The relationship between annual income and number
of children is expressed by b6 = −1,765, which
tells us that in this sample for each additional
child annual income decreases on average by
$1,765

*in each case we assume all other variables are held constant…

57
Testing the Coefficients…
For each independent variable, we can test to determine whether
there is enough evidence of a linear relationship between it and the
dependent variable for the entire population…
H0: βi = 0
H1: βi ≠ 0
(for i = 1, 2, …, k) and using:

as our test statistic (with n–k–1 degrees of freedom).

58
Testing the Coefficients
Test of β1 (Coefficient of age)
Value of the test statistic: t = 6.28; p-value = 0

Test of β2 (Coefficient of education)


Value of the test statistic: t = 9.35; p-value = 0

Test of β3 (Coefficient of number of hours of work per week)


Value of the test statistic: t = 9.32; p-value = 0

Test of β4 (Coefficient of spouse’s number of hours of work per week)


Value of the test statistic: t = −1.97; p-value = .0496

59
Testing the Coefficients
Test of β5 (Coefficient of number of earners in family)
Value of the test statistic: t = .31; p-value = .7531

Test of β6 (Coefficient of number of children)


Value of the test statistic: t = −1.28; p-value = .2029

60
INTERPRET
Testing the Coefficients
There is sufficient evidence at the 5% significance
level to infer that each of the following variables
is linearly related to income

Age
Education
Number of hours of work per week
Spouse’s hours of work

61
Testing the Coefficients INTERPRET

In this model there is not enough evidence to conclude that each


of the following variables is linearly related to income
Number of children
Number of earners in the family

Note that this may mean that there is no evidence of a linear


relationship between these two independent variables.
However, it may also mean that there is a linear relationship
between the two variables, but because of a condition called
multicollinearity, the t-test of revealed no linear relationship.

62

You might also like