0% found this document useful (0 votes)
8 views36 pages

Delmars Administrative Medical Assisting 5th Edition eBook and TestBank Bundle

The document provides information about the availability of various educational resources, including the Test Bank for Delmar's Administrative Medical Assisting, 5th Edition, and other related ebooks and test banks. It also includes a detailed explanation of Analysis of Variance (ANOVA) and the F-statistic in the context of regression analysis, discussing how to assess the significance of regression models. Additionally, it highlights potential pitfalls in regression analysis, such as extrapolation and the influence of outliers.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views36 pages

Delmars Administrative Medical Assisting 5th Edition eBook and TestBank Bundle

The document provides information about the availability of various educational resources, including the Test Bank for Delmar's Administrative Medical Assisting, 5th Edition, and other related ebooks and test banks. It also includes a detailed explanation of Analysis of Variance (ANOVA) and the F-statistic in the context of regression analysis, discussing how to assess the significance of regression models. Additionally, it highlights potential pitfalls in regression analysis, such as extrapolation and the influence of outliers.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

Test Bank for Delmars Administrative Medical Assisting, 5th Edition

Ebook and TestBank bundle download

https://ebookdownload.blog/product/delmars-administrative-medical-assisting-5th-edition-ebook-and-testbank-bundle/
Explore and download more test bank or solution manual at ebookdownload.blog
Featured Products

Introduction to Corporate Finance 4th Edition Booth Ebook and TestBank


bundle
https://ebookdownload.blog/product/introduction-to-corporate-finance-4th-edition-booth-ebook-and-testban
k-bundle/

Family Interaction A Multigenerational Developmental Perspective 5th Edition


Anderson Sabatelli Ebook and TestBank bundle
https://ebookdownload.blog/product/family-interaction-a-multigenerational-developmental-perspective-5th-
edition-anderson-sabatelli-ebook-and-testbank-bundle/

Money, Banking and Financial Markets, 5th Edition, Stephen Cecchetti, Kermit
Schoenholtz Ebook and TestBank bundle
https://ebookdownload.blog/product/money-banking-and-financial-markets-5th-edition-stephen-cecchetti-k
ermit-schoenholtz-ebook-and-testbank-bundle/

Berne and Levy Physiology, 6th Edition: Koeppen Ebook and TestBank bundle
https://ebookdownload.blog/product/berne-and-levy-physiology-6th-edition-koeppen-ebook-and-testbank-b
undle/

Chemistry Principles and Practice, 3rd Edition Ebook and TestBank bundle
https://ebookdownload.blog/product/chemistry-principles-and-practice-3rd-edition-ebook-and-testbank-bun
dle/
More Featured Products

Macroeconomics 8e by Andrew B. Abel 0133407926 Ebook and TestBank


bundle
https://ebookdownload.blog/product/macroeconomics-8e-by-andrew-b-abel-0133407926-ebook-and-testb
ank-bundle/

Nutrition and You 2nd Edition by Blake Ebook and TestBank bundle
https://ebookdownload.blog/product/nutrition-and-you-2nd-edition-by-blake-ebook-and-testbank-bundle/

Laboratory Manual for General, Organic, and Biological Chemistry, 3/E 3rd
Edition Karen C. Timberlake Ebook and TestBank bundle
https://ebookdownload.blog/product/laboratory-manual-for-general-organic-and-biological-chemistry-3-e-3r
d-edition-karen-c-timberlake-ebook-and-testbank-bundle/

Exceptional Learners: An Introduction to Special Education 14th Edition Ebook


and TestBank bundle
https://ebookdownload.blog/product/exceptional-learners-an-introduction-to-special-education-14th-edition
-ebook-and-testbank-bundle/

Essentials of Oceanography, 6th Edition: Garrison Ebook and TestBank bundle


https://ebookdownload.blog/product/essentials-of-oceanography-6th-edition-garrison-ebook-and-testbank-
bundle/
Delmars Administrative Medical Assisting, 5th Edition Ebook and
TestBank bundle
https://ebookdownload.blog/product/delmars-administrative-medical-assisting-5th-edition-ebook-and-testbank-bundle/

You will receive this product immediate after placing the order\nDelmars Administrative Medical
Assisting, 5th Edition
Random Documents With

Unrelated Content Scribd

Suggests To You.
ANOVA and the F-Statistic
bis
The key question in simple regression is whether the slope is zero, and we
developed a hypothesis test for 8, = 0. If the null hypothesis is rejected, it means
that the predictor variable has a significant (i.e., real) relationship with the outcome
variable. But this approach does not generalize perfectly in multiple regression,
and so we will need another approach, called Analysis of Variance (ANOVA) for
Regression.
Since it is easier to understand in the simple regression situation we introduce
it here and then use it to fuller advantage in Chapter 15.
·To start with, you can think of this as another test for £,. But instead of stating
the test in terms of the parameter, we will restate it in terms of the model, in a very
general way: Is the regression worthwhile? In other words, does the predictor vari-
able contain useful information about the response variable?
Analysis of Variance for Regression
To answer this more general question, we will divide up and explain the sources of
variation in the response variable. And that·s why this approach is called ·analysis
of variance.· It really should be called ·analysis of variation· but nobody asked us!
Consider what variation means here. Ask yourself the question, ·Why don·t all
the y-values equal the mean value of x?· The answer is twofold. First, some y-values
have different «-values; and second, even if two y-values have the same x-value,
there may be other variables, or simply random error, that explain the difference.
Figure 14.26
The point (x,y) is a typical point on the scatterplot. The vertical line from
the point to the lowest horizontal line is the deviation·the distance from the actual yto
the mean y. The height represented by the bracket on the left side of the vertical line is
the sum of the heights represented by the two brackets on the right side of the vertical
line. This demonstrates how the deviation is partitioned into two parts.
A residual, e, is the difference between an actual y and its predicted value, 7.
That difference can be written as the sum of two parts: y · y = (vy · y) + YJ · 9).
We have ·partitioned· the residual to correspond to the twofold answer above.
If we define ·variation· as the square of a deviation (distance from a data value
to its mean), then the total variation of y is defined as the sum of the squared devia-
tions: Yo · y)·. But this is also equal to (n · 1s, so variation and variance are
closely related; variation is (7 · 1) multiplied by variance.
Summing and squaring the left hand side of the equation: y · y = (y · y) +
(y · 9), gives the total variation >) (y · y)·. Let·s sum and square the right-hand
side, using the algebraic expansion (@ + b) = a· + b° + 2ab that you learned
long ago.
A remarkable thing happens. The right-hand side becomes )(j · 9)·
+ S\(y · 9)·. What happened to the cross-product? It sums to zero. That·s one of
514
CHAPTER 14
Inference for Regression
the great consequences of the decision we made long ago to define variance using
squared differences from the mean rather than the absolute value of differences
from the mean. (It·s a tedious calculation, but not difficult; we won·t bore you
with it here.)
So, we have an elegant partition of variation })(y · y= >d.CO- yy
+ S\(y · 9)·, which can express in words and acronyms as
Sum of Squares Total (SST) = Sum of Squares Model (SSM)
+ Sum of Squares Error (SSE)
In many books and software (including Excel), the words ·Model· and ·Error·
are replaced by Regression and Residual, but then we can·t distinguish between
SSR (for Regression) and SSR (for Residual). So we prefer SSM and SSE.
Each of the sums of squares is associated with a number of degrees
of freedom, and if you notice that SST = S(y · yy· =(· - 1)s;, and
SSE = HG · jy = Se = (n · 2)s?, you might have an idea of what the
degrees of freedom are: (n · 1) for SST, and (· · 2) for SSE. What about SSMP
That·s the easy part·it·s 1, the number of predictor variables, which, in a simple
regression, is 1.
When we calculate the variance of sample data, we divide the total sum of
squares by degrees of freedom (· · 1), giving the mean sum of squares. We can do
the same thing for the other sums of squares, and we get
SM
Mean Square Model, MSM = ae
SSE
Mean Square Error, MSE = OE:
We don·t bother defining the Mean Square Total, MST, since we already have
a name for it·the variance of y, s;·and since we won·t need it for the rest of the
ANOVA table.
There·s one more thing to compute, something called the F-statistic. More
about that in a moment. But first, let·s summarize all the previous calculations
in an Analysis of Variance table. The underlined letters lead to the convenient
and clever acronym ANOVA, coined by John Tukey of stem-and-leaf display
and boxplot fame, among other things. Remember that he also coined the term
·software.·
ANOVA Table
Source of
Sum of
Degrees of
Variation
Squares
Freedom
Mean Square
F-statistic
Model!
odel
SSM
1
MSM = SSM
re MSM
ANOVA and the F-Statistic
515
table has an F-distribution. It is called F in honour of Sir Ronald Fisher, possibly
the greatest statistician in the first half of the twentieth century.
F
MSM
(Explained)
MSE · (Unexplained)
The F-distribution is skewed (like the chi-square distribution), is anchored
at 0 on the left end, and has two values of degrees of freedom, one corresponding
to the numerator and the other corresponding to the denominator of the ratio
that makes up the F-statistic.
The F-statistic is the ratio of explained and unexplained mean squares. So
when we see a high value for the F-statistic, we know that a lot of the variability
in the original data has been explained by the regression. In ANOVA tables from
software, a P-value is also given, indicating whether the F-statistic is high enough
to imply that the regression is significant overall. In other words, is the regression
worthwhile?
The null hypothesis is that the regression model predicts no better than the
mean; that is, the model is not worthwhile. The alternative is that it does; that is,
the model is worthwhile. If the null hypothesis were true, the F-statistic would
be near 1.
The F-Test for Regression
When the conditions are met, we can test the hypotheses Ho: 8; = 0 vs. Ha:
B, ~ O using a test statistic called the F-test for regression
which follows an F-model with 1 and
· 2 degrees of freedom. We can use the
F-model to find the P-value of the test.
So now we have two test statistics to test whether B, = 0. It shouldn·t surprise
you that they have a close relationship. When the numerator degrees of freedom
is 1, F,, = 4, so the t-stat for testing 8, = 0 is the square root of the F-stat for
testing B, = 0.
R? revisited
In Chapter 4 we learned that R· is a measure of how much variation in our data is
explained by the regression model, and that it is the square of the correlation, 7. In
other words, 7 = R·! There is another way to compute it:
_ SSM _ (explained)
RBSST
(total)
In other words, using an F-test to see whether the model is worthwhile is the
same as testing whether the R· is different from zero. Rejecting either version of
the hypothesis means the predictor accounts for enough variation in y to distin-
guish it from noise. It is a ·worthwhile predictor.·
Let·s return to the Canada Goose case we discussed earlier. In the Guided
Example the computer output for the regression was displayed as follows:
R
Predictor
Coefficient
SE(Coeff)
t-ratio
P-value
516
CHAPTER 14
·Inference for Regression
Here is the ANOVA table for the same data.
Source of Variation | SumofSquares | Degrees ofFreedom | Mean Square
F-Statistic
Model
328359.97
1
328359.97
310.11
Error
60354.45
57
1058.85
Total
388714.42
58
Notice three equivalences:
1 MSE = 1058-85 = G2.54)_ = 5-
2. SSM/SST = 328 359.97 /388 714.42 = 0.845 = R?
3. F-statistic = 310.11 = (17.61)? = (tratio for slope)·
From software, the P-value for the F-statistic (with 1 and 57 degrees of freedom)
is <0.0001, which must be the same as the P-value for the t-test of slope.
WHAT CAN GO WRONG?
With inference, we·ve put numbers on our estimates and predictions, but
these numbers are only as good as the model. Here are the main things to
watch out for:
© Don·t fit a linear regression to data that aren·t straight. This is the most funda-
mental assumption. If the relationship between x and y isn·t approximately
linear, there·s no sense in fitting a straight line to it.
° Watch out for a changing spread. The common part of confidence and pre-
diction intervals is the estimate of the error standard deviation, the spread
around the line. If it changes with «, the estimate won·t make sense. Imag-
ine making a prediction interval for these data:
=,
°
Sere are
< eet oor, *,
i
°
3 Wo, Me one
.
8
Poe
-
pie
When « is small, we can predict y precisely, but as x gets larger, it·s
much harder to pin y down. Unfortunately, if the spread changes, the single
value of s, won·t pick that up. The prediction interval will use the aver-
age spread around the line, with the result that we·ll be too pessimistic
Ethics in Action
SIL 7/
° Beware of extrapolating. Beware of extrapolation beyond the x-values that
were used to fit the model. Although it·s common to use linear models to
extrapolate, be cautious.
° Beware of extrapolating far into the future. Be especially cautious about extrapo-
lating far into the future with linear models, especially when the » variable
is time. A linear model assumes that changes over time will continue forever
at the same rate you·ve observed in the past. Predicting the future is par-
ticularly tempting and particularly dangerous.
° Look for unusual points. Unusual points always deserve attention and may well
reveal more about your data than the rest of the points combined. Always
look for them and try to understand why they stand apart. Making a scatter-
plot of the data is a good way to reveal high-leverage and influential points.
A scatterplot of the residuals against the predicted values is a good tool for
finding points with large residuals.
¢ Beware of high-leverage points, especially of those that are influential. Influential
points can alter the regression model a great deal. The resulting model may
say more about one or two points than about the overall relationship.
¢ Consider setting aside outliers and re-running the regression. To see the impact
of outliers on a regression, try running two regressions, one with and one
without the extraordinary points, and then discuss the differences.
® Treat unusual points honestly. If you remove enough carefully selected points,
you will eventually get a regression with a high R·. But it won·t get you very
far. Some data are not simple enough for a linear model to fit very well.
When that happens, report the failure and stop.
© Significant correlation does not necessarily mean strong correlation. A statistically
significant correlation means the correlation is not zero. It doesn·t say any-
thing about the strength of the association.
© Don't confuse sums of squares and mean squares. ·The F-statistic is the ratio of
mean squares (MSM to MSE), but R· is the ratio of sums of squares (SSM
to SST). The numerators both come from the Model line of the ANOVA
table but the denominators are from different lines.
ETHICS IN ACTION YW
os
LL
he need for elder
opening an elder care business in their area. Allen was
·-
a
@
care businesses that
contacted recently by Kyle Sennefeld, a recent business
Zen,
·a
offer companionship
school graduate with a minor in gerontology, who is inter-
and nonmedical home ser-
vices is increasing as the
Canadian population con-
tinues to age. One such
518
CHAPTER 14 _ Inference for Regression
ETHICAL ISSUE The regression model has a small R2, so its
used to extrapolate outside the range of x-values. Allen
predictive ability is questionable. Related to ASA Ethical
should provide a prediction interval as well as an estimate
Guidelines, Items A and B, which can be found at http:// _
of the profit. Because Kyle will be assessing his franchise·s
www.amstat.org/about/ethicalguidelines.cfm.
chances for profit from this interval, Allen should make sure
it is a prediction interval and not a confidence interval for
ETHICAL SOLUTION Disclose the value of R? along with the
ine
:
the mean profit at all similar locations.
prediction results and disclose if the regression is being
0 Se
ee
WHAT HAVE WE LEARNED?
In this chapter, we have extended our study of inference methods by applying them to regression mod-
els. We've found that the same methods we used for means·Student·s t-models·work for regression
in much the same way they did for means. And we·ve seen that although this makes the mechanics
familiar, we need to check new conditions and be careful when describing the hypotheses we test and
the confidence intervals we construct.
¢ We've learned that under certain assumptions, the sampling distribution for the slope of a
regression line can be modelled by a Student·s t-model with m · 2 degrees of freedom.
We·ve learned to check four conditions before we proceed with inference. We·ve learned the
importance of checking these conditions in order, and we·ve seen that most of the checks can
be made by graphing the data and the residuals.
We·ve learned to use the appropriate t-model to test a hypothesis about the slope. If the slope
of our regression line is significantly different from zero, we have strong evidence that there
is an association between the two variables.
We've learned to use an alternative method, the F-model and ANOVA, for testing a hypoth-
esis about slope. The F-test examines whether the model is worthwhile, and will be used for
the more general situation of multiple regression in the next chapter. We·ve also seen another
derivation of R·, based on the ANOVA table calculations.
We·ve learned to use the t-model to test for correlation when a linear regression is not avail-
able. It is equivalent to the t-model for slope, and a significant test means there is a linear
association between the two variables.
there are many ways in which a data set may be unsuitable for a regression analysis:
¢ The Linearity Condition says that the relationship should be reasonably straight to fit a
regression. Paradoxically, it may be easier to see that the relationship is not straight after you
fit the regression and examine the residuals.
¢ The Outlier Condition refers to two ways in which cases can be extraordinary. They can
have large residuals or high leverage (or, of course, both). Cases with either kind of extraordi-
nary behaviour can influence the regression model significantly.
Terms
Analysis for Variance
The total variation of y values from the mean is partitioned into sums of squares that represent
(ANOVA) for regression
explained and unexplained variation. Adjust the sums of squared by their degrees of freedom to get
Extrapolation
F-test for regression
Influential
Leverage
Outlier
Prediction interval for a
future observation
Residual standard
deviation
tHtest for the
regression slope
Skills
Css
PLAN
What Have We Learned?
519
2
SEG,) = 4] SEG) x (@,
· HP + =
The critical value, *_,, depends on the specified confidence level and the Student's t-model with
·2
degrees of freedom.
where
Although linear models provide an easy way to predict values of y for a given value of «x, it is unsafe
to predict for values of x far from the ones used to find the linear model equation. Be cautious when
extrapolating.
An alternative test of the null hypothesis of whether the slope is zero can also be thought of as a test of
whether the regression model is worthwhile. It is based on a ratio of mean squares. To test Hy: 8; = 0,
MSM
use Ff = MSE where MSM and MSE come from the ANOVA table. We find the P-value values from
the F-model with 1 and» · 2 degrees of freedom.
If omitting a point from the data changes the regression model substantially, that point is considered
influential.
Data points whose «-values are far from the mean of x are said to exert leverage on a linear model. High-
leverage points pull the line close to them, so they can have a large effect on the line, sometimes completely
determining the slope and intercept. Points with high enough leverage can have deceptively small residuals.
Any data point that stands away from the regression line by having a large residual is called an outlier.
A confidence interval for individual values. Prediction intervals are to observations as confidence inter-
vals are to parameters. They predict the distribution of individual values, while confidence intervals
specify likely values for a true parameter. The prediction interval takes the form
Dy ea ae x SEG,)
where
2
SE(Qj,) = [see x (x, · x + 2
+ 3?
The critical value, t* >, depends on the specified confidence level and the Student·s t-model with » · 2
degrees of freedom. The extra s? in SE(j,) makes the interval wider than the corresponding confidence
interval for the mean.
The measure, denoted s,, of the spread of the data around the regression line:
SS ESO a
st he
520
CHAPTER 14
Inference for Regression
* Know how to examine displays of the residuals from a regression to double-check that the con-
ditions required for regression have been met. In particular, know how to judge linearity and
constant variance from a scatterplot of residuals against predicted values. Know how to judge
Normality from a histogram and Normal probability plot.
¢ Remember to be especially careful to check for failures of the Independence Assumption when
working with data recorded over time. To search for patterns, examine scatterplots both of x
against time and of the residuals against time.
¢ Know the danger of extrapolating beyond the range of the x-values used to find the linear
model, especially when the extrapolation tries to predict into the future.
Understand that points can be unusual by having a large residual or by having high leverage.
Understand that an influential point can change the slope and intercept of the regression line.
* Know how to test the standard hypothesis that the true regression slope is zero. Be able to state
the null and alternative hypotheses. Know where to find the relevant numbers in standard com-
puter regression output.
TE ¢
O20056
Be able to find a confidence interval for the slope of a regression based on the values reported in
a standard regression output table.
¢ Know how to look for high-leverage and influential points by examining a scatterplot of the
data. Know how to look for points with large residuals by examining a scatterplot of the residu-
als against the predicted values or against the x-variable. Understand how fitting a regression
line with and without influential points can add to understanding of the regression model.
¢ Know how to look for high-leverage points by examining the distribution of the x-values or
by recognizing them.
Be able to summarize a regression in words. In particular, be able to state the meaning of the
true regression slope, the standard error of the estimated slope, and the standard deviation of
the errors.
REPORT
a
* Be able to interpret the P-value of the t-statistic for the slope to test the standard null
hypothesis.
* Be able to interpret a confidence interval for the slope of a regression.
* Include diagnostic information such as plots of residuals and leverages as part of your report of
a regression.
¢ Report any high-leverage points.
Report any outliers. Consider reporting analyses with and without outliers included to assess
their influence on the regression.
* Include appropriate cautions about extrapolation when reporting predictions from a linear model.
* Be able to test for zero correlation.
* Be able to interpret an ANOVA table and F-statistic in simple linear regression.
TECHNOLOGY HELP @ REGRESSION ANALYSIS
l| statistics packages make a table of results for a
regression. These tables differ slightly from one pack-
age to another, but all are essentially the same.
You should, of course, always look at the scatterplot of
your two variables before computing a regression.
Regressions are almost always found with a computer or
All packages offer analyses of the residuals. With some,
Technology Help: Regression Analysis
521
seatterplot with
regression Line
Sa
DR pale
2B A
c
D
E
)
kK
L
1 [summary OUTPUT
aan
&
SHMIDGaT tee
Mit
inte
Sf
See
JGAMTT
G0)
| Ma
ae
2|
TIME Line Fit Plot
3
Regression Statistics
|Multiple R
0.919072687
<e
be
5 |R Square
0.844694605
wy 200
aryl!
6 |Adjusted
R Square
0.841969948
|%
3
|
7_sStandard Error
20.50113093
a
oe
A lapses
Ss
{
A
522
CHAPTER 14
Inference for Regression
Comments
The Y and X ranges do not need to be in the same rows of
Although the dialogue offers a Normal probability plot of
the spreadsheet, although they must cover the same number
the residuals, the data analysis add-in does not make a Cor-
of cells. But it is a good idea to arrange your data in parallel
rect probability plot, so don·t use this option.
columns as in a data table to reduce the chance of error.
Frozen Pizza
The product manager at a major food distributor is interested in learn-
ing how sensitive sales are to changes in the unit price of a frozen
pizza in Winnipeg, Edmonton, ·Toronto, and Calgary. ·The prod-
uct manager has been provided data on both Price and Sales
volume every fourth week over a period of nearly four years
for the four cities (ch14_MCSP_Frozen_Pizza.xlsx).
Examine the relationship between Price and Sales for each
city. Be sure to discuss the nature and validity of this rela-
tionship. Is it linear? Is it negative? Is it significant? Are the
conditions of regression met? Some individuals in the prod-
uct manager·s division suspect that frozen pizza sales are more
sensitive to price in some cities than in others. Is there any evi-
dence to suggest that? Write up a short report on what you find.
Include 95% confidence intervals for the mean Sales if the Price is
ete Marty Crome 2 Photo,
$2.50 and discuss how that interval changes if the Price is $3.50.
Right: Evlakhov Valeriy/Shutterstock
Global Warming?
Every spring, Nenana, Alaska, hosts a contest in which participants try to guess
the exact minute that a wooden tripod placed on the frozen Tanana River will
fall through the breaking ice. The contest started in 1917 as a diversion for rail-
road engineers, with a jackpot of $800 for the closest guess. It has grown into
an event in which hundreds of thousands of entrants enter their guesses on the
internet and vie for more than $300 000.
Because so much money and interest depends on the time of the ice breakup,
it has been recorded to the nearest minute with great accuracy ever since 1917
(ch14_MCSP_Global_Warming_2014.xlsx). And because a standard mea-
sure of breakup has been used throughout this time, the data are consistent. An
article in Science (·Climate Change in Nontraditional Data Sets,· Science 294,
October 2001) used the data to investigate global warming. Researchers are
interested in the following questions: What is the rate of change in the date of
breakup over time (if any)? If the ice is breaking up earlier, what is your con-
clusion? Does this necessarily suggest global warming? What could be other
reasons for this trend? What is the predicted breakup date for the year 2020?
(Be sure to include an appropriate prediction or confidence interval.) Write up
a short report with your answers.
Exercises
ay2S}
MyLab Business Statistics
students! Save time, improve your grades with MyLab· Business Statistics.
Data Sets for exercises marked @ are available on MyLab Business Statistics for formatted technologies. You'll
Business Insight videos, StatTalk videos, and a personalized Study Plan available to you, too!
EXERCISES
eee
@ 1. Marriage age 2008. Weddings are one of the fastest
growing businesses; about $4 billion is spent on weddings
in Canada each year. But demographics may be changing,
and this could affect wedding retailers· marketing plans. Is
there evidence that the age at which women get married
has changed over the past 100 years? The graph shows
the trend in age at first marriage for Canadian women.
(www4.hrsdc.gc.ca) LOO
30
28
26
24
Women·s
Age
at
First
Marriage
(years)
22
20
1920 1925 1930 1935 1940 1945 1950 1955 1960 1965 1970 1975 1980 1985 1990 1995 2000 2005
a) Do you think there is a clear pattern? Describe the trend.
b) Is the association strong?
c) Is the correlation high? Explain.
d) Do you think a linear model is appropriate for these
data? Explain.
@ 2. Smoking 2011. The Canadian Tobacco Use Monitor-
ing Survey (CTUMS) was deveioped in 1999 by Health
Canada and Statistics Canada to provide timely, reliable,
and continual data on tobacco use and related issues.
Before 1999, smoking behaviour was assessed by Statistics
Canada·s General Social Survey and the National Popu-
lation Health Survey. CTUMS revealed that the overall
current smoking rate among Canadians aged 15 years and
older declined from 25% in 1999 to 17% in 2011. Among
20-24 year olds, the rate declined from 35% to 21%. How
has the percentage of 20-24 year olds who smoke changed
since the danger became clear during the last half of the
twentieth century? The following scatterplot shows per-
centages of smokers among all Canadians 20-24 years of
age, as estimated by surveys from 1965 to 2011 (http://
www.hc-se.ge.ca/hc-ps/tobac-tabac/research-recherche/
stat/ctums-esutc_2011-eng.php). LoO
524
CHAPTER 14
Inference for Regression
b) If you fit a linear model to the data, what do you
think a scatterplot of residuals versus predicted HDI
will look like?
c) There is an outlier (Luxembourg) with a GDPPC of
around $70 000. Will setting this point aside improve the
model substantially? Explain.
@ 4. HDI, part 2. The United Nations Development Pro-
gramme (UNDP) uses the Human Development Index
(HDI) in an attempt to summarize in one number the
progress in health, education, and economics of a country.
The number of cellphone subscribers per 1000 people is
positively associated with economic progress in a country.
Can the number of cellphone subscribers be used to pre-
dict the HDI? Here is a scatterplot of HDI against cell-
phone subscribers: LO®
e
oe
%
© dp aPoct eo
.
09
we
e
*
eo
e
eo
e@
aE
Uae
$ 23. ee e
os 0.7
0030 A
:
=)
=
ee
e
= 06+

;
0.5
3.
@,°
04+
«°
%
0.3
@ 9. Movie budgets. How does the cost of a movie depend
on its length? Data on the cost (millions of dollars) and the
running time (minutes) for major release films of 2005 are
summarized in these plots and computer output: LO®
160
®
®
e va
120
e.
~
ee
§ 00

Bhi
soy
%e
hel
<
ue
®
See ?
Tovar
os
40
8
Rh geerte
°
0
Soars
* ae.
e
e% @
ze
®
·
®
ry
e
0)
ee
eo
90
120
150
180
Run Time (minutes)
Dependent variable is: Budget($M)
R squared = 15.4%
s = 32.95 with 120 · 2 = 118 degrees of freedom
Variable
526
CHAPTER 14 _ ·Inference for Regression
a) Explain in words and numbers what the regression says.
b) The intercept is negative. Discuss its value, taking note
of its P-value.
c) The output reports s = 53.79. Explain what that means
in this context.
d) What·s the value of the standard error of the slope of the
regression line?
e) Explain what that means in this context.
@ 11. Movie budgets, part 2. Exercise 9 shows computer
output examining the association between the length of a
movie and its cost. LO®
a) Check the assumptions and conditions for inference.
b) Find a 95% confidence interval for the slope and
interpret it.
@ 12. House prices, part 2. Exercise 10 shows computer out-
put examining the association between the sizes of houses
and their sale prices. LO®
a) Check the assumptions and conditions for inference.
b) Find a 95% confidence interval for the slope and inter-
pret it.
@ 13. Water hardness. In an investigation of environmental
causes of disease, data were collected on the annual mor-
tality rate (deaths per 100 000) for males in 61 large towns
in England and Wales. In addition, the water hardness was
recorded as the calcium concentration (parts per million,
or ppm) in the drinking water. Here are the scatterplot and
regression analysis of the relationship between mortality
and calcium concentration, where the dependent variable
is Mortality. 0@, LO®
Variable
Coefficient
SE(Coeff)
Intercept
1676.36
29°30
Calcium
=3.226
0.485
R squared = 42.9%
s = 143.0 with 61 · 2 = 59 degrees of freedom
S 2000}
«
S
°
1800
=
Some
= 1600| gr
, o*
c) ·Test an appropriate hypothesis to determine if the asso-
ciation is statistically significant.
d) What percentage of the variability in the LFPR can be
accounted for by the regression model?
@ 16. Female labour force participation rate 2014. The Interna-
tional Labor Organization (ILO) reports the labour force
participation rate (LFPR)·the percentage of the relevant
population who are either employed or actively seeking
work·worldwide. The data file holds this data for the
years 1990 to 2014 for men and women. LOG, LO®
a) Find a regression model to describe any trend in the
LFPR for women over this time period. State in simple
language what the model says.
b) Test an appropriate hypothesis to determine if the asso-
ciation is statistically significant.
c) What percentage of the variability in the LFPR can be
accounted for by the regression model?
d) Examine the residuals to determine if a linear regres-
sion is appropriate. Make additional plots if necessary and
describe what you find.
17. Unusual points. Each of the four scatterplots a·d that fol-
low shows a cluster of points and one ·stray· point. For
each, answer questions 1-4: LOG
1) In what way is the point unusual? Does it have high
leverage, a large residual, or both?
2) Do you think that point is an influential point?
3) If that point were removed from the data, would the cor-
relation become stronger or weaker? Explain.
4) If that point were removed from the data, would the
slope of the regression line increase, decrease, or remain
the same? Explain.
18. More unusual points. Each of the following scatterplots
a·-d shows a cluster of points and one ·stray· point. For
each, answer questions 1-4: LO0@
1) In what way is the point unusual? Does it have high
leverage, a large residual, or both?
2) Do you think that point is an influential point?
3) If that point were removed from the data, would the cor-
relation become stronger or weaker? Explain.
Exercises
B27.
4) If that point were removed from the data, would the
slope of the regression line increase, decrease, or remain
the same? Explain.
19. The extra point. The scatterplot shows five blue data
points at the left. Not surprisingly, the correlation for
these points is r = 0. Suppose one additional data point
is added at one of the five positions suggested below in
green. Match each point (a·e) with the correct new cor-
relation from the list given. LO@
1) 0.90
528
CHAPTER 14 _ Inference for Regression
22. What's the effect? Published reports about violence in
computer games have become a concern to developers
and distributors of these games. One firm commissioned
a study of violent behaviour in elementary school children.
The researcher asked the children·s parents how much
time each child spent playing computer games and had
their teachers rate each child·s level of aggressiveness when
playing with other children. The researcher found a mod-
erately strong positive correlation between computer game
time and aggressiveness score. But does this mean that
playing computer games increases aggression in children?
Describe three different possible cause-and-effect explana-
tions for this relationship. Lo@
@ 23. Used cars. Classified ads in a newspaper offered
several used Toyota Corollas for sale. Listed below are
the ages of the cars and the advertised prices. LO@, LO®
Prices Advertised ($)
]
13 990
1
13 495
3
12 999
4
9 500
4
10 495
)
8 995
5
9 495
6
ul
7
8
8
Age (yr)
6 999
6 950
7 850
6 999
2999
10
4950
10
4 495
13
2 850
a) Make a scatterplot for these data.
The regression equation is Assets
=
1867.4
+ 0.975 Sales
Predictor
Coefficient
SE(Coeff)
t-ratio
P-value
Constant
1867.4
804.5
2.32
0.0230
Sales
0.975
0.099
9.84
=0.0001
s = 6132.59
R-Sq = 55.7%
R-Soladj) = 55.1%
Use the data provided to find a 95% confidence interval, if
appropriate, for the slope of the regression line and inter-
pret your interval in context.
@ 27. Fuel economy and weight. A consumer organization has
reported test data for 50 car models. We will examine the
association between the weight of the car (in thousands of
pounds) and the fuel efficiency (in miles per gallon). Use
the data provided to answer the following questions, where
the response variable is Fuel Efficiency (mpg). Lo
a) Create the scatterplot and obtain the regression equation.
b) Are the assumptions for regression satisfied?
c) Write the appropriate hypotheses for the slope.
d) Test the hypotheses and state your conclusion.
28. Auto batteries. In a recent issue, Consumer Reports listed
the price (in dollars) and power (in cold cranking amps)
of auto batteries. We want to know if more expensive bat-
teries are generally better in terms of starting power. Here
are the regression and residual output, where the response
variable is Power. LO@, L0©, LO@
Dependent variable is: Power
R squared = 25.2%
s = 116.0 with 33 · 2 = 31 degrees of freedom
Variable
Coefficient
SE(Coeff)
t-ratio
P-value
Intercept
384.594
530
CHAPTER 14
Inference for Regression
#
of
Students
=)
20
120
220
Residuals
a) Is there evidence of a linear association between Math
and Verbal scores? Write an appropriate hypothesis.
b) Discuss the assumptions for inference.
c) Test your hypothesis and state an appropriate conclusion.
@ 30. Productivity. How strong is the association between
labour productivity and labour costs? Statistics Canada
provides seasonally adjusted quarterly indexes of labour
productivity and related variables, using 2007 index values
as the base year (i.e., 2007 = 100) by industry based on the
North American Industry Classification System (NAICS).
Data for labour productivity and unit labour costs across
18 industries, from the fourth quarter of 2012, are used
to examine this relationship (CANSIM ·Table 383-0012).
Here are the results of a regression analysis where the
response variable is Labour Productivity. 0®
Predictor
Coefficient
SE(Coeff)
t-ratio
P-value
Intercept
178.40
11.63
·By. eb
<(a} (ofalay|
Unit Labour Cost
·0.679
0.101
-6.70
<0O.0001
s= 14.39
R-Sq = 73.7%
a) Is there evidence of a linear association between Labour
Productivity and Unit Labour Costs? Write appropriate
hypotheses.
b) Test your null hypothesis and state an appropriate con-
clusion (assume that assumptions and conditions are met).
@ 31. Football salaries 2013. Football team owners are con-
stantly in competition for good players. The more wins,
the more likely it is that the team will provide good busi-
© 33. Fuel economy and weight, part 2. Consider again the data in
Exercise 27 about the fuel economy and weights of cars. LO®
a) Create a 95% confidence interval for the slope of the
regression line.
b) Explain in this context what your confidence interval means.
@ 34. SAT scores, part 2. Consider the high school SAT
scores data from Exercise 29. Lo®
a) Find a 90% confidence interval for the slope of the true line
describing the association between Math and Verbal scores.
b) Explain in this context what your confidence interval
means.
@ 35. Sales and profits. A business analyst was interested in
the relationship between a company·s sales and its profits.
She collected data (in $ millions) from a random sample of
Fortune 500 companies and created the regression analy-
sis and summary statistics shown. The assumptions for
regression inference appeared to be satisfied. Lo®
Profits
Sales
Count
ZS
7S
Mean
209.839
4178.29
Variance
635,172
49,163,000
Std Dev
736:377
7011.63
Dependent variable is Profits
R-squared = 66.2% s = 466.2
Variable
Coefficient
SE(Coeff)
Intercept
·176.644
61.16
Sales
0.092498
0.0075
a) Is there a statistically significant association between
sales and profits? Test an appropriate hypothesis and state
your conclusion in context.
b) Do you think that a company·s sales serve as a useful pre-
dictor of its profits? Use the valves of both R? and s in your
explanation.
36. Marketing managers. Are wages for various marketing
managerial positions related? One way to determine this
is to examine the relationship between the mean hourly
532
CHAPTER 14
Inference for Regression
46 for Karachi (the least expensive city in 2011) to Dhaka
and Manila at 62. Here is the resulting regression: LO®
Dependent variable: Index 2011
R-squared = 99.3%
s = 4.061 with 20 · 2 = 18 degrees of freedom
Predictor
Coeff
SE(Coeff)
t-Ratio
P-Value
Intercept
·2.6787
2.272
·1.18
Oresa7
Index
1.1082
0.022
50.1
<0.0001
a) Sketch what a scatterplot of Index2011 vs. Index2010 is
likely to look like. You do not need to see the data.
b) Explain why the R· of this regression is higher than the
R· of the regression in Exercise 37.
@ 40. NHL attendance 2011-2012. Traditionally, athletic teams
that perform better grow their fan base and generate greater
attendance at games or matches. This should hold true
regardless of the sport·whether it·s soccer, football, base-
ball, or hockey. Data on the number of points and home
attendance for the 30 teams in the 2011-2012 National
Hockey League season are provided. Use Home Attendance
as the dependent variable and Points as the explanatory vari-
able to answer the following questions: L0@
a) Examine a scatterplot for the two variables and test the
conditions for regression.
b) Do you think there is a linear association between Home
Attendance and Points? Explain.
@ 41. Tablet computers. In July 2013, cnet.com listed the
battery life (in hours) and luminous intensity (i.e., screen
brightness, in cd/m·) for a sample of tablet computers. We
want to know if screen brightness is associated with bat-
tery life. (reviews.cnet.com/8301-19736_7-20080768-25 1/
cnet-updates-tablet-test-results/?tag=contentBody;content
Highlights) Lo@
Dependent variable: Video battery life (in hours)
R-squared = 4.82%
s = 1.946 with 69 · 2 = G7 degrees of freedam
Variable
Number of Strikes (out of 50)
Before
After
Before
After
28
35
33
33
29
36
33
35
30
32
34
32
32
28
34
30
32
30
34
33
32
31
35
34
32
32
36
37
32
34
36
33
32
35
37
35
33
36
37
32
@ 43. Fuel economy and weight, part 3. Consider again the
data in Exercise 27 about the fuel economy and weights of
cars. LOO
a) Create a 95% confidence interval for the average fuel
efficiency among cars weighing 2500 pounds, and explain
what your interval means.
534
CHAPTER 14
Inference for Regression
a) State what you want to know, identify the variables, and
give the appropriate hypotheses.
b) Check the assumptions and conditions.
c) If the conditions are met, complete the analysis.
@ 48. Energy use and recession. The Great Recession of 2008
changed spending and energy use habits worldwide. Based
on data collected from the United Nations Millennium
Indicators Database related to measuring the goal of ensur-
ing environmental sustainability, investigate the association
between energy use (kg oil equivalent per $1000 GDP)
before (2006) and after (2010) the crisis, for a sample of 33
countries (unstats.un.org/unsd/mi/mi_goals.asp; accessed
June 2013). Lo@, Lo®
a) Find a regression model showing the relationship
between 2010 Energy Use (response variable) and 2006
Energy Use (predictor variable).
b) Examine the residuals to determine if a linear regression
is appropriate.
c) Test an appropriate hypothesis to determine if the asso-
ciation is significant.
d) What percentage of the variability in 2010 Energy Use is
explained by 2006 Energy Use?
@ 49. Youth unemployment 2012. Here is a scatterplot show-
ing the regression line, 95% confidence interval, and 95%
prediction interval, using 2012 youth unemployment
data for a sample of 33 nations. The response variable
is the Male Rate, and the predictor variable is the Female
Rate. 108, Lo®
80
o
re
sp
ce
75
®
Oi
oa
a
70+
8
ane
pecaee e
2
e5 pe"
oa alt
oO
cc
--
Response variable: Energy Use 2010
R squared = 88.6%
S = 13.10 with 32 · 2 = 30 degrees of freedom
Variable
Coefficient
SE(Coeff)
t-Ratio
P-Value
Intercept
5.74436
9.061
0.634
0.5309
Energy
0.921540
0.0605
oe
<0.0001
a) Explain the meaning of the 95% prediction interval in
this context.
b) Explain the meaning of the 95% confidence interval in
this context.
c) How has setting aside Iceland changed the regression
model? How is it likely to affect the intervals discussed in
parts a and b?
@ 52. Global reach 2012. The internet has revolutionized
business and offers unprecedented opportunities for gio-
balization. However, the ability to access the internet var-
ies greatly among different regions of the world. One of
the variables the United Nations collects data on each year
is Personal Computers per 100 Population (http://unstats.
un.org/unsd/cdb/cdb_help/cdb_quick_start.asp) for vari-
ous countries. Below is a scatterplot showing the regres-
sion line, 95% confidence interval, and 95% prediction
interval using 2000 and 2012 computer adoption (personal
computers per 100 population) for a sample of 85 coun-
tries. The response variable is PC/100 2012. LO®, LO®
PC
per
100
Year
2012
oe
a
ee
0
5
10
15
20
25
536
CHAPTER 14
Inference for Regression
a) Is the relationship between final exam grade and mid-
term exam grade significant?
Assuming the conditions for inference are satisfied, find
the t-value and P-value to test the appropriate hypothesis.
State your conclusion in context.
b) Here is part of the ANOVA table. Find the F-statistic
and P-value and state your conclusion in context. How
does it compare with part a?
Anova
Sum of
Degrees of
Mean
Source
Squares
Freedom
Square
Model
23875.86
1
23875.86
Error
17823.82
198
90.02
Total
41699.68
199
c) How useful a predictor of final exam grade is mid-term
grade? Use the values of both R· and s in your explanation.
d) Give and interpret a 95% confidence interval for the
increase in final exam grade associated with each percent
increase in mid-term exam grade.
e) The mean mid-term grade was 75. Using an interval in
which you have 95% confidence, predict the final grade for
a student who had mid-term grade of 80.
f) Using an interval in which you have 95% confidence,
predict the average final grade for all students who had a
mid-term grade of 80.
@ 56. Exam grades, part 2. The spreadsheet of grades ana-
lyzed in Exercise 55 also has the total assignment grade.
Now consider the relationship between Assignment grade
and Final Exam grade for the same sample of 200 stu-
dents. LO@, LOO, LO®
a) Fill in the missing cells in the regression output template
below.
Variable
Coefficient
SE(Coeff)
CONNECTIONS: CHAPTER 15
Chapter 14 examined the relationship
between one quantitative predictor or
explanatory variable and one quantitative
response or outcome variable using
a simple linear model. In the current
chapter we extend the linear model to
multiple predictor variables. Multiple
regression is one of the most powerful
and widely used statistical tools.
LEARNING OBJECTIVES
©
6
8
8&8
88
6
O46
\S)
Check regression assumptions and R?
Interpret coefficients, test statistic
computations and hypothesis tests
Check assumptions using residual
plots
Interpret complete regression output
Carry out multiple regression analysis
with software
Use the fitted model for making
predictions
Communicate conclusions of
regression analysis
Compare models with F-test for
change in R@
Extend regression models to include
polynomial terms and indicator
variables
Interpret logistic regression output
(optional)
Hyungwon
Kang/REUTERS/Alamy
Stock
Photo
Multiple Regression
CREA·Canadian Real Estate
Association
For most people, the most expensive purchase they will ever make
will be a new home. In July 2016, average house prices in Canada
ranged from about $150 000-$250 000 in the Maritimes to over
$650 000 in British Columbia. The Canadian Real Estate Association
(CREA) is one of the largest single-industry trade associations in
Canada, with over 100 000 members. For years, information on
538
CHAPTER 15
Multiple Regression
~ atl |
·@:
,
@eoasse
=-· =
a
©
BuzzBuzzHome is another Toronto-based company whose website has
become a leading online hub for new home purchasers in Canada and
the United States. It connects users directly with developers and provides
social tools for collaboration among buyers, renters, and industry experts.
Since its inception in 2009, BuzzBuzzHome·s monthly traffic has contin-
ued to increase exponentially.
An important aspect of real estate data is the ability to estimate a home·s
worth based on a variety of predictor variables, such as past history of the
home·s sales, location, and house characteristics (lot size, floor space area,
number of bedrooms and bathrooms).
The tools for people to be able to analyze real estate trends have been
growing substantially. With advances in mobile technology, real time
mapping, and data acquisition, home buyers will be even better prepared
to make purchase decisions. But no matter how much data is available, a
home purchase will still be expensive!
By the way, the word Realtor® is capitalized because it is a trademark. To
be a Realtor® in Canada you must be a member of your local real estate
board and CREA.
Based on information from www.crea.ca/organization.
WHO
WHAT
WHEN
WHERE
WHY
Houses
Sale price and other facts about
the houses
A recent year
·Somewhere in Canada·
To understand what influences
housing prices and how to
predict them
ow would CREA or any of these other real estate sites figure out the worth of
a house? ·The answer is, not surprisingly, to collect a huge amount of data and
build a model. For example, the Zillow.com site computes a Zestimate (the
proper pronunciation is unclear: in Canada it·s ·zed-estimate,· in the United States
·zee-estimate,· or perhaps it·s just ·zest-imate·). According to the Zillow.com site,
·We compute this figure by taking zillions of data points·much of this data is
public·and entering them into a formula. This formula is built using what our
statisticians call ·a proprietary algorithm··big words for ·secret formula.· When
our statisticians developed the model to determine home values, they explored
Multiple Regression
539
regression isn·t a big step, but it·s an important and worthwhile one. Multiple
regression is probably the most powerful and widely used statistical tool today.
As anyone who·s ever looked at house prices knows, house prices depend on
the local market. To control for that, we will restrict our attention to a single mar-
ket. We have a random sample of 1057 home sales from the public records of sales.
The first thing often mentioned in describing a house for sale is the number of
bedrooms. Let·s start with just one predictor variable. Can we use Bedrooms to pre-
dict home Price?
·The number of Bedrooms is a quantitative variable, but it holds only a few val-
ues (from 1 to 5 in this data set). So, a scatterplot may not be the best way to exam-
ine the relationship between Bedrooms and Price. In fact, at each value for Bedrooms
there is a whole distribution of prices. Side-by-side boxplots of Price against Bed-
rooms show a general increase in price with more bedrooms, and an approximately
linear growth.
oo > oO
© @ocoo
Oo
°
Price
($000)
1
2
3
4
5
Number of Bedrooms
Figure 15.1
Side-by-side boxplots of Price against
Bedrooms show that price increases, on average, with
more bedrooms.
The figure also shows a clearly increasing spread from left to right, violating
the Equal Spread Condition, and that·s a possible sign of trouble. For now, we·ll
proceed cautiously. We·ll fit the regression model, but we will be cautious about
using inference methods for the model. Later, we·ll add more variables to increase
the power and usefulness of the model.
The output from a linear regression model of Price on Bedrooms shows
Response variable:
Price
R2 = 21.4%
s=
68432.21 with 1057 ·2
=
1055 degrees of freedom
Variable
Coeff
SE(Coeff )
t-ratio
P-value
Intercept
540
Edad
In simple regression, finding the
least squares solution requires
simple calculus (partial deriva-
tives) and solving two equations
in two unknowns; the algebra
is very easy. But in multiple
regression, the solution requires
solving k + 1 equations ink + 1
unknowns. That·s not easy, and
requires matrix algebra, so we
leave it to the computer.
Why can we not draw a scat-
terplot of the data, as we did
in simple regression? By add-
ing predictor variables we have
increased the number of dimen-
sions we would need to display.
With one predictor variable
(y vs. x), least squares means
finding the best-fitting line
through a two-dimensional
(i.e., oval) cloud of points. If
there are two predictor variables,
we look for the best-fitting plane
through a three-dimensional
(i.e., ovate or egg-shaped) cloud
of points. That·s the furthest we
can go visually, because adding
a third predictor variable would
mean finding the best-fitting
solid through four-dimensional
space, and so on. It makes your
brain hurt to think about what
happens in 10 dimensions!
CHAPTER 15
Multiple Regression
Even though the model does tell us something, notice that the R· for this regres-
sion is only 21.4%. The variation in the number of bedrooms accounts for only
21.4% of the variation in house prices. Perhaps some of the other facts about these
houses can account for portions of the remaining variation.
The standard deviation of the residuals is s = 68 432, which tells us that
the model only does a modestly good job of accounting for the price of a home.
Approximating use of the 68-95-99.7 Rule, we·d guess that only about 68% of
home prices predicted by this model would be within $68 432 of the actual price.
That·s not likely to be close enough to be useful for a home buyer.
The Multiple Regression Model
For simple regression, we wrote the predicted values in terms of one predictor
variable:
y=h+ dx
The Multiple Regression Model
541
explained by the model has gone up. It was for this reason·the hope of account-
ing for some of that leftover variability·that we tried a second predictor. We also
shouldn·t be surprised that the size of the house, as measured by Living Area, also
contributes to a good prediction of house prices. Collecting the coefficients of the
multiple regression of Price on Bedrooms and Living Area from Table 15.2, we can
write the estimated regression as
ee ae
Price = 20 986.09 · 7483.10 Bedrooms + 93.84 Living Area
As before, we define the residuals as
ar
a
The standard deviation of the residuals is still denoted as s (or also sometimes
as J, as in simple regression·for the same reason·to distinguish it from the stan-
dard deviation, s,, of y). The degrees of freedom calculation comes right from our
definition. The degrees of freedom is the number of observations (7 = 1057) minus
one for each coefficient estimated
af =n k=l
where & is the number of predictor variables and 7 is the number of cases. For this
model, we subtract three (the two coefficients and the intercept). To find the stan-
dard deviation of the residuals, we use that number of degrees of freedom in the
denominator:
cele al ole
:
n·-k·I|
For each predictor, the regression output shows a coefficient, its standard error, a
t-ratio for the coefficient, and the corresponding P-value. As with simple regres-
sion, the ¢-ratio measures how many standard errors the coefficient is away from
0. Using a Student·s t-model, we can use its P-value to test the null hypothesis that
the true value of the coefficient is 0.
What·s different? With so much of the multiple regression looking just like
simple regression, why devote an entire chapter to the subject?
There are several answers to this question. First, and most important, is that
the meaning of the coefficients in the regression model has changed in a subtle,
but important, way. Because that change is not obvious, multiple regression coeffi-
cients are often misinterpreted. And that can lead to dangerously wrong decisions.
We'll show some examples to explain this change in meaning.
Second, the analysis is much more complex. Analysis of the simple regression
model tests one key hypothesis; namely, does the single predictor explain varia-
tion in y more than just chance alone? Analysis of the multiple regression model
goes far beyond that. Once it is determined that the model as a whole is useful, the
question turns to which of the whole set of x-variables in the model are the con-
tributors to that usefulness. A team may be a winner but that doesn·t mean every
team member was a contributing member.
Third, multiple regression is an extraordinarily versatile model, underly-
ing many widely used statistics methods. A sound understanding of the multiple
regression model will help you understand these other applications as well.
Fourth, multiple regression offers you a first glimpse into statistical mod-
els that use more than two quantitative variables. The real world is complex.
Simple models of the kind we·ve shown so far are a great start, but they·re not
542
CHAPTER 15
Multiple Regression
15.2
David Cooper/Getty Images
Interpreting Multiple Regression Coefficients
It makes sense that both the number of bedrooms and the size of the living area
would influence the price of a house. We·d expect both variables to have a positive
effect on price·houses with more bedrooms typically sell for more money, as do
larger houses. But look at the coefficient for Bedrooms in the multiple regression
equation. It·s negative: ~7483.10. How can it be that the coefficient of Bedrooms
in the multiple regression is negative? And not just slightly negative, its f-ratio 1s
large enough for us to be quite confident that the true value is really negative. Net
from Table 15.1, we saw the coefficient was equally clearly positive when Bedrooms
was the sole predictor in the model (see Figure 15.2).
Price
($000)
Bedrooms (#)
Figure 15.2
The slope of Bedrooms is positive.
For each additional bedroom, we would predict
an additional $48 000 in the price of a house
from the simple regression model of Table 15.1.
The explanation of this apparent paradox is that in a multiple regression, coef-
ficients have a more subtle meaning. Each coefficient takes into account the other
predictor(s) in the model.
·To see how variables can interact, let·s look at a group of similarly sized homes
and examine the relationship between Bedrooms and Price just for houses with 2500
to 3000 square feet of living area (Figure 15.3).
500 +
3
400 +
300
Ea
200
eas
100
Price
($000)
3
4
5
Bedrooms (#)
Figure 15.3
For the 96 houses with Living
Area between 2500 and 3000 square feet, the
slope of Price on Bedrooms is negative. For
each additional bedroom, restricting data to
homes of this size, we would predict that the
house·s Price was about $17 800 lower.

You might also like