Chapter 7 – Hypothesis Tests and Confidence Intervals in Multiple
Regression
Exercise 1:
Attendance at sports events depends on various factors. Teams typically do not change ticket
prices from game to game to attract more spectators to less attractive games. However, there
are other marketing tools used, such as fireworks, free hats, etc., for this purpose. You work
as a consultant for a sports team, the Los Angeles Dodgers, to help them forecast attendance,
so that they can potentially devise strategies for price discrimination. After collecting data
over two years for every one of the 162 home games of the 2000 and 2001 season, you run
the following regression:
= 15,005 + 201 × Temperat + 465 × DodgNetWin + 82 × OppNetWin
(8,770) (121) (169) (26)
+ 9647 × DFSaSu + 1328 × Drain + 1609 × D150m + 271 × DDiv – 978 × D2001;
(1505) (3355) (1819) (1,184) (1,143)
R2 = 0.416, SER = 6983
where Attend is announced stadium attendance, Temperat it the average temperature on game
day, DodgNetWin are the net wins of the Dodgers before the game (wins-losses), OppNetWin
is the opposing team's net wins at the end of the previous season, and DFSaSu, Drain,
D150m, Ddiv, and D2001 are binary variables, taking a value of 1 if the game was played on
a weekend, it rained during that day, the opposing team was within a 150 mile radius, the
opposing team plays in the same division as the Dodgers, and the game was played during
2001, respectively. Numbers in parentheses are heteroskedasticity- robust standard errors.
(a) Are the slope coefficients statistically significant?
(b) To test whether the effect of the last four binary variables is significant, you have your
regression program calculate the relevant F-statistic, which is 0.295. What is the critical
value? What is your decision about excluding these variables?
Answer:
(a) The t-statistics for DodgNetWin, OppNetWin, and DFSaSu are all statistically
significant at the 5% level (𝑡 𝑠𝑡𝑎𝑡 > 𝑡 𝑐𝑟𝑖𝑡𝑖𝑐𝑎𝑙 = 1.96). All the other coefficients are not
statistically significant at the 5% level.
(You should calculate all the t-statistics of all the coefficients.
As an example:
𝐻0 : 𝛽2 = 0 vs 𝐻1 : 𝛽2 ≠ 0
̂ −0
𝛽 465
for 𝛽̂2: 𝑡 𝑠𝑡𝑎𝑡 = 2 =
̂2 ) = 2.751
𝑆𝐸(𝛽 169
Since 𝑡 = 2.751 > 𝑡 𝑐𝑟𝑖𝑡𝑖𝑐𝑎𝑙 = 1.96 → Reject 𝐻0 → 𝛽̂2 is statistically significant (≠ 0)→
𝑠𝑡𝑎𝑡
the Dodgnetwin has a significant impact on the attendance
(b) 𝐻0 : 𝛽5 = 𝛽6 = 𝛽7 = 𝛽8 = 0 vs 𝐻1 : at least one of the coefficients is ≠ 0
From the 𝐹𝑚,∞ table: The critical value at the 5% level is 2.37 (𝐹4,∞ ).
Thus, 𝐹 𝑠𝑡𝑎𝑡 = 0.295 < 𝐹 𝑐𝑟𝑖𝑡𝑖𝑐𝑎𝑙 = 2.37 → we fail to reject the null hypothesis that the four
coefficients are simultaneously zero
Exercise 2:
You have collected data for 104 countries to address the difficult questions of the
determinants for differences in the standard of living among the countries of the world. You
recall from your macroeconomics lectures that the neoclassical growth model suggests that
output per worker (per capita income) levels are determined by, among others, the saving rate
and population growth rate. To test the predictions of this growth model, you run the
following regression:
= 0.339 – 12.894 × n + 1.397 × SK , R2 = 0.621, SER = 0.177
(0.068) (3.177) (0.229)
where RelPersInc is GDP per worker relative to the United States, n is the average population
growth rate, 1980-1990, and SK is the average investment share of GDP from 1960 to 1990
(remember investment equals saving). Numbers in parentheses are for heteroskedasticity-
robust standard errors.
(a) Calculate the t-statistics and test whether or not each of the population parameters are
significantly different from zero.
(b) The overall F-statistic for the regression is 79.11. What is the critical value at the 5% and
1% level? What is your decision on the null hypothesis?
(c) You remember that human capital in addition to physical capital also plays a role in
determining the standard of living of a country. You therefore collect additional data on the
average educational attainment in years for 1985, and add this variable (Educ) to the above
regression. This results in the modified regression output:
= 0.046 – 5.869 × n + 0.738 × SK + 0.055 × Educ, R2 = 0.775, SER = 0.1377
(0.079) (2.238) (0.294) (0.010)
How has the inclusion of Educ affected your previous results?
(d) Upon checking the regression output, you realize that there are only 86 observations,
since data for Educ is not available for all 104 countries in your sample. Do you have to
modify some of your statements in (c)?
Answer:
(a) The t-statistics for population growth and the saving rate are –4.06 and 6.10, making both
coefficients significantly different from zero at conventional levels of significance.
(b)𝐻0 : 𝛽1 = 𝛽2 = 0 vs 𝐻1 : 𝛽1 or 𝛽2 ≠ 0 (or both)
The F stat=79.11 > The critical values 3.00 (5%) and 4.61 (1%) respectively, allowing you to
reject the null hypothesis that all slope coefficients are zero → At least one of these variables
(average population growth rate and average investment share of GDP) has a significant
impact on GDP per worker
(c) The coefficients on the population growth rate and the saving rate are roughly half of what
they were originally. The regression R2 has increased significantly.
(d) When comparing results, you should ensure that the sample is identical, since
comparisons are not valid otherwise. In addition, there are now less than 100 observations,
making inference based on the standard normal distribution problematic.
Exercise 3:
Consider the following regression using the California School data set from your textbook.
= 681.44 - 0.61LchPct
𝑛 = 420, 𝑅 2 = 0.75, 𝑆𝐸𝑅 = 9.45
where TestScore is the test score and LchPct is the percent of students eligible for subsidized
lunch.
Your textbook started with the following regression in Chapter 4:
= 698.9 - 2.28STR
𝑛 = 420, 𝑅 2 = 0.051, 𝑆𝐸𝑅 = 18.58
where STR is the student teacher ratio.
Your textbook tells you that in the multiple regression framework considered, the percentage
of students eligible for subsidized lunch is a control variable, while the student teacher ratio is
the variable of interest. Given that the regression R2 is so much higher for the first equation
than for the second equation, shouldn't the role of the two variables be reversed? That is,
shouldn't the student teacher ratio be the control variable while the percent of students
eligible for subsidized lunch be the variable of interest?
Answer:
The choice of variable of interest versus control variable has nothing to do with which
variable has a higher explanatory power in the two models. Instead it depends on the question
your are analyzing. In Chapter 4, the question was raised whether or not the test scores of
students could be improved by hiring more teachers. Hence the variable of interest became
class size or its proxy, the student teacher ratio. However, there are other variables which
may have an effect on test scores, and not controlling for those will result in omitted variable
bias on the coefficient of the variable of interest. Of course, the role of a control variable and
the variable of interest can be switched if a different policy question is addressed. For
example, a politician might be interested in figuring out the effect of improved student
performance if she can raise income levels in certain school districts, or across the board.
Exercise 4:
Data were collected from a random sample of 200 home sales from a community in 2013. Let
Price denote the selling price (in $1000s), BDR denote the number of bedrooms, Bath denote
the number of bathrooms, Hsize denote the size of the house (in square feet), Lsize denote the
lot size (in square feet), Age denote the age of the house (in years), and Poor denote a binary
variable that is equal to 1 if the condition of the house is reported as “poor.”
An estimated regression yields:
Price = 109.7 + 0.567BDR + 26.9Bath + 0.239Hsize + 0.005Lsize + 0.1Age - 56.9Poor
(22.1) (1.23) (9.76) (0.021) (0.00072) (0.23) (12.23)
𝑅̅ 2 = 0.849, SER = 45.8
a. Is the coefficient on 𝐵𝐷𝑅 statistically significant?
b. A homeowner purchases 2500 square feet from an adjacent lot. Construct a 95%
confidence interval for the change in the value of his house.
c. Lot size is measured in square feet. Do you think that another scale might be more
appropriate?
d. The F-statistic for omitting 𝐵𝐷𝑅 and 𝐴𝑔𝑒 from the regression is 𝐹 = 2.38. Are the
coefficients on 𝐵𝐷𝑅 and 𝐴𝑔𝑒 statistically different from zero at the 10% level? At the
5% level?
Answer:
a. 𝐻0 : 𝛽1 = 0 vs 𝐻1 : 𝛽1 ≠ 0
𝛽̂1 − 0 0.567
𝑡 𝑠𝑡𝑎𝑡 = = = 0.461
𝑆𝐸(𝛽̂1 ) 1.23
𝑡 𝑠𝑡𝑎𝑡 = 0.461 < 𝑡 𝑐𝑟𝑖𝑡𝑖𝑐𝑎𝑙 = 1.96 → We fail to reject 𝐻0 → the coefficient on 𝐵𝐷𝑅 is
not statistically significant → the number of bedrooms does not have a significant
impact on the house’s price.
b. 95% confidence interval for the change in the value of the house:
2500 × {𝛽̂4 − 𝑡 𝑐𝑟𝑖𝑡𝑖𝑐𝑎𝑙 × 𝑆𝐸(𝛽̂4 ); 𝛽̂4 + 𝑡 𝑐𝑟𝑖𝑡𝑖𝑐𝑎𝑙 × 𝑆𝐸(𝛽̂4 )}
2500 × {0.005 − 1.96 × 0.00072; 0.005 + 1.96 × 0.00072}
{8.972; 16.028} (in thousand dollars)
c. Choosing the scale of the variables should be done to make the regression results easy
to read and interpret. However, it does not have an impact on the interpretation of the
results.
d. 𝐻0 : 𝛽1 = 𝛽5 = 0 vs 𝐻1 : 𝛽1 ≠ 0 𝑎𝑛𝑑/𝑜𝑟 𝛽5 ≠ 0
𝐹 𝑠𝑡𝑎𝑡 = 2.38
𝐹 𝑐𝑟𝑖𝑡𝑖𝑐𝑎𝑙 at 10% level → 𝑞 = 2, 𝑛 ≥ 100 → 𝐹2,∞ = 2.30
Since 𝐹 𝑠𝑡𝑎𝑡 = 2.38 > 𝐹 𝑐𝑟𝑖𝑡𝑖𝑐𝑎𝑙 = 2.30 → We reject 𝐻0 at the 10% significance level
→ At least one of the two variables (𝐵𝐷𝑅 and 𝐴𝑔𝑒) has a significant impact on the
house’s price.
𝐹 𝑐𝑟𝑖𝑡𝑖𝑐𝑎𝑙 at 5% level → 𝑞 = 2, 𝑛 ≥ 100 → 𝐹2,∞ = 3.00
Since 𝐹 𝑠𝑡𝑎𝑡 = 2.38 < 𝐹 𝑐𝑟𝑖𝑡𝑖𝑐𝑎𝑙 = 3.00 → We fail to reject 𝐻0 at the 5% significance
level → The two variables (𝐵𝐷𝑅 and 𝐴𝑔𝑒) don’t have a significant impact on the
house’s price.