Problem Set 3: General Guideline
Problem Set 3: General Guideline
General Guideline
What we are looking for in the assignments is a demonstration that you can understand the econometrics
and statistics questions and can solve them with R or conceptually. That means effective programming to
get correct results is needed, but at the same time, clear explanations of economics/business concepts in
well presented reports are equally important when assessing your work. In particular, you will be marked
for successful (correct) programming (not the style of coding), good understanding of related concepts, and
clear interpretations and explanations of results.
Please submit a pdf or html file converted from R markdown/notebook after you program in R.
Use the data in kielmc.RData (only those observations from 1981) to answer the following questions. The
data are for houses that sold during 1981 in North Andover, Massachusetts; 1981 was the year construction
began on a local garbage incinerator. We want to study the effects of the incinerator location on housing
price. Two key variables are: price is housing price in dollars and dist is distance from the house to the
incinerator measured in feet.
First, we check the data. There are 321 rows in the table, and there is no NA value in the
table.
load('kielmc.RData')
summary(data) # there is no NA value in the data
1
## Median : 7.000 Median :2056 Median : 43560 Median :2.00
## Mean : 6.586 Mean :2107 Mean : 39630 Mean :2.34
## 3rd Qu.: 7.000 3rd Qu.:2544 3rd Qu.: 46100 3rd Qu.:3.00
## Max. :10.000 Max. :5136 Max. :544500 Max. :4.00
## dist ldist wind lprice
## Min. : 5000 Min. : 8.517 Min. : 3.000 Min. :10.17
## 1st Qu.:13400 1st Qu.: 9.503 1st Qu.: 5.000 1st Qu.:11.08
## Median :19900 Median : 9.898 Median : 7.000 Median :11.36
## Mean :20716 Mean : 9.837 Mean : 6.978 Mean :11.38
## 3rd Qu.:27200 3rd Qu.:10.211 3rd Qu.:11.000 3rd Qu.:11.70
## Max. :40000 Max. :10.597 Max. :11.000 Max. :12.61
## y81 larea lland y81ldist
## Min. :0.0000 Min. :6.600 Min. : 7.444 Min. : 0.000
## 1st Qu.:0.0000 1st Qu.:7.352 1st Qu.: 9.737 1st Qu.: 0.000
## Median :0.0000 Median :7.629 Median :10.682 Median : 0.000
## Mean :0.4424 Mean :7.597 Mean :10.302 Mean : 4.343
## 3rd Qu.:1.0000 3rd Qu.:7.841 3rd Qu.:10.739 3rd Qu.: 9.820
## Max. :1.0000 Max. :8.544 Max. :13.208 Max. :10.569
## lintstsq nearinc y81nrinc rprice
## Min. : 47.72 Min. :0.0000 Min. :0.0000 Min. : 26000
## 1st Qu.: 82.90 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.: 59000
## Median : 93.71 Median :0.0000 Median :0.0000 Median : 82000
## Mean : 90.48 Mean :0.2991 Mean :0.1246 Mean : 83721
## 3rd Qu.:101.73 3rd Qu.:1.0000 3rd Qu.:0.0000 3rd Qu.:100230
## Max. :108.87 Max. :1.0000 Max. :1.0000 Max. :300000
## lrprice
## Min. :10.17
## 1st Qu.:10.99
## Median :11.31
## Mean :11.26
## 3rd Qu.:11.52
## Max. :12.61
## [1] 321
1. There are two alternative specifications of the simple regression model: (1) regress price on dist; (2)
regress log(price) on log(dist). Plot the two OLS lines on the corresponding scatter plots, and discuss
which one would you use for the study.
Based on the model regressing price on dist, the p-value of dist is 0.00034, which means that
dist is statistically significant. The R-squared of this model is 0.03949, which means that
about 3.95% of the variance in the dependent variable can be explained by the independent
variable. On the other hand, based on the model regressing lprice on ldist, the p-value of
ldist is 1.78e-10, which means that dist is statistically significant. The R-squared of this model
is 0.1199, which means that about 11.99% of the variance in the dependent variable can be
explained by the independent variable. Based on the R-squared of both results, I will choose
the model regressing lprice on ldist since it can explain more of the variance in the dependent
variable. In addition, based on the OLS lines on the corresponding scatter plots, the OLS line
regressing lprice on ldist produces smaller residuals, which gives us more accurate predictions.
2
kielmc.m1 <- lm(price ~ dist, data = data)
summary(kielmc.m1)
##
## Call:
## lm(formula = price ~ dist, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -68772 -31196 -12955 23511 209165
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.519e+04 6.241e+03 12.046 < 2e-16 ***
## dist 1.010e+00 2.788e-01 3.622 0.00034 ***
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Residual standard error: 42430 on 319 degrees of freedom
## Multiple R-squared: 0.03949, Adjusted R-squared: 0.03648
## F-statistic: 13.12 on 1 and 319 DF, p-value: 0.0003404
3e+05
2e+05
price
1e+05
3
kielmc.m2 <- lm(lprice ~ ldist, data = data)
summary(kielmc.m2)
##
## Call:
## lm(formula = lprice ~ ldist, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.22356 -0.28076 -0.05527 0.27992 1.29332
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.25750 0.47383 17.427 < 2e-16 ***
## ldist 0.31722 0.04811 6.594 1.78e-10 ***
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Residual standard error: 0.4117 on 319 degrees of freedom
## Multiple R-squared: 0.1199, Adjusted R-squared: 0.1172
## F-statistic: 43.48 on 1 and 319 DF, p-value: 1.779e-10
12
lprice
11
4
2. Suppose you choose to regress price on dist. What is a 95% prediction interval for a house that is
20,000 feet away from the incinerator (Answer this question without using the predict() function)?
Add the bounds of the 95% prediction interval to a scatter plot of price versus dist. What percentage
of your observations are outside this interval?
Without using predict() function, the other simple way to find the prediction interval is to
define a new set of independent variable dist0 = dist - 20000. Then, we regress the price on
the new independent variable dist0. From the result, we can calculate the standard error of
prediction error, and the t statistics at 95% level with 319 degrees of freedom. Therefore,
we can then calculate the lower bound and upper bound of the 95% prediction interval to be
(12218.73, 178541.27), which means that the price of a house that is 20,000 feet away from
the incinerator is predicted to be in the interval between 12,218.73 and 178,541.27
##
## Call:
## lm(formula = price ~ dist0, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -68772 -31196 -12955 23511 209165
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.538e+04 2.376e+03 40.134 < 2e-16 ***
## dist0 1.010e+00 2.788e-01 3.622 0.00034 ***
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Residual standard error: 42430 on 319 degrees of freedom
## Multiple R-squared: 0.03949, Adjusted R-squared: 0.03648
## F-statistic: 13.12 on 1 and 319 DF, p-value: 0.0003404
We add the bounds of the 95% prediction interval to a scatter plot of price versus dist, and
find that there are 13 data outside the interval. This percentage is 13/321 = 0.04049844,
which 4.05%.
5
# Predict the price and calculate prediction interval
kielmc.pred <- predict(kielmc.m1, interval = "prediction", level = 0.95)
3e+05
2e+05
price
1e+05
0e+00
3. Now suppose that you choose to regress log(price) on log(dist). Report the results. Would you choose
to add the square of log(dist) to the model? Explain.
6
Based on the result of the model regressing lprice on ldist, The linear regression model suggests
a statistically significant relationship between ldist and lprice. However, the R-squared value
indicates that only about 11.99% of the variability in lprice is explained by ldist. While the
model is statistically significant, it may not capture a large portion of the variability in the
data. On the other hand, based on the model regressing lprice on ldist and square of ldist,
the coefficients on both log(dist) and square of log(dist) are very statistically significant, with
a t statistic greater than three in absolute value. The R-squared value increased from 11.99%
to 15.06%. Adding a square of log(dist) has some effect, which means that distance from the
incinerator is correlated in some nonlinear way that also affects housing prices. In this case,
we will choose to add the square of log(dist) to the model.
##
## Call:
## lm(formula = lprice ~ ldist + ldist2, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.28806 -0.24405 -0.05712 0.28100 1.21653
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -19.72575 8.26686 -2.386 0.017612 *
## ldist 6.12103 1.71250 3.574 0.000406 ***
## ldist2 -0.30011 0.08852 -3.390 0.000786 ***
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Residual standard error: 0.4051 on 318 degrees of freedom
## Multiple R-squared: 0.1506, Adjusted R-squared: 0.1453
## F-statistic: 28.2 on 2 and 318 DF, p-value: 5.311e-12
4. To the simple regression model in part 3 (without the square of log(dist)), add the variables log(intst),
log(area), log(land), rooms, baths, and age, where intst is distance from house to interstate (i.e., a
major system of highways running between US states) entrance ramp measured in feet, area is square
footage of the house, land is the lot size in square feet, rooms is total number of rooms, baths is number
of bathrooms, and age is age of the house in years. Now, what do you conclude about the effects of
the incinerator? Explain why parts 3 and 4 give conflicting results.
Comparing the results between parts 3 and 4, when the dependent variables log(inst),
log(area), log(land), rooms, baths, and age are added to the regression model, the coefficient
of log(dist) changes to 0.0282046, which shows that the effect of log(dist) is much smaller
now. In addition, it becomes statistically insignificant as its p-value increases to 0.59647. The
effects of incinerators decrease in this case. This is because we have controlled several other
factors that might affect the dependent variable.
kielmc.m4 <- lm(lprice ~ ldist + lintst + larea + lland + rooms + baths + age, data = data)
summary(kielmc.m4)
7
##
## Call:
## lm(formula = lprice ~ ldist + lintst + larea + lland + rooms +
## baths + age, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.35838 -0.18221 0.00117 0.20533 0.82180
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.2996221 0.5960536 10.569 < 2e-16 ***
## ldist 0.0282046 0.0532130 0.530 0.59647
## lintst -0.0438028 0.0424358 -1.032 0.30277
## larea 0.5124039 0.0698229 7.339 1.87e-12 ***
## lland 0.0782203 0.0337208 2.320 0.02100 *
## rooms 0.0503141 0.0235113 2.140 0.03313 *
## baths 0.1070541 0.0352304 3.039 0.00258 **
## age -0.0035631 0.0005774 -6.171 2.10e-09 ***
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Residual standard error: 0.2828 on 313 degrees of freedom
## Multiple R-squared: 0.5925, Adjusted R-squared: 0.5834
## F-statistic: 65.02 on 7 and 313 DF, p-value: < 2.2e-16
Use the data in gpa2.RData for this exercise. Consider the equation
where colgpa is cumulative college grade point average; hsperc is academic percentile in high school (from
top); sat is combined SAT score (a standardized test widely used for college admission in US); f emale is a
binary gender variable; and athlete is a binary variable, which is one for student athletes.
First, we check the data. There are 4137 rows in the table, and there is no NA value in the
table.
load('gpa2.RData')
summary(data)
8
## 3rd Qu.:0.9649 3rd Qu.:3.68 3rd Qu.: 70.00 3rd Qu.:27.7108
## Max. :1.6667 Max. :9.40 Max. :634.00 Max. :92.0000
## female white black hsizesq
## Min. :0.0000 Min. :0.0000 Min. :0.00000 Min. : 0.0009
## 1st Qu.:0.0000 1st Qu.:1.0000 1st Qu.:0.00000 1st Qu.: 2.7225
## Median :0.0000 Median :1.0000 Median :0.00000 Median : 6.3001
## Mean :0.4496 Mean :0.9255 Mean :0.05535 Mean :10.8535
## 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:0.00000 3rd Qu.:13.5424
## Max. :1.0000 Max. :1.0000 Max. :1.00000 Max. :88.3600
## [1] 4137
1. Estimate the equation and report the results. What is the estimated GPA differential between athletes
and nonathletes? Is it statistically significant?
The estimated GPA of the athletes is 0.156282 points higher than that of non-athletes, holding
other factors constant. The p-value of the coefficient of athletes is 0.00023, which means that
the difference between the GPA of the athletes and non-athletes is statistically significant.
##
## Call:
## lm(formula = colgpa ~ hsperc + sat + female + athlete, data = data)
##
## Coefficients:
## (Intercept) hsperc sat female athlete
## 1.147058 -0.012883 0.001627 0.155561 0.156282
summary(gpa.m1)
##
## Call:
## lm(formula = colgpa ~ hsperc + sat + female + athlete, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.71546 -0.35324 0.03398 0.39125 1.84429
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.147e+00 7.582e-02 15.128 < 2e-16 ***
## hsperc -1.288e-02 5.610e-04 -22.963 < 2e-16 ***
## sat 1.627e-03 6.689e-05 24.322 < 2e-16 ***
## female 1.556e-01 1.805e-02 8.619 < 2e-16 ***
## athlete 1.563e-01 4.239e-02 3.687 0.00023 ***
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
9
##
## Residual standard error: 0.5562 on 4132 degrees of freedom
## Multiple R-squared: 0.2876, Adjusted R-squared: 0.2869
## F-statistic: 417.1 on 4 and 4132 DF, p-value: < 2.2e-16
2. Drop sat from the model and re-estimate the equation. Now, what is the estimated effect of being an
athlete? Discuss why the estimate is different than that obtained in part 1.
The estimated GPA of the athletes becomes 0.001758 points less than that of non-athletes,
holding other factors constant. The p-value of the coefficient of athletes is 0.96869, which
means that the difference between the GPA of the athletes and non-athletes is statistically
insignificant. The difference between part 1 and part 2 is because sat is a factor that will affect
the colgpa between athletes and non-athletes. Once we control the sat, athletes do better than
non-athletes.
##
## Call:
## lm(formula = colgpa ~ hsperc + female + athlete, data = data)
##
## Coefficients:
## (Intercept) hsperc female athlete
## 2.948326 -0.016766 0.059982 -0.001758
summary(gpa.m2)
##
## Call:
## lm(formula = colgpa ~ hsperc + female + athlete, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.52452 -0.38549 0.01841 0.42288 1.92750
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.948326 0.017372 169.722 < 2e-16 ***
## hsperc -0.016766 0.000575 -29.159 < 2e-16 ***
## female 0.059982 0.018832 3.185 0.00146 **
## athlete -0.001758 0.044780 -0.039 0.96869
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Residual standard error: 0.5946 on 4133 degrees of freedom
## Multiple R-squared: 0.1857, Adjusted R-squared: 0.1851
## F-statistic: 314.1 on 3 and 4133 DF, p-value: < 2.2e-16
3. In the model, allow the effect of being an athlete to differ by gender and test the null hypothesis that
there is no difference between women athletes and women non-athletes.
10
Based on the model, The coefficient on female.athlete shows that colgpa is predicted to be
about 0.1806 points higher for a female athlete than a female non-athlete, holding other
variables in the equation fixed. The t value is greater than 2, which means that we can reject
the null that there is no difference in the population between women athletes and women non-
athletes. There is a significant difference between women athletes and women non-athletes,
ceteris paribus.
##
## Call:
## lm(formula = colgpa ~ hsperc + sat + female.athlete + male.athlete +
## male.nonathlete, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.71474 -0.35337 0.03322 0.39385 1.84362
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.303e+00 7.185e-02 18.131 <2e-16 ***
## hsperc -1.288e-02 5.612e-04 -22.951 <2e-16 ***
## sat 1.626e-03 6.693e-05 24.294 <2e-16 ***
## female.athlete 1.806e-01 8.425e-02 2.144 0.0321 *
## male.athlete -6.031e-03 4.876e-02 -0.124 0.9016
## male.nonathlete -1.544e-01 1.836e-02 -8.412 <2e-16 ***
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Residual standard error: 0.5562 on 4131 degrees of freedom
## Multiple R-squared: 0.2877, Adjusted R-squared: 0.2868
## F-statistic: 333.6 on 5 and 4131 DF, p-value: < 2.2e-16
4. Does the effect of sat on colgpa differ by gender? Justify your answer.
The p-value of the coefficient of is 0.677697, which means that there is no statistically significant
difference of effect of sat on colgpa by gender.
gpa.m3 <- lm(colgpa ~ hsperc + sat + female + athlete + female:sat, data = data)
summary(gpa.m3)
##
## Call:
## lm(formula = colgpa ~ hsperc + sat + female + athlete + female:sat,
## data = data)
##
11
## Residuals:
## Min 1Q Median 3Q Max
## -2.72246 -0.35323 0.03388 0.38942 1.84130
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.171e+00 9.451e-02 12.384 < 2e-16 ***
## hsperc -1.290e-02 5.619e-04 -22.950 < 2e-16 ***
## sat 1.605e-03 8.528e-05 18.819 < 2e-16 ***
## female 1.003e-01 1.342e-01 0.747 0.454926
## athlete 1.547e-01 4.257e-02 3.633 0.000283 ***
## sat:female 5.384e-05 1.295e-04 0.416 0.677697
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Residual standard error: 0.5562 on 4131 degrees of freedom
## Multiple R-squared: 0.2877, Adjusted R-squared: 0.2868
## F-statistic: 333.7 on 5 and 4131 DF, p-value: < 2.2e-16
In class, we discussed using VIF to identify the issue of multicollinearity. Consider the following model:
y = β0 + β1 x1 + β2 x2 + u.
Now suppose that V IF1 (variance inflation factor for x1 ) is equal to 10. What is the correlation between x1
and x2 in the sample?
If the VIF_1 if equal to 10, then the correlation between x_1 and x_2 is either 0.9 or -0.9.
If they tend to move together, the correlation is positive; if they tend to move in opposite
directions, the correlation is negative.
VIF_1 <- 10
R_squared <- 1-(1/VIF_1)
R_squared
## [1] 0.9
12