Multlin 4
Multlin 4
Jo Hardin
Multiple Regression IV – R code
Model Building
E[Y ] = β0 + β1 X1 + β2 X2 + β3 X3 + β4 X4 + β5 X5 + β6 X6
Y = state ave SAT score
X1 = % of eligible seniors who took the exam, takers
X2 = median income of families of test takers, income
X3 = ave number of years of formal eduction, years
X4 = % of test takers who attend public school, public
X5 = total state expenditure on public secondary schools ($100 /student), expend
X6 = median percentile rank of test takers within their secondary school class, rank
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 948.45 10.21 92.86 <2e-16 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1112.408 12.386 89.81 <2e-16 ***
ltakers -59.175 4.167 -14.20 <2e-16 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
These notes belong at the end... I’m putting them here to save paper and keep the formatting
reasonably clear.
• Notice that Cp and F-tests use a “full” model MSE. Typically, the MSE will only be
an unbiased predictor of σ 2 in backwards variable selection.
• Using different selection criteria may lead to different models (there is no one best
model).
• The order in which variables are entered does not necessarily represent their impor-
tance. As a variable entered early on can be dropped at a later stage because it
is predicted well from the other explanatory variables that have been subsequently
added to the model.
Forward Variable Selection: F-tests
> add1(lm(sat~1), sat~ ltakers + income + years + public + expend +
rank, test="F")
Single term additions
Model:
sat ~ 1
Df Sum of Sq RSS AIC F value Pr(F)
<none> 245376 419
ltakers 1 199007 46369 340 201.7138 < 2.2e-16 ***
income 1 102026 143350 395 33.4513 5.711e-07 ***
years 1 26338 219038 416 5.6515 0.02156 *
public 1 1232 244144 421 0.2371 0.62856
expend 1 386 244991 421 0.0740 0.78683
rank 1 190297 55079 348 162.3828 < 2.2e-16 ***
Note: Sum of Sq refers to the SSR(new variable | current model) (additional reduction in
SSE). RSS is the SSE for the model that contains the current variables and the new variable.
Backward Variable Selection: F-tests
> drop1(lm(sat ~ ltakers + income + years + public + expend + rank), test="F")
Single term deletions
Model:
sat ~ ltakers + income + years + public + expend + rank
Df Sum of Sq RSS AIC F value Pr(F)
<none> 21397 312
ltakers 1 2150 23547 315 4.2203 0.04620 *
income 1 340 21737 311 0.6681 0.41834
years 1 2532 23928 315 4.9693 0.03121 *
public 1 20 21417 310 0.0393 0.84390
expend 1 10964 32361 330 21.5221 3.404e-05 ***
rank 1 2679 24076 316 5.2587 0.02691 *
If you ask to add1 here (that is, to see whether it makes sense to add either public or
income back into the model), neither is significant.
Step: AIC=339.78
sat ~ ltakers
Df Sum of Sq RSS AIC
+ expend 1 20523 25846 313
+ years 1 6364 40006 335
<none> 46369 340
+ rank 1 871 45498 341
+ income 1 785 45584 341
+ public 1 449 45920 341
Step: AIC=313.14
sat ~ ltakers + expend
Df Sum of Sq RSS AIC
+ years 1 1248.2 24597.6 312.7
+ rank 1 1053.6 24792.2 313.1
<none> 25845.8 313.1
+ income 1 53.3 25792.5 315.0
+ public 1 1.3 25844.5 315.1
Step: AIC=312.71
sat ~ ltakers + expend + years
Df Sum of Sq RSS AIC
+ rank 1 2675.5 21922.1 309.1
<none> 24597.6 312.7
+ public 1 287.8 24309.8 314.1
+ income 1 19.2 24578.4 314.7
Step: AIC=309.07
sat ~ ltakers + expend + years + rank
Df Sum of Sq RSS AIC
<none> 21922.1 309.1
+ income 1 505.4 21416.7 309.9
+ public 1 185.0 21737.1 310.7
Step: AIC=321.28
sat ~ ltakers + income + years + expend + rank
Df Sum of Sq RSS AIC
- income 1 505 21922 319
<none> 21417 321
- ltakers 1 2552 23968 323
- years 1 3011 24428 324
- rank 1 3162 24578 324
- expend 1 12465 33882 340
Step: AIC=318.53
sat ~ ltakers + years + expend + rank
Df Sum of Sq RSS AIC
<none> 21922 319
- rank 1 2676 24598 320
- years 1 2870 24792 321
- ltakers 1 5094 27016 325
- expend 1 13620 35542 338