Introduction Econometrics R
Introduction Econometrics R
Rajat Tayal
Fourth Quantitative Finance Workshop
December 21-December 24, 2012
Indian Institute of Technology, Kanpur
23 December 2012
1/1
Regression diagnostics
Leverage and standardized residuals
Deletion diagnostics
The function influence.measures()
Testing for heteroskedasticity
Testing for functional form
Testing for autocorrelation
Robust standard errors and tests
Rajat Tayal (IIT Kanpur)
2/1
Part I
Linear regression
3/1
Introduction
The linear regression model, typically estimated by ordinary least squares
(OLS), is the workhorse of applied econometrics. The model is
yi = xiT + i , i = 1, 2, . . . , n.
(1)
y = X +
(2)
E (|X ) = 0
(3)
Var (|X ) = 2 I
(4)
E (j |xi ) = 0, i j.
(5)
For cross-sections:
4/1
Introduction
(6)
5/1
install.packages("AER", dependencies=TRUE)
library(AER)
data("Journals") ; names("Journals")
journals <- Journals[, c("subs", "price")]
journals$citeprice <- Journals$price/Journals$citations
summary(journals)
subs
price
citeprice
Min.
:
2.0
Min.
: 20.0
Min.
: 0.005223
1st Qu.: 52.0
1st Qu.: 134.5
1st Qu.: 0.464495
Median : 122.5
Median : 282.0
Median : 1.320513
Mean
: 196.9
Mean
: 417.7
Mean
: 2.548455
3rd Qu.: 268.2
3rd Qu.: 540.8
3rd Qu.: 3.440171
Max.
:1098.0
Max.
:2120.0
Max.
:24.459459
Rajat Tayal (IIT Kanpur)
6/1
(7)
7/1
Here, the formula of interest is log (subs) log (citeprice). This can be used
both for plotting and for model fitting:
> plot(log(subs) ~ log(citeprice), data = journals)
> jour_lm <- lm(log(subs) ~ log(citeprice), data = journals)
> abline(jour_lm)
abline() extracts the coefficients of the fitted model and adds the
corresponding regression line to the plot.
8/1
9/1
The function lm() returns a fitted-model object, here stored as jour lm.
It is an object of class lm.
> class(jour_lm)
[1] "lm"
> names(jour_lm)
[1] "coefficients" "residuals" "effects" "rank" "fitted.val
[7] "qr" "df.residual" "xlevels" "call" "terms" "model"
10 / 1
11 / 1
Function Description
simple printed display
standard regression output
extracting the regression coefficients
extracting residuals
extracting fitted values
comparison of nested models
predictions for new data
diagnostic plots
confidence intervals for the regression
coefficients
residual sum of squares
(estimated) variance-covariance matrix
log-likelihood (assuming normally distributed
errors)
information criteria including AIC, BIC/SBC
(assuming normally distributed errors)
12 / 1
13 / 1
Analysis of variance
> anova(jour_lm)
Analysis of Variance Table
Response: log(subs)
Df Sum Sq Mean Sq F value
Pr(>F)
log(citeprice)
1 125.93 125.934 224.04 < 2.2e-16 ***
Residuals
178 100.06
0.562
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1
1
The ANOVA table breaks the sum of squares about the mean (for the
dependent variable, here log(subs)) into two parts: a part that is
accounted for by a linear function of log(citeprice) and a part attributed to
residual variation.
Rajat Tayal (IIT Kanpur)
14 / 1
15 / 1
Prediction
The standard errors of predictions for new data take into account
both the uncertainty in the regression line and the variation of the
individual points about the line.
Thus, the prediction interval for prediction of new data is larger than
that for prediction of points on the line. The function predict()
provides both types of standard errors.
16 / 1
Prediction
17 / 1
Prediction
>
>
>
>
>
>
18 / 1
Prediction
19 / 1
Plotting lm objects
The plot() method for class lm() provides six types of diagnostic plots,
four of which are shown by default.
We set the graphical parameter mfrow to c(2, 2) using the par() function,
creating a 2 2 matrix of plotting areas to see all four plots simultaneously:
> par(mfrow = c(2, 2))
> plot(jour_lm)
> par(mfrow = c(1, 1))
The first provides a graph of residuals versus fitted values, the second is a
QQ plot for normality, plots three and four are a scale-location plot and a
plot of standardized residuals against leverages, respectively.
20 / 1
Plotting lm objects
21 / 1
(8)
22 / 1
23 / 1
education
Min.
: 0.00
1st Qu.:12.00
Median :12.00
Mean
:13.07
3rd Qu.:15.00
Max.
:18.00
experience
Min.
:-4.0
1st Qu.: 8.0
Median :16.0
Mean
:18.2
3rd Qu.:27.0
Max.
:63.0
ethnicity
cauc:25923
afam: 2232
smsa
no : 7223
yes:20932
region
northeast:6441
midwest :6863
south
:8760
west
:6091
parttime
no :25631
yes: 2524
24 / 1
25 / 1
Comparison of models
With more than a single explanatory variable, it is interesting to test for
the relevance of subsets of regressors. For any two nested models, this can
be done using the function anova(). E.g. to test for the relevance of the
variable ethnicity, we explicitly fit the model without ethnicity and then
compare both models.
> cps_noeth <- lm(log(wage) ~ experience + I(experience^2) +
education, data = CPS1988)
> anova(cps_noeth, cps_lm)
Analysis of Variance Table
Model 1: log(wage) ~ experience + I(experience^2) + education
Model 2: log(wage) ~ experience + I(experience^2) + education + ethnicity
Res.Df
RSS Df Sum of Sq
F
Pr(>F)
1 28151 9719.6
2 28150 9598.6 1
121.02 354.91 < 2.2e-16 ***
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1
1
This reveals that the effect of ethnicity is significant at any reasonable level.
26 / 1
Comparison of models
27 / 1
Part II
Linear regression with panel data
28 / 1
Introduction
29 / 1
Introduction
For illustrating the basic fixed- and random-effects methods, we use the
wellknown Grunfeld data (Grunfeld 1958) comprising 20 annual
observations on the three variables real gross investment (invest), real
value of the firm (value), and real value of the capital stock (capital) for
11 large US firms for the years 1935-1954.
>
>
>
>
30 / 1
31 / 1
32 / 1
33 / 1
34 / 1
35 / 1
36 / 1
p
X
(11)
i=1
37 / 1
38 / 1
39 / 1
The results suggest that autoregressive dynamics are important for these
data.
40 / 1
Part III
Regression diagnostics
41 / 1
Review
>
>
>
>
>
data("Journals")
journals <- Journals[, c("subs", "price")]
journals$citeprice <- Journals$price/Journals$citations
journals$age <- 2000 - Journals$foundingyear
jour_lm <- lm(log(subs) ~ log(citeprice), data = journals)
42 / 1
43 / 1
44 / 1
45 / 1
46 / 1
Further tests for autocorrelation are the Box-Pierce test and the
Ljung-Box test, both being implemented in the function Box.test() in
base R.
> Box.test(residuals(consump1), type = "Ljung-Box")
Box-Ljung test
Rajat
Tayal (IITresiduals(consump1)
Kanpur)
Introduction to Estimation/Computing Environment -II23 December 2012
data:
47 / 1
Thank You....
48 / 1