Regression in R
Regression in R
Generic functions compute something different depending on the class of an object inside brackets
(e.g. summary())
When a variable has a larger variance (st. deviation) than the other, the estimated coef for this
variable has a smaller st. error than the est coef for the other variable.
If the H0 is correct then we would obtain such a test result (such a data set) with p-value probability
H0: none of the coef is different from 0. Neither x1 nor x2 have significant impact on Y
p-value close to 0. 0,2% we obtain such a sample that x1 and x2 have no impact on Y
Ordinary t-test:
If I only look at 1 estimated parameter at a time, as a confidence interval I see only purple
dashed square => uncertainty
If I look at more than 1 parameter at a time, area is much smaller than square – confidence
ellipse (like F-test)
Confidence interval for fixed parameters (beta) accounts for estimation uncertainty only.
Beta is a true coef, a true intercept (it is fixed in the population, we don’t know it, it’s not random)
Prediction interval (for random parameters, like y) – accounts not just for estimation uncertainty but
also for inherit randomness ( a fact, that, on average, each point of the regression does not lay on
the regression line but is on average 1 st. error of the reg line above or below).
Confidence interval (red dashes) is smaller than prediction interval (blue)
set.seed(42)
confint(reg)
2.5 % 97.5 %
(Intercept) 3.956109 11.05414
x 1.958848 2.56930
A 95% Confidence intervals for b0 and b1: CI b0 (3.96; 11) and CI b1( 1.96; 2.57)
model.matrix(reg)
y.p
y.p_CI
y.p_PI
Exercise:
library(AER)
data(CPS1988)
summary(cps_lm)
#Years of experience have a positive effect on wage but this effect is flattening (exp^2 has a minus)
Whenever this horizontal line is within the interval => coef is not significant
Task 3. We want a surface that covers the range of possible values that can be inserted for covariates
colnames(dat)
range(dat$education)
### Prediction
prd <- predict(object = reg, newdata = dat.new, interval = "prediction", level = 0.95)
#predicted wage for these 3 new people (Attention: fit has strange numbers cuz log of wage!)
1. 1 option: arbitrary cut – 80% obs training, 20% - test. Problem: no random assignment into
groups, biased estimations as not the whole data set is used.
2. 2 option: randomly assign some obs into training set, some – into test set. Problem: not the
whole data set is used for estimation.
3. 3 option: do approach 2 repeatedly
Task 1
library(np)
data(wage1)
n <- nrow(dat)
ASEP.output <- matrix(data = NA, nrow = R, ncol = 2, dimnames = list(NULL, c("reg1", "reg3")))
for(r in 1:R){
### Computation
for(r in 1:R){
apply(X = ASEP.output, MARGIN = 2, FUN = mean) # reg3 is better (ASEP is smaller) but...
apply(X = ASEP.output, MARGIN = 2, FUN = sd) # ... reg3 has a higher standard deviation
apply(X = ASEP.output, MARGIN = 2, FUN = median) # equivalent results for median and
### -> see, whether one specification performs poorly for some of the replications
sum(ASEP.output[, "reg1"] < ASEP.output[, "reg3"]) # In 31 (of 100) replications, reg1 is better,
plot(ASEP.output)
abline(c(0, 1))
# Points below diagonal indicate that for the corresponding replications, reg1 has a larger ASEP than
reg3, i.e., reg3 is better, while points above diagonal indicate superior performance of reg1 for the
corresponding replications.
Code 5 MC Simulation
Nominal size (=alpha): The probability of rejecting a correct H_0 (alpha is the upper limit for this
probability). Probability of type-1 error is determined by alpha = P(type-1-error)
H_0: b1 = 0
Actual size: Percentage (or share) of replications, when the correct H_0 is rejected.
Power (= 1 - beta), the probability of rejecting a false H_0 – probability of not committing type-2
error.
Sample size N
Significance level (alpha)
“true” value of tested parameter (beta)
### This means we have to generate different settings. In order to judge the size of a test, H_0 has to
be true.
### Important point: for both cases (size and power) we need to have the probability of rejecting
H_0.
Type-1 error “False positive” – when we find statistically significant difference when there is
not one.
alpha = P(type-1-error) = P(reject H_0 | H_0 correct)
A test with 95% level of confidence means there is 5% chance to get type 1 error
(determined by alpha).
Type-2 error “false negative”: H0 is false and you fail to reject it (determined by beta).
beta = P(type-2-error) = P(do not reject H_0 | H_0 false)
R <- 100
n <- n.seq[2]
reject.H_0 <- array(data = NA, dim = c(R, length(b1.seq)), dimnames = list(NULL, b1.seq))
set.seed(42)
for(b1 in b1.seq){
for(r in 1:R){
reject.H_0[r, as.character(b1)] <- t.hat < qnorm(p = alpha/2, mean = 0, sd = 1) | t.hat >
qnorm(p = 1 - alpha/2, mean = 0, sd = 1) # H_0 rejected
# shows the power of the test, apart from x = 0 (b = 0). At x = 0 it shows empirical size of the test
(nominal size 5%)
#Power is a skill of test to detect false H0 (probability to reject H0 given it is indeed false). Power is
higher the higher is the number of obs (n↑↑power)
Exercise:
Standard deviation of slope parameters is biased. T-test is biased. Misbehavior of power function.
The more far away I am from b=0, the higher should be the probability of rejection. But here we
have regions where it is again decreasing.
Error assumptions are not violated – no heteroskedasticity: errors are homoscedastic (error
variance is const across all obs) and no autocorrelation
Correct functional form
Appropriate covariates that explain relationship between x and Y
If error assumptions are violated => t-statistics and p-value are biased
Mississippi, Alaska and Washington DC have a potential to leverage the regression line (have high
impact) because they are in terms of X very far away from mean income
Residuals vs fitted (top left plot): judges homoskedasticity of error term and model
misspecification
Residuals bounce randomly around the zero line => reasonable assumption of linearity
Residuals form a horizontal band around the zero line => variances of error terms are equal
There is no residual that stands out from random pattern => no outliers
Normal Q-Q (top right plot): points are close to diagonal but for very small or very big quantiles we
have deviations (empirical quantiles deviate from theoretic). Normality assumption
Red lines are 3*mean Hat matrix value and 1* mean Hat matrix value
Alaska and Washington have a very large leverage, have potential to severely change regression
line
Hat matrix main diagonal values (leverage values) against covariate values
U-shaped relationship: the closer to the mean income, the lower the value of main diag of hat matrix
is.
cooks.distance(ps_lm)
plot(fitted(jour_lm), resid(jour_lm))
abline(h = 0)
Points with fitted values < 5 have larger variance of residuals than those higher than 5 => error
difference is different for different fitted values => homoskedasticity assumption is violated!
Breusch-Pagan Test and White Test -> both are tests of homoskedasticity (null hypothesis)
-> White-Test-Equation: regress residuals on all covariates and all covariate interactions
All 3 tests: p-value is close to 0 => reject H0 => we have heteroskedasticity => biased estimates
H0: model is homoscedastic
Possible solutions:
Rainbow and Harvey tests check whether omitted variables have impact on Y
Serial correlation – relationship between a given var and a lagged version of itself over time
(measures relationship between current value given its past values)
p-value close to 0 => reject H0 => true autocorrelation is greater than 0 => serial correlation
Test distributions
(The bigger the sample) The more DF T-distr has, the more it converges to a normal
distribution
N (0, 1) – normal density – black line
Different t-distributions: 10 df, 5 df, 1 df
1 df deviates the most from N (0, 1)
Exact test and asymptotic test. Often exact t-test distribution is replaced by asymptotic equivalent
(normal density)
Bootstrap – resampling with replacement from the given sample - used for estimating a distribution.
We create bootstrap samples with same size as original sample to estimate a distribution of
something where we only have one entity with a sample of n observations, we can estimate a
specific t-statistic if we now need a distribution of a t-statistic (but we only have this sample), then in
Bootstrapping we create many samples artificially
Exact or bootstrap? Always better to use exact test if we have the exact test and corresponding
assumptions are fulfilled.
If we are not sure about exact distr, then better to use bootstrap. Bootstrap is based on model. So if
a model is misspecified + data is not representative => do not use bootstrap
H0: coef = 0
p-value = 0.77 => H0 is NOT rejected => income^2 is not significant
Heteroskedasticity- and autocorrelation consistent (HAC) covariance matrix estimation
vcovHAC(jour_lm)
Exercise:
n <- 200
set.seed(123)
x <- rnorm(n = n, mean = 10, sd = 3)
y <- 5 + 1*x + rnorm(n = n, mean = 0, sd = sqrt(abs(x))) # heteroskedastic DGP
plot(x, y)
plot(x, hatvalues(reg))
w.r.t. their covariate-characteristics, they possess the potential of having a large impact on the
estimated regression function
influence.measures(reg)
Influential observations have a large impact on the estimated regression function (they have a
leverage + induce unusual behavior of Y)
Cook’s distance:
(manual calculation):
y.hat.minus1 <- predict( # compute fitted values for all rows of dat, when
cook.numerator/cook.denominator
library(AER)
summary(reg) #we cannot trust the t-statistics bcuz there is heteroskedasticity => use coeftest
coeftest(reg)
waldtest(reg)
If we omit a variable that has a huge correlation with the other => OMVB (wrong coefficients)
Information criteria: AIC, BIC/SIC – the smaller the better! (or the more negative)
Information criteria depends on trade-off between fit and penalty. Fir is better for bigger model,
penalty is bigger for bigger model
AIC depends on the scale of Y, AIC penalizes additional variables less than SIC/BIC
AIC(reg1)
BIC(reg1)
Test for nested models
Wald test (F test)
summary(reg.temp)
IF H0 is true (x2 and x3 have no impact), we would obtain sample with prob = 0.121 (12%). This is
higher than significance level => H0 is NOT rejected => indeed, x2 and x3 have no impact on Y.
Functions for tests of linear hypotheses:
1. anova(reg1, reg3) #ANOVA (analysis of variance table)
Exercise:
library(AER)
set.seed(42)
### Estimation
summary(reg3)
### Test
r <- c(300, -1) # bcuz : beta_0 = 300 and beta_1 = -1 + beta_2 + beta_3
#P-Value = 0.014 => If the null hypothesis is true, we obtain such a sample with prob 1.4%. This is
lower than significance level => H0 is rejected => either beta_0 is not = 300 or beta_1 is not = -1 +
beta_2 + beta_3
set.seed(42)
x2 <- runif(n = n, min = 5, max = 15) # Create covariate 2
for(r in 1:R){
x = 1:nrow(dat),
size = ncol(est.id),
replace = FALSE
### Simulation
mse <- matrix(data = NA, nrow = R, ncol = 3, dimnames = list(NULL, paste("reg", 1:3, sep = "")))
for(r in 1:R){
newdata = dat[-est.id[r, ], ]
)^2
)
mse[r, "reg2"] <- mean((dat[-est.id[r, ], "y"] - predict(object = lm(formula = y ~ x1 + x2, data =
dat[est.id[r, ], ]), newdata = dat[-est.id[r, ], ]))^2)
### Evaluation
mse #MSE for each model in each simulation round #small values are better
Task 3. Give a nested model example, where logarithmized or squared covariates are used.
reg1 <- lm(wage ~ educ + exper + tenure + female + west, data = wage1)
summary(reg1)
reg2 <- lm(wage ~ educ + exper + expersq + tenure + tenursq + female + west, data = wage1)
summary(reg2)
#If experience increases by 1 year, the wage increases c. p. by 7 US-Cent 0.07 US-Dollar.
2. Log-log model
reg.log.log <- lm(log(wage) ~ log(exper) + educ, data = wage1)
#If experience increases by 1%, the wage increases ceteris paribus by 0.17 %
3. Log-level model
reg.log.lev <- lm(log(wage) ~ exper + educ, data = wage1)
or coef(reg1)[2]*100
Or just coef(reg1)[2]/12
Transformation only has effect on first 2 columns: coef and SE. Bcuz they are closely related to the
scales of x and y. While t-values and p-values remove the scale, “standardize” the coef.
Standardized coefficients
x1.std <- scale(x1, center = TRUE, scale = TRUE)
Structure of the plot is same, but we adjust the centre of variable, so that distribution is symmetric
around zero.
Due to standardization we don’t need intercept
Interactions
library(AER)
table(CPS1988$ethnicity)
We want to know whether the effect of education is different for different ethnicity (cauc vs afam)
coeftest(cps_int)
Remove intercept, as it is best replaced by 2 separate intercepts for the 2 levels on ethnicity
cps_sep_cf
The effects of education are similar for both groups but the coef are smaller for afam.
p-value is close to zero => the model with interaction of ethnicity with other variables fits
significantly better than the model without interaction
coef(cps_region)
Interpretation: log-log model – if we increase price per citation by 1%, subscription will decrease by
0,53%.
Fitted vs residuals plot: for high residuals we see low fitted Y. For low residuals – high fitted values =>
heteroskedasticity.
log(subs) ~ log(citeprice),
data = journals,
log(subs) ~ log(citeprice),
data = journals,
abline(jour_lm)
y <- log(journals$subs)
X <- model.matrix(jour_lm)
jour_wls1
### Feasible generalized least squares (i.e., WLS with estimated weights)
plot(log(journals$citeprice), resid(jour_lm))
abline(h = 0)
plot(log(journals$citeprice), resid(jour_lm)^2)
# Problem: Negative weights (compare bottomleft part of plot) are not sensible.
# as response variable of the auxiliary regression and revert the log() by taking
# exp() afterwards. The exp() ensures that estimated weights will all be positive.
log(residuals(jour_lm)^2) ~ log(citeprice),
data = journals
data = journals
gamma2 <- 0
while(abs((gamma2i - gamma2)/gamma2) > 1e-7) { # convergence criterion (for stopping the while
loop)
Exercise:
When the assumption of homoskedastic errors in the original regression model is unlikely to hold.
library(np)
data("wage1")
plot(fitted(reg1), resid(reg1))
# => smaller variation for small and large fitted values than for mediocre fitted values =>
heteroskedasticity
bptest(reg1)
#p-value is close to 0 => reject H0 homoskedasticity => we have
heteroskedasticity
We can’t use Weighted least squares regression as we don’t know which weights to use => FGLS
We only know that obs with large variance are weighted down, whereas obs with low variance are
made more important (weighted up).
#by squaring we indicate that obs have large variance, we down-weight them
#by log we allow for exp function
#exp function ensures that the variance is always positive
summary(reg.fgls)
These 2 graphs:
1. plot(wage1$exper, resid(reg1)^2)
abline(h=0)
2. plot(wage1$exper, log(resid(reg1)^2))
abline(h=0)
Code 9 GLM
library(AER)
data("SwissLabor")
SL <- SwissLabor
levels(SL$participation) # the first item in the list becomes the reference value (0 in the dummy
variable space).
### Probit-model
summary(reg.probit)
#measure of fit
Interpretation: ppl in sample have prob to work from 1% to 94%. Mean = 46%.
boxplot(fitted(reg.probit) ~ SL$participation)
Confusion matrix:
round(tab.probit/nrow(SL)*100, digits = 1)
0 1
no 38.6 15.4
yes 16.7 29.2
Correctly specified events = 29.2%, correctly specified non-events = 38.6%.
### Logit-model
CCR.cali<- numeric(length(calibrator))
for(cali in 1:length(calibrator)){
calibrator[which.max(CCR.cali)] #we obtain correct classification level of 69% (max) at the threshold
level 0.47
If we have estimated probability that someone works of 47% or larger, then y.hat = 1, otherwise if
prob < 47% => y.hat = 0.
Exercise:
data(SwissLabor)
dat$participation
as.numeric(dat$participation)
#Problem: probabilities larger than 1 or smaller than 0 are nonsense and (from a statistical
perspective) lead to biased estimates
summary(reg.logit)
summary(reg.logit)
#by changing threshold you gain in 1 indicator but lose in the other
### in general
CCR.cali<- numeric(length(calibrator))
for(cali in 1:length(calibrator)){
Y t = b 0 + b1*Xt + b2* Y t −1
b2 – persistent behavior
memory – if it grew during the previous periods, it will grow now as well
If residuals are large and positive, we severely underestimated the true values
FDL Y t = X t + X t −1 vs ADL Y t = Xt + Y t −1
As both models are not nested in each other, we create model that comprises both models => then
we use nested model test
encomptest(cons_lm1, cons_lm2)
Best model is cons_lmE <- dynlm(consumption ~ dpi + L(dpi) + L(consumption), data = USMacroG)
Same:
window(UKNonDurables, end = c(1956, 4)) #For subsets in time series we use window instead of []
ts.weights <- c(1/2, rep(1, 11), 1/2)/12 # symmetric weighting of past/future time periods
#rollapply computes running/rolling functions, e.g. the sd compare first value of rollapply and
set.seed(1234)
plot(x)
2. Decomposition
decompose TS into seasonal structure and long term development
dd_dec <- decompose(log(UKDriverDeaths))
dd_stl <- stl(log(UKDriverDeaths), s.window = 13)
3. Exponential Smoothing
Exponential decay in weight: the longer we go back, the lower the weight is (past obs matter less)
# use subset of data for estimation (and predict last two years)
# Estimate Holt Winters-Model (has separate equations for level, trend, and season)
Plot: observed values were lower than predicted ones (structural break, new law, etc)
library("plm")
# Compare:
summary(gr_fe)
# H_0 is rejected, hence: panel structure cannot be ignored as firm dummies are jointly
significant => different companies behave differently, panel data should not be ignored!
gr_re <- plm(invest ~ value + capital, data = pgr, model = "random", random.method = "walhus")
summary(gr_re)
plmtest(gr_pool) #Lagrange Multiplier Test
# Meaning of hypotheses:
H_0: Fixed and random effects are consistent => use random effects
H_1: Fixed effects is consistent, random effects not => use fixed effects
Exercise:
encomptest(reg1, reg2)
Task 2
data(UKDriverDeaths)
#seasonality captured
Task 3. Use the Grunfeld data set of the AER-package to answer the following questions:
library(AER)
data(Grunfeld)
table(dat$year)
# b) Perform a pooled estimation, a fixed effects estimation, and a random effects estimation for the
regression of invest on value and capital and compare the estimated parameters of the three
specifications. Are there huge differences?
library(plm)
# => Hausman-Test
phtest(reg.fixed, reg.random)
p-value = 0.1376
# => use Random Effects specification here, as it is more efficient than Fixed Effects.