0% found this document useful (0 votes)
14 views

Regression in R

Programming in R, statistics, regressions

Uploaded by

Angela Ivanova
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views

Regression in R

Programming in R, statistics, regressions

Uploaded by

Angela Ivanova
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 40

Fitted values – predicted response

Generic functions compute something different depending on the class of an object inside brackets
(e.g. summary())

When a variable has a larger variance (st. deviation) than the other, the estimated coef for this
variable has a smaller st. error than the est coef for the other variable.

R^2 adj penalizes degrees of freedom (number of estimated parameters)

Type 1 error Type 2 error


Rejection of a true null Hypothesis Non-Rejection of a false null Hypothesis

If the H0 is correct then we would obtain such a test result (such a data set) with p-value probability

H0: none of the coef is different from 0. Neither x1 nor x2 have significant impact on Y

p-value close to 0. 0,2% we obtain such a sample that x1 and x2 have no impact on Y

for alpha = 0.05 --> 95 % confidence interval

Code 3 CI and PI (confidence and prediction intervals)

Ordinary t-test:

 If I only look at 1 estimated parameter at a time, as a confidence interval I see only purple
dashed square => uncertainty
 If I look at more than 1 parameter at a time, area is much smaller than square – confidence
ellipse (like F-test)

Confidence interval for fixed parameters (beta) accounts for estimation uncertainty only.

Beta is a true coef, a true intercept (it is fixed in the population, we don’t know it, it’s not random)

Y is a always a random variable, as it depends on error parameter u (not fixed).

Prediction interval (for random parameters, like y) – accounts not just for estimation uncertainty but
also for inherit randomness ( a fact, that, on average, each point of the regression does not lay on
the regression line but is on average 1 st. error of the reg line above or below).
Confidence interval (red dashes) is smaller than prediction interval (blue)

n <- 10 # Number of observations

set.seed(42)

x <- runif(n = n, min = 5, max = 15) # Create covariate

u <- rnorm(n = n, mean = 0, sd = 1) # Create error term

y <- 10 + 2*x + u # Create dependent variable

reg <- lm(y ~ x)

confint(reg)
2.5 % 97.5 %
(Intercept) 3.956109 11.05414
x 1.958848 2.56930

A 95% Confidence intervals for b0 and b1: CI b0 (3.96; 11) and CI b1( 1.96; 2.57)

x.p <- c(1, 11) #one potential raw of covariate matrix X

reg <- lm(y ~ x)

model.matrix(reg)

x.p.new<- data.frame(1, x = x.p[2])

predict(object = reg, newdata = x.p.new)

y.p

predict(object = reg, newdata = x.p.new, interval = "confidence")

y.p_CI

predict(object = reg, newdata = x.p.new, interval = "prediction")

y.p_PI
Exercise:

library(AER)

data(CPS1988)

cps_lm <- lm(log(wage) ~ experience + I(experience^2) + education + ethnicity, data = CPS1988)

summary(cps_lm)

#Years of experience have a positive effect on wage but this effect is flattening (exp^2 has a minus)

CI_of_all <- confint(cps_lm, level=0.95) #Confidence interval for all covariates

edu <- CI_of_all[4,] #Confidence interval for edu

eth <- CI_of_all[5,] #Confidence interval for eth

Whenever this horizontal line is within the interval => coef is not significant

Task 3. We want a surface that covers the range of possible values that can be inserted for covariates

dat <- CPS1988[, 1:4]

reg <- lm(log(wage) ~ experience + I(experience^2) + education + ethnicity, data = dat)

colnames(dat)

range(dat$education)

range(dat$experience) # experience = age - education - 6

table(dat$ethnicity) #ethinicity is a factor


### New data # chose 3 people

dat.new <- dat[1:3, ]

dat.new[1, ] <- list(NA, 12, 40, "cauc")

dat.new[2, ] <- list(NA, 15, 20, "afam")

dat.new[3, ] <- list(NA, 18, 0, "afam")

### Prediction

prd <- predict(object = reg, newdata = dat.new, interval = "prediction", level = 0.95)

#predicted wage for these 3 new people (Attention: fit has strange numbers cuz log of wage!)

exp(prd[ ,1]) #to transform into dollars

Code 4 Prediction performance

1. 1 option: arbitrary cut – 80% obs training, 20% - test. Problem: no random assignment into
groups, biased estimations as not the whole data set is used.
2. 2 option: randomly assign some obs into training set, some – into test set. Problem: not the
whole data set is used for estimation.
3. 3 option: do approach 2 repeatedly

Task 1

library(np)

data(wage1)

dat <- subset(wage1[wage1$west == 1, ], select = c(wage, educ, female))

n <- nrow(dat)

R <- 100 # number of replications

perc <- 80 # percentage used for estimation

n.est <- round(n*perc/100)

#empty result matrix

ASEP.output <- matrix(data = NA, nrow = R, ncol = 2, dimnames = list(NULL, c("reg1", "reg3")))

id.val <- array(NA, c(R, n - n.est))


set.seed(42)

for(r in 1:R){

sam <- sample(1:n, replace = FALSE)

id.val[r, ] <- sam[(n.est+1):n]

#which of these models is better in terms of prediction performance?

### Computation

(start.time <- Sys.time())

for(r in 1:R){

reg1.est <- lm(formula = wage ~ educ + factor(female), data = dat[-


id.val[r, ], ])

reg1.fit.val <- predict(reg1.est, newdata = dat[id.val[r, ], ])

ASEP.output[r, "reg1"] <- mean((reg1.fit.val - dat[id.val[r, ], "wage"])^2)

reg3.est <- lm(formula = wage ~ educ + factor(female) + educ:factor(female) +


I(educ^2), data = dat[-id.val[r, ], ])

reg3.fit.val <- predict(reg3.est, newdata = dat[id.val[r, ], ])

ASEP.output[r, "reg3"] <- mean((reg3.fit.val - dat[id.val[r, ], "wage"])^2)

(end.time <- Sys.time())

difftime(end.time, start.time, units = "secs")

### Evaluation (some examples)

### General figures -> are always useful

apply(X = ASEP.output, MARGIN = 2, FUN = mean) # reg3 is better (ASEP is smaller) but...

apply(X = ASEP.output, MARGIN = 2, FUN = sd) # ... reg3 has a higher standard deviation

apply(X = ASEP.output, MARGIN = 2, FUN = median) # equivalent results for median and

apply(X = ASEP.output, MARGIN = 2, FUN = IQR) # interquartile range


### Comparison per replication

### -> see, whether one specification performs poorly for some of the replications

sum(ASEP.output[, "reg1"] < ASEP.output[, "reg3"]) # In 31 (of 100) replications, reg1 is better,

sum(ASEP.output[, "reg1"] > ASEP.output[, "reg3"]) # in 69 replications, reg3 is better

plot(ASEP.output)

abline(c(0, 1))

# Points below diagonal indicate that for the corresponding replications, reg1 has a larger ASEP than
reg3, i.e., reg3 is better, while points above diagonal indicate superior performance of reg1 for the
corresponding replications.

Code 5 MC Simulation

Nominal size (=alpha): The probability of rejecting a correct H_0 (alpha is the upper limit for this
probability). Probability of type-1 error is determined by alpha = P(type-1-error)

H_0: b1 = 0

Actual size: Percentage (or share) of replications, when the correct H_0 is rejected.

Probability of type-2 error is determined by beta.

Power (= 1 - beta), the probability of rejecting a false H_0 – probability of not committing type-2
error.

Power of the test is affected by:

 Sample size N
 Significance level (alpha)
 “true” value of tested parameter (beta)

### This means we have to generate different settings. In order to judge the size of a test, H_0 has to
be true.

### In order to judge the power of a test, H_0 has to be false.

### Important point: for both cases (size and power) we need to have the probability of rejecting
H_0.

 Type-1 error “False positive” – when we find statistically significant difference when there is
not one.
alpha = P(type-1-error) = P(reject H_0 | H_0 correct)
A test with 95% level of confidence means there is 5% chance to get type 1 error
(determined by alpha).
 Type-2 error “false negative”: H0 is false and you fail to reject it (determined by beta).
beta = P(type-2-error) = P(do not reject H_0 | H_0 false)

T-test = coef/st error

R <- 100

alpha.seq <- c(0.01, 0.05, 0.1)

alpha <- alpha.seq[2]

n.seq <- c(10, 20, 50)

n <- n.seq[2]

b1.seq <- seq(from = -1, to = 1, by = 0.1)

reject.H_0 <- array(data = NA, dim = c(R, length(b1.seq)), dimnames = list(NULL, b1.seq))

set.seed(42)

(start.time <- Sys.time())

for(b1 in b1.seq){

for(r in 1:R){

x <- rnorm(n = n, mean = 10, sd = 3) # For random x (fixed x e.g. in experiments).

y <- 5 + b1*x + rnorm(n = n, mean = 0, sd = 1) # define DGP

reg <- lm(y ~ x)

t.hat <- coef(reg)[2]/sqrt(vcov(reg)[2, 2])

reject.H_0[r, as.character(b1)] <- t.hat < qnorm(p = alpha/2, mean = 0, sd = 1) | t.hat >
qnorm(p = 1 - alpha/2, mean = 0, sd = 1) # H_0 rejected

(end.time <- Sys.time())

difftime(end.time, start.time, units = "secs")

colSums(reject.H_0) # how often is H_0 rejected?


When b is really = 0, H0 is rejected only 4 times => falsely not rejected 96 times (96% of type-2 error)

colSums(reject.H_0)/R #relative number

plot(b1.seq, colSums(reject.H_0)/R, ylim = c(0, 1), type = "l", lwd = 3)

Rejection frequency plot

# shows the power of the test, apart from x = 0 (b = 0). At x = 0 it shows empirical size of the test
(nominal size 5%)

How power depends on different alpha and samples

#Power is a skill of test to detect false H0 (probability to reject H0 given it is indeed false). Power is
higher the higher is the number of obs (n↑↑power)

Trade-off between type-1-error and type-2-error

 Trade-off between type-1-error and type-2-error: alpha↑↓type-2 error↑type-1 error


 A larger significance level (alpha) means higher rejection probability
 If H_0 is not correct, then choosing alpha ↓↑beta => ↓power (1-beta)
We don’t violate the standard linear model assumptions: nominal sizes are close to empirical)

We should not have heteroskedasticity ( abs error ↑↑ covariate)

Exercise:

#situation with heteroskedasticity:

Standard deviation of slope parameters is biased. T-test is biased. Misbehavior of power function.
The more far away I am from b=0, the higher should be the probability of rejection. But here we
have regions where it is again decreasing.

Trade-off between type-1-error and type-2-error (with heteroskedasticity):


Power is lower (compared to homoskedastic case). The test does not seem to be correctly sized
(more replications might help, to see this more clearly)

Code 6 Regression diagnostics

Is a certain model correctly specified or not? Check validity (fit of obs)

Assumptions of Correctly specified linear model:

 Error assumptions are not violated – no heteroskedasticity: errors are homoscedastic (error
variance is const across all obs) and no autocorrelation
 Correct functional form
 Appropriate covariates that explain relationship between x and Y

If error assumptions are violated => t-statistics and p-value are biased

Mississippi, Alaska and Washington DC have a potential to leverage the regression line (have high
impact) because they are in terms of X very far away from mean income
Residuals vs fitted (top left plot): judges homoskedasticity of error term and model
misspecification

Residuals = 0 line corresponds to the regression line (estimated)

Characteristics of a well-behaved plot:

 Residuals bounce randomly around the zero line => reasonable assumption of linearity
 Residuals form a horizontal band around the zero line => variances of error terms are equal
 There is no residual that stands out from random pattern => no outliers
Normal Q-Q (top right plot): points are close to diagonal but for very small or very big quantiles we
have deviations (empirical quantiles deviate from theoretic). Normality assumption

Main diagonal value of hat matrix - leverage

Red lines are 3*mean Hat matrix value and 1* mean Hat matrix value

Alaska and Washington have a very large leverage, have potential to severely change regression
line

Hat matrix main diagonal values (leverage values) against covariate values

U-shaped relationship: the closer to the mean income, the lower the value of main diag of hat matrix
is.

influence.measures(ps_lm) # Influence measures

Influence measures (columns):

1. Impact on estimated intercept


2. Impact on slope coef (b)
3. Impact on regression fit
4. Impact on cov matrix of est. par
5. Impact on Cook’s distance
6. Impact on hat matrix main diag
7. Shows whether the obs is influential*

Whenever we remove these obs, the regression changes a lot

Cook's distance is a rescaled measure of influence

 Leverage potential ↑↑ Cook’s distance


 Whenever hat value is larger => higher Cook’s distance
 Whenever u.hat (residuals) are larger in absolute values => Cook’s distance is larger

cooks.distance(ps_lm)

Testing for homoskedasticity of errors (H_0)

Plot Fitted vs residuals

plot(fitted(jour_lm), resid(jour_lm))

abline(h = 0)

Points with fitted values < 5 have larger variance of residuals than those higher than 5 => error
difference is different for different fitted values => homoskedasticity assumption is violated!

Breusch-Pagan Test and White Test -> both are tests of homoskedasticity (null hypothesis)

-> BP-Test-Equation: regress residuals on all covariates

-> White-Test-Equation: regress residuals on all covariates and all covariate interactions

 bptest(jour_lm) # Breusch-Pagan Test

 bptest(jour_lm, ~ log(citeprice) + I(log(citeprice)^2), data = journals) #White Test


 gqtest(jour_lm, order.by = ~ citeprice, point = 0.5, data = journals) #Goldfeld-Quandt Test

All 3 tests: p-value is close to 0 => reject H0 => we have heteroskedasticity => biased estimates
H0: model is homoscedastic

Possible solutions:

 Weighted least squares


 Different cov matrix

Testing the functional form f() misspecification

H0: both coef = 0 => no severe misspecification

 resettest(jour_lm, power = 2:3) ### REgression Specification Error Test

p-value = 0,24 => H0 is NOT rejected => no severe misspecification

 raintest(jour_lm, order.by = ~ age, data = journals) ### Rainbow Test

p-value close to 0 => H0 is rejected => severe misspecification

 harvtest(jour_lm, order.by = ~ age, data = journals) ###Harvey-Collier Test

p-value close to 0 => H0 is rejected => severe misspecification

Misspecification: Age is omitted, age should be included in the regression

Rainbow and Harvey tests check whether omitted variables have impact on Y

Testing for (no) autocorrelation (in the errors)

Serial correlation – relationship between a given var and a lagged version of itself over time
(measures relationship between current value given its past values)

Dynamic linear model

 dwtest(consump1) ### Durbin-Watson Test

H0: the residuals from an OLS regression are not autocorrelated

p-value close to 0 => reject H0 => true autocorrelation is greater than 0 => serial correlation

 Box.test(residuals(consump1), type = "Ljung-Box") ### Ljung-Box Test or Box-Pierce Test

H0: white noise error terms (no serial correlation)

p-value is close to 0 => reject H0 => autocorrelation

Test distributions

 (The bigger the sample) The more DF T-distr has, the more it converges to a normal
distribution
 N (0, 1) – normal density – black line
 Different t-distributions: 10 df, 5 df, 1 df
 1 df deviates the most from N (0, 1)

Exact test and asymptotic test. Often exact t-test distribution is replaced by asymptotic equivalent
(normal density)

Asymptotic, exact, and estimated test distribution (under H_0)

 Dnorm: distr of t-statistic under H0


 Dt: dist of H0 whenever we conduct exact test with given number of obs
 Boot: density of 999 bootstrap values – we take slope values for 999 boots and subtract
corresponding estimates
 Asymptotic version and exact test nearly coincide
 If we have enough obs => can use st. normal distr

Bootstrap – resampling with replacement from the given sample - used for estimating a distribution.
We create bootstrap samples with same size as original sample to estimate a distribution of
something where we only have one entity with a sample of n observations, we can estimate a
specific t-statistic if we now need a distribution of a t-statistic (but we only have this sample), then in
Bootstrapping we create many samples artificially

Exact or bootstrap? Always better to use exact test if we have the exact test and corresponding
assumptions are fulfilled.
If we are not sure about exact distr, then better to use bootstrap. Bootstrap is based on model. So if
a model is misspecified + data is not representative => do not use bootstrap

Some solutions (if classical linear model assumptions are violated)

 vcovHC(jour_lm) ### HC covariance


 coeftest(jour_lm, vcov = vcovHC) ### t-Test (HC)
 ps_lm <- lm(Expenditure ~ Income, data = ps)
ps_lm2 <- lm(Expenditure ~ Income + I(Income^2), data = ps)
waldtest(ps_lm, ps_lm2, vcov = vcovHC(ps_lm2, type = "HC4")) # Wald-Test (HC)

H0: coef = 0
p-value = 0.77 => H0 is NOT rejected => income^2 is not significant
 Heteroskedasticity- and autocorrelation consistent (HAC) covariance matrix estimation
vcovHAC(jour_lm)

Exercise:
n <- 200
set.seed(123)
x <- rnorm(n = n, mean = 10, sd = 3)
y <- 5 + 1*x + rnorm(n = n, mean = 0, sd = sqrt(abs(x))) # heteroskedastic DGP

plot(x, y)

dat <- data.frame(y, x)


reg <- lm(y ~ x, data = dat)

Leverage observations – observations that correspond to hat matrix diag values

plot(x, hatvalues(reg))

w.r.t. their covariate-characteristics, they possess the potential of having a large impact on the
estimated regression function

influence.measures(reg)

Influential observations have a large impact on the estimated regression function (they have a
leverage + induce unusual behavior of Y)

Cook’s distance:

 influence.measures(reg)$infmat[1, ] # first row


 cooks.distance(ps_lm)

(manual calculation):

y.hat.minus1 <- predict( # compute fitted values for all rows of dat, when

update(reg, data = dat[-1, ]), # first row is omitted from estimation


newdata = dat

cook.numerator <- sum((fitted(reg) - y.hat.minus1)^2)

cook.denominator <- length(coef(reg))*sum(resid(reg)^2)/(n - length(coef(reg)))

cook.numerator/cook.denominator

library(AER)

summary(reg) #we cannot trust the t-statistics bcuz there is heteroskedasticity => use coeftest

coeftest(reg)

coeftest(reg, vcov = vcovHC) #care for heteroskedasticity

coeftest(reg, vcov = vcovHAC) #care for heteroskedasticity + autocorrelation

waldtest(reg)

waldtest(reg, vcov = vcovHC)

waldtest(reg, vcov = vcovHAC)

Code 7 Comparing model, AIC, BIC

Underspecification is worse than overspecifying => OMVB

If we omit a variable that has a huge correlation with the other => OMVB (wrong coefficients)

Overspecification => uncertainty, SE are too large

R^2 should not be used for model comparison

R^2 adj penalizes to a small extent

Information criteria: AIC, BIC/SIC – the smaller the better! (or the more negative)
Information criteria depends on trade-off between fit and penalty. Fir is better for bigger model,
penalty is bigger for bigger model
AIC depends on the scale of Y, AIC penalizes additional variables less than SIC/BIC
AIC(reg1)
BIC(reg1)
Test for nested models
Wald test (F test)

reg1 <- lm(y ~ x1) # underspecified

reg2 <- lm(y ~ x1 + x2) # correctly specified


reg3 <- lm(y ~ x1 + x2 + x3) # overspecified

reg.temp <- reg3

summary(reg.temp)

beta.hat <- coef(reg.temp)

ifelse( # check, whether ncol(R)==length(beta.hat)


test = ncol(R) != length(beta.hat),
yes = "Attention: nCol in R does not fit the number of elements in beta.hat",
no = "correct"
)
X.temp <- model.matrix(reg.temp)
XtX.inv <- solve(t(X.temp) %*% X.temp)
SSR.H1 <- sum(resid(reg.temp)^2)

df.H0H1 <- nrow(R)


df.H1 <- n - length(beta.hat)
F.hat <- ((t(R %*% beta.hat - r) %*% solve(R %*% XtX.inv %*% t(R)) %*% (R %*% beta.hat -
r))/df.H0H1) / (SSR.H1 / df.H1)

(p.F.hat <- 1 - pf(q = F.hat, df1 = df.H0H1, df2 = df.H1)) 0.1214998

IF H0 is true (x2 and x3 have no impact), we would obtain sample with prob = 0.121 (12%). This is
higher than significance level => H0 is NOT rejected => indeed, x2 and x3 have no impact on Y.
Functions for tests of linear hypotheses:
1. anova(reg1, reg3) #ANOVA (analysis of variance table)

2. linearHypothesis(model = reg3, test = "F", hypothesis.matrix = R, rhs = r)


Interpretation: p-value more than 0.05 => cannot reject H0 => additional X are not significant

Exercise:

library(AER)

compound (joint) hypothesis test

H_0: beta_0 = 300 and beta_1 = -1 + beta_2 + beta_3

n <- 20 # number of observations

x1 <- 1:n # Create covariate 1

set.seed(42)

x2 <- runif(n = n, min = 5, max = 15) # Create covariate 2

u <- rnorm(n = n, mean = 0, sd = 1) # Create error term

y <- 300 - 0.5*x1 + 0.1*x2 + u # Create dependent variable

x3 <- rnorm(n = n, mean = 10, sd = 3) # Create covariate 3 which is not present in y

### Estimation

reg3 <- lm(y ~ x1 + x2 + x3)

summary(reg3)

### Test

R.row1 <- c(1, 0, 0, 0) # beta_0

R.row2 <- c(0, 1, -1, -1) # beta_1 - beta_2 - beta_3

R <- rbind(R.row1, R.row2)

r <- c(300, -1) # bcuz : beta_0 = 300 and beta_1 = -1 + beta_2 + beta_3

linearHypothesis(model = reg3, test = "F", hypothesis.matrix = R, rhs = r)

#P-Value = 0.014 => If the null hypothesis is true, we obtain such a sample with prob 1.4%. This is
lower than significance level => H0 is rejected => either beta_0 is not = 300 or beta_1 is not = -1 +
beta_2 + beta_3

Simulation to compare the prediction performance

n <- 20 # number of observations

est.perc <- 60 # percentage of observations in estimation subsample

R <- 10 # number of replications

### DGP and data set

x1 <- 1:n # Create covariate 1

set.seed(42)
x2 <- runif(n = n, min = 5, max = 15) # Create covariate 2

u <- rnorm(n = n, mean = 0, sd = 1) # Create error term

y <- 300 - 0.5*x1 + 0.1*x2 + u # Create dependent variable

x3 <- rnorm(n = n, mean = 10, sd = 3) # Create covariate 3 which is not present in y

dat <- data.frame(y, x1, x2, x3)

### Create estimation id subsamples

est.id <- matrix(data = NA, nrow = R, ncol = floor(est.perc/100*n))

for(r in 1:R){

est.id[r, ] <- sample(

x = 1:nrow(dat),

size = ncol(est.id),

replace = FALSE

### Simulation

mse <- matrix(data = NA, nrow = R, ncol = 3, dimnames = list(NULL, paste("reg", 1:3, sep = "")))

for(r in 1:R){

mse[r, "reg1"] <- mean(

dat[-est.id[r, ], "y"] - predict(

object = lm(formula = y ~ x1, data = dat[est.id[r, ], ]),

newdata = dat[-est.id[r, ], ]

)^2

)
mse[r, "reg2"] <- mean((dat[-est.id[r, ], "y"] - predict(object = lm(formula = y ~ x1 + x2, data =
dat[est.id[r, ], ]), newdata = dat[-est.id[r, ], ]))^2)

mse[r, "reg3"] <- mean((dat[-est.id[r, ], "y"] - predict(object = lm(formula = y ~ x1 + x2 + x3, data =


dat[est.id[r, ], ]), newdata = dat[-est.id[r, ], ]))^2)

### Evaluation

mse #MSE for each model in each simulation round #small values are better

apply(X = mse, MARGIN = 2, FUN = mean) #column mean

#correctly specified model is the best (lowest MSE)

#play around with the parameters (n, est.perc and R):

 If you increase sample size, results should improve


 Est percentage should not matter much as long as u have enough obs in the estimation
sample and validation sample
 Number of replications just changes the sharpness (the more replications, the more in detail)

Task 3. Give a nested model example, where logarithmized or squared covariates are used.

reg1 <- lm(wage ~ educ + exper + tenure + female + west, data = wage1)

summary(reg1)

reg2 <- lm(wage ~ educ + exper + expersq + tenure + tenursq + female + west, data = wage1)

summary(reg2)

# => reg1 is nested in reg2:

# take the formula of reg2, set beta_expersq=0 and

# beta_tenursq=0 and you will obtain model reg1.

# Hence, you can for example perform an ordinary F-test:


anova(reg1, reg2)
p-value 1.294e-07 => reject H0 => expersq and tenuresq have impact on wage

Code 8 Weighted least squares, FGLS


1. Level-level model
reg.lev.lev <- lm(wage ~ exper + educ, data = wage1)
summary(reg.lev.lev)

#If experience increases by 1 year, the wage increases c. p. by 7 US-Cent 0.07 US-Dollar.
2. Log-log model
reg.log.log <- lm(log(wage) ~ log(exper) + educ, data = wage1)

#If experience increases by 1%, the wage increases ceteris paribus by 0.17 %
3. Log-level model
reg.log.lev <- lm(log(wage) ~ exper + educ, data = wage1)

#If experience increases by 1 year, the wage increases c.p. by 1 % (=0.010347*100).


4. Level-log model
reg.lev.log <- lm(wage ~ log(exper) + educ, data = wage1)

#If experience increases by 1%, wage increases c.p. by 1 US-Cent (=1.07841US-$/100).

If Y is in log, we multiply by 100 -> interpret Y in %!


If X is in log, we divide by 100 -> interpret X in %!

A covariate has a non-linear impact


As soon as we include non-linear relationship, we allow for non-const partial effect (not straight line,
different slope)
Cubic relationship allows for 1 inflection point
reg.quad <- lm(wage ~ exper + I(exper^2) + educ, data = wage1)

We cannot interpret b1 or b2 separately! => jointly! Look below


Interpretation:
1. for the mean education value, how the wage changes w.r.t. experience. Wage is highest for ca. 29
years of experience.
2. First partial derivative of wage w.r.t. experience. -0.005 slope
The longer we are in the firm, the less gain in wage we get with each additional year.
Transformation of response variable (let’s look at EUR and not $)
reg1 <- lm(wage ~ educ + exper, data = wage1)
If 1 Euro is 1.06 US-Dollar, we could use 2 methods:
 Replace reg1 by reg2
wage1$wage.euro <- wage1$wage/1.06
reg2 <- update(object = reg1, formula = wage.euro ~ .)
coef(reg2)
 Divide coef of reg1 by 1.06
coef(reg1)/1.06

reg3 <- lm(I(100*wage) ~ exper, data = wage1) # wage now in US-Cent

or coef(reg1)[2]*100

Linear Transformation of covariate (let’s look at days and not years)

 Replace reg1 by reg3


wage1$exper.days <- wage1$exper*365
reg3 <- update(object = reg1, formula = . ~ . - exper + exper.days)
coef(reg3)["exper.days"]
 Divide coef of reg1 by 365
coef(reg1)["exper"]/365

Do it inside of the formula:

reg1 <- lm(wage ~ exper, data = wage1)


exper 0.03072 #Interpretation: 1 additional year gives $0.03 increase in wage
reg2 <- lm(wage ~ I(exper*12), data = wage1) # exper now in months
I(exper * 12) 0.0025602 # each additional MONTH gives 0.2 US cents in wage

Or just coef(reg1)[2]/12
Transformation only has effect on first 2 columns: coef and SE. Bcuz they are closely related to the
scales of x and y. While t-values and p-values remove the scale, “standardize” the coef.

Standardized coefficients
x1.std <- scale(x1, center = TRUE, scale = TRUE)
Structure of the plot is same, but we adjust the centre of variable, so that distribution is symmetric
around zero.
Due to standardization we don’t need intercept

reg1 <- lm(y ~ x1 + x2)

reg.std2 <- lm(y.std ~ x1.std + x2.std - 1)

(beta.hat.reg1 <- coef(reg1)[2:3])

beta.hat.reg1 * c(sd(x1), sd(x2)) / sd(y)

(beta.hat.reg.std2 <- coef(reg.std2))

 If x1 increases by one SD, then y decreases on average c.p. by 0.3225 SD.


 If x2 increases by one SD, y increases on average c.p. by 0.7188 SD

Interactions

library(AER)

data(CPS1988) # another wage data set

table(CPS1988$ethnicity)

cps_lm <- lm(log(wage) ~ experience + I(experience^2) + education + ethnicity, data = CPS1988)

We want to know whether the effect of education is different for different ethnicity (cauc vs afam)

cps_int <- lm(log(wage) ~ experience + I(experience^2) + education * ethnicity, data = CPS1988)

*this means: education + ethnicity + education:ethnicity*

coeftest(cps_int)

Separate regressions for each level of ethnicity

Remove intercept, as it is best replaced by 2 separate intercepts for the 2 levels on ethnicity

cps_sep <- lm(log(wage) ~ ethnicity / (experience + I(experience^2) + education) - 1, data = CPS1988)

cps_sep_cf <- matrix(coef(cps_sep), nrow = 2)

rownames(cps_sep_cf) <- levels(CPS1988$ethnicity)


colnames(cps_sep_cf) <- names(coef(cps_lm))[1:4]

cps_sep_cf

The effects of education are similar for both groups but the coef are smaller for afam.

anova(cps_sep, cps_lm) #compare 2 models

p-value is close to zero => the model with interaction of ethnicity with other variables fits
significantly better than the model without interaction

Change of the reference category

CPS1988$region <- relevel(CPS1988$region, ref = "south") #make "south" a reference category

cps_region <- lm(log(wage) ~ ethnicity + education + experience + I(experience^2) + region, data =


CPS1988)

coef(cps_region)

Interpretation: being from southern region increases the wage by 4%.

Weighted least squares

Homoskedasticity: var(u) = σ^2

jour_lm <- lm(log(subs) ~ log(citeprice), data = journals)

Interpretation: log-log model – if we increase price per citation by 1%, subscription will decrease by
0,53%.

Fitted vs residuals plot: for high residuals we see low fitted Y. For low residuals – high fitted values =>
heteroskedasticity.

WLS deals with heteroskedasticity


If the slope of the line was 0 => homoskedasticity. Here we see that variance of residuals^2 grows
with log(x) => heteroskedasticity.

### comparing weighting of observations

jour_wls1 <- lm(

log(subs) ~ log(citeprice),

data = journals,

weights = 1/citeprice^2 # is equivalent to dividing every row in y and X by citeprice

jour_wls2 <- lm(

log(subs) ~ log(citeprice),

data = journals,

weights = 1/citeprice # is equivalent to dividing every row in y and X by sqrt(citeprice)

plot(log(subs) ~ log(citeprice), data = journals)

abline(jour_lm)

abline(jour_wls1, lwd = 2, lty = 2)

abline(jour_wls2, lwd = 2, lty = 3)

legend("bottomleft", c("OLS", "WLS1", "WLS2"), lty = 1:3, lwd = 2, bty = "n")


### Own computation of WLS

y <- log(journals$subs)

X <- model.matrix(jour_lm)

weight.obs <- 1/journals$citeprice # yields results of jour_wls1

# weight.obs <- 1/sqrt(journals$citeprice) # yields results of jour_wls2

y.wght <- y * weight.obs

X.wght <- X * weight.obs # Note: intercept is also transformed!

(beta.hat.wght <- solve(t(X.wght) %*% X.wght) %*% t(X.wght) %*% y.wght)

jour_wls1

### Feasible generalized least squares (i.e., WLS with estimated weights)

### Intuition of methodology

plot(log(journals$citeprice), resid(jour_lm))

abline(h = 0)

plot(log(journals$citeprice), resid(jour_lm)^2)

abline(lm(resid(jour_lm)^2 ~ log(journals$citeprice))) # Estimate coefficients of straight line to


determine weights of observations

# Problem: Negative weights (compare bottomleft part of plot) are not sensible.

# To avoid negative weights, we use log(resid(jour_lm)^2) instead of resid(jour_lm)^2

# as response variable of the auxiliary regression and revert the log() by taking

# exp() afterwards. The exp() ensures that estimated weights will all be positive.

### Auxiliary regression (Estimate weights)

auxreg <- lm(

log(residuals(jour_lm)^2) ~ log(citeprice),

data = journals

### FGLS (i.e., estimated weights)


jour_fgls1 <- lm(log(subs) ~ log(citeprice), weights = 1/exp(fitted(auxreg)), # exp ensures pos var

data = journals

### Iterated FGLS

gamma2i <- coef(auxreg)[2]

gamma2 <- 0

while(abs((gamma2i - gamma2)/gamma2) > 1e-7) { # convergence criterion (for stopping the while
loop)

gamma2 <- gamma2i

fglsi <- lm(log(subs) ~ log(citeprice), data = journals, weights = 1/citeprice^gamma2)

gamma2i <- coef(lm(log(residuals(fglsi)^2) ~ log(citeprice), data = journals))[2]

jour_fgls2 <- lm(log(subs) ~ log(citeprice), data = journals, weights = 1/citeprice^gamma2)

coef(jour_fgls2) # Iterated FGLS coefficients are close to original FGLS coefficients.


(Intercept) log(citeprice)
4.7758234 -0.5007952

Exercise:

Explain a typical application case of FGLS.

When the assumption of homoskedastic errors in the original regression model is unlikely to hold.

library(np)

data("wage1")

reg1 <- lm(wage ~ exper, data = wage1)

plot(fitted(reg1), resid(reg1))

# => smaller variation for small and large fitted values than for mediocre fitted values =>
heteroskedasticity

bptest(reg1)
#p-value is close to 0 => reject H0 homoskedasticity => we have
heteroskedasticity

We can’t use Weighted least squares regression as we don’t know which weights to use => FGLS

We only know that obs with large variance are weighted down, whereas obs with low variance are
made more important (weighted up).

### FGLS – estimation of the weights

### Step 1: Estimate initial regression model

summary(reg1) #estimates are biased due to heteroskedasticity!

### Step 2: regress function of residuals of Step 1 on covariates.

reg.aux <- update(object = reg1, formula = log(resid(reg1)^2) ~ .)

 #by squaring we indicate that obs have large variance, we down-weight them
 #by log we allow for exp function
 #exp function ensures that the variance is always positive

### Step 3: estimate FGLS model

reg.fgls <- update(object = reg1, weights = 1/exp(fitted(reg.aux)))

summary(reg.fgls)

summary(reg1) # estimated coefficients have also changed

These 2 graphs:

1. plot(wage1$exper, resid(reg1)^2)
abline(h=0)
2. plot(wage1$exper, log(resid(reg1)^2))
abline(h=0)

Code 9 GLM

Generalized linear models – allow for non-linear transformation


Models for binary dependent variables

library(AER)

data("SwissLabor")

SL <- SwissLabor

levels(SL$participation) # the first item in the list becomes the reference value (0 in the dummy
variable space).

### Probit-model

Our objective: to model that Y prob = 1

reg.probit <- glm(participation ~ . + I(age^2), data = SL, family = binomial(link = "probit"))

summary(reg.probit)

age 2.07530 #with higher age, prob to be employed is higher

#measure of fit

summary(fitted(reg.probit)) # always nonnegative


Min. 1st Qu. Median Mean 3rd Qu. Max.
0.01043 0.28060 0.47034 0.46066 0.61825 0.94161

Interpretation: ppl in sample have prob to work from 1% to 94%. Mean = 46%.

boxplot(fitted(reg.probit) ~ SL$participation)

#Box plot: prob > 50% y = 1, prob < 50% y = 0

Confusion matrix:

tab.probit <- table(SwissLabor$participation, as.numeric(fitted(reg.probit)>0.5))

round(tab.probit/nrow(SL)*100, digits = 1)
0 1
no 38.6 15.4
yes 16.7 29.2
Correctly specified events = 29.2%, correctly specified non-events = 38.6%.

(CCR.probit <- sum(diag(tab.probit))/nrow(SL)) # correct specification ratio

correct specification ratio = 29.2% + 38.6% = 67.8%

### Logit-model

reg.logit <- glm(participation ~ . + I(age^2), data = SL, family = binomial(link = "logit"))

Linear probability model vs probit/logit

reg.lpm <- lm(as.character(participation) == "yes" ~ . + I(age^2), data = SL)

summary(fitted(reg.lpm)) #non-sense: prob are higher than 1 or lower than 0

#slightly better correct specification ratio

Model calibration (i.e., analyse the threshold values)

If 2 levels of y have severely different number of obs

Calibrator is a threshold between my decision on identifying y as 1 or 0.

reg.temp <- reg.probit

calibrator <- seq(from = 0.01, to = 0.99, by = 0.01)

CCR.cali<- numeric(length(calibrator))

for(cali in 1:length(calibrator)){

tab.temp <- table(SwissLabor$participation, as.numeric(fitted(reg.temp) > calibrator[cali]))

CCR.cali[cali] <- sum(diag(tab.temp))/nrow(SL)

plot(calibrator, CCR.cali, type = "l", lwd = 3, ylim = c(0, 1))

abline(v = 0.5, h = 0.5)

CCR.cali[which.max(CCR.cali)] #max correct classification is 69%

calibrator[which.max(CCR.cali)] #we obtain correct classification level of 69% (max) at the threshold
level 0.47

If we have estimated probability that someone works of 47% or larger, then y.hat = 1, otherwise if
prob < 47% => y.hat = 0.
Exercise:

data(SwissLabor)

dat <- SwissLabor

dat$participation

as.numeric(dat$participation)

dat$participation <- as.numeric(dat$participation) - 1 # recoding to avoid estimation issues

reg.lpm <- lm(participation ~ income, data = dat)

#Fitted values correspond to estimated probability for Y=1, i.e.: P.hat(Y=1 | X)

#Problem: probabilities larger than 1 or smaller than 0 are nonsense and (from a statistical
perspective) lead to biased estimates

sum(fitted(reg.lpm) < 0) #no estimated prob below 0

sum(fitted(reg.lpm) > 1) #only 1 estimated prob exceeds 0

# => not a severe problem for the given data set.

reg.logit <- glm(participation ~ . + I(age^2), data = dat, family = binomial(link = "logit"))

summary(reg.logit)

#education is not significant => remove edu variable

reg.logit <- glm(participation ~ . + I(age^2) - education, data = dat, family = binomial(link =


"logit"))

summary(reg.logit)

#is this a good model? look at confusion matrix

thres <- 0.5 #model calibration

(tab.logit.50 <- table(dat$participation, as.numeric(fitted(reg.logit)>thres)))


#sensitivity = 258 and specificity = 335

#correct classification ratio = (258+335)/number of ppl

#by changing threshold you gain in 1 indicator but lose in the other

### in general

reg.temp <- reg.logit

calibrator <- seq(from = 0.01, to = 0.99, by = 0.01)

CCR.cali<- numeric(length(calibrator))

for(cali in 1:length(calibrator)){

tab.temp <- table(dat$participation, as.numeric(fitted(reg.temp) > calibrator[cali]))

CCR.cali[cali] <- sum(diag(tab.temp))/nrow(dat)

plot(calibrator, CCR.cali, type = "l", lwd = 3, ylim = c(0, 1))

abline(v = 0.5, h = 0.5)

Code 10 Time series, panel data, encompassing test

library(dynlm) #dynamic linear models

# Finite distributed lag (FDL) model

cons_lm1 <- dynlm(consumption ~ dpi + L(dpi), data = USMacroG)


Y t = X t + X t −1

L is 1 lag of var (previous period)

# Autoregressive distributed lag (ADL) model

cons_lm2 <- dynlm(consumption ~ dpi + L(consumption), data = USMacroG)

Y t = b 0 + b1*Xt + b2* Y t −1

 b2 – persistent behavior
 memory – if it grew during the previous periods, it will grow now as well
If residuals are large and positive, we severely underestimated the true values

Encompassing test – for cross-section and panel data

We want to test models against each other

FDL Y t = X t + X t −1 vs ADL Y t = Xt + Y t −1

As both models are not nested in each other, we create model that comprises both models => then
we use nested model test

cons_lmE <- dynlm(consumption ~ dpi + L(dpi) + L(consumption), data = USMacroG)

encomptest(cons_lm1, cons_lm2)

 M1 vs ME p-value close to zero => H0 rejected => M1 is misspecified


 M2 vs ME p-value close to zero => H0 rejected => M2 is misspecified

Both models are misspecified, as all covariates in ME are relevant

 M1 vs ME : difference is Yt-1. H0 that corresponding coef = 0 is rejected => Yt-1 is relevant


 M2 vs ME : difference is Xt-1. H0 that corresponding coef = 0 is rejected => Xt-1 is relevant

Best model is cons_lmE <- dynlm(consumption ~ dpi + L(dpi) + L(consumption), data = USMacroG)

tsp(UKNonDurables) ### Time series properties


1955.00 1988.75 4.00
Q1 last Q number of seasons

Same:

 start(UKNonDurables) # first observation: year and season


 end(UKNonDurables) # last observation: year and season
 frequency(UKNonDurables) # number of seasons (quarterly data)

window(UKNonDurables, end = c(1956, 4)) #For subsets in time series we use window instead of []

3 examples of linear smoothing

1. Moving average smoother


we want to get rid of month effect
(smoothed version – long time development of the data)
neglects seasonal peaks

ts.weights <- c(1/2, rep(1, 11), 1/2)/12 # symmetric weighting of past/future time periods

ts.filter <- filter(UKDriverDeaths, ts.weights)

lines(ts.filter, col = 2, lwd = 2)

plot(rollapply(UKDriverDeaths, 12, sd))

#rollapply computes running/rolling functions, e.g. the sd compare first value of rollapply and

sd(window(UKDriverDeaths, end = 1969.917))

### Function filter() can also be used for data generation

set.seed(1234)

x <- filter(rnorm(100), 0.9, method = "recursive")

 #here AR = 0.9, high memory (persistence)


 #if AR = 1 => random walk, none of the past errors is forgotten

plot(x)

2. Decomposition
decompose TS into seasonal structure and long term development
dd_dec <- decompose(log(UKDriverDeaths))
dd_stl <- stl(log(UKDriverDeaths), s.window = 13)

# size of grey bar shows the Y range of importance

#trend – removed seasonal component

# remainder – random – anything that cannot be related to none of them

3. Exponential Smoothing

Exponential decay in weight: the longer we go back, the lower the weight is (past obs matter less)

dd_past <- window(UKDriverDeaths, end = c(1982, 12))

# use subset of data for estimation (and predict last two years)

dd_hw <- HoltWinters(dd_past)

# Estimate Holt Winters-Model (has separate equations for level, trend, and season)

dd_pred <- predict(dd_hw, n.ahead = 24) # Predict next two years

Plot: observed values were lower than predicted ones (structural break, new law, etc)

Panel data models – collection of TS


library("plm")

data("Grunfeld", package = "AER")

library("plm")

gr <- subset(Grunfeld, firm %in% c("General Electric", "General Motors", "IBM"))

pgr <- pdata.frame(gr, index = c("firm", "year"))

### Pooled model, i.e. ignore panel information

Pooling behaves as random

gr_pool <- plm(invest ~ value + capital, data = pgr, model = "pooling")

# Compare:

reg1 <- lm(invest ~ value + capital, data = pgr) #same coef

### Fixed effects estimation

gr_fe <- plm(invest ~ value + capital, data = pgr, model = "within")

summary(gr_fe)

reg2 <- lm(invest ~ value + capital + factor(firm), data = pgr)

pFtest(gr_fe, gr_pool) # fixed effects relevant?

# H_0: firm fixed effects are all equal to 0

# H_0 is rejected, hence: panel structure cannot be ignored as firm dummies are jointly
significant => different companies behave differently, panel data should not be ignored!

anova(reg1, reg2) # yields same result

### Random effects estimation

gr_re <- plm(invest ~ value + capital, data = pgr, model = "random", random.method = "walhus")

summary(gr_re)
plmtest(gr_pool) #Lagrange Multiplier Test

# random effects are relevant? p-value < 2.2e-16

### Fixed vs. Random Effects

 Fixed effects model is more robust (requires less rigid assumptions)


 Random effects model is more efficient (has typically lower standard errors, lower
variance) but less robust

phtest(gr_re, gr_fe) #Hausman test

# Meaning of hypotheses:

 H_0: Fixed and random effects are consistent => use random effects
 H_1: Fixed effects is consistent, random effects not => use fixed effects

# here: do not reject p-value = 0.98 => use Random effects


# if H0 is rejected => use Fixed effects

Exercise:

reg1 <- lm(wage ~ exper + expersq + tenure, data = wage1)

reg2 <- lm(wage ~ exper + tenure + tenursq, data = wage1)

encomptest(reg1, reg2)

# p-Value(M1 vs. ME) = 0.039

# => H_0: beta_tenursq=0 can be rejected on a 5% level

# => ME seems to be better than M1 (alias reg1)

# p-Value(M2 vs. ME) = 0 (at least very close to 0)

# => H_0: beta_expersq=0 can be rejected on any significance level

# => ME seems to be better than M2 (alias reg2)

# Overall message: do neither use reg1, nor reg2.

Task 2

data(UKDriverDeaths)

dat <- UKDriverDeaths


plot(window(dat, start = c(1979, 1))) #last 6 years of data

hw80 <- HoltWinters(window(dat, end = c(1980, 12)))

#exp smoothing - up to year 1980, month Dec

lines(predict(hw80, n.ahead = 48), col = "red") #predict next 48 months

#seasonality captured

hw80 #alpha - overall level #beta – trend #gamma - seasonal effect

hw81 <- HoltWinters(window(dat, end = c(1981, 12))) #add 1 more year

lines(predict(hw81, n.ahead = 36), col = "darkorange")

hw82 <- HoltWinters(window(dat, end = c(1982, 12))) #add 2 more years

lines(predict(hw82, n.ahead = 24), col = "gold")

hw83 <- HoltWinters(window(dat, end = c(1983, 12)))

lines(predict(hw83, n.ahead = 12), col = "purple")

#the negative trend captured

Task 3. Use the Grunfeld data set of the AER-package to answer the following questions:

# a) Is this a balanced data set? Why?

library(AER)

data(Grunfeld)

dat <- Grunfeld

table(dat$year)

# => yes, as we observe 11 firm-observations for each of 20 years.

# b) Perform a pooled estimation, a fixed effects estimation, and a random effects estimation for the
regression of invest on value and capital and compare the estimated parameters of the three
specifications. Are there huge differences?
library(plm)

pgr <- pdata.frame(dat, index = c("firm", "year"))

reg.pool <- plm(invest ~ value + capital, data = pgr, model = "pooling")

reg.fixed <- plm(invest ~ value + capital, data = pgr, model = "within")

reg.random <- plm(invest ~ value + capital, data = pgr, model = "random")

# beta.hat(value) very similar across specifications

# beta.hat(invest) much smaller in pooled regression (probably misspecified)

#use FE or RE models bcuz pooling ignores panel structure

# c) Which one to use? FE or RE?

# => Hausman-Test

phtest(reg.fixed, reg.random)
p-value = 0.1376

# H0 is not rejected => Random Effects specification is not rejected.

# => use Random Effects specification here, as it is more efficient than Fixed Effects.

You might also like