0% found this document useful (0 votes)
85 views

Statistical Learning in R

1. The document discusses various statistical and machine learning techniques for data standardization, transformation, and resampling including multivariate standardization, univariate standardization, Box-Cox transformation, and k-fold cross-validation. 2. It also covers resampling techniques like the bootstrap and their application to estimating out-of-sample error and computing statistics like variance.

Uploaded by

Angela Ivanova
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
85 views

Statistical Learning in R

1. The document discusses various statistical and machine learning techniques for data standardization, transformation, and resampling including multivariate standardization, univariate standardization, Box-Cox transformation, and k-fold cross-validation. 2. It also covers resampling techniques like the bootstrap and their application to estimating out-of-sample error and computing statistics like variance.

Uploaded by

Angela Ivanova
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 31

Sensible: if we have at least 4 obs for each parameter

Code 2

Standardization

Multivariate: moves the cloud to coordination center, adjusts the variance of objects to equal 1,
removes correlation. For cluster analysis

spec <- eigen(solve(cov(dat))) # spectral decomposition of inverted covariance matrix

Q <- spec$vectors # matrix of eigenvectors

D <- diag(spec$values) # diagonal matrix with eigenvalues

SIGMA.hat.inv.sqrt <- Q %*% sqrt(D) %*% t(Q) # square root of inverted SIGMA.hat

# for first row of dat

dat.row<- dat[1, ]

Z.row <- t(SIGMA.hat.inv.sqrt %*% (

t(dat.row) - matrix(apply(X = dat, MARGIN = 2, FUN = mean), nrow = ncol(dat), ncol = 1)

))

# for all rows of dat

dat.ms <- t(SIGMA.hat.inv.sqrt %*% (

t(dat) - matrix(apply(X = dat, MARGIN = 2, FUN = mean), nrow = ncol(dat), ncol = nrow(dat))

))

apply(X = dat.ms, MARGIN = 2, FUN = mean) # mean = 0 (apart from rounding)

apply(X = dat.ms, MARGIN = 2, FUN = var) # variance = 1

cor(dat.ms) # correlation = 0

Univariate: moves the cloud to coordination center, adjusts the variance of objects to equal 1,
removes the mean. For PCA.

dat.us <- scale(x = dat, center = TRUE, scale = TRUE) #center = TRUE – subtracts the mean, scale =
TRUE – divides by st div

apply(X = dat.us, MARGIN = 2, FUN = mean) # mean = 0 (apart from rounding)

apply(X = dat.us, MARGIN = 2, FUN = var) # variance = 1

cor(dat.us) # correlation unchanged

cor(dat)

Box-Cox transformation

Aim: to make distribution more symmetric around the median (e.g. dependent var), look like Normal
distr. Reduce skewness. Box-Cox transformation should be used for continuous variables.
#Task Check the skewness of these variables and try to find a BoxCox-transformed version of either
of these variables that is less skewed.

boxcox <- function(x, lambda){if(lambda != 0){(x^lambda - 1)/lambda} else {log(x)}}

skew <- function(x){

quart <- quantile(x, probs = 1:3/4)

skew.coef <- as.numeric(((quart[3] - quart[2]) - (quart[2] - quart[1]))/(quart[3] - quart[1]))

return(skew.coef)

co <- 2 # which column of dat (i.e., which variable to analyze/transform)?

skew(x = dat[, co]) # current skewness

lambda.seq <- seq(from = -2, to = 2, by = 0.1)

skew.seq <- rep(NA, times = length(lambda.seq))

for(pos in 1:length(lambda.seq)){

skew.seq[pos] <- skew(

x = boxcox(

x = dat[, co],

lambda = lambda.seq[pos]

skew.seq # skewness for Box-Cox-transformed variable w.r.t. different lambdas

plot(lambda.seq, skew.seq)

abline(h = 0)

min(abs(skew.seq)) # skewness value that is closest to zero

which.min(abs(skew.seq)) # position of skewness value that is closest to zero

lambda.seq[which.min(abs(skew.seq))] # lambda that yields skewness value that is closest to zero

Skewness
> skew(x = wage1$wage) #not transformed
[1] 0.26

quantile(dat2$wage, probs = 1:3/4)

If difference between upper quartile and median is bigger than the difference between lower quartile and
median => right-skewed distribution.
Graphs for not-transformed and transformed:

par(mfrow = c(1, 3))

var.temp <- wage1$wage

hist(var.temp, freq = FALSE, breaks = 20, col = "lightgrey", main = paste("Untransformed =>
Skewness:", round(skew(var.temp), digits = 2)))

lines(density(var.temp), col = "royalblue", lwd = 3)

box()

lambda.temp <- lambda.seq[which.min(abs(skew.seq))]

var.temp <- boxcox(x = wage1$wage, lambda = lambda.temp)

hist(var.temp, freq = FALSE, breaks = 20, col = "lightgrey", main = paste("Lambda=", lambda.temp, "
=> Skewness: ", round(skew(var.temp), digits = 2), sep = ""))

lines(density(var.temp), col = "royalblue", lwd = 3)

box()

1. Peak is on the left => right-skewed (skewness 0.26 > 0)


2. Lambda = 0.5, gives skewness = 0.
3. Lambda = -1.4. Skewness = -0.17 < 0 => left-skewed (peak on the right)

Resampling

Goal: obtain an estimate for the out-of-sample test error.


Split into 2 parts: one for estimation, the other – (hold-out) for prediction
1. With replacement – Bootstrap – artificially create more data
2. Without replacement – cross-validation (CV)

install.packages("ISLR")
install.packages("boot")

Validation set approach

set.seed(42)

train <- sample(x = 392, size = 196, replace = FALSE)

lm.fit <- lm(mpg ~ horsepower, data = dat, subset = train)

mean((dat$mpg - predict(object = lm.fit, newdata = dat))[-train]^2) # MSE for test set observations

#MSE – mean sq errors – measure of prediction performance – the lower the better

### Leave-one-out cross-validation (LOOCV)

cv.err <- cv.glm(

data = dat,

glmfit = glm.fit,

K = nrow(dat)

cv.err$delta
24.23151 – MSE of prediction from CV; 24.23114 – Bayes correction

Polynominal: e.g. i = 3 => mpg ~ HP + HP^2 + HP^3

cv.error <- rep(0, 5)

names(cv.error)<- paste("degree", 1:5, sep = "")

for(i in 1:5){

glm.fit <- glm(mpg ~ poly(horsepower, i), data = dat)

cv.error[i] <- cv.glm(data = Auto, glmfit = glm.fit,

K = nrow(Auto))$delta[1]

cv.error
degree1 degree2 degree3 degree4 degree5
24.23151 19.24821 19.33498 19.42443 19.03321

Interpretation: the model where horsepower enters linearly is worse than the model where
horsepower is squared.

### K-fold cross-validation – create equal folds (parts) of the sample (e.g. 9 parts to test and 1
part to predict). Protection against overfitting. K = 5 or 10

set.seed(42) #

cv.error.10 <- rep(x = 0, times = 13) #

names(cv.error.10) <- paste("degree", 1:13, sep = "") #


for(i in 1:13){ #

glm.fit <- glm(mpg ~ poly(horsepower, i), data = dat) #

cv.error.10[i] <- cv.glm(data = dat, glmfit = glm.fit, # 10-fold CV

K = 10)$delta[1] #

} #

cv.error.10

plot(x = 1:13, y = cv.error.10, type = "b")

Interpretation: there is no difference between i-orders as long as it is at least i=2. Model with linear
parameter has much higher error than i=2.

2-fold cross-validation

fold.set <- c(rep(x = 1, times = floor(nrow(wage1)/2)), rep(x = 2, times = nrow(wage1) -


floor(nrow(wage1)/2)))

set.seed(12345)

fold.id <- sample(x = fold.set, size = length(fold.set), replace = FALSE)

# fold.id contains the shuffled id, which row of dat will belong to fold 1 or fold 2.

# Estimate reg1 with observations of fold1, ...

reg1.fold1 <- update(object = reg1, data = wage1[fold.id == 1, ])

#create predictions for the fold2-observations and compute the mean squared error (of prediction).

MSE.reg1.fold1 <- mean((wage1[fold.id == 2, "wage"] - predict(object = reg1.fold1, newdata =


wage1[fold.id == 2, ]))^2)

reg1.fold2 <- update(object = reg1, data = wage1[fold.id == 2, ])

MSE.reg1.fold2 <- mean((wage1[fold.id == 1, "wage"] - predict(object = reg1.fold2, newdata =


wage1[fold.id == 1, ]))^2)

Bootstrap – resampling with replacement - used for estimating a distribution (e.g. of a t statistic)

set.seed(12345)

Bootstrap.sample1 <- sample(x = 1:nrow(wage1), size = nrow(wage1), replace = TRUE)

Validation.sample1 <- (1:nrow(wage1))[-unique(Bootstrap.sample1)]

#Bootstrap.sample1 contains the row indices of the observations of wage1 that should be part of the
first bootstrap sample. Due to the sampling with replacement, some of the numbers {1, 2, ..., 525,
526} appear more than once in Bootstrap.sample1, some numbers do not appear at all. The latter
(hold-out) ids (in Validation.sample1) identify the rows of wage1 that can be used to create
predictions, after the models are estimated using the rows of wage1 identified by Boostrap.sample1.
This approach is repeated B times.

alpha.fn <- function(data, index){

X <- data$X[index] # asset 1


Y <- data$Y[index] # asset 2

return((var(Y) - cov(X,Y)) / (var(X) + var(Y) - # calculation (see example ISL, p.187)

2*cov(X,Y))) #

} #

alpha.fn(data = dat.p, index = 1:100) # variance of two-asset portfolio

set.seed(42) #

alpha.fn(data = dat.p, index = sample( x = 100, size = 100, # use different subsets of the data

replace = TRUE)) # (generated by sampling with replacement)

boot(data = dat.p, statistic = alpha.fn, R = 1000)

Original variance; bias – mean of (boot stat – original); sr. error – uncertainty of original estimators,
the larger the SE, the more widespread the distribution is

Code 3

Supervised learning – primary interest is Y

Unsupervised learning – every column is equally important

PCA – principal component analysis – focus on columns. Aim: to decrease the dimension (take a
smaller number of columns), using spectoral decomposition. Should be performed on the dimension
that is smaller

spec <- eigen(cor(dat)) #Eigen-function to perform a principal component analysis of the


correlation matrix. Sum of eigen-values always = number of variables.

Method 1 to choose number of PCs (% of data explained):

alt1 <- data.frame(

num.pc = 1:ncol(dat),

perc.expl = round(cumsum(spec$values)/sum(spec$values)*100, digits = 1)

)
# 18 PCs. First PC explains 13.6% of dependence structure of the 18 variables. 3 pcs explain only
about one third of it. 9 PCs (half of the original dimension) would explain about 75%.

Method 2: Scree-Plot – choose the most pronounce kink to determine the number of PCs (here 3
PCs) -> means that we decrease the number of columns from 18 to only 3.

plot(

x = 1:ncol(dat),

y = spec$values,

type = "b"

From the class:

dat.us <- scale(dat, center = TRUE, scale = TRUE) #univariate standardization

di <- dist(x = dat.us, method = "manhattan") #manhattan distance

clan <- hclust(d = di, method = "average") #average linkage

plot(clan)

#most similar obs

di.matrix <- as.matrix(dist(x = dat.us, method = "manhattan", diag = TRUE, upper = TRUE))

diag(di.matrix) <- NA

which(di.matrix == min(di.matrix, na.rm = TRUE), arr.ind = TRUE)

#decided for 3 clusters

y <- cutree(tree = clan, k = 3)

table(y) # show how many obs in each of the 3 clusters

dat[y == 3, ] #see obs in cluster 3

#to characterize different clusters w.r.t. different parameters

quantile(dat$exper, probs = 0:10/10)

quantile(dat$educ, probs = 0:10/10)

# usage of correlation matrix ensures that variables are comparable - prevent the variable with
largest variance to drive PCA (default setting prcomp: variables are centered)
pr.out <- prcomp(dat, scale = TRUE)

pr.out$center # mean of variables prior to centering

pr.out$scale # stdev of variables prior to standardization

pr.out$rotation

Interpretation:

 If urbanPop value is very high, the PC 2 is very negative. If murder is high, PC 2 is also high.
 PC 1 is security. PC 1 is higher whenever murder/assault/rape are small.

Plot:

biplot(pr.out, scale = 0, col = c("black", "royalblue"), main = "first two PCs of USArrests data set")

pr.out$sdev # stdev explained by each PC

(pr.var <- pr.out$sdev^2) # variance explained by each PC (divide by 4 to get the share)

(pve <- pr.var/(sum(pr.var))) # share of variance explained by each PC

Interpretation: PC 1 explains 62% of the dependence structure among all 4 variables

Plot of variance explained by each PC vs the cumulative variance explained vs the number of PCs

par(mfrow = c(1,2), mar = c(5,4,2,2) + .1, oma = c(0,0,4,0))

plot(pve, ylim = c(0,1), xlab = "principal component", ylab = "share of variance explained", type = "b",
pch = 20, xaxt = "n", yaxt = "n")

axis(side = 1, at = 1:4) #

axis(side = 2, at = c(0, seq(.2:1, by = .2))) # left plot: screeplot

plot(cumsum(pve), ylim = c(0,1), type = "b", pch = 20, xlab = "number of principal components", xaxt
= "n", ylab = "cum. share of variance explained", yaxt = "n") #

axis(side = 1, at = 1:4) #

axis(side = 2, at = c(0, seq(.2:1, by = .2))) #

title("variance explained vs. principal components", outer = "TRUE") #

par(mfrow = c(1,1)) #
Interpretation: 2 PCs are enough to explain 80% of the dependence structure among all 4 variables

summary(pr.out) # shows stdev, share of variance explained and cumulative share of variance
explained for all PCs (= min(obs., number of variables))

plot(pr.out)

Cluster analysis – focus on rows. Aim: check if there is dependence structure between obs in the
rows, find groups of obs. Obs are homogeneous within a cluster. Intended for metric variables only
(no binary)!

 Euclidean distance: air distance


 Manhattan: traffic distance – absolute distance

* Euclidean results = Manhattan results when there is only one covariate

3 methods of hierarchical clustering:

1. Complete linkage: use max distance (too pessimistic)


2. Average linkage – distance = average of all distances (can also suffer from chain building)
3. Single linkage: use min distance. Disadvantage – suffer from chain building instead of clusters
bcuz we (too optimistic)
 dist(dat) #Distance - Judge similarity of obs
 dist(dat, method = "manhattan")
 dist(dat, method = "euclidean")

hc.complete <- hclust(d = dist(X), method = "complete") #Euclidean distance with complete linkage

plot(hc.complete, main = "complete linkage", xlab = "", sub = "", cex = .9) #plot

k <- 2

abline(h = (hc.complete$height[length(hc.complete$height) - (k-1)] +


hc.complete$height[length(hc.complete$height) - (k-2)])/2, col = "royalblue") #add cut

hc.clusters <- cutree(tree = hc.complete, k = 2) #divide observations into 2 classes

table(hc.clusters, dat) #show frequency table of the 2 classes

Interpretation: breast cancer – 2 obs are in cluster 1, 3 obs in cluster 2 and 2 obs in cluster 4. The fact
that same cancer is assigned to different clusters at the same time is not a problem given that the
chosen data explains …% of the dependence structure.
Code 4

install.packages("caret")

predictionErrors <- observed - predicted

cor(predicted, observed)^2 R^2

mean(squPredErrors) MSE = bias^2 + variance

sqrt(mean(squPredErrors)) RMSE

RMSE(predicted, observed) RMSE

Interpretation of RMSE: On average, we have a prediction error of … units w.r.t. the scale of Y.

Measure of in-sample fit = R^2

Measure of out-sample fit and prediction performance = MSE

 error term, for y = f(x) + u or y = b0 + b1*x + u


u is everything that cannot be explained by x
 fitted value, if we estimate the model, we obtain y.hat = b0.hat + b1.hat*x
 residual, if we estimate the model, we obtain y = y.hat + u.hat or u.hat = y - y.hat
 prediction, if we estimate the model, we obtain y.hat = b0.hat + b1.hat*x, if we use this
formula for new covariate values x.new, we obtain y.prd = b0.hat + b1.hat*x.new
 prediction error, if we have new observations (y.new, x.new), we can compare the observed
and predicted values via u.prd = y.new - y.prd

Variance-bias trade-off

 Small models tend to be Underfitted: high bias, low variance


 Big models suffer from Overfitting: low bias, high variance (sensitivity, less robustness)

#Bias-Variance-Trade-off for the k-Nearest Neighbor-Method

#Prediction performance improves with the higher K neighbors, more robust, lower variance.
However, bias is higher.

#Underfitting means the model does not fit or does not predict the (training) data very well. On the
other hand, overfitting means that the model predict the (training) data too well. It is too good to be
true. If the new data point comes in, the prediction may be wrong. Normally, underfitting implies
high bias and low variance, and overfitting implies low bias but high variance.

#Bias is the difference between the true label and our prediction, and variance is the expectation of
the squared deviation of a random variable from its mean.
At 20 partitions we get min MSE. At first, due to lower bias, we can achieve a better MSE but then
growth of variance overcompensates reduction in bias.

Code 5

Training the model, uses the data that was not used for estimation, is done to better estimate how
the model will perform with new data + assessment if the model is overfitted

Collinearity – very large variance and st. errors.

lmValues1 <- data.frame(obs = solTestY, pred = lmPred1)

defaultSummary(lmValues1)

R^2 – sq correlation between hold-outs and predicted values, lower than the in-sample measure
(0.94)

Avoid in-sample measures in auto-sample.

c(summary(lmFitAllPredictors)$sigma, summary(lmFitAllPredictors)$r.squared) #in-sample


estimates
SE 0.5524155 (sq of est variance of residuas) R^2 0.9446316

Compare these to RMSE, MAE to see if we have overfitting

ctrl <- trainControl(method = "cv", number = 10)

set.seed(100)

lmTune <- train(x = solTrainXtrans, y = solTrainY, method = "lm", trControl = ctrl)

lmTune

str(lmTune)

reg.check <- lm(Solubility ~ ., data = trainingData)

summary(fitted(reg.check) - fitted(lmTune))
summary(resid(reg.check) - resid(lmTune))

### Remove covariates that are highly correlated

corThresh <- .9

tooHigh <- findCorrelation(cor(solTrainXtrans), corThresh) #

corrPred <- names(solTrainXtrans)[tooHigh] #

trainXfiltered <- solTrainXtrans[, -tooHigh] #

testXfiltered <- solTestXtrans[, -tooHigh]

set.seed(100)

lmFiltered <- train(x = trainXfiltered, y = solTrainY, method = "lm", trControl = ctrl)

lmFiltered

PCR – dimension X reduction

set.seed(100)

lmPCA <- train(solTrainXtrans, solTrainY,

method = "lm",

preProcess = "pca",

trControl = ctrl

#86 PCs are necessary to explain 95% of the original dependence structure

lmPCA

ncol(solTrainXtrans) # original number of predictors (i.e. covariates): 228

summary(lmPCA)

PLSR is better than PCR because PLSR cares for the share of variance explained w.r.t. Y.

PLSR – dimension X reduction + focus on relation of Y to PC

plsrFit <- plsr(Solubility ~ ., data = trainingData)

summary(plsrFit)

Interpretation: 17 PCs explain 86% of relationship among covariates and 93% of variation in Y.

predict(plsrFit, solTestXtrans[1:5,], ncomp = 2)


Interpretation:

PLSR with CV training

set.seed(100)

plsrTune <- train(solTrainXtrans, solTrainY,

method = "pls",

tuneLength = 40,

trControl = ctrl,

preProc = c("center", "scale")

plsrTune

summary(plsrTune)

Variable importance

plsrImp <- varImp(plsrTune, scale = FALSE)

plot(plsrImp, top = 25, scales = list(y = list(cex = .95)))

#For linear model (LLS) function summary, absolute value of t-statistic as a measure of importance
of the variable: the larger the t statistic, the more important is the variable

#Lasso, Ridge and ElasticNet are all part of the Linear Regression family where the x (input) and y
(output) are assumed to have a linear relationship. The main difference among them is whether
the model is penalized for its weights.

Ridge regression estimation and prediction

 Ridge penalizes sum of sq est coef


 LASSO penalizes sum absolut values of coef (s<1, lambda = 0)
 Elastic net – linear combination of LASSO penalty and Ridge penalty

Lambda (penalty, weight decay) ↑↓coef

Lambda represents bias-variance trade-off: lambda↑↑bias↓variance

If we raise lambda,

 in Ridge coef are shrunk to 0, become more equal (disadvantage of Ridge, no feature
selection property)
 in LASSO: the higher the lambda, the more coef are shrunk to 0 (those parameters that have
the lowest impact on Y are shrunk to zero) – implicit Variable selection property
ridgeModel <- enet(x = as.matrix(solTrainXtrans), y = solTrainY, lambda = 0.001)

ridgePred <- predict(ridgeModel, newx = as.matrix(solTestXtrans), s = 1, mode = "fraction", type =


"fit")

ridgeGrid <- data.frame(.lambda = seq(0, 0.1, length = 15))

set.seed(100)

ridgeTune <- train(solTrainXtrans, solTrainY, method = "ridge", tuneGrid = ridgeGrid, trControl = ctrl,
preProc = c("center", "scale"))

ridgeTune

LASSO: lambda = 0

The lowest RMSE obtained with lambda = 0.0286

Elastic net:

enetModel <- enet(x = as.matrix(solTrainXtrans), y = solTrainY, normalize = TRUE, lambda = 0.01)

enetGrid Interpretation: using between 10% and 20% of the original covariates, we get the lowest
RMSE

From exercise:

PLSR - partial least squares regression

PLSR uses all 19 covariates and makes transformation => original covariates lose the direct
interpretability

library(pls)

reg.pls <- plsr(wage ~ ., data = dat)

summary(reg.pls)
LASSO regression (shrinkage method) => more beneficial than PLSR, cuz keeps the direct
interpretability of covariates

library(glmnet)

lambda.grid <- 10^(seq(from = -2, to = 10, length = 100)) #100 steps

reg.lasso <- glmnet(x = as.matrix(subset(x = dat, select = -wage)), y = dat$wage,

alpha = 1, # 1 means LASSO, 0 means Ridge Regression, 0<alpha<1 means elastic net penalty.

lambda = lambda.grid)

#Df – number of estimated coef. With high lambda (penalty) all the covariates are smoothed out to
zero, while with lambda close to zero, all 19 covariates are included. %Dev – the fit of the model is
getting worse with ↑lambda. LASSO allows for reduction of the dimension. 10 covariates can explain
almost 40% of the variation of Y.

plot(reg.lasso) #Coefficient path plot – how the coef change when the penalty increases (lambda↑)?

reg.lasso$beta #order in which covariates shrink

#(bottom) L1 Norm – sum of absolut coef values (on the left we have a very huge penalty => zero
covariates left). (top axis) the number of covariates that are still in the model.

#What is the optimal abscissa position? Look at the Cross validation (train function at Caret)

#The right part is close to OLS solution.


Cross-validation (train function)

1. PLSR CV

library(caret)

set.seed(100)

plsrTune <- train(dat[, -1], dat[, 1], method = "pls", tuneLength = 50, trControl =
trainControl(method = "cv", number = 5))

plsrTune

summary(plsrTune)

plot(plsrTune)

#Model improves with the increasing number of PCs. If we move to the left, we get more robust
model (but worse prediction performance). If we move to the right, less robust but better
performance prediction but not considerably. Take between 5 to 7 PCs.

2. LASSO CV

set.seed(100)

lassoTune <- train(dat[, -1], dat[, 1], method = "glmnet", tuneGrid = expand.grid(.alpha = 1, .lambda =
lambda.grid), trControl = trainControl(method = "cv", number = 5)) #.alpha = 1 LASSO

lassoTune

#From the point where R^2 is NaN, all covariates are shrunk to 0 and we regress Y on the mean of Y.

#the best lambda is 0.0132 which achieves performance of 42% (LASSO, alpha = 1)

Code 6

k-nearest neighbor – predicts new sample using the k-closest samples from the training set.

non-linear, non-parametric regression, focus on bias-variance trade-off


*Usually, small K over-fit and large K under-fit the data.

library(AppliedPredictiveModeling)

library(caret)

From exercise:

library(np)

data(wage1)

dat <- subset(x = wage1, select = -c(lwage, nonwhite, female, married))

#Pre-processing - Univariate standardization of all the variables in dat apart from wage

#bcuz KNN method depends on distance between samples, so the scale of predictors can have a
huge impact on the distances among samples. Predictors with largest scales would contribute most.

dat.s <- data.frame(wage = dat$wage, scale(dat[, -1], center = TRUE, scale = TRUE))

# 10 fold-Cross Validation to determine the optimal number k for a k-NN estimation of wage on the
other variables in dat.s

set.seed(42)

knnTune <- train(

x = subset(x = dat.s, select = -wage), # x = subset(x = dat, select = -wage),

y = dat$wage,

method = "knn",

tuneGrid = data.frame(.k = 1:30),

trControl = trainControl(method = "cv", number = 10)

knnTune

#knnTune shows to use k=12 but there is a variance-bias trade-off and we should be careful while
choosing K. Compare fitted values to observed response values

fitted(knnTune) #create the fitted values

plot(knnTune)
#Prediction performance improves with the higher K neighbors, more robust. Decrease the variance
but increase the bias. Variance-bias Trade-off. K between 5 and 15 is good. Result with the lowest
RMSE could be not so stable, be careful! Look at fitted vs observed

#Take k=12 and do regression

knn <- knnreg(x = subset(x = dat.s, select = -wage), y = dat$wage, k = 12)

knn.fitted <- predict(knn, newdata = subset(x = dat.s, select = -wage))

knn.resid <- dat$wage - knn.fitted

plot(knn.fitted, knn.resid) # fitted-residual-scatterplot

abline(h = 0)

#Variance not constant (indicator of heteroskedasticity) – variance of residuals is increasing with


increasing fitted y. Assumption is violated.

#Compare the in-sample performance of the k-NN approach of 1c) to the performance of a linear
least squares regression for data "dat.s".

id.holdout <- 401:nrow(dat) #for prediction

#training

knn.train <- knnreg(x = subset(x = dat.s[-id.holdout, ], select = -wage), y = dat$wage[-


id.holdout], k = 12)

knn.fit <- predict(knn.train, newdata = subset(x = dat.s[-id.holdout, ], select = -wage))

reg.train <- lm(wage ~ ., data = dat.s[-id.holdout, ])

cor(dat.s[-id.holdout, "wage"], knn.fit)^2


0.469533 #R^2 for 12 nearest neighbors

cor(dat.s[-id.holdout, "wage"], fitted(reg.train))^2


0.4789283 #R^2 for Linear least squares (all variables)

Interpretation: fit is worse for k-nearest neighbors (R^2 is less)


#Compare the out-of-sample performance of both approaches.

#prediction, use hold-out data

knn.prd <- predict(knn.train, newdata = subset(x = dat.s[id.holdout, ], select = -wage))

reg.prd <- predict(reg.train, newdata = subset(x = dat.s[id.holdout, ], select = -wage))

mean((dat.s[id.holdout, "wage"] - knn.prd)^2)


8.467812 #MSE for 12 nearest neighbors
mean((dat.s[id.holdout, "wage"] - reg.prd)^2)
7.838936 #MSE for Linear least squares
Interpretation: prediction performance is worse for k-nearest neighbors
(MSE is higher)

#Are there observations, where the fit or prediction of one of the approaches is poor?

#Fit: large y values are under-estimated, while small y-values are often over-estimated cuz here for
larger y values both models have large positive residuals. Residual = y – y.hat.

#Prediction: very large prediction errors for large y values

Code 7

Tree – a set of binary decisions

Basic regression trees partition the data into smaller groups that are more homogenous with respect
to the response. To achieve outcome homogeneity, regression trees determine:

• The predictor to split on and value of the split


• The depth or complexity of the tree

• The prediction equation in the terminal nodes

For regression, the model begins with the entire data set, S, and searches every distinct value of
every predictor to find the predictor and split value that partitions the data into two groups (S1 and
S2) such that the overall sums of squares error (SSE) are minimized. Then within each of groups S1
and S2, this method searches for the predictor and split value that best reduces SSE. Because of the
recursive splitting nature of regression trees, this method is also known as recursive partitioning.

Predictors that appear higher in the tree (i.e., earlier splits) or those that appear multiple times in the
tree will be more important than predictors that occur lower in the tree or not at all.

Disadvantages of trees:

1. single regression trees are more likely to have sub-optimal predictive performance compared
to other modeling approaches. This is partly due to the simplicity of the model.
2. an individual tree tends to be unstable. If the data are slightly altered, a completely different
set of splits might be found
3. selection bias: predictors with a higher number of distinct values are favored over more
granular predictors

# Single regression tree

rpartTree <- rpart(Solubility ~ ., data = trainingData)

summary(rpartTree)

rpart.plot(rpartTree)

mean(trainingData$Solubility) -2.71 #mean response variable (top of the tree)

tree$variable.importance #most important variable (the first split) – NumCarbon


#Interpretation: measure of improvement of prediction performance. How the fit is worse if we omit
this variable?

tree.fitted<- predict(tree, newdata = dat)

sort(unique(tree.fitted)) # which distinct categories of y.hat exist? The lowest level of the tree

cor(dat$Solubility, tree.fitted)^2 #measure of fit – how many percent of variation is explained

# Interpretation of the tree:

 coloring scheme is darker, whenever the mean response value is larger for the corresponding
combination
 if the NumCarbon >= 3.78 then Solubility = -4.49, else Solubility = -1.84 (means of y)
 we get the lowest Solubility if large NumCarbon, low surface area, large NumNon-AHtoms.
 The fitted values of a tree are just the mean response values of all observations in the
final/terminal node. For example: 5% of all observations belong to the left-most terminal
node, they have the following covariate configuration: NumCarbon>= 3.8 (more precisely
3.78, but plot sometimes rounds to 1 digit) AND SurfaceArea2<0.98 AND
NumNonHAtoms>=2.8. Now, the fitted value or prediction for all observations that have
such a covariate configuration is just the mean y-Value of those 5% observations that are
used for estimating the tree.

### Using train function - serves for cross-validation based determination of the trees.

set.seed(100)

rpartTune <- train(solTrainXtrans, solTrainY,

method = “rpart2”,

tuneLength = 20, # number of splits allowed

trControl = trainControl(method = “cv”))

summary(rpartTune) # gives same result + see pic: min RMSE at 0.97 and 25 nodes)

plot(rpartTune)

# RMSE relation to number of splits: RMSE ↓↑ tuneLength

# More flexibility, less robustness (bias lower, variance higher)

y.hat <- predict(rpartTree, data = dat)


Bagging – bootstraping + aggregation

random sampling with replacement

Bootstrap Aggregation. You perform Bootstrapping and thus get B Bootstrap samples and for each
sample you estimate a tree and the Bagging-result is just the average of the trees.

Advantages:

1. stability, effectively reduces the variance of a prediction through its aggregation process
2. can provide their own internal estimate of predictive performance that correlates well with
either CV estimates or test set estimates. Because certain samples are left out. These
samples are called out-of-bag, and they can be used to assess the predictive performance of
that specific model since they were not used to build the model. Hence, every model in the
ensemble generates a measure of predictive performance courtesy of the out-of-bag
samples. The average of the out-of-bag performance metrics can then be used to gauge the
predictive performance of the entire ensemble, and this value usually correlates well with
the assessment of predictive performance we can get with either CV or from a test set.

### Bagging computation

B <- 10 # number of bootstrap samples

y.hat.boot <- array(data = NA, dim = c(nrow(dat), B))

set.seed(42)

for(b in 1:B){

dat.boot <- dat[sample(x = 1:nrow(dat), size = nrow(dat), replace = TRUE), ]

rpartTree.boot<- rpart(y ~ ., data = dat.boot)

y.hat.boot[, b] <- predict(rpartTree.boot, newdata = dat)

### Bagging results

plot(x, y, col = "royalblue", pch = 20) # observations

curve(reg.fun, from = -pi, to = 3*pi, col = "royalblue", add = TRUE) #correct functional
relationship

lines(x, y.hat, lwd = 2, col = "cyan") # fitted line for single tree

matplot(x, y.hat.boot, type = "l", col = "gray70", lty = 1, add = TRUE) # fitted lines with bootstrap
samples

lines(x, rowMeans(y.hat.boot), lwd = 3, col = "darkorange") # fitted line obtained from BAGGING
#Every line is a single tree that is estimated for one of the 10 bootstreps. Orange line yields from
averages of every grey line at every x. Orange line captures the structure better than the blue line
(single tree).

### Computation via train

set.seed(42)

x.named <- matrix(x, dimnames = list(NULL, "x1"))

treebagTune <- train(x = x.named, y = y,

method = "treebag",

nbagg = 20, trControl = trainControl(method = "cv"))

treebagTune

y.hat.treebagTune <- predict(treebagTune, newdata = x.named)

#for every sample we do CV to determine tree fit, to protect from overfitting

Random forests – collection of trees, average across the trees

Column subsampling and then determine most important variables

(vs bootstrep, where we do row subsampling)

Similar to bagging, but at each split of a bootstrap replication, a random sample of the
predictors/covariates is used. Then, as before, the best separating covariate and corresponding
threshold/cut-off-point is determined to conduct a split.

Advantage: trees exhibit less dependence across the bootstrap replications, hence: (forecasting)
performance often improves. Vs bagging: we use same covariates again and again -> weaker results.

library(randomForest)

rfModel <- randomForest(solTrainXtrans, solTrainY, importance = TRUE, ntrees = 100)

plot(rfModel)
# Error improves with number of trees (number of trees ↑↓ error)

importance(rfModel)

# %IncMSE – increase of MSE when corresponding covariate is omitted. Most important variables are
those with the highest %IncMSE

varImp(rfModel) #overall importance SORTED in decreasing order of importance

varImpPlot(rfModel)

Boosting

Resampling so that you focus on obs that are poorly fitted and try to improve for the next step of
boosting.

gbmModel <- gbm.fit(solTrainXtrans, solTrainY,

distribution = "gaussian")

# Shrinkage: If you improve obs more and more, you run into overfitting. If you get better fit (better
error) you don’t adjust the original fit by the whole amount. Instead, you should only use 1% or 10%
(learning rate) of the knowledge of a better model.

Useful:

 function gbm.fit - interpretation of results: JS: Iterative approach (not relevant for exam),
where deviance of training set is optimized (deviance is an alternative measure of fit we did
not talk about). In either iteration step, boosting focuses on those observations that are
poorly fitted in a previous step. More details are not relevant here.
 train: JS: for cross-validated estimation
 min number of observations in each tree: JS: similar to 5% in final nodes of the tree above
 configurations (number of trees): JS: Equivalent to number of bootstrap samples in boosting
 xgbTree: JS: boosting can be applied to very many statistical methods, here, we use boosted
trees. The train-Function also covers a huge number of methods,
compare: http://topepo.github.io/caret/train-models-by-tag.html

Exercise code 7:

Code 8 - Classification performance measure

Linear probability model

Disadvantage lpm: biased estimates, obtain probabilities larger than 100% or lower than 0%.
Exercise:

data(wage1)

dat <- wage1

lpm <- lm(south ~ exper + expersq + female + west, data = dat)

summary(lpm)

#south - response variable is binary

#biased estimates, probabilities larger than 100% or lower than 0%.

cor(dat$south, fitted(lpm))^2 # R^2

thres <- 0.5 #f or model calibration - changing threshold - important for ROC curve

# If threshold is changed, one can approach all points on the corresponding roc-curve

#Larger thres-values yield more fitted values 0, while smaller thres-values yield more fitted values 1.

#e.g. decrease a threshold if u know that less people have y=1

# assign 1 if prob is larger than 0.5, and assign 0 if the prob is lower than 0.5

y.vs.yhat <- data.frame(

y = dat$south,

yhat.prb = fitted(lpm),

yhat = as.numeric(fitted(lpm) > thres)

head(y.vs.yhat)

table(y.vs.yhat[, c(1, 3)]) # Confusion matrix

#Confusion matrix interpretation: 18 ppl are correctly identified as stemming from the South. If we
use a lower threshold (e.g. 0.2), we correctly identify all the 187 ppl coming from South but to the
cost of falsely identifying 250 ppl that originally do not come from the South as they do come from
the South. Trade-off between sensitivity and specificity.

 Sensitivity – how many y=1 are correctly specified (TPR = 18/(18+169))


 Specificity - how many y=0 are correctly specified (TNR = 316/(316 + 23) = 1 – TPR)

library(pROC)

rocCurve <- roc(response = y.vs.yhat$y, predictor = y.vs.yhat$yhat.prb)


plot(rocCurve) # Roc Curve – trade-off between sensitivity and specificity

auc(rocCurve) # area under curve => model not that good

#Compare with the improved model with omitted gender bcuz it's not significant

lpm2 <- lm(south ~ exper + expersq + smsa + northcen + west + ndurman, data = dat)

summary(lpm2)

y.vs.yhat[, "yhat.prb2"] <- fitted(lpm2)

y.vs.yhat[, "yhat2"] <- as.numeric(fitted(lpm2) > thres)

rocCurve2 <- roc(response = y.vs.yhat$y, predictor = y.vs.yhat$yhat.prb2)

plot(rocCurve)

plot(rocCurve2, col = "darkorange", add = TRUE)

auc(rocCurve) # AUC area under the curve - don't interpret as %

auc(rocCurve2)

ROC curve

 For given data we can only reach the points on the ROC curve. With threshold of 0.5 we
would reach TPR = 0.8 and FPR = 0.1.
 Uninformed Diagonal – we would obtain from a pure guess
 Perfect discrimination point (FPR = 0; TPR = 1) – all events and all non-events are correctly
identified.
 Bottom left point (FPR = 0, TPR = 0) – always estimate 0
 Top right point (FPR = 1, TPR = 1) – always estimate 1
 The further the ROC curve is from the diagonal, the better the corresponding estimator is.
AUC – area under the curve

 AUC < 1 => ROC is further from perfect discrimination point


 AUC = 1 => ROC would be able to reach the perfect discrimination point (TPR = 1, FPR = 0)

With prob 95%, we obtain with the given data and model AUC between 95.6% and 99.1%

Code 9 – Classification models, model comparison

 library(caret)
 library(MASS)
 library(AppliedPredictiveModeling)
 library(randomForest)
 library(pROC)
 library(klaR)

# LDM – linear discrimination model – discriminate between categories of Y. Assumption of equal


group variances: Var(x|y=0) = Var(x|y=1); Normal distribution for different groups

lda.obj <- lda(y ~ x)

lda.obj #prior probabilities of the groups

predict(lda.obj) #see on paper

plot(lda.obj)

#vs Quadratic DA (allow for difference variance of X conditional on group)

#qda.obj <- qda(y ~ x)

#qda.obj

#predict(qda.obj)

(mean.0 <- mean(x[y == 0]))

(mean.1 <- mean(x[y == 1]))

(sig2.hat <- (1/(n - 2))*sum(c((x[y == 0] - mean.0)^2, (x[y == 1] - mean.1)^2)))

Decision rules (and indifference values):

1. Max Likelihood – densities are only different in their mean => to distinguish between N distr
we can take means and average
(x.star.ML <- (mean.0 + mean.1)/2)
2. Bayes – account for the fact that there are more obs in one group pf Y than in the other
(x.star.Bayes <- (mean.0 + mean.1)/2 - log(prior.0/prior.1)*sig2.hat/(mean.0 - mean.1))

#Regular densities (for Max Likelihood)

d.0 <- dnorm(x = x, mean = mean(x[y == 0]), sd = sqrt(sig2.hat))


d.1 <- dnorm(x = x, mean = mean(x[y == 1]), sd = sqrt(sig2.hat))

#Rescaled densities (for Bayes)

d.0.resc<- d.0*prior.0

d.1.resc<- d.1*prior.1

### Estimated posterior probability values for Max Likelihood decision rule (for Bayes they are
given in predict function of LDA)

(posterior.ML.0 <- d.0/(d.0 + d.1))

(posterior.ML.1 <- d.1/(d.0 + d.1))

### Intuition of Maximum Likelihood- and Bayes-rule

x.seq <- seq(from = min(x), to = max(x), length.out = 100)

### Group densities and weighted group densities

d.0.seq <- dnorm(x = x.seq, mean = mean(x[y == 0]), sd = sqrt(sig2.hat))

d.1.seq <- dnorm(x = x.seq, mean = mean(x[y == 1]), sd = sqrt(sig2.hat))

d.0.resc.seq <- d.0.seq*prior.0

d.1.resc.seq <- d.1.seq*prior.1

Blue: y=0, orange: y=1

Dashed densities have same height but only means are different.

Dashed vertical line – ML indifference (point of intersection of dashed densities). To the left of it,
whenever X is lower than the ML indifference, probability to be in group 0 is higher => blue solid
density is higher than the other (assign y.hat = 1). Whenever X is higher than ML indifference, orange
density is higher than the blue density => assign y.hat = 1.

If we multiply every Y on dashed blue density by 0.58, we get the blue solid density.

Solid vertical line – Bayes indifference (point of intersection of solid densities). Same rule as in ML.

Bayes line is more to the right than ML bcuz we know there are more obs with y=0 than y=1.
Exercise:

set.seed(1234)

fold.set <- 1:10

fold <- sample(x = fold.set, size = nrow(dat), replace = TRUE)

table(fold) # number of obs in each fold

res <- data.frame(y = dat$west, fold = fold)

head(res)

for(fo in fold.set){

#fo <- 1

# LPM

mod.lp <- lm(west ~ northcen + south + tenure + tenursq, data = dat[fold !=fo, ])

res[res$fold == fo, "lpm"] <- predict(mod.lp, newdata = dat[fold == fo, ])

# LDA

mod.lda <- lda(west ~ ., data = dat[fold !=fo, ])

res[res$fold == fo, "lda"] <- predict(mod.lda, newdata = dat[fold == fo, ])$posterior[, "1"]

# k-NN

for(k in 5:15){

mod.knn <- knn3(west ~ ., data = dat[fold !=fo, ], k = k)

res[res$fold == fo, paste("nn", k, sep = "")] <- predict(object = mod.knn, newdata = dat[fold == fo, ])
[, "1"]

# RF

for(ntree in c(100, 200, 300)){

mod.rf <- randomForest(as.factor(west) ~ ., data = dat[fold !=fo, ], ntree = ntree)

res[res$fold == fo, paste("rf", ntree, sep = "")] <- predict(object = mod.rf, newdata = dat[fold ==
fo, ], type = "prob")[, "1"]

print(paste("Computation for Fold", fo, "done."))

}
#c) Compare the performance of the models in b) and try to improve the specification of every
model. Discuss your findings. Improvement also partially included.

thres <- 0.3

table(res$y, res$lpm > thres)

table(res$y, res$lda > thres)

table(res$y, res$nn5 > thres)

table(res$y, res$nn7 > thres)

table(res$y, res$rf100 > thres)

table(res$y, res$rf200 > thres)

table(res$y, res$rf300 > thres)

#Cross-validation with train for random forest. How can you extract/create the confusion matrix?

rf.fit <- train(

x = subset(dat, select = -south),

y = as.factor(dat$south),

method = "rf",

tuneLength = 1, # of course, more than 1 specification should be estimated

trControl = trainControl(method = "cv")

rf.fit$finalModel$confusion

 Sensitivity = 1 – 0.29 = 0.71 (about 70%)


 Specificity = 1 – 0.21 = 0.79 (about 80%)

ROC

set.seed(12345)

RF2 <- randomForest(

card ~ age, data = dat,

importance = TRUE,

ntree = 100
)

y <- dat$card

y.pred <- predict(RF2)

confusionMatrix(data = y.pred,

reference = y,

positive = "yes"

# Sensitivity : 82% of events are correctly specified

# Specificity : 25% of non-events are correctly specified

rocCurve <- pROC::roc(response = y,

predictor = as.numeric(y.pred)

auc(rocCurve) # area under the curve is 0.53. "Half way" between the diagonal and perfect
discrimination point

ci(rocCurve) # Confidence Interval: with 95%, we obtain with the given data and model, AUC
between 0.49 and 0.58

plot(rocCurve, legacy.axes = TRUE)

You might also like