Statistical Learning in R
Statistical Learning in R
Code 2
Standardization
Multivariate: moves the cloud to coordination center, adjusts the variance of objects to equal 1,
removes correlation. For cluster analysis
SIGMA.hat.inv.sqrt <- Q %*% sqrt(D) %*% t(Q) # square root of inverted SIGMA.hat
dat.row<- dat[1, ]
))
t(dat) - matrix(apply(X = dat, MARGIN = 2, FUN = mean), nrow = ncol(dat), ncol = nrow(dat))
))
cor(dat.ms) # correlation = 0
Univariate: moves the cloud to coordination center, adjusts the variance of objects to equal 1,
removes the mean. For PCA.
dat.us <- scale(x = dat, center = TRUE, scale = TRUE) #center = TRUE – subtracts the mean, scale =
TRUE – divides by st div
cor(dat)
Box-Cox transformation
Aim: to make distribution more symmetric around the median (e.g. dependent var), look like Normal
distr. Reduce skewness. Box-Cox transformation should be used for continuous variables.
#Task Check the skewness of these variables and try to find a BoxCox-transformed version of either
of these variables that is less skewed.
return(skew.coef)
for(pos in 1:length(lambda.seq)){
x = boxcox(
x = dat[, co],
lambda = lambda.seq[pos]
plot(lambda.seq, skew.seq)
abline(h = 0)
Skewness
> skew(x = wage1$wage) #not transformed
[1] 0.26
If difference between upper quartile and median is bigger than the difference between lower quartile and
median => right-skewed distribution.
Graphs for not-transformed and transformed:
hist(var.temp, freq = FALSE, breaks = 20, col = "lightgrey", main = paste("Untransformed =>
Skewness:", round(skew(var.temp), digits = 2)))
box()
hist(var.temp, freq = FALSE, breaks = 20, col = "lightgrey", main = paste("Lambda=", lambda.temp, "
=> Skewness: ", round(skew(var.temp), digits = 2), sep = ""))
box()
Resampling
install.packages("ISLR")
install.packages("boot")
set.seed(42)
mean((dat$mpg - predict(object = lm.fit, newdata = dat))[-train]^2) # MSE for test set observations
#MSE – mean sq errors – measure of prediction performance – the lower the better
data = dat,
glmfit = glm.fit,
K = nrow(dat)
cv.err$delta
24.23151 – MSE of prediction from CV; 24.23114 – Bayes correction
for(i in 1:5){
K = nrow(Auto))$delta[1]
cv.error
degree1 degree2 degree3 degree4 degree5
24.23151 19.24821 19.33498 19.42443 19.03321
Interpretation: the model where horsepower enters linearly is worse than the model where
horsepower is squared.
### K-fold cross-validation – create equal folds (parts) of the sample (e.g. 9 parts to test and 1
part to predict). Protection against overfitting. K = 5 or 10
set.seed(42) #
K = 10)$delta[1] #
} #
cv.error.10
Interpretation: there is no difference between i-orders as long as it is at least i=2. Model with linear
parameter has much higher error than i=2.
2-fold cross-validation
set.seed(12345)
# fold.id contains the shuffled id, which row of dat will belong to fold 1 or fold 2.
#create predictions for the fold2-observations and compute the mean squared error (of prediction).
Bootstrap – resampling with replacement - used for estimating a distribution (e.g. of a t statistic)
set.seed(12345)
#Bootstrap.sample1 contains the row indices of the observations of wage1 that should be part of the
first bootstrap sample. Due to the sampling with replacement, some of the numbers {1, 2, ..., 525,
526} appear more than once in Bootstrap.sample1, some numbers do not appear at all. The latter
(hold-out) ids (in Validation.sample1) identify the rows of wage1 that can be used to create
predictions, after the models are estimated using the rows of wage1 identified by Boostrap.sample1.
This approach is repeated B times.
2*cov(X,Y))) #
} #
set.seed(42) #
alpha.fn(data = dat.p, index = sample( x = 100, size = 100, # use different subsets of the data
Original variance; bias – mean of (boot stat – original); sr. error – uncertainty of original estimators,
the larger the SE, the more widespread the distribution is
Code 3
PCA – principal component analysis – focus on columns. Aim: to decrease the dimension (take a
smaller number of columns), using spectoral decomposition. Should be performed on the dimension
that is smaller
num.pc = 1:ncol(dat),
)
# 18 PCs. First PC explains 13.6% of dependence structure of the 18 variables. 3 pcs explain only
about one third of it. 9 PCs (half of the original dimension) would explain about 75%.
Method 2: Scree-Plot – choose the most pronounce kink to determine the number of PCs (here 3
PCs) -> means that we decrease the number of columns from 18 to only 3.
plot(
x = 1:ncol(dat),
y = spec$values,
type = "b"
plot(clan)
di.matrix <- as.matrix(dist(x = dat.us, method = "manhattan", diag = TRUE, upper = TRUE))
diag(di.matrix) <- NA
# usage of correlation matrix ensures that variables are comparable - prevent the variable with
largest variance to drive PCA (default setting prcomp: variables are centered)
pr.out <- prcomp(dat, scale = TRUE)
pr.out$rotation
Interpretation:
If urbanPop value is very high, the PC 2 is very negative. If murder is high, PC 2 is also high.
PC 1 is security. PC 1 is higher whenever murder/assault/rape are small.
Plot:
biplot(pr.out, scale = 0, col = c("black", "royalblue"), main = "first two PCs of USArrests data set")
(pr.var <- pr.out$sdev^2) # variance explained by each PC (divide by 4 to get the share)
Plot of variance explained by each PC vs the cumulative variance explained vs the number of PCs
plot(pve, ylim = c(0,1), xlab = "principal component", ylab = "share of variance explained", type = "b",
pch = 20, xaxt = "n", yaxt = "n")
axis(side = 1, at = 1:4) #
plot(cumsum(pve), ylim = c(0,1), type = "b", pch = 20, xlab = "number of principal components", xaxt
= "n", ylab = "cum. share of variance explained", yaxt = "n") #
axis(side = 1, at = 1:4) #
par(mfrow = c(1,1)) #
Interpretation: 2 PCs are enough to explain 80% of the dependence structure among all 4 variables
summary(pr.out) # shows stdev, share of variance explained and cumulative share of variance
explained for all PCs (= min(obs., number of variables))
plot(pr.out)
Cluster analysis – focus on rows. Aim: check if there is dependence structure between obs in the
rows, find groups of obs. Obs are homogeneous within a cluster. Intended for metric variables only
(no binary)!
hc.complete <- hclust(d = dist(X), method = "complete") #Euclidean distance with complete linkage
plot(hc.complete, main = "complete linkage", xlab = "", sub = "", cex = .9) #plot
k <- 2
Interpretation: breast cancer – 2 obs are in cluster 1, 3 obs in cluster 2 and 2 obs in cluster 4. The fact
that same cancer is assigned to different clusters at the same time is not a problem given that the
chosen data explains …% of the dependence structure.
Code 4
install.packages("caret")
sqrt(mean(squPredErrors)) RMSE
Interpretation of RMSE: On average, we have a prediction error of … units w.r.t. the scale of Y.
Variance-bias trade-off
#Prediction performance improves with the higher K neighbors, more robust, lower variance.
However, bias is higher.
#Underfitting means the model does not fit or does not predict the (training) data very well. On the
other hand, overfitting means that the model predict the (training) data too well. It is too good to be
true. If the new data point comes in, the prediction may be wrong. Normally, underfitting implies
high bias and low variance, and overfitting implies low bias but high variance.
#Bias is the difference between the true label and our prediction, and variance is the expectation of
the squared deviation of a random variable from its mean.
At 20 partitions we get min MSE. At first, due to lower bias, we can achieve a better MSE but then
growth of variance overcompensates reduction in bias.
Code 5
Training the model, uses the data that was not used for estimation, is done to better estimate how
the model will perform with new data + assessment if the model is overfitted
defaultSummary(lmValues1)
R^2 – sq correlation between hold-outs and predicted values, lower than the in-sample measure
(0.94)
set.seed(100)
lmTune
str(lmTune)
summary(fitted(reg.check) - fitted(lmTune))
summary(resid(reg.check) - resid(lmTune))
corThresh <- .9
set.seed(100)
lmFiltered
set.seed(100)
method = "lm",
preProcess = "pca",
trControl = ctrl
#86 PCs are necessary to explain 95% of the original dependence structure
lmPCA
summary(lmPCA)
PLSR is better than PCR because PLSR cares for the share of variance explained w.r.t. Y.
summary(plsrFit)
Interpretation: 17 PCs explain 86% of relationship among covariates and 93% of variation in Y.
set.seed(100)
method = "pls",
tuneLength = 40,
trControl = ctrl,
plsrTune
summary(plsrTune)
Variable importance
#For linear model (LLS) function summary, absolute value of t-statistic as a measure of importance
of the variable: the larger the t statistic, the more important is the variable
#Lasso, Ridge and ElasticNet are all part of the Linear Regression family where the x (input) and y
(output) are assumed to have a linear relationship. The main difference among them is whether
the model is penalized for its weights.
If we raise lambda,
in Ridge coef are shrunk to 0, become more equal (disadvantage of Ridge, no feature
selection property)
in LASSO: the higher the lambda, the more coef are shrunk to 0 (those parameters that have
the lowest impact on Y are shrunk to zero) – implicit Variable selection property
ridgeModel <- enet(x = as.matrix(solTrainXtrans), y = solTrainY, lambda = 0.001)
set.seed(100)
ridgeTune <- train(solTrainXtrans, solTrainY, method = "ridge", tuneGrid = ridgeGrid, trControl = ctrl,
preProc = c("center", "scale"))
ridgeTune
LASSO: lambda = 0
Elastic net:
enetGrid Interpretation: using between 10% and 20% of the original covariates, we get the lowest
RMSE
From exercise:
PLSR uses all 19 covariates and makes transformation => original covariates lose the direct
interpretability
library(pls)
summary(reg.pls)
LASSO regression (shrinkage method) => more beneficial than PLSR, cuz keeps the direct
interpretability of covariates
library(glmnet)
alpha = 1, # 1 means LASSO, 0 means Ridge Regression, 0<alpha<1 means elastic net penalty.
lambda = lambda.grid)
#Df – number of estimated coef. With high lambda (penalty) all the covariates are smoothed out to
zero, while with lambda close to zero, all 19 covariates are included. %Dev – the fit of the model is
getting worse with ↑lambda. LASSO allows for reduction of the dimension. 10 covariates can explain
almost 40% of the variation of Y.
plot(reg.lasso) #Coefficient path plot – how the coef change when the penalty increases (lambda↑)?
#(bottom) L1 Norm – sum of absolut coef values (on the left we have a very huge penalty => zero
covariates left). (top axis) the number of covariates that are still in the model.
#What is the optimal abscissa position? Look at the Cross validation (train function at Caret)
1. PLSR CV
library(caret)
set.seed(100)
plsrTune <- train(dat[, -1], dat[, 1], method = "pls", tuneLength = 50, trControl =
trainControl(method = "cv", number = 5))
plsrTune
summary(plsrTune)
plot(plsrTune)
#Model improves with the increasing number of PCs. If we move to the left, we get more robust
model (but worse prediction performance). If we move to the right, less robust but better
performance prediction but not considerably. Take between 5 to 7 PCs.
2. LASSO CV
set.seed(100)
lassoTune <- train(dat[, -1], dat[, 1], method = "glmnet", tuneGrid = expand.grid(.alpha = 1, .lambda =
lambda.grid), trControl = trainControl(method = "cv", number = 5)) #.alpha = 1 LASSO
lassoTune
#From the point where R^2 is NaN, all covariates are shrunk to 0 and we regress Y on the mean of Y.
#the best lambda is 0.0132 which achieves performance of 42% (LASSO, alpha = 1)
Code 6
k-nearest neighbor – predicts new sample using the k-closest samples from the training set.
library(AppliedPredictiveModeling)
library(caret)
From exercise:
library(np)
data(wage1)
#Pre-processing - Univariate standardization of all the variables in dat apart from wage
#bcuz KNN method depends on distance between samples, so the scale of predictors can have a
huge impact on the distances among samples. Predictors with largest scales would contribute most.
dat.s <- data.frame(wage = dat$wage, scale(dat[, -1], center = TRUE, scale = TRUE))
# 10 fold-Cross Validation to determine the optimal number k for a k-NN estimation of wage on the
other variables in dat.s
set.seed(42)
y = dat$wage,
method = "knn",
knnTune
#knnTune shows to use k=12 but there is a variance-bias trade-off and we should be careful while
choosing K. Compare fitted values to observed response values
plot(knnTune)
#Prediction performance improves with the higher K neighbors, more robust. Decrease the variance
but increase the bias. Variance-bias Trade-off. K between 5 and 15 is good. Result with the lowest
RMSE could be not so stable, be careful! Look at fitted vs observed
abline(h = 0)
#Compare the in-sample performance of the k-NN approach of 1c) to the performance of a linear
least squares regression for data "dat.s".
#training
#Are there observations, where the fit or prediction of one of the approaches is poor?
#Fit: large y values are under-estimated, while small y-values are often over-estimated cuz here for
larger y values both models have large positive residuals. Residual = y – y.hat.
Code 7
Basic regression trees partition the data into smaller groups that are more homogenous with respect
to the response. To achieve outcome homogeneity, regression trees determine:
For regression, the model begins with the entire data set, S, and searches every distinct value of
every predictor to find the predictor and split value that partitions the data into two groups (S1 and
S2) such that the overall sums of squares error (SSE) are minimized. Then within each of groups S1
and S2, this method searches for the predictor and split value that best reduces SSE. Because of the
recursive splitting nature of regression trees, this method is also known as recursive partitioning.
Predictors that appear higher in the tree (i.e., earlier splits) or those that appear multiple times in the
tree will be more important than predictors that occur lower in the tree or not at all.
Disadvantages of trees:
1. single regression trees are more likely to have sub-optimal predictive performance compared
to other modeling approaches. This is partly due to the simplicity of the model.
2. an individual tree tends to be unstable. If the data are slightly altered, a completely different
set of splits might be found
3. selection bias: predictors with a higher number of distinct values are favored over more
granular predictors
summary(rpartTree)
rpart.plot(rpartTree)
sort(unique(tree.fitted)) # which distinct categories of y.hat exist? The lowest level of the tree
coloring scheme is darker, whenever the mean response value is larger for the corresponding
combination
if the NumCarbon >= 3.78 then Solubility = -4.49, else Solubility = -1.84 (means of y)
we get the lowest Solubility if large NumCarbon, low surface area, large NumNon-AHtoms.
The fitted values of a tree are just the mean response values of all observations in the
final/terminal node. For example: 5% of all observations belong to the left-most terminal
node, they have the following covariate configuration: NumCarbon>= 3.8 (more precisely
3.78, but plot sometimes rounds to 1 digit) AND SurfaceArea2<0.98 AND
NumNonHAtoms>=2.8. Now, the fitted value or prediction for all observations that have
such a covariate configuration is just the mean y-Value of those 5% observations that are
used for estimating the tree.
### Using train function - serves for cross-validation based determination of the trees.
set.seed(100)
method = “rpart2”,
summary(rpartTune) # gives same result + see pic: min RMSE at 0.97 and 25 nodes)
plot(rpartTune)
Bootstrap Aggregation. You perform Bootstrapping and thus get B Bootstrap samples and for each
sample you estimate a tree and the Bagging-result is just the average of the trees.
Advantages:
1. stability, effectively reduces the variance of a prediction through its aggregation process
2. can provide their own internal estimate of predictive performance that correlates well with
either CV estimates or test set estimates. Because certain samples are left out. These
samples are called out-of-bag, and they can be used to assess the predictive performance of
that specific model since they were not used to build the model. Hence, every model in the
ensemble generates a measure of predictive performance courtesy of the out-of-bag
samples. The average of the out-of-bag performance metrics can then be used to gauge the
predictive performance of the entire ensemble, and this value usually correlates well with
the assessment of predictive performance we can get with either CV or from a test set.
set.seed(42)
for(b in 1:B){
curve(reg.fun, from = -pi, to = 3*pi, col = "royalblue", add = TRUE) #correct functional
relationship
lines(x, y.hat, lwd = 2, col = "cyan") # fitted line for single tree
matplot(x, y.hat.boot, type = "l", col = "gray70", lty = 1, add = TRUE) # fitted lines with bootstrap
samples
lines(x, rowMeans(y.hat.boot), lwd = 3, col = "darkorange") # fitted line obtained from BAGGING
#Every line is a single tree that is estimated for one of the 10 bootstreps. Orange line yields from
averages of every grey line at every x. Orange line captures the structure better than the blue line
(single tree).
set.seed(42)
method = "treebag",
treebagTune
Similar to bagging, but at each split of a bootstrap replication, a random sample of the
predictors/covariates is used. Then, as before, the best separating covariate and corresponding
threshold/cut-off-point is determined to conduct a split.
Advantage: trees exhibit less dependence across the bootstrap replications, hence: (forecasting)
performance often improves. Vs bagging: we use same covariates again and again -> weaker results.
library(randomForest)
plot(rfModel)
# Error improves with number of trees (number of trees ↑↓ error)
importance(rfModel)
# %IncMSE – increase of MSE when corresponding covariate is omitted. Most important variables are
those with the highest %IncMSE
varImpPlot(rfModel)
Boosting
Resampling so that you focus on obs that are poorly fitted and try to improve for the next step of
boosting.
distribution = "gaussian")
# Shrinkage: If you improve obs more and more, you run into overfitting. If you get better fit (better
error) you don’t adjust the original fit by the whole amount. Instead, you should only use 1% or 10%
(learning rate) of the knowledge of a better model.
Useful:
function gbm.fit - interpretation of results: JS: Iterative approach (not relevant for exam),
where deviance of training set is optimized (deviance is an alternative measure of fit we did
not talk about). In either iteration step, boosting focuses on those observations that are
poorly fitted in a previous step. More details are not relevant here.
train: JS: for cross-validated estimation
min number of observations in each tree: JS: similar to 5% in final nodes of the tree above
configurations (number of trees): JS: Equivalent to number of bootstrap samples in boosting
xgbTree: JS: boosting can be applied to very many statistical methods, here, we use boosted
trees. The train-Function also covers a huge number of methods,
compare: http://topepo.github.io/caret/train-models-by-tag.html
Exercise code 7:
Disadvantage lpm: biased estimates, obtain probabilities larger than 100% or lower than 0%.
Exercise:
data(wage1)
summary(lpm)
thres <- 0.5 #f or model calibration - changing threshold - important for ROC curve
# If threshold is changed, one can approach all points on the corresponding roc-curve
#Larger thres-values yield more fitted values 0, while smaller thres-values yield more fitted values 1.
# assign 1 if prob is larger than 0.5, and assign 0 if the prob is lower than 0.5
y = dat$south,
yhat.prb = fitted(lpm),
head(y.vs.yhat)
#Confusion matrix interpretation: 18 ppl are correctly identified as stemming from the South. If we
use a lower threshold (e.g. 0.2), we correctly identify all the 187 ppl coming from South but to the
cost of falsely identifying 250 ppl that originally do not come from the South as they do come from
the South. Trade-off between sensitivity and specificity.
library(pROC)
#Compare with the improved model with omitted gender bcuz it's not significant
lpm2 <- lm(south ~ exper + expersq + smsa + northcen + west + ndurman, data = dat)
summary(lpm2)
plot(rocCurve)
auc(rocCurve2)
ROC curve
For given data we can only reach the points on the ROC curve. With threshold of 0.5 we
would reach TPR = 0.8 and FPR = 0.1.
Uninformed Diagonal – we would obtain from a pure guess
Perfect discrimination point (FPR = 0; TPR = 1) – all events and all non-events are correctly
identified.
Bottom left point (FPR = 0, TPR = 0) – always estimate 0
Top right point (FPR = 1, TPR = 1) – always estimate 1
The further the ROC curve is from the diagonal, the better the corresponding estimator is.
AUC – area under the curve
With prob 95%, we obtain with the given data and model AUC between 95.6% and 99.1%
library(caret)
library(MASS)
library(AppliedPredictiveModeling)
library(randomForest)
library(pROC)
library(klaR)
plot(lda.obj)
#qda.obj
#predict(qda.obj)
1. Max Likelihood – densities are only different in their mean => to distinguish between N distr
we can take means and average
(x.star.ML <- (mean.0 + mean.1)/2)
2. Bayes – account for the fact that there are more obs in one group pf Y than in the other
(x.star.Bayes <- (mean.0 + mean.1)/2 - log(prior.0/prior.1)*sig2.hat/(mean.0 - mean.1))
d.0.resc<- d.0*prior.0
d.1.resc<- d.1*prior.1
### Estimated posterior probability values for Max Likelihood decision rule (for Bayes they are
given in predict function of LDA)
Dashed densities have same height but only means are different.
Dashed vertical line – ML indifference (point of intersection of dashed densities). To the left of it,
whenever X is lower than the ML indifference, probability to be in group 0 is higher => blue solid
density is higher than the other (assign y.hat = 1). Whenever X is higher than ML indifference, orange
density is higher than the blue density => assign y.hat = 1.
If we multiply every Y on dashed blue density by 0.58, we get the blue solid density.
Solid vertical line – Bayes indifference (point of intersection of solid densities). Same rule as in ML.
Bayes line is more to the right than ML bcuz we know there are more obs with y=0 than y=1.
Exercise:
set.seed(1234)
head(res)
for(fo in fold.set){
#fo <- 1
# LPM
mod.lp <- lm(west ~ northcen + south + tenure + tenursq, data = dat[fold !=fo, ])
# LDA
res[res$fold == fo, "lda"] <- predict(mod.lda, newdata = dat[fold == fo, ])$posterior[, "1"]
# k-NN
for(k in 5:15){
res[res$fold == fo, paste("nn", k, sep = "")] <- predict(object = mod.knn, newdata = dat[fold == fo, ])
[, "1"]
# RF
res[res$fold == fo, paste("rf", ntree, sep = "")] <- predict(object = mod.rf, newdata = dat[fold ==
fo, ], type = "prob")[, "1"]
}
#c) Compare the performance of the models in b) and try to improve the specification of every
model. Discuss your findings. Improvement also partially included.
#Cross-validation with train for random forest. How can you extract/create the confusion matrix?
y = as.factor(dat$south),
method = "rf",
rf.fit$finalModel$confusion
ROC
set.seed(12345)
importance = TRUE,
ntree = 100
)
y <- dat$card
confusionMatrix(data = y.pred,
reference = y,
positive = "yes"
predictor = as.numeric(y.pred)
auc(rocCurve) # area under the curve is 0.53. "Half way" between the diagonal and perfect
discrimination point
ci(rocCurve) # Confidence Interval: with 95%, we obtain with the given data and model, AUC
between 0.49 and 0.58