0% found this document useful (0 votes)
2 views

Multivariate statistics_tutorial 4 Sensitivity, Specificity, ROC and Validation

This document is a tutorial on multivariate statistics in R, focusing on classification metrics, ROC curves, and validation methods for diagnostic tests. It includes exercises for practical application, such as calculating sensitivity, specificity, and predictive values based on prostate-specific antigen (PSA) levels. The tutorial also demonstrates how to utilize R functions for confusion matrices and ROC curve analysis to evaluate the performance of diagnostic tests.

Uploaded by

nadianparisa1999
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Multivariate statistics_tutorial 4 Sensitivity, Specificity, ROC and Validation

This document is a tutorial on multivariate statistics in R, focusing on classification metrics, ROC curves, and validation methods for diagnostic tests. It includes exercises for practical application, such as calculating sensitivity, specificity, and predictive values based on prostate-specific antigen (PSA) levels. The tutorial also demonstrates how to utilize R functions for confusion matrices and ROC curve analysis to evaluate the performance of diagnostic tests.

Uploaded by

nadianparisa1999
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

Multivariate statistics in R

Tutorial 4

26/7 2024

Andreas Tilevik
Institutionen för Biovetenskap
Högskolan i Skövde

1
Table of Contents
Classification ............................................................................................................................................ 3
Exercise 4.1 .......................................................................................................................................... 9
ROC curve ................................................................................................................................................ 9
Exercise 4.2 ........................................................................................................................................ 14
Validation .............................................................................................................................................. 15
The hold-out method ........................................................................................................................ 16
Cross-validation ................................................................................................................................. 18
Exercise 4.3 ........................................................................................................................................ 19

2
Classification
In this section, we will learn how to compute different metrics that can describe how well a
diagnostic test, or a biomarker, can be used to predict a certain disease.

Sensitivity and specificity


https://play.his.se/media/Sensitivity/0_0b86sos4

How to calculate the likelihood ratio


https://play.his.se/media/Likelihood+Ratio/0_q110gh
fm

The positive and negative predictive values (PPV and


NPV)

https://play.his.se/media/PPV/0_lupqqklm

As an example, we will here calculate the sensitivity and specificity based on a small data set. One
biomarker to distinguish between healthy subjects and subjects with prostate cancer is the prostate-
specific antigen (PSA). Let’s say that we have measured the PSA level of 14 subjects, where we know
that seven of these subjects are healthy and that the other seven subjects have prostate cancer. A
prostate biopsy has been performed to determine whether the subjects have prostate cancer or not.
The PSA levels in prostate cancer subjects and healthy subjects are shown below:

Prostate cancer NO NO NO NO NO NO NO YES YES YES YES YES YES YES


PSA level (µg/L) 2.5 2.0 1.7 1.4 1.2 0.9 0.8 3.8 3.4 2.9 2.8 2.7 2.1 1.6

Let’s enter the data in R:

rm(list=ls()) # Clear memory


C=c(3.8, 3.4, 2.9, 2.8, 2.7, 2.1, 1.6) # Cancer
H=c(2.5, 2.0, 1.7, 1.4, 1.2, 0.9, 0.8) # Healthy
PSA=c(C,H)
Group=factor(rep(c("Cancer","Healthy"),c(7,7)),levels=c("Healthy","Cancer"))

3
df=data.frame(Group,PSA)

Note that, when we create the grouping factor that defines if a person is healthy or has the disease,
we should always set the healthy group as our baseline, which means that the category “Healthy”
should always come first. This is why we use the argument “levels” where we tell R that “Healthy”
should be our first category. Per default, R puts the categories in alphabetic order, so we need to
change the order in this case. If we print the variable “Group”:

Group
Cancer Cancer Cancer Cancer Cancer Cancer Cancer Healthy Healthy …..
Levels: Healthy Cancer

we see that the row “Levels” now tells us that the category “Healthy” is first, which is exactly what we
want. Try and change the order of the categories in the code above and make sure you understand the
meaning of levels. Before you move on, make sure that “Healthy” is your first category. Let’s plot the
data:
stripchart(PSA~Group,data=df,ylim=c(0,4),ylab=c("PSA (μg/L)"),vertical = T,xlim=c(0.7,2.3) ,las=1)
abline(v=1.5) # Add a vertical line in the plot

We would now like to find a threshold that can be used to determine if a patient has prostate cancer
or not based on a simple PSA measurement. Looking at the figure above, it seems like an appropriate
threshold value is 2.3 µg/L. Let’s place a horizontal line in the figure for this cutoff value:

abline(h=2.3,lty=2,col="red") # Place a line in the figure that represents a cutoff value (2.3 µg/L)

4
We could use this cutoff value to predict if a new patient has prostate cancer or not. However, how
accurate is such a threshold value? From the figure above, we see that using such a threshold level on
subjects that are known to be healthy or are diagnosed with prostate cancer (e.g. determined by biopsy
of the prostate), we misclassify three subjects. One healthy patient is incorrectly classified as having
cancer, whereas two subjects with prostate cancer are incorrectly classified as being healthy. Thus, we
have four possible outcomes:

TP – true positives: number of prostate cancer patients predicted as having prostate cancer
TN – true negatives: number of healthy subjects predicted as being healthy
FP – false positives: number of healthy subjects incorrectly predicted as having prostate cancer
FN – false negatives: number of prostate cancer patients incorrectly predicted as being healthy

Confusion matrix in R
We will now compute a confusion matrix in R. To create such a table we need a vector that defines
how each individual is classified according to the cutoff value. We can use the function “ifelse” to assign
the predicted class based on whether their PSA levels are greater or less than 2.3 (above or below the
cutoff line):

pred=factor(ifelse(df$PSA<2.3,"Healthy","Cancer"),levels=c("Healthy","Cancer"))
df=data.frame(df,pred)
df
Group PSA pred
1 Cancer 3.8 Cancer
2 Cancer 3.4 Cancer
3 Cancer 2.9 Cancer
4 Cancer 2.8 Cancer
5 Cancer 2.7 Cancer
6 Cancer 2.1 Healthy
7 Cancer 1.6 Healthy
8 Healthy 2.5 Cancer
9 Healthy 2.0 Healthy
10 Healthy 1.7 Healthy
11 Healthy 1.4 Healthy
12 Healthy 1.2 Healthy
13 Healthy 0.9 Healthy
14 Healthy 0.8 Healthy

5
We can generate a table of the two vectors by using the function “table”:

tab = table(pred,Group)
tab
Group
pred Healthy Cancer
Healthy 6 2
Cancer 1 5

where the rows represent the predicted class and the columns the actual class. We can now define the
TP, TN, FN and FP:

TP=tab[2,2] # True positive


FP=tab[2,1] # False positive
TN=tab[1,1] # True negative
FN=tab[1,2] # False negative
print(c(TP=TP,FP=FP,TN=TN,FN=FN))
TP FP TN FN
5 1 6 2

We can now calculate the metrics for the predicted performance of our cutoff value:

Sensitivity=TP/(TP+FN)
Sensitivity
0.7142857

Specificity=TN/(TN+FP)
Specificity
0.8571429

Accuracy=(TP+TN)/(TP+TN+FP+FN) # or sum(diag(tab))/sum(tab)
Accuracy
0.7857143

pLR=Sensitivity/(1-Specificity) # Positive likelihood ratio


pLR
5

nLR=(1-Sensitivity)/Specificity # Negative likelihood ratio


nLR
0.3333333

Prevalence=(TP+FN)/(TP+FN+FP+TN) # Prevalence in sample


Prevalence
0.5
# Positive predictive value
PPV=(Sensitivity*Prevalence)/((Sensitivity*Prevalence)+(1-Specificity)*(1-Prevalence))
PPV
0.8333333
# Negative predictive value
NPV=(Specificity*(1-Prevalence))/(((1-Sensitivity)*Prevalence)+Specificity*(1-Prevalence))

6
NPV
0.75

We can compute the same metrics and much more with the function “confusionMatrix” from the
“caret” package. Note that we need to state which of the categories that are defined as a “positive”
results. In most cases, this is the disease category, which is why we set the argument “positive” to
“Cancer”:

library(caret)
confusionMatrix(data=pred,reference=Group, positive="Cancer")
Reference
Prediction Healthy Cancer
Healthy 6 2
Cancer 1 5

Accuracy : 0.7857
95% CI : (0.492, 0.9534)
No Information Rate : 0.5
P-Value [Acc > NIR] : 0.02869

Kappa : 0.5714
Mcnemar's Test P-Value : 1.00000

Sensitivity : 0.7143
Specificity : 0.8571
Pos Pred Value : 0.8333
Neg Pred Value : 0.7500
Prevalence : 0.5000
Detection Rate : 0.3571
Detection Prevalence : 0.4286
Balanced Accuracy : 0.7857

'Positive' Class : Cancer

Prior probability and PPV


Using a cutoff level of 2.3 results in a sensitivity of 71% and a specificity of 85%. For our study, we
selected the same sample size for the two groups. However, let’s say that out of the ones testing for
prostate cancer, we know from previous studies that 90% are healthy and 10% have prostate cancer.
Thus, our prior probabilities are then 0.9 and 0.1. We can calculate the PPV and NPV with the same
code as before but where we now set the prevalence to 0.1:

Prevalence=0.1
PPV=(Sensitivity*Prevalence)/((Sensitivity*Prevalence)+(1-Specificity)*(1-Prevalence))
PPV
0.3571429
NPV=(Specificity*(1-Prevalence))/(((1-Sensitivity)*Prevalence)+Specificity*(1-Prevalence))
NPV
0.9642857

7
We can compute the same values by setting the argument “prevalence” to 0.1 in the function
“confusionMatrix”:

library(caret)
confusionMatrix(data=pred,reference=Group, positive="Cancer",prevalence=0.1)

Sensitivity : 0.7143
Specificity : 0.8571
Pos Pred Value : 0.3571
Neg Pred Value : 0.9643

Using another cutoff value


All our previous calculations were done based on the cutoff value of 2.3 µg/L. However, all the metrics
will change if we change the cutoff value. Let’s say that we want to increase the sensitivity because we
might think that it is really important to identify all individuals with prostate cancer with our PSA test.
We can then simply reduce the cutoff value to, for example, 1.5 µg/L:

stripchart(PSA~Group,data=df,ylim=c(0,4),ylab=c("PSA (μg/L)"),vertical = T,xlim=c(0.7,2.3) ,las=1)


abline(v=1.5) # Add a vertical line to the plot
abline(h=1.5,lty=2,col="red") # cutoff value

With this low cutoff value, we will get 100% sensitivity since all prostate cancer patients are classified
as having prostate cancer (no prostate cancer patients will be classified as being healthy). Note that
100% sensitivity only accounts for our sample. If we would use his cutoff value for a new set of data
including hundreds of individuals, then a few prostate cancer patients will for sure have a lower PSA
level than 1.5 µg/L. We will later see how to validate our classifier that gives a better prediction of the
true accuracy. Let’s compute the confusion matrix:

cutoff=1.5
pred=factor(ifelse(PSA<cutoff,"Healthy","Cancer"),levels=c("Healthy","Cancer"))
confusionMatrix(data=pred,reference=Group,positive="Cancer")
Reference
Prediction Healthy Cancer
Healthy 4 0
Cancer 3 7
….

8
Accuracy : 0.7857

Sensitivity : 1.0000
Specificity : 0.5714

We see that the accuracy is still the same since we still misclassify three individuals. Sensitivity is now
100% whereas the specificity has been reduced to 57%.

Exercise 4.1

1. Make a strip chart of the variable body depth (BD) from the Crab data where you separate
the two colors (Blue (B) and Orange (O)). Place a cutoff line at 11 and set the argument
“method” to “jitter” so that you see all the data points.

2. Based on the plot you have generated, with the cutoff value of 11, how many false negatives
do you have, given that you set the color blue as the baseline category (orange is therefore
considered as a positive case)?

3. Based on the same data and the cutoff value as in the previous question, compute sensitivity,
specificity, PPV, and NPV.

ROC curve
We will here have a look at the so-called receiver operating characteristic (ROC) curve, which can be
used to find an appropriate cutoff value, given a certain sensitivity and specificity. ROC curves are also
commonly used to compare how well different biomarkers perform.

ROC curve
https://play.his.se/media/ROC/0_kjbx9tq6

9
The ROC curve represents the calculated specificity and sensitivity for different cutoff values. Let’s first
enter the same data that we have used previously:

rm(list=ls()) # Clear memory


C=c(3.8, 3.4, 2.9, 2.8, 2.7, 2.1, 1.6) # Cancer
H=c(2.5, 2.0, 1.7, 1.4, 1.2, 0.9, 0.8) # Healthy
PSA=c(C,H)
Group=factor(rep(c("Cancer","Healthy"),c(7,7)) ,levels=c("Healthy","Cancer"))
df=data.frame(Group,PSA)

We will here use the pROC package to generate an ROC curve with the function “roc”:

library(pROC)
roc1=roc(Group, PSA)
Setting levels: control = Healthy, case = Cancer
Setting direction: controls < cases

Note that the output tells us that the “Healthy” category has been set as the “control” and the category
“cancer” has been set as the “case”. This is exactly what we want because the category healthy should
be our baseline. To generate a nice printout of the different sensitivities and specificities for given
cutoff values, we can print the output as a data frame format:
data.frame(roc1$thresholds,roc1$sensitivities,roc1$specificities)
roc1.thresholds roc1.sensitivities roc1.specificities
1 -Inf 1.0000000 0.0000000
2 0.85 1.0000000 0.1428571
3 1.05 1.0000000 0.2857143
4 1.30 1.0000000 0.4285714
5 1.50 1.0000000 0.5714286
6 1.65 0.8571429 0.5714286
7 1.85 0.8571429 0.7142857
8 2.05 0.8571429 0.8571429
9 2.30 0.7142857 0.8571429
10 2.60 0.7142857 1.0000000
11 2.75 0.5714286 1.0000000
12 2.85 0.4285714 1.0000000
13 3.15 0.2857143 1.0000000
14 3.60 0.1428571 1.0000000
15 Inf 0.0000000 1.0000000

From this output, we can see how different cutoff values affect sensitivity and specificity. If we think
that sensitivity and specificity are equally important, an appropriate threshold value seems to be at a
PSA level of 2.05 µg/L. If we like to have 100% sensitivity for the test, a threshold value of 1.5 µg/L
could be used, although this cutoff reduces the specificity to 57%. However, note that these results
only account for our sample and that we need to validate the sensitivity and specificity for a given
threshold with test data or by using cross-validation which we will do later on. Let’s generate an ROC
curve, which is simply a curve representing the values in the two columns above for sensitivity and
specificity:

plot(roc1)

10
The thin diagonal line is our reference line. The thick curve represents our receiver operating
characteristic (ROC) curve. What we want is an ROC curve that is approaching the top left corner so
that our test has as high sensitivity and specificity as possible. If your two groups show weak separation
(the values from the two groups are completely mixed), then the ROC curve will be close to the
reference line.

Optimal cutoff value


The package pROC includes a function that can help us to select the optimal cutoff value. One such
measure is a threshold, which maximizes the distance (Youden Index) to the reference line:

coords(roc1,"best", best.method="youden",transpose = FALSE)


threshold specificity sensitivity
2.6000000 1.0000000 0.7142857

Another measure is the threshold point that is closest to the top-left corner of the ROC curve plot:

coords(roc1, "best", best.method="closest.topleft",transpose = FALSE)


threshold specificity sensitivity
2.0500000 0.8571429 0.8571429

PPV and NPV


If we know the prevalence of the disease, we can also calculate the PPV and NPV for different cutoff
values. Let’s say that we know, from earlier studies, that the prevalence of prostate cancer for the ones
who do the test is 1%, then the PPV and NPV for the different cutoffs can be calculated by:

Prevalence=0.01
Sensitivity=roc1$sensitivities
Specificity=roc1$specificities
PPV=(Sensitivity*Prevalence)/((Sensitivity*Prevalence)+(1-Specificity)*(1-Prevalence))
NPV=(Specificity*(1-Prevalence))/(((1-Sensitivity)*Prevalence)+Specificity*(1-Prevalence))
data.frame(roc1$thresholds,PPV,NPV,Sensitivity,Specificity) # Print
roc1.thresholds PPV NPV Sensitivity Specificity
1 -Inf 0.01000000 NaN 1.0000000 0.0000000

11
2 0.85 0.01164725 1.0000000 1.0000000 0.1428571
3 1.05 0.01394422 1.0000000 1.0000000 0.2857143
4 1.30 0.01736973 1.0000000 1.0000000 0.4285714
5 1.50 0.02302632 1.0000000 1.0000000 0.5714286
6 1.65 0.01980198 0.9974811 0.8571429 0.5714286
7 1.85 0.02941176 0.9979839 0.8571429 0.7142857
8 2.05 0.05714286 0.9983193 0.8571429 0.8571429
9 2.30 0.04807692 0.9966443 0.7142857 0.8571429
10 2.60 1.00000000 0.9971223 0.7142857 1.0000000
11 2.75 1.00000000 0.9956897 0.5714286 1.0000000
12 2.85 1.00000000 0.9942611 0.4285714 1.0000000
13 3.15 1.00000000 0.9928367 0.2857143 1.0000000
14 3.60 1.00000000 0.9914163 0.1428571 1.0000000
15 Inf NaN 0.9900000 0.0000000 1.0000000

We see that the PPV is only 5.7% when the specificity and sensitivity is 85%.

The area under the ROC curve


The area under the ROC curve (AUC) is an important measure of how well a test performs. The AUC
can vary between one and zero. The area below the reference line is exactly 0.5. What we want is an
area that is close to one and at least significantly different from 0.5 (the area below the reference line).
We can extract the area under the ROC curve from the output by:

roc1$auc
Area under the curve: 0.9184

We can generate a 95% CI for the AUC by:

ci.auc(roc1)
95% CI: 0.7723-1 (DeLong)

We see that the 95% confidence interval does not include the value of 0.5. Hence, we can conclude
that the AUC value is significantly greater than 0.5. It is also possible to compute a p-value based on a
Mann-Whitney U test. Let’s compute a two-sided Mann-Whitney U test to check if our AUC is
significantly different from 0.5:

wilcox.test(PSA~Group, exact=FALSE, correct=FALSE)


data: PSA by Group
W = 4, p-value = 0.008809

We see that the p-value is less than 0.05, which means that we can conclude that the area is
significantly greater than 0.5.

Comparing ROC curves


If your data set includes several variables, you might want to plot an ROC curve for each of the variables
and possibly compare such ROC curves. As an example, we will here use the Iris data set where we
only focus on the classes Versicolor and Virginica. We will plot an ROC curve for the petal width and
the sepal width. From the figure below, we can see that Versicolor is more separated from Virginica
based on the petal width rather than the sepal width:

12
Let’s first prepare the data:

rm(list=ls()) # Clear memory


Pet_width=iris[51:150,4] # Petal width for Versicolor and Virginica
Sep_width=iris[51:150,2] # Sepal width for Versicolor and Virginica
Species1=factor(iris$Species[51:150]) # Species class (Versicolor and Virginica)

Note that we here set the species Versicolor as our baseline (=0), but the baseline in this example is
arbitrary. We then plot one ROC curve for the petal width and one curve for the sepal width.

(roc1=roc(Species1, Pet_width))

0.9804
plot(roc1)
(roc2=roc(Species1, Sep_width))

0.6636
lines(roc2,col="red")
legend("bottomright",c("Petal width","Sepal Width"),col=1:2,lty=1,lwd=2)

We see that we can get about 90% sensitivity and specificity using the variable petal width whereas
the variable sepal width is close to the reference line. The AUC value for the petal width is 0.98,
whereas the AUC value for the sepal width is only 0.66. By using the function “roc.test” we can test if
there is a significant difference in AUC between the two curves. Since the two ROC curves are built on

13
variables from the same sample (repeated measures on the same flower), we set the argument
“paired” to true:

roc.test(roc1,roc2, paired=TRUE)
DeLong's test for two correlated ROC curves
data: roc1 and roc2
Z = 6.2869, p-value = 3.238e-10
alternative hypothesis: true difference in AUC is not equal to 0
sample estimates:
AUC of roc1 AUC of roc2
0.9804 0.6636

Based on the low p-value, we can conclude that the AUC is significantly greater for the petal width
compared to the sepal width.

Exercise 4.2

1. Use the variable body depth (BD) from the Crab data set to classify the two-color forms (Blue
(B) and Orange (O)). Generate an ROC curve of the data.

2. Calculate the 95% confidence interval of your AUC value. Is your AUC value significantly
greater than 0.5?

3. Use the “youden” method to find an appropriate cutoff value. Report the cutoff value and
the corresponding sensitivity and specificity based on this cutoff.

4. Create two ROC curves that represent the two variables body depth (BD) and rear width
(RW) in the same plot. Use the “roc.test” to test if there is a significant difference in AUC
between the two curves.

14
Validation
In this section, we will see how to validate the performance of a test or a biomarker by using either
the hold-out method or cross-validation.

Validation techniques - hold-out, cross-


validation, LOOCV
https://play.his.se/media/Validation/0_0vvuh1bp

We will here learn how to perform validation based on the Iris data set. To simplify, we will here reduce
this data set so that we only have the petal length and the two species Versicolor and Virginica:
PL=iris[51:150,3]
Species=factor(iris[51:150,5])
df=data.frame(PL,Species)
stripchart(PL~Species,data=df,ylim=c(2.8,7),method="jitter",vertical = T,xlim=c(0.7,2.3) ,las=1)

To see how well a single cutoff line can separate the groups, we can use, for example, a cutoff line
based on the mean of the two group means:
cutoff=mean(tapply(PL,Species,mean))
abline(h=cutoff,lty=2,col="red")

15
If we use this cutoff to calculate the confusion matrix and the accuracy:
pred=factor(ifelse(PL<cutoff,"versicolor","virginica"))
table(pred,Species)
Species
pred versicolor virginica
versicolor 48 6
virginica 2 44
sum(diag(table(pred,Species)))/nrow(df)
0.92

we see that 92 flowers are correctly predicted out of the 100 flowers. However, by determining the
cutoff value on all data, we usually overestimate the accuracy because if we use a new data set, our
given cutoff value might not be optimal for that data set. To get a better estimation of how well a
single cutoff value can separate the groups, we need to use some validation method.

The hold-out method


The so-called hold-out method can be used to separate the data into two groups: a training group and
a test group. Let’s randomly split the 100 flowers into a training group and a test group. In this example,
we use 60% of the data in the training group and 40% in the test group. We can use the function
“sample” to generate a random sequence in a certain range. To illustrate how this function works, we
will here generate a random sequence between 1 and 10:
set.seed(10)
sample(1:10)
6 3 4 5 1 2 7 10 8 9

Before we run the function “sample”, we use the function “set.seed” with some arbitrary number. If
you run these two lines several times, you will get the exact same random sequence every time. This
is nice because you will then get the same random sequence of numbers as shown in this tutorial.
NOTE, if you get different numbers compared to the ones above, you might be using a different R-
version compared to the version used to generate the above numbers (I used version 4.0.5). This
means that you will then not get the same results as shown in this tutorial.

16
To select 60% of the versicolors for the training data, we can create a sequence of random numbers
between 1 to 50 because the Versicolors are located on the rows 1 to 50 in the data frame. We also
generate a sequence between 51 to 100 because all the Virginicas are located on these rows.

set.seed(150)
ind_Versicolor=sample(1:50)
ind_Virginica=sample(51:100)

Then we take the first 30 numbers of these two random sequences so that we select 30 out of the 50
flowers of each species (60%):

ind_train=c(ind_Versicolor[1:30],ind_Virginica[1:30])
ind_train
3 22 23 40 12 18 17 15 44 36 48 41 31 16 25 34 20 8 33 10 43 4 1 2 13 19 6 45 24 26
92 100 94 84 56 72 99 54 66 86 85 87 51 69 73 89 81 63 64 75 68 79 76 82 52 71 80 83 53
97

The random numbers you see above will be used to select the rows of the data frame to be used in
our training data:
train=df[ind_train,]
nrow(train) # Check how many flowers we have in the training data
60

We see that we have selected 60 out of the 100 flowers for our training data. For the test data, we will
select all the flowers that were not included in the training data. This can be done by simply using a
minus sign in front of “ind_train”:
test=df[-ind_train,]
nrow(test)
40

A simpler way to do the same thing is to use the function “createDataPartition” from the caret package:
library(caret)
set.seed(150)
ind = createDataPartition(df$Species, p = 0.6, list=FALSE)
training = df[ind,]
table(training$Species)
versicolor virginica
30 30
testing = df[-ind,]
table(test$Species)
versicolor virginica
20 20

17
Note that when we supply a categorical variable (df$Species) to the function, it creates a random
sequence of numbers so that we get the same proportions from each group.

We will now compute a cutoff based on the training data:


cutoff=mean(tapply(train$PL,train$Species,mean))
cutoff
4.96

Next, we will use this cutoff value to see how well it can separate the two species based on the test
data set:
pred=factor(ifelse(test$PL<cutoff,"versicolor","virginica"))
table(pred,test$Species)
pred versicolor virginica
versicolor 19 4
virginica 1 16
sum(diag(table(pred, test$Species)))/nrow(test)
0.875

We see that accuracy is 87.5% if we use the hold-out method in comparison to when we used all the
data (92%). An accuracy that is based on the test data is a better estimation of the true accuracy
because such data was not used to identify the most optimal cutoff value.

Cross-validation
Another type of validation approach, when our sample size is small, is to approximate the accuracy by
using K-fold cross-validation where the sample is divided into K equal subsamples. A subsample is
retained as the test set and the remaining subsamples are used as training data. The cross-validation
is then repeated K folds, where each subsample is used only once as test data. The accuracy will then
be averaged over the K cross-validations. K should be adjusted based on the sample size. A small
sample size requires large Ks. When K equals the sample size, the cross-validation is called leave-one-
out cross-validation (LOOCV). To understand the LOOCV algorithm, we will here implement the
algorithm with a simple for-loop. In every loop, we remove one data point, calculate a new cutoff value
based on the mean of the remaining data, and test if the data point that was left out is correctly
predicted or not:

n=nrow(df) # Number of data points


pred=NULL # Create a vector to save predictions
for (i in 1:n){
rm_data=df[i,] # Data point with row i will be removed
new=df[-i,] # Create a new data frame without the data point i
cutoff=mean(tapply(new$PL,new$Species,mean)) # Compute cutoff without data point
pred[i]=ifelse(rm_data$PL<cutoff,"versicolor","virginica") # Predict
}
table(pred,Species)
Species
pred versicolor virginica
versicolor 46 6
virginica 4 44

18
sum(diag(table(pred, Species)))/nrow(df)
0.9

By using the LOOCV we get an accuracy of 90%.

Exercise 4.3

1. Use the hold-out method as a validation method to compute the accuracy of how well the
variable body depth (BD) can classify the two-color forms (Blue (B) and Orange (O)). Use the
training data set (60% of the total data) to compute an appropriate cutoff value based on the
ROC curve, using the Youden method. Then use the test data to see how well this cutoff
value can discriminate the two groups. Calculate the accuracy.

2. Use the LOOCV to compute the accuracy of all data for the variable body depth (BD), using
the Youden method to determine the cutoff value.

19

You might also like