Multivariate statistics_tutorial 4 Sensitivity, Specificity, ROC and Validation
Multivariate statistics_tutorial 4 Sensitivity, Specificity, ROC and Validation
Tutorial 4
26/7 2024
Andreas Tilevik
Institutionen för Biovetenskap
Högskolan i Skövde
1
Table of Contents
Classification ............................................................................................................................................ 3
Exercise 4.1 .......................................................................................................................................... 9
ROC curve ................................................................................................................................................ 9
Exercise 4.2 ........................................................................................................................................ 14
Validation .............................................................................................................................................. 15
The hold-out method ........................................................................................................................ 16
Cross-validation ................................................................................................................................. 18
Exercise 4.3 ........................................................................................................................................ 19
2
Classification
In this section, we will learn how to compute different metrics that can describe how well a
diagnostic test, or a biomarker, can be used to predict a certain disease.
https://play.his.se/media/PPV/0_lupqqklm
As an example, we will here calculate the sensitivity and specificity based on a small data set. One
biomarker to distinguish between healthy subjects and subjects with prostate cancer is the prostate-
specific antigen (PSA). Let’s say that we have measured the PSA level of 14 subjects, where we know
that seven of these subjects are healthy and that the other seven subjects have prostate cancer. A
prostate biopsy has been performed to determine whether the subjects have prostate cancer or not.
The PSA levels in prostate cancer subjects and healthy subjects are shown below:
3
df=data.frame(Group,PSA)
Note that, when we create the grouping factor that defines if a person is healthy or has the disease,
we should always set the healthy group as our baseline, which means that the category “Healthy”
should always come first. This is why we use the argument “levels” where we tell R that “Healthy”
should be our first category. Per default, R puts the categories in alphabetic order, so we need to
change the order in this case. If we print the variable “Group”:
Group
Cancer Cancer Cancer Cancer Cancer Cancer Cancer Healthy Healthy …..
Levels: Healthy Cancer
we see that the row “Levels” now tells us that the category “Healthy” is first, which is exactly what we
want. Try and change the order of the categories in the code above and make sure you understand the
meaning of levels. Before you move on, make sure that “Healthy” is your first category. Let’s plot the
data:
stripchart(PSA~Group,data=df,ylim=c(0,4),ylab=c("PSA (μg/L)"),vertical = T,xlim=c(0.7,2.3) ,las=1)
abline(v=1.5) # Add a vertical line in the plot
We would now like to find a threshold that can be used to determine if a patient has prostate cancer
or not based on a simple PSA measurement. Looking at the figure above, it seems like an appropriate
threshold value is 2.3 µg/L. Let’s place a horizontal line in the figure for this cutoff value:
abline(h=2.3,lty=2,col="red") # Place a line in the figure that represents a cutoff value (2.3 µg/L)
4
We could use this cutoff value to predict if a new patient has prostate cancer or not. However, how
accurate is such a threshold value? From the figure above, we see that using such a threshold level on
subjects that are known to be healthy or are diagnosed with prostate cancer (e.g. determined by biopsy
of the prostate), we misclassify three subjects. One healthy patient is incorrectly classified as having
cancer, whereas two subjects with prostate cancer are incorrectly classified as being healthy. Thus, we
have four possible outcomes:
TP – true positives: number of prostate cancer patients predicted as having prostate cancer
TN – true negatives: number of healthy subjects predicted as being healthy
FP – false positives: number of healthy subjects incorrectly predicted as having prostate cancer
FN – false negatives: number of prostate cancer patients incorrectly predicted as being healthy
Confusion matrix in R
We will now compute a confusion matrix in R. To create such a table we need a vector that defines
how each individual is classified according to the cutoff value. We can use the function “ifelse” to assign
the predicted class based on whether their PSA levels are greater or less than 2.3 (above or below the
cutoff line):
pred=factor(ifelse(df$PSA<2.3,"Healthy","Cancer"),levels=c("Healthy","Cancer"))
df=data.frame(df,pred)
df
Group PSA pred
1 Cancer 3.8 Cancer
2 Cancer 3.4 Cancer
3 Cancer 2.9 Cancer
4 Cancer 2.8 Cancer
5 Cancer 2.7 Cancer
6 Cancer 2.1 Healthy
7 Cancer 1.6 Healthy
8 Healthy 2.5 Cancer
9 Healthy 2.0 Healthy
10 Healthy 1.7 Healthy
11 Healthy 1.4 Healthy
12 Healthy 1.2 Healthy
13 Healthy 0.9 Healthy
14 Healthy 0.8 Healthy
5
We can generate a table of the two vectors by using the function “table”:
tab = table(pred,Group)
tab
Group
pred Healthy Cancer
Healthy 6 2
Cancer 1 5
where the rows represent the predicted class and the columns the actual class. We can now define the
TP, TN, FN and FP:
We can now calculate the metrics for the predicted performance of our cutoff value:
Sensitivity=TP/(TP+FN)
Sensitivity
0.7142857
Specificity=TN/(TN+FP)
Specificity
0.8571429
Accuracy=(TP+TN)/(TP+TN+FP+FN) # or sum(diag(tab))/sum(tab)
Accuracy
0.7857143
6
NPV
0.75
We can compute the same metrics and much more with the function “confusionMatrix” from the
“caret” package. Note that we need to state which of the categories that are defined as a “positive”
results. In most cases, this is the disease category, which is why we set the argument “positive” to
“Cancer”:
library(caret)
confusionMatrix(data=pred,reference=Group, positive="Cancer")
Reference
Prediction Healthy Cancer
Healthy 6 2
Cancer 1 5
Accuracy : 0.7857
95% CI : (0.492, 0.9534)
No Information Rate : 0.5
P-Value [Acc > NIR] : 0.02869
Kappa : 0.5714
Mcnemar's Test P-Value : 1.00000
Sensitivity : 0.7143
Specificity : 0.8571
Pos Pred Value : 0.8333
Neg Pred Value : 0.7500
Prevalence : 0.5000
Detection Rate : 0.3571
Detection Prevalence : 0.4286
Balanced Accuracy : 0.7857
Prevalence=0.1
PPV=(Sensitivity*Prevalence)/((Sensitivity*Prevalence)+(1-Specificity)*(1-Prevalence))
PPV
0.3571429
NPV=(Specificity*(1-Prevalence))/(((1-Sensitivity)*Prevalence)+Specificity*(1-Prevalence))
NPV
0.9642857
7
We can compute the same values by setting the argument “prevalence” to 0.1 in the function
“confusionMatrix”:
library(caret)
confusionMatrix(data=pred,reference=Group, positive="Cancer",prevalence=0.1)
…
Sensitivity : 0.7143
Specificity : 0.8571
Pos Pred Value : 0.3571
Neg Pred Value : 0.9643
…
With this low cutoff value, we will get 100% sensitivity since all prostate cancer patients are classified
as having prostate cancer (no prostate cancer patients will be classified as being healthy). Note that
100% sensitivity only accounts for our sample. If we would use his cutoff value for a new set of data
including hundreds of individuals, then a few prostate cancer patients will for sure have a lower PSA
level than 1.5 µg/L. We will later see how to validate our classifier that gives a better prediction of the
true accuracy. Let’s compute the confusion matrix:
cutoff=1.5
pred=factor(ifelse(PSA<cutoff,"Healthy","Cancer"),levels=c("Healthy","Cancer"))
confusionMatrix(data=pred,reference=Group,positive="Cancer")
Reference
Prediction Healthy Cancer
Healthy 4 0
Cancer 3 7
….
8
Accuracy : 0.7857
…
Sensitivity : 1.0000
Specificity : 0.5714
…
We see that the accuracy is still the same since we still misclassify three individuals. Sensitivity is now
100% whereas the specificity has been reduced to 57%.
Exercise 4.1
1. Make a strip chart of the variable body depth (BD) from the Crab data where you separate
the two colors (Blue (B) and Orange (O)). Place a cutoff line at 11 and set the argument
“method” to “jitter” so that you see all the data points.
2. Based on the plot you have generated, with the cutoff value of 11, how many false negatives
do you have, given that you set the color blue as the baseline category (orange is therefore
considered as a positive case)?
3. Based on the same data and the cutoff value as in the previous question, compute sensitivity,
specificity, PPV, and NPV.
ROC curve
We will here have a look at the so-called receiver operating characteristic (ROC) curve, which can be
used to find an appropriate cutoff value, given a certain sensitivity and specificity. ROC curves are also
commonly used to compare how well different biomarkers perform.
ROC curve
https://play.his.se/media/ROC/0_kjbx9tq6
9
The ROC curve represents the calculated specificity and sensitivity for different cutoff values. Let’s first
enter the same data that we have used previously:
We will here use the pROC package to generate an ROC curve with the function “roc”:
library(pROC)
roc1=roc(Group, PSA)
Setting levels: control = Healthy, case = Cancer
Setting direction: controls < cases
Note that the output tells us that the “Healthy” category has been set as the “control” and the category
“cancer” has been set as the “case”. This is exactly what we want because the category healthy should
be our baseline. To generate a nice printout of the different sensitivities and specificities for given
cutoff values, we can print the output as a data frame format:
data.frame(roc1$thresholds,roc1$sensitivities,roc1$specificities)
roc1.thresholds roc1.sensitivities roc1.specificities
1 -Inf 1.0000000 0.0000000
2 0.85 1.0000000 0.1428571
3 1.05 1.0000000 0.2857143
4 1.30 1.0000000 0.4285714
5 1.50 1.0000000 0.5714286
6 1.65 0.8571429 0.5714286
7 1.85 0.8571429 0.7142857
8 2.05 0.8571429 0.8571429
9 2.30 0.7142857 0.8571429
10 2.60 0.7142857 1.0000000
11 2.75 0.5714286 1.0000000
12 2.85 0.4285714 1.0000000
13 3.15 0.2857143 1.0000000
14 3.60 0.1428571 1.0000000
15 Inf 0.0000000 1.0000000
From this output, we can see how different cutoff values affect sensitivity and specificity. If we think
that sensitivity and specificity are equally important, an appropriate threshold value seems to be at a
PSA level of 2.05 µg/L. If we like to have 100% sensitivity for the test, a threshold value of 1.5 µg/L
could be used, although this cutoff reduces the specificity to 57%. However, note that these results
only account for our sample and that we need to validate the sensitivity and specificity for a given
threshold with test data or by using cross-validation which we will do later on. Let’s generate an ROC
curve, which is simply a curve representing the values in the two columns above for sensitivity and
specificity:
plot(roc1)
10
The thin diagonal line is our reference line. The thick curve represents our receiver operating
characteristic (ROC) curve. What we want is an ROC curve that is approaching the top left corner so
that our test has as high sensitivity and specificity as possible. If your two groups show weak separation
(the values from the two groups are completely mixed), then the ROC curve will be close to the
reference line.
Another measure is the threshold point that is closest to the top-left corner of the ROC curve plot:
Prevalence=0.01
Sensitivity=roc1$sensitivities
Specificity=roc1$specificities
PPV=(Sensitivity*Prevalence)/((Sensitivity*Prevalence)+(1-Specificity)*(1-Prevalence))
NPV=(Specificity*(1-Prevalence))/(((1-Sensitivity)*Prevalence)+Specificity*(1-Prevalence))
data.frame(roc1$thresholds,PPV,NPV,Sensitivity,Specificity) # Print
roc1.thresholds PPV NPV Sensitivity Specificity
1 -Inf 0.01000000 NaN 1.0000000 0.0000000
11
2 0.85 0.01164725 1.0000000 1.0000000 0.1428571
3 1.05 0.01394422 1.0000000 1.0000000 0.2857143
4 1.30 0.01736973 1.0000000 1.0000000 0.4285714
5 1.50 0.02302632 1.0000000 1.0000000 0.5714286
6 1.65 0.01980198 0.9974811 0.8571429 0.5714286
7 1.85 0.02941176 0.9979839 0.8571429 0.7142857
8 2.05 0.05714286 0.9983193 0.8571429 0.8571429
9 2.30 0.04807692 0.9966443 0.7142857 0.8571429
10 2.60 1.00000000 0.9971223 0.7142857 1.0000000
11 2.75 1.00000000 0.9956897 0.5714286 1.0000000
12 2.85 1.00000000 0.9942611 0.4285714 1.0000000
13 3.15 1.00000000 0.9928367 0.2857143 1.0000000
14 3.60 1.00000000 0.9914163 0.1428571 1.0000000
15 Inf NaN 0.9900000 0.0000000 1.0000000
We see that the PPV is only 5.7% when the specificity and sensitivity is 85%.
roc1$auc
Area under the curve: 0.9184
ci.auc(roc1)
95% CI: 0.7723-1 (DeLong)
We see that the 95% confidence interval does not include the value of 0.5. Hence, we can conclude
that the AUC value is significantly greater than 0.5. It is also possible to compute a p-value based on a
Mann-Whitney U test. Let’s compute a two-sided Mann-Whitney U test to check if our AUC is
significantly different from 0.5:
We see that the p-value is less than 0.05, which means that we can conclude that the area is
significantly greater than 0.5.
12
Let’s first prepare the data:
Note that we here set the species Versicolor as our baseline (=0), but the baseline in this example is
arbitrary. We then plot one ROC curve for the petal width and one curve for the sepal width.
(roc1=roc(Species1, Pet_width))
…
0.9804
plot(roc1)
(roc2=roc(Species1, Sep_width))
…
0.6636
lines(roc2,col="red")
legend("bottomright",c("Petal width","Sepal Width"),col=1:2,lty=1,lwd=2)
We see that we can get about 90% sensitivity and specificity using the variable petal width whereas
the variable sepal width is close to the reference line. The AUC value for the petal width is 0.98,
whereas the AUC value for the sepal width is only 0.66. By using the function “roc.test” we can test if
there is a significant difference in AUC between the two curves. Since the two ROC curves are built on
13
variables from the same sample (repeated measures on the same flower), we set the argument
“paired” to true:
roc.test(roc1,roc2, paired=TRUE)
DeLong's test for two correlated ROC curves
data: roc1 and roc2
Z = 6.2869, p-value = 3.238e-10
alternative hypothesis: true difference in AUC is not equal to 0
sample estimates:
AUC of roc1 AUC of roc2
0.9804 0.6636
Based on the low p-value, we can conclude that the AUC is significantly greater for the petal width
compared to the sepal width.
Exercise 4.2
1. Use the variable body depth (BD) from the Crab data set to classify the two-color forms (Blue
(B) and Orange (O)). Generate an ROC curve of the data.
2. Calculate the 95% confidence interval of your AUC value. Is your AUC value significantly
greater than 0.5?
3. Use the “youden” method to find an appropriate cutoff value. Report the cutoff value and
the corresponding sensitivity and specificity based on this cutoff.
4. Create two ROC curves that represent the two variables body depth (BD) and rear width
(RW) in the same plot. Use the “roc.test” to test if there is a significant difference in AUC
between the two curves.
14
Validation
In this section, we will see how to validate the performance of a test or a biomarker by using either
the hold-out method or cross-validation.
We will here learn how to perform validation based on the Iris data set. To simplify, we will here reduce
this data set so that we only have the petal length and the two species Versicolor and Virginica:
PL=iris[51:150,3]
Species=factor(iris[51:150,5])
df=data.frame(PL,Species)
stripchart(PL~Species,data=df,ylim=c(2.8,7),method="jitter",vertical = T,xlim=c(0.7,2.3) ,las=1)
To see how well a single cutoff line can separate the groups, we can use, for example, a cutoff line
based on the mean of the two group means:
cutoff=mean(tapply(PL,Species,mean))
abline(h=cutoff,lty=2,col="red")
15
If we use this cutoff to calculate the confusion matrix and the accuracy:
pred=factor(ifelse(PL<cutoff,"versicolor","virginica"))
table(pred,Species)
Species
pred versicolor virginica
versicolor 48 6
virginica 2 44
sum(diag(table(pred,Species)))/nrow(df)
0.92
we see that 92 flowers are correctly predicted out of the 100 flowers. However, by determining the
cutoff value on all data, we usually overestimate the accuracy because if we use a new data set, our
given cutoff value might not be optimal for that data set. To get a better estimation of how well a
single cutoff value can separate the groups, we need to use some validation method.
Before we run the function “sample”, we use the function “set.seed” with some arbitrary number. If
you run these two lines several times, you will get the exact same random sequence every time. This
is nice because you will then get the same random sequence of numbers as shown in this tutorial.
NOTE, if you get different numbers compared to the ones above, you might be using a different R-
version compared to the version used to generate the above numbers (I used version 4.0.5). This
means that you will then not get the same results as shown in this tutorial.
16
To select 60% of the versicolors for the training data, we can create a sequence of random numbers
between 1 to 50 because the Versicolors are located on the rows 1 to 50 in the data frame. We also
generate a sequence between 51 to 100 because all the Virginicas are located on these rows.
set.seed(150)
ind_Versicolor=sample(1:50)
ind_Virginica=sample(51:100)
Then we take the first 30 numbers of these two random sequences so that we select 30 out of the 50
flowers of each species (60%):
ind_train=c(ind_Versicolor[1:30],ind_Virginica[1:30])
ind_train
3 22 23 40 12 18 17 15 44 36 48 41 31 16 25 34 20 8 33 10 43 4 1 2 13 19 6 45 24 26
92 100 94 84 56 72 99 54 66 86 85 87 51 69 73 89 81 63 64 75 68 79 76 82 52 71 80 83 53
97
The random numbers you see above will be used to select the rows of the data frame to be used in
our training data:
train=df[ind_train,]
nrow(train) # Check how many flowers we have in the training data
60
We see that we have selected 60 out of the 100 flowers for our training data. For the test data, we will
select all the flowers that were not included in the training data. This can be done by simply using a
minus sign in front of “ind_train”:
test=df[-ind_train,]
nrow(test)
40
A simpler way to do the same thing is to use the function “createDataPartition” from the caret package:
library(caret)
set.seed(150)
ind = createDataPartition(df$Species, p = 0.6, list=FALSE)
training = df[ind,]
table(training$Species)
versicolor virginica
30 30
testing = df[-ind,]
table(test$Species)
versicolor virginica
20 20
17
Note that when we supply a categorical variable (df$Species) to the function, it creates a random
sequence of numbers so that we get the same proportions from each group.
Next, we will use this cutoff value to see how well it can separate the two species based on the test
data set:
pred=factor(ifelse(test$PL<cutoff,"versicolor","virginica"))
table(pred,test$Species)
pred versicolor virginica
versicolor 19 4
virginica 1 16
sum(diag(table(pred, test$Species)))/nrow(test)
0.875
We see that accuracy is 87.5% if we use the hold-out method in comparison to when we used all the
data (92%). An accuracy that is based on the test data is a better estimation of the true accuracy
because such data was not used to identify the most optimal cutoff value.
Cross-validation
Another type of validation approach, when our sample size is small, is to approximate the accuracy by
using K-fold cross-validation where the sample is divided into K equal subsamples. A subsample is
retained as the test set and the remaining subsamples are used as training data. The cross-validation
is then repeated K folds, where each subsample is used only once as test data. The accuracy will then
be averaged over the K cross-validations. K should be adjusted based on the sample size. A small
sample size requires large Ks. When K equals the sample size, the cross-validation is called leave-one-
out cross-validation (LOOCV). To understand the LOOCV algorithm, we will here implement the
algorithm with a simple for-loop. In every loop, we remove one data point, calculate a new cutoff value
based on the mean of the remaining data, and test if the data point that was left out is correctly
predicted or not:
18
sum(diag(table(pred, Species)))/nrow(df)
0.9
Exercise 4.3
1. Use the hold-out method as a validation method to compute the accuracy of how well the
variable body depth (BD) can classify the two-color forms (Blue (B) and Orange (O)). Use the
training data set (60% of the total data) to compute an appropriate cutoff value based on the
ROC curve, using the Youden method. Then use the test data to see how well this cutoff
value can discriminate the two groups. Calculate the accuracy.
2. Use the LOOCV to compute the accuracy of all data for the variable body depth (BD), using
the Youden method to determine the cutoff value.
19