20250210_01_Evaluating Pointwise Reliability of Machine Learning Prediction
20250210_01_Evaluating Pointwise Reliability of Machine Learning Prediction
A R T I C L E I N F O A B S T R A C T
Keywords: Interest in Machine Learning applications to tackle clinical and biological problems is increasing. This is driven
Machine learning trustworthiness by promising results reported in many research papers, the increasing number of AI-based software products, and
Predictive reliability by the general interest in Artificial Intelligence to solve complex problems. It is therefore of importance to
Uncertainty
improve the quality of machine learning output and add safeguards to support their adoption. In addition to
regulatory and logistical strategies, a crucial aspect is to detect when a Machine Learning model is not able to
generalize to new unseen instances, which may originate from a population distant to that of the training
population or from an under-represented subpopulation. As a result, the prediction of the machine learning
model for these instances may be often wrong, given that the model is applied outside its “reliable” space of
work, leading to a decreasing trust of the final users, such as clinicians. For this reason, when a model is deployed
in practice, it would be important to advise users when the model’s predictions may be unreliable, especially in
high-stakes applications, including those in healthcare. Yet, reliability assessment of each machine learning
prediction is still poorly addressed.
Here, we review approaches that can support the identification of unreliable predictions, we harmonize the
notation and terminology of relevant concepts, and we highlight and extend possible interrelationships and
overlap among concepts. We then demonstrate, on simulated and real data for ICU in-hospital death prediction, a
possible integrative framework for the identification of reliable and unreliable predictions. To do so, our pro
posed approach implements two complementary principles, namely the density principle and the local fit
principle. The density principle verifies that the instance we want to evaluate is similar to the training set. The
local fit principle verifies that the trained model performs well on training subsets that are more similar to the
instance under evaluation. Our work can contribute to consolidating work in machine learning especially in
medicine.
1. Introduction products that are now available on the market; most of them developed
for the fields of Radiology, but several other areas are represented, such
Machine Learning (ML) for making predictions and extracting in as cardiology and internal medicine [7]. The transition from research to
formation from data is increasingly applied in many different fields, bedside poses several challenges in order to provide systems with safe
from medicine [53,63] to finance [32]. Thanks to these algorithms we guards that can allow a widespread clinical use [50,97]. Those chal
are able to analyze the increasing mass of biological data produced by lenges include logistical and regulatory issues, with focus on
next-generation sequencing technologies, to make inference about trustworthiness and fairness [21,97]. Recently, the European Union
genomic mutation pathogenicity [69,5,67] or cell type [33], to identify regulations, in compliance with the General Data Protection Regulation
therapy targets, and to design new therapeutic compounds [23]. ML can (GDPR), have defined minimal standards for implementing ML systems
also be used to extract information in EHR data to predict patients’ in public health, requiring, among others, the explainability of the
diagnosis [59], post-hospitalization risks [49] or heart failure [73], and model. Explainable AI (XAI) is becoming a promising field of research
it can also support the analysis of emerging diseases, such as outcome within the AI community. Its purpose is to investigate methods for
severity during COVID-19 infection [4]. analyzing or complementing AI black box models to make the internal
Moreover, there is a fast increase in the number of FDA approved logic and output of algorithms transparent and/or interpretable.
* Corresponding author.
E-mail address: [email protected] (G. Nicora).
https://doi.org/10.1016/j.jbi.2022.103996
Received 31 October 2021; Received in revised form 7 January 2022; Accepted 11 January 2022
Available online 15 January 2022
1532-0464/© 2022 The Authors. Published by Elsevier Inc. This is an open access article under the CC BY-NC-ND license
(http://creativecommons.org/licenses/by-nc-nd/4.0/).
G. Nicora et al. Journal of Biomedical Informatics 127 (2022) 103996
Table 1 potential users when using these tools in many high stakes applications
Definition of different concepts to evaluate a classifier on a test set, along with [42]. In a ML pipeline, model performance is evaluated on a set of in
possible measures and relevant probability distributions. stances (test or validation set) through performance metrics such as
Concepts Definition Measures Relevant accuracy, sensitivity and specificity. These metrics represent an aggre
Distributions gate prediction capability of the model on that test set, since they are
Calibration The predictive Calibration (reliability) P (Y ∈ ⋅|f(X)) computed (in the case of a binary classification task) from the number of
probability diagram f(X) true/false positive examples and true/false negative examples in the test
distribution of a set, or from the complete set of predicted posterior probabilities (to
classification
compute AUC and PRC). When the model is deployed, we would expect
model honestly
expresses its that it will perform (on average) as well as it did on the test/validation
observed set. However, this may not be the case [76]. Moreover, the overall
probability performance metrics, mentioned above, do not provide a way to mea
distribution [94] sure the performance of single new cases whose class is predicted by the
Robustness How accurate is a Accuracy on adversarial p(Y = C|X; θ)
ML algorithm on samples or on small p(X)
model during deployment [54]. Additionally, we may not know the true
new independent perturbations of inputs class of these new future examples. For these reasons, it is crucial to
data even if one or develop new types of metrics and methodologies that can aid us to un
more of the input derstand whether the classification can be wrong case by case. For
independent
instance, suppose that we have trained a ML model to predict whether a
variables (features)
or assumptions are patient has a tumor based on the results of laboratory testing and MRI
drastically changed images. Suppose that the classifier predicts, for a certain patient, that
due to unforeseen his/her probability of having a tumor is 0.89. The probability threshold
circumstances for classification is 0.5, therefore the classifier assigns the patient to the
∑N ⃒ ⃒
Sharpness Variability of i=1 P(yi
⃒xi )(1 − P(yi ⃒xi )) p(Y = C|X; θ)
positive class. Through reliability assessment, we would like to assign a
predictions as N
described by degree of reliability of such prediction. The pipeline may report that the
distributions of predicted class for our patient is “positive” with a certain reliability, for
predictions [64] instance 0.7, or a binary value (0 or 1, indicating an unreliable or reli
able classification). In this article, we aim at discussing potential ap
proaches to evaluate the reliability of ML model predictions that can be
Table 2 integrated in clinical decision support systems. We use the term “reli
Definition of different concepts to evaluate a classifier on a single instance, ability” to indicate the degree of trust that we have on the prediction
along with possible measures and relevant probability distributions. made by the ML model on a single example, as used in Saria et al. [76]
Concepts Definition Measures Relevant and in [54]. The application of reliability can support the identification
Distributions of correct and incorrect cases, mentioned in the definition of “trust in
Uncertainty (of a Variability in a model’s Variance, p(Y = C|X; θ) model correctness” [42]. Naively, the probability of the predicted class
single ML prediction due to entropy, could be seen as the classifier’s trust on that prediction, but this
prediction) different types of confidence approach can be misleading since it does not account for the algorithm’s
uncertainty (epistemic/ intervals
biases [54]. Other strategies can be pursued, but we observe a lack of
aleatoric uncertainty)
Reliability (of Degree of trust that the Local fit p(Y = C|X; θ) standardization in the usage of different terms such as reliability, un
single ML prediction is correct principle, p(X)p(Y) certainty or robustness in the literature. For this reason, we will first
prediction) p( ŷi = = yi ) density start with a background section where we list the relevant concepts
principle pertaining to model evaluation, both in terms of performance ability and
reliability, and we then systematically review relevant papers that, to
Explainability can be thus perceived as a property through which ML some extent, cover our principles of interest. Afterwards, we will present
can gain trust from the user [74]. However, when dealing with a model’s our approach to integrate different concepts to perform reliability
prediction, we trust the classifier not only if we understand its logic but, assessment on simulated and real clinical data, and we will conclude our
most importantly, if its output classification is likely to be correct. XAI work with a discussion and some closing remarks.
may help in identifying incorrectly classified examples by identifying
artifacts or undesirable correlations from training [74], but it does not in 2. Background
itself enable trust [42]. For instance, XAI may be less useful to identify
critical situations in the data when they are not known a priori by users. 2.1. Terminology and problem definition
In other words, XAI can be used to help understand the classification
process made by the model, but it cannot assess reliability of single ML Let us consider the supervised classification problem, where a ML
predictions per se. Understanding whether we can trust the prediction of model f : Rm →Rc
( )
ML systems is crucial to identify failures, since ML inherently suffers has been trained on a training set D = xi , yi , where xi ∈ Rm is the
from dataset shifts and poor generalization ability across different vector containing the attribute values for the i-th example and yi ∈ (1, .
populations [50,41]. As a matter of fact, examples of fooling deep ., cj , ...C) is the class of the training examples. In particular, for a
learning (DL) networks with adversarial instances are widely reported in discriminative ML algorithm:
the literature [96,24]. Generalization ability under data perturbation (i. ( ⃒ )
f := p Y = cj ⃒X ∈ D
e. the robustness of the algorithm) can be promoted by using stable ML
methods and tested before deployment [86] and strategies for model
is the predicted posterior probability of the class being cj given the
updating need to be considered for deployed models [22]. In addition to
training data D. In the development phase of our model, we need to
that, we need methods and metrics to identify potential failures, due to
evaluate the performance of the ML classifier on a new set of data, called
perturbations or poor generalization ability, a posteriori, as part of a ( )
monitoring strategy of our model. test set T = xj , yj (alternatively, we can use bootstrap, cross-validation
Assessing the reliability of ML predictions may increase the trust of or other resampling approaches to estimate errors). On the test set, the
2
G. Nicora et al. Journal of Biomedical Informatics 127 (2022) 103996
model is commonly evaluated in terms of its predictive accuracy, i.e. the single point, and they have been used to select small molecules based on
fraction of correctly classified examples in the test set. In many appli predicted binding affinity with kinases [39]. A widely used method to
cations, such as in medicine, accuracy alone should not be the only measure the uncertainty is by using multiple classifiers. Intuitively, the
evaluation metric, especially when the classes are imbalanced and/or more the classifiers disagree, the higher the uncertainty. Ensembles of
the cost of misclassification for a class is higher with respect to the other classifiers can be used to estimate both aleatoric and epistemic uncer
(s) [58]. For this reason, other metrics, such as precision (i.e. positive tainty in a formal way, by computing the entropy or the relative likeli
predictive value), recall (i.e. sensitivity), specificity, F1 score and Mat hood of the posterior probabilities predicted by each weak classifier
thew’s Correlation Coefficient should be reported and analyzed as well [82]. In Deep Networks, uncertainty is estimated using drop-out
[43]. These metrics are applied to the discretized output of the classifier, methods or through ensembles of networks. In [56], the authors build
which is the posterior predicted probability of the class given the data. a DL network that includes drop-out Bayesian uncertainty estimation,
The computation of such measurements on a test set represents the most that can be used to diagnose diabetic retinopathy, showing that de
common approach to evaluate the performance of a classifier. Yet, other cisions informed by uncertainty estimation can improve diagnostic
measurements, reflecting different concepts, can be used during vali performance. Aleatoric and epistemic uncertainty of Bayesian deep
dation and deployment. Here, we provide an overview of these concepts, networks have been estimated for computer vision tasks in [51]. The
reported in Table 1 and Table 2. Table 1 highlights metrics that are reader can refer to [2] for a review of uncertainty quantification of DLs
computed on the entire test or validation set, while in Table 2 we re and to [40] for formal definitions of uncertainty and methodological
ported metrics applied to a single machine learning prediction. approaches to estimate total, epistemic and aleatoric uncertainty.
Calibration measures whether the distribution of the target class Another assumption that has been made is that the predicted probability
conditional on any given prediction of our model is equal to that pre is the measure of (un)certainty, provided that the model is calibrated
diction [94]. It is worth noting that the concept of calibration is called [55]. However, calibration itself can suffer from dataset shift, thus
“reliability” in the weather forecast community [64], but in this paper making this type of uncertainty estimation less trustworthy in case of
we will use “calibration” since with “reliability” we are referring to a OOD examples [71].
comprehensive concept that aims at assessing whether the predicted In 2009, [10] defined reliability in ML as “any qualitative property or
class is equal to the true class of a single example. Through the vari ability of the system which is related to a critical performance indicator
ability of the predicted probability distributions we can evaluate the (positive or negative) of that system, such as accuracy, inaccuracy, avail
sharpness [64], for instance by computing the variance of the predicted ability, downtime rate, responsiveness, etc”. In this paper, we are interested
posterior probabilities [90]. These “overall” metrics measure the ability in the assessment of individual (pointwise) reliability, thus we will use
of the classifier on the complete test set. If the test set is representative of the term reliability to indicate a property of a single ML prediction made
the entire underlying population of interest, then we will expect that the on a given sample. In this context, [76] put together principles from
classifier will perform as well as in the test on future data coming from reliability engineering to assure that a ML system performs as intended
the population. Another property that can be evaluated on a set of ex in deployment, i.e. without failure and without specified perfor
amples is the robustness of the classifier, i.e. how accurate is a ML al mance limits. These principles encompass (1) prevention of failures, i.e.
gorithm on new independent data even if one or more of the input adoption of techniques, during development, to avoid possible incorrect
independent variables (features) or assumptions are drastically changed cases, (2) identification of failures during deployment and (3) mainte
due to unforeseen circumstances. Out-of-distribution (OOD) samples nance of the model, i.e. fix or prevent the failures when they occur. To
and adversarial samples are often generated to fool DL [12]. Robustness identify failures, the concept of point-wise reliability should be applied,
against adversarial attacks should be guaranteed in high stakes appli by evaluating the model’s prediction of a single instance and thus
cations, such as medical ML [25]. possibly rejecting the classification when such prediction is deemed as
These metrics are especially useful during the validation phase, to “unreliable”. To compute pointwise reliability, two principles should be
understand how the model performs in general. In deployment, we may considered according to [76]:
want to understand how the classifier is performing case by case. In this
context, uncertainty is an important concept related to AI trust. In 1. density principle: it asks whether the test sample is close to the
statistics, uncertainty is a statement about a single prediction, that can training set (similar to outlier detection or out-of-distribution sam
be for instance a confidence interval. Ideally, especially in the medical ples detection)
field, a predictive model should have the ability to abstain from pre 2. local fit principle: it asks whether the model was accurate on
diction when the uncertainty is high [52]. Uncertainty can be related to training samples closest to the test case.
the model’s “self-confidence” on the prediction: for instance, given a
binary discriminative classifier outputting a probability, we can label By evaluating these two principles we should be able to identify the
the model as “certain” when the predicted probability is near 0 or 1. In “context” in which the model is expected to be performant, thus pro
this sense, uncertainty is a measure of sharpness on a single example. moting the “trust in model correctness”. In other words, the model
From now on, we will refer to this type of uncertainty as “unsharpness”, should perform well on samples that (1) are similar to the training set
to distinguish it from the predictive uncertainty, which is due to the and (2) are similar to training samples on which the model performed
epistemic and aleatoric uncertainty [65]. Aleatoric uncertainty refers to well. The density principle only refers to the distribution on the attribute
uncertainty in the data, such as noise, while epistemic uncertainty refers space of the training set and the new instance to be classified, while the
to the uncertainty in the model, such as the uncertainty in the model’s local fit principle also refers to the averaged performance of the classi
parameters due to lack of knowledge. Epistemic uncertainty should in fier. The first principle acknowledges that the model may be unreliable
crease when the prediction is made on an instance which is “far” from when the distribution of the new data is changing for any reason. The
the training set (i.e. out-of-distribution (OOD) samples). The total un second principle acknowledges that the model was not “perfectly ac
certainty is the sum of aleatoric and epistemic uncertainty. Aleatoric curate” in the first place (during model development and evaluation) on
uncertainty represents the irreducible part of the uncertainty, while specific subsets of the training set.
epistemic uncertainty is the reducible part of (the total) uncertainty. In A formal definition of pointwise reliability was proposed by [54] as
ML, the sources of uncertainty (aleatoric or epistemic) are not usually the probability that the predicted class is equal to the true class value of
distinguished [40]. Different methodologies allow for total uncertainty an instance. To calculate this probability, the authors proposed a
quantification, either in terms of building a confidence interval around a transductive approach that involves the re-training of the algorithm:
prediction, or by computing a single score. Gaussian processes can also each time a new instance needs to be classified, the posterior predicted
estimate uncertainty by computing the variance of predictions around a probability of the original model is stored, the new instance is included
3
G. Nicora et al. Journal of Biomedical Informatics 127 (2022) 103996
Fig. 1. Supervised Machine Learning model life cycle. First, a model is trained on an available set of training samples (1). Then, the model is evaluated on a test set in
terms of different metrics (2). Eventually, the model is deployed, and it can be used for prediction on new cases (e.g. patients). In such single cases, the true class is
usually unknown, but it is necessary to understand whether the ML prediction should be trusted. Reliability assessment can support the identification of unreliable
predictions (3).
in the training set and the model is re-trained. The posterior probability a pointwise reliability estimation deals with any kind of future input
is predicted by this second model, and it is compared with the stored samples.
predicted probability. Intuitively, if the posterior probability changes Fig. 1 shows a workflow for ML model development and deployment,
with some degree after the inclusion of the instance in the training set, highlighting validation and continuous monitoring through reliability
such instance is adding knowledge that was previously unavailable. assessment.
Therefore, the first predicted probability should be deemed as unreli
able. However, this approach implies the re-training of the algorithm, 2.2. Related work
which is sometimes unfeasible due to computational time and/or un
availability of the training set. Uncertainty may be used to evaluate the Several works incorporate the notions of reliability, uncertainty, and
reliability of a classifier, following the principle that we will be more robustness to provide classifiers able to detect potential failures. Some
willing to trust a predictive model that is “certain” about a prediction. classifiers naturally provide a way to compute a form of reliability
For instance, through uncertainty estimation, in [56] the authors were measure, while other work exploits or adapts well-known classifiers to
able to detect difficult cases for further inspection, resulting in sub compute a form of reliability measure.
stantial improvements in detection performance in the remaining data. In particular, a branch of pattern recognition research, named
Moreover, as the epistemic uncertainty can be caused by OOD data, it Learning with Reject Option (LRO), focuses on the development of
may seem that evaluating epistemic uncertainty naturally includes the classifiers with reject ability. In LRO, the aim is to find a subset of the test
evaluation of the density principle. Yet, as studies have shown, uncer set for which the model has higher performances, thus identifying non-
tainty evaluation through the predicted posterior probability is not reliable examples for which the classifier may withhold the classification
trustworthy under dataset shift [71], which is exactly the case in which [6]. The concept of LRO is equivalent to the concept of “selective pre
we would like to have a reliable approach to detect possible failures. diction”, in which a model can choose to abstain itself from the classi
Also the robustness of the model is related to its reliability, since it fication when the uncertainty is high [31]. For instance, Bayes classifiers
promotes the development of accurate models on adversarial samples. were exploited to detect reliable regions in gene expression data [34],
Yet, robustness is limited to the evaluation of adversarial samples, while while posterior probability and contextual information are used to
4
G. Nicora et al. Journal of Biomedical Informatics 127 (2022) 103996
classify images from teratoma tissues and reject non-reliable portions to predict the confidence intervals for a new prediction in an on-line
[15]. By identifying samples for which the classification may be wrong, setting, in which the prediction of a new case is made based on the
classification reliability and classification with reject option can be seen previous classified example. Given a predicted label ̂y and the true label
as synonyms [3,27,61]. Learning with rejection often implies the defi y, conformal prediction can compute a 95% confidence interval that
nition of a reject threshold. In pattern recognition, an optimal reject contains the y label with at least 95% probability. Usually, such regions
threshold based on the rejection cost was formulated in 1970 [14], and a also include the predicted label ̂ y In case of regression, ̂y and y are
reject threshold was computed based on the known costs of misclassi continuous values, while for classification they are discrete values [81].
fication and rejection in text categorization [27]. An optimal reject rule Conformal prediction is often applied for reliable drug discovery (see
for a binary classification problem was also formulated by [89], in which [20] for a review) and can be extended to DL as well [36,62].
error costs peculiar for the particular problem are used to compute More recently, a new method computes the reliability of a trained
optimal reliability thresholds based on ROC curves. [75] demonstrated model provided that the gradient of its loss function is accessible [80].
that the approaches from [14] and from [89] are actually equivalent.
Another important aspect is the performance evaluation of LRO classi 2.2.2. Reliability based on uncertainty or predicted posterior probability
fiers. [66] introduced the accuracy-rejection curve (ARC), which repre The most straightforward way to develop LRO classifiers is adding a
sents how the accuracy varies as function of the rejection rate. To rejection rule to any type of classifier outputting a posterior probability
account for misclassification and rejection costs in applications such as or an uncertainty estimation. The reject threshold is set according to
medical diagnosis, [1] recently extended the ARC curves. such value, based on some trade-off between misclassification and
A binary classifier with reject option (or reliability estimation) can be rejection [27,34,8,95]. This approach is also called plug-in [14,35].
developed in different ways: 1) by choosing a classifier whose charac When using the predicted posterior probability, we assume that such
teristics can be exploited to evaluate reliability 2) by thresholding the probability is calibrated. In this case, when the predicted posterior
predicted posterior probability or an uncertainty estimation 3) by probability for the predicted class is lower than a threshold, then the
designing independent and complementary classifiers and 4) by classification is deemed as unreliable. In [72] authors define rejection
designing classifiers that intrinsically learn to classify with rejection intervals, based on misclassification rates, for the predicted posterior
(also known as embedded methods) [61,85]. Also 5) unsupervised probability of a classifier that predicts the effect of amino acid sub
methods, such as generative models, can be used to capture statistical stitutions on protein stability. In order to use a measure of uncertainty,
properties of the dataset to be used for reliability estimation. In the we can use Gaussian Process classifiers, or we can compute epistemic,
following subsections we will discuss these different approaches (as aleatoric and total uncertainty as the relative likelihood, or the entropy,
reported in Fig. 2). Note that we are interested in LRO methods that of the posterior probabilities predicted by each weak classifier in an
identify unreliable regions on the feature space. Another application of ensemble [40,82]. Alternatively, [48] exploited different “fusion
LRO methods is the detection of samples in outlier classes, in situations methods” for posterior probabilities of ensembles, for instance by sum
where new classes can appear [47,88]. Moreover, we focus on reliability ming or taking the maximum value of the probabilities, to develop an
assessment for classification problems, while also regression problems anti-diabetic drug failure classifier with a reject option.
may benefit from reliability estimation [9]. Recently, a variety of work has been published dealing with uncer
tainty and rejecting options for DL networks. In [19], deep networks
2.2.1. Reliability based on classifier’s properties and characteristics ensembles are used to predict drug efficacy. It has been observed that DL
Transductive algorithms, such as k-Nearest Neighbor and Support tends to provide high predicted probabilities also for misclassified ex
Vector Machines, rather than output a general classification rule or amples. This is due to the fact that the softmax function, which is usually
function, classify instances with respect to the available training set. used by DL to output a predicted probability, is fast-growing exponen
Therefore, reliability can be determined based on how much the new tially [38]. Regarding the usage of the DL posterior probability as an
case is “close” to the training set, an assumption that is included in the indicator of reliability, published results are contradictory [38]
“density principle” as well as in other previous work [30]. For instance, observed that the predicted probability of misclassified examples can be
using k-Nearest Neighbour, one can rely on the “purity” of the k nearest usually lower with respect to the correctly identified instances, and
neighbours for an example: if less than k’ neighbors are in the predicted therefore DL probabilities may be evaluated to reject a classification.
class, then the classification is rejected (i.e. unreliable) [37]. In [78] the Lakshminarayanan et al. [55] successfully used an ensemble of networks
confidence of a SVM classifier is computed from the values of the to estimate the uncertainty from the predicted posterior probability.
Lagrange multipliers, since they reflect how “odd” is a particular However [71] performed a large-scale evaluation of different methods
example. The term confidence is sometimes used to refer to the same for quantifying predictive uncertainty under dataset shift and shows
concept of reliability [54,26,48,44]. Distance from the separating hy that, along with accuracy, also the calibration of the classifier de
perplane of the SVM was used by [93] to detect arrhythmia from com teriorates under dataset shift. Consequently, using the classifier’s pre
plex electrocardiogram (ECG) signals and a SVM with l1 penalty and diction to detect unreliable classification may not be trustworthy. On
reject option was developed by [13] to perform feature selection and tabular medical data, a recent study found that, while ensemble methods
classify cancer samples based on gene expression profiles. Among neural for prediction and uncertainty quantification are the best performing in
network approaches, LVQ (Learning Vector Quantisation) networks detecting correlation between uncertainty and performance, they are
consist of two layers for input and output and an array of codebook less suited for the identification of out-of-distribution examples. In this
vectors. The output is defined as the distance between the input vector latter case, novelty detection methods, for instance based on Variational
and the codebook vectors for every class. The classification for an input Autoencoder (VAE) performs better [60]. The limitation of uncertainty
sample x would be the class whose representative codebook vector has estimation to detect OOD examples is also reported by [92], where the
the minimum distance to the sample x. The distances between the authors investigated different methodologies for uncertainty estimation
sample and the codebook vectors of each class are used by [3] to esti to detect OOD examples in medical data.
mate the probability densities for codebook vectors, which in turn are
used to select the threshold for rejection. LVQ networks for LRO were 2.2.3. Complementary classifier
first proposed by [17,18] and were also used for reliable footstep Another possible solution to design a pipeline that learns with
identifications [87]. It has also been argued that LVQ combines model rejection in a binary classification problem is the development of two
interpretability with the reject option [11]. Along with transductive independent classifiers. The first classifier is trained to output C1 only
algorithms, conformal prediction can also be used to compute reliability. when the posterior probability for C1 classification is high, the second is
Conformal prediction is a general framework that uses past experience
5
G. Nicora et al. Journal of Biomedical Informatics 127 (2022) 103996
Fig. 4. a) Simulated binary dataset b) Simulated binary dataset with dataset shift.
6
G. Nicora et al. Journal of Biomedical Informatics 127 (2022) 103996
Fig. 5. Performances of the RF (100 trees) on the validation (IID) and test set (OOD and IID), in terms of accuracy, F1 score, Matthews Correlation Coefficient, Brier
Score, Area Under The Receiving Operating Curve (AUC) and Area Under the Precision Recall Curve (PRC).
Fig. 6. a) Boxplots of the RF total uncertainty on correctly classified examples and incorrectly classified examples in the validation set b) Boxplots of the RF total
uncertainty on correctly classified examples and incorrectly classified examples in the test set.
3. Reliability estimation following the density and local fit that the road to follow for reliability estimation should be based on the
principles local and density principles defined by [76]. In the following, we present
a possible strategy which is based on the implementation of these two
Previous work has shown that the usage of “classifier-related” met principles, and we demonstrate the utility of this approach on a simu
rics (such as posterior predicted probability) to assess pointwise reli lated and real-case medical dataset.
ability may be biased and misleading given the intrinsic data-driven Given any type of classifier and the training set used to train the
nature of the classifiers. Moreover, “a classifier may simply not be the classifier, we first apply the instance selection-based approach proposed
best judge of its own trustworthiness” [45]. Motivated by this, we are in [68]. Briefly, this approach first detects the “border” instances of the
interested in investigating the use of approaches for reliability estima feature spaces, that is, for each attribute, we identify those training
tion that are independent of the classifier chosen, and that does not rely examples that are either inner or outer border for their membership
on the classifier itself to compute its own reliability. To do so, we believe class. When a new example will be classified, the “density principle”
7
G. Nicora et al. Journal of Biomedical Informatics 127 (2022) 103996
Fig. 7. a) Test samples colored by the epistemic uncertainty value. b) Test samples colored by the aleatoric uncertainty value.
will be evaluated by comparing the example to the training border, and 3.1. Simulated dataset
its “density-based reliability” will be calculated by the simple formula:
mx We simulate a binary classification problem under dataset shift. First,
drel(x) = 1 −
m we generate normally distributed samples (std = 1) with 2 features and
classes overlaps. Each class is represented by a cluster in the 2d space, as
where mx is the number of attributes for which the value of the new reported in Fig. 4a. The dataset is made of 6000 samples, 80% of which
example falls outside the training border and m is the total number of in class 0 and the remaining in class 1. This dataset represents the true
attributes. The example will be considered as reliable according to the underlying population. We then want to simulate 1) that the available
density principle if the computed drel is greater than a given threshold. class distribution in the training set is different from the true population
An example on a simple dataset with 2 attributes is shown in Fig. 3. More and 2) that dataset shift occurs. Therefore, we selected about 800
details can be found in [68]. samples for training and about 300 for validation. In both training and
After evaluating the density principle, we will evaluate the local fit validation, samples are equally distributed in the two classes. The
principle, i.e. we will check whether the algorithm performs well on remaining samples are kept for final testing. We then simulate dataset
training samples close to the example. To do so, we will select the k shift of the red class by shifting the “red samples” along the x axis and by
nearest examples in the training set (or in a validation set coming from adding random noise to the y axis (Fig. 4b). The shifted samples will be
the same exact distribution of the training set), and we will calculate part of the final test set. In this context, the validation set represents an
average performance metrics, such as accuracy or F1 score, on these IID population, while the test set will gather both IID and OOD instances.
neighbors. If the performance metrics are above a predefined threshold, The validation and the test set will only be used to evaluate the per
we will evaluate the classifier as “trustful” on that example following the formance of the classifier, as well as reliability estimation, and they will
local fit principle. The threshold should be selected by the user, ac not be used for hyperparameter tuning or model selection. The valida
cording to the desired and realistic performance that can be achieved tion set will be used to select thresholds for reliability based on uncer
within the specific problem. Eventually, the example will be considered tainty and based on the density principle, while the threshold for local fit
as reliable if it is reliable according to the density principle and to the principle was empirically set to local accuracy equal or higher than 0.75.
local fit principle. For additional information about the experiments, code is available
We then compare this approach to an uncertainty-based estimation, (https://github.com/GiovannaNicora/reliability). After developing our
by using a Random Forest (RF). RFs are widely used also in clinical ML classifier, errors can occur for different reasons: on one hand, the
application since they have shown high performances in different pre classifier may be wrong when there is overlap between the feature
diction tasks [79,79]. In addition, it has been recently shown how it is values of the two different classes (aleatoric uncertainty) and when the
possible to compute prediction uncertainty as the entropy of the pre dataset shift occurs (epistemic uncertainty). To detect reliable and unre
dicted posterior probabilities as reported in [82]. For continuous mea liable examples, we will apply and compare the approaches detailed in
sure of reliability (e.g. uncertainty and density based-reliability), the previous sections: reliability will be evaluated (1) in terms of local fit
optimal threshold to label an example as “reliable” or “unreliable” can and density principles, (2) in terms of the predictive uncertainty.
be selected as follows: for different thresholds, we will compute the
entropy of the reliable and unreliable sets with respect to the label
“correctly classified” and “incorrectly classified” by the classifier. The 3.2. MIMIC-III dataset
threshold with lowest entropy will be selected. By doing so, we select the
threshold that is more able to distinguish correctly classified and The MIMIC-III dataset is a freely available resource that contains
incorrectly classified examples. In particular, we will train a RF with 100 clinical data and vital signs of thousands of patients in ICU. In particular,
trees and at least 10 samples per leaf. We apply the two approaches for we used the preprocessed dataset made available by the PhysioNet 2012
pointwise reliability estimation to detect incorrectly classified examples challenge [83]. We are interested in developing a model to predict in-
on a simulated dataset and on MIMIC-III, a widely used medical dataset hospital death from clinical data. Moreover, we will examine whether
which collects de-identified health data associated with thousands of bias in the training set will affect test performance and if reliability
intensive care unit (ICU) admissions [46]. Code is available as Google assessment can support the user to identify such unreliable predictions.
Colaboratory notebooks (https://github.com/GiovannaNicora/re After removing features with at least 90% of missing values and
liability). patients with at least a missing value, we obtain a dataset of 4480 pa
tients that survived (“In-Hospital death” = 0) and 768 that died in the
hospital (“In-Hospital death”=1). More details about the features set can
be found in [83] and in https://physionet.org/content/challenge-2012/
8
G. Nicora et al. Journal of Biomedical Informatics 127 (2022) 103996
patients. Therefore, in the test set, around 72% of patients are female. In
the training set, 14% of patients are in class 1, while in the test the
percentage of patients in class 1 is higher (23%). We select as a classifier
a Random Forest, which allows us to compute uncertainty, as explained
in the previous section. Best thresholds for uncertainty and density-
based reliability will be computed on the results of the predictions
made on the validation (i.i.d.) set. Also in this case, as in the simulated
dataset, the validation set is similar to the training set. We carried out an
additional experiment on the MIMIC dataset, reported in the Supple
mentary Material, where the data shift is simulated using age groups. In
this case, we exploited as classifier a Lasso logistic regression, and we
used the predicted posterior probability as uncertainty estimation (see
Supplementary Material).
4. Results
9
G. Nicora et al. Journal of Biomedical Informatics 127 (2022) 103996
Fig. 9. a) Performance of the classifier on the entire test set, on the reliable set and on the unreliable set identified through the uncertainty estimation. b) Per
formance of the classifier on the entire test set, on the reliable set and on the unreliable set identified through the application of the local fit and density principles.
coefficient for the unreliable set according to the uncertainty method is The performance on the test set, both on the IID and the OOD sets,
42%, while it drops to 34% for the unreliable set detected with the local are lower than those reported on the training set through a 5-fold cross
fit and density principle. The AUC of the reliable and unreliable sets validation. Yet, on the IID set, the results are slightly higher in com
according to the uncertainty are comparable (75% vs 72%), while their parison to the OOD set (female patients). We then calculated the un
difference is higher with the second method (86% vs 64%). Therefore, in certainty and the reliability to evaluate whether we would be able to
this simulated case a reliability estimation based on local fit and density identify a subset of patients with a higher chance of being correctly
principles seems to perform better than the uncertainty-based estima classified. By analyzing the performance metrics computed on the reli
tion in the identification of reliable and unreliable instances. able and unreliable sets according to the uncertainty (Fig. 11a and
Table 3), for this particular dataset, the uncertainty-based method did
not label any true positive example as reliable. On the contrary, the
4.2. MIMIC dataset reliability based on the local fit and density principles labels all the false
negative and false positive samples in the unreliable sets, while the
Fig. 10 reports the results of a RF trained on male patients.
10
G. Nicora et al. Journal of Biomedical Informatics 127 (2022) 103996
Fig. 10. Performance of the classifier on the training set (5-fold cross validation mean and standard deviation), on the validation set and on the IID test set and on
OOD test set.
reliable set is entirely made of correctly classified examples (Fig. 11b the development of trustworthy AI systems (https://digital-strategy.ec.
and Table 4). Some correctly classified samples are labeled as unreliable. europa.eu/en/library/ethics-guidelines-trustworthy-ai). In such guide
Reliability estimation in a region informs us about the distance from the lines, it is stated that “a reliable AI system is one that works properly with a
training set, independently of the labels. This implies that reliability range of inputs and in a range of situations”. We need therefore to define
estimation provides a warning sign about the machine learning predic (1) when a system is considered to work properly and (2) what is the
tion by computing how far a particular sample is “distant” from the acceptable range of inputs and situations. In this paper we concentrate
training set and/or whether it falls in a region where a high number of our attention to point-wise prediction reliability, which is the degree of
errors occur. Still, the model can correctly classify unreliable samples, trust that the predicted class is equal to the true class of a single instance.
even if this should occur at a lower rate. In other words, it is the degree of trust that the classification is correct.
According to the reliability computed with the local fit and density Reliability can be judged by analyzing whether the instance to be clas
principles, 45% of the test set is considered reliable (70% females) and sified falls into a “trustworthy” region of the feature space, where the
55% unreliable (73% females). The percentage of reliable predictions on classifier should be confident and sure about the prediction. This region
female patients is 444%, while 47% of predictions on male patients was contains samples coming from the same distribution of the training data
considered reliable. If we consider the reliable and unreliable sets (i.i.d. samples) and for which the classifier showed “good fit”, for
identified with uncertainty, we see that 67% of the test set is considered instance high accuracy. Other approaches to estimate reliability rely on
reliable (71% females) and 33% unreliable (73% females). The per uncertainty estimation, or using particular types of classifiers, such as
centage of predictions supposed to be reliable on female patients is 66%, conformal methods.
while 70% of predictions on male patients was considered reliable. From Yet, incorporating reliability estimation within a ML pipeline also
these results, we can observe that the strategies behave as expected, and adds computational complexity. For instance, when the reliability is
that, while both are able to highlight that predictions on females are less computed by comparing the training set to each new instance to be
reliable than one on males, local fit and density principles are able to classified, the user needs to have access to the training data. Access to
label a higher percentage of predictions on o.o.d. samples (odds ratio training data can be unfeasible or even impossible, due to the dimen
1.144) as unreliable than uncertainty (odds ratio 1.135). sionality of the training set or to privacy issues.
A well-known paradigm to deal with these issues is represented by
5. Discussion instance selection methods, which select only the most informative
training examples to be stored. We suggest using such methods for
ML applications to medical-related problems need not only to show reliability assessment. In this case, instance selection methods choose
high performance during validation, but they also need to demonstrate the most informative training samples that will be compared with new
their trustworthiness throughout the entire deployment phase. Intrinsic instances for reliability assessment. The results may depend on the
trust can be achieved through explainable methods that pinpoint the specific method used to evaluate if the example of the test set is OOD.
reasoning process performed by black box predictive models. Reliability In our paper, we have presented an approach that is even more
monitoring is as crucial as explainability. Our claim is that the imple parsimonious, since it performs OOD reliability assessment by defining
mentation of reliability monitoring mechanisms should be a require boundaries attribute-by-attribute, both in the original or in a trans
ment of any ML applications to achieve trust. Reliability is also a formed feature space. In this case, the additional information to be
requirement of the recent guidelines from the European Commision for retained can be compactly represented by O(m) number of variables.
11
G. Nicora et al. Journal of Biomedical Informatics 127 (2022) 103996
Fig. 11. a) Performance of the classifier on the entire test set, the reliable and unreliable set according to the value of uncertainty. b) Performance of the classifier on
the entire test set, the reliable and unreliable set according to the local fit and density principles.
Despite its simplicity, our approach has some limitations. First of all, by
Table 3 detecting border examples attribute-by-attribute, we do not take into
Confusion matrix of the test set, the reliable set and the unreliable set detected
account possible correlations between features. Secondly, the results of
with the local fit and density-based method.
reliability assessment may depend on parameters that are tuned by the
TN TP FP FN user, such as the threshold for reliable/unreliable labelling, the number
All test set 2658 31 30 447 of neighbors and the distance function chosen to select those neighbors
Reliable set 1416 8 0 0 within the application of the local fit principle. When applying the
Unreliable set 1242 23 30 437
density principle, the values of the features in the training set are used to
select the borders. Before testing, users can select a threshold for reli
ability, by either using another dataset (which we called “validation set”
Table 4 in our experiments), or, in absence of a third dataset, by choosing a
Confusion matrix of the test set, the reliable set and the unreliable set detected predefined threshold. We showed how to perform reliability assessment
with the uncertainty method. on a simple simulated dataset and on the well-known medical dataset
TN TP FP FN MIMIC. For illustrative purposes, in the first simulated binary dataset of
two dimensions, we added a shift that affects only one of the two classes
All test set 2658 31 30 447
Reliable set 2010 0 0 129
i.e. the positive class (red samples). In medical applications, this may
Unreliable set 648 31 30 318 happen for instance, when interventions are taken to treat a disease.
Consider for example an application where the aim is to detect cancer
cells in a patient based on single-cell gene expression profiles. Suppose
that the classifier is trained on data coming from patients before
12
G. Nicora et al. Journal of Biomedical Informatics 127 (2022) 103996
treatment. If a new treatment, targeting cancer cells only, is available, CRediT authorship contribution statement
such treatment can modify the gene expression profiles of the targeted
cancer cells before they die. Therefore, we may observe a shift in gene Giovanna Nicora: Methodology, Formal analysis, Conceptualiza
expression features that affect only the positive class. Yet, in the ma tion, Writing – original draft. Miguel Rios: Conceptualization, Writing –
jority of applications the shift will affect the entire population, regard review & editing. Ameen Abu-Hanna: Conceptualization, Writing –
less of the class. The investigation of different types and combinations of review & editing, Supervision. Riccardo Bellazzi: Methodology,
covariate shifts is an important topic. Another possible approach can be Conceptualization, Writing – review & editing, Supervision.
to simulate a dataset shift that corresponds to a change of the very na
ture of the distribution itself. An extensive benchmarking of this method
Declaration of Competing Interest
applied to different Machine Learning classifiers, along with the previ
ously published approaches mentioned in the Background sections, will
The authors declare that they have no known competing financial
be performed as future work. Additionally, the qualitative evaluation of
interests or personal relationships that could have appeared to influence
the different methodologies that can compute a form of a reliability
the work reported in this paper.
measure is complex, and standard procedures to obtain these measures
are missing to the best of our knowledge. In our results, we computed
Acknowledgements
different metrics on the reliable and unreliable sets to qualitatively
evaluate whether reliable predictions were associated with better per
We thank the anonymous reviewers for their careful reading of our
formance metrics. This method implies that the ground truth of test data
manuscript and their many insightful comments and suggestions. This
is available. In the scenario in which reliability assessment is deployed
work was partially supported by the Department of Electrical, Computer
within a machine learning pipeline, and it is applied to new prediction
and Biomedical Engineering of University of Pavia and by the European
(s), the ground truth may not be available. Therefore, it may be
Commission as part of the PERISCOPE project (Grant Agreement
impossible to monitor the reliability performance as we reported in the
101016233), coordinated by the University of Pavia.
paper, and new metrics should be developed. Yet, it is when the ground
truth is not available (during deployment), that an approach for reli
ability estimation can guide and support the user, by “raising an alarm” References
when predictions are labeled as unreliable. Reliability may also support
[1] M.R. Abbas, M.S.A. Nadeem, A. Shaheen, A.A. Alshdadi, R. Alharbey, S.-O. Shim,
monitoring, by highlighting possible shifts and bias in the data. Coupled W. Aziz, Accuracy Rejection Normalized-Cost Curves (ARNCCs): A Novel 3-
with XAI, reliability can promote trustworthiness in a model’s predic Dimensional Framework for Robust Classification, IEEE Access 7 (2019)
tion. While local XAI can show the relationship between feature values 160125–160143, https://doi.org/10.1109/ACCESS.2019.2950244.
[2] M. Abdar, F. Pourpanah, S. Hussain, D. Rezazadegan, L. Liu, M. Ghavamzadeh, P.
and the prediction on a single instance in an interpretable way, reli Fieguth, et al., A Review of Uncertainty Quantification in Deep Learning:
ability assures that the model is being used as intended, i.e. on data Techniques, Applications and Challenges, 2021. ArXiv:2011.06225 [Cs], January.
coming from a region of the features space where the classifier should be http://arxiv.org/abs/2011.06225.
[3] A. Ahmadi, S. Omatu, T. Fujinaka, T. Kosaka, Improvement of Reliability in
confident. Banknote Classification Using Reject Option and Local PCA, Inf. Sci. 168 (1) (2004)
An interesting, improved approach is to resort to generative models. 277–293, https://doi.org/10.1016/j.ins.2004.02.018.
Such models can be flexibly used to perform data distribution assess [4] A. Alimadadi, S. Aryal, I. Manandhar, P.B. Munroe, B. Joe, X. Cheng, Artificial
Intelligence and Machine Learning to Fight COVID-19, Physiol. Genomics 52 (4)
ment. By capturing the statistical properties of the training data, they (2020) 200–202, https://doi.org/10.1152/physiolgenomics.00029.2020.
can be either used to compute goodness of fit measures or to generate [5] N. Alirezaie, K.D. Kernohan, T. Hartley, J. Majewski, T.D. Hocking, ClinPred:
synthetic data [91] that can be used to substitute real data during reli Prediction Tool to Identify Disease-Relevant Nonsynonymous Single-Nucleotide
Variants, Am. J. Human Genet. 103 (4) (2018) 474–483, https://doi.org/10.1016/
ability assessment.
j.ajhg.2018.08.005.
[6] P.L. Bartlett, M.H. Wegkamp, Classification with a Reject Option Using a Hinge
6. Conclusion Loss, J. Machine Learn. Res. 9 (59) (2008) 1823–1840.
[7] S. Benjamens, P. Dhunnoo, B. Meskó, The state of artificial intelligence-based FDA-
approved medical devices and algorithms: an online database, NPJ Digit Med. 11
We formally defined different metrics for reliability estimation and (3) (2020) 118, https://doi.org/10.1038/s41746-020-00324-0.
comprehensively reported different methodologies that have been used [8] A. Benso, S. Di Carlo, G. Politano, A. Savino, H. Hafeezurrehman, Building Gene
Expression Profile Classifiers with a Simple and Efficient Rejection Option in R,
to evaluate reliability, not only in healthcare applications.
BMC Bioinf. 12 (Suppl 13) (2011) S3, https://doi.org/10.1186/1471-2105-12-S13-
We then showed a simple approach to include a reliability estimation S3.
to any type of classifier. Such reliability is computed following the local [9] Z. Bosnić, I. Kononenko, Estimation of Individual Prediction Reliability Using the
fit and density principles defined by Saria et al. [76]. We compared this Local Sensitivity Analysis, Appl. Intell. 29 (3) (2008) 187–203, https://doi.org/
10.1007/s10489-007-0084-9.
method to reliability according to the level of the uncertainty, and we [10] Z. Bosnić, I. Kononenko, An Overview of Advances in Reliability Estimation of
investigated which of the two methodologies is best suited for the Individual Predictions in Machine Learning, Intell. Data Anal. 13 (2) (2009)
identification of incorrectly classified examples, both on a simulated 385–401.
[11] J. Brinkrolf, B. Hammer, Interpretable machine learning with reject option,
dataset and a clinical dataset. In these two experiments, the reliability Automatisierungstechnik 66 (4) (2018) 283–290, https://doi.org/10.1515/auto-
based on the local fit and density principles seems to perform better in 2017-0123.
the identification of samples for which the classifier is correct at a higher [12] I. Buzhinsky, A. Nerinovsky, S. Tripakis, Metrics and Methods for Robustness
Evaluation of Neural Networks with Generative Models’. ArXiv:2003.01993 [Cs,
rate. We also compared the local fit and density principles with reli Stat], 2020 March, http://arxiv.org/abs/2003.01993.
ability based on the predicted posterior probability. Our results confirm [13] H. Choi, D. Yeo, S. Kwon, Y. Kim, Gene Selection and Prediction for Cancer
that the predicted posterior probability as a measure of reliability of the Classification Using Support Vector Machines with a Reject Option, Comput. Stat.
Data Anal. 55 (5) (2011) 1897–1908, https://doi.org/10.1016/j.csda.2010.12.001.
classification can be less useful under dataset shift (see Supplementary [14] C. Chow, On Optimum Recognition Error and Reject Tradeoff, IEEE Trans. Inf.
Material). We conclude that the usage of “classifier-related” metrics Theory 16 (1) (1970) 41–46, https://doi.org/10.1109/TIT.1970.1054406.
(such as posterior predicted probability, uncertainty or conformal pre [15] F. Condessa, J. Bioucas-Dias, C.A. Castro, J.A. Ozolek, J. Kovačević, Classification
with Reject Option Using Contextual Information, in: 2013 IEEE 10th International
diction) to assess pointwise reliability may be biased and misleading
Symposium on Biomedical Imaging, 2013, pp. 1340–1343, https://doi.org/
given the intrinsic data-driven nature of the classifier. Reliability esti 10.1109/ISBI.2013.6556780.
mation should evaluate the classifier’s output independently of the [16] C. Corbière, N. Thome, A. Bar-Hen, M. Cord, P. Pérez, Addressing Failure
classifier itself. However, additional experiments should be carried out Prediction by Learning Model Confidence, ArXiv:1910.04851 [Cs, Stat], 2019
October. http://arxiv.org/abs/1910.04851.
on different types of both simulated and medical datasets for further [17] L.P. Cordella, C. De Stefano, C. Sansone, M. Vento, An Adaptive Reject Option for
evaluation. LVQ Classifiers, in: C. Braccini, L. DeFloriani, G. Vernazza (Eds.), Image Analysis
13
G. Nicora et al. Journal of Biomedical Informatics 127 (2022) 103996
and Processing. Lecture Notes in Computer Science, Springer, Berlin, Heidelberg, Computing and Intelligent Interaction and Workshops : [Proceedings]. ACII
1995, pp. 68–73, https://doi.org/10.1007/3-540-60298-4_238. (Conference) 2013, 2013, pp. 245–251, https://doi.org/10.1109/ACII.2013.47.
[18] L.P. Cordella, C. De Stefano, F. Tortorella, M. Vento, A Method for Improving [44] F.C. Jiang, Study on a Confidence Machine Learning Method Based on Ensemble
Classification Reliability of Multilayer Perceptrons, IEEE Trans. Neural Networks 6 Learning, Cluster Comput. 20 (4) (2017) 3357–3368, https://doi.org/10.1007/
(5) (1995) 1140–1147, https://doi.org/10.1109/72.410358. s10586-017-1085-z.
[19] I. Cortés-Ciriano, A. Bender, Deep Confidence: A Computationally Efficient [45] H. Jiang, B. Kim, M.Y. Guan, M. Gupta, To Trust Or Not To Trust A Classifier.
Framework for Calculating Reliable Prediction Errors for Deep Neural Networks, ArXiv:1805.11783 [Cs, Stat], October 2018, http://arxiv.org/abs/1805.11783.
J. Chem. Informat. Model. 59 (3) (2019) 1269–1281, https://doi.org/10.1021/acs. [46] A.E.W. Johnson, T.J. Pollard, L.u. Shen, L.-W. Lehman, M. Feng, M. Ghassemi,
jcim.8b00542. B. Moody, P. Szolovits, L.A. Celi, R.G. Mark, MIMIC-III, a Freely Accessible Critical
[20] I. Cortés-Ciriano, A. Bender, Concepts and Applications of Conformal Prediction in Care Database, Sci. Data 3 (1) (2016) 160035, https://doi.org/10.1038/
Computational Drug Discovery, ArXiv:1908.03569 [Cs, q-Bio], 2019b August, sdata.2016.35.
http://arxiv.org/abs/1908.03569. [47] J. Kang, C.D. Yoo, Learning of a Multi-Class Classifier with Rejection Option Using
[21] C.M. Cutillo, K.R. Sharma, L. Foschini, S. Kundu, M. Mackintosh, K.D. Mandl, Sparse Representation, in: The 18th IEEE International Symposium on Consumer
Machine Intelligence in Healthcare—Perspectives on Trustworthiness, Electronics (ISCE 2014), 2014, pp. 1–2, https://doi.org/10.1109/
Explainability, Usability, and Transparency, Npj Digital Medicine 3 (1) (2020) 1–5, ISCE.2014.6884541.
https://doi.org/10.1038/s41746-020-0254-2. [48] S. Kang, S. Cho, S.-J. Rhee, Y. Kyung-Sang, Reliable Prediction of Anti-Diabetic
[22] S.E. Davis, Stabilizing Calibration of Clinical Prediction Models in Non-Stationary Drug Failure Using a Reject Option, Pattern Anal. Appl. 20 (3) (2017) 883–891,
Environments: Methods Supporting Data-Driven Model Updating, 2019 October. https://doi.org/10.1007/s10044-016-0585-4.
https://ir.vanderbilt.edu/handle/1803/14327. [49] E. Kawaler, A. Cobian, P. Peissig, D. Cross, S. Yale, M. Craven, Learning to Predict
[23] Z. Dlamini, F.Z. Francies, R. Hull, R. Marima, Artificial Intelligence (AI) and Big Post-Hospitalization VTE Risk from EHR Data, AMIA Annual Symp. Proc. 2012
Data in Cancer and Precision Oncology, Comput. Struct. Biotechnol. J. (2020), (November) (2012) 436–445.
https://doi.org/10.1016/j.csbj.2020.08.019. [50] C.J. Kelly, A. Karthikesalingam, M. Suleyman, G. Corrado, D. King, Key Challenges
[24] G.F. Elsayed, I. Goodfellow, J. Sohl-Dickstein, Adversarial Reprogramming of for Delivering Clinical Impact with Artificial Intelligence, BMC Med. 17 (1) (2019)
Neural Networks’. ArXiv:1806.11146 [Cs, Stat], 2018 November. http://arxiv.org/ 195, https://doi.org/10.1186/s12916-019-1426-2.
abs/1806.11146. [51] A. Kendall, Y. Gal, What Uncertainties Do We Need in Bayesian Deep Learning for
[25] S.G. Finlayson, J.D. Bowers, J. Ito, J.L. Zittrain, A.L. Beam, I.S. Kohane, Adversarial Computer Vision?’ ArXiv:1703.04977 [Cs], October 2017, http://arxiv.org/abs/
Attacks on Medical Machine Learning, Science 363 (6433) (2019) 1287–1289, 1703.04977.
https://doi.org/10.1126/science.aaw4399. [52] B. Kompa, J. Snoek, A.L. Beam, Second Opinion Needed: Communicating
[26] L. Fischer, L. Ehrlinger, V. Geist, R. Ramler, F. Sobieczky, W. Zellinger, B. Moser, Uncertainty in Medical Machine Learning, Npj Digital Med. 4 (1) (2021) 1–6,
Applying AI in Practice: Key Challenges and Lessons Learned, in: A. Holzinger, https://doi.org/10.1038/s41746-020-00367-3.
P. Kieseberg, A. Min Tjoa, E. Weippl (Eds.), Machine Learning and Knowledge [53] I. Kononenko, Machine Learning for Medical Diagnosis: History, State of the Art
Extraction. Lecture Notes in Computer Science, Springer International Publishing, and Perspective, Artif. Intell. Med. 23 (1) (2001) 89–109, https://doi.org/
Cham, 2020, pp. 451–471, https://doi.org/10.1007/978-3-030-57321-8_25. 10.1016/S0933-3657(01)00077-X.
[27] G. Fumera, I. Pillai, F. Roli, Classification with Reject Option in Text Categorisation [54] M. Kukar, I. Kononenko, Reliable Classifications with Machine Learning, in:
Systems, in: 12th International Conference on Image Analysis and Processing, 2003 Proceedings of the 13th European Conference on Machine Learning, ECML ’02,
Proceedings, 2003, pp. 582–587, https://doi.org/10.1109/ICIAP.2003.1234113. Springer-Verlag, Berlin, Heidelberg, 2002, pp. 219–231.
[28] G. Fumera, F. Roli, Support Vector Machines with Embedded Reject Option, in: S.- [55] B. Lakshminarayanan, A. Pritzel, C. Blundell, Simple and Scalable Predictive
W. Lee, A. Verri (Eds.), Pattern Recognition with Support Vector Machines. Lecture Uncertainty Estimation Using Deep Ensembles. ArXiv:1612.01474 [Cs, Stat], 2017
Notes in Computer Science, Springer, Berlin, Heidelberg, 2002, pp. 68–82, https:// November, http://arxiv.org/abs/1612.01474.
doi.org/10.1007/3-540-45665-1_6. [56] C. Leibig, V. Allken, M.S. Ayhan, P. Berens, S. Wahl, Leveraging Uncertainty
[29] G. Sousa, A.R. Ricardo, R. Neto, J.S. Cardoso, G.A. Barreto, Robust Classification Information from Deep Neural Networks for Disease Detection, Sci. Rep. 7 (1)
with Reject Option Using the Self-Organizing Map, Neural Comput. Appl. 26 (7) (2017) 17816, https://doi.org/10.1038/s41598-017-17876-z.
(2015) 1603–1619, https://doi.org/10.1007/s00521-015-1822-2. [57] J.A. Leonard, M.A. Kramer, L.H. Ungar, A Neural Network Architecture That
[30] J. Gao, J. Yao, Y. Shao, Towards Reliable Learning for High Stakes Applications, Computes Its Own Reliability, Comput. Chem. Eng., Int. J. Comput. Appl. Chem.
Proc. AAAI Conf. Artif. Intell. 33 (01) (2019) 3614–3621, https://doi.org/ Eng. 16 (9) (1992) 819–835, https://doi.org/10.1016/0098-1354(92)80035-8.
10.1609/aaai.v33i01.33013614. [58] C.X. Ling, V.S. Sheng, C. Sammut, G.I. Webb, Cost-Sensitive LearningCost-Sensitive
[31] Y. Geifman, R. El-Yaniv, SelectiveNet: A Deep Neural Network with an Integrated Learning, in: Encyclopedia of Machine Learning, Springer US., Boston, MA, 2010,
Reject Option, ArXiv:1901.09192 [Cs, Stat], June 2019, http://arxiv.org/abs/ pp. 231–235, https://doi.org/10.1007/978-0-387-30164-8_181.
1901.09192. [59] S. Malakouti, M. Hauskrecht, Predicting Patient’s Diagnoses and Diagnostic
[32] H. Ghoddusi, G.G. Creamer, N. Rafizadeh, Machine Learning in Energy Economics Categories from Clinical-Events in EHR Data, in: D. Riaño, S. Wilk, A. ten Teije
and Finance: A Review, Energy Econ. 81 (June) (2019) 709–727, https://doi.org/ (Eds.), Artificial Intelligence in Medicine. Lecture Notes in Computer Science,
10.1016/j.eneco.2019.05.006. Springer International Publishing, Cham, 2019, pp. 125–130, https://doi.org/
[33] F.K. Hamey, B. Göttgens, Machine Learning Predicts Putative Hematopoietic Stem 10.1007/978-3-030-21642-9_17.
Cells within Large Single-Cell Transcriptomics Data Sets, Exp. Hematol 78 (2019) [60] L. Meijerink, G. Cinà, M. Tonutti, Uncertainty Estimation for Classification and
11–20, https://doi.org/10.1016/j.exphem.2019.08.009. Risk Prediction on Medical Tabular Data, ArXiv:2004.05824 [Cs, Stat], May 2020.
[34] B. Hanczar, E.R. Dougherty, Classification with Reject Option in Gene Expression http://arxiv.org/abs/2004.05824.
Data, Bioinformatics 24 (17) (2008) 1889–1895, https://doi.org/10.1093/ [61] D.P.P. Mesquita, L.S. Rocha, J.P.P. Gomes, A.R. Rocha Neto, Classification with
bioinformatics/btn349. Reject Option for Software Defect Prediction, Appl. Soft Comput. 49 (December)
[35] B. Hanczar, M. Sebag, Combination of One-Class Support Vector Machines for (2016) 1085–1093, https://doi.org/10.1016/j.asoc.2016.06.023.
Classification with Reject Option, in: T. Calders, F. Esposito, E. Hüllermeier, R. Meo [62] S. Messoudi, S. Rousseau, S. Destercke, Deep Conformal Prediction for Robust
(Eds.), Machine Learning and Knowledge Discovery in Databases. Lecture Notes in Models, Informat. Process. Manage. Uncertainty Knowledge-Based Syst. 1237
Computer Science, Springer, Berlin, Heidelberg, 2014, pp. 547–562, https://doi. (May) (2020) 528–540, https://doi.org/10.1007/978-3-030-50146-4_39.
org/10.1007/978-3-662-44848-9_35. [63] S.J. Mooney, V. Pejaver, Big Data in Public Health: Terminology, Machine
[36] Y. Hechtlinger, B. Póczos, L. Wasserman, Cautious Deep Learning, ArXiv: Learning, and Privacy, Annu. Rev. Public Health 39 (1) (2018) 95–112, https://
1805.09460 [Cs, Stat], February 2019, http://arxiv.org/abs/1805.09460. doi.org/10.1146/annurev-publhealth-040617-014208.
[37] M.E. Hellman, The Nearest Neighbor Classification Rule with a Reject Option, IEEE [64] A.H. Murphy, What Is a Good Forecast? An Essay on the Nature of Goodness in
Trans. Syst. Sci. Cybernet. 6 (3) (1970) 179–185, https://doi.org/10.1109/ Weather Forecasting, Weather Forecasting 8 (2) (1993) 281–293, https://doi.org/
TSSC.1970.300339. 10.1175/1520-0434(1993)008<0281:WIAGFA>2.0.CO;2.
[38] D. Hendrycks, K. Gimpel, A Baseline for Detecting Misclassified and Out-of- [65] K. Murphy, Probabilistic Machine Learning: An Introduction, Accessed 8 April
Distribution Examples in Neural Networks. ArXiv:1610.02136 [Cs], October 2018, 2021, n.d., https://probml.github.io/pml-book/book1.html.
http://arxiv.org/abs/1610.02136. [66] M.S. Nadeem, J.-D. Ahmed, B. Hanczar, Accuracy-Rejection Curves (ARCs) for
[39] B. Hie, B.D. Bryson, B. Berger, Leveraging Uncertainty in Machine Learning Comparing Classification Methods with a Reject Option, in: Machine Learning
Accelerates Biological Discovery and Design, Cell Syst. 11 (5) (2020) 461–477.e9, Systems Biology, PMLR, 2009, pp. 65–81, in: http://proceedings.mlr.press/v8/
https://doi.org/10.1016/j.cels.2020.09.007. nadeem10a.html.
[40] E. Hüllermeier, W. Waegeman, Aleatoric and Epistemic Uncertainty in Machine [67] P.M. do Nascimento, I.G. Medeiros, R.M. Falcão, B. Stransky, J.E.S. de Souza,
Learning: An Introduction to Concepts and Methods, Machine Learn. 110 (3) A Decision Tree to Improve Identification of Pathogenic Mutations in Clinical
(2021) 457–506, https://doi.org/10.1007/s10994-021-05946-3. Practice, BMC Medical Informat. Decision Making 20 (1) (2020) 52, https://doi.
[41] E.J. Hwang, S. Park, K.-N. Jin, J.I. Kim, S.Y. Choi, J.H. Lee, J.M. Goo, et al., org/10.1186/s12911-020-1060-0.
Development and Validation of a Deep Learning-Based Automated Detection [68] G. Nicora, R. Bellazzi, A Reliable Machine Learning Approach Applied to Single-
Algorithm for Major Thoracic Diseases on Chest Radiographs, JAMA Network Open Cell Classification in Acute Myeloid Leukemia, AMIA Annual Symp. Proc. 2020
2 (3) (2019), e191095, https://doi.org/10.1001/jamanetworkopen.2019.1095. (January) (2021) 925–932.
[42] A. Jacovi, A. Marasović, T. Miller, Y. Goldberg, Formalizing Trust in Artificial [69] G. Nicora, S. Marini, I. Limongelli, E. Rizzo, S. Montoli, F.F. Tricomi, R. Bellazzi,
Intelligence: Prerequisites, Causes and Goals of Human Trust in AI. ArXiv: A Semi-Supervised Learning Approach for Pan-Cancer Somatic Genomic Variant
2010.07487 [Cs], January 2021, http://arxiv.org/abs/2010.07487. Classification, in: D. Riaño, S. Wilk, A. ten Teije (Eds.), Artificial Intelligence in
[43] L.A. Jeni, J.F. Cohn, F. De La Torre, Facing Imbalanced Data Recommendations for Medicine. Lecture Notes in Computer Science, Springer International Publishing,
the Use of Performance Metrics, in: International Conference on Affective Cham, 2019, pp. 42–46, https://doi.org/10.1007/978-3-030-21642-9_7.
14
G. Nicora et al. Journal of Biomedical Informatics 127 (2022) 103996
[70] J.A. Olvera-López, J. Ariel Carrasco-Ochoa, J. Francisco Martínez-Trinidad, [85] R. Sousa, A.R. Neto, G. Barreto, Jaime S. Cardoso, M. Coimbra, Reject Option
J. Kittler, A Review of Instance Selection Methods, Artif. Intell. Rev. 34 (2) (2010) Paradigm for the Reduction of Support Vectors, in: ESANN, 2014.
133–143, https://doi.org/10.1007/s10462-010-9165-y. [86] A. Subbaswamy, S. Saria, From Development to Deployment: Dataset Shift,
[71] Y. Ovadia, E. Fertig, J. Ren, Z. Nado, D. Sculley, S. Nowozin, J.V. Dillon, B. Causality, and Shift-Stable Models in Health AI, Biostatistics 21 (2) (2020)
Lakshminarayanan, J. Snoek, Can you trust your model’s uncertainty? evaluating 345–352, https://doi.org/10.1093/biostatistics/kxz041.
predictive uncertainty under dataset shift. arXiv preprint arXiv:1906.02530, 2019. [87] J. Suutala, S. Pirttikangas, J. Riekki, J. Röning, Reject-Optional LVQ-Based Two-
[72] A. Ozen, M. Gönen, E. Alpaydan, T. Haliloğlu, Machine Learning Integration for Level Classifier to Improve Reliability in Footstep Identification, in: A. Ferscha,
Predicting the Effect of Single Amino Acid Substitutions on Protein Stability, BMC F. Mattern (Eds.), Pervasive Computing. Lecture Notes in Computer Science,
Struct. Biol. 9 (October) (2009) 66, https://doi.org/10.1186/1472-6807-9-66. Springer, Berlin, Heidelberg, 2004, pp. 182–187, https://doi.org/10.1007/978-3-
[73] M. Panahiazar, V. Taslimitehrani, N. Pereira, J. Pathak, Using EHRs and Machine 540-24646-6_12.
Learning for Heart Failure Survival Analysis, Stud. Health Technol. Informat. 216 [88] D.M.J. Tax, R.P.W. Duin, Growing a Multi-Class Classifier with a Reject Option,
(2015) 40–44. Pattern Recogn. Lett. 29 (10) (2008) 1565–1570, https://doi.org/10.1016/j.
[74] M.T. Ribeiro, S. Singh, C. Guestrin, Why Should I Trust You?”: Explaining the patrec.2008.03.010.
Predictions of Any Classifier’. ArXiv:1602.04938 [Cs, Stat], August 2016, http:// [89] F. Tortorella, An Optimal Reject Rule for Binary Classifiers, in: F.J. Ferri, J.
arxiv.org/abs/1602.04938. M. Iñesta, A. Amin, P. Pudil (Eds.), Advances in Pattern Recognition. Lecture Notes
[75] C.M. Santos-Pereira, A.M. Pires, On Optimal Reject Rules and ROC Curves, Pattern in Computer Science, Springer, Berlin, Heidelberg, 2000, pp. 611–620, https://doi.
Recogn. Lett. 26 (7) (2005) 943–952, https://doi.org/10.1016/j. org/10.1007/3-540-44522-6_63.
patrec.2004.09.042. [90] K. Tran, W. Neiswanger, J. Yoon, Q. Zhang, E. Xing, Z.W. Ulissi, Methods for
[76] S. Saria, A. Subbaswamy, Tutorial: Safe and Reliable Machine Learning. ArXiv: Comparing Uncertainty Quantifications for Material Property Predictions, ArXiv:
1904.07204 [Cs], 2019 April. http://arxiv.org/abs/1904.07204. 1912.10066 [Cond-Mat, Physics:Physics], 2020 February, http://arxiv.org/abs/
[77] A. Sarica, A. Cerasa, A. Quattrone, Random Forest Algorithm for the Classification 1912.10066.
of Neuroimaging Data in Alzheimer’s Disease: A Systematic Review, Front. Aging [91] A. Tucker, Z. Wang, Y. Rotalinti, et al., Generating high-fidelity synthetic patient
Neurosci. 9 (2017) 329, https://doi.org/10.3389/fnagi.2017.00329. data for assessing machine learning healthcare software, npj Digit. Med. 3 (2020)
[78] C. Saunders, A. Gammerman, V. Vovk, Transduction with Confidence and 147, https://doi.org/10.1038/s41746-020-00353-9.
Credibility. Proceedings of the 16th International Joint Conference on Artificial [92] D. Ulmer, L. Meijerink, G. Cinà, Trust Issues: Uncertainty Estimation Does Not
Intelligence - Volume 2, 722–26. IJCAI’99, Morgan Kaufmann Publishers Inc, San Enable Reliable OOD Detection On Medical Tabular Data. ArXiv:2011.03274 [Cs,
Francisco, CA, USA, 1999. Stat], 2020 November. http://arxiv.org/abs/2011.03274.
[79] M. Schinkel, K. Paranjape, R.S. Nannan Panday, N. Skyttberg, P.W.B. Nanayakkara, [93] A. Uyar, F. Gurgen, Arrhythmia Classification Using Serial Fusion of Support
Clinical applications of artificial intelligence in sepsis: A narrative review, Comput. Vector Machines and Logistic Regression, in: 2007 4th IEEE Workshop on
Biol. Med. 115 (2019) 103488, https://doi.org/10.1016/j. Intelligent Data Acquisition and Advanced Computing Systems: Technology and
compbiomed.2019.103488. Applications, 2007, pp. 560–565, https://doi.org/10.1109/
[80] P. Schulam, S. Saria, Can You Trust This Prediction? Auditing Pointwise Reliability IDAACS.2007.4488483.
After Learning’. ArXiv:1901.00403 [Cs, Stat], 2019. February, http://arxiv.org/ [94] J. Vaicenavicius, D. Widmann, C. Andersson, F. Lindsten, J. Roll, T.B. Schön,
abs/1901.00403. Evaluating Model Calibration in Classification’. ArXiv:1902.06977 [Cs, Stat], 2019
[81] G. Shafer, V. Vovk, A Tutorial on Conformal Prediction, J. Machine Learn. Res. 9 February. http://arxiv.org/abs/1902.06977.
(12) (2008) 371–421. [95] M.H. Waseem, M.S.A. Nadeem, A. Abbas, A. Shaheen, W. Aziz, A. Anjum,
[82] M.H. Shaker, E. Hüllermeier, Aleatoric and Epistemic Uncertainty with Random U. Manzoor, M.A. Balubaid, S.-O. Shim, On the Feature Selection Methods and
Forests, in: M.R. Berthold, A. Feelders, G. Krempl (Eds.), Advances in Intelligent Reject Option Classifiers for Robust Cancer Prediction, IEEE Access 7 (2019)
Data Analysis XVIII. Lecture Notes in Computer Science, Springer International 141072–141082, https://doi.org/10.1109/ACCESS.2019.2944295.
Publishing, Cham, 2020, pp. 444–456, https://doi.org/10.1007/978-3-030-44584- [96] T.-W. Weng, H. Zhang, P.-Y. Chen, J. Yi, D. Su, Y. Gao, C.-J. Hsieh, L. Daniel,
3_35. Evaluating the Robustness of Neural Networks: An Extreme Value Theory
[83] I. Silva, G. Moody, D.J. Scott, L.A. Celi, R.G. Mark, Predicting In-Hospital Mortality Approach, ArXiv:1801.10578 [Cs, Stat], January 2018. http://arxiv.org/abs/
of ICU Patients: The PhysioNet/Computing in Cardiology Challenge 2012, Comput. 1801.10578.
Cardiol. 39 (2012) 245–248. [97] J. Wiens, S. Saria, M. Sendak, M. Ghassemi, V.X. Liu, F. Doshi-Velez, K. Jung, et al.,
[84] R. Sousa, B. Mora, J.S. Cardoso, An Ordinal Data Method for the Classification with Do No Harm: A Roadmap for Responsible Machine Learning for Health Care, Nat.
Reject Option, in: 2009 International Conference on Machine Learning and Med. 25 (9) (2019) 1337–1340, https://doi.org/10.1038/s41591-019-0548-6.
Applications, 2009, pp. 746–750, https://doi.org/10.1109/ICMLA.2009.11.
15