0% found this document useful (0 votes)

3 views

20250210_01_Evaluating Pointwise Reliability of Machine Learning Prediction

The article discusses the increasing interest in machine learning (ML) applications in clinical and biological fields, emphasizing the need for reliable predictions to enhance user trust, particularly in healthcare. It reviews existing approaches for assessing the reliability of ML predictions and proposes an integrative framework based on the density and local fit principles to identify reliable and unreliable predictions. The authors aim to consolidate concepts related to ML reliability and provide methodologies that can be integrated into clinical decision support systems.

Uploaded by

Fauzan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views

20250210_01_Evaluating Pointwise Reliability of Machine Learning Prediction

Uploaded by

Fauzan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

Journal of Biomedical Informatics 127 (2022) 103996

Contents lists available at ScienceDirect

Journal of Biomedical Informatics

journal homepage: www.elsevier.com/locate/yjbin

Evaluating pointwise reliability of machine learning prediction

Giovanna Nicora a, *, Miguel Rios b, Ameen Abu-Hanna b, Riccardo Bellazzi a
a
Department of Electrical, Computer and Biomedical Engineering, University of Pavia, Italy
b
Department of Medical Informatics, Amsterdam UMC, University of Amsterdam, the Netherlands

A R T I C L E I N F O A B S T R A C T

Keywords: Interest in Machine Learning applications to tackle clinical and biological problems is increasing. This is driven
Machine learning trustworthiness by promising results reported in many research papers, the increasing number of AI-based software products, and
Predictive reliability by the general interest in Artificial Intelligence to solve complex problems. It is therefore of importance to
Uncertainty
improve the quality of machine learning output and add safeguards to support their adoption. In addition to
regulatory and logistical strategies, a crucial aspect is to detect when a Machine Learning model is not able to
generalize to new unseen instances, which may originate from a population distant to that of the training
population or from an under-represented subpopulation. As a result, the prediction of the machine learning
model for these instances may be often wrong, given that the model is applied outside its “reliable” space of
work, leading to a decreasing trust of the final users, such as clinicians. For this reason, when a model is deployed
in practice, it would be important to advise users when the model’s predictions may be unreliable, especially in
high-stakes applications, including those in healthcare. Yet, reliability assessment of each machine learning
prediction is still poorly addressed.
Here, we review approaches that can support the identification of unreliable predictions, we harmonize the
notation and terminology of relevant concepts, and we highlight and extend possible interrelationships and
overlap among concepts. We then demonstrate, on simulated and real data for ICU in-hospital death prediction, a
possible integrative framework for the identification of reliable and unreliable predictions. To do so, our pro
posed approach implements two complementary principles, namely the density principle and the local fit
principle. The density principle verifies that the instance we want to evaluate is similar to the training set. The
local fit principle verifies that the trained model performs well on training subsets that are more similar to the
instance under evaluation. Our work can contribute to consolidating work in machine learning especially in
medicine.

1. Introduction products that are now available on the market; most of them developed
for the fields of Radiology, but several other areas are represented, such
Machine Learning (ML) for making predictions and extracting in as cardiology and internal medicine [7]. The transition from research to
formation from data is increasingly applied in many different fields, bedside poses several challenges in order to provide systems with safe
from medicine [53,63] to finance [32]. Thanks to these algorithms we guards that can allow a widespread clinical use [50,97]. Those chal
are able to analyze the increasing mass of biological data produced by lenges include logistical and regulatory issues, with focus on
next-generation sequencing technologies, to make inference about trustworthiness and fairness [21,97]. Recently, the European Union
genomic mutation pathogenicity [69,5,67] or cell type [33], to identify regulations, in compliance with the General Data Protection Regulation
therapy targets, and to design new therapeutic compounds [23]. ML can (GDPR), have defined minimal standards for implementing ML systems
also be used to extract information in EHR data to predict patients’ in public health, requiring, among others, the explainability of the
diagnosis [59], post-hospitalization risks [49] or heart failure [73], and model. Explainable AI (XAI) is becoming a promising field of research
it can also support the analysis of emerging diseases, such as outcome within the AI community. Its purpose is to investigate methods for
severity during COVID-19 infection [4]. analyzing or complementing AI black box models to make the internal
Moreover, there is a fast increase in the number of FDA approved logic and output of algorithms transparent and/or interpretable.

* Corresponding author.
E-mail address: [email protected] (G. Nicora).

https://doi.org/10.1016/j.jbi.2022.103996
Received 31 October 2021; Received in revised form 7 January 2022; Accepted 11 January 2022
Available online 15 January 2022
1532-0464/© 2022 The Authors. Published by Elsevier Inc. This is an open access article under the CC BY-NC-ND license
(http://creativecommons.org/licenses/by-nc-nd/4.0/).
G. Nicora et al. Journal of Biomedical Informatics 127 (2022) 103996

Table 1 potential users when using these tools in many high stakes applications
Definition of different concepts to evaluate a classifier on a test set, along with [42]. In a ML pipeline, model performance is evaluated on a set of in
possible measures and relevant probability distributions. stances (test or validation set) through performance metrics such as
Concepts Definition Measures Relevant accuracy, sensitivity and specificity. These metrics represent an aggre
Distributions gate prediction capability of the model on that test set, since they are
Calibration The predictive Calibration (reliability) P (Y ∈ ⋅|f(X)) computed (in the case of a binary classification task) from the number of
probability diagram f(X) true/false positive examples and true/false negative examples in the test
distribution of a set, or from the complete set of predicted posterior probabilities (to
classification
compute AUC and PRC). When the model is deployed, we would expect
model honestly
expresses its that it will perform (on average) as well as it did on the test/validation
observed set. However, this may not be the case [76]. Moreover, the overall
probability performance metrics, mentioned above, do not provide a way to mea
distribution [94] sure the performance of single new cases whose class is predicted by the
Robustness How accurate is a Accuracy on adversarial p(Y = C|X; θ)
ML algorithm on samples or on small p(X)
model during deployment [54]. Additionally, we may not know the true
new independent perturbations of inputs class of these new future examples. For these reasons, it is crucial to
data even if one or develop new types of metrics and methodologies that can aid us to un
more of the input derstand whether the classification can be wrong case by case. For
independent
instance, suppose that we have trained a ML model to predict whether a
variables (features)
or assumptions are patient has a tumor based on the results of laboratory testing and MRI
drastically changed images. Suppose that the classifier predicts, for a certain patient, that
due to unforeseen his/her probability of having a tumor is 0.89. The probability threshold
circumstances for classification is 0.5, therefore the classifier assigns the patient to the
∑N ⃒ ⃒
Sharpness Variability of i=1 P(yi
⃒xi )(1 − P(yi ⃒xi )) p(Y = C|X; θ)
positive class. Through reliability assessment, we would like to assign a
predictions as N
described by degree of reliability of such prediction. The pipeline may report that the
distributions of predicted class for our patient is “positive” with a certain reliability, for
predictions [64] instance 0.7, or a binary value (0 or 1, indicating an unreliable or reli
able classification). In this article, we aim at discussing potential ap
proaches to evaluate the reliability of ML model predictions that can be
Table 2 integrated in clinical decision support systems. We use the term “reli
Definition of different concepts to evaluate a classifier on a single instance, ability” to indicate the degree of trust that we have on the prediction
along with possible measures and relevant probability distributions. made by the ML model on a single example, as used in Saria et al. [76]
Concepts Definition Measures Relevant and in [54]. The application of reliability can support the identification
Distributions of correct and incorrect cases, mentioned in the definition of “trust in
Uncertainty (of a Variability in a model’s Variance, p(Y = C|X; θ) model correctness” [42]. Naively, the probability of the predicted class
single ML prediction due to entropy, could be seen as the classifier’s trust on that prediction, but this
prediction) different types of confidence approach can be misleading since it does not account for the algorithm’s
uncertainty (epistemic/ intervals
biases [54]. Other strategies can be pursued, but we observe a lack of
aleatoric uncertainty)
Reliability (of Degree of trust that the Local fit p(Y = C|X; θ) standardization in the usage of different terms such as reliability, un
single ML prediction is correct principle, p(X)p(Y) certainty or robustness in the literature. For this reason, we will first
prediction) p( ŷi = = yi ) density start with a background section where we list the relevant concepts
principle pertaining to model evaluation, both in terms of performance ability and
reliability, and we then systematically review relevant papers that, to
Explainability can be thus perceived as a property through which ML some extent, cover our principles of interest. Afterwards, we will present
can gain trust from the user [74]. However, when dealing with a model’s our approach to integrate different concepts to perform reliability
prediction, we trust the classifier not only if we understand its logic but, assessment on simulated and real clinical data, and we will conclude our
most importantly, if its output classification is likely to be correct. XAI work with a discussion and some closing remarks.
may help in identifying incorrectly classified examples by identifying
artifacts or undesirable correlations from training [74], but it does not in 2. Background
itself enable trust [42]. For instance, XAI may be less useful to identify
critical situations in the data when they are not known a priori by users. 2.1. Terminology and problem definition
In other words, XAI can be used to help understand the classification
process made by the model, but it cannot assess reliability of single ML Let us consider the supervised classification problem, where a ML
predictions per se. Understanding whether we can trust the prediction of model f : Rm →Rc
( )
ML systems is crucial to identify failures, since ML inherently suffers has been trained on a training set D = xi , yi , where xi ∈ Rm is the
from dataset shifts and poor generalization ability across different vector containing the attribute values for the i-th example and yi ∈ (1, .
populations [50,41]. As a matter of fact, examples of fooling deep ., cj , ...C) is the class of the training examples. In particular, for a
learning (DL) networks with adversarial instances are widely reported in discriminative ML algorithm:
the literature [96,24]. Generalization ability under data perturbation (i. ( ⃒ )
f := p Y = cj ⃒X ∈ D
e. the robustness of the algorithm) can be promoted by using stable ML
methods and tested before deployment [86] and strategies for model
is the predicted posterior probability of the class being cj given the
updating need to be considered for deployed models [22]. In addition to
training data D. In the development phase of our model, we need to
that, we need methods and metrics to identify potential failures, due to
evaluate the performance of the ML classifier on a new set of data, called
perturbations or poor generalization ability, a posteriori, as part of a ( )
monitoring strategy of our model. test set T = xj , yj (alternatively, we can use bootstrap, cross-validation
Assessing the reliability of ML predictions may increase the trust of or other resampling approaches to estimate errors). On the test set, the

2
G. Nicora et al. Journal of Biomedical Informatics 127 (2022) 103996

model is commonly evaluated in terms of its predictive accuracy, i.e. the single point, and they have been used to select small molecules based on
fraction of correctly classified examples in the test set. In many appli predicted binding affinity with kinases [39]. A widely used method to
cations, such as in medicine, accuracy alone should not be the only measure the uncertainty is by using multiple classifiers. Intuitively, the
evaluation metric, especially when the classes are imbalanced and/or more the classifiers disagree, the higher the uncertainty. Ensembles of
the cost of misclassification for a class is higher with respect to the other classifiers can be used to estimate both aleatoric and epistemic uncer
(s) [58]. For this reason, other metrics, such as precision (i.e. positive tainty in a formal way, by computing the entropy or the relative likeli
predictive value), recall (i.e. sensitivity), specificity, F1 score and Mat hood of the posterior probabilities predicted by each weak classifier
thew’s Correlation Coefficient should be reported and analyzed as well [82]. In Deep Networks, uncertainty is estimated using drop-out
[43]. These metrics are applied to the discretized output of the classifier, methods or through ensembles of networks. In [56], the authors build
which is the posterior predicted probability of the class given the data. a DL network that includes drop-out Bayesian uncertainty estimation,
The computation of such measurements on a test set represents the most that can be used to diagnose diabetic retinopathy, showing that de
common approach to evaluate the performance of a classifier. Yet, other cisions informed by uncertainty estimation can improve diagnostic
measurements, reflecting different concepts, can be used during vali performance. Aleatoric and epistemic uncertainty of Bayesian deep
dation and deployment. Here, we provide an overview of these concepts, networks have been estimated for computer vision tasks in [51]. The
reported in Table 1 and Table 2. Table 1 highlights metrics that are reader can refer to [2] for a review of uncertainty quantification of DLs
computed on the entire test or validation set, while in Table 2 we re and to [40] for formal definitions of uncertainty and methodological
ported metrics applied to a single machine learning prediction. approaches to estimate total, epistemic and aleatoric uncertainty.
Calibration measures whether the distribution of the target class Another assumption that has been made is that the predicted probability
conditional on any given prediction of our model is equal to that pre is the measure of (un)certainty, provided that the model is calibrated
diction [94]. It is worth noting that the concept of calibration is called [55]. However, calibration itself can suffer from dataset shift, thus
“reliability” in the weather forecast community [64], but in this paper making this type of uncertainty estimation less trustworthy in case of
we will use “calibration” since with “reliability” we are referring to a OOD examples [71].
comprehensive concept that aims at assessing whether the predicted In 2009, [10] defined reliability in ML as “any qualitative property or
class is equal to the true class of a single example. Through the vari ability of the system which is related to a critical performance indicator
ability of the predicted probability distributions we can evaluate the (positive or negative) of that system, such as accuracy, inaccuracy, avail
sharpness [64], for instance by computing the variance of the predicted ability, downtime rate, responsiveness, etc”. In this paper, we are interested
posterior probabilities [90]. These “overall” metrics measure the ability in the assessment of individual (pointwise) reliability, thus we will use
of the classifier on the complete test set. If the test set is representative of the term reliability to indicate a property of a single ML prediction made
the entire underlying population of interest, then we will expect that the on a given sample. In this context, [76] put together principles from
classifier will perform as well as in the test on future data coming from reliability engineering to assure that a ML system performs as intended
the population. Another property that can be evaluated on a set of ex in deployment, i.e. without failure and without specified perfor
amples is the robustness of the classifier, i.e. how accurate is a ML al mance limits. These principles encompass (1) prevention of failures, i.e.
gorithm on new independent data even if one or more of the input adoption of techniques, during development, to avoid possible incorrect
independent variables (features) or assumptions are drastically changed cases, (2) identification of failures during deployment and (3) mainte
due to unforeseen circumstances. Out-of-distribution (OOD) samples nance of the model, i.e. fix or prevent the failures when they occur. To
and adversarial samples are often generated to fool DL [12]. Robustness identify failures, the concept of point-wise reliability should be applied,
against adversarial attacks should be guaranteed in high stakes appli by evaluating the model’s prediction of a single instance and thus
cations, such as medical ML [25]. possibly rejecting the classification when such prediction is deemed as
These metrics are especially useful during the validation phase, to “unreliable”. To compute pointwise reliability, two principles should be
understand how the model performs in general. In deployment, we may considered according to [76]:
want to understand how the classifier is performing case by case. In this
context, uncertainty is an important concept related to AI trust. In 1. density principle: it asks whether the test sample is close to the
statistics, uncertainty is a statement about a single prediction, that can training set (similar to outlier detection or out-of-distribution sam
be for instance a confidence interval. Ideally, especially in the medical ples detection)
field, a predictive model should have the ability to abstain from pre 2. local fit principle: it asks whether the model was accurate on
diction when the uncertainty is high [52]. Uncertainty can be related to training samples closest to the test case.
the model’s “self-confidence” on the prediction: for instance, given a
binary discriminative classifier outputting a probability, we can label By evaluating these two principles we should be able to identify the
the model as “certain” when the predicted probability is near 0 or 1. In “context” in which the model is expected to be performant, thus pro
this sense, uncertainty is a measure of sharpness on a single example. moting the “trust in model correctness”. In other words, the model
From now on, we will refer to this type of uncertainty as “unsharpness”, should perform well on samples that (1) are similar to the training set
to distinguish it from the predictive uncertainty, which is due to the and (2) are similar to training samples on which the model performed
epistemic and aleatoric uncertainty [65]. Aleatoric uncertainty refers to well. The density principle only refers to the distribution on the attribute
uncertainty in the data, such as noise, while epistemic uncertainty refers space of the training set and the new instance to be classified, while the
to the uncertainty in the model, such as the uncertainty in the model’s local fit principle also refers to the averaged performance of the classi
parameters due to lack of knowledge. Epistemic uncertainty should in fier. The first principle acknowledges that the model may be unreliable
crease when the prediction is made on an instance which is “far” from when the distribution of the new data is changing for any reason. The
the training set (i.e. out-of-distribution (OOD) samples). The total un second principle acknowledges that the model was not “perfectly ac
certainty is the sum of aleatoric and epistemic uncertainty. Aleatoric curate” in the first place (during model development and evaluation) on
uncertainty represents the irreducible part of the uncertainty, while specific subsets of the training set.
epistemic uncertainty is the reducible part of (the total) uncertainty. In A formal definition of pointwise reliability was proposed by [54] as
ML, the sources of uncertainty (aleatoric or epistemic) are not usually the probability that the predicted class is equal to the true class value of
distinguished [40]. Different methodologies allow for total uncertainty an instance. To calculate this probability, the authors proposed a
quantification, either in terms of building a confidence interval around a transductive approach that involves the re-training of the algorithm:
prediction, or by computing a single score. Gaussian processes can also each time a new instance needs to be classified, the posterior predicted
estimate uncertainty by computing the variance of predictions around a probability of the original model is stored, the new instance is included

3
G. Nicora et al. Journal of Biomedical Informatics 127 (2022) 103996

Fig. 1. Supervised Machine Learning model life cycle. First, a model is trained on an available set of training samples (1). Then, the model is evaluated on a test set in
terms of different metrics (2). Eventually, the model is deployed, and it can be used for prediction on new cases (e.g. patients). In such single cases, the true class is
usually unknown, but it is necessary to understand whether the ML prediction should be trusted. Reliability assessment can support the identification of unreliable
predictions (3).

Fig. 2. Taxonomy of approaches implementing reliability assessment reported in the literature.

in the training set and the model is re-trained. The posterior probability a pointwise reliability estimation deals with any kind of future input
is predicted by this second model, and it is compared with the stored samples.
predicted probability. Intuitively, if the posterior probability changes Fig. 1 shows a workflow for ML model development and deployment,
with some degree after the inclusion of the instance in the training set, highlighting validation and continuous monitoring through reliability
such instance is adding knowledge that was previously unavailable. assessment.
Therefore, the first predicted probability should be deemed as unreli
able. However, this approach implies the re-training of the algorithm, 2.2. Related work
which is sometimes unfeasible due to computational time and/or un
availability of the training set. Uncertainty may be used to evaluate the Several works incorporate the notions of reliability, uncertainty, and
reliability of a classifier, following the principle that we will be more robustness to provide classifiers able to detect potential failures. Some
willing to trust a predictive model that is “certain” about a prediction. classifiers naturally provide a way to compute a form of reliability
For instance, through uncertainty estimation, in [56] the authors were measure, while other work exploits or adapts well-known classifiers to
able to detect difficult cases for further inspection, resulting in sub compute a form of reliability measure.
stantial improvements in detection performance in the remaining data. In particular, a branch of pattern recognition research, named
Moreover, as the epistemic uncertainty can be caused by OOD data, it Learning with Reject Option (LRO), focuses on the development of
may seem that evaluating epistemic uncertainty naturally includes the classifiers with reject ability. In LRO, the aim is to find a subset of the test
evaluation of the density principle. Yet, as studies have shown, uncer set for which the model has higher performances, thus identifying non-
tainty evaluation through the predicted posterior probability is not reliable examples for which the classifier may withhold the classification
trustworthy under dataset shift [71], which is exactly the case in which [6]. The concept of LRO is equivalent to the concept of “selective pre
we would like to have a reliable approach to detect possible failures. diction”, in which a model can choose to abstain itself from the classi
Also the robustness of the model is related to its reliability, since it fication when the uncertainty is high [31]. For instance, Bayes classifiers
promotes the development of accurate models on adversarial samples. were exploited to detect reliable regions in gene expression data [34],
Yet, robustness is limited to the evaluation of adversarial samples, while while posterior probability and contextual information are used to

4
G. Nicora et al. Journal of Biomedical Informatics 127 (2022) 103996

classify images from teratoma tissues and reject non-reliable portions to predict the confidence intervals for a new prediction in an on-line
[15]. By identifying samples for which the classification may be wrong, setting, in which the prediction of a new case is made based on the
classification reliability and classification with reject option can be seen previous classified example. Given a predicted label ̂y and the true label
as synonyms [3,27,61]. Learning with rejection often implies the defi y, conformal prediction can compute a 95% confidence interval that
nition of a reject threshold. In pattern recognition, an optimal reject contains the y label with at least 95% probability. Usually, such regions
threshold based on the rejection cost was formulated in 1970 [14], and a also include the predicted label ̂ y In case of regression, ̂y and y are
reject threshold was computed based on the known costs of misclassi continuous values, while for classification they are discrete values [81].
fication and rejection in text categorization [27]. An optimal reject rule Conformal prediction is often applied for reliable drug discovery (see
for a binary classification problem was also formulated by [89], in which [20] for a review) and can be extended to DL as well [36,62].
error costs peculiar for the particular problem are used to compute More recently, a new method computes the reliability of a trained
optimal reliability thresholds based on ROC curves. [75] demonstrated model provided that the gradient of its loss function is accessible [80].
that the approaches from [14] and from [89] are actually equivalent.
Another important aspect is the performance evaluation of LRO classi 2.2.2. Reliability based on uncertainty or predicted posterior probability
fiers. [66] introduced the accuracy-rejection curve (ARC), which repre The most straightforward way to develop LRO classifiers is adding a
sents how the accuracy varies as function of the rejection rate. To rejection rule to any type of classifier outputting a posterior probability
account for misclassification and rejection costs in applications such as or an uncertainty estimation. The reject threshold is set according to
medical diagnosis, [1] recently extended the ARC curves. such value, based on some trade-off between misclassification and
A binary classifier with reject option (or reliability estimation) can be rejection [27,34,8,95]. This approach is also called plug-in [14,35].
developed in different ways: 1) by choosing a classifier whose charac When using the predicted posterior probability, we assume that such
teristics can be exploited to evaluate reliability 2) by thresholding the probability is calibrated. In this case, when the predicted posterior
predicted posterior probability or an uncertainty estimation 3) by probability for the predicted class is lower than a threshold, then the
designing independent and complementary classifiers and 4) by classification is deemed as unreliable. In [72] authors define rejection
designing classifiers that intrinsically learn to classify with rejection intervals, based on misclassification rates, for the predicted posterior
(also known as embedded methods) [61,85]. Also 5) unsupervised probability of a classifier that predicts the effect of amino acid sub
methods, such as generative models, can be used to capture statistical stitutions on protein stability. In order to use a measure of uncertainty,
properties of the dataset to be used for reliability estimation. In the we can use Gaussian Process classifiers, or we can compute epistemic,
following subsections we will discuss these different approaches (as aleatoric and total uncertainty as the relative likelihood, or the entropy,
reported in Fig. 2). Note that we are interested in LRO methods that of the posterior probabilities predicted by each weak classifier in an
identify unreliable regions on the feature space. Another application of ensemble [40,82]. Alternatively, [48] exploited different “fusion
LRO methods is the detection of samples in outlier classes, in situations methods” for posterior probabilities of ensembles, for instance by sum
where new classes can appear [47,88]. Moreover, we focus on reliability ming or taking the maximum value of the probabilities, to develop an
assessment for classification problems, while also regression problems anti-diabetic drug failure classifier with a reject option.
may benefit from reliability estimation [9]. Recently, a variety of work has been published dealing with uncer
tainty and rejecting options for DL networks. In [19], deep networks
2.2.1. Reliability based on classifier’s properties and characteristics ensembles are used to predict drug efficacy. It has been observed that DL
Transductive algorithms, such as k-Nearest Neighbor and Support tends to provide high predicted probabilities also for misclassified ex
Vector Machines, rather than output a general classification rule or amples. This is due to the fact that the softmax function, which is usually
function, classify instances with respect to the available training set. used by DL to output a predicted probability, is fast-growing exponen
Therefore, reliability can be determined based on how much the new tially [38]. Regarding the usage of the DL posterior probability as an
case is “close” to the training set, an assumption that is included in the indicator of reliability, published results are contradictory [38]
“density principle” as well as in other previous work [30]. For instance, observed that the predicted probability of misclassified examples can be
using k-Nearest Neighbour, one can rely on the “purity” of the k nearest usually lower with respect to the correctly identified instances, and
neighbours for an example: if less than k’ neighbors are in the predicted therefore DL probabilities may be evaluated to reject a classification.
class, then the classification is rejected (i.e. unreliable) [37]. In [78] the Lakshminarayanan et al. [55] successfully used an ensemble of networks
confidence of a SVM classifier is computed from the values of the to estimate the uncertainty from the predicted posterior probability.
Lagrange multipliers, since they reflect how “odd” is a particular However [71] performed a large-scale evaluation of different methods
example. The term confidence is sometimes used to refer to the same for quantifying predictive uncertainty under dataset shift and shows
concept of reliability [54,26,48,44]. Distance from the separating hy that, along with accuracy, also the calibration of the classifier de
perplane of the SVM was used by [93] to detect arrhythmia from com teriorates under dataset shift. Consequently, using the classifier’s pre
plex electrocardiogram (ECG) signals and a SVM with l1 penalty and diction to detect unreliable classification may not be trustworthy. On
reject option was developed by [13] to perform feature selection and tabular medical data, a recent study found that, while ensemble methods
classify cancer samples based on gene expression profiles. Among neural for prediction and uncertainty quantification are the best performing in
network approaches, LVQ (Learning Vector Quantisation) networks detecting correlation between uncertainty and performance, they are
consist of two layers for input and output and an array of codebook less suited for the identification of out-of-distribution examples. In this
vectors. The output is defined as the distance between the input vector latter case, novelty detection methods, for instance based on Variational
and the codebook vectors for every class. The classification for an input Autoencoder (VAE) performs better [60]. The limitation of uncertainty
sample x would be the class whose representative codebook vector has estimation to detect OOD examples is also reported by [92], where the
the minimum distance to the sample x. The distances between the authors investigated different methodologies for uncertainty estimation
sample and the codebook vectors of each class are used by [3] to esti to detect OOD examples in medical data.
mate the probability densities for codebook vectors, which in turn are
used to select the threshold for rejection. LVQ networks for LRO were 2.2.3. Complementary classifier
first proposed by [17,18] and were also used for reliable footstep Another possible solution to design a pipeline that learns with
identifications [87]. It has also been argued that LVQ combines model rejection in a binary classification problem is the development of two
interpretability with the reject option [11]. Along with transductive independent classifiers. The first classifier is trained to output C1 only
algorithms, conformal prediction can also be used to compute reliability. when the posterior probability for C1 classification is high, the second is
Conformal prediction is a general framework that uses past experience

5
G. Nicora et al. Journal of Biomedical Informatics 127 (2022) 103996

2.2.5. Unsupervised methods

To capture similarities in data, also unsupervised methods, such as
Self Organizing Map (SOM), can be used to summarize the statistical
properties of the training data, and they are therefore suitable for
detecting patterns of reliable and unreliable instances [29]. Generative
models, like GANs, have been used to learn a reliable region in the
Learning with Reject Option (LRO) framework by [30]. GANs (Genera
tive Adversarial Networks) are particular types of unsupervised net
works, which consist of two models: the generative model generates
samples from a latent distribution learned from the data, and the
adversarial model predicts whether the generated samples belong to the
original dataset. For this reason, the discriminative model of GAN can be
used to detect OOD and therefore potentially unreliable samples. An
other type of neural networks, named autoencoder (AE), can learn
efficient codings of unlabeled data by attempting to regenerate the input
from the encoding. AE reconstruction error can be used as a measure of
uncertainty [60].
All the aforementioned work relies on the usage of particular types of
classifiers, forcing the user to choose among ensembles, transductive
Fig. 3. Examples of inner and outer border samples on two attributes. A1-A6
belongs to class 0, B1-B4 to class 1. For attribute x1 , A5 and B2 are inner algorithms etc, thus limiting them in the design of the model. Fewer
borders, A3 and B3 outer borders. For attribute x2 , A5 and B2 are inner borders, work has been developed to deal with any types of classifiers. Given that
while A2 and B4 are outer borders A4 and B1 are outer borders. The test the assumption for rejecting or assigning a low confidence/reliability to
example e2 would be inner border for attribute x1 , while the e1 example would an example is often that this example is “distant” from the training set,
be outer border for both for x1 and x2 attributes. Therefore, the reliability the parallelism becomes clear between learning with rejection and data
computed for e1 would be 1-½=0.5, for e2 the reliability would be 1–2/2 = 0. subset selection, such as instance selection methods [29]. Instance se
lection methods are usually exploited to remove from the training set the
trained to output C2 only when the posterior probability for C2 classi non-useful instances, thus speeding runtime training while not affecting
fication is high. The classification of a single new instance is seen as classification performance [70]. Such non-useful instances may be those
reliable if and only if the two classifiers agree. This approach for clas similar to already well represented examples in the training set, and they
sification with the reject option was proposed by [85] and a similar are not adding new relevant information to the dataset. Therefore, un
methodology was developed by [61] for software defect prediction. [35] like LRO, which is performed after model training, instance selection is
fitted two one-class support vector machines (one for each class in a applied before training to select the most relevant examples in the
binary problem) that detect the two unreliable regions (when examples training set. In a recent work, we suggested the exploitation of instance
of different classes are overlapping in the feature space and when selection methods for learning with rejection [68]. We do not use
samples are actually outlier) for rejections. One-class complementary instance selection methods to actually select training examples, but
classifiers were also used in [44]. rather to assess whether a new unseen example would be selected as
“relevant” in comparison with the training set: if so, the new element
2.2.4. Embedded methods may not come from the training set population distribution, and there
Embedded methods intrinsically learn to classify with rejection fore we cannot trust the model prediction (following the density prin
[28,30,31,57,84]. Back in 1992, [57] proposed a neural network able to ciple). Therefore, instance selection is performed after the training
compute reliability of classification by adding additional output nodes phase, by comparing the training relevant examples with the new
that provide uncertainty in the classification. [16] included the reli example(s) to be classified. This approach can be applied to any type of
ability estimation within a DL approach by defining a new target crite classifier. Another “classifier-independent approach” has been proposed
rion for the network training. Rules for reliability assessment of by [45]: first the authors detect high density region for each class by
multilayer perceptron were also introduced [17,18], where the authors filtering out training samples that may be outliers, and then a “trust
assumed that the function that describes the performance of the net score” is calculated as the ratio between the distance from the testing
works is a linear function combining the recognition rate, the misclas sample to the-high-density-set of the nearest class different from the
sification rate and the rejection rate. The threshold for rejection is predicted class, and the distance from the test sample to the high-
computed according to the values of output nodes of the network. density-set of the class predicted.

Fig. 4. a) Simulated binary dataset b) Simulated binary dataset with dataset shift.

6
G. Nicora et al. Journal of Biomedical Informatics 127 (2022) 103996

Fig. 5. Performances of the RF (100 trees) on the validation (IID) and test set (OOD and IID), in terms of accuracy, F1 score, Matthews Correlation Coefficient, Brier
Score, Area Under The Receiving Operating Curve (AUC) and Area Under the Precision Recall Curve (PRC).

Fig. 6. a) Boxplots of the RF total uncertainty on correctly classified examples and incorrectly classified examples in the validation set b) Boxplots of the RF total
uncertainty on correctly classified examples and incorrectly classified examples in the test set.

3. Reliability estimation following the density and local fit that the road to follow for reliability estimation should be based on the
principles local and density principles defined by [76]. In the following, we present
a possible strategy which is based on the implementation of these two
Previous work has shown that the usage of “classifier-related” met principles, and we demonstrate the utility of this approach on a simu
rics (such as posterior predicted probability) to assess pointwise reli lated and real-case medical dataset.
ability may be biased and misleading given the intrinsic data-driven Given any type of classifier and the training set used to train the
nature of the classifiers. Moreover, “a classifier may simply not be the classifier, we first apply the instance selection-based approach proposed
best judge of its own trustworthiness” [45]. Motivated by this, we are in [68]. Briefly, this approach first detects the “border” instances of the
interested in investigating the use of approaches for reliability estima feature spaces, that is, for each attribute, we identify those training
tion that are independent of the classifier chosen, and that does not rely examples that are either inner or outer border for their membership
on the classifier itself to compute its own reliability. To do so, we believe class. When a new example will be classified, the “density principle”

7
G. Nicora et al. Journal of Biomedical Informatics 127 (2022) 103996

Fig. 7. a) Test samples colored by the epistemic uncertainty value. b) Test samples colored by the aleatoric uncertainty value.

will be evaluated by comparing the example to the training border, and 3.1. Simulated dataset
its “density-based reliability” will be calculated by the simple formula:
mx We simulate a binary classification problem under dataset shift. First,
drel(x) = 1 −
m we generate normally distributed samples (std = 1) with 2 features and
classes overlaps. Each class is represented by a cluster in the 2d space, as
where mx is the number of attributes for which the value of the new reported in Fig. 4a. The dataset is made of 6000 samples, 80% of which
example falls outside the training border and m is the total number of in class 0 and the remaining in class 1. This dataset represents the true
attributes. The example will be considered as reliable according to the underlying population. We then want to simulate 1) that the available
density principle if the computed drel is greater than a given threshold. class distribution in the training set is different from the true population
An example on a simple dataset with 2 attributes is shown in Fig. 3. More and 2) that dataset shift occurs. Therefore, we selected about 800
details can be found in [68]. samples for training and about 300 for validation. In both training and
After evaluating the density principle, we will evaluate the local fit validation, samples are equally distributed in the two classes. The
principle, i.e. we will check whether the algorithm performs well on remaining samples are kept for final testing. We then simulate dataset
training samples close to the example. To do so, we will select the k shift of the red class by shifting the “red samples” along the x axis and by
nearest examples in the training set (or in a validation set coming from adding random noise to the y axis (Fig. 4b). The shifted samples will be
the same exact distribution of the training set), and we will calculate part of the final test set. In this context, the validation set represents an
average performance metrics, such as accuracy or F1 score, on these IID population, while the test set will gather both IID and OOD instances.
neighbors. If the performance metrics are above a predefined threshold, The validation and the test set will only be used to evaluate the per
we will evaluate the classifier as “trustful” on that example following the formance of the classifier, as well as reliability estimation, and they will
local fit principle. The threshold should be selected by the user, ac not be used for hyperparameter tuning or model selection. The valida
cording to the desired and realistic performance that can be achieved tion set will be used to select thresholds for reliability based on uncer
within the specific problem. Eventually, the example will be considered tainty and based on the density principle, while the threshold for local fit
as reliable if it is reliable according to the density principle and to the principle was empirically set to local accuracy equal or higher than 0.75.
local fit principle. For additional information about the experiments, code is available
We then compare this approach to an uncertainty-based estimation, (https://github.com/GiovannaNicora/reliability). After developing our
by using a Random Forest (RF). RFs are widely used also in clinical ML classifier, errors can occur for different reasons: on one hand, the
application since they have shown high performances in different pre classifier may be wrong when there is overlap between the feature
diction tasks [79,79]. In addition, it has been recently shown how it is values of the two different classes (aleatoric uncertainty) and when the
possible to compute prediction uncertainty as the entropy of the pre dataset shift occurs (epistemic uncertainty). To detect reliable and unre
dicted posterior probabilities as reported in [82]. For continuous mea liable examples, we will apply and compare the approaches detailed in
sure of reliability (e.g. uncertainty and density based-reliability), the previous sections: reliability will be evaluated (1) in terms of local fit
optimal threshold to label an example as “reliable” or “unreliable” can and density principles, (2) in terms of the predictive uncertainty.
be selected as follows: for different thresholds, we will compute the
entropy of the reliable and unreliable sets with respect to the label
“correctly classified” and “incorrectly classified” by the classifier. The 3.2. MIMIC-III dataset
threshold with lowest entropy will be selected. By doing so, we select the
threshold that is more able to distinguish correctly classified and The MIMIC-III dataset is a freely available resource that contains
incorrectly classified examples. In particular, we will train a RF with 100 clinical data and vital signs of thousands of patients in ICU. In particular,
trees and at least 10 samples per leaf. We apply the two approaches for we used the preprocessed dataset made available by the PhysioNet 2012
pointwise reliability estimation to detect incorrectly classified examples challenge [83]. We are interested in developing a model to predict in-
on a simulated dataset and on MIMIC-III, a widely used medical dataset hospital death from clinical data. Moreover, we will examine whether
which collects de-identified health data associated with thousands of bias in the training set will affect test performance and if reliability
intensive care unit (ICU) admissions [46]. Code is available as Google assessment can support the user to identify such unreliable predictions.
Colaboratory notebooks (https://github.com/GiovannaNicora/re After removing features with at least 90% of missing values and
liability). patients with at least a missing value, we obtain a dataset of 4480 pa
tients that survived (“In-Hospital death” = 0) and 768 that died in the
hospital (“In-Hospital death”=1). More details about the features set can
be found in [83] and in https://physionet.org/content/challenge-2012/

8
G. Nicora et al. Journal of Biomedical Informatics 127 (2022) 103996

patients. Therefore, in the test set, around 72% of patients are female. In
the training set, 14% of patients are in class 1, while in the test the
percentage of patients in class 1 is higher (23%). We select as a classifier
a Random Forest, which allows us to compute uncertainty, as explained
in the previous section. Best thresholds for uncertainty and density-
based reliability will be computed on the results of the predictions
made on the validation (i.i.d.) set. Also in this case, as in the simulated
dataset, the validation set is similar to the training set. We carried out an
additional experiment on the MIMIC dataset, reported in the Supple
mentary Material, where the data shift is simulated using age groups. In
this case, we exploited as classifier a Lasso logistic regression, and we
used the predicted posterior probability as uncertainty estimation (see
Supplementary Material).

4. Results

4.1. Simulated data

Fig. 5 shows the performance of a RF on the simulated validation set

(which is identically distributed, IID) and on the test (which also con
tains out-of-distributed, i.e. OOD samples).
As expected, the performances on the IID validation set are high,
while all the metrics, including calibration and AUC, decrease when the
properties of the dataset are different from the training set. In this case,
we have both imbalanced data and OOD data.
Aleatoric, epistemic and total uncertainties are calculated on each
validation and test samples. As we can see in the boxplot of Fig. 6a, the
total uncertainty of the correctly identified validation samples are
distributed around lower values in comparison to the incorrectly clas
sified validation samples, and the uncertainty means are statistically
different according to a t-test result. In Fig. 6b we can observe that, for
the test set, the distributions of correctly and incorrectly classified ex
amples are more similar, even though their means are significantly
different.
By focusing on the test samples, we can see how the aleatoric and
epistemic uncertainty varies in the feature space (Fig. 7).
These results confirm that the epistemic uncertainty is higher for the
OOD samples, while the aleatoric uncertainty is capturing the noise
when the two classes are overlapping.
Fig. 8 compares the correctly and incorrectly classified samples with
the reliable and unreliable examples as labeled by the uncertainty
method and by the combination of density and local fit principles.
Ideally, the incorrectly classified examples should be labeled as
unreliable.
As we can see, the RF fails to correctly classify IID samples that fall in
the feature space where there is an overlap of the two classes. The
reliability based on uncertainty estimation, as well as the reliability
based on the local fit and density principle, are able to label as “unre
liable” these samples. Regarding the OOD samples, which all belong to
the class 1, only those points whose y value is roughly higher than − 0.5
Fig. 8. a) Test samples colored in red if incorrectly classified, colored in green
are incorrectly classified. On these shifted data, the uncertainty esti
if correctly classified. b) Reliable and unreliable samples according to the total
uncertainty. c) Reliable and unreliable samples according to the density and mation labels as “reliable” the majority of misclassified points while
local fit principles. (For interpretation of the references to colour in this figure samples in the area with the majority of correctly classified points are
legend, the reader is referred to the web version of this article.) labeled as “unreliable”. Therefore, the uncertainty estimation seems to
be less trustworthy on shifted data. Instead, the approach based on the
1.0.0/. In this case, we would like to simulate the situation in which local fit and density principles labels as “reliable” those OOD samples
some sub-populations of patients are not represented, or they are under- which are nearer to the IID clusters along the x axis.
represented in the training set. To do so, we simulated the extreme case We then investigated whether the performances of the classifier on
in which only male patients are available in the training set. We there the reliable and unreliable sets differ (Fig. 9). Ideally, the reliable set
fore select 70% of male patients for training and validation (that will be should consist of samples for which the classification is more trust
exploited for reliability and uncertainty thresholds selection), while the worthy. Therefore, performance metrics on the reliable sets should be
remaining, plus the female cohort, will be used for testing. The i.d,d set higher. All the performance metrics, except for the area under the
(1716 male patients in class 0 (alive) and 277 male patients in class 1 precision-recall curve and the F1 score, decrease in the unreliable set in
(deceased)) will be divided into 70% training and 30% validation. The comparison to the reliable one. The performance decrease of the unre
test set will be made of more than 2273 female patients and 893 male liable set is higher when using the reliability estimation with local fit
and density principles. For instance, the Matthews Correlation

9
G. Nicora et al. Journal of Biomedical Informatics 127 (2022) 103996

Fig. 9. a) Performance of the classifier on the entire test set, on the reliable set and on the unreliable set identified through the uncertainty estimation. b) Per
formance of the classifier on the entire test set, on the reliable set and on the unreliable set identified through the application of the local fit and density principles.

coefficient for the unreliable set according to the uncertainty method is The performance on the test set, both on the IID and the OOD sets,
42%, while it drops to 34% for the unreliable set detected with the local are lower than those reported on the training set through a 5-fold cross
fit and density principle. The AUC of the reliable and unreliable sets validation. Yet, on the IID set, the results are slightly higher in com
according to the uncertainty are comparable (75% vs 72%), while their parison to the OOD set (female patients). We then calculated the un
difference is higher with the second method (86% vs 64%). Therefore, in certainty and the reliability to evaluate whether we would be able to
this simulated case a reliability estimation based on local fit and density identify a subset of patients with a higher chance of being correctly
principles seems to perform better than the uncertainty-based estima classified. By analyzing the performance metrics computed on the reli
tion in the identification of reliable and unreliable instances. able and unreliable sets according to the uncertainty (Fig. 11a and
Table 3), for this particular dataset, the uncertainty-based method did
not label any true positive example as reliable. On the contrary, the
4.2. MIMIC dataset reliability based on the local fit and density principles labels all the false
negative and false positive samples in the unreliable sets, while the
Fig. 10 reports the results of a RF trained on male patients.

10
G. Nicora et al. Journal of Biomedical Informatics 127 (2022) 103996

Fig. 10. Performance of the classifier on the training set (5-fold cross validation mean and standard deviation), on the validation set and on the IID test set and on
OOD test set.

reliable set is entirely made of correctly classified examples (Fig. 11b the development of trustworthy AI systems (https://digital-strategy.ec.
and Table 4). Some correctly classified samples are labeled as unreliable. europa.eu/en/library/ethics-guidelines-trustworthy-ai). In such guide
Reliability estimation in a region informs us about the distance from the lines, it is stated that “a reliable AI system is one that works properly with a
training set, independently of the labels. This implies that reliability range of inputs and in a range of situations”. We need therefore to define
estimation provides a warning sign about the machine learning predic (1) when a system is considered to work properly and (2) what is the
tion by computing how far a particular sample is “distant” from the acceptable range of inputs and situations. In this paper we concentrate
training set and/or whether it falls in a region where a high number of our attention to point-wise prediction reliability, which is the degree of
errors occur. Still, the model can correctly classify unreliable samples, trust that the predicted class is equal to the true class of a single instance.
even if this should occur at a lower rate. In other words, it is the degree of trust that the classification is correct.
According to the reliability computed with the local fit and density Reliability can be judged by analyzing whether the instance to be clas
principles, 45% of the test set is considered reliable (70% females) and sified falls into a “trustworthy” region of the feature space, where the
55% unreliable (73% females). The percentage of reliable predictions on classifier should be confident and sure about the prediction. This region
female patients is 444%, while 47% of predictions on male patients was contains samples coming from the same distribution of the training data
considered reliable. If we consider the reliable and unreliable sets (i.i.d. samples) and for which the classifier showed “good fit”, for
identified with uncertainty, we see that 67% of the test set is considered instance high accuracy. Other approaches to estimate reliability rely on
reliable (71% females) and 33% unreliable (73% females). The per uncertainty estimation, or using particular types of classifiers, such as
centage of predictions supposed to be reliable on female patients is 66%, conformal methods.
while 70% of predictions on male patients was considered reliable. From Yet, incorporating reliability estimation within a ML pipeline also
these results, we can observe that the strategies behave as expected, and adds computational complexity. For instance, when the reliability is
that, while both are able to highlight that predictions on females are less computed by comparing the training set to each new instance to be
reliable than one on males, local fit and density principles are able to classified, the user needs to have access to the training data. Access to
label a higher percentage of predictions on o.o.d. samples (odds ratio training data can be unfeasible or even impossible, due to the dimen
1.144) as unreliable than uncertainty (odds ratio 1.135). sionality of the training set or to privacy issues.
A well-known paradigm to deal with these issues is represented by
5. Discussion instance selection methods, which select only the most informative
training examples to be stored. We suggest using such methods for
ML applications to medical-related problems need not only to show reliability assessment. In this case, instance selection methods choose
high performance during validation, but they also need to demonstrate the most informative training samples that will be compared with new
their trustworthiness throughout the entire deployment phase. Intrinsic instances for reliability assessment. The results may depend on the
trust can be achieved through explainable methods that pinpoint the specific method used to evaluate if the example of the test set is OOD.
reasoning process performed by black box predictive models. Reliability In our paper, we have presented an approach that is even more
monitoring is as crucial as explainability. Our claim is that the imple parsimonious, since it performs OOD reliability assessment by defining
mentation of reliability monitoring mechanisms should be a require boundaries attribute-by-attribute, both in the original or in a trans
ment of any ML applications to achieve trust. Reliability is also a formed feature space. In this case, the additional information to be
requirement of the recent guidelines from the European Commision for retained can be compactly represented by O(m) number of variables.

11
G. Nicora et al. Journal of Biomedical Informatics 127 (2022) 103996

Fig. 11. a) Performance of the classifier on the entire test set, the reliable and unreliable set according to the value of uncertainty. b) Performance of the classifier on
the entire test set, the reliable and unreliable set according to the local fit and density principles.

Despite its simplicity, our approach has some limitations. First of all, by
Table 3 detecting border examples attribute-by-attribute, we do not take into
Confusion matrix of the test set, the reliable set and the unreliable set detected
account possible correlations between features. Secondly, the results of
with the local fit and density-based method.
reliability assessment may depend on parameters that are tuned by the
TN TP FP FN user, such as the threshold for reliable/unreliable labelling, the number
All test set 2658 31 30 447 of neighbors and the distance function chosen to select those neighbors
Reliable set 1416 8 0 0 within the application of the local fit principle. When applying the
Unreliable set 1242 23 30 437
density principle, the values of the features in the training set are used to
select the borders. Before testing, users can select a threshold for reli
ability, by either using another dataset (which we called “validation set”
Table 4 in our experiments), or, in absence of a third dataset, by choosing a
Confusion matrix of the test set, the reliable set and the unreliable set detected predefined threshold. We showed how to perform reliability assessment
with the uncertainty method. on a simple simulated dataset and on the well-known medical dataset
TN TP FP FN MIMIC. For illustrative purposes, in the first simulated binary dataset of
two dimensions, we added a shift that affects only one of the two classes
All test set 2658 31 30 447
Reliable set 2010 0 0 129
i.e. the positive class (red samples). In medical applications, this may
Unreliable set 648 31 30 318 happen for instance, when interventions are taken to treat a disease.
Consider for example an application where the aim is to detect cancer
cells in a patient based on single-cell gene expression profiles. Suppose
that the classifier is trained on data coming from patients before

12
G. Nicora et al. Journal of Biomedical Informatics 127 (2022) 103996

treatment. If a new treatment, targeting cancer cells only, is available, CRediT authorship contribution statement
such treatment can modify the gene expression profiles of the targeted
cancer cells before they die. Therefore, we may observe a shift in gene Giovanna Nicora: Methodology, Formal analysis, Conceptualiza
expression features that affect only the positive class. Yet, in the ma tion, Writing – original draft. Miguel Rios: Conceptualization, Writing –
jority of applications the shift will affect the entire population, regard review & editing. Ameen Abu-Hanna: Conceptualization, Writing –
less of the class. The investigation of different types and combinations of review & editing, Supervision. Riccardo Bellazzi: Methodology,
covariate shifts is an important topic. Another possible approach can be Conceptualization, Writing – review & editing, Supervision.
to simulate a dataset shift that corresponds to a change of the very na
ture of the distribution itself. An extensive benchmarking of this method
Declaration of Competing Interest
applied to different Machine Learning classifiers, along with the previ
ously published approaches mentioned in the Background sections, will
The authors declare that they have no known competing financial
be performed as future work. Additionally, the qualitative evaluation of
interests or personal relationships that could have appeared to influence
the different methodologies that can compute a form of a reliability
the work reported in this paper.
measure is complex, and standard procedures to obtain these measures
are missing to the best of our knowledge. In our results, we computed
Acknowledgements
different metrics on the reliable and unreliable sets to qualitatively
evaluate whether reliable predictions were associated with better per
We thank the anonymous reviewers for their careful reading of our
formance metrics. This method implies that the ground truth of test data
manuscript and their many insightful comments and suggestions. This
is available. In the scenario in which reliability assessment is deployed
work was partially supported by the Department of Electrical, Computer
within a machine learning pipeline, and it is applied to new prediction
and Biomedical Engineering of University of Pavia and by the European
(s), the ground truth may not be available. Therefore, it may be
Commission as part of the PERISCOPE project (Grant Agreement
impossible to monitor the reliability performance as we reported in the
101016233), coordinated by the University of Pavia.
paper, and new metrics should be developed. Yet, it is when the ground
truth is not available (during deployment), that an approach for reli
ability estimation can guide and support the user, by “raising an alarm” References
when predictions are labeled as unreliable. Reliability may also support
[1] M.R. Abbas, M.S.A. Nadeem, A. Shaheen, A.A. Alshdadi, R. Alharbey, S.-O. Shim,
monitoring, by highlighting possible shifts and bias in the data. Coupled W. Aziz, Accuracy Rejection Normalized-Cost Curves (ARNCCs): A Novel 3-
with XAI, reliability can promote trustworthiness in a model’s predic Dimensional Framework for Robust Classification, IEEE Access 7 (2019)
tion. While local XAI can show the relationship between feature values 160125–160143, https://doi.org/10.1109/ACCESS.2019.2950244.
[2] M. Abdar, F. Pourpanah, S. Hussain, D. Rezazadegan, L. Liu, M. Ghavamzadeh, P.
and the prediction on a single instance in an interpretable way, reli Fieguth, et al., A Review of Uncertainty Quantification in Deep Learning:
ability assures that the model is being used as intended, i.e. on data Techniques, Applications and Challenges, 2021. ArXiv:2011.06225 [Cs], January.
coming from a region of the features space where the classifier should be http://arxiv.org/abs/2011.06225.
[3] A. Ahmadi, S. Omatu, T. Fujinaka, T. Kosaka, Improvement of Reliability in
confident. Banknote Classification Using Reject Option and Local PCA, Inf. Sci. 168 (1) (2004)
An interesting, improved approach is to resort to generative models. 277–293, https://doi.org/10.1016/j.ins.2004.02.018.
Such models can be flexibly used to perform data distribution assess [4] A. Alimadadi, S. Aryal, I. Manandhar, P.B. Munroe, B. Joe, X. Cheng, Artificial
Intelligence and Machine Learning to Fight COVID-19, Physiol. Genomics 52 (4)
ment. By capturing the statistical properties of the training data, they (2020) 200–202, https://doi.org/10.1152/physiolgenomics.00029.2020.
can be either used to compute goodness of fit measures or to generate [5] N. Alirezaie, K.D. Kernohan, T. Hartley, J. Majewski, T.D. Hocking, ClinPred:
synthetic data [91] that can be used to substitute real data during reli Prediction Tool to Identify Disease-Relevant Nonsynonymous Single-Nucleotide
Variants, Am. J. Human Genet. 103 (4) (2018) 474–483, https://doi.org/10.1016/
ability assessment.
j.ajhg.2018.08.005.
[6] P.L. Bartlett, M.H. Wegkamp, Classification with a Reject Option Using a Hinge
6. Conclusion Loss, J. Machine Learn. Res. 9 (59) (2008) 1823–1840.
[7] S. Benjamens, P. Dhunnoo, B. Meskó, The state of artificial intelligence-based FDA-
approved medical devices and algorithms: an online database, NPJ Digit Med. 11
We formally defined different metrics for reliability estimation and (3) (2020) 118, https://doi.org/10.1038/s41746-020-00324-0.
comprehensively reported different methodologies that have been used [8] A. Benso, S. Di Carlo, G. Politano, A. Savino, H. Hafeezurrehman, Building Gene
Expression Profile Classifiers with a Simple and Efficient Rejection Option in R,
to evaluate reliability, not only in healthcare applications.
BMC Bioinf. 12 (Suppl 13) (2011) S3, https://doi.org/10.1186/1471-2105-12-S13-
We then showed a simple approach to include a reliability estimation S3.
to any type of classifier. Such reliability is computed following the local [9] Z. Bosnić, I. Kononenko, Estimation of Individual Prediction Reliability Using the
fit and density principles defined by Saria et al. [76]. We compared this Local Sensitivity Analysis, Appl. Intell. 29 (3) (2008) 187–203, https://doi.org/
10.1007/s10489-007-0084-9.
method to reliability according to the level of the uncertainty, and we [10] Z. Bosnić, I. Kononenko, An Overview of Advances in Reliability Estimation of
investigated which of the two methodologies is best suited for the Individual Predictions in Machine Learning, Intell. Data Anal. 13 (2) (2009)
identification of incorrectly classified examples, both on a simulated 385–401.
[11] J. Brinkrolf, B. Hammer, Interpretable machine learning with reject option,
dataset and a clinical dataset. In these two experiments, the reliability Automatisierungstechnik 66 (4) (2018) 283–290, https://doi.org/10.1515/auto-
based on the local fit and density principles seems to perform better in 2017-0123.
the identification of samples for which the classifier is correct at a higher [12] I. Buzhinsky, A. Nerinovsky, S. Tripakis, Metrics and Methods for Robustness
Evaluation of Neural Networks with Generative Models’. ArXiv:2003.01993 [Cs,
rate. We also compared the local fit and density principles with reli Stat], 2020 March, http://arxiv.org/abs/2003.01993.
ability based on the predicted posterior probability. Our results confirm [13] H. Choi, D. Yeo, S. Kwon, Y. Kim, Gene Selection and Prediction for Cancer
that the predicted posterior probability as a measure of reliability of the Classification Using Support Vector Machines with a Reject Option, Comput. Stat.
Data Anal. 55 (5) (2011) 1897–1908, https://doi.org/10.1016/j.csda.2010.12.001.
classification can be less useful under dataset shift (see Supplementary [14] C. Chow, On Optimum Recognition Error and Reject Tradeoff, IEEE Trans. Inf.
Material). We conclude that the usage of “classifier-related” metrics Theory 16 (1) (1970) 41–46, https://doi.org/10.1109/TIT.1970.1054406.
(such as posterior predicted probability, uncertainty or conformal pre [15] F. Condessa, J. Bioucas-Dias, C.A. Castro, J.A. Ozolek, J. Kovačević, Classification
with Reject Option Using Contextual Information, in: 2013 IEEE 10th International
diction) to assess pointwise reliability may be biased and misleading
Symposium on Biomedical Imaging, 2013, pp. 1340–1343, https://doi.org/
given the intrinsic data-driven nature of the classifier. Reliability esti 10.1109/ISBI.2013.6556780.
mation should evaluate the classifier’s output independently of the [16] C. Corbière, N. Thome, A. Bar-Hen, M. Cord, P. Pérez, Addressing Failure
classifier itself. However, additional experiments should be carried out Prediction by Learning Model Confidence, ArXiv:1910.04851 [Cs, Stat], 2019
October. http://arxiv.org/abs/1910.04851.
on different types of both simulated and medical datasets for further [17] L.P. Cordella, C. De Stefano, C. Sansone, M. Vento, An Adaptive Reject Option for
evaluation. LVQ Classifiers, in: C. Braccini, L. DeFloriani, G. Vernazza (Eds.), Image Analysis

13
G. Nicora et al. Journal of Biomedical Informatics 127 (2022) 103996

and Processing. Lecture Notes in Computer Science, Springer, Berlin, Heidelberg, Computing and Intelligent Interaction and Workshops : [Proceedings]. ACII
1995, pp. 68–73, https://doi.org/10.1007/3-540-60298-4_238. (Conference) 2013, 2013, pp. 245–251, https://doi.org/10.1109/ACII.2013.47.
[18] L.P. Cordella, C. De Stefano, F. Tortorella, M. Vento, A Method for Improving [44] F.C. Jiang, Study on a Confidence Machine Learning Method Based on Ensemble
Classification Reliability of Multilayer Perceptrons, IEEE Trans. Neural Networks 6 Learning, Cluster Comput. 20 (4) (2017) 3357–3368, https://doi.org/10.1007/
(5) (1995) 1140–1147, https://doi.org/10.1109/72.410358. s10586-017-1085-z.
[19] I. Cortés-Ciriano, A. Bender, Deep Confidence: A Computationally Efficient [45] H. Jiang, B. Kim, M.Y. Guan, M. Gupta, To Trust Or Not To Trust A Classifier.
Framework for Calculating Reliable Prediction Errors for Deep Neural Networks, ArXiv:1805.11783 [Cs, Stat], October 2018, http://arxiv.org/abs/1805.11783.
J. Chem. Informat. Model. 59 (3) (2019) 1269–1281, https://doi.org/10.1021/acs. [46] A.E.W. Johnson, T.J. Pollard, L.u. Shen, L.-W. Lehman, M. Feng, M. Ghassemi,
jcim.8b00542. B. Moody, P. Szolovits, L.A. Celi, R.G. Mark, MIMIC-III, a Freely Accessible Critical
[20] I. Cortés-Ciriano, A. Bender, Concepts and Applications of Conformal Prediction in Care Database, Sci. Data 3 (1) (2016) 160035, https://doi.org/10.1038/
Computational Drug Discovery, ArXiv:1908.03569 [Cs, q-Bio], 2019b August, sdata.2016.35.
http://arxiv.org/abs/1908.03569. [47] J. Kang, C.D. Yoo, Learning of a Multi-Class Classifier with Rejection Option Using
[21] C.M. Cutillo, K.R. Sharma, L. Foschini, S. Kundu, M. Mackintosh, K.D. Mandl, Sparse Representation, in: The 18th IEEE International Symposium on Consumer
Machine Intelligence in Healthcare—Perspectives on Trustworthiness, Electronics (ISCE 2014), 2014, pp. 1–2, https://doi.org/10.1109/
Explainability, Usability, and Transparency, Npj Digital Medicine 3 (1) (2020) 1–5, ISCE.2014.6884541.
https://doi.org/10.1038/s41746-020-0254-2. [48] S. Kang, S. Cho, S.-J. Rhee, Y. Kyung-Sang, Reliable Prediction of Anti-Diabetic
[22] S.E. Davis, Stabilizing Calibration of Clinical Prediction Models in Non-Stationary Drug Failure Using a Reject Option, Pattern Anal. Appl. 20 (3) (2017) 883–891,
Environments: Methods Supporting Data-Driven Model Updating, 2019 October. https://doi.org/10.1007/s10044-016-0585-4.
https://ir.vanderbilt.edu/handle/1803/14327. [49] E. Kawaler, A. Cobian, P. Peissig, D. Cross, S. Yale, M. Craven, Learning to Predict
[23] Z. Dlamini, F.Z. Francies, R. Hull, R. Marima, Artificial Intelligence (AI) and Big Post-Hospitalization VTE Risk from EHR Data, AMIA Annual Symp. Proc. 2012
Data in Cancer and Precision Oncology, Comput. Struct. Biotechnol. J. (2020), (November) (2012) 436–445.
https://doi.org/10.1016/j.csbj.2020.08.019. [50] C.J. Kelly, A. Karthikesalingam, M. Suleyman, G. Corrado, D. King, Key Challenges
[24] G.F. Elsayed, I. Goodfellow, J. Sohl-Dickstein, Adversarial Reprogramming of for Delivering Clinical Impact with Artificial Intelligence, BMC Med. 17 (1) (2019)
Neural Networks’. ArXiv:1806.11146 [Cs, Stat], 2018 November. http://arxiv.org/ 195, https://doi.org/10.1186/s12916-019-1426-2.
abs/1806.11146. [51] A. Kendall, Y. Gal, What Uncertainties Do We Need in Bayesian Deep Learning for
[25] S.G. Finlayson, J.D. Bowers, J. Ito, J.L. Zittrain, A.L. Beam, I.S. Kohane, Adversarial Computer Vision?’ ArXiv:1703.04977 [Cs], October 2017, http://arxiv.org/abs/
Attacks on Medical Machine Learning, Science 363 (6433) (2019) 1287–1289, 1703.04977.
https://doi.org/10.1126/science.aaw4399. [52] B. Kompa, J. Snoek, A.L. Beam, Second Opinion Needed: Communicating
[26] L. Fischer, L. Ehrlinger, V. Geist, R. Ramler, F. Sobieczky, W. Zellinger, B. Moser, Uncertainty in Medical Machine Learning, Npj Digital Med. 4 (1) (2021) 1–6,
Applying AI in Practice: Key Challenges and Lessons Learned, in: A. Holzinger, https://doi.org/10.1038/s41746-020-00367-3.
P. Kieseberg, A. Min Tjoa, E. Weippl (Eds.), Machine Learning and Knowledge [53] I. Kononenko, Machine Learning for Medical Diagnosis: History, State of the Art
Extraction. Lecture Notes in Computer Science, Springer International Publishing, and Perspective, Artif. Intell. Med. 23 (1) (2001) 89–109, https://doi.org/
Cham, 2020, pp. 451–471, https://doi.org/10.1007/978-3-030-57321-8_25. 10.1016/S0933-3657(01)00077-X.
[27] G. Fumera, I. Pillai, F. Roli, Classification with Reject Option in Text Categorisation [54] M. Kukar, I. Kononenko, Reliable Classifications with Machine Learning, in:
Systems, in: 12th International Conference on Image Analysis and Processing, 2003 Proceedings of the 13th European Conference on Machine Learning, ECML ’02,
Proceedings, 2003, pp. 582–587, https://doi.org/10.1109/ICIAP.2003.1234113. Springer-Verlag, Berlin, Heidelberg, 2002, pp. 219–231.
[28] G. Fumera, F. Roli, Support Vector Machines with Embedded Reject Option, in: S.- [55] B. Lakshminarayanan, A. Pritzel, C. Blundell, Simple and Scalable Predictive
W. Lee, A. Verri (Eds.), Pattern Recognition with Support Vector Machines. Lecture Uncertainty Estimation Using Deep Ensembles. ArXiv:1612.01474 [Cs, Stat], 2017
Notes in Computer Science, Springer, Berlin, Heidelberg, 2002, pp. 68–82, https:// November, http://arxiv.org/abs/1612.01474.
doi.org/10.1007/3-540-45665-1_6. [56] C. Leibig, V. Allken, M.S. Ayhan, P. Berens, S. Wahl, Leveraging Uncertainty
[29] G. Sousa, A.R. Ricardo, R. Neto, J.S. Cardoso, G.A. Barreto, Robust Classification Information from Deep Neural Networks for Disease Detection, Sci. Rep. 7 (1)
with Reject Option Using the Self-Organizing Map, Neural Comput. Appl. 26 (7) (2017) 17816, https://doi.org/10.1038/s41598-017-17876-z.
(2015) 1603–1619, https://doi.org/10.1007/s00521-015-1822-2. [57] J.A. Leonard, M.A. Kramer, L.H. Ungar, A Neural Network Architecture That
[30] J. Gao, J. Yao, Y. Shao, Towards Reliable Learning for High Stakes Applications, Computes Its Own Reliability, Comput. Chem. Eng., Int. J. Comput. Appl. Chem.
Proc. AAAI Conf. Artif. Intell. 33 (01) (2019) 3614–3621, https://doi.org/ Eng. 16 (9) (1992) 819–835, https://doi.org/10.1016/0098-1354(92)80035-8.
10.1609/aaai.v33i01.33013614. [58] C.X. Ling, V.S. Sheng, C. Sammut, G.I. Webb, Cost-Sensitive LearningCost-Sensitive
[31] Y. Geifman, R. El-Yaniv, SelectiveNet: A Deep Neural Network with an Integrated Learning, in: Encyclopedia of Machine Learning, Springer US., Boston, MA, 2010,
Reject Option, ArXiv:1901.09192 [Cs, Stat], June 2019, http://arxiv.org/abs/ pp. 231–235, https://doi.org/10.1007/978-0-387-30164-8_181.
1901.09192. [59] S. Malakouti, M. Hauskrecht, Predicting Patient’s Diagnoses and Diagnostic
[32] H. Ghoddusi, G.G. Creamer, N. Rafizadeh, Machine Learning in Energy Economics Categories from Clinical-Events in EHR Data, in: D. Riaño, S. Wilk, A. ten Teije
and Finance: A Review, Energy Econ. 81 (June) (2019) 709–727, https://doi.org/ (Eds.), Artificial Intelligence in Medicine. Lecture Notes in Computer Science,
10.1016/j.eneco.2019.05.006. Springer International Publishing, Cham, 2019, pp. 125–130, https://doi.org/
[33] F.K. Hamey, B. Göttgens, Machine Learning Predicts Putative Hematopoietic Stem 10.1007/978-3-030-21642-9_17.
Cells within Large Single-Cell Transcriptomics Data Sets, Exp. Hematol 78 (2019) [60] L. Meijerink, G. Cinà, M. Tonutti, Uncertainty Estimation for Classification and
11–20, https://doi.org/10.1016/j.exphem.2019.08.009. Risk Prediction on Medical Tabular Data, ArXiv:2004.05824 [Cs, Stat], May 2020.
[34] B. Hanczar, E.R. Dougherty, Classification with Reject Option in Gene Expression http://arxiv.org/abs/2004.05824.
Data, Bioinformatics 24 (17) (2008) 1889–1895, https://doi.org/10.1093/ [61] D.P.P. Mesquita, L.S. Rocha, J.P.P. Gomes, A.R. Rocha Neto, Classification with
bioinformatics/btn349. Reject Option for Software Defect Prediction, Appl. Soft Comput. 49 (December)
[35] B. Hanczar, M. Sebag, Combination of One-Class Support Vector Machines for (2016) 1085–1093, https://doi.org/10.1016/j.asoc.2016.06.023.
Classification with Reject Option, in: T. Calders, F. Esposito, E. Hüllermeier, R. Meo [62] S. Messoudi, S. Rousseau, S. Destercke, Deep Conformal Prediction for Robust
(Eds.), Machine Learning and Knowledge Discovery in Databases. Lecture Notes in Models, Informat. Process. Manage. Uncertainty Knowledge-Based Syst. 1237
Computer Science, Springer, Berlin, Heidelberg, 2014, pp. 547–562, https://doi. (May) (2020) 528–540, https://doi.org/10.1007/978-3-030-50146-4_39.
org/10.1007/978-3-662-44848-9_35. [63] S.J. Mooney, V. Pejaver, Big Data in Public Health: Terminology, Machine
[36] Y. Hechtlinger, B. Póczos, L. Wasserman, Cautious Deep Learning, ArXiv: Learning, and Privacy, Annu. Rev. Public Health 39 (1) (2018) 95–112, https://
1805.09460 [Cs, Stat], February 2019, http://arxiv.org/abs/1805.09460. doi.org/10.1146/annurev-publhealth-040617-014208.
[37] M.E. Hellman, The Nearest Neighbor Classification Rule with a Reject Option, IEEE [64] A.H. Murphy, What Is a Good Forecast? An Essay on the Nature of Goodness in
Trans. Syst. Sci. Cybernet. 6 (3) (1970) 179–185, https://doi.org/10.1109/ Weather Forecasting, Weather Forecasting 8 (2) (1993) 281–293, https://doi.org/
TSSC.1970.300339. 10.1175/1520-0434(1993)008<0281:WIAGFA>2.0.CO;2.
[38] D. Hendrycks, K. Gimpel, A Baseline for Detecting Misclassified and Out-of- [65] K. Murphy, Probabilistic Machine Learning: An Introduction, Accessed 8 April
Distribution Examples in Neural Networks. ArXiv:1610.02136 [Cs], October 2018, 2021, n.d., https://probml.github.io/pml-book/book1.html.
http://arxiv.org/abs/1610.02136. [66] M.S. Nadeem, J.-D. Ahmed, B. Hanczar, Accuracy-Rejection Curves (ARCs) for
[39] B. Hie, B.D. Bryson, B. Berger, Leveraging Uncertainty in Machine Learning Comparing Classification Methods with a Reject Option, in: Machine Learning
Accelerates Biological Discovery and Design, Cell Syst. 11 (5) (2020) 461–477.e9, Systems Biology, PMLR, 2009, pp. 65–81, in: http://proceedings.mlr.press/v8/
https://doi.org/10.1016/j.cels.2020.09.007. nadeem10a.html.
[40] E. Hüllermeier, W. Waegeman, Aleatoric and Epistemic Uncertainty in Machine [67] P.M. do Nascimento, I.G. Medeiros, R.M. Falcão, B. Stransky, J.E.S. de Souza,
Learning: An Introduction to Concepts and Methods, Machine Learn. 110 (3) A Decision Tree to Improve Identification of Pathogenic Mutations in Clinical
(2021) 457–506, https://doi.org/10.1007/s10994-021-05946-3. Practice, BMC Medical Informat. Decision Making 20 (1) (2020) 52, https://doi.
[41] E.J. Hwang, S. Park, K.-N. Jin, J.I. Kim, S.Y. Choi, J.H. Lee, J.M. Goo, et al., org/10.1186/s12911-020-1060-0.
Development and Validation of a Deep Learning-Based Automated Detection [68] G. Nicora, R. Bellazzi, A Reliable Machine Learning Approach Applied to Single-
Algorithm for Major Thoracic Diseases on Chest Radiographs, JAMA Network Open Cell Classification in Acute Myeloid Leukemia, AMIA Annual Symp. Proc. 2020
2 (3) (2019), e191095, https://doi.org/10.1001/jamanetworkopen.2019.1095. (January) (2021) 925–932.
[42] A. Jacovi, A. Marasović, T. Miller, Y. Goldberg, Formalizing Trust in Artificial [69] G. Nicora, S. Marini, I. Limongelli, E. Rizzo, S. Montoli, F.F. Tricomi, R. Bellazzi,
Intelligence: Prerequisites, Causes and Goals of Human Trust in AI. ArXiv: A Semi-Supervised Learning Approach for Pan-Cancer Somatic Genomic Variant
2010.07487 [Cs], January 2021, http://arxiv.org/abs/2010.07487. Classification, in: D. Riaño, S. Wilk, A. ten Teije (Eds.), Artificial Intelligence in
[43] L.A. Jeni, J.F. Cohn, F. De La Torre, Facing Imbalanced Data Recommendations for Medicine. Lecture Notes in Computer Science, Springer International Publishing,
the Use of Performance Metrics, in: International Conference on Affective Cham, 2019, pp. 42–46, https://doi.org/10.1007/978-3-030-21642-9_7.

14
G. Nicora et al. Journal of Biomedical Informatics 127 (2022) 103996

[70] J.A. Olvera-López, J. Ariel Carrasco-Ochoa, J. Francisco Martínez-Trinidad, [85] R. Sousa, A.R. Neto, G. Barreto, Jaime S. Cardoso, M. Coimbra, Reject Option
J. Kittler, A Review of Instance Selection Methods, Artif. Intell. Rev. 34 (2) (2010) Paradigm for the Reduction of Support Vectors, in: ESANN, 2014.
133–143, https://doi.org/10.1007/s10462-010-9165-y. [86] A. Subbaswamy, S. Saria, From Development to Deployment: Dataset Shift,
[71] Y. Ovadia, E. Fertig, J. Ren, Z. Nado, D. Sculley, S. Nowozin, J.V. Dillon, B. Causality, and Shift-Stable Models in Health AI, Biostatistics 21 (2) (2020)
Lakshminarayanan, J. Snoek, Can you trust your model’s uncertainty? evaluating 345–352, https://doi.org/10.1093/biostatistics/kxz041.
predictive uncertainty under dataset shift. arXiv preprint arXiv:1906.02530, 2019. [87] J. Suutala, S. Pirttikangas, J. Riekki, J. Röning, Reject-Optional LVQ-Based Two-
[72] A. Ozen, M. Gönen, E. Alpaydan, T. Haliloğlu, Machine Learning Integration for Level Classifier to Improve Reliability in Footstep Identification, in: A. Ferscha,
Predicting the Effect of Single Amino Acid Substitutions on Protein Stability, BMC F. Mattern (Eds.), Pervasive Computing. Lecture Notes in Computer Science,
Struct. Biol. 9 (October) (2009) 66, https://doi.org/10.1186/1472-6807-9-66. Springer, Berlin, Heidelberg, 2004, pp. 182–187, https://doi.org/10.1007/978-3-
[73] M. Panahiazar, V. Taslimitehrani, N. Pereira, J. Pathak, Using EHRs and Machine 540-24646-6_12.
Learning for Heart Failure Survival Analysis, Stud. Health Technol. Informat. 216 [88] D.M.J. Tax, R.P.W. Duin, Growing a Multi-Class Classifier with a Reject Option,
(2015) 40–44. Pattern Recogn. Lett. 29 (10) (2008) 1565–1570, https://doi.org/10.1016/j.
[74] M.T. Ribeiro, S. Singh, C. Guestrin, Why Should I Trust You?”: Explaining the patrec.2008.03.010.
Predictions of Any Classifier’. ArXiv:1602.04938 [Cs, Stat], August 2016, http:// [89] F. Tortorella, An Optimal Reject Rule for Binary Classifiers, in: F.J. Ferri, J.
arxiv.org/abs/1602.04938. M. Iñesta, A. Amin, P. Pudil (Eds.), Advances in Pattern Recognition. Lecture Notes
[75] C.M. Santos-Pereira, A.M. Pires, On Optimal Reject Rules and ROC Curves, Pattern in Computer Science, Springer, Berlin, Heidelberg, 2000, pp. 611–620, https://doi.
Recogn. Lett. 26 (7) (2005) 943–952, https://doi.org/10.1016/j. org/10.1007/3-540-44522-6_63.
patrec.2004.09.042. [90] K. Tran, W. Neiswanger, J. Yoon, Q. Zhang, E. Xing, Z.W. Ulissi, Methods for
[76] S. Saria, A. Subbaswamy, Tutorial: Safe and Reliable Machine Learning. ArXiv: Comparing Uncertainty Quantifications for Material Property Predictions, ArXiv:
1904.07204 [Cs], 2019 April. http://arxiv.org/abs/1904.07204. 1912.10066 [Cond-Mat, Physics:Physics], 2020 February, http://arxiv.org/abs/
[77] A. Sarica, A. Cerasa, A. Quattrone, Random Forest Algorithm for the Classification 1912.10066.
of Neuroimaging Data in Alzheimer’s Disease: A Systematic Review, Front. Aging [91] A. Tucker, Z. Wang, Y. Rotalinti, et al., Generating high-fidelity synthetic patient
Neurosci. 9 (2017) 329, https://doi.org/10.3389/fnagi.2017.00329. data for assessing machine learning healthcare software, npj Digit. Med. 3 (2020)
[78] C. Saunders, A. Gammerman, V. Vovk, Transduction with Confidence and 147, https://doi.org/10.1038/s41746-020-00353-9.
Credibility. Proceedings of the 16th International Joint Conference on Artificial [92] D. Ulmer, L. Meijerink, G. Cinà, Trust Issues: Uncertainty Estimation Does Not
Intelligence - Volume 2, 722–26. IJCAI’99, Morgan Kaufmann Publishers Inc, San Enable Reliable OOD Detection On Medical Tabular Data. ArXiv:2011.03274 [Cs,
Francisco, CA, USA, 1999. Stat], 2020 November. http://arxiv.org/abs/2011.03274.
[79] M. Schinkel, K. Paranjape, R.S. Nannan Panday, N. Skyttberg, P.W.B. Nanayakkara, [93] A. Uyar, F. Gurgen, Arrhythmia Classification Using Serial Fusion of Support
Clinical applications of artificial intelligence in sepsis: A narrative review, Comput. Vector Machines and Logistic Regression, in: 2007 4th IEEE Workshop on
Biol. Med. 115 (2019) 103488, https://doi.org/10.1016/j. Intelligent Data Acquisition and Advanced Computing Systems: Technology and
compbiomed.2019.103488. Applications, 2007, pp. 560–565, https://doi.org/10.1109/
[80] P. Schulam, S. Saria, Can You Trust This Prediction? Auditing Pointwise Reliability IDAACS.2007.4488483.
After Learning’. ArXiv:1901.00403 [Cs, Stat], 2019. February, http://arxiv.org/ [94] J. Vaicenavicius, D. Widmann, C. Andersson, F. Lindsten, J. Roll, T.B. Schön,
abs/1901.00403. Evaluating Model Calibration in Classification’. ArXiv:1902.06977 [Cs, Stat], 2019
[81] G. Shafer, V. Vovk, A Tutorial on Conformal Prediction, J. Machine Learn. Res. 9 February. http://arxiv.org/abs/1902.06977.
(12) (2008) 371–421. [95] M.H. Waseem, M.S.A. Nadeem, A. Abbas, A. Shaheen, W. Aziz, A. Anjum,
[82] M.H. Shaker, E. Hüllermeier, Aleatoric and Epistemic Uncertainty with Random U. Manzoor, M.A. Balubaid, S.-O. Shim, On the Feature Selection Methods and
Forests, in: M.R. Berthold, A. Feelders, G. Krempl (Eds.), Advances in Intelligent Reject Option Classifiers for Robust Cancer Prediction, IEEE Access 7 (2019)
Data Analysis XVIII. Lecture Notes in Computer Science, Springer International 141072–141082, https://doi.org/10.1109/ACCESS.2019.2944295.
Publishing, Cham, 2020, pp. 444–456, https://doi.org/10.1007/978-3-030-44584- [96] T.-W. Weng, H. Zhang, P.-Y. Chen, J. Yi, D. Su, Y. Gao, C.-J. Hsieh, L. Daniel,
3_35. Evaluating the Robustness of Neural Networks: An Extreme Value Theory
[83] I. Silva, G. Moody, D.J. Scott, L.A. Celi, R.G. Mark, Predicting In-Hospital Mortality Approach, ArXiv:1801.10578 [Cs, Stat], January 2018. http://arxiv.org/abs/
of ICU Patients: The PhysioNet/Computing in Cardiology Challenge 2012, Comput. 1801.10578.
Cardiol. 39 (2012) 245–248. [97] J. Wiens, S. Saria, M. Sendak, M. Ghassemi, V.X. Liu, F. Doshi-Velez, K. Jung, et al.,
[84] R. Sousa, B. Mora, J.S. Cardoso, An Ordinal Data Method for the Classification with Do No Harm: A Roadmap for Responsible Machine Learning for Health Care, Nat.
Reject Option, in: 2009 International Conference on Machine Learning and Med. 25 (9) (2019) 1337–1340, https://doi.org/10.1038/s41591-019-0548-6.
Applications, 2009, pp. 746–750, https://doi.org/10.1109/ICMLA.2009.11.

Unlocking Healthcare Insights: Disease Prediction With Machine Learning
No ratings yet
Unlocking Healthcare Insights: Disease Prediction With Machine Learning
6 pages
PREDICTION OF DISEASES USING MACHINE LEARNING Semi
No ratings yet
PREDICTION OF DISEASES USING MACHINE LEARNING Semi
12 pages
Conference Template Letter
No ratings yet
Conference Template Letter
5 pages
Accurate prediction of chronic diseases using deep learning algorithms
No ratings yet
Accurate prediction of chronic diseases using deep learning algorithms
14 pages
No_11
No ratings yet
No_11
8 pages
Mini Project Report
No ratings yet
Mini Project Report
21 pages
Disease Prediction Using Machine Learning
No ratings yet
Disease Prediction Using Machine Learning
5 pages
Disease Prediction Using Machine Learning: December 2020
No ratings yet
Disease Prediction Using Machine Learning: December 2020
5 pages
Latest Seminar Report Yash Ingole
No ratings yet
Latest Seminar Report Yash Ingole
35 pages
The Significance of Machine Learning in Clinical Disease Diagnosis: A Review
No ratings yet
The Significance of Machine Learning in Clinical Disease Diagnosis: A Review
8 pages
A Comprehensive Review For Chronic Disease Prediction Using Machine Learning Algorithms
No ratings yet
A Comprehensive Review For Chronic Disease Prediction Using Machine Learning Algorithms
28 pages
Interpretable Machine Learning Models for Healthcare Diagnostics: Addressing the Black-Box Problem
No ratings yet
Interpretable Machine Learning Models for Healthcare Diagnostics: Addressing the Black-Box Problem
27 pages
FINAL YEAR MINOR PROJECT
No ratings yet
FINAL YEAR MINOR PROJECT
9 pages
[IJCST-V13I2P2]:Seema Saroj, Sakshi Sahu, Sanjana Patel, Suraj Sahu
No ratings yet
[IJCST-V13I2P2]:Seema Saroj, Sakshi Sahu, Sanjana Patel, Suraj Sahu
2 pages
XAI Framework For Cardiovascular Disease
No ratings yet
XAI Framework For Cardiovascular Disease
30 pages
Final Research Paper
No ratings yet
Final Research Paper
10 pages
A Framework for Characterizing What Makes an Instance Hard to Classify
No ratings yet
A Framework for Characterizing What Makes an Instance Hard to Classify
16 pages
20 Page Summary
No ratings yet
20 Page Summary
18 pages
Disease Prediction by Machine Learning
No ratings yet
Disease Prediction by Machine Learning
6 pages
Emergency Patient Forecasting With Models Based On Support Vector Machines
No ratings yet
Emergency Patient Forecasting With Models Based On Support Vector Machines
12 pages
Research_folder_1 (1)
No ratings yet
Research_folder_1 (1)
16 pages
EasyChair-Preprint-15084
No ratings yet
EasyChair-Preprint-15084
25 pages
1-s2.0-S2352914822000739-main
No ratings yet
1-s2.0-S2352914822000739-main
5 pages
Published PDF 6 ANND 119
No ratings yet
Published PDF 6 ANND 119
9 pages
BT40816_Project_Report
No ratings yet
BT40816_Project_Report
34 pages
Machine_Learning_for_Medical_and_Healthcare_Data_Analysis_and_Modelling
No ratings yet
Machine_Learning_for_Medical_and_Healthcare_Data_Analysis_and_Modelling
6 pages
saniya lt l
No ratings yet
saniya lt l
23 pages
Advancing Disease Prediction and Prognosis
No ratings yet
Advancing Disease Prediction and Prognosis
1 page
ppt_major_p3[1]
No ratings yet
ppt_major_p3[1]
24 pages
Base Paper
No ratings yet
Base Paper
4 pages
EBSCO-FullText-26_02_2025 (1)
No ratings yet
EBSCO-FullText-26_02_2025 (1)
24 pages
Heart Disease Prediction Using Machine Learning Al
No ratings yet
Heart Disease Prediction Using Machine Learning Al
13 pages
predictive health analytics
No ratings yet
predictive health analytics
47 pages
Ai Powered Medical Diagnosis-Phase 3
No ratings yet
Ai Powered Medical Diagnosis-Phase 3
10 pages
2501.09628v1
No ratings yet
2501.09628v1
22 pages
2222
No ratings yet
2222
12 pages
Medical Disease Prediction Using Machine Learning Algorithms
No ratings yet
Medical Disease Prediction Using Machine Learning Algorithms
10 pages
Golu
No ratings yet
Golu
25 pages
IEEE_Conference_Template (12)
No ratings yet
IEEE_Conference_Template (12)
9 pages
Mortality prediction using data from wearable
No ratings yet
Mortality prediction using data from wearable
22 pages
Thesis repot
No ratings yet
Thesis repot
9 pages
Disease Prediction Using Machine Learning
No ratings yet
Disease Prediction Using Machine Learning
4 pages
Machine Learning in Public Health A Review
No ratings yet
Machine Learning in Public Health A Review
5 pages
Gen AI for Disease Prediction
No ratings yet
Gen AI for Disease Prediction
8 pages
Deepika_Disease Prediction Using Machine Learning
No ratings yet
Deepika_Disease Prediction Using Machine Learning
3 pages
DiseasePredReport (3) (1)
No ratings yet
DiseasePredReport (3) (1)
42 pages
Machine Learning (ML) in Medicine - Review, Applications, and Challenges PDF
No ratings yet
Machine Learning (ML) in Medicine - Review, Applications, and Challenges PDF
52 pages
Karol Przystalski, Rohit M. Thanki - Explainable Machine Learning in Medicine-Springer Cham (2024) (1)
No ratings yet
Karol Przystalski, Rohit M. Thanki - Explainable Machine Learning in Medicine-Springer Cham (2024) (1)
92 pages
E3sconf Icmpc2023 01051
No ratings yet
E3sconf Icmpc2023 01051
10 pages
Jurnal Ancaman Teknologi Siber Di Dunia Kesehatan - 7
No ratings yet
Jurnal Ancaman Teknologi Siber Di Dunia Kesehatan - 7
7 pages
TSP_CMC_14604
No ratings yet
TSP_CMC_14604
19 pages
Chronic Disease Prediction Using Machine Learning
No ratings yet
Chronic Disease Prediction Using Machine Learning
7 pages
Research Papers On Python
No ratings yet
Research Papers On Python
6 pages
3100Predictive Modeling in Biomedical Data Mining and Analysis Sudipta Roy all chapter instant download
100% (5)
3100Predictive Modeling in Biomedical Data Mining and Analysis Sudipta Roy all chapter instant download
22 pages
final project
No ratings yet
final project
25 pages
Symptom-Based_Disease_Prediction_A_Machine_Learnin
No ratings yet
Symptom-Based_Disease_Prediction_A_Machine_Learnin
10 pages
diagnostics-15-01170
No ratings yet
diagnostics-15-01170
5 pages
ML Report - Merged
No ratings yet
ML Report - Merged
17 pages
Uncertainty Theories and Multisensor Data Fusion
From Everand
Uncertainty Theories and Multisensor Data Fusion
Alain Appriou
No ratings yet
Machine Learning in Healthcare
From Everand
Machine Learning in Healthcare
Vaibhav Rupapara
No ratings yet
2.5_Numerical Evaluation of the Lateral Behavior of Vertical and Battered Piled
No ratings yet
2.5_Numerical Evaluation of the Lateral Behavior of Vertical and Battered Piled
174 pages
2.2_Study on the Bearing Capacity of High-Cap Inclined Pile Foundation
No ratings yet
2.2_Study on the Bearing Capacity of High-Cap Inclined Pile Foundation
13 pages
Processed Merged Sheets
No ratings yet
Processed Merged Sheets
4,460 pages
Bootstrap Final With Histograms
No ratings yet
Bootstrap Final With Histograms
40 pages
Chowdhury Et Al. - 2020
No ratings yet
Chowdhury Et Al. - 2020
12 pages
Crop Yield Waali
100% (2)
Crop Yield Waali
20 pages
Cognitive Science Manual
No ratings yet
Cognitive Science Manual
17 pages
Image Classification
No ratings yet
Image Classification
16 pages
GR No-01-Project-Report
No ratings yet
GR No-01-Project-Report
51 pages
1Z0 1122 23 Demo
No ratings yet
1Z0 1122 23 Demo
5 pages
Mitigating Bias in Algorithmic Hiring Evaluating Claims and Practices
No ratings yet
Mitigating Bias in Algorithmic Hiring Evaluating Claims and Practices
24 pages
Selecting Responsible AI Framework 1694208418
No ratings yet
Selecting Responsible AI Framework 1694208418
35 pages
Newwhitepaper - Operationalizing Generative AI On Vertex AI
No ratings yet
Newwhitepaper - Operationalizing Generative AI On Vertex AI
69 pages
IJCRT2106313
No ratings yet
IJCRT2106313
3 pages
Chronic Kidney Documents
No ratings yet
Chronic Kidney Documents
69 pages
Local Interpretable Model-Agnostic Explanations (LIME) - An Introduction - O'Reilly
No ratings yet
Local Interpretable Model-Agnostic Explanations (LIME) - An Introduction - O'Reilly
9 pages
ANN Assignment
No ratings yet
ANN Assignment
3 pages
Fruit-Classification Report
100% (1)
Fruit-Classification Report
17 pages
Sentiment Analysis of Rising Fuel Prices On Social
No ratings yet
Sentiment Analysis of Rising Fuel Prices On Social
10 pages
Machine Learning An Algorithmic Perspective (2nd Ed) - 40-42
No ratings yet
Machine Learning An Algorithmic Perspective (2nd Ed) - 40-42
3 pages
Customers Satisfaction Based On Zomato Ratings and Reviews Using Machine Learning
No ratings yet
Customers Satisfaction Based On Zomato Ratings and Reviews Using Machine Learning
5 pages
Project Report December 2022 Final
No ratings yet
Project Report December 2022 Final
42 pages
Earthquake Magnitude Prediction in Chile Using Neural Network
No ratings yet
Earthquake Magnitude Prediction in Chile Using Neural Network
4 pages
Mcqs 1
No ratings yet
Mcqs 1
34 pages
Research Paper1
No ratings yet
Research Paper1
5 pages
Unit 2
No ratings yet
Unit 2
15 pages
1 s2.0 S2590123022003103 Main
No ratings yet
1 s2.0 S2590123022003103 Main
9 pages
Security Mechanism Diagram (Technical Privacy Safeguards) ..Drawio
No ratings yet
Security Mechanism Diagram (Technical Privacy Safeguards) ..Drawio
3 pages
Paper 38-Inspection System For Glass Bottle Defect Classification
No ratings yet
Paper 38-Inspection System For Glass Bottle Defect Classification
10 pages
Automated Quality Assessment of Crops
No ratings yet
Automated Quality Assessment of Crops
3 pages
ATARC AIDA Guidebook - FINAL 61
No ratings yet
ATARC AIDA Guidebook - FINAL 61
7 pages
1 s2.0 S1743919121002867 Main
No ratings yet
1 s2.0 S1743919121002867 Main
17 pages
Event Categorization From News Articles Using Machine Learning Techniques.. (1) ..
No ratings yet
Event Categorization From News Articles Using Machine Learning Techniques.. (1) ..
18 pages
Masters Project Write
No ratings yet
Masters Project Write
7 pages

20250210_01_Evaluating Pointwise Reliability of Machine Learning Prediction

Uploaded by

20250210_01_Evaluating Pointwise Reliability of Machine Learning Prediction

Uploaded by

Journal of Biomedical Informatics 127 (2022) 103996

Contents lists available at ScienceDirect

Journal of Biomedical Informatics

Evaluating pointwise reliability of machine learning prediction

Fig. 2. Taxonomy of approaches implementing reliability assessment reported in the literature.

2.2.5. Unsupervised methods

4.1. Simulated data

Fig. 5 shows the performance of a RF on the simulated validation set

You might also like