A Survey of Uncertainty in Deep Neural Networks
A Survey of Uncertainty in Deep Neural Networks
Abstract—Over the last decade, neural networks have reached and safety-critical real world applications remains limited. The
almost every field of science and became a crucial part of main factors responsible for this limitation are
various real world applications. Due to the increasing spread,
arXiv:2107.03342v3 [cs.LG] 18 Jan 2022
confidence in neural network predictions became more and • the lack of expressiveness and transparency of a deep neu-
more important. However, basic neural networks do not deliver ral network’s inference model, which makes it difficult to
certainty estimates or suffer from over or under confidence, trust their outcomes [2],
i.e. are badly calibrated. To overcome this, many researchers
have been working on understanding and quantifying uncertainty • the inability to distinguish between in-domain and out-of-
in a neural network’s prediction. As a result, different types domain samples [11], [12] and the sensitivity to domain
and sources of uncertainty have been identified and a variety shifts [13],
of approaches to measure and quantify uncertainty in neural
networks have been proposed. This work gives a comprehensive • the inability to provide reliable uncertainty estimates for
overview of uncertainty estimation in neural networks, reviews a deep neural network’s decision [14] and frequently
recent advances in the field, highlights current challenges, and occurring overconfident predictions [15], [16],
identifies potential research opportunities. It is intended to give
anyone interested in uncertainty estimation in neural networks • the sensitivity to adversarial attacks that make deep neural
a broad overview and introduction, without presupposing prior networks vulnerable for sabotage [17], [18], [19].
knowledge in this field. For that, a comprehensive introduction
to the most crucial sources of uncertainty is given and their These factors are mainly based on an uncertainty already
separation into reducible model uncertainty and not reducible included in the data (data uncertainty) or a lack of knowledge
data uncertainty is presented. The modeling of these uncertain- of the neural network (model uncertainty). To overcome these
ties based on deterministic neural networks, Bayesian neural
networks, ensemble of neural networks, and test-time data
limitations, it is essential to provide uncertainty estimates,
augmentation approaches is introduced and different branches of such that uncertain predictions can be ignored or passed to
these fields as well as the latest developments are discussed. For human experts [20]. Providing uncertainty estimates is not
a practical application, we discuss different measures of uncer- only important for a safe decision-making in high-risks fields,
tainty, approaches for the calibration of neural networks and give but also crucial in fields where the data sources are highly
an overview of existing baselines and available implementations.
Different examples from the wide spectrum of challenges in the
inhomogeneous and labeled data is rare, such as in remote
fields of medical image analysis, robotic and earth observation sensing [21], [22]. Also for fields where uncertainties form
give an idea of the needs and challenges regarding uncertainties a crucial part of the learning techniques, such as for active
in practical applications of neural networks. Additionally, the learning [23], [24], [25], [26] or reinforcement learning [20],
practical limitations of uncertainty quantification methods in [27], [28], [29], uncertainty estimates are highly important.
neural networks for mission- and safety-critical real world
applications are discussed and an outlook on the next steps In recent years, researchers have shown an increased interest
towards a broader usage of such methods is given. in estimating uncertainty in DNNs [30], [20], [31], [32],
[33], [34], [35], [36]. The most common way to estimate the
Index Terms—Bayesian deep neural networks, Ensembles,
Test-time augmentation, Calibration, Uncertainty uncertainty on a prediction (the predictive uncertainty) is based
on separately modelling the uncertainty caused by the model
(epistemic or model uncertainty) and the uncertainty caused
I. I NTRODUCTION by the data (aleatoric or data uncertainty). While the first one
Within the last decade enormous advances on deep neural is reducible by improving the model which is learned by the
networks (DNNs) have been realized, encouraging their adap- DNN, the latter one is not reducible. The most important ap-
tation in a variety of research fields, where complex systems proaches for modeling this separation are Bayesian inference
have to be modeled or understood, as for example earth [30], [20], [37], [9], [38], ensemble approaches [31], [39],
observation, medical image analysis or robotics. Although [40], test time data augmentation approaches [41], [42], or
DNNs have become attractive in high risk fields such as single deterministic networks containing explicit components
medical image analysis [1], [2], [3], [4], [5], [6] or autonomous to represent the model and the data uncertainty [43], [44],
vehicle control [7], [8], [9], [10], their deployment in mission- [45], [32], [46]. Estimating the predictive uncertainty is not
sufficient for safe decision-making. Furthermore, it is crucial
This work is in part supported by the German Federal Ministry of to assure that the uncertainty estimates are reliable. To this end,
Education and Research (BMBF) in the framework of the international future
AI lab ”AI4EO – Artificial Intelligence for Earth Observation: Reasoning, the calibration property (the degree of reliability) of DNNs
Uncertainties, Ethics and Beyond” (Grant number: 01DD20001). has been investigated and re-calibration methods have been
2
proposed [15], [47], [48] to obtain reliable (well-calibrated) • Uncertainty measures and methods for assessing the
uncertainty estimates. quality and impact of uncertainty estimates (Section IV),
There are several works that give an introduction and • Recent studies and approaches for calibrating DNNs
overview of uncertainty in statistical modelling. Ghanem et (Section V),
al. [49] published a handbook about uncertainty quantification,
which includes a detailed and broad description of different • An overview over frequently used evaluation data sets,
concepts of uncertainty quantification, but without explicitly available benchmarks and implementations1 (Section VI),
focusing on the application of neural networks. The theses • An overview over real-world applications using uncer-
of Gal [50] and Kendall [51] contain a good overview about tainty estimates (Section VII),
Bayesian neural networks, especially with focus on the
• A discussion on current challenges and further directions
Monte Carlo (MC) Dropout approach and its application
of research in the future (Section VIII).
in computer vision tasks. The thesis of Malinin [52] also
contains a very good introduction and additional insights In general, the principles and methods for estimating uncer-
into Prior networks. Wang et al. contributed two surveys on tainty and calibrating DNNs can be applied to all regression,
Bayesian deep learning [53], [54]. They introduced a general classification, and segmentation problems, if not stated differ-
framework and the conceptual description of the Bayesian ently. In order to get a deeper dive into explicit applications
neural networks (BNNs), followed by an updated presentation of the methods, we refer to the section on applications and to
of Bayesian approaches for uncertainty quantification in further readings in the referenced literature.
neural networks with special focus on recommender systems,
topic models, and control. In [55] an evaluation of uncertainty II. U NCERTAINTY IN D EEP N EURAL N ETWORKS
quantification in deep learning is given by presenting and A neural network is a non-linear function fθ parameterized
comparing the uncertainty quantification based on the softmax by model parameters θ (i.e. the network weights) that maps
output, the ensemble of networks, Bayesian neural networks, from a measurable input set X to a measurable output set Y,
and autoencoders on the MNIST data set. Regarding the i.e.
practicability of uncertainty quantification approaches for real fθ : X → Y fθ (x) = y . (1)
life mission- and safety-critical applications, Gustafsson et al.
[56] introduced a framework to test the robustness required For a supervised setting, we further have a finite set of training
in real-world computer vision applications and delivered a data D ⊆ D = X × Y containing N data samples and
comparison of two popular approaches, namely MC Dropout corresponding targets, i.e.
and Ensemble methods. Hüllermeier et al. [57] presented
D = (X , Y) = {xn , yn }N
n=1 ⊆ D . (2)
the concepts of aleatoric and epistemic uncertainty in neural
networks and discussed different concepts to model and For a new data sample x∗ ∈ X, a neural network trained on
quantify them. In contrast to this, Abdar et al. [58] presented D can be used to predict a corresponding target fθ (x∗ ) =
an overview on uncertainty quantification methodologies in y ∗ . We consider four different steps from the raw information
neural networks and provide an extensive list of references for in the environment to a prediction by a neural network with
different application fields and a discussion of open challenges. quantified uncertainties, namely
1) the data acquisition process:
In this work, we present an extensive overview over all The occurrence of some information in the environment
concepts that have to be taken into account when working with (e.g. a bird’s singing) and a measured observation of this
uncertainty in neural networks while keeping the applicability information (e.g. an audio record).
on real world applications in mind. Our goal is to provide the
reader with a clear thread from the sources of uncertainty to 2) the DNN building process:
applications, where uncertainty estimations are needed. Fur- The design and training of a neural network.
thermore, we point out the limitations of current approaches 3) the applied inference model: The model which is applied
and discuss further challenges to be tackled in the future. for inference (e.g. a Bayesian neural network or an
For that, we provide a broad introduction and comparison of ensemble of neural networks).
different approaches and fundamental concepts. The survey is
4) the prediction’s uncertainty model: The modelling of the
mainly designed for people already familiar with deep learning
uncertainties caused by the neural network and by the
concepts and who are planning to incorporate uncertainty
data.
estimation into their predictions. But also for people already
familiar with the topic, this review provides a useful overview In practice, these four steps contain several potential sources of
of the whole concept of uncertainty in neural networks and uncertainty and errors, which again affect the final prediction
their applications in different fields. of a neural network. The five factors that we think are the
In summary, we comprehensively discuss most vital for the cause of uncertainty in a DNN’s predictions
are
• Sources and types of uncertainty (Section II),
1 The list of available implementations can be found in Sec-
• Recent studies and approaches for estimating uncertainty tion VI as well as within an additional GitHub repository under
in DNNs (Section III), github.com/JakobCode/UncertaintyInNeuralNetworks Resources.
3
• the variability in real world situations, Factor I: Variability in Real World Situations
• the errors inherent to the measurement systems, Most real world environments are highly variable and
almost constantly affected by changes. These changes af-
• the errors in the architecture specification of the DNN, fect parameters as for example temperature, illumination,
• the errors in the training procedure of the DNN, clutter, and physical objects’ size and shape. Changes in
the environment can also affect the expression of objects,
• the errors caused by unknown data. as for example plants after rain look very different to
plants after a drought. When real world situations change
In the following, the four steps leading from raw infor- compared to the training set, this is called a distribution
mation to uncertainty quantification on a DNN’s prediction shift. Neural networks are sensitive to distribution shifts,
are described in more detail. Within this, we highlight the which can lead to significant changes in the performance
sources of uncertainty that are related to the single steps of a neural network.
and explain how the uncertainties are propagated through the
process. Finally, we introduce a model for the uncertainty on The second case is based on the measurement system, which
a neural network’s prediction and introduce the main types of has a direct effect on the correlation between the samples and
uncertainty considered in neural networks. the corresponding targets. The measurement system generates
The goal of this section is to give an accountable idea of the information xi and yi that describe ωi but might not contain
uncertainties in neural networks. Hence, for the sake of sim- enough information to learn a direct mapping from xi to
plicity we only describe and discuss the mathematical proper- yi . This means that there might be highly different real
ties, which are relevant for understanding the approaches and world information ωi and ωj (e.g. city and forest) resulting
applying the methodology in different fields. in very similar corresponding measurements xi and xj (e.g.
temperature) or similar corresponding targets yi and yj (e.g.
A. Data Acquisition label noise that labels both samples as forest). This directly
In the context of supervised learning, the data acquisition leads to our second factor of uncertainty:
describes the process where measurements x and target vari-
ables y are generated in order to represent a (real world) Factor II: Error and Noise in Measurement Systems
situation ω from some space Ω. In the real world, a realization The measurements themselves can be a source of uncer-
of ω could for example be a bird, x a picture of this bird, tainty on the neural network’s prediction. This can be
and y a label stating ’bird’. During the measurement, random caused by limited information in the measurements, as
noise can occur and information may get lost. We model this for example the image resolution, or by not measuring
randomness in x by false or insufficiently available information modalities.
Moreover, it can be caused by noise, for example sensor
x|ω ∼ px|ω . (3)
noise, by motion, or mechanical stress leading to impre-
Equivalently, the corresponding target variable y is derived, cise measures. Furthermore, false labeling is also a source
where the description is either based on another measurement of uncertainty that can be seen as error and noise in the
or is the result of a labeling process2 . For both cases the measurement system. It is referenced as label noise and
description can be affected by noise and errors and we state affects the model by reducing the confidence on the true
it as class prediction during training.
y|ω ∼ py|ω . (4)
A neural network is trained on a finite data set of realizations B. Deep Neural Network Design and Training
of x|ωi and y|ωi based on N real world situations ω1 , ..., ωN , The design of a DNN covers the explicit modeling of
D= {xi , yi }N . (5) the neural network and its stochastic training process. The
i=1
assumptions on the problem structure induced by the de-
When collecting the training data, two factors can cause sign and training of the neural network are called inductive
uncertainty in a neural network trained on this data. First, the bias [59]. We summarize all decisions of the modeler on
sample space Ω should be sufficiently covered by the training the network’s structure (e.g. the number of parameters, the
data x1 , ..., xN for ω1 , ..., ωN . For that, one has to take into layers, the activation functions, etc.) and training process (e.g.
account that for a new sample x∗ it in general holds that optimization algorithm, regularization, augmentation, etc.) in
x∗ 6= xi for all training situations xi . Following, the target a structure configuration s. The defined network structure
has to be estimated based on the trained neural network model, gives the third factor of uncertainty in a neural network’s
which directly leads to the first factor of uncertainty: predictions:
2 In many cases one can model the labeling process as a mapping from X Factor III: Errors in the Model Structure
to Y, e.g. for speech recognition or various computer vision tasks. For other The structure of a neural network has a direct effect on its
tasks, as for example earth observation, this is not always the case. Data is performance and therefore also on the uncertainty of its
often labeled based on high resolutional data while low resolutional data is
utilized for the prediction task.
4
prediction. For instance, the number of parameters affects Factor V: Errors Caused by Unknown Data
the memorization capacity, which can lead to under- or Especially in classification tasks, a neural network that is
over-fitting on the training data. Regarding uncertainty trained on samples derived from a world W1 can also be
in neural networks, it is known that deeper networks capable of processing samples derived from a completely
tend to be overconfident in their softmax output, meaning different world W2 . This is for example the case, when
that they predict too much probability on the class with a network trained on images of cats and dogs receives a
highest probability score [15]. sample showing a bird. Here, the source of uncertainty
does not lie in the data acquisition process, since we
For a given network structure s and a training data set D, assume a world to contain only feasible inputs for a
the training of a neural network is a stochastic process and prediction task. Even though the practical result might be
therefore the resulting neural network fθ is based on a random equal to too much noise on a sensor or complete failure
variable, of a sensor, the data considered here represents a valid
sample, but for a different task or domain.
θ|D, s ∼ pθ|D,s . (6)
The process is stochastic due to random decisions as the order D. Predictive Uncertainty Model
of the data, random initialization or random regularization
as augmentation or dropout. The loss landscape of a neu- As a modeller, one is mainly interested in the uncertainty
ral network is highly non-linear and the randomness in the that is propagated onto a prediction y ∗ , the so-called predictive
training process in general leads to different local optima θ∗ uncertainty. Within the data acquisition model, the probability
resulting in different models [31]. Also, parameters as batch distribution for a prediction y ∗ based on some sample x∗ is
size, learning rate, and the number of training epochs affect given by Z
the training and result in different models. Depending on the p(y ∗ |x∗ ) = p(y ∗ |ω)p(ω|x∗ )dω (7)
underlying task these models can significantly differ in their Ω
predictions for single samples, even leading to a difference in and a maximum a posteriori (MAP) estimation is given by
the overall model performance. This sensitivity to the training
process directly leads to the fourth factor for uncertainties in y ∗ = arg max p(y|x∗ ) . (8)
y
neural network predictions:
Since the modeling is based on the unavailable latent variable
ω, one takes an approximative representation based on a
Factor IV: Errors in the Training Procedure sampled training data set D = {xi , yi }N i=1 containing N
The training process of a neural network includes many samples and corresponding targets. The distribution and MAP
parameters that have to be defined (batch size, opti- estimator in (7) and (8) for a new sample x∗ are then predicted
mizer, learning rate, stopping criteria, regularization, etc.) based on the known examples by
and also stochastic decisions within the training process Z
(batch generation and weight initialization) take place. p(y ∗ |x∗ ) = p(y ∗ |D, x∗ ) (9)
All these decisions affect the local optima and it is D
therefore very unlikely that two training processes deliver and
the same model parameterization. A training data set that y ∗ = arg max p(y|D, x∗ ) . (10)
suffers from imbalance or low coverage of single regions y
in the data distribution also introduces uncertainties on In general, the distribution given in (9) is unknown and can
the network’s learned parameters, as already described in only be estimated based on the given data in D. For this
the data acquisition. This might be softened by applying estimation, neural networks form a very powerful tool for
augmentation to increase the variety or by balancing the many tasks and applications.
impact of single classes or regions on the loss function. The prediction of a neural network is subject to both model-
Since the training process is based on the given training data dependent and input data-dependent errors, and therefore
set D, errors in the data acquisition process (e.g. label noise) the predictive uncertainty associated with y ∗ is in general
can result in errors in the training process. separated into data uncertainty (also statistical or aleatoric
uncertainty [57]) and model uncertainty (also systemic or
epistemic uncertainty [57]). Depending on the underlying
C. Inference approach, an additional explicit modeling of distributional
The inference describes the prediction of an output y ∗ for uncertainty [32] is used to model the uncertainty, which is
a new data sample x∗ by the neural network. At this time, the caused by examples from a region not covered by the training
network is trained for a specific task. Thus, samples which are data.
not inputs for this task cause errors and are therefore also a 1) Model- and Data Uncertainty: The model uncertainty
source of uncertainty: covers the uncertainty that is caused by shortcomings in
the model, either by errors in the training procedure, an
insufficient model structure, or lack of knowledge due to
unknown samples or a bad coverage of the training data set.
5
In contrast to this, data uncertainty is related to uncertainty 2) Distributional Uncertainty: Depending on the ap-
that directly stems from the data. Data uncertainty is caused proaches that are used to quantify the uncertainty in y ∗ , the
by information loss when representing the real world within formulation of the predictive distribution might be further
a data sample and represents the distribution stated in (7). separated into data, distributional, and model parts [32]:
For example, in regression tasks noise in the input and target Z Z
measurements causes data uncertainty that the network cannot p(y ∗ |x∗ , D) = p(y|µ) p(µ|x∗ , θ) p(θ|D) dµdθ . (13)
learn to correct. In classification tasks, samples which do not
| {z } | {z } | {z }
Data Distributional Model
contain enough information in order to identify one class with
100% certainty cause data uncertainty on the prediction. The The distributional part in (13) represents the uncertainty on the
information loss is a result of the measurement system, e.g. actual network output, e.g. for classification tasks this might
by representing real world information by image pixels with be a Dirichlet distribution, which is a distribution over the
a specific resolution, or by errors in the labelling process. categorical distribution given by the softmax output. Modeled
Considering the five presented factors for uncertainties on a this way, distributional uncertainty refers to uncertainty that
neural network’s prediction, model uncertainty covers Factors is caused by a change in the input-data distribution, while
I, III, IV, and V and data uncertainty is related to Factor II. model uncertainty refers to uncertainty that is caused by
While model uncertainty can be (theoretically) reduced by the process of building and training the DNN. As modeled
improving the architecture, the learning process, or the training in (13), the model uncertainty affects the estimation of the
data set, the data uncertainties cannot be explained away [60]. distributional uncertainty, which affects the estimation of the
Therefore, DNNs that are capable of handling uncertain inputs data uncertainty.
and that are able to remove or quantify the model uncertainty While most methods presented in this paper only distinguish
and give a correct prediction of the data uncertainty are of between model and data uncertainty, approaches specialized
paramount importance for a variety of real world mission- on out-of-distribution detection often explicitly aim at rep-
and safety-critical applications. resenting the distributional uncertainty [32], [64]. A more
The Bayesian framework offers a practical tool to reason detailed presentation of different approaches for quantifying
about uncertainty in deep learning [61]. In Bayesian modeling, uncertainties in neural networks is given in Section III. In
the model uncertainty is formalized as a probability distribu- Section IV, different measures for measuring the different
tion over the model parameters θ, while the data uncertainty is types of uncertainty are presented.
formalized as a probability distribution over the model outputs
y ∗ , given a parameterized model fθ . The distribution over a E. Uncertainty Classification
prediction y ∗ , the predictive distribution, is then given by On the basis of the input data domain, the predictive
Z
p(y |x , D) = p(y ∗ |x∗ , θ) p(θ|D) dθ .
∗ ∗
(11) uncertainty can also be classified into three main classes:
| {z } | {z } • In-domain uncertainty [65]
Data Model
In-domain uncertainty represents the uncertainty related
The term p(θ|D) is referenced as posterior distribution on the to an input drawn from a data distribution assumed to
model parameters and describes the uncertainty on the model be equal to the training data distribution. The in-domain
parameters given a training data set D. The posterior distri- uncertainty stems from the inability of the deep neural
bution is in general not tractable. While ensemble approaches network to explain an in-domain sample due to lack of
seek to approximate it by learning several different parameter in-domain knowledge. From a modeler point of view, in-
settings and averaging over the resulting models [31], Bayesian domain uncertainty is caused by design errors (model
inference reformulates it using Bayes Theorem [62] uncertainty) and the complexity of the problem at hand
p(D|θ)p(θ) (data uncertainty). Depending on the source of the in-
p(θ|D) = . (12)
p(D) domain uncertainty, it might be reduced by increasing the
quality of the training data (set) or the training process
The term p(θ) is called the prior distribution on the model
[57].
parameters, since it does not take any information but the
general knowledge on θ into account. The term p(D|θ) rep- • Domain-shift uncertainty [13]
resents the likelihood that the data in D is a realization of Domain-shift uncertainty denotes the uncertainty related
the distribution predicted by a model parameterized with θ. to an input drawn from a shifted version of the training
Many loss functions are motivated by or can be related to the distribution. The distribution shift results from insufficient
likelihood function. Loss functions that seek to maximize the coverage by the training data and the variability inherent
log-likelihood (for an assumed distribution) are for example to real world situations. A domain-shift might increase
the cross-entropy or the mean squared error [63]. the uncertainty due to the inability of the DNN to explain
Even with the reformulation given in (12), the predictive the domain shift sample on the basis of the seen samples
distribution given in (11) is still intractable. To overcome at training time. Some errors causing domain shift un-
this, several different ways to approximate the predictive certainty can be modeled and can therefore be reduced.
distribution were proposed. A broad overview on the different For example, occluded samples can be learned by the
concepts and some specific approaches is presented in Section deep neural network to reduce domain shift uncertainty
III. caused by occlusions [66]. However, it is difficult if
6
Out of Out of
distribution distribution
Model 1
Low model
uncertainty
Fig. 1: Visualization of the data, the model, and the distributional uncertainty for classification and regression models.
not impossible to model all errors causing domain shift domain uncertainty that is caused by the nature of the data
uncertainty, e.g., motion noise [60]. From a modeler point the network is trained on, as for example overlapping samples
of view, domain-shift uncertainty is caused by external or and systematic label noise.
environmental factors but can be reduced by covering the
shifted domain in the training data set. III. U NCERTAINTY E STIMATION
• Out-of-domain uncertainty [67], [68], [69], [70] As described in Section II, several factors may cause model
Out-of-domain uncertainty represents the uncertainty re- and data uncertainty and affect a DNN’s prediction. This
lated to an input drawn from the subspace of unknown variety of sources of uncertainty makes the complete exclusion
data. The distribution of unknown data is different and of uncertainties in a neural network impossible for almost all
far from the training distribution. While a DNN can ex- applications. Especially in practical applications employing
tract in-domain knowledge from domain-shift samples, it real world data, the training data is only a subset of all
cannot extract in-domain knowledge from out-of-domain possible input data, which means that a miss-match between
samples. For example, when domain-shift uncertainty the DNN domain and the unknown actual data domain is
describes phenomena like a blurred picture of a dog, often unavoidable. However, an exact representation of the
out-of-domain uncertainty describes the case when a uncertainty of a DNN prediction is also not possible to
network that learned to classify cats and dogs is asked to compute, since the different uncertainties can in general not
predict a bird. The out-of-domain uncertainty stems from be modeled accurately and are most often even unknown.
the inability of the DNN to explain an out-of-domain Therefore, methods for estimating uncertainty in a DNN
sample due to its lack of out-of-domain knowledge. From prediction is a popular and vital field of research. The data
a modeler point of view, out-of-domain uncertainty is uncertainty part is normally represented in the prediction,
caused by input samples, where the network is not meant e.g. in the softmax output of a classification network or in
to give a prediction for or by insufficient training data. the explicit prediction of a standard deviation in a regression
network [60]. In contrast to this, several different approaches
Since the model uncertainty captures what the DNN does not which model the model uncertainty and seek to separate it
know due to lack of in-domain or out-of-domain knowledge, from the data uncertainty in order to receive an accurate
it captures all, in-domain, domain-shift, and out-of-domain representation of the data uncertainty were introduced [60],
uncertainties. In contrast, the data uncertainty captures in- [32], [31].
7
• Type
• Depth
• …
Settlement
Forest
Factor IV (epistemic)
Errors in the Training Procedure
Training process
Settlement
Inference
Factor V
Unknown Data
Raw information
Uncertainty
Measured quantification on
data Other data
source given prediction
Fig. 2: The illustration shows the different steps of a neural network pipeline, based on the earth observation example of
land cover classification (here settlement and forest) based on optical images. The different factors that affect the predictive
uncertainty are highlighted in the boxes. Factor I is shown as changing environments by cloud covered trees, different types and
colors of trees. Factor II is shown by insufficient measurements, that can not directly be used to separate between settlement
and forest and by label noise. In practice, the resolution of such images can be low and which would also be part of Factor
II. Factor III and Factor IV represent the uncertainties caused by the network structure and the stochastic training process,
respectively. Factor V in contrast is represented by feeding the trained network with unknown types of images, namely cows
and pigs.
In general, the methods for estimating the uncertainty can be • Test-time augmentation methods give the prediction
split in four different types based on the number (single or based on one single deterministic network but augment
multiple) and the nature (deterministic or stochastic) of the the input data at test-time in order to generate several
used DNNs. predictions that are used to evaluate the certainty of the
prediction.
• Single deterministic methods give the prediction based
on one single forward pass within a deterministic net-
work. The uncertainty quantification is either derived In the following, the main ideas and further extensions of
by using additional (external) methods or is directly the four types are presented and their main properties are
predicted by the network. discussed. In Figure 3, an overview of the different types
and methods is given. In Figure 4, the different underlying
• Bayesian methods cover all kinds of stochastic DNNs, principles that are used to differentiate between the different
i.e. DNNs where two forward passes of the same sample
types of methods are presented. Table I summarizes the
generally lead to different results.
main properties of the methods presented in this work, such
• Ensemble methods combine the predictions of several as complexity, computational effort, memory consumption,
different deterministic networks at inference. flexibility, and others.
8
Fig. 3: Visualization of the four different types of uncertainty quantification methods presented in this paper.
• Gradient Metrices1
• Additional Network
for Uncertainty2
External • Distance
3
to Training
Methods Data
• Augmentation
Neural Network Policies25
Single Network Uncertainty Test-Time
Deterministic Augmentation
Methods Quantification Methods
Internal Methods
Methods
• Prior Networks4
• Evidential Neural
Networks5 Weight
• Gradient penalties7 Sharing
• Sub-Ensembles24
• Batch-Ensembles23
1
[71], [72] 2 [46] 3 [36], [35] 4 [32] 5 [44] 7 [35] 8 [73], [74], 9
[75], [30] 10 [76], [77] 11 [20]
12
[78], [79], [80] 13 [81] 14 [82] 15 [83] 16 [63] 17 [84] 18 [31] 19
[85] 20 [86] 21 [87], [88], [89] 22
A. Single Deterministic Methods latter type is in general applied on already trained networks.
Since trained networks are not modified by such methods, they
For deterministic neural networks the parameters are de- have no effect on the network’s predictions. In the following,
terministic and each repetition of a forward pass delivers we call these two types internal and external uncertainty
the same result. With single deterministic network methods quantification approaches.
for uncertainty quantification, we summarize all approaches
where the uncertainty on a prediction y ∗ is computed based 1) Internal Uncertainty Quantification Approaches: Many
on one single forward pass within a deterministic network. of the internal uncertainty quantification approaches followed
In the literature, several such approaches can be found. They the idea of predicting the parameters of a distribution over
can be roughly categorized into approaches where one single the predictions instead of a direct pointwise maximum-a-
network is explicitly modeled and trained in order to quantify posteriori estimation. Often, the loss function of such networks
uncertainties [44], [32], [92], [64], [93] and approaches that takes the expected divergence between the true distribution
use additional components in order to give an uncertainty and the predicted distribution into account as e.g., in [32],
estimate on the prediction of a network [46], [36], [71], [72]. [94]. The distribution over the outputs can be interpreted as a
While for the first type, the uncertainty quantification affects quantification of the model uncertainty (see Section II), trying
the training procedure and the predictions of the network, the to emulate the behavior of a Bayesian modeling of the network
9
TABLE I: An overview about the four general methods presented in this paper, namely Bayesian Neural Networks, Ensembles,
Single Deterministic Neural Networks, and Test-Time Data Augmentation. The labels high and low are given relative to the
other approaches and based on the general idea behind them.
Single Deterministic Bayesian Methods Ensemble Methods Test-Time Data
Networks Augmentation
Description Approaches that receive an Model parameters are ex- The predictions of several The prediction and uncer-
uncertainty quantification plicitly modeled as random models are combined into tainty quantification at in-
on a prediction of a variables. For a single for- one prediction. A variety ference is based on several
deterministic neural ward pass the parameters among the single models is predictions resulting from
network. are sampled from this distri- crucial. different augmentations of
bution. Therefore, the pre- the original input sample.
diction is stochastic and
each prediction is based on
different model weights.
parameters [64]. The prediction is then given as the expected uncertainty. However, it is widely discussed that neural net-
value of the predicted distribution. works are often over-confident and the softmax output is often
For classification tasks, the output in general represents class poorly calibrated, leading to inaccurate uncertainty estimates
probabilities. These probabilities are a result of applying the [95], [67], [44], [92]. Furthermore, the softmax output cannot
softmax function be associated with model uncertainty. But without explicitly
(
XK
) taking the model uncertainty into account, out-of-distribution
softmax : RK → z ∈ RK |zi ≥ 0, zk = 1 samples could lead to outputs that certify a false confidence.
k=1 (14) For example, a network trained on cats and dogs will very
exp(zj ) likely not result in 50% dog and 50% cat when it is fed with
softmax(z)j = PK the image of a bird. This is, because the network extracts
k=1 exp(zk )
features from the image and even though the features do
for multiclass settings and the sigmoid function not fit to the cat class, they might fit even less to the dog
sigmoid : R → [0, 1] class. As a result, the network puts more probability on cat.
1 (15) Furthermore, it was shown that the combination of rectified
sigmoid(z) = linear unit (ReLu) networks and the softmax output leads to
1 + exp(−z)
settings where the network becomes more and more confident
for binary classification tasks on the logits z. These proba- as the distance between an out-of-distribution sample and the
bilities can be already interpreted as a prediction of the data
10
x* x* x* x*
Augmentation
Deterministic Network Forward Pass Forward Pass Forward Pass Deterministic Deterministic Deterministic
Parameters θ1 Parameters θ 2 Parameters θ M Member 1 Member 2 Member M
x *1 x *2 ... x *M
... ...
Deterministic
Network
θ* y *1 y *2 y *M y *1 y *2 y *M
y *1 y *2 ... y *M
* *
E [Ξ ( θ )] Var [Ξ( θ )] Mean [ y *1 , y *2 , ... , y *M ] Var [ y *1 , y *2 , ... , y *M ] Mean [ y *1 , y *2 , ... , y *M ] Var [ y *1 , y *2 , ... , y *M ]
Mean [ y *1 , y *2 , ... , y *M ] Var [ y *1 , y *2 , ... , y *M ]
y* σ* y* σ* y* σ* y* σ*
Fig. 4: A visualization of the basic principles of uncertainty modeling of the four presented general types of uncertainty
prediction in neural networks. For a given input sample x∗ each approach delivers a prediction y ∗ , a representation of model
uncertainty σmodel and a value of data uncertainty σdata . A) single deterministic model, B) Bayesian neural network, B)
ensemble approach, and D) test-time data augmentation. The mean and the standard deviation are only used to keep the
visualization simple. In practice other methods, could be utilized. For the deterministic approaches the idea of predicting the
parameters of an probability distribution Ξ is visualized, other approaches which base on tools additional to the prediction
network are not visualized here.
TABLE II: Overview over the properties of internal and external deterministic single network methods. For a comparison of
single deterministic network approaches with Bayesian, ensemble, and test-time augmentation methods, see Table I.
Internal Methods External Methods
Implementation effort Relatively low, but depends on ex- Relatively low, but depends on ex-
plicit approach, often only loss and plicit approach.
network output has to be fixed.
learned training set becomes larger [96]. Figure 5 shows an over categorical distributions. The density of the Dirichlet
example where the rotation of a digit from MNIST leads to distribution is defined by
false predictions with high softmax values. This phenomenon K K
is described and further investigated by Hein et al. [96] who Γ(α0 ) Y
c −1
X
Dir (µ|α) = QK µα
c , αc > 0, α0 = αc ,
proposed a method to avoid this behaviour, based on enforcing c=1 Γ(α )
c c=1 c=1
a uniform predictive distribution far away from the training
data. where Γ is the gamma function, α1 , ..., αK are called the
concentration parameters, and the scalar α0 is the precision
Several other classification approaches [44], [32], [94], of the distribution. In practice, the concentrations α1 , ..., αK
[64] followed a similar idea of taking the logit magnitude are derived by applying a strictly positive transformation, as
into account, but make use of the Dirichlet distribution. The for example the exponential function, to the logit values. As
Dirichlet distribution is the conjugate prior of the categorical visualized in Figure 6, a higher concentration value leads to a
distribution and hence can be interpreted as a distribution sharper Dirichlet distribution.
11
regularization term. Amini et al. [101] transferred the idea als represent a promising tool in order to avoid overconfident
of evidential neural networks from classification tasks to network predictions.
regression tasks by learning the parameters of an evidential 2) External Uncertainty Quantification Approaches:
normal inverse gamma distribution over an underlying Normal External uncertainty quantification approaches do not affect
distribution. Charpentier et al. [102] avoided the need of OOD the models’ predictions, since the evaluation of the uncertainty
data for the training process by using normalizing flows to is separated from the underlying prediction task. Furthermore,
learn a distribution over a latent space for each class. A new several external approaches can be applied to already trained
input sample is projected onto this latent space and a Dirichlet networks at the same time without affecting each other. Raghu
distribution is parameterized based on the class wise densities et al. [46] argued that when both tasks, the prediction and the
of the received latent point. uncertainty quantification, are done by one single method, the
Beside the Dirichlet distribution based approaches described uncertainty estimation is biased by the actual prediction task.
above, several other internal approaches exist. In [68], a Therefore, they recommended a ”direct uncertainty prediction”
relatively simple approach based on small pertubations on the and suggested to train two neural networks, one for the
training input data and the temperature scaling calibration is actual prediction task and a second one for the prediction of
presented leading to an efficient differentiation of in- and out- the uncertainty on the first network’s predictions. Similarly,
of-distritbuion samples. Możejko et al. [92] made use of the Ramalho and Miranda [36] introduced an additional neural
inhibited softmax function. It contains an artificial and constant network for uncertainty estimation. But in contrast to [46], the
logit that makes the absolute magnitude of the single logits representation space of the training data is considered and the
more determining in the softmax output. Van Amersfoort et al. density around a given test sample is evaluated. The additional
[35] showed that Radial Basis Function (RBF) networks can be neural network uses this training data density in order to
used to achieve competitive results in accuracy and very good predict whether the main network’s estimate is expected to be
results regarding uncertainty estimation. RBF networks learn correct or false. Hsu et al. [105] detected out-of-distribution
a linear transformation on the logits and classify inputs based examples in classification tasks at test-time by predicting
on the distance between the transformed logits and the learned total probabilities for each class, additional to the categorical
class centroids. In [35], a scaled exponentiated L2 distance was distribution given by the softmax output. The class wise total
used. The data uncertainty can be directly derived from the probability is predicted by applying the sigmoid function to
distances between the centroids. By including penalties on the the network’s logits. Based on these total probabilities, OOD
Jacobian matrix in the loss function, the network was trained examples can be identified as those with low class probabilities
to be more sensitive to changes in the input space. As a result, for all classes.
the method reached good performance on out-of-distribution In contrast to this, Oberdiek et al. [71] took the sensitivity of
detection. In several tests, the approach was compared to a the model, i.e. the model’s slope, into account by using gradi-
five members deep ensemble [31] and it was shown that this ent metrics for the uncertainty quantification in classification
single network approach performs at least equivalently well on tasks. Lee et al. [72] applied a similar idea but made use of
detecting out-of-distribution samples and improves the true- back propagated gradients. In their work they presented state
positive rate. of the art results on out-of-distribution and corrupted input
For regression tasks, Oala et al. [93] introduced an uncertainty detection.
score based on the lower and an upper bound output of an 3) Summing Up Single Deterministic Methods: Compared
interval neural network. The interval neural network has the to many other principles, single deterministic methods are
same structure as the underlying deterministic neural network computational efficient in training and evaluation. For training,
and is initialized with the deterministic network’s weights. In only one network has to be trained and often the approaches
contrast to Gaussian representations of uncertainty given by can even be applied on pre-trained networks. Depending on the
a standard deviation, this approach can give non-symmetric actual approach, only a single or at most two forward passes
values of uncertainty. Furthermore, the approach is found to be have to be fulfilled for evaluation. The underlying networks
more robust in the presence of noise. Tagasovska and Lopez- could contain more complex loss functions, which slows down
Paz [103] presented an approach to estimate data and model the training process [44] or external components that have to
uncertainty. A simultaneous quantile regression loss function be trained and evaluated additionally [46]. But in general, this
was introduced in order to generate well-calibrated prediction is still more efficient than the number of predictions needed for
intervals for the data uncertainty. The model uncertainty is ensembles based methods (Section III-C), Bayesian methods
quantified based on a mapping from the training data to zero, (Section III-B), and test-time data augmentation methods (Sec-
based on so called Orthonormal Certificates. The aim was tion III-D). A drawback of single deterministic neural network
that out-of-distribution samples, where the model is uncertain, approaches is the fact that they rely on a single opinion and
are mapped to a non-zero value and thus can be recognized. can therefore become very sensitive to the underlying network
Kawashima et al. [104] introduced a method which computes architecture, training procedure, and the training data.
virtual residuals in the training samples of a regression task
based on a cross-validation like pre-training step. With original
training data expanded by the information of these residuals, B. Bayesian Neural Networks
the actual predictor is trained to give a prediction and a value Bayesian Neural Networks (BNNs) [106], [107], [108] have
of certainty. The experiments indicated that the virtual residu- the ability to combine the scalability, expressiveness, and
13
TABLE III: Overview over the properties different types of Bayesian neural networks approaches. The properties are stated
relatively among the approaches. The properties can not used as comparison to other uncertainty methods as ensembles, single
deterministic models, and test-time augmentation methods. For a comparison of Bayesian methods to these methods see Table
I
Variational Inference Sampling Approaches Laplace Approximation
Description Variational inference approaches Representation of the target ran- Simplifies the target distribution
approximate the (in general dom variable from which realiza- by approximating the log-posterior
intractable) posterior distribution tions can be sampled. Such meth- distribution and then, based on
by optimizing over a family of ods are based on Markov Chain this approximation, deriving a nor-
tractable distributions. Achieved Monte Carlo and further exten- mal distribution on the network
by minimizing the KL divergence. sions. weights.
Unbiased? No Yes No
Convergence Easy to assess convergence Hard to assess convergence Easy to assess convergence
Computational effort at Medium - Convergence may be High - M forward passes based Low - Training of one deterministic
training time slowed down by regularization. on sampled parameters, otherwise model and Laplace approximation
Additional parameters for uncer- intractable
tainty representation.
Computational effort at High - M forward passes based High - M forward passes based High - M forward passes based
inference on sampled parameters, otherwise on sampled parameters, otherwise on sampled parameters, otherwise
intractable intractable intractable
predictive performance of neural networks with the Bayesian the expected value by the mean of N deterministic networks,
learning as opposed to learning via the maximum likelihood fθ1 , ..., fθN , parameterized by N samples, θ1 , θ2 , ..., θN , from
principles. This is achieved by inferring the probability dis- the posterior distribution of the weights, i.e.
tribution over the network parameters θ = (w1 , ..., wK ). N N
More specifically, given a training input-target pair (x, y) the 1 X ∗ 1 X
y∗ ≈ y = fθ (x∗ ). (19)
posterior distribution over the space of parameters p(θ|x, y) is N i=1 i N i=1 i
modelled by assuming a prior distribution over the parameters
p(θ) and applying Bayes theorem: Wilson and Izmailov [16] argue that a key advantage of
BNNs lie in this marginalization step, which particularly can
p(y|x, θ)p(θ) improve both the accuracy and calibration of modern deep
p(θ|x, y) = ∝ p(y|x, θ)p(θ). (16)
p(y|x) neural networks. We note that the use-cases of BNNs are not
Here, the normalization constant in (16) is called the model limited for uncertainty estimation but open up the possibility to
evidence p(y|x) which is defined as bridge the powerful Bayesian tool-boxes within deep learning.
Z Notable examples include Bayesian model selection [109],
p(y|x) = p(y|x, θ)p(θ)dθ. (17) [110], [111], [112], model compression [76], [113], [114],
active learning [115], [23], [116], continual learning [117],
Once the posterior distribution over the weights have been [118], [119], [120], theoretic advances in Bayesian learning
estimated, the prediction of an output y ∗ for a new input data [121] and beyond.
x∗ can be obtained by Bayesian Model Averaging or Full While the formulation is rather simple, there exist several
Bayesian Analysis that involves marginalizing the likelihood challenges. For example, no closed form solution exists for
p(y|x, θ) with the posterior distribution: the posterior inference as conjugate priors do not typically
Z exist for complex models such as neural networks [62]. Hence,
p(y ∗ |x∗ , x, y) = p(y ∗ |x∗ , θ)p(θ|x, y)dθ. (18) approximate Bayesian inference techniques are often needed
to compute the posterior probabilities. Yet, directly using
This Bayesian way of prediction is a direct application of the approximate Bayesian inference techniques have been proven
law of total probability and endows the ability to compute to be difficult as the size of the data and number of parameters
the principled predictive uncertainty. The integral of (18) is are too large for the use-cases of deep neural networks. In other
intractable for the most common prior posterior pairs and words, the integrals of above equations are not computationally
approximation techniques are therefore typically applied. The tractable as the size of the data and number of parameters
most widespread approximation, the Monte Carlo Approxi- grows. Moreover, specifying a meaningful prior for deep
mation, follows the law of large numbers and approximates neural networks is another challenge that is less understood.
14
In this survey, we classify the BNNs into three different Variational methods for BNNs have been pioneered by
types based on how the posterior distribution is inferred to Hinton and Van Camp [73] where the authors derived a
approximate Bayesian inference: diagonal Gaussian approximation to the posterior distribution
• Variational inference [73], [74] of neural networks (couched in information theory - a
Variational inference approaches approximate the (in gen- minimum description length). Another notable extension in
eral intractable) posterior distribution by optimizing over 1990s has been proposed by Barber and Bishop [74], in
a family of tractable distributions. which the full covariance matrix was chosen as the variational
family, and the authors demonstrated how the ELBO can be
• Sampling approaches [78] optimized for neural networks. Several modern approaches
Sampling approaches deliver a representation of the target can be viewed as extensions of these early works [73], [74]
random variable from which realizations can be sampled. with a focus on how to scale the variational inference to
Such methods are based on Markov Chain Monte Carlo modern neural networks.
and further extensions. An evident direction with the current methods are the use of
• Laplace approximation [122], [123] stochastic variational inference (or Monte-Carlo variational
Laplace approximation simplifies the target distribution inference), where the optimization of ELBO is performed
by approximating the log-posterior distribution and then, using mini-batch of data. One of the first connections to
based on this approximation, deriving a normal distribu- stochastic variational inference has been proposed by Graves
tion over the network weights. et al. [75] with Gaussian priors. In 2015, Blundell et al.
While limiting our scope to these three categories, we also [30] introduced Bayes By Backprop, a further extension
acknowledge several advances in related domains of BNN of stochastic variational inference [75] to non-Gaussian
research. Some examples are (i) approximate inference priors and demonstrated how the stochastic gradients can
techniques such as alpha divergence [124], [125], [126], be made unbiased. Notable, Kingma et al. [142] introduced
expectation propagation [127], [128], assumed density the local reparameterization trick to reduce the variance
filtering [129] etc, (ii) probabilistic programming to exploit of the stochastic gradients. One of the key concepts is
modern Graphical Processing Units (GPUs) [130], [131], to reformulate the loss function of neural network as the
[132], [133], (iii) different types of priors [134], [135], (iv) ELBO. As a result the intractable posterior distribution is
advancements in theoretical understandings of BNNs [136], indirectly optimized and variational inference is compatible
[121], [137], (iv) uncertainty propagation techniques to speed to back-propagation with certain modifications to the training
up the marginalization procedures [138] and (v) computations procedure. These extensions widely focus on the fragility of
of aleatoric uncertainty [139], [140], [141]. stochastic variational inference that arises due to sensitivity
to initialization, prior definition and variance of the gradients.
1) Variational Inference: The goal of variational inference These limitations have been addressed recently by Wu et al.
is to infer the posterior probabilities p(θ|x, y) using a pre- [143], where a hierarchical prior was used and the moments of
specified family of distributions q(θ). Here, these so-called the variational distribution are approximated deterministically.
variational family q(θ) is defined as a parametric distribution. Above works commonly assumed mean-field approximations
An example is the Multivariate Normal distribution where as the variational family, neglecting the correlations between
its parameters are the mean and the covariance matrix. The the parameters. In order to make more expressive variational
main idea of variational inference is to find the settings of distributions feasible for deep neural networks, several
these parameters that make q(θ) to be close to the posterior works proposed to infer using the matrix normal distribution
of interest p(θ|x, y). This measure of closeness between the [144], [145], [146] or more expressive variants [147],
probability distributions are given by the Kullback-Leibler [148] where the covariance matrix is decomposed into the
(KL) divergence Kronecker products of smaller matrices or in a low rank
form plus a positive diagonal matrix. A notable contribution
q(θ) towards expressive posterior distributions has been the use
KL(q||p) = Eq log . (20)
p(θ|x, y) of normalizing flows [77], [149] - a hierarchical probability
distribution where a sequence of invertible transformations are
Due to the posterior p(θ|x, y) the KL-divergence in (20) can
applied so that a simple initial density function is transformed
not be minimized directly. Instead, the evidence lower bound
into a more complex distribution. Interestingly, Farquhar et al.
(ELBO), a function that is equal to the KL divergence up to
[137] argue that mean-field approximation is not a restrictive
a constant, is optimized. For a given prior distribution on the
assumption, and the layer-wise weight correlations may not
parameters p(θ), the ELBO is given by
be as important as capturing the depth-wise correlations.
p(y|x, θ) While the claim of Farquhar et al. [137] may remain to be
L = Eq log (21) an open question, the mean-field approximations have an
q(θ)
advantage on smaller computational complexities [137]. For
and for the KL divergence example, Osawa et al. [150] demonstrated that variational
KL(q||p) = −L + log p(y|x) (22) inference can be scaled up to ImageNet size data-sets and
architectures using multiple GPUs and proposed practical
holds. tricks such as data augmentation, momentum initialization
15
and learning rate scheduling. i.e. the next state only depends on the current state and not
One of the successes in variational methods have been on any other former state. In order to draw samples from
accomplished by casting existing stochastic elements of the true posterior, MCMC sampling methods first generate
deep learning as variational inference. A widely known samples in an iterative and the Markov Chain fashion. Then,
example is Monte Carlo Dropout (MC Dropout) where the at each iteration, the algorithm decides to either accept or
dropout layers are formulated as Bernoulli distributed random reject the samples where the probability of acceptance is
variables, and training a neural network with dropout layers determined by certain rules. In this way, as more and more
can be approximated as performing variational inference samples are produced, their values can approximate the desired
[61], [20], [151]. A main advantage of MC dropout is that distribution.
the predictive uncertainty can be computed by activating Hamiltonian Monte Carlo or Hybrid Monte Carlo (HMC)
dropout not only during training, but also at test time. In this [158] is an important variant of MCMC sampling method (pi-
way, once the neural network is trained with dropout layers, oneered by Neals [78], [79], [80], [159] for neural networks),
the implementation efforts can be kept minimum and the and is often known to be the gold standards of Bayesian
practitioners do not need expert knowledge to reason about inference [159], [160], [125]. The algorithm works as follows:
uncertainty - certain criteria that the authors are attributing (i) start by initializing a set of parameters θ (either randomly
to its success [20]. The practical values of this method has or in a user-specific manner). Then, for a given number of total
been demonstrated also in several works [152], [10], [21] iterations, (ii) instead of a random walk, a momentum vector
and resulted in different extensions (evaluating the usage - an auxiliary variable ρ is sampled, and the current value of
of different dropout masks for example for convolutional parameters θ is updated via the Hamiltonian dynamics:
layers [153] or by changing the representations of the
predictive uncertainty into model and data uncertainties [60]). H(ρ, θ) = −logp(ρ, θ) = −logp(ρ|θ) − logp(θ). (24)
Approaches that build upon the similar idea but randomly Defining the potential energy (V (θ) = −logp(θ) and the
drop incoming activations of a node, instead of dropping an kinetic energy T (ρ|θ) = −logp(ρ|θ), the update steps via
activation for all following nodes, were also proposed within Hamilton’s equations are governed by,
the literature [37] and called drop connect. This was found
dθ ∂H ∂T
to be more robust on the uncertainty representation, even = = and (25)
though it was shown that a combination of both can lead to dt ∂ρ ∂ρ
higher accuracy and robustness in the test predictions [154]. dρ ∂H ∂T ∂V
=− =− − . (26)
Lastly, connections of variation inference to Adam [155], dt ∂θ ∂θ ∂θ
RMS Prop [156] and batch normalization [157] have been The so-called leapfrog integrator is used as a solver [161]. (iii)
further suggested in the literature. For each step, a Metropolis acceptance criterion is applied
to either reject or accept the samples (similar to MCMC).
Unfortunately, HMC requires the processing of the entire
2) Sampling Methods: Sampling methods, or also often
data-set per iteration, which is computationally too expensive
called Monte Carlo methods, are another family of Bayesian
when the data-set size grows to million to even billions.
inference algorithms that represent uncertainty without a para-
Hence, many modern algorithms focus on how to perform the
metric model. Specifically, sampling methods use a set of
computations in a mini-batch fashion stochastically. In this
hypotheses (or samples) drawn from the distribution and offer
context, for the first time, Welling and Teh [81] proposed to
an advantage that the representation itself is not restricted
combine Stochastic Gradient Descent (SGD) with Langevin
by the type of distribution (e.g. can be multi-modal or non-
dynamics (a form of MCMC [162], [163], [159]) in order to
Gaussian) - hence probability distributions are obtained non-
obtain a scalable approximation to MCMC algorithm based
parametrically. Popular algorithms within this domain are
on mini-batch SGD [164], [165]. The work demonstrated that
particle filtering, rejection sampling, importance sampling and
performing Bayesian inference on Deep Neural Networks
Markov Chain Monte Carlo sampling (MCMC) [62].
can be as simple as running a noisy SGD. This method does
In case of neural networks, MCMC is often used since alterna-
not include the momentum term of HMC via using the first
tives such as rejection and importance sampling are known to
order Langevin dynamics and opened up a new research
be inefficient for such high dimensional problems. The main
area on Stochastic Gradient Markov Chain Monte Carlo
idea of MCMC is to sample from arbitrary distributions by
(SG-MCMC).
transition in state space where this transition is governed by
Consequently, several extensions are available which include
a record of the current state and the proposal distribution
the use of 2nd order information such as preconditioning
that aims to estimate the target distribution (e.g. the true
and optimizing with the Fisher Information Matrix (FIM)
posterior). To explain this, let us start defining the Markov
[166], [167], [168], the Hessian [169], [170], [171], adapting
Chain: a Markov Chain is a distribution over random variables
preconditioning diagonal matrix [172], generating samples
x1 , · · · , xT which follows the state transition rule:
from non-isotropic target densities using Fisher scoring [173],
and samplers in the Riemannian manifold [174] using the
T
Y first order Langevin dynamics and Levy diffusion noise
p(x1 , · · · , xT ) = p(x1 ) p(xt |xt−1 ), (23) and momentum [175]. Within these methods, the so-called
t=2 parameter dependent diffusion matrices are incorporated
16
with an intention to offset the stochastic perturbation of Hessian H resulting in a Multivariate Normal distribution:
the gradient. To do so, the ”thermostat” ideas [176], [177],
[178] are proposed so that a prescribed constant temperature p(θ | x, y) ∼ N θ̂, (H + τ I)−1 . (27)
distribution is maintained with the parameter dependent noise.
Ahn et al. [179] devised a distributed computing system for In contrast to the two other methods described, the Laplace
SG-MCMC to exploit the modern computing routines, while approximation can be applied on already trained networks,
Wang et al. [180] showed that Generative Adversarial Models and is generally applicable when using standard loss functions
(GANs) can be used to distill the samples for improved such as MSE or cross entropy and piece-wise linear activations
memory efficiency, instead of distillation for enhancing the (e.g RELU). Mackay [123] and Denker et al. [122] have
run-time capabilities of computing predictive uncertainty pioneered the Laplace approximation for neural networks in
[181]. Lastly, other recent trends are techniques that reduce 1990s, and several modern methods provide an extension to
the variance [160], [182] and bias [183], [184] arising from deep neural networks [192], [193], [63], [84].
stochastic gradients. The core of the Laplace Approximation is the estimation of
Concurrently, there have been solid advances in theory of the Hessian. Unfortunately, due to the enormous number of
SG-MCMC methods and their applications in practice. Sato parameters in modern neural networks, the Hessian matrices
and Nakagawa [185], for the first time, showed that the cannot be computed in a feasible way as opposed to relative
SGLD algorithm with constant step size weakly converges; smaller networks in Mackay [123] and Denker et al. [122].
Chen et al. [186] showed that faster convergence rates Consequently, several different ways for approximating H
and more accurate invariant measures can be observed for have been proposed in the literature. A brief review is as
SG-MCMCs with higher order integrators rather than a follows. Instead of diagonal approximations (e.g. [194],
1st order Euler integrator while Teh et al. [187] studied [83]), several researchers have been focusing on including
the consistency and fluctuation properties of the SGLD. the off-diagonal elements (e.g. [195], [196] and [197]).
As a result, verifiable conditions obeying a central limit Amongst them, layer-wise Kronecker Factor approximation
theorem for which the algorithm is consistent, and how its of [198], [193], [192] and [199] have demonstrated a notable
asymptotic bias-variance decomposition depends on step-size scalability [200]. A recent extension can be found in [201]
sequences have been discovered. A more detailed review of where the authors propose to re-scale the eigen-values of the
the SG-MCMC with a focus on supporting theoretical results Kronecker factored matrices so that the diagonal variance in
can be found in Nemeth and Fearnhead [82]. Practically, SG- its eigenbasis is accurate. The work presents an interesting
MCMC techniques have been applied to shape classification idea as one can prove that in terms of a Frobenius norm, the
and uncertainty quantification [188], empirically study and proposed approximation is more accurate than that of [193].
validate the effects of tempered posteriors (or called cold- However, as this approximation is harmed by inaccurate
posteriors) [189] and train a deep neural network in order to estimates of eigenvectors, Lee et al. [84] proposed to further
generalize and avoid over-fitting [190], [191]. correct the diagonal elements in the parameter space.
Existing works obtain Laplace Approximation using various
3) Laplace Approximation: The goal of the Laplace Ap- approximation of the Hessian in the line of fidelity-complexity
proximation is to estimate the posterior distribution over the trade-offs. For several works, an approximation using the
parameters of neural networks p(θ | x, y) around a local mode diagonal of the Fisher information matrix or Gauss Newton
of the loss surface with a Multivariate Normal distribution. matrix, leading to independently distributed model weights,
The Laplace Approximation to the posterior can be obtained have been utilized in order to prune weights [202] or perform
by taking the second-order Taylor series expansion of the log continual learning in order to avoid catastrophic forgetting
posterior over the weights around the MAP estimate θ̂ given [203]. In Ritter et al. [63], the Kronecker factorization of
some data (x, y). If we assume a Gaussian prior with a scalar the approximate block-diagonal Hessian [193], [192] have
precision value τ > 0, then this corresponds to the commonly been applied to obtain scalable Laplace Approximation for
used L2 -regularization, and the Taylor series expansion results neural networks. With this, the weights among different
in layers are still assumed to be independently distributed,
but not the correlations within the same layer. Recently,
building upon the current understandings of neural network’s
log p(θ | x, y) ≈ log p(θ̂ | x, y) loss landscape that many eigenvalues of the Hessian tend
1 to be zero, [84] developed a low rank approximation that
+ (θ − θ̂)T (H + τ I)(θ − θ̂),
2 leads to sparse representations of the layers’ co-variance
matrices. Furthermore, Lee et al. [84] demonstrated that
where the first-order term vanishes because the gradient of the the Laplace Approximation can be scaled to ImageNet size
log posterior δθ = ∇ log p(θ | x, y) is zero at the maximum θ̂. data-sets and architectures, and further showed that with the
Taking the exponential on both sides and approximating inte- proposed sparsification technique, the memory complexity of
grals by reverse engineering densities, the weight posterior is modelling correlations can be made similar to the diagonal
approximately a Gaussian with the mean θ̂ and the covariance approximation. Lastly, Kristiadi et al. [204] proposed a simple
matrix (H + τ I)−1 where H is the Hessian of log p(θ | x, y). procedure to compute the last-layer Gaussian approximation
This means that the model uncertainty is represented by the (neglecting the model uncertainty in all other layers of neural
17
networks), and showed that even such a minimalist solution models, arguing that a group of decision makers tend to make
can mitigate overconfidence predictions of ReLU networks. better decisions than a single decision maker [217], [218]. For
Recent efforts have extended the Laplace Approximation an ensemble f : X → Y with members fi : X → Y for
beyond the Hessian approximation. To tackle the widely i ∈ 1, 2, ..., M , this could be for example implemented by
known assumption that the Laplace Approximation is for the simply averaging over the members’ predictions,
bell shaped true posterior and thus resulting in under-fitting
M
behavior [63], Humt et al. [205] proposed to use Bayesian 1 X
f (x) := fi (x) .
Optimization and showed that hyperparameters of the Laplace M i=1
Approximation can be efficiently optimized with increaed
calibration performance. Another work in this domain is by Based on this intuitive idea, several works applying ensemble
Kristiadi et al. [206], who proposed uncertainty units - a methods to different kinds of practical tasks and approaches,
new type of hidden units that changes the geometry of the as for example bioinformatics [219], [220], [221], remote
loss landscape so that more accurate inference is possible. sensing [222], [223], [224], or reinforcement learning
While Shinde et al. [207] demonstrated practical effectiveness [225], [226] can be found in the literature. Besides the
of the Laplace Approximation to the autonomous driving improvement in the accuracy, ensembles give an intuitive
applications, Feng et al. [208] showed the possibility to (i) way of representing the model uncertainty on a prediction by
incorporate contextual information and (ii) domain adaptation evaluating the variety among the member’s predictions.
in a semi-supervised manner within the context of image Compared to Bayesian and single deterministic network
classification. This is achieved by designing unary potentials approaches, ensemble methods have two major differences.
within a Conditional Random Field. Several real-time First, the general idea behind ensembles is relatively clear
methods also exist that do not require multiple forward passes and there are not many groundbreaking differences in the
to compute the predictive uncertainty. So-called linearized application of different types of ensemble methods and
Laplace Approximation has been proposed in [209], [210] their application in different fields. Hence, this section
using the ideas of Mackay [115] and have been extended with focuses on different strategies to train an ensemble and some
Laplace bridge for classification [211]. Within this framework, variations that target on making ensemble methods more
Daxberger et al. [212] proposed inferring the sub-networks efficient. Second, ensemble methods were originally not
to increase the expressivity of covariance propagation while introduced to explicitly handle and quantify uncertainties
remaining computationally tractable. in neural networks. Although the derivation of uncertainty
from ensemble predictions is obvious, since they actually
4) Sum Up Bayesian Methods: Bayesian methods for deep aim at reducing the model uncertainty, ensembles were first
learning have emerged as a strong research domain by com- introduced and discussed in order to improve the accuracy
bining principled Bayesian learning for deep neural networks. on a prediction [218]. Therefore, many works on ensemble
A review of current BNNs has been provided with a focus methods do not explicitly take the uncertainty into account.
on mostly, how the posterior p(θ|x, y) is inferred. As an Notwithstanding this, ensembles have been found to be well
observation, many of the recent breakthroughs have been suited for uncertainty estimations in neural networks [31].
achieved by performing approximate Bayesian inference in a
mini-batch fashion (stochastically) or investigating relatively 2) Single- and Multi-Mode Evaluation: One main point
simple but scalable techniques such as MC-dropout or Laplace where ensemble methods differ from the other methods
Approximation. As a result, several works demonstrated that presented in this paper is the number of local optima that
the posterior inference in large scale settings are now pos- are considered, i.e. the differentiation into single-mode and
sible [213], [150], [84], and the field has several practical multi-mode evaluation.
approximation tools to compute more expressive and accurate In order to create synergies and marginalise false predictions
posteriors since the revival of BNNs beyond the pioneers [73], of single members, the members of an ensemble have to
[74], [78], [122], [123]. There are also emerging challenges behave differently in case of an uncertain outcome. The
on new frontiers beyond accurate inference techniques. Some mapping defined by a neural network is highly non-linear
examples are: (i) how to specify meaningful priors? [134], and hence the optimized loss function contains many local
[135], (ii) how to efficiently marginalize over the parame- optima to which a training algorithm could converge to.
ters for fast predictive uncertainty? [181], [138], [211] (iii) Deterministic neural networks converge to one single local
infrastructures such as new benchmarks, evaluation protocols optimum in the solution space [227]. Other approaches, e.g.
and software tools [214], [131], [132], [215], and (iv) towards BNNs, still converge to one single optimum, but additionally
better understandings on the current methodologies and their take the uncertainty on this local optimum into account
potential applications [137], [189], [216], [208]. [227]. This means, that neighbouring points within a certain
region around the solution also affect the loss and also
C. Ensemble Methods influence the prediction of a test sample. Since these methods
1) Principles of Ensemble Methods: Ensembles derive a focus on single regions, the evaluation is called single-mode
prediction based on the predictions received from multiple so- evaluation. In contrast to this, ensemble methods consist of
called ensemble members. They target at a better generaliza- several networks, which should converge to different local
tion by making use of synergy effects among the different optima. This leads to a so called multi-mode evaluation [227].
18
is evaluated with respect to accuracy, calibration, and out- sub-problems focusing on single blocks of classes that are
of-distribution detection for classification and regression difficult to differentiate. Several smaller trainer networks
tasks. In all tests, the method performs at least equally are trained on the sub-problems and then teach one student
well as the BNN approaches used for comparison, namely network to separate all classes at the same time. In contrast to
Monte Carlo Dropout and Probabilistic Backpropagation. this, Ensemble distillation approaches capture the behaviour
Lakshminarayanan et al. [31] also showed that shuffling the of an ensemble by one single network. First works on
training data and a random initialization of the training process ensemble distillation used the average of the softmax outputs
induces a sufficient variety in the models in order to predict of the ensemble members in order to teach a student network
the uncertainty for the given architectures and data sets. the derived predictive uncertainty [245]. Englesson and
Furthermore, bagging is even found to worsen the predictive Azizpour [246] justify the resulting predictive distributions
uncertainty estimation, extending the findings of Lee et al. of this approach and additionally cover the handling of out-
[230], who found bagging to worsen the predictive accuracy of-distribution samples. When averaging over the members’
of ensemble methods on the investigated tasks. Gustafsson outputs, the model uncertainty, which is represented in the
et al. [56] introduced a framework for the comparison of variety of ensemble outputs, gets lost. To overcome this
uncertainty quantification methods with a specific focus on drawback, researchers applied the idea of learning higher
real life applications. Based on this framework, they compared order distributions, i.e. distributions over a distribution,
ensembles and Monte Carlo dropout and found ensembles instead of directly predicting the output [90], [45]. The
to be more reliable and applicable to real life applications. members are then distillated based on the divergence from
These findings endorse the results reported by Beluch et al. the average distribution. The idea is closely related to the
[240] who found ensemble methods to deliver more accurate prior networks [32] and the evidential neural networks
and better calibrated predictions on active learning tasks [44], which are described in Section III-A. [45] modelled
than Monte Carlo Dropout. Ovadia et al. [13] evaluated ensemble members and the distilled network as prior networks
different uncertainty quantification methods based on test predicting the parameters of a Dirichlet distribution. The
sets affected by distribution shifts. The excessive evaluation distillation then seeks to minimize the KL divergence between
contains a variety of model types and data modalities. As the averaged Dirichlet distributions of the ensemble members
a take away, the authors stated that already for a relatively and the output of the distilled network. Lindqvist et al.
small ensemble size of five, deep ensembles seem to perform [90] generalized this idea to any other parameterizable
best and are more robust to data set shifts than the compared distribution. With that, the method is also applicable to
methods. Vyas et al. [241] presented an ensemble method regression problems, for example by predicting a mean and
for the improved detection of out-of-distribution samples. For standard deviation to describe a normal distribution. Within
each member, a subset of the training data is considered as several tests, the distillation models generated by these
out-of-distribution. For the training process, a loss, seeking a approaches are able to distinguish between data uncertainty
minimum margin greater zero between the average entropy and model uncertainty. Although distillation methods cannot
of the in-domain and the out-of-distribution subsets is completely capture the behaviour of an underlying ensemble,
introduced and leads to a significant improvement in the it has been shown that they are capable of delivering good
out-of-distribution detection. and for some experiments even comparable results [90], [45],
[247].
5) Making Ensemble Methods more Efficient: Compared to Other approaches, as sub-ensembles [39] and batch-ensembles
single model methods, ensemble methods come along with [40] seek to reduce the computation effort and memory
a significantly increased computational effort and memory consumption by sharing parts among the single members.
consumption [217], [45]. When deploying an ensemble for a It is important to note that the possibility of using different
real life application the available memory and computational model architectures for the ensemble members could get lost
power are often limited. Such limitations could easily become when parts of the ensembles are shared. Also, the training
a bottleneck [242] and could become critical for applications of the models cannot be run in a completely independent
with limited reaction time. Reducing the number of models manner. Therefore, the actual time needed for training does
leads to less memory and computational power consumption. not necessarily decrease in the same way as the computational
Pruning approaches reduce the complexity of ensembles by effort does.
pruning over the members and reducing the redundancy among Sub-ensembles [39] divide a neural network architecture
them. For that, several approaches based on different diversity into two sub-networks. The trunk network for the extraction
measures are developed to remove single members without of general information from the input data, and the task
strongly affecting the performance [88], [87], [243]. network that uses these information to fulfill the actual task.
Distillation is another approach where the number of In order to train a sub-ensemble, first, the weights of each
networks is reduced to one single model. It is the procedure member’s trunk network are fixed based on the resulting
of teaching a single network to represent the knowledge parameters of one single model’s training process. Following,
of a group of neural networks [244]. First works on the the parameters of each ensemble members’ task network are
distillation of neural networks were motivated by restrictions trained independently from the other members. As a result,
when deploying large scale classification problems [244]. the members are built with a common trunk and an individual
The original classification problem is separated into several task sub-network. Since the training and the evaluation of
20
the trunk network have to be done only once, the number of D. Test Time Augmentation
computations needed for training and testing decreases by the
factor M ·NMtask +Ntrunk
·N , where Ntask , Ntrunk , and N stand for the Inspired by ensemble methods and adversarial examples
number of variables in the task networks, the trunk network, [14], the test time data augmentation is one of the simpler
and the complete network. Valdenegro-Toro [39] further predictive uncertainty estimation techniques. The basic method
underlined the usage of a shared trunk network by arguing is to create multiple test samples from each test sample by
that the trunk network is in general computational more costly applying data augmentation techniques on it and then test all
than the task network. In contrast to this, batch-ensembles those samples to compute a predictive distribution in order
[40] connect the member networks with each other at every to measure uncertainty. The idea behind this method is that
layer. The ensemble members’ weights are described as a the augmented test samples allow the exploration of different
Hadamard product of one shared weight matrix W ∈ Rn×m views and is therefore capable of capturing the uncertainty.
and M individual rank one matrices Fi ∈ Rn×m , each linked Mostly, this technique of test time data augmentations has been
with one of the M ensemble members. The rank one matrices used in medical image processing [248], [249], [14], [250].
can be written as a multiplication Fi = ri sTi of two vectors One of the reasons for this is that the field of medical image
s ∈ Rn and r ∈ Rm and hence the matrix Fi can be described processing already makes heavy use of data augmentations
by n + m parameters. With this approach, each additional while using deep learning [251], so it is quite easy to just
ensemble member increases the number of parameters only apply those same augmentations during test time to calculate
n+m
by the factor M ·(n+m)+n·m + 1 instead of MM+1 = 1 + M 1
. the uncertainties. Another reason is that collecting medical
On the one hand, with this approach, the members are not images is costly, thus forcing practitioners to rely on data
independent anymore such that all the members have to augmentation techniques. Moshkov et al. [250] used the test
be trained in parallel. On the other hand, the authors also time augmentation technique for cell segmentation tasks. For
showed that the parallelization can be realized similar to the that, they created multiple variations of the test data before
optimization on mini-batches and on a single unit. feeding it to a trained UNet or Mask R-CNN architecture.
Following, they used a majority voting to create the final
6) Sum Up Ensemble Methods: Ensemble methods are output segmentation mask and discuss the policies of applying
very easy to apply, since no complex implementation or different augmentation techniques and how they affect the final
major modification of the standard deterministic model have predictive results of the deep networks.
to be realized. Furthermore, ensemble members are trained Overall, test time augmentation is an easy method for estimat-
independently from each other, which makes the training ing uncertainties because it keeps the underlying model un-
easily parallelizable. Also, trained ensembles can be extended changed, requires no additional data, and is simple to put into
easily, but the needed memory and the computational effort practice with off-the-shelf libraries. Nonetheless, it needs to be
increases linearly with the number of members for training and kept in mind that during applying this technique, one should
evaluation. The main challenge when working with ensemble only apply valid augmentations to the data, meaning that the
methods is the need of introducing diversity among the ensem- augmentations should not generate data from outside the target
ble members. For accuracy, uncertainty quantification, and out- distribution. According to [252], test time augmentation can
of-distribution detection, random initialization, data shuffling, change many correct predictions into incorrect predictions (and
and augmentations have been found to be sufficient for many vice versa) due to many factors such as the nature of the
applications and tasks [31], [232]. Since these methods may be problem at hand, the size of training data, the deep neural
applied anyway, they do not need much additional effort. The network architecture, and the type of augmentation. To limit
independence of the single ensemble members leads to a linear the impact of these factors, Shanmugam et al. [252] proposed
increase in the required memory and computation power with a learning-based method for test time augmentation that takes
each additional member. This holds for the training as well as these factors into consideration. In particular, the proposed
for testing. This limits the deployment of ensemble methods method learns a function that aggregates the predictions
in many practical applications where the computation power from each augmentation of a test sample. Similar to [252],
or memory is limited, the application is time-critical, or very Molchanov et al. [91] proposed a method, named “greedy
large networks with high inference time are included [45]. Policy Search”, for constructing a test-time augmentation
Many aspects of ensemble approaches are only investigated policy by choosing augmentations to be include in a fixed-
with respect to the performance on the predictive accuracy length policy. Similarly, Kim et al. [253] proposed a method
but do not take predictive uncertainty into account. This also for learning a loss predictor from the training data for instance-
holds for the comparison of different training strategies for aware test-time augmentation selection. The predictor selects
a broad range of problems and data sets. Especially since test-time augmentations with the lowest predicted loss for a
the overconfidence from single members can be transferred given sample.
to the whole ensemble, strategies that encourage the members Although learnable test time augmentation techniques [252],
to deliver different false predictions instead of all delivering [91], [253] help to select valid augmentations, one of the
the same false prediction should be further investigated. For a major open question is to find out the effect on uncertainty
better understanding of ensemble behavior, further evaluations due to different kinds of augmentations. It can for example
of the loss landscape, as done by Fort et al. [227], could offer happen that a simple augmentation like reflection is not
interesting insights. able to capture much of the uncertainty while some domain
21
specialized stretching and shearing captures more uncertainty. evaluation given by ensembles is a powerful property for
It is also important to find out how many augmentations are real life applications. Nevertheless Bayesian approaches have
needed to correctly quantify uncertainties in a given task. This delivered strong results as well and furthermore come along
is particularly important in applications like earth observation, with a strong theoretical foundation [84], [211], [6], [23]. As
where inference might be needed on global scale with limited a way to go, the combination of efficient ensemble strategies
resources. and Bayesian approaches could combine the variability in
the model parameters while still considering several modes
for a prediction. Also, single deterministic approaches as
E. Neural Network Uncertainty Quantification Approaches for the prior networks [32], [64], [44], [33] deliver comparable
Real Life Applications results while consuming significantly less computation power.
In order to use the presented methods on real life tasks, However, this efficiency often comes along with the problem
several different considerations have to be taken into account. that separated sets of in- and out-of-distribution samples have
The memory and computational power is often restricted to be available for the training process [33], [64]. In general,
while many real world tasks my be time-critical [242]. An the development of new problem and loss formulations as
overview over the main properties is given in Table I. for example given in [64] leads to a better understanding and
The presented applications all come along with advantages description of the underlying problem and forms an important
and disadvantages, depending on the properties a user field of research.
is interested in. While ensemble methods and test-time
augmentation methods are relatively easy to apply, Bayesian
IV. U NCERTAINTY M EASURES AND Q UALITY
approaches deliver a clear description of the uncertainty on
the models parameters and also deliver a deeper theoretical In Section III, we presented different methods for modeling
basis. The computational effort and memory consumption and predicting different types of uncertainty in neural net-
is a common restriction on real life applications, where works. In order to evaluate these approaches, measures have
single deterministic network approaches perform best, but to be applied on the derived uncertainties. In the following,
distillation of ensembles or efficient Bayesian methods can we present different measures for quantifying the different
also be taken into consideration. Within the different types predicted types of uncertainty. In general, the correctness
of Bayesian approaches, the performance, the computational and trustworthiness of these uncertainties is not automatically
effort, and the implementation effort still vary strongly. given. In fact, there are several reasons why evaluating the
Laplace approximations are relatively easy to apply and quality of the uncertainty estimates is a challenging task.
compared to sampling approaches much less computational • First, the quality of the uncertainty estimation depends on
effort is needed. Furthermore, there often already exist the underlying method for estimating uncertainty. This is
pretrained networks for an application. In this case, Laplace exemplified in the work undertaken by Yao et al. [256],
Approximation and external deterministic single network which shows that different approximates of Bayesian
approaches can in general be applied to already trained inference (e.g. Gaussian and Laplace approximates) result
networks. in different qualities of uncertainty estimates.
• Second, there is a lack of ground truth uncertainty
Another important aspect that has to be taken into account
estimates [31] and defining ground truth uncertainty
for uncertainty quantification in real life applications is the
estimates is challenging. For instance, if we define the
source and type of uncertainty. For real life applications,
ground truth uncertainty as the uncertainty across human
out-of-distribution detection forms the maybe most important
subjects, we still have to answer questions as ”How many
challenge in order to avoid unexpected decisions of the
subjects do we need?” or ”How to choose the subjects?”.
network and to be aware of adversarial attacks. Especially
since many motivations of uncertainty quantification are • Third, there is a lack of a unified quantitative evaluation
given by risk minimization, methods that deliver risk metric [257]. To be more specific, the uncertainty is
averse predictions are an important field to evaluate. Many defined differently in different machine learning tasks
works already demonstrated the capability of detecting such as classification, segmentation, and regression. For
out-of-distribution samples on several tasks and built a instance, prediction intervals or standard deviations are
strong fundamental tool set for the deployment in real life used to represent uncertainty in regression tasks, while
applications [254], [241], [255], [56]. However, in real entropy (and other related measures) are used to capture
life, the tasks are much more difficult than finding out-of- uncertainty in classification and segmentation tasks.
distribution samples among data sets (e.g., MNIST or CIFAR
data sets etc.) and the main challenge lies in comparing
such approaches on several real-world data sets against A. Evaluating Uncertainty in Classification Tasks
each other. The work of Gustafsson et al. [56] forms a first For classification tasks, the network’s softmax output al-
important step towards an evaluation of methods that better ready represents a measure of confidence. But since the
suits the demands in real life applications. Interestingly, they raw softmax output is neither very reliable [67] nor can it
found for their tests ensembles to outperform the considered represent all sources of uncertainty [19], further approaches
Bayesian approaches. This indicates, that the multi-mode and corresponding measures were developed.
22
1) Measuring Data Uncertainty in Classification Tasks: between two given probability distributions. The EKL can be
Consider a classification task with K different classes and used to measure the (expected) divergence among the possible
a probability vector network output p(x) for some input softmax outputs,
sample x. In the following p is used for simplification and "K #
pk stands for the k-th entry in the vector. In general, the
X p̂i
Eθ∼p(θ|D) [KL(p̂ || p)] = Eθ∼p(θ|D) p̂i log ,
given prediction p represents a categorical distribution, i.e. it i=1
pi
assigns a probability to each class to be the correct prediction. (32)
Since the prediction is not given as an explicit class but
as a probability distribution, (un)certainty estimates can be which can also be interpreted as a measure of uncertainty
directly derived from the prediction. In general this pointwise on the model’s output and therefore represents the model
prediction can be seen as estimated data uncertainty [60]. uncertainty.
However, as discussed in Section II, the model’s estimation of The predictive variance evaluates the variance on the (random)
the data uncertainty is affected by model uncertainty, which softmax outpus, i.e.
has to be taken into account separately. In order to evaluate
h i
2
σ(p) = Eθ∼p(θ|D) (p − p̂) . (33)
the amount of predicted data uncertainty, one can for example
apply the maximal class probability or the entropy measures: As described in Section III, an analytically described pos-
Maximal probability: pmax =
K
max {pk }k=1 (28) terior distribution p(θ|D) is only given for a subset of the
Bayesian methods. And even for an analytically described
K distribution, the propagation of the parameter uncertainty into
X
Entropy: H(p) = − pk log2 (pk ) (29) the prediction is in almost all cases intractable and has to
k=1 be approximated for example with Monte Carlo approxima-
tion. Similarly, ensemble methods collect predictions from M
The maximal probability represents a direct representation
neural networks, and test-time data augmentation approaches
of certainty, while entropy describes the average level of
receive M predictions from M different augmentations applied
information in a random variable. Even though a softmax
to the original input
sample. For all these cases, we receive a
output should represent the data uncertainty, one cannot tell M
from a single prediction how large the amount of model set of M samples, pi i=1 , which can be used to approximate
uncertainty is that affects this specific prediction as well. the intractable or even undefined underlying distribution. With
2) Measuring Model Uncertainty in Classification Tasks: these approximations, the measures defined in (31), (32), and
As already discussed in Section III, a single softmax prediction (33) can be applied straight forward and only the expectation
is not a very reliable way for uncertainty quantification since has to be replaced by average sums. For example, the expected
it is often badly calibrated [19] and does not have any infor- softmax output becomes
mation about the certainty the model itself has on this specific M
1 X i
output [19]. An (approximated) posterior distribution p(θ|D) p̂ ≈ p .
M i=1
on the learned model parameters can help to receive better
uncertainty estimates. With such a posterior distribution, the For the expectations given in (31), (32), and (33), the expec-
softmax output itself becomes a random variable and one can tation is approximated similarly.
evaluate its variation, i.e. uncertainty. For simplicity, we denote 3) Measuring Distributional Uncertainty in Classification
p(y|θ, x) also as p and it will be clear from context whether p Tasks:
depends on θ or not. The most common measures for this are Although these uncertainty measures are widely used to cap-
the mutual information (MI), the expected Kullback-Leibler ture the variability among several predictions derived from
Divergence (EKL), and the predictive variance. Basically, all Bayesian neural networks [60], ensemble methods [31], or
these measures compute the expected divergence between the test-time data augmentation methods [14], they cannot capture
(stochastic) softmax output and the expected softmax output distributional shifts in the input data or out-of-distribution
p̂ = Eθ∼p(θ|D) [p(y|x, θ] . (30) examples, which could lead to a biased inference process and
a falsely stated confidence. If all predictors attribute a high
The MI uses the entropy to measure the mutual dependence probability mass to the same (false) class label, this induces a
between two variables. In the described case, the difference low variability among the estimates. Hence, the network seams
between the information given in the expected softmax output to be certain about its prediction, while the uncertainty in the
and the expected information in the softmax output is com- prediction itself (given by the softmax probabilities) is also
pared, i.e. evaluated to be low. To tackle this issue, several approaches
described in Section III take the magnitude of the logits into
MI (θ, y|x, D) = H [p̂] − Eθ∼p(θ|D) H [p(y|x, θ)] . (31)
account, since a larger logit indicates larger evidence for the
Smith and Gal [19] pointed out that the MI is minimal when corresponding class [44]. Thus, the methods either interpret the
the knowledge about model parameters does not increase the total sum of the (exponentials of) the logits as precision value
information in the final prediction. Therefore, the MI can be of a Dirichlet distribution (see description of Dirichlet Priors
interpreted as a measure of model uncertainty. in Section III-A) [32], [94], [64], or as a collection of evidence
The Kullback-Leibler divergence measures the divergence that is compared to a defined constant [44], [92]. One can also
23
derive a total class probability for each class individually by III, a common approach to overcome this is to let the network
applying the sigmoid function to each logit [105]. Based on predict the parameters of a probability distribution, for exam-
the class-wise total probabilities, OOD samples might easier ple a mean vector and a standard deviation for a normally
be detected, since all classes can have low probability at the distributed uncertainty [31], [60]. Doing so, a measure of data
same time. Other methods deliver an explicit measure how uncertainty is directly given. The prediction of the standard
well new data samples suit into the training data distribution. deviation allows an analytical description that the (unknown)
Based on this, they also give a measure that a sample will be true value is within a specific region. The interval that covers
predicted correctly [36]. the true value with a probability of α (under the assumption
4) Performance Measure on Complete Data Set: that the predicted distribution is correct) is given by
While the measures described above measure the performance
1 1
of individual predictions, others evaluate the usage of these ŷ − Φ−1 (α) · σ; ŷ + Φ−1 (α) · σ (34)
2 2
measures on a set of samples. Measures of uncertainty can
be used to separate between correctly and falsely classified where Φ−1 is the quantile function, the inverse of the cumu-
samples or between in-domain and out-of-distribution samples lative probability function. For a given probability value α the
[67]. For that, the samples are split into two sets, for example quantile function gives a boundary, such that 100 · α% of a
in-domain and out-of-distribution or correctly classified and standard normal distribution’s probability mass is on values
falsely classified. The two most common approaches are smaller than Φ−1 (α). Quantiles assume some probability
the Receiver Operating Characteristic (ROC) curve and the distribution and interpret the given prediction as the expected
Precision-Recall (PR) curve. Both methods generate curves value of the distribution.
based on different thresholds of the underlying measure. For In contrast to this, other approaches [259], [260] directly
each considered threshold, the ROC curve plots the true predict a so called prediction interval (PI)
positive rate against the false positive rate3 , and the PR curve
P I(x) = [Bl , Bu ] (35)
plots the precision against the recall4 . While the ROC and
PR curves give a visual idea of how well the underlying in which the prediction is assumed to lay. Such intervals
measures are suited to separate the two considered test cases, induce an uncertainty as a uniform distribution without giving
they do not give a qualitative measure. To reach this, the area a concrete prediction. The certainty of such approaches can,
under the curve (AUC) can be evaluated. Roughly speaking, as the name indicates, be directly measured by the size of
the AUC gives a probability value that a randomly chosen the predicted interval. The Mean Prediction Interval Width
positive sample leads to a higher measure than a randomly (MPIW) can be used to evaluate the average certainty of
chosen negative example. For example, the maximum softmax the model [259], [260]. In order to evaluate the correctness
values measure ranks of correctly classified examples higher of the predicted intervals the Prediction Interval Coverage
than falsely classified examples. Hendrycks and Gimpel [67] Probability (PICP) can be applied [259], [260]. The PCIP
showed for several application fields that correct predictions represents the percentage of test predictions that fall into a
have in general a higher predicted certainty in the softmax prediction interval and is defined as
value than false predictions. Especially for the evaluation of c
in-domain and out-of-distribution examples, the Area Under PICP = , (36)
n
Receiver Operating Curve (AUROC) and the Area Under where n is the total number of predictions and c the number of
Precision Recall Curce (AUPRC) are commonly used [64], ground truth values that are actually captured by the predicted
[32], [94]. The clear weakness of these evaluations is the fact intervals.
that the performance is evaluated and the optimal threshold is 2) Measuring Model Uncertainty in Regression Predictions:
computed based on a given test data set. A distribution shift
from the test set distribution can ruin the whole performance In Section II, it is described, that model uncertainty is mainly
and make the derived thresholds impractical. caused by the model’s architecture, the training process, and
underrepresented areas in the training data. Hence, there is no
B. Evaluating Uncertainty in Regression Tasks real difference in the causes and effects of model uncertainty
between regression and classification tasks such that model
1) Measuring Data Uncertainty in Regression Predictions:
uncertainty in regression tasks can be measured equivalently
In contrast to classification tasks, where the network typically
as already described for classification tasks, i.e. in most cases
outputs a probability distritution over the possible classes,
by approximating an average prediction and measuring the
regression tasks only predict a pointwise estimation without
divergence among the single predictions [60].
any hint of data uncertainty. As already described in Section
3 The true positive rate is the number of samples, which are correctly
C. Evaluating Uncertainty in Segmentation Tasks
predicted as positive divided by the total number of true samples. The false
positive rate is the number of samples falsely predicted as positive divided The evaluation of uncertainties in segmentation tasks is
by the total number of negative samples (see also [258]) very similar to the evaluation for classification problems. The
4 The precision is equal to the number of samples that are correctly classified
uncertainty is estimated in segmentation tasks using approxi-
as positive, divided by the total number of positive samples. The recall is equal
to the number of samples correctly predicted as positive divided by the total mates of Bayesian inference [1], [2], [4], [152], [261], [262],
number of positive samples (see also [258]) [263], [264] or test-time data augmentation techniques [249].
24
In the context of segmentation, the uncertainty in pixel wise Several methods for uncertainty estimation presented in Sec-
segmentation is measured using confidence intervals [4], [152], tion III also improve the networks calibration [31], [20]. This
the predictive variance [262], [264], the predictive entropy is clear, since these methods quantify model and data uncer-
[2], [249], [261], [263] or the mutual information [1]. The tainty separately and aim at reducing the model uncertainty
uncertainty in structure (volume) estimation is obtained by av- on the predictions. Besides the methods that improve the
eraging over all pixel-wise uncertainty estimates [264], [261]. calibration by reducing the model uncertainty, a large and
The quality of volume uncertainties is assessed by evaluating growing body of literature has investigated methods for explic-
the coefficient of variation, the average Dice score or the itly reducing calibration errors. These methods are presented in
intersection over union [2], [249]. These metrics measure the the following, followed by measures to quantify the calibration
agreement in area overlap between multiple estimates in a pair- error. It is important to note that these methods do not reduce
wise fashion. Ideally, a false segmentation should result in an the model uncertainty, but propagate the model uncertainty
increase in pixel-wise and structure uncertainty. To evaluate onto the representation of the data uncertainty. For example,
whether this is the case, Nair et al. [1] evaluated the pixel- if a binary classifier is overfitted and predicts all samples of
level true positive rate and false detection rate as well as the a test set as class A with probability 1, while half of the test
ROC curves for the retained pixels at different uncertainty samples are actually class B, the recalibration methods might
thresholds. Similar to [1], McClure et al. [261] also analyzed map the network output to 0.5 in order to have a reliable
the area under the ROC curve. confidence. This probability of 0.5 is not equivalent to the data
uncertainty but represents the model uncertainty propagated
V. C ALIBRATION onto the predicted data uncertainty.
A predictor is called well-calibrated if the derived predictive
confidence represents a good approximation of the actual A. Calibration Methods
probability of correctness [15]. Therefore, in order to make
Calibration methods can be classified into three main groups
use of uncertainty quantification methods, one has to be sure
according to the step when they are applied:
that the network is well calibrated. Formally, for classification
tasks a neural network fθ is calibrated [265] if it holds that • Regularization methods applied during the training phase
[268], [269], [11], [270], [271]
N X K
X yi,k · I{fθ (xi )k = p} N →∞ These methods modify the objective, optimization and/or
∀p ∈ [0, 1] : −−−−→ p .
I{fθ (xi )k = p} regularization procedure in order to build DNNs that are
i=1 k=1
(37) inherently calibrated.
Here, I{·} is the indicator function that is either 1 if the • Post-processing methods applied after the training pro-
condition is true or 0 if it is false and yi,k is the k-th entry in cess of the DNN [15], [47]
the one-hot encoded groundtruth vector of a training sample These methods require a held-out calibration data set to
(xi , yi ). This formulation means that for example 30% of adjust the prediction scores for recalibration. They only
all predictions with a predictive confidence of 70% should work under the assumption that the distribution of the
actually be false. For regression tasks the calibration can be left-out validation set is equivalent to the distribution,
defined such that predicted confidence intervals should match on which inference is done. Hence, also the size of the
the confidence intervals empirically computed from the data validation data set can influence the calibration result.
set [265], i.e. • Neural network uncertainty estimation methods
N Approaches, as presented in Section III, that reduce
X I {yi ∈ confp (fθ (xi ))} N →∞
∀p ∈ [0, 1] : −−−−→ p, (38) the amount of model uncertainty on a neural network’s
i=1
N confidence prediction, also lead to a better calibrated
where confp is the confidence interval that covers p percent predictor. This is because the remaining predicted data
of a distribution. uncertainty better represents the actual uncertainty on
A DNN is called under-confident if the left hand side of (37) the prediction. Such methods are based for example on
and (38) are larger than p. Equivalently, it is under-confident Bayesian methods [272], [209], [273], [274], [16] or deep
if the terms are smaller than p. The calibration property of a ensembles [31], [275].
DNN can be visualized using a reliability diagram, as shown In the following, we present the three types of calibration
in Figure 8. methods in more detail.
In general, calibration errors are caused by factors related 1) Regularization Methods: Regularization methods for
to model uncertainty [15]. This is intuitively clear, since as calibrating confidences manipulate the training of DNNs by
discussed in Section II, data uncertainty represents the under- modifying the objective function or by augmenting the training
lying uncertainty that an input x and a target y represent the data set. The goal and idea of regularization methods is very
same real world information. Following, correctly predicted similar to the methods presented in Section III-A where the
data uncertainty would lead to a perfectly calibrated neural methods mainly quantify model and data uncertainty sepa-
network. In practice, several works pointed out that deeper rately within a single forward pass. However, the methods
networks tend to be more overconfident than shallower ones in Section III-A quantify the model and data uncertainty,
[15], [266], [267]. while these calibration methods are regularized in order to
25
Fig. 8: (a) Reliability diagram showing an overconfident classifier: The bin-wise accuracy is smaller than the corresponding
confidence. (b) Reliability diagram of an underconfident classifier: The bin-wise accuracy is larger than the corresponding
confidence. (c) Reliability diagram of a well calibrated classifier: The confidence fits the actual accuracy for the single bins.
minimize the model uncertainty. Following, at inference, the combination of two contradictive loss functions,
model uncertainty cannot be obtained anymore. This is the N
(i) (i)
X
main motivation for us to separate the approaches presented LVWCI (θ) = − (1 − αi )LGT (θ) + αi LU (θ) , (41)
below from the approaches presented in Section III-A. i=1
One popular regularization based calibration method is label (i)
where LGT is the mean cross-entropy computed for the training
smoothing [268]. For label smoothing, the labels of the train-
sample xi with given ground-truth yi . LU represents the
ing examples are modified by taking a small portion α of the
mean KL-divergence between a uniform target probability
true class’ probability mass and assign it uniformly to the false
vector and the computed prediction. The adaptive smoothing
classes. For hard, non-smoothed labels, the optimum cannot be
parameter αi pushes predictions of training samples with high
reached in practice, as the gradient of the output with respect
model uncertainty (given by high variances) towards a uniform
to the logit vector z,
distribution while increasing the prediction scores of samples
∇z CE(y, ŷ(z)) = softmax(z) − y with low model uncertainty. As a result, variances in the
predictions of a single sample are reduced and the network
exp(z) (39)
= PK −y , can then be applied with a single forward pass at inference.
i=1 exp(zi ) Pereyra et al. [269] combated the overconfidence issue by
adding the negative entropy to the standard loss function and
can only converge to zero with increasing distance between
therefore a penalty that increases with the network’s predicted
the true and false classes’ logits. As a result, the logits of the
confidence. This results in the entropy-based objective function
correct class are much larger than the logits for the incorrect
LH , which is defined as
classes and the logits of the incorrect classes can be very
N
different to each other. Label-smoothing avoids this and while 1 X
it generally leads to a higher training loss, the calibration error LH (θ) = − yi log ŷi − αi H(ŷi ) , (42)
N i=1
decreases and the accuracy often increases as well [270].
Seo et al. [266] extended the idea of label smoothing and where H(ŷi ) is the entropy of the output and αi a parameter
directly aimed at reducing the model uncertainty. For this, that controls the strength of the entropy-based confidence
they sampled T forward passes of a stochastic neural network penalty. The parameter αi is computed equivalently as for the
already at training time. Based on the T forward passes of VWCI loss.
a training sample (xi , yi ), a normalized model variance αi Instead of regularizing the training process by modifying
is derived as the mean of the Bhattacharyya coefficients [281] the objective function, Thulasidasan et al. [278] regularized it
between the T individual by using a data-agnostic data augmentation technique named
PT predictions ŷ1 , ..., ŷT and the average mixup [282]. In mixup training, the network is not only trained
prediction ȳ = T1 t=1 ŷt ,
on the training data, but also on virtual training samples (x̃, ỹ)
T generated by a convex combination of two random training
1X
αi = BC(ȳi , ŷi,t ) pairs (xi , yi ) and (xj , yj ), i.e.
T t=1
T K
(40) x̃ = λxi + (1 − λ)xj (43)
1 XXp
= ȳi,k · ŷi,t,k . ỹ = λyi + (1 − λ)yj . (44)
T t=1
k=1
According to [278], the label smoothing resulting from mixup
Based on this αi , Seo et al. [266] introduced the variance- training can be viewed as a form of entropy-based regular-
weighted confidence-integrated loss function that is a convex ization resulting in inherent calibration of networks trained
26
Ensemble
Objective
of Post-
Function
Processing
Modification11
Models1
Methods
Post-Processing Regularization
Methods Methods
Histogram Label
Binning3 Smoothing9
Exposure
Temperature
to OOD
Scaling4 Uncertainty examples8
Estimation
Approaches
Ensemble Bayesian
of Neural Neural
Networks5 Networks6
1
[48] 2 [47] 3 [276] 4 [15], [68], [274] 5 [31], [275] 6 [272], [209], [273], [274], [16] 8
[277] 9
[268], [270]
10
[278], [279], [234], [280] 11 [269], [266], [11], [271], [279]
Fig. 9: Visualization of the different types of uncertainty calibration methods presented in this paper.
with mixup. Maroñas et al. [279] see mixup training among only on the unmixed samples:
the most popular data augmentation regularization techniques
due to its ability to improve the calibration as well as the 1 X b
LECE (θ) = L (θ) + βECEb , (45)
accuracy. However, they argued that in mixup training the B
b∈B
data uncertainty in mixed inputs affects the calibration and
therefore mixup does not necessarily improve the calibra- where Lb (θ) is the original unregularized loss using training
tion. They also underlined this claim empirically. Similarly, and mixed samples included in batch b and β is a hyper-
Rahaman and Thiery [234] experimentally showed that the parameter controlling the relative importance given to the
distributional-shift induced by data augmentation techniques batchwise expected calibration error ECEb . By adding the
such as mixup training can negatively affect the confidence batchwise calibration error for each batch b ∈ B to the
calibration. Based on this observation, Maroñas et al. [279] standard loss function, the miscalibration induced by mixup
proposed a new objective function that explicitly takes the training is regularized.
calibration performance on the unmixed input samples into In the context of data augmentation, Patel et al. [280]
account. Inspired by the expected calibration error (ECE, see improved the calibration of uncertainty estimates by using on-
Section V-B) Naeini et al. [283] measured the calibration manifold data augmentation. While mixup training combines
performance on the unmixed samples for each batch b by the training samples, on-manifold adversarial training generate
differentiable squared differences between the batch accuracy out-of-domain samples using adversarial attack. They experi-
and the mean confidence on the batch samples. The total loss is mentally showed that on-manifold adversarial training outper-
given as a weighted combination of the original loss on mixed forms mixup training in improving the calibration. Similar to
and unmixed samples and the calibration measure evaluated [280], Hendrycks et al. [277] showed that exposing classifiers
to out-of-distribution examples at training can help to improve
the calibration.
27
2) Post-Processing Methods: Post-processing (or post-hoc) Wenger et al. [47] proposed a Gaussian process (GP)
methods are applied after the training process and aim at based method, which can be used to calibrate any multi-class
learning a re-calibration function. For this, a subset of the classifier that outputs confidence values and presented their
training data is held-out during the training process and used methodology by calibrating neural networks. The main idea
as a calibration set. The re-calibration function is applied to the behind their work is to learn the calibration map by a Gaussian
network’s outputs (e.g. the logit vector) and yields an improved process that is trained on the networks confidence predictions
calibration learned on the left-out calibration set. Zhang et al. and the corresponding ground-truths in the left out calibration
[48] discussed three requirements that should be satisfied by set. For this approach, the preservation of the predictions is
post-hoc calibration methods. They should also not assured.
1) preserve the accuracy, i.e. should not affect the predic- 3) Calibration with Uncertainty Estimation Approaches:
tors performance. As already discussed above, removing the model uncertainty
2) be data efficient, i.e. only a small fraction of the training and receiving an accurate estimation of the data uncertainty
data set should be left out for the calibration. leads to a well calibrated predictor. Following several works
3) be able to approximate the correct re-calibration map as based on deep ensembles [31], [275] and BNNs, [272], [209],
long as there is enough data available for calibration. [204] also compared their performance to other methods based
Furthermore, they pointed out that none of the existing ap- on the resulting calibration. Lakshminarayanan et al. [31]
proaches fulfills all three requirements. and Mehrtash et al. [275] reported an improved calibration
For classification tasks, the most basic but still very efficient by applying deep ensembles compared to single networks.
way of post-hoc calibration is temperature scaling [15]. For However, Rahaman and Thiery[234] showed that for specific
temperature scaling, the temperature T > 0 of the softmax configurations as the usage of mixup-regularization, deep
function ensembles can even increase the calibration error. On the
expzi /T other side they showed that applying temperature scaling on
softmax(zi ) = PK , (46) the averaged predictions can give a significant improvement
zj /T
j=1 exp
on the calibration.
is optimized. For T = 1 the function remains the regular For the Bayesian approaches, [204] showed that restricting
softmax function. For T > 1 the output changes such that its the Bayesian approximation to the weights of the last fully
entropy increases, i.e. the predicted confidence decreases. For connected layer of a DNN is already enough to improve the
T ∈ (0, 1) the entropy decreases and following, the predicted calibration significantly. Zhang et al. [273] and Laves et al.
confidence increases. As already mentioned above, a perfect [274] showed that confidence estimates computed with MC
calibrated neural network outputs MAP estimates. Since the dropout can be poorly calibrated. To overcome this, Zhang
learned transformation can only affect the uncertainty, the et al. [273] proposed structured dropout, which consists
log-likelihood based losses as cross-entropy do not have to of dropping channel, blocks or layers, to promote model
be replaced by a special calibration loss. While the data diversity and reduce calibration errors.
efficiency and the preservation of the accuracy is given, the
expressiveness of basic temperature scaling is limited [48]. To
overcome this, Zhang et al. [48] investigated an ensemble of B. Evaluating Calibration Quality
several temperature scaling models. Doing so, they achieved
better calibrated predictions, while preserving the classification Evaluating calibration consists of measuring the statistical
accuracy and improving the data efficiency and the expressive consistency between the predictive distributions and the ob-
power. Kull et al. [284] were motivated by non-neural network servations [286]. For classification tasks, several calibration
calibration methods, where the calibration is performed class- measures are based on binning. For that, the predictions are
wise as a one-vs-all binary calibration. They showed that this ordered by the predicted confidence p̂i and grouped into M
approach can be interpreted as learning a linear transforma- bins b1 , ..., bM . Following, the calibration of the single bins is
tion of the predicted log-likelihoods followed by a softmax evaluated by setting the average bin confidence
function. This again is equivalent to train a dense layer on the 1 X
conf(bm ) = p̂s (47)
log-probabilities and hence the method is also very easy to |bm |
s∈bm
implement and apply. Obviously, the original predictions are
not guaranteed to be preserved. in relation to the average bin accuracy
Analogous to temperature scaling for classification net- 1 X
works, Levi et al. [285] introduced standard deviation scaling acc(bm ) = 1(ŷs = ys ) , (48)
|bm |
s∈bm
(std-scaling) for regression networks. As the name indicates,
the method is trained to rescale the predicted standard de- where ŷs , ys and p̂s refer to the predicted and true class
viations of a given network. Equivalently to the motivation label of a sample s. As noted in [15], confidences are well-
of optimizing temperature scaling with the cross-entropy loss, calibrated when for each bin acc(bm ) = conf(bm ). For a visual
std-scaling can be trained using the Gaussian log-likelihood evaluation of a model’s calibration, the reliability diagram
function as loss, which is in general also used for the training introduced by [287] is widely used. For a reliability diagram,
of regression networks, which also give a prediction for the the conf(bm ) is plotted against acc(bm ). For a well-calibrated
data uncertainty. model, the plot should be close to the diagonal, as visualized
28
in Figure 8. The basic reliability diagram visualization does VI. DATA S ETS AND BASELINES
not distinguish between different classes. In order to do so and
hence to improve the interpretability of the calibration error, In this section, we collect commonly used tasks and data
Vaicenavicius et al. [286] used an alternative visualization sets for evaluating uncertainty estimation among existing
named multidimensional reliability diagram. works. Besides, a variety of baseline approaches commonly
For a quantitative evaluation of a model’s calibration, dif- used as comparison against the methods proposed by the
ferent calibration measures can be considered. researchers are also presented. By providing a review on the
The Expected Calibration Error (ECE) is a widely used relevant information of these experiments, we hope that both
binning based calibration measure [283], [15], [274], [275], researchers and practitioners can benefit from it. While the
[278], [47]. For the ECE, M equally-spaced bins b1 , ..., bM former can gain a basic understanding of recent benchmarks
are considered, where bm denotes the set of indices of samples tasks, data sets and baselines so that they can design ap-
whose confidences fall into the interval Im =] m−1 m propriate experiments to validate their ideas more efficiently,
M , M ]. The
ECE is then computed as the weighted average of the bin-wise the latter might use the provided information to select more
calibration errors, i.e. relevant approaches to start based on a concise overview on the
tasks and data sets on which the approach has been validated.
M
X |bm | In the following, we will introduce the data sets and
ECE = |acc(bm ) − conf(bm )| . (49)
N baselines summarized in table IV according to the taxonomy
m=1
used throughout this review.
For the ECE, only the predicted confidence score (top-label) The structure of the table is designed to organize the main
is considered. In contrast to this, the Static Calibration Error contents of this section concisely, hoping to provide a clear
(SCE) [288], [289] considers the predictions of all classes (all- overview of the relevant works. We group the approaches of
labels). For each class, the SCE computes the calibration error each category into one of four blocks and extract the most
within the bins and then averages across all the bins, i.e. commonly used tasks, data sets and provided baselines for
K M
1 X X |bmk | each column respectively. The corresponding literature is listed
SCE = |conf(bmk ) − acc(bmk )| . (50) at the bottom of each block to facilitate lookup. Note that we
K N
m=1
k=1 focus on methodological comparison here, but not the choice
Here conf (bmk ) and acc(bmk ) are the confidence and accu- of architecture for different methods which has an impact on
racy of bin bm for class label k, respectively. Nixon et al. performance as well. Due to the space limitation and visual
[288] empirically showed that all-labels calibration measures density, we only show the most important elements (task, data
such as the SCE are more effective in assessing the calibration set, baselines) ranked according to the frequency of use in the
error than the top-label calibration measures as the ECE. literature we have researched.
In contrast to the ECE and SCE, which group predictions The main results are as follows. One of the most frequent
into M equally-spaced bins (what in general leads to different tasks for evaluating uncertainty estimation methods are re-
numbers of evaluation samples per bin), the adaptive calibra- gression tasks, where samples close and far away from the
tion error [288], [289] adaptively groups predictions into R training distribution are studied. Furthermore, the calibration
bins with different width but equal number of predictions. With of uncertainty estimates in the case of classification problems
this adaptive bin size, the adaptive Expected Calibration Error is very often investigated. Further noteworthy tasks are out-
(aECE) of-distribution (OOD) detection and robustness against adver-
R
sarial attacks. In the medical domain, calibration of semantic
1 X segmentation results is the predominant use case.
aECE = |conf(br ) − acc(br )| , (51)
R r=1 The choice of data sets is mostly consistent among all
reviewed works. For regression, toy data sets are employed for
and the adaptive Static Calibration Error (aSCE)
visualization of uncertainty intervals while the UCI data sets
K R are studied in light of (negative) log-likelihood comparison.
1 XX
aSCE = |conf(brk ) − acc(brk )| (52) The most common data sets for calibration and OOD detection
KR r=1
k=1 are MNIST, CIFAR10 and 100 as well as SVHN while
are defined as extensions of the ECE and the SCE. ImageNet and its tiny variant are also studied frequently. These
As has been empirically shown in [280] and [288], the adaptive form distinct pairs when OOD detection is studied where
binning calibration measures aECE and aSCE are more robust models trained on CIFAR variants are evaluated on SVHN
to the number of bins than the corresponding equal-width and visa versa while MNIST is paired with variants of itself
binning calibration measures ECE and SCE. like notMNIST and FashionMNIST. Classification data sets
It is important to make clear that in a multi-class setting, are also commonly distorted and corrupted to study the effects
the calibration measures can suffer from imbalance in the on calibration, blurring the line between OOD detection and
test data. Even when then calibration is computed classwise, adversarial attacks.
the computed errors are weighted by the number of samples Finally, the most commonly used baselines by far are Monte
in the classes. Following, larger classes can shadow the bad Carlo (MC) Dropout and deep ensembles while the softmax
calibration on small classes, comparable to accuracy values in output of deterministic models is almost always employed
classification tasks [290]. as a kind of surrogate baseline. It is interesting to note that
29
TABLE IV: Overview of frequently compared benchmark approaches, tasks and their data sets among existing works organized
according to the taxonomy of this paper.
Tasks Task index: Data sets Baselines
Single Deterministic 1. Regression21-23 1. Toy7 , UCI, NYU Depth Softmax44 , GAN27 , Dirichlet48 , BBB3 ,
Models 2. Calibration21-23,39,40,41,43 2, 3: (E/Fashion/not)MNIST, MCdropout1 , DeepEnsemble2,57 ,
3. OOD Detection21,22,38,39,41-53 Toy, CIFAR10/100, SVHN, Mahalanobis56 , TemperatureScaling38,54
4. Adversarial Attacks21,41,48 LSUN, (Tiny)ImageNet, IMDB, NormalizingFlow4
Diabetic Retinopathy, Om-
niglot
4: MNIST, CIFAR10, NYU
Depth, Omniglot
21 [101] 22 [103] 23 [104] 38 [68] 39 [143] 40 [98] 41 [43] 42 [96] 43 [95] 44 [67] 45 [64] 46 [92] 47 [72] 48 [44] 49 [94] 50 [35] 51 [36] 52 [46] 53 [71]
inside each approach–BNNs, Ensembles, Single Deterministic 4), NormalizingFlow (TensorFlow, PyTorch), PBP, SWAG (1* ,
Models and Input Augmentation–some baselines are preferred 2), KFAC (PyTorch: 1, 2, 3; TensorFlow), DVI (TensorFlow* ,
over others. BNNs are most frequently compared against PyTorch), HMC, VOGN* , INF* , MFVI, SGLD, Tempera-
variational inference methods like Bayes’ by Backprop (BBB) tureScaling (1* , 2, 3), GAN* , Dirichlet and Mahalanobis* .
or Probabilistic Backpropagation (PBP) while for Single De-
terministic Models it is more common to compare them against VII. A PPLICATIONS OF U NCERTAINTY E STIMATES
distance-based methods in the case of OOD detection. Overall, From a practical point of view, the main motivation for
BNN methods show a more diverse set of tasks considered quantifying uncertainties in DNNs is to be able to classify the
while being less frequently evaluated on large data sets like received predictions and to make more confident decisions.
ImageNet. This section gives a brief overview and examples of the
To further facilitate access for practitioners, we provide aforementioned motivations. In the first part, we discuss how
web-links to the authors’ official implementations (marked by uncertainty is used within active learning and reinforcement
a star) of all common baselines as identified in the baselines learning. Subsequently, we discuss the interest of the commu-
column. Where no official implementation is provided, we nities working on domain fields like medical image analysis,
instead link to the highest ranked implementations found on robotics, and earth observation. These application fields are
GitHub at the time of this survey. The list can be also found used representatively for the large number of domains where
within our GitHub repository on available implementations5 . the uncertainty quantification plays an important role. The
The relevant baselines are Softmax* (TensorFlow, PyTorch), challenges and concepts could (and should) be transferred to
MCdropout (TensorFlow* ; PyTorch: 1, 2), DeepEnsembles any application domain of interest.
(TensorFlow: 1, 2, 3; PyTorch: 1, 2), BBB (PyTorch: 1, 2, 3, 1) Active Learning: The process of collecting labeled data
for supervised training of a DNN can be laborious, time-
5 https://github.com/JakobCode/UncertaintyInNeuralNetworks Resources consuming, and costly. To reduce the annotation effort, the
30
Predictor Agent
Pool with Test neural
network
unlabeled data S
Observed t
state a Distribution
t of actions
e
Retrain neural Update
network parameters
Labeled
training data
Environment Receive
reward
Uncertain test
Chose and label data predictions Apply
(aquisition function) action
Fig. 10: The active learning framework: The acquisition Fig. 11: The reinforcement learning framework: The agent
function evaluates the uncertainties on the network’s test interacts with the environment by executing a specific action
predictions in order to select unlabelled data. The selected data influencing the next state of the agent. The agent observes a re-
are labelled and added to the pool of labelled data, which is ward representing the cost associated with the executed action.
used to train and improve the performance of the predictor. The agent chooses actions based on a policy learned by a deep
neural network. However, the predicted uncertainty associated
with the action predicted by the deep neural network can help
the agent to decide weather to execute the predicted action or
active learning framework shown in Figure 10 trains the DNN not.
sequentially on different labelled data sets increasing in size
over time [292]. In particular, given a small labelled data set
and a large unlabeled data set, a deep neural network trained A. Uncertainty in Real-World Applications
in the setting of active learning learns from the small labeled
data set and decides based on the acquisition function, which With increasing usage of deep learning approaches within
samples to select from the pool of unlabeled data. The selected many different fields, quantifying and handling uncertainties
data are added to the training data set and a new DNN is has become more and more important. On one hand, uncer-
trained on the updated training data set. This process is then tainty quantification plays an important role in risk minimiza-
repeated with the training set increasing in size over time. tion, which is needed in many application fields. On the other
Uncertainty sampling is one most popular criteria used in hand, many fields offer only challenging data sources, which
acquisition functions [293] where predictive uncertainty de- are hard to control and verify. This makes the generation
termines which training samples have the highest uncertainty of trust-worthy ground truth a very challenging task. In the
and should be labelled next. Uncertainty based active learning following, three different fields where uncertainty plays an
strategies for deep learning applications have been successfully important role are presented, namely Autonomous Driving,
used in several works [23], [24], [294], [25], [26]. medical image analysis, and earth observation.
1) Medical Analysis: Since the size, shape, and location
2) Reinforcement Learning: The general framework of of many diseases vary largely across patients, the estimation
deep reinforcement learning is shown in Figure 11. In the of the predictive uncertainty is crucial in analyzing medical
context of reinforcement learning, uncertainty estimates can be images in applications such as lesion detection [1], [3], lung
used to solve the exploration-exploitation dilemma. It says that node segmentation [296], brain tumor segmentation [152],
uncertainty estimates can be used to effectively balance the [248], [249], [2], [261], parasite segmentation in images of
exploration of unknown environments with the exploitation of liver stage malaria [262], recognition of abnormalities on chest
existing knowledge extracted from known environments. For radiographs [297], and age estimation [6]. Here, uncertainty
example, if a robot interacts with an unknown environment, the estimates in particular improve the interpretability of decisions
robot can safely avoid catastrophic failures by reasoning about of DNNs [298]. They are essential to understand the reliability
its uncertainty. To estimate the uncertainty in this framework, of segmentation results, to detect false segmented areas and
Huang et al. [27] used an ensemble of bootstrapped models to guide human experts in the task of refinement [249]. Well-
(models trained on different data sets sampled with replace- calibrated and reliable uncertainty estimates allow clinical
ment from the original data set), while Gal and Ghahramani experts to properly judge whether an automated diagnosis can
[20] approximated Bayesian inference via dropout sampling. be trusted [298]. Uncertainty was estimated in medical image
Inspired by [20] and [27], Kahn et al. [28] and Lötjens et segmentation based on Monte Carlo dropout [152], [296], [1],
al. [29] used a mixture of deep Bayesian networks performing [2], [3], [263], [4], [5], [6], spike-and-slab dropout [261], and
dropout sampling on an ensemble of bootstrapped models. For spatial dropout [262]. Wang et al. [248], [249] used test time
further reading, Ghavamzadeh et al. [295] presented a survey data augmentation to estimate the data-dependent uncertainty
of Bayesian reinforcement learning. in medical image segmentation.
31
2) Robotics: Robots are active agents that perceive, decide, deep learning based localization systems along with uncer-
plan, and act in the real-world – all based on their incomplete tainty estimates. Other approaches also learn on the robots’
knowledge about the world. As a result, mistakes of the robots past experiences of failures or detect inconsistencies of the
not only cause failures of their own mission, but can endanger predictors [332], [333]. In summary, the robotics community
human lives, e.g. in case of surgical robotics, self-driving cars, has been both, the users and the developers of the uncertainty
space robotics, etc. Hence, the robotics application of deep estimation frameworks targeted to a specific problems.
learning poses unique research challenges that significantly Yet, robotics pose several unique challenges to uncertainty
differ from those often addressed in computer vision and estimation methods for DNNs. These are for example, (i) how
other off-line settings [299]. For example, the assumption to limit the computational burden and build real-time capable
that the testing condition come from the same distribution as methods that can be executed on the robots with limited
training is often invalid in many settings of robotics, resulting computational capacities (e.g. aerial, space robots, etc); (ii)
in deterioration of the performance of DNNs in uncontrolled how to leverage spatial and temporal information, as robots
and detrimental conditions. This raises the questions how we sense sequentially instead of having a batch of training data
can quantify the uncertainty in a DNN’s predictions in order for uncertainty estimates; (iii) whether robots can select the
to avoid catastrophic failures. Answering such questions are most uncertainty samples and update its learner online; (iv)
important in robotics, as it might be a lofty goal to expect data- Whether robots can purposefully manipulate the scene when
driven approaches (in many aspects from control to perception) uncertain. Most of these challenges arise due to the properties
to always be accurate. Instead, reasoning about uncertainty can of robots that they are physically situated systems.
help in leveraging the recent advances in deep learning for 3) Earth Observation(EO): Earth Observation (EO) sys-
robotics. tems are increasingly used to make critical decisions related
Reasoning about uncertainties and the use of probabilistic to urban planning [334], resource management [335], disaster
representations, as oppose to relying on a single, most-likely response [336], and many more. Right now, there are hundreds
estimate, have been central to many domains of robotics of EO satellites in space, owned by different space agencies
research, even before the advent of deep learning [300]. and private companies. Figure 12 shows the satellites owned
In robot perception, several uncertainty-aware methods have by the European Space Agency (ESA). Like in many other do-
been proposed in the past, starting from localization methods mains, deep learning has shown great initial success in the field
[301], [302], [303] to simultaneous localization and mapping of EO over the past few years [337]. These early successes
(SLAM) frameworks [304], [305], [306], [307]. As a result, consisted of taking the latest developments of deep learning
many probabilistic methods such as factor graphs [308], [309] in computer vision and applying them to small curated earth
are now the work-horse of advanced consumer products such observation data sets [337]. At the same time, the underlying
as robotic vacuum cleaners and unmanned aerial vehicles. data is very challenging. Even though the amount of data is
In case of planning and control, estimation problems are huge, so is the variability in the data. This variability is caused
widely treated as Bayesian sequential learning problems, and by different sensor types, spatial changes (e.g. different regions
sequential decision making frameworks such as POMDPs and resolutions), and temporal changes (e.g. changing light
[310], [311] assume a probabilistic treatment of the underlying conditions, weather conditions, seasons). Besides the challenge
planning problems. With probabilistic representations, many of efficient uncertainty quantification methods for such large
reinforcement learning algorithms are backed up by stability amounts of data, several other challenges that can be tackled
guarantees for safe interactions in the real-world [312], [313], with uncertainty quantification exist in the field of EO. All
[314]. Lastly, there have been also several advances starting in all, the sensitivity of many EO applications together with
from reasoning (semantics [315] to joint reasoning with ge- the nature of EO systems and the challenging EO data make
ometry), embodiment (e.g. active perception [316]) to learn- the quantification of uncertainties very important in this field.
ing (e.g. active learning [317], [318], [319] and identifying Despite hundreds of publications in the last years on DL for
unknown objects [320], [321], [322]). EO, the range of literature on measuring uncertainties of these
Similarly, with the advent of deep learning, many re- systems is relatively small.
searchers proposed new methods to quantify the uncertainty Furthermore, due to the large variation in the data, a data
in deep learning as well as on how to further exploit such sample received at test time is often not covered by the training
information. As oppose to many generic approaches, we sum- data distribution. For example while preparing training data for
marize task-specific methods and their application in practice a local climate zone classification, the human experts might
as followings. Notably, [323] proposed to perform novelty de- be presented only with images where there is no obstruction
tection using auto-encoders, where the reconstructed outputs of and structures are clearly visible. When a model which is
auto-encoders was used to decide how much one can trust the trained on this data set is deployed in real world, it might
network’s predictions. Peretroukhin et al. [324] developed a see the images with clouds obstructing the structures or snow
SO(3) representation and uncertainty estimation framework for giving them a completely different look. Also, the classes in
the problem of rotational learning problems with uncertainty. EO data can have a very wide distribution. For example, there
[325], [28], [326], [327] demonstrated uncertainty-aware, real are millions of types of houses in the world and no training
world application of a reinforcement learning algorithm for data can contain the examples for all of them. The question
robotics, while [328], [329] proposed to leverage spatial infor- is where the OOD detector will draw the line and declare the
mation, on top of MC-dropout. [207], [330], [331] developed following houses as OOD. Hence, OOD detection is important
32
in earth observation and uncertainty measurements play an • Lack of Standardized Evaluation Protocol
important part in this [22]. Existing methods for evaluating the estimated uncertainty
Another common task in EO, where uncertainties can play are better suited to compare uncertainty quantification
an important role, is the data fusion. Optical images normally methods based on measurable quantities such as the cal-
contain only a few channels like RGB. In contrast to this, ibration [340] or the performance on out-of-distribution
EO data can contain optical images with up to hundreds of detection [32]. As described in Section VI, these tests
channels, and a variety of different sensors with different are performed on standardized sets within the machine
spatial, temporal, and semantic properties. Fusing the infor- learning community. Furthermore, the details of these
mation from these different sources and channels propagates experiments might differ in the experimental setting from
the uncertainties from different sources onto the prediction. paper to paper [214]. However, a clear standardized proto-
The challenge lies in developing methods that do not only col of tests that should be performed on uncertainty quan-
quantify uncertainties but also the amount of contribution from tification methods is still not available. For researchers
different channels individually and which learn to focus on the from other domains it is difficult to directly find state
trustworthy data source for a given sample [338]. of the art methods for the field they are interested in,
Unlike normal computer vision scenarios where the image not to speak of the hard decision on which sub-field of
acquisition equipment is quite near to the subject, the EO uncertainty quantification to focus. This makes the direct
satellites are hundreds of kilometers away from the sub- comparison of the latest approaches difficult and also
ject. The sensitivity of sensors, the atmospheric absorption limits the acceptance and adoption of current existing
properties, and surface reflectance properties all contribute to methods for uncertainty quantification.
uncertainties in the acquired data. Integrating the knowledge • Inability to Evaluate Uncertainty Associated to a
of physical EO systems, which also contain information about Single Decision
uncertainty models in those systems, is another major open Existing measures for evaluating the estimated uncer-
issue. However, for several applications in EO, measuring tainty (e.g., the expected calibration error) are based
uncertainties is not only something good to have but rather on the whole testing data set. This means, that equiv-
an important requirement of the field. E.g., the geo-variables alent to classification tasks on unbalanced data sets,
derived from EO data may be assimilated into process models the uncertainty associated with single samples or small
(ocean, hydrological, weather, climate, etc) and the assimi- groups of samples may potentially get biased towards
lation requires the probability distribution of the estimated the performance on the rest of the data set. But for
variables. practical applications, assessing the reliability of a pre-
dicted confidence would give much more possibilities
VIII. C ONCLUSION AND O UTLOOK than an aggregated reliability based on some testing data,
A. Conclusion - How well do the current uncertainty quan- which are independent from the current situation [341].
tification methods work for real world applications? Especially for mission- and safety-critical applications,
Even though many advances on uncertainty quantification pointwise evaluation measures could be of paramount
in neural networks have been made over the last years, their importance and hence such evaluation approaches are
adoption in practical mission- and safety-critical applications very desirable.
is still limited. There are several reasons for this, which are • Lack of Ground Truth Uncertainties
discussed one-by-one as follows: Current methods are empirically evaluated and the per-
• Missing Validation of Existing Methods over Real- formance is underlined by reasonable and explainable
World Problems values of uncertainty. A ground truth uncertainty that
Although DNNs have become the defacto standard in could be used for validation is in general not available.
solving numerous computer vision and medical image Additionally, even though existing methods are calibrated
processing tasks, the majority of existing models are not on given data sets, one cannot simply transfer these
able to appropriately quantify uncertainty that is inherent results to any other data set since one has to be aware
to their inferences particularly in real world applications. of shifts in the data distribution and that many fields can
This is primarily because the baseline models are mostly only cover a tiny portion of the actual data environment.
developed using standard data sets such as Cifar10/100, In application fields as EO, the preparation of a huge
ImageNet, or well known regression data sets that are amount of training data is hard and expensive and hence
specific to a particular use case and are therefore not synthetic data can be used to train a model. For this artifi-
readily applicable to complex real-world environments, cial data, artificial uncertainties in labels and data should
as for example low resolutional satellite data or other data be taken into account to receive a better understanding
sources affected by noise. Although many researchers of the uncertainty quantification performance. The gap
from other fields apply uncertainty quantification in their between the real and synthetic data, or estimated and
field [21], [10], [8], a broad and structured evaluation of real uncertainty further limits the adoption of currently
existing methods based on different real world applica- existing methods for uncertainty quantification.
tions is not available yet. Works like [56] already built • Explainability Issue:
first steps towards a real life evaluation.
33
Fig. 12: European Space Agency (ESA) Developed Earth Observation Missions [339].
Existing methods of neural network uncertainty quantifi- regard to risk-averse and worst case scenarios should be
cation deliver predictions of certainty without any clue considered there. This means, that uncertainty predictions
about what causes possible uncertainties. Even though with a very high predicted uncertainty should never fail,
those certainty values often look reasonable to a human as for example for a prediction of a red or green traffic
observer, one does not know whether the uncertainties light. Such a general protocol would enable researchers
are actually predicted based on the same observations the to easily compare different types of methods against an
human observer made. But without being sure about the established benchmark as well as on real world data
reasons and motivations of single uncertainty estimations, sets. The adoption of such a standard evaluation protocol
a proper transfer from one data set to another, and even should be encouraged by conferences and journals.
only a domain shift, are much harder to realize with a • Expert & Systematic Comparison of Baselines
guaranteed performance. Regarding safety critical real A broad and structured comparison of existing methods
life applications, the lack of explainability makes the for uncertainty estimation on real world applications is
application of the available methods significantly harder. not available yet. An evaluation on real world data is even
Besides the explainability of neural networks decisions, not standard in current machine learning research papers.
existing methods for uncertainty quantification are not As a result, given a specific application, it remains unclear
well understood on a higher level. For instance, ex- which method for uncertainty estimation performs best
plaining the behavior of single deterministic approaches, and whether the latest methods outperform older methods
ensembles or Bayesian methods is a current direction of also on real world examples. This is also partly caused
research and remains difficult to grasp in every detail by the fact, that researchers from other domains that
[227]. It is, however, crucial to understand how those use uncertainty quantification methods, in general present
methods operate and capture uncertainty to identify path- successful applications of single approaches on a specific
ways for refinement, detect and characterize uncertainty, problem or a data set by hand. Considering this, there
failures and important shortcomings [227]. are several points that could be adopted for a better
comparison within the different research domains. For
B. Outlook instance, domain experts should also compare different
approaches against each other and present the weaknesses
• Generic Evaluation Framework
of single approaches in this domain. Similarly, for a better
As already discussed above, there are still problems
comparison among several domains, a collection of all
regarding the evaluation of uncertainty methods, as the
the works in the different real world domains could be
lack of ’ground truth’ uncertainties, the inability to test
collected and exchanged on a central platform. Such a
on single instances, and standardized benchmarking pro-
platform might also help machine learning researchers
tocols, etc. To cope with such issues, the provision of an
in providing an additional source of challenges in the
evaluation protocol containing various concrete baseline
real world and would pave way to broadly highlight
data sets and evaluation metrics that cover all types of
weaknesses in the current state of the art approaches.
uncertainty would undoubtedly help to boost research
Google’s repository on baselines in uncertainties in neural
in uncertainty quantification. Also, the evaluation with
34
networks [340]6 could be such a platform and a step [8] J. Choi, D. Chun, H. Kim, and H.-J. Lee, “Gaussian yolov3: An
towards achieving this goal. accurate and fast object detector using localization uncertainty for
autonomous driving,” in Proceedings of the IEEE International Con-
• Uncertainty Ground Truths ference on Computer Vision, 2019, pp. 502–511.
It remains difficult to validate existing methods due to the [9] A. Amini, A. Soleimany, S. Karaman, and D. Rus, “Spatial uncertainty
lack of uncertainty ground truths. An actual uncertainty sampling for end-to-end control,” arXiv preprint arXiv:1805.04829,
2018.
ground truth on which methods can be compared in an [10] A. Loquercio, M. Segu, and D. Scaramuzza, “A general framework for
ImageNet like manner would make the evaluation of uncertainty estimation in deep learning,” IEEE Robotics and Automa-
predictions on single samples possible. To reach this, the tion Letters, vol. 5, no. 2, pp. 3153–3160, 2020.
[11] K. Lee, H. Lee, K. Lee, and J. Shin, “Training confidence-calibrated
evaluation of the data generation process and occurring classifiers for detecting out-of-distribution samples,” in International
sources of uncertainty, as for example the labeling pro- Conference on Learning Representations, 2018.
cess, might be investigated in more detail. [12] J. Mitros and B. Mac Namee, “On the validity of bayesian neural
networks for uncertainty estimation,” arXiv preprint arXiv:1912.01530,
• Explainability and Physical Models 2019.
Knowing the actual reasons for a false high certainty [13] Y. Ovadia, E. Fertig, J. Ren, Z. Nado, D. Sculley, S. Nowozin, J. Dillon,
or a low certainty makes it much easier to engineer B. Lakshminarayanan, and J. Snoek, “Can you trust your model’s
the methods for real life applications, which again in- uncertainty? evaluating predictive uncertainty under dataset shift,” in
Advances in Neural Information Processing Systems, 2019, pp. 13 991–
creases the trust of people into such methods. Recently, 14 002.
Antorán et al. [342] claimed to have published the first [14] M. S. Ayhan and P. Berens, “Test-time data augmentation for estimation
work on explainable uncertainty estimation. Uncertainty of heteroscedastic aleatoric uncertainty in deep neural networks,” in
Medical Imaging with Deep Learning Conference, 2018.
estimations, in general, form an important step towards [15] C. Guo, G. Pleiss, Y. Sun, and K. Q. Weinberger, “On calibration
explainable artificial intelligence. Explainable uncertainty of modern neural networks,” in International Conference on Machine
estimations would give an even deeper understanding of Learning. PMLR, 2017, pp. 1321–1330.
[16] A. G. Wilson and P. Izmailov, “Bayesian deep learning and a probabilis-
the decision process of a neural network, which, in prac- tic perspective of generalization,” in Advances in Neural Information
tical deployment of DNNs, shall incorporate the desired Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. F.
ability to be risk averse while staying applicable in real Balcan, and H. Lin, Eds., vol. 33, 2020, pp. 4697–4708.
[17] M. Rawat, M. Wistuba, and M.-I. Nicolae, “Harnessing model un-
world (especially safety critical applications). Also, the certainty for detecting adversarial examples,” in NIPS Workshop on
possibility of improving explainability with physically Bayesian Deep Learning, 2017.
based arguments offers great potential. While DNNs are [18] A. C. Serban, E. Poll, and J. Visser, “Adversarial examples-a complete
characterisation of the phenomenon,” arXiv preprint arXiv:1810.01185,
very flexible and efficient, they do not directly embed the 2018.
domain specific expert knowledge that is mostly available [19] L. Smith and Y. Gal, “Understanding measures of uncertainty for
and can often be described by mathematical or physical adversarial example detection,” in Proceesings of the Conference on
Uncertainty in Artificial Intelligence, 2018, pp. 560–569.
models, as for example earth system science problems [20] Y. Gal and Z. Ghahramani, “Dropout as a bayesian approximation:
[343]. Such physic guided models offer a variety of Representing model uncertainty in deep learning,” in international
possibilities to include explicit knowledge as well as conference on machine learning, 2016, pp. 1050–1059.
practical uncertainty representations into a deep learning [21] M. Rußwurm, S. M. Ali, X. X. Zhu, Y. Gal, and M. Körner, “Model and
data uncertainty for satellite time series forecasting with deep recurrent
framework [344], [345]. models,” in 2020 IEEE International Geoscience and Remote Sensing
Symposium (IGARSS), 2020.
R EFERENCES [22] J. Gawlikowski, S. Saha, A. Kruspe, and X. X. Zhu, “Out-of-
[1] T. Nair, D. Precup, D. L. Arnold, and T. Arbel, “Exploring uncertainty distribution detection in satellite image classification,” in RobustML
measures in deep networks for multiple sclerosis lesion detection and workshop at ICLR 2021. ICRL, 2021, pp. 1–5.
segmentation,” Medical image analysis, vol. 59, p. 101557, 2020. [23] Y. Gal, R. Islam, and Z. Ghahramani, “Deep bayesian active learning
[2] A. G. Roy, S. Conjeti, N. Navab, C. Wachinger, A. D. N. Initiative with image data,” in International Conference on Machine Learning.
et al., “Bayesian quicknat: Model uncertainty in deep whole-brain PMLR, 2017, pp. 1183–1192.
segmentation for structure-wise quality control,” NeuroImage, vol. 195, [24] K. Chitta, J. M. Alvarez, and A. Lesnikowski, “Large-scale visual
pp. 11–22, 2019. active learning with deep probabilistic ensembles,” arXiv preprint
[3] P. Seeböck, J. I. Orlando, T. Schlegl, S. M. Waldstein, H. Bogunović, arXiv:1811.03575, 2018.
S. Klimscha, G. Langs, and U. Schmidt-Erfurth, “Exploiting epistemic [25] J. Zeng, A. Lesnikowski, and J. M. Alvarez, “The relevance of
uncertainty of anatomy segmentation for anomaly detection in retinal bayesian layer positioning to model uncertainty in deep bayesian active
oct,” IEEE transactions on medical imaging, vol. 39, no. 1, pp. 87–98, learning,” arXiv preprint arXiv:1811.12535, 2018.
2019. [26] V.-L. Nguyen, S. Destercke, and E. Hüllermeier, “Epistemic uncer-
[4] T. LaBonte, C. Martinez, and S. A. Roberts, “We know where we tainty sampling,” in International Conference on Discovery Science.
don’t know: 3d bayesian cnns for credible geometric uncertainty,” arXiv Springer, 2019, pp. 72–86.
preprint arXiv:1910.10793, 2019. [27] W. Huang, J. Zhang, and K. Huang, “Bootstrap estimated uncertainty
[5] J. C. Reinhold, Y. He, S. Han, Y. Chen, D. Gao, J. Lee, J. L. Prince, of the environment model for model-based reinforcement learning,” in
and A. Carass, “Validating uncertainty in medical image translation,” Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33,
in 2020 IEEE 17th International Symposium on Biomedical Imaging 2019, pp. 3870–3877.
(ISBI). IEEE, 2020, pp. 95–98. [28] G. Kahn, A. Villaflor, V. Pong, P. Abbeel, and S. Levine, “Uncertainty-
[6] S. Eggenreich, C. Payer, M. Urschler, and D. Štern, “Variational in- aware reinforcement learning for collision avoidance,” arXiv preprint
ference and bayesian cnns for uncertainty estimation in multi-factorial arXiv:1702.01182, 2017.
bone age prediction,” arXiv preprint arXiv:2002.10819, 2020. [29] B. Lötjens, M. Everett, and J. P. How, “Safe reinforcement learning
[7] D. Feng, L. Rosenbaum, and K. Dietmayer, “Towards safe autonomous with model uncertainty estimates,” in 2019 International Conference
driving: Capture uncertainty in the deep neural network for lidar 3d on Robotics and Automation (ICRA). IEEE, 2019, pp. 8662–8668.
vehicle detection,” in 2018 21st International Conference on Intelligent [30] C. Blundell, J. Cornebise, K. Kavukcuoglu, and D. Wierstra, “Weight
Transportation Systems (ITSC). IEEE, 2018, pp. 3266–3273. uncertainty in neural networks,” in Proceedings of the 32nd Interna-
tional Conference on International Conference on Machine Learning-
6 https://github.com/google/uncertainty-baselines Volume 37, 2015, pp. 1613–1622.
35
[31] B. Lakshminarayanan, A. Pritzel, and C. Blundell, “Simple and scalable [56] F. K. Gustafsson, M. Danelljan, and T. B. Schon, “Evaluating scalable
predictive uncertainty estimation using deep ensembles,” in Advances bayesian deep learning methods for robust computer vision,” in Pro-
in neural information processing systems, 2017, pp. 6402–6413. ceedings of the IEEE/CVF Conference on Computer Vision and Pattern
[32] A. Malinin and M. Gales, “Predictive uncertainty estimation via prior Recognition Workshops, 2020, pp. 318–319.
networks,” in Advances in Neural Information Processing Systems, [57] E. Hüllermeier and W. Waegeman, “Aleatoric and epistemic uncer-
2018, pp. 7047–7058. tainty in machine learning: An introduction to concepts and methods,”
[33] X. Zhao, Y. Ou, L. Kaplan, F. Chen, and J.-H. Cho, “Quantifying Machine Learning, vol. 110, no. 3, pp. 457–506, 2021.
classification uncertainty using regularized evidential neural networks,” [58] M. Abdar, F. Pourpanah, S. Hussain, D. Rezazadegan, L. Liu,
arXiv preprint arXiv:1910.06864, 2019. M. Ghavamzadeh, P. Fieguth, X. Cao, A. Khosravi, U. R. Acharya
[34] Q. Wu, H. Li, W. Su, L. Li, and Z. Yu, “Quantifying intrinsic et al., “A review of uncertainty quantification in deep learning: Tech-
uncertainty in classification via deep dirichlet mixture networks,” arXiv niques, applications and challenges,” Information Fusion, 2021.
preprint arXiv:1906.04450, 2019. [59] P. W. Battaglia, J. B. Hamrick, V. Bapst, A. Sanchez-Gonzalez,
[35] J. Van Amersfoort, L. Smith, Y. W. Teh, and Y. Gal, “Uncertainty V. Zambaldi, M. Malinowski, A. Tacchetti, D. Raposo, A. Santoro,
estimation using a single deep deterministic neural network,” in Pro- R. Faulkner et al., “Relational inductive biases, deep learning, and
ceedings of the 37th International Conference on Machine Learning. graph networks,” arXiv preprint arXiv:1806.01261, 2018.
PMLR, 2020, pp. 9690–9700. [60] A. Kendall and Y. Gal, “What uncertainties do we need in bayesian
[36] T. Ramalho and M. Miranda, “Density estimation in representation deep learning for computer vision?” in Advances in neural information
space to predict model uncertainty,” in Engineering Dependable and processing systems, 2017, pp. 5574–5584.
Secure Machine Learning Systems: Third International Workshop, [61] Y. Gal and Z. Ghahramani, “Bayesian convolutional neural net-
EDSMLS 2020, New York City, NY, USA, February 7, 2020, Revised works with bernoulli approximate variational inference,” arXiv preprint
Selected Papers, vol. 1272. Springer Nature, 2020, p. 84. arXiv:1506.02158, 2015.
[37] A. Mobiny, H. V. Nguyen, S. Moulik, N. Garg, and C. C. Wu, [62] C. Bishop, Pattern Recognition and Machine Learning. Springer-
“Dropconnect is effective in modeling uncertainty of bayesian deep Verlag New York, 2006.
networks,” arXiv preprint arXiv:1906.04569, 2019. [63] H. Ritter, A. Botev, and D. Barber, “A scalable laplace approximation
[38] D. Krueger, C.-W. Huang, R. Islam, R. Turner, A. Lacoste, for neural networks,” in 6th International Conference on Learning
and A. Courville, “Bayesian hypernetworks,” arXiv preprint Representations, vol. 6. International Conference on Representation
arXiv:1710.04759, 2017. Learning, 2018.
[39] M. Valdenegro-Toro, “Deep sub-ensembles for fast uncertainty estima- [64] J. Nandy, W. Hsu, and M. L. Lee, “Towards maximizing the represen-
tion in image classification,” in Bayesian Deep Learning Workshop at tation gap between in-domain & out-of-distribution examples,” in
Neural Information Processing Systems 2019, 2019. Advances in Neural Information Processing Systems, H. Larochelle,
[40] Y. Wen, D. Tran, and J. Ba, “Batchensemble: an alternative approach M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, Eds., 2020, pp.
to efficient ensemble and lifelong learning,” in 8th International 9239–9250.
Conference on Learning Representations, 2020. [65] A. Ashukha, A. Lyzhov, D. Molchanov, and D. Vetrov, “Pitfalls of
[41] C. Shorten and T. M. Khoshgoftaar, “A survey on image data augmen- in-domain uncertainty estimation and ensembling in deep learning,” in
tation for deep learning,” Journal of Big Data, vol. 6, no. 1, pp. 1–48, International Conference on Learning Representations, 2020.
2019. [66] T. DeVries and G. W. Taylor, “Improved regularization of convolutional
[42] Q. Wen, L. Sun, X. Song, J. Gao, X. Wang, and H. Xu, “Time neural networks with cutout,” arXiv preprint arXiv:1708.04552, 2017.
series data augmentation for deep learning: A survey,” arXiv preprint [67] D. Hendrycks and K. Gimpel, “A baseline for detecting misclassified
arXiv:2002.12478, 2020. and out-of-distribution examples in neural networks,” in 5th Interna-
[43] T. Tsiligkaridis, “Information robust dirichlet networks for predictive tional Conference on Learning Representations, 2017.
uncertainty estimation,” arXiv preprint arXiv:1910.04819, 2019. [68] S. Liang, Y. Li, and R. Srikant, “Enhancing the reliability of out-of-
[44] M. Sensoy, L. Kaplan, and M. Kandemir, “Evidential deep learning to distribution image detection in neural networks,” in 6th International
quantify classification uncertainty,” in Advances in Neural Information Conference on Learning Representations, 2018.
Processing Systems, 2018, pp. 3179–3189. [69] A. Shafaei, M. Schmidt, and J. J. Little, “A less biased evaluation
[45] A. Malinin, B. Mlodozeniec, and M. Gales, “Ensemble distribution dis- of out-of-distribution sample detectors,” in British Machine Vision
tillation,” in 8th International Conference on Learning Representations, Conference 2019, 2019.
2020. [70] M. Mundt, I. Pliushch, S. Majumder, and V. Ramesh, “Open set
[46] M. Raghu, K. Blumer, R. Sayres, Z. Obermeyer, B. Kleinberg, S. Mul- recognition through deep neural network uncertainty: Does out-of-
lainathan, and J. Kleinberg, “Direct uncertainty prediction for medical distribution detection require generative classifiers?” in Proceedings
second opinions,” in International Conference on Machine Learning. of the IEEE International Conference on Computer Vision Workshops,
PMLR, 2019, pp. 5281–5290. 2019.
[47] J. Wenger, H. Kjellström, and R. Triebel, “Non-parametric calibration [71] P. Oberdiek, M. Rottmann, and H. Gottschalk, “Classification uncer-
for classification,” in International Conference on Artificial Intelligence tainty of deep neural networks based on gradient information,” in
and Statistics, 2020, pp. 178–190. IAPR Workshop on Artificial Neural Networks in Pattern Recognition.
[48] J. Zhang, B. Kailkhura, and T. Y.-J. Han, “Mix-n-match: Ensemble and Springer, 2018, pp. 113–125.
compositional methods for uncertainty calibration in deep learning,” in [72] J. Lee and G. AlRegib, “Gradients as a measure of uncertainty in
International Conference on Machine Learning. PMLR, 2020, pp. neural networks,” in 2020 IEEE International Conference on Image
11 117–11 128. Processing. IEEE, 2020, pp. 2416–2420.
[49] R. Ghanem, D. Higdon, and H. Owhadi, Handbook of uncertainty [73] G. E. Hinton and D. Van Camp, “Keeping the neural networks simple
quantification. Springer, 2017, vol. 6. by minimizing the description length of the weights,” in Proceedings of
[50] Y. Gal, “Uncertainty in deep learning,” Ph.D. dissertation, University the sixth annual conference on Computational learning theory, 1993,
of Cambridge, 2016. pp. 5–13.
[51] A. G. Kendall, “Geometry and uncertainty in deep learning for com- [74] D. Barber and C. M. Bishop, “Ensemble learning in bayesian neural
puter vision,” Ph.D. dissertation, University of Cambridge, 2019. networks,” Nato ASI Series F Computer and Systems Sciences, vol.
[52] A. Malinin, “Uncertainty estimation in deep learning with application 168, pp. 215–238, 1998.
to spoken language assessment,” Ph.D. dissertation, University of [75] A. Graves, “Practical variational inference for neural networks,” in
Cambridge, 2019. Advances in neural information processing systems, 2011, pp. 2348–
[53] H. Wang and D.-Y. Yeung, “Towards bayesian deep learning: A frame- 2356.
work and some existing methods,” IEEE Transactions on Knowledge [76] C. Louizos, K. Ullrich, and M. Welling, “Bayesian compression for
and Data Engineering, vol. 28, no. 12, pp. 3395–3408, 2016. deep learning,” in Advances in neural information processing systems,
[54] ——, “A survey on bayesian deep learning,” ACM Computing Surveys 2017, pp. 3288–3298.
(CSUR), vol. 53, no. 5, pp. 1–37, 2020. [77] D. Rezende and S. Mohamed, “Variational inference with normalizing
[55] N. Ståhl, G. Falkman, A. Karlsson, and G. Mathiason, “Evaluation flows,” in International Conference on Machine Learning, 2015, pp.
of uncertainty quantification in deep learning,” in Information Pro- 1530–1538.
cessing and Management of Uncertainty in Knowledge-Based Systems. [78] R. M. Neal, “Bayesian training of backpropagation networks by the
Springer International Publishing, 2020, pp. 556–568. hybrid monte carlo method,” Citeseer, Tech. Rep., 1992.
36
[79] ——, “An improved acceptance procedure for the hybrid monte carlo H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin,
algorithm,” Journal of Computational Physics, vol. 111, no. 1, pp. 194– Eds., vol. 33, 2020, pp. 1356–1367.
203, 1994. [103] N. Tagasovska and D. Lopez-Paz, “Single-model uncertainties for deep
[80] ——, “Bayesian learning for neural networks,” Ph.D. dissertation, learning,” in Advances in Neural Information Processing Systems,
University of Toronto, 1995. 2019, pp. 6417–6428.
[81] M. Welling and Y. W. Teh, “Bayesian learning via stochastic gradient [104] T. Kawashima, Q. Yu, A. Asai, D. Ikami, and K. Aizawa, “The
langevin dynamics,” in Proceedings of the 28th international confer- aleatoric uncertainty estimation using a separate formulation with
ence on machine learning, 2011, pp. 681–688. virtual residuals,” in 2020 25th International Conference on Pattern
[82] C. Nemeth and P. Fearnhead, “Stochastic gradient markov chain monte Recognition (ICPR). IEEE, 2021, pp. 1438–1445.
carlo,” Journal of the American Statistical Association, pp. 1–18, 2020. [105] Y.-C. Hsu, Y. Shen, H. Jin, and Z. Kira, “Generalized odin: Detect-
[83] T. Salimans and D. P. Kingma, “Weight Normalization: A Simple ing out-of-distribution image without learning from out-of-distribution
Reparameterization to Accelerate Training of Deep Neural Networks,” data,” in Proceedings of the IEEE/CVF Conference on Computer Vision
in Advances in Neural Information Processing Systems 29. Curran and Pattern Recognition, 2020, pp. 10 951–10 960.
Associates, Inc., 2016, pp. 901–909. [106] J. Denker, D. Schwartz, B. Wittner, S. Solla, R. Howard, L. Jackel,
[84] J. Lee, M. Humt, J. Feng, and R. Triebel, “Estimating model uncer- and J. Hopfield, “Large automatic learning, rule extraction, and gener-
tainty of neural networks in sparse information form,” in International alization,” Complex systems, vol. 1, no. 5, pp. 877–922, 1987.
Conference on Machine Learning. PMLR, 2020, pp. 5702–5713. [107] N. Tishby, E. Levin, and S. A. Solla, “Consistent inference of
[85] O. Achrack, O. Barzilay, and R. Kellerman, “Multi-loss sub-ensembles probabilities in layered networks: Predictions and generalization,” in
for accurate classification with uncertainty estimation,” arXiv preprint International Joint Conference on Neural Networks, vol. 2, 1989, pp.
arXiv:2010.01917, 2020. 403–409.
[86] G. Huang, Y. Li, G. Pleiss, Z. Liu, J. E. Hopcroft, and K. Q. Wein- [108] W. L. Buntine and A. S. Weigend, “Bayesian back-propagation,”
berger, “Snapshot ensembles: Train 1, get m for free,” in International Complex systems, vol. 5, no. 6, pp. 603–643, 1991.
conference on learning representations, 2017. [109] D. J. C. MacKay, “Bayesian model comparison and backprop nets,” in
[87] G. D. Cavalcanti, L. S. Oliveira, T. J. Moura, and G. V. Carvalho, Advances in neural information processing systems, 1992, pp. 839–846.
“Combining diversity measures for ensemble pruning,” Pattern Recog- [110] M.-A. Sato, “Online model selection based on the variational bayes,”
nition Letters, vol. 74, pp. 38–45, 2016. Neural computation, vol. 13, no. 7, pp. 1649–1681, 2001.
[88] H. Guo, H. Liu, R. Li, C. Wu, Y. Guo, and M. Xu, “Margin & diversity [111] A. Corduneanu and C. M. Bishop, “Variational bayesian model selec-
based ordering ensemble pruning,” Neurocomputing, vol. 275, pp. 237– tion for mixture distributions,” in Artificial intelligence and Statistics,
246, 2018. vol. 2001. Morgan Kaufmann Waltham, MA, 2001, pp. 27–34.
[89] W. G. Martinez, “Ensemble pruning via quadratic margin maximiza- [112] S. Ghosh, J. Yao, and F. Doshi-Velez, “Model selection in bayesian
tion,” IEEE Access, vol. 9, pp. 48 931–48 951, 2021. neural networks via horseshoe priors,” Journal of Machine Learning
[90] J. Lindqvist, A. Olmin, F. Lindsten, and L. Svensson, “A general Research, vol. 20, no. 182, pp. 1–46, 2019.
framework for ensemble distribution distillation,” in 2020 IEEE 30th [113] M. Federici, K. Ullrich, and M. Welling, “Improved bayesian compres-
International Workshop on Machine Learning for Signal Processing sion,” arXiv preprint arXiv:1711.06494, 2017.
(MLSP). IEEE, 2020, pp. 1–6. [114] J. Achterhold, J. M. Koehler, A. Schmeink, and T. Genewein, “Varia-
[91] D. Molchanov, A. Lyzhov, Y. Molchanova, A. Ashukha, and D. Vetrov, tional network quantization,” in International Conference on Learning
“Greedy policy search: A simple baseline for learnable test-time Representations, 2018.
augmentation,” arXiv preprint arXiv:2002.09103, vol. 2, no. 7, 2020. [115] D. J. MacKay, “Information-based objective functions for active data
[92] M. Możejko, M. Susik, and R. Karczewski, “Inhibited softmax selection,” Neural computation, vol. 4, no. 4, pp. 590–604, 1992.
for uncertainty estimation in neural networks,” arXiv preprint [116] A. Kirsch, J. van Amersfoort, and Y. Gal, “Batchbald: Efficient
arXiv:1810.01861, 2018. and diverse batch acquisition for deep bayesian active learning,” in
[93] L. Oala, C. Heiß, J. Macdonald, M. März, W. Samek, and G. Ku- Advances in Neural Information Processing Systems, 2019, pp. 7026–
tyniok, “Interval neural networks: Uncertainty scores,” arXiv preprint 7037.
arXiv:2003.11566, 2020. [117] C. V. Nguyen, Y. Li, T. D. Bui, and R. E. Turner, “Variational continual
[94] A. Malinin and M. Gales, “Reverse kl-divergence training of prior net- learning,” in International Conference on Learning Representations,
works: Improved uncertainty and adversarial robustness,” in Advances 2018.
in Neural Information Processing Systems, H. Wallach, H. Larochelle, [118] S. Ebrahimi, M. Elhoseiny, T. Darrell, and M. Rohrbach, “Uncertainty-
A. Beygelzimer, F. dÁlché-Buc, E. Fox, and R. Garnett, Eds., 2019, guided continual learning with bayesian neural networks,” in Interna-
pp. 14 547–14 558. tional Conference on Learning Representations, 2020.
[95] V. T. Vasudevan, A. Sethy, and A. R. Ghias, “Towards better con- [119] S. Farquhar and Y. Gal, “A unifying bayesian view of continual
fidence estimation for neural models,” in ICASSP 2019-2019 IEEE learning,” arXiv preprint arXiv:1902.06494, 2019.
International Conference on Acoustics, Speech and Signal Processing [120] H. Li, P. Barnaghi, S. Enshaeifar, and F. Ganz, “Continual learning us-
(ICASSP). IEEE, 2019, pp. 7335–7339. ing bayesian neural networks,” IEEE Transactions on Neural Networks
[96] M. Hein, M. Andriushchenko, and J. Bitterwolf, “Why relu networks and Learning Systems, 2020.
yield high-confidence predictions far away from the training data and [121] M. E. E. Khan, A. Immer, E. Abedi, and M. Korzepa, “Approximate
how to mitigate the problem,” in Proceedings of the IEEE Conference inference turns deep networks into gaussian processes,” in Advances
on Computer Vision and Pattern Recognition, 2019, pp. 41–50. in neural information processing systems, 2019, pp. 3094–3104.
[97] T. Joo, U. Chung, and M.-G. Seo, “Being bayesian about categor- [122] J. S. Denker and Y. LeCun, “Transforming neural-net output levels to
ical probability,” in International Conference on Machine Learning. probability distributions,” in Advances in neural information processing
PMLR, 2020, pp. 4950–4961. systems, 1991, pp. 853–859.
[98] T. Tsiligkaridis, “Failure prediction by confidence estimation of [123] D. J. MacKay, “A practical bayesian framework for backpropagation
uncertainty-aware dirichlet networks,” in 2021 IEEE International networks,” Neural computation, vol. 4, no. 3, pp. 448–472, 1992.
Conference on Acoustics, Speech and Signal Processing. IEEE, 2021, [124] J. Hernandez-Lobato, Y. Li, M. Rowland, T. Bui, D. Hernández-
pp. 3525–3529. Lobato, and R. Turner, “Black-box alpha divergence minimization,” in
[99] ——, “Information robust dirichlet networks for predictive uncertainty International Conference on Machine Learning, 2016, pp. 1511–1520.
estimation,” arXiv preprint arXiv:1910.04819, 2019. [125] Y. Li and Y. Gal, “Dropout inference in bayesian neural networks with
[100] A. P. Dempster, “A generalization of bayesian inference,” Journal of alpha-divergences,” in International Conference on Machine Learning,
the Royal Statistical Society: Series B (Methodological), vol. 30, no. 2, 2017, pp. 2052–2061.
pp. 205–232, 1968. [126] T. Minka et al., “Divergence measures and message passing,” Technical
[101] A. Amini, W. Schwarting, A. Soleimany, and D. Rus, “Deep evidential report, Microsoft Research, Tech. Rep., 2005.
regression,” in Advances in Neural Information Processing Systems, [127] T. P. Minka, “Expectation propagation for approximate bayesian in-
H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, ference,” in Proceedings of the 17th Conference in Uncertainty in
Eds., vol. 33. Curran Associates, Inc., 2020, pp. 14 927–14 937. Artificial Intelligence, 2001, pp. 362–369.
[102] B. Charpentier, D. Zügner, and S. Günnemann, “Posterior network: [128] J. Zhao, X. Liu, S. He, and S. Sun, “Probabilistic inference of
Uncertainty estimation without ood samples via density-based pseudo- bayesian neural networks with generalized expectation propagation,”
counts,” in Advances in Neural Information Processing Systems, Neurocomputing, vol. 412, pp. 392–398, 2020.
37
[129] J. M. Hernández-Lobato and R. Adams, “Probabilistic backpropagation [152] Z. Eaton-Rosen, F. Bragman, S. Bisdas, S. Ourselin, and M. J. Car-
for scalable learning of bayesian neural networks,” in International doso, “Towards safe deep learning: accurately quantifying biomarker
Conference on Machine Learning, 2015, pp. 1861–1869. uncertainty in neural network predictions,” in International Conference
[130] D. Tran, A. Kucukelbir, A. B. Dieng, M. Rudolph, D. Liang, and D. M. on Medical Image Computing and Computer-Assisted Intervention.
Blei, “Edward: A library for probabilistic modeling, inference, and Springer, 2018, pp. 691–699.
criticism,” arXiv preprint arXiv:1610.09787, 2016. [153] C. R. N. Tassi, “Bayesian convolutional neural network: Robustly
[131] D. Tran, M. D. Hoffman, R. A. Saurous, E. Brevdo, K. Murphy, quantify uncertainty for misclassifications detection,” in Mediter-
and D. M. Blei, “Deep probabilistic programming,” in International ranean Conference on Pattern Recognition and Artificial Intelligence.
Conference on Machine Learning, 2016. Springer, 2019, pp. 118–132.
[132] E. Bingham, J. P. Chen, M. Jankowiak, F. Obermeyer, N. Pradhan, [154] P. McClure and N. Kriegeskorte, “Robustly representing uncer-
T. Karaletsos, R. Singh, P. Szerlip, P. Horsfall, and N. D. Goodman, tainty through sampling in deep neural networks,” arXiv preprint
“Pyro: Deep universal probabilistic programming,” The Journal of arXiv:1611.01639, 2016.
Machine Learning Research, vol. 20, no. 1, pp. 973–978, 2019. [155] M. Khan, D. Nielsen, V. Tangkaratt, W. Lin, Y. Gal, and A. Srivastava,
[133] R. Cabañas, A. Salmerón, and A. R. Masegosa, “Inferpy: Probabilistic “Fast and scalable bayesian deep learning by weight-perturbation in
modeling with tensorflow made easy,” Knowledge-Based Systems, vol. adam,” in International Conference on Machine Learning. PMLR,
168, pp. 25–27, 2019. 2018, pp. 2611–2620.
[134] Y. Ito, C. Srinivasan, and H. Izumi, “Bayesian learning of neural [156] M. E. Khan, Z. Liu, V. Tangkaratt, and Y. Gal, “Vprop: Variational
networks adapted to changes of prior probabilities,” in International inference using rmsprop,” in Advances in neural information processing
Conference on Artificial Neural Networks. Springer, 2005, pp. 253– systems, 2017, pp. 3288–3298.
259. [157] A. Atanov, A. Ashukha, D. Molchanov, K. Neklyudov, and D. Vetrov,
[135] S. Sun, G. Zhang, J. Shi, and R. Grosse, “Functional variational “Uncertainty estimation via stochastic batch normalization,” in Interna-
bayesian neural networks,” in International Conference on Learning tional Symposium on Neural Networks. Springer, 2019, pp. 261–269.
Representations, 2018. [158] S. Duane, A. D. Kennedy, B. J. Pendleton, and D. Roweth, “Hybrid
[136] S. Depeweg, J. M. Hernández-Lobato, S. Udluft, and T. Runkler, monte carlo,” Physics letters B, vol. 195, no. 2, pp. 216–222, 1987.
“Sensitivity analysis for predictive uncertainty in bayesian neural [159] R. M. Neal et al., “Mcmc using hamiltonian dynamics,” Handbook of
networks,” arXiv preprint arXiv:1712.03605, 2017. markov chain monte carlo, vol. 2, no. 11, p. 2, 2011.
[137] S. Farquhar, L. Smith, and Y. Gal, “Try depth instead of weight [160] K. A. Dubey, S. J. Reddi, S. A. Williamson, B. Poczos, A. J. Smola,
correlations: Mean-field is a less restrictive assumption for deeper and E. P. Xing, “Variance reduction in stochastic gradient langevin
networks,” arXiv preprint arXiv:2002.03704, 2020. dynamics,” in Advances in neural information processing systems,
[138] J. Postels, F. Ferroni, H. Coskun, N. Navab, and F. Tombari, “Sampling- 2016, pp. 1154–1162.
free epistemic uncertainty estimation using approximated variance [161] B. Leimkuhler and S. Reich, Simulating hamiltonian dynamics. Cam-
propagation,” in Proceedings of the IEEE International Conference on bridge university press, 2004, vol. 14.
Computer Vision, 2019, pp. 2931–2940. [162] P. J. Rossky, J. Doll, and H. Friedman, “Brownian dynamics as smart
monte carlo simulation,” The Journal of Chemical Physics, vol. 69,
[139] J. Gast and S. Roth, “Lightweight probabilistic deep networks,” in
no. 10, pp. 4628–4633, 1978.
Proceedings of the IEEE Conference on Computer Vision and Pattern
[163] G. O. Roberts and O. Stramer, “Langevin diffusions and metropolis-
Recognition, 2018, pp. 3369–3378.
hastings algorithms,” Methodology and computing in applied probabil-
[140] S. Depeweg, J.-M. Hernandez-Lobato, F. Doshi-Velez, and S. Udluft,
ity, vol. 4, no. 4, pp. 337–357, 2002.
“Decomposition of uncertainty in bayesian deep learning for efficient
[164] H. Kushner and G. G. Yin, Stochastic approximation and recursive
and risk-sensitive learning,” in International Conference on Machine
algorithms and applications. Springer Science & Business Media,
Learning. PMLR, 2018, pp. 1184–1193.
2003, vol. 35.
[141] ——, “Decomposition of uncertainty in Bayesian deep learning for
[165] I. Goodfellow, Y. Bengio, A. Courville, and Y. Bengio, Deep learning.
efficient and risk-sensitive learning,” in Proceedings of the 35th In-
MIT press Cambridge, 2016, vol. 1, no. 2.
ternational Conference on Machine Learning, ser. Proceedings of
[166] Y.-A. Ma, T. Chen, and E. Fox, “A complete recipe for stochastic gra-
Machine Learning Research, J. Dy and A. Krause, Eds., vol. 80.
dient mcmc,” in Advances in Neural Information Processing Systems,
PMLR, 10–15 Jul 2018, pp. 1184–1193.
2015, pp. 2917–2925.
[142] D. P. Kingma, T. Salimans, and M. Welling, “Variational dropout and [167] G. Marceau-Caron and Y. Ollivier, “Natural langevin dynamics for
the local reparameterization trick,” in Advances in neural information neural networks,” in International Conference on Geometric Science
processing systems, 2015, pp. 2575–2583. of Information. Springer, 2017, pp. 451–459.
[143] A. Wu, S. Nowozin, E. Meeds, R. Turner, J. Hernández-Lobato, and [168] Z. Nado, J. Snoek, R. B. Grosse, D. Duvenaud, B. Xu, and J. Martens,
A. Gaunt, “Deterministic variational inference for robust bayesian “Stochastic gradient langevin dynamics that exploit neural network
neural networks,” in 7th International Conference on Learning Rep- structure,” in International Conference on Learning Representations
resentations, ICLR 2019, 2019. (Workshop), 2018.
[144] C. Louizos and M. Welling, “Structured and efficient variational deep [169] U. Simsekli, R. Badeau, A. T. Cemgil, and G. Richard, “Stochas-
learning with matrix gaussian posteriors,” in International Conference tic quasi-newton langevin monte carlo,” in Proceedings of the 33rd
on Machine Learning, 2016, pp. 1708–1716. International Conference on International Conference on Machine
[145] G. Zhang, S. Sun, D. Duvenaud, and R. Grosse, “Noisy natural gradient Learning-Volume 48, 2016, pp. 642–651.
as variational inference,” in International Conference on Machine [170] Y. Zhang and C. A. Sutton, “Quasi-newton methods for markov chain
Learning, 2018, pp. 5852–5861. monte carlo,” in Advances in Neural Information Processing Systems,
[146] S. Sun, C. Chen, and L. Carin, “Learning structured weight uncertainty 2011, pp. 2393–2401.
in bayesian neural networks,” in Artificial Intelligence and Statistics, [171] T. Fu, L. Luo, and Z. Zhang, “Quasi-newton hamiltonian monte carlo.”
2017, pp. 1283–1292. in Conference on Uncertainty in Artificial Intelligence, 2016.
[147] J. Bae, G. Zhang, and R. Grosse, “Eigenvalue corrected noisy natural [172] C. Li, C. Chen, D. Carlson, and L. Carin, “Preconditioned stochastic
gradient,” arXiv preprint arXiv:1811.12565, 2018. gradient langevin dynamics for deep neural networks,” in Proceedings
[148] A. Mishkin, F. Kunstner, D. Nielsen, M. Schmidt, and M. E. Khan, of the Thirtieth AAAI Conference on Artificial Intelligence. AAAI
“Slang: Fast structured covariance approximations for bayesian deep Press, 2016, pp. 1788–1794.
learning with natural gradient,” in Advances in Neural Information [173] S. Ahn, A. K. Balan, and M. Welling, “Bayesian posterior sampling
Processing Systems, 2018, pp. 6245–6255. via stochastic gradient fisher scoring,” in International Conference on
[149] C. Louizos and M. Welling, “Multiplicative normalizing flows for Learning Representations, 2012.
variational bayesian neural networks,” in International Conference on [174] S. Patterson and Y. W. Teh, “Stochastic gradient riemannian langevin
Machine Learning, 2017, pp. 2218–2227. dynamics on the probability simplex,” in Advances in neural informa-
[150] K. Osawa, S. Swaroop, M. E. E. Khan, A. Jain, R. Eschenhagen, tion processing systems, 2013, pp. 3102–3110.
R. E. Turner, and R. Yokota, “Practical deep learning with bayesian [175] N. Ye and Z. Zhu, “Stochastic fractional hamiltonian monte carlo,” in
principles,” in Advances in neural information processing systems, IJCAI, 2018, pp. 3019–3025.
2019, pp. 4287–4299. [176] N. Ding, Y. Fang, R. Babbush, C. Chen, R. D. Skeel, and H. Neven,
[151] Y. Gal, J. Hron, and A. Kendall, “Concrete dropout,” in Advances in “Bayesian sampling using stochastic gradient thermostats,” in Advances
neural information processing systems, 2017, pp. 3581–3590. in neural information processing systems, 2014, pp. 3203–3211.
38
[177] X. Shang, Z. Zhu, B. Leimkuhler, and A. J. Storkey, “Covariance- [200] J. Ba, R. Grosse, and J. Martens, “Distributed second-order opti-
controlled adaptive langevin thermostat for large-scale bayesian sam- mization using kronecker-factored approximations,” in International
pling,” in Advances in Neural Information Processing Systems, 2015, Conference on Learning Representations, 2017.
pp. 37–45. [201] T. George, C. Laurent, X. Bouthillier, N. Ballas, and P. Vincent,
[178] B. Leimkuhler and X. Shang, “Adaptive thermostats for noisy gradient “Fast approximate natural gradient descent in a kronecker factored
systems,” SIAM Journal on Scientific Computing, vol. 38, no. 2, pp. eigenbasis,” in Advances in Neural Information Processing Systems,
A712–A736, 2016. 2018, pp. 9573–9583.
[179] S. Ahn, B. Shahbaba, and M. Welling, “Distributed stochastic gradient [202] Y. LeCun, J. S. Denker, and S. A. Solla, “Optimal brain damage,”
mcmc,” in International conference on machine learning, 2014, pp. in Advances in Neural Information Processing Systems 2, 1990, pp.
1044–1052. 598–605.
[180] K.-C. Wang, P. Vicol, J. Lucas, L. Gu, R. Grosse, and R. Zemel, [203] J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins,
“Adversarial distillation of bayesian neural network posteriors,” in A. A. Rusu, K. Milan, J. Quan, T. Ramalho, A. Grabska-Barwinska
International Conference on Machine Learning, 2018, pp. 5190–5199. et al., “Overcoming catastrophic forgetting in neural networks,” Pro-
[181] A. K. Balan, V. Rathod, K. P. Murphy, and M. Welling, “Bayesian dark ceedings of the national academy of sciences, vol. 114, no. 13, pp.
knowledge,” in Advances in Neural Information Processing Systems, 3521–3526, 2017.
2015, pp. 3438–3446. [204] A. Kristiadi, M. Hein, and P. Hennig, “Being bayesian, even just a bit,
[182] D. Zou, P. Xu, and Q. Gu, “Stochastic variance-reduced hamilton monte fixes overconfidence in relu networks,” in International Conference on
carlo methods,” in International Conference on Machine Learning, Machine Learning. PMLR, 2020, pp. 5436–5446.
2018, pp. 6028–6037. [205] M. Humt, J. Lee, and R. Triebel, “Bayesian optimization meets
[183] A. Durmus, U. Simsekli, E. Moulines, R. Badeau, and G. Richard, laplace approximation for robotic introspection,” arXiv preprint
“Stochastic gradient richardson-romberg markov chain monte carlo,” in arXiv:2010.16141, 2020.
Advances in Neural Information Processing Systems, 2016, pp. 2047– [206] A. Kristiadi, M. Hein, and P. Hennig, “Learnable uncertainty under
2055. laplace approximations,” arXiv preprint arXiv:2010.02720, 2020.
[184] A. Durmus, E. Moulines et al., “High-dimensional bayesian inference [207] K. Shinde, J. Lee, M. Humt, A. Sezgin, and R. Triebel, “Learning
via the unadjusted langevin algorithm,” Bernoulli, vol. 25, no. 4A, pp. multiplicative interactions with bayesian neural networks for visual-
2854–2882, 2019. inertial odometry,” in Workshop on AI for Autonomous Driving at the
[185] I. Sato and H. Nakagawa, “Approximation analysis of stochastic 37th International Conference on Machine Learning, 2020.
gradient langevin dynamics by using fokker-planck equation and ito [208] J. Feng, M. Durner, Z.-C. Marton, F. Balint-Benczedi, and R. Triebel,
process,” in International Conference on Machine Learning, 2014, pp. “Introspective robot perception using smoothed predictions from
982–990. bayesian neural networks,” in International Symposium on Robotics
[186] C. Chen, N. Ding, and L. Carin, “On the convergence of stochastic Research, 2019.
gradient mcmc algorithms with high-order integrators,” in Advances in [209] A. Y. Foong, Y. Li, J. M. Hernández-Lobato, and R. E. Turner, “’in-
Neural Information Processing Systems, 2015, pp. 2278–2286. between’ uncertainty in bayesian neural networks,” arXiv preprint
[187] Y. W. Teh, A. H. Thiery, and S. J. Vollmer, “Consistency and fluc- arXiv:1906.11537, 2019.
tuations for stochastic gradient langevin dynamics,” The Journal of [210] A. Immer, M. Korzepa, and M. Bauer, “Improving predictions of
Machine Learning Research, vol. 17, no. 1, pp. 193–225, 2016. bayesian neural nets via local linearization,” in Proceedings of The
[188] C. Li, A. Stevens, C. Chen, Y. Pu, Z. Gan, and L. Carin, “Learning 24th International Conference on Artificial Intelligence and Statistics.
weight uncertainty with stochastic gradient mcmc for shape classifica- PMLR, 2021, pp. 703–711.
tion,” in Proceedings of the IEEE Conference on Computer Vision and
[211] M. Hobbhahn, A. Kristiadi, and P. Hennig, “Fast predictive uncer-
Pattern Recognition, 2016, pp. 5666–5675.
tainty for classification with bayesian deep networks,” arXiv preprint
[189] F. Wenzel, K. Roth, B. Veeling, J. Swiatkowski, L. Tran, S. Mandt,
arXiv:2003.01227, 2020.
J. Snoek, T. Salimans, R. Jenatton, and S. Nowozin, “How good is
[212] E. Daxberger, E. Nalisnick, J. U. Allingham, J. Antorán, and J. M.
the bayes posterior in deep neural networks really?” in International
Hernández-Lobato, “Expressive yet tractable bayesian deep learning
Conference on Machine Learning. PMLR, 2020, pp. 10 248–10 259.
via subnetwork inference,” arXiv preprint arXiv:2010.14689, 2020.
[190] N. Ye, Z. Zhu, and R. K. Mantiuk, “Langevin dynamics with continuous
tempering for training deep neural networks,” in Proceedings of the 31st [213] W. J. Maddox, P. Izmailov, T. Garipov, D. P. Vetrov, and A. G. Wilson,
International Conference on Neural Information Processing Systems, “A simple baseline for bayesian uncertainty in deep learning,” in
2017, pp. 618–626. Advances in Neural Information Processing Systems, 2019, pp. 13 153–
[191] R. Chandra, K. Jain, R. V. Deo, and S. Cripps, “Langevin-gradient 13 164.
parallel tempering for bayesian neural learning,” Neurocomputing, vol. [214] J. Mukhoti, P. Stenetorp, and Y. Gal, “On the importance of strong
359, pp. 315–326, 2019. baselines in bayesian deep learning,” arXiv preprint arXiv:1811.09385,
[192] A. Botev, H. Ritter, and D. Barber, “Practical gauss-newton optimi- 2018.
sation for deep learning,” in Proceedings of the 34th International [215] A. Filos, S. Farquhar, A. N. Gomez, T. G. Rudner, Z. Kenton, L. Smith,
Conference on Machine Learning-Volume 70, 2017, pp. 557–565. M. Alizadeh, A. de Kroon, and Y. Gal, “A systematic comparison of
[193] J. Martens and R. Grosse, “Optimizing neural networks with kronecker- bayesian deep learning robustness in diabetic retinopathy tasks,” arXiv
factored approximate curvature,” in Proceeding of the 32nd Interna- preprint arXiv:1912.10481, 2019.
tional conference on machine learning, 2015, pp. 2408–2417. [216] J. Mukhoti and Y. Gal, “Evaluating bayesian deep learning methods
[194] S. Becker and Y. LeCun, “Improving the convergence of back- for semantic segmentation,” arXiv preprint arXiv:1811.12709, 2018.
propagation learning with second-order methods,” in Proceedings of the [217] O. Sagi and L. Rokach, “Ensemble learning: A survey,” WIREs Data
1988 Connectionist Models Summer School, San Mateo, D. Touretzky, Mining and Knowledge Discovery, vol. 8, no. 4, p. e1249, 2018.
G. Hinton, and T. Sejnowski, Eds. Morgan Kaufmann, 1989, pp. [218] L. K. Hansen and P. Salamon, “Neural network ensembles,” IEEE
29–37. transactions on pattern analysis and machine intelligence, vol. 12,
[195] D. C. Liu and J. Nocedal, “On the limited memory bfgs method for no. 10, pp. 993–1001, 1990.
large scale optimization,” Mathematical Programming, vol. 45, pp. [219] Y. Cao, T. A. Geddes, J. Y. H. Yang, and P. Yang, “Ensemble deep
503–528, 08 1989. learning in bioinformatics,” Nature Machine Intelligence, pp. 1–9,
[196] P. Hennig, “Fast probabilistic optimization from noisy gradients,” 2020.
in Proceedings of the 30th International Conference on Machine [220] L. Nannia, S. Ghidoni, and S. Brahnam, “Ensemble of convolutional
Learning, vol. 28-1. PMLR, 2013, pp. 62–70. neural networks for bioimage classification,” Applied Computing and
[197] N. L. Roux and A. W. Fitzgibbon, “A fast natural newton method,” Informatics, 2020.
in Proceedings of the International Conference on Machine Learning, [221] L. Wei, S. Wan, J. Guo, and K. K. Wong, “A novel hierarchical
2010. selective ensemble classifier with bioinformatics application,” Artificial
[198] R. B. Grosse and J. Martens, “A kronecker-factored approximate fisher intelligence in medicine, vol. 83, pp. 82–90, 2017.
matrix for convolution layers,” in Proceedings of the 33nd International [222] F. Lv, M. Han, and T. Qiu, “Remote sensing image classification based
Conference on Machine Learning, 2016, pp. 573–582. on ensemble extreme learning machine with stacked autoencoder,”
[199] S.-W. Chen, C.-N. Chou, and E. Chang, “Bda-pch: Block-diagonal ap- IEEE Access, vol. 5, pp. 9021–9031, 2017.
proximation of positive-curvature hessian for training neural networks,” [223] X. Dai, X. Wu, B. Wang, and L. Zhang, “Semisupervised scene classi-
CoRR, abs/1802.06502, 2018. fication for remote sensing images: A method based on convolutional
39
neural networks and ensemble learning,” IEEE Geoscience and Remote [248] G. Wang, W. Li, S. Ourselin, and T. Vercauteren, “Automatic brain
Sensing Letters, vol. 16, no. 6, pp. 869–873, 2019. tumor segmentation using convolutional neural networks with test-
[224] E. Marushko and A. Doudkin, “Methods of using ensembles of time augmentation,” in International MICCAI Brainlesion Workshop.
heterogeneous models to identify remote sensing objects,” Pattern Springer, 2018, pp. 61–72.
Recognition and Image Analysis, vol. 30, no. 2, pp. 211–216, 2020. [249] G. Wang, W. Li, M. Aertsen, J. Deprest, S. Ourselin, and T. Ver-
[225] T. Kurutach, I. Clavera, Y. Duan, A. Tamar, and P. Abbeel, “Model- cauteren, “Aleatoric uncertainty estimation with test-time augmentation
ensemble trust-region policy optimization,” in International Conference for medical image segmentation with convolutional neural networks,”
on Learning Representations, 2018. Neurocomputing, vol. 338, pp. 34–45, 2019.
[226] A. Rajeswaran, S. Ghotra, B. Ravindran, and S. Levine, “Epopt: [250] N. Moshkov, B. Mathe, A. Kertesz-Farkas, R. Hollandi, and P. Horvath,
Learning robust neural network policies using model ensembles,” in “Test-time augmentation for deep learning-based cell segmentation on
International Conference on Learning Representations, 2017. microscopy images,” Scientific reports, vol. 10, no. 1, pp. 1–7, 2020.
[227] S. Fort, H. Hu, and B. Lakshminarayanan, “Deep ensembles: A loss [251] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional net-
landscape perspective,” arXiv preprint arXiv:1912.02757, 2019. works for biomedical image segmentation,” in International Confer-
[228] A. Renda, M. Barsacchi, A. Bechini, and F. Marcelloni, “Comparing ence on Medical image computing and computer-assisted intervention.
ensemble strategies for deep learning: An application to facial expres- Springer, 2015, pp. 234–241.
sion recognition,” Expert Systems with Applications, vol. 136, pp. 1–11, [252] D. Shanmugam, D. Blalock, G. Balakrishnan, and J. Guttag, “When and
2019. why test-time augmentation works,” arXiv preprint arXiv:2011.11156,
[229] E. J. Herron, S. R. Young, and T. E. Potok, “Ensembles of networks 2020.
produced from neural architecture search,” in International Conference [253] I. Kim, Y. Kim, and S. Kim, “Learning loss for test-time augmentation,”
on High Performance Computing. Springer, 2020, pp. 223–234. in Advances in Neural Information Processing Systems, 2020, pp.
[230] S. Lee, S. Purushwalkam, M. Cogswell, D. Crandall, and D. Batra, 4163–4174.
“Why m heads are better than one: Training a diverse ensemble of [254] Q. Yu and K. Aizawa, “Unsupervised out-of-distribution detection
deep networks,” arXiv preprint arXiv:1511.06314, 2015. by maximum classifier discrepancy,” in Proceedings of the IEEE
[231] I. E. Livieris, L. Iliadis, and P. Pintelas, “On ensemble techniques International Conference on Computer Vision, 2019, pp. 9518–9526.
of weight-constrained neural networks,” Evolving Systems, pp. 1–13, [255] J. Ren, P. J. Liu, E. Fertig, J. Snoek, R. Poplin, M. Depristo, J. Dillon,
2020. and B. Lakshminarayanan, “Likelihood ratios for out-of-distribution
[232] L. Nanni, S. Brahnam, and G. Maguolo, “Data augmentation for detection,” in Advances in Neural Information Processing Systems,
building an ensemble of convolutional neural networks,” in Innovation 2019, pp. 14 707–14 718.
in Medicine and Healthcare Systems, and Multimedia. Singapore: [256] J. Yao, W. Pan, S. Ghosh, and F. Doshi-Velez, “Quality of uncertainty
Springer Singapore, 2019, pp. 61–69. quantification for bayesian neural network inference,” arXiv preprint
[233] J. Guo and S. Gould, “Deep cnn ensemble with data augmentation for arXiv:1906.09686, 2019.
object detection,” arXiv preprint arXiv:1506.07224, 2015. [257] X. Huang, J. Yang, L. Li, H. Deng, B. Ni, and Y. Xu, “Evaluating
[234] R. Rahaman and A. H. Thiery, “Uncertainty quantification and deep and boosting uncertainty quantification in classification,” arXiv preprint
ensembles,” stat, vol. 1050, p. 20, 2020. arXiv:1909.06030, 2019.
[235] Y. Wen, G. Jerfel, R. Muller, M. W. Dusenberry, J. Snoek, B. Lakshmi- [258] J. Davis and M. Goadrich, “The relationship between precision-recall
narayanan, and D. Tran, “Combining ensembles and data augmentation and roc curves,” in Proceedings of the 23rd international conference
can harm your calibration,” in International Conference on Learning on Machine learning, 2006, pp. 233–240.
Representations, 2021. [259] T. Pearce, A. Brintrup, M. Zaki, and A. Neely, “High-quality prediction
[236] W. Kim, B. Goyal, K. Chawla, J. Lee, and K. Kwon, “Attention-based intervals for deep learning: A distribution-free, ensembled approach,”
ensemble for deep metric learning,” in Proceedings of the European in International Conference on Machine Learning. PMLR, 2018, pp.
Conference on Computer Vision, 2018. 4075–4084.
[237] J. Yang and F. Wang, “Auto-ensemble: An adaptive learning rate [260] D. Su, Y. Y. Ting, and J. Ansel, “Tight prediction intervals using
scheduling based deep learning model ensembling,” IEEE Access, expanded interval minimization,” arXiv preprint arXiv:1806.11222,
vol. 8, pp. 217 499–217 509, 2020. 2018.
[238] M. Leutbecher and T. N. Palmer, “Ensemble forecasting,” Journal of [261] P. McClure, N. Rho, J. A. Lee, J. R. Kaczmarzyk, C. Y. Zheng, S. S.
computational physics, vol. 227, no. 7, pp. 3515–3539, 2008. Ghosh, D. M. Nielson, A. G. Thomas, P. Bandettini, and F. Pereira,
[239] W. S. Parker, “Ensemble modeling, uncertainty and robust predictions,” “Knowing what you know in brain segmentation using bayesian deep
Wiley Interdisciplinary Reviews: Climate Change, vol. 4, no. 3, pp. neural networks,” Frontiers in neuroinformatics, vol. 13, p. 67, 2019.
213–223, 2013. [262] A. P. Soleimany, H. Suresh, J. J. G. Ortiz, D. Shanmugam, N. Gural,
[240] W. H. Beluch, T. Genewein, A. Nürnberger, and J. M. Köhler, “The J. Guttag, and S. N. Bhatia, “Image segmentation of liver stage
power of ensembles for active learning in image classification,” in malaria infection with spatial uncertainty sampling,” arXiv preprint
Proceedings of the IEEE Conference on Computer Vision and Pattern arXiv:1912.00262, 2019.
Recognition, 2018, pp. 9368–9377. [263] R. D. Soberanis-Mukul, N. Navab, and S. Albarqouni, “Uncertainty-
[241] A. Vyas, N. Jammalamadaka, X. Zhu, D. Das, B. Kaul, and T. L. based graph convolutional networks for organ segmentation refine-
Willke, “Out-of-distribution detection using an ensemble of self super- ment,” in Medical Imaging with Deep Learning. PMLR, 2020, pp.
vised leave-out classifiers,” in Proceedings of the European Conference 755–769.
on Computer Vision, 2018, pp. 550–564. [264] P. Seebock, J. I. Orlando, T. Schlegl, S. M. Waldstein, H. Bogunovic,
[242] J. Kocić, N. Jovičić, and V. Drndarević, “An end-to-end deep neural S. Klimscha, G. Langs, and U. Schmidt-Erfurth, “Exploiting epistemic
network for autonomous driving designed for embedded automotive uncertainty of anatomy segmentation for anomaly detection in retinal
platforms,” Sensors, vol. 19, no. 9, p. 2064, 2019. oct,” IEEE Transactions on Medical Imaging, vol. 39, p. 87–98, 2020.
[243] G. Martı́nez-Muñoz, D. Hernández-Lobato, and A. Suárez, “An anal- [265] V. Kuleshov, N. Fenner, and S. Ermon, “Accurate uncertainties for deep
ysis of ensemble pruning techniques based on ordered aggregation,” learning using calibrated regression,” in International Conference on
IEEE Transactions on Pattern Analysis and Machine Intelligence, Machine Learning. PMLR, 2018, pp. 2796–2804.
vol. 31, no. 2, pp. 245–259, 2008. [266] S. Seo, P. H. Seo, and B. Han, “Learning for single-shot confidence
[244] C. Buciluǎ, R. Caruana, and A. Niculescu-Mizil, “Model compression,” calibration in deep neural networks through stochastic inferences,” in
in Proceedings of the 12th ACM SIGKDD international conference on Proceedings of the IEEE Conference on Computer Vision and Pattern
Knowledge discovery and data mining, 2006, pp. 535–541. Recognition, 2019, pp. 9030–9038.
[245] G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a [267] Z. Li and D. Hoiem, “Improving confidence estimates for unfamiliar
neural network,” stat, vol. 1050, p. 9, 2015. examples,” in Proceedings of the IEEE Conference on Computer Vision
[246] E. Englesson and H. Azizpour, “Efficient evaluation-time uncertainty and Pattern Recognition, 2020, pp. 2686–2695.
estimation by improved distillation,” in Workshop on Uncertainty and [268] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethink-
Robustness in Deep Learning at International Conference on Machine ing the inception architecture for computer vision,” in Proceedings of
Learning, 2019. the IEEE conference on computer vision and pattern recognition, 2016,
[247] S. Reich, D. Mueller, and N. Andrews, “Ensemble distillation for pp. 2818–2826.
structured prediction: Calibrated, accurate, fast—choose three,” in [269] G. Pereyra, G. Tucker, J. Chorowski, Ł. Kaiser, and G. Hinton, “Regu-
Proceedings of the 2020 Conference on Empirical Methods in Natural larizing neural networks by penalizing confident output distributions,”
Language Processing (EMNLP), 2020, pp. 5583–5595. arXiv preprint arXiv:1701.06548, 2017.
40
[270] R. Müller, S. Kornblith, and G. E. Hinton, “When does label smoothing [292] M. L. Iuzzolino, T. Umada, N. R. Ahmed, and D. A. Szafir, “In
help?” in Advances in Neural Information Processing Systems, 2019, automation we trust: Investigating the role of uncertainty in active
pp. 4694–4703. learning systems,” arXiv preprint arXiv:2004.00762, 2020.
[271] B. Venkatesh and J. J. Thiagarajan, “Heteroscedastic calibra- [293] B. Settles, “Active learning literature survey,” University of Wisconsin-
tion of uncertainty estimators in deep learning,” arXiv preprint Madison Department of Computer Sciences, Tech. Rep., 2009.
arXiv:1910.14179, 2019. [294] R. Pop and P. Fulop, “Deep ensemble bayesian active learning:
[272] P. Izmailov, W. J. Maddox, P. Kirichenko, T. Garipov, D. Vetrov, and Addressing the mode collapse issue in monte carlo dropout via en-
A. G. Wilson, “Subspace inference for bayesian deep learning,” in sembles,” arXiv preprint arXiv:1811.03897, 2018.
Uncertainty in Artificial Intelligence. PMLR, 2020, pp. 1169–1179. [295] M. Ghavamzadeh, S. Mannor, J. Pineau, and A. Tamar, “Bayesian re-
[273] Z. Zhang, A. V. Dalca, and M. R. Sabuncu, “Confidence calibration for inforcement learning: A survey,” Foundations and Trends® in Machine
convolutional neural networks using structured dropout,” arXiv preprint Learning, vol. 8, no. 5-6, pp. 359–483, 2015.
arXiv:1906.09551, 2019. [296] S. Hu, D. Worrall, S. Knegt, B. Veeling, H. Huisman, and M. Welling,
[274] M.-H. Laves, S. Ihler, K.-P. Kortmann, and T. Ortmaier, “Well- “Supervised uncertainty quantification for segmentation with multiple
calibrated model uncertainty with temperature scaling for dropout annotations,” in International Conference on Medical Image Computing
variational inference,” arXiv preprint arXiv:1909.13550, 2019. and Computer-Assisted Intervention. Springer, 2019, pp. 137–145.
[275] A. Mehrtash, W. M. Wells, C. M. Tempany, P. Abolmaesumi, and [297] F. C. Ghesu, B. Georgescu, E. Gibson, S. Guendel, M. K. Kalra,
T. Kapur, “Confidence calibration and predictive uncertainty estimation R. Singh, S. R. Digumarthy, S. Grbic, and D. Comaniciu, “Quantifying
for deep medical image segmentation,” IEEE Transactions on Medical and leveraging classification uncertainty for chest radiograph assess-
Imaging, 2020. ment,” in International Conference on Medical Image Computing and
[276] B. Zadrozny and C. Elkan, “Obtaining calibrated probability estimates Computer-Assisted Intervention. Springer, 2019, pp. 676–684.
from decision trees and naive bayesian classifiers,” in International
[298] M. S. Ayhan, L. Kuehlewein, G. Aliyeva, W. Inhoffen, F. Ziemssen,
Conference on Machine Learning, vol. 1. Citeseer, 2001, pp. 609–
and P. Berens, “Expert-validated estimation of diagnostic uncertainty
616.
for deep neural networks in diabetic retinopathy detection,” Medical
[277] D. Hendrycks, M. Mazeika, and T. Dietterich, “Deep anomaly detec-
Image Analysis, p. 101724, 2020.
tion with outlier exposure,” in International Conference on Learning
Representations, 2019. [299] N. Sunderhauf, O. Brock, W. Scheirer, R. Hadsell, D. Fox, J. Leitner,
[278] S. Thulasidasan, G. Chennupati, J. A. Bilmes, T. Bhattacharya, and B. Upcroft, P. Abbeel, W. Burgard, M. Milford et al., “The limits and
S. Michalak, “On mixup training: Improved calibration and predictive potentials of deep learning for robotics,” The International Journal of
uncertainty for deep neural networks,” in Advances in Neural Informa- Robotics Research, vol. 37, no. 4-5, pp. 405–420, 2018.
tion Processing Systems, 2019, pp. 13 888–13 899. [300] S. Thrun, “Probabilistic robotics,” Communications of the ACM,
[279] J. Maroñas, D. Ramos, and R. Paredes, “Improving calibration in vol. 45, no. 3, pp. 52–57, 2002.
mixup-trained deep neural networks through confidence-based loss [301] D. Fox, “Markov localization-a probabilistic framework for mobile
functions,” arXiv preprint arXiv:2003.09946, 2020. robot localization and navigation.” Ph.D. dissertation, Citeseer, 1998.
[280] K. Patel, W. Beluch, D. Zhang, M. Pfeiffer, and B. Yang, “On-manifold [302] D. Fox, W. Burgard, H. Kruppa, and S. Thrun, “A probabilistic
adversarial data augmentation improves uncertainty calibration,” in approach to collaborative multi-robot localization,” Autonomous robots,
2020 25th International Conference on Pattern Recognition (ICPR). vol. 8, no. 3, pp. 325–344, 2000.
IEEE, 2021, pp. 8029–8036. [303] S. Thrun, D. Fox, W. Burgard, and F. Dellaert, “Robust monte carlo
[281] D. Comaniciu, V. Ramesh, and P. Meer, “Real-time tracking of non- localization for mobile robots,” Artificial intelligence, vol. 128, no. 1-2,
rigid objects using mean shift,” in Proceedings IEEE Conference on pp. 99–141, 2001.
Computer Vision and Pattern Recognition, vol. 2. IEEE, 2000, pp. [304] H. Durrant-Whyte and T. Bailey, “Simultaneous localization and map-
142–149. ping: part i,” IEEE robotics & automation magazine, vol. 13, no. 2,
[282] H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez-Paz, “mixup: Beyond pp. 99–110, 2006.
empirical risk minimization,” in International Conference on Learning [305] T. Bailey and H. Durrant-Whyte, “Simultaneous localization and map-
Representations, 2018. ping (slam): Part ii,” IEEE robotics & automation magazine, vol. 13,
[283] M. P. Naeini, G. F. Cooper, and M. Hauskrecht, “Obtaining well no. 3, pp. 108–117, 2006.
calibrated probabilities using bayesian binning,” in Proceedings of the [306] M. Montemerlo, S. Thrun, D. Koller, and B. Wegbreit, “Fastslam:
AAAI Conference on Artificial Intelligence, vol. 2015, 2015, p. 2901. A factored solution to the simultaneous localization and mapping
[284] M. Kull, M. Perelló-Nieto, M. Kängsepp, T. de Menezes e Silva Filho, problem,” Aaai/iaai, vol. 593598, 2002.
H. Song, and P. A. Flach, “Beyond temperature scaling: Obtaining [307] M. Kaess, V. Ila, R. Roberts, and F. Dellaert, “The bayes tree: An
well-calibrated multi-class probabilities with dirichlet calibration,” in algorithmic foundation for probabilistic robot mapping,” in Algorithmic
Advances in Neural Information Processing Systems, 2019, pp. 12 295– Foundations of Robotics IX. Springer, 2010, pp. 157–173.
12 305. [308] F. Dellaert and M. Kaess, “Factor graphs for robot perception,”
[285] D. Levi, L. Gispan, N. Giladi, and E. Fetaya, “Evaluating and cal- Foundations and Trends in Robotics, vol. 6, no. 1-2, pp. 1–139, 2017.
ibrating uncertainty prediction in regression tasks,” arXiv preprint [309] H. A. Loeliger, “An introduction to factor graphs,” IEEE Signal
arXiv:1905.11659, 2019. Processing Magazine, vol. 21, no. 1, pp. 28–41, 2004.
[286] J. Vaicenavicius, D. Widmann, C. Andersson, F. Lindsten, J. Roll,
[310] D. Silver and J. Veness, “Monte-carlo planning in large pomdps,” in
and T. Schön, “Evaluating model calibration in classification,” in The
Advances in Neural Information Processing Systems, 2010.
22nd International Conference on Artificial Intelligence and Statistics.
PMLR, 2019, pp. 3459–3467. [311] S. Ross, J. Pineau, S. Paquet, and B. Chaib-Draa, “Online planning
[287] M. H. DeGroot and S. E. Fienberg, “The comparison and evaluation algorithms for pomdps,” Journal of Artificial Intelligence Research,
of forecasters,” Journal of the Royal Statistical Society: Series D (The vol. 32, pp. 663–704, 2008.
Statistician), vol. 32, no. 1-2, pp. 12–22, 1983. [312] S. M. Richards, F. Berkenkamp, and A. Krause, “The lyapunov neural
[288] J. Nixon, M. W. Dusenberry, L. Zhang, G. Jerfel, and D. Tran, network: Adaptive stability certification for safe learning of dynamical
“Measuring calibration in deep learning,” in CVPR Workshops, 2019, systems,” in Conference on Robot Learning. PMLR, 2018, pp. 466–
pp. 38–41. 476.
[289] A. Ghandeharioun, B. Eoff, B. Jou, and R. Picard, “Characterizing [313] F. Berkenkamp, A. P. Schoellig, and A. Krause, “Safe controller
sources of uncertainty to proxy calibration and disambiguate annota- optimization for quadrotors with gaussian processes,” in 2016 IEEE
tor and data bias,” in 2019 IEEE/CVF International Conference on International Conference on Robotics and Automation (ICRA). IEEE,
Computer Vision Workshop. IEEE, 2019, pp. 4202–4206. 2016, pp. 491–496.
[290] F. J. Pulgar, A. J. Rivera, F. Charte, and M. J. del Jesus, “On the impact [314] F. Berkenkamp, M. Turchetta, A. P. Schoellig, and A. Krause, “Safe
of imbalanced data in convolutional neural networks performance,” model-based reinforcement learning with stability guarantees,” in Ad-
in International Conference on Hybrid Artificial Intelligence Systems. vances in Neural Information Processing Systems, 2017.
Springer, 2017, pp. 220–232. [315] H. Grimmett, R. Triebel, R. Paul, and I. Posner, “Introspective clas-
[291] K. Lee, K. Lee, H. Lee, and J. Shin, “A simple unified framework sification for robot perception,” The International Journal of Robotics
for detecting out-of-distribution samples and adversarial attacks,” in Research, vol. 35, no. 7, pp. 743–762, 2016.
Advances in Neural Information Processing Systems, 2018, pp. 7167– [316] R. Bajcsy, “Active perception,” Proceedings of the IEEE, vol. 76, no. 8,
7177. pp. 966–1005, 1988.
41
[317] R. Triebel, H. Grimmett, R. Paul, and I. Posner, “Driven learning for [339] ESA, “European space agency (esa) devel-
driving: How introspection improves semantic mapping,” in Robotics oped earth observation satellites,” 2019. [Online].
Research. Springer, 2016, pp. 449–465. Available: http://www.esa.int:8080/ESA Multimedia/Images/2019/05/
[318] A. Narr, R. Triebel, and D. Cremers, “Stream-based active learning ESA-developed Earth observation missions
for efficient and adaptive classification of 3d objects,” in 2016 IEEE [340] Z. Nado, N. Band, M. Collier, J. Djolonga, M. W. Dusenberry, S. Far-
International Conference on Robotics and Automation. IEEE, 2016, quhar, A. Filos, M. Havasi, R. Jenatton, G. Jerfel et al., “Uncertainty
pp. 227–233. baselines: Benchmarks for uncertainty & robustness in deep learning,”
[319] D. A. Cohn, Z. Ghahramani, and M. I. Jordan, “Active learning with arXiv preprint arXiv:2106.04015, 2021.
statistical models,” Journal of artificial intelligence research, vol. 4, [341] M. Kull and P. A. Flach, “Reliability maps: a tool to enhance
pp. 129–145, 1996. probability estimates and improve classification accuracy,” in Joint
[320] A. Nguyen, J. Yosinski, and J. Clune, “Deep neural networks are easily European Conference on Machine Learning and Knowledge Discovery
fooled: High confidence predictions for unrecognizable images,” in in Databases. Springer, 2014, pp. 18–33.
Proceedings of the IEEE conference on computer vision and pattern [342] J. Antorán, U. Bhatt, T. Adel, A. Weller, and J. M. Hernández-Lobato,
recognition, 2015, pp. 427–436. “Getting a clue: A method for explaining uncertainty estimates,” in
[321] K. Wong, S. Wang, M. Ren, M. Liang, and R. Urtasun, “Identifying International Conference on Learning Representations, 2021.
unknown instances for autonomous driving,” in Conference on Robot [343] M. Reichstein, G. Camps-Valls, B. Stevens, M. Jung, J. Denzler, and
Learning. PMLR, 2020, pp. 384–393. N. Carvalhais, “Deep learning and process understanding for data-
[322] W. Boerdijk, M. Sundermeyer, M. Durner, and R. Triebel, “”What’s driven earth system science,” Nature, vol. 566, no. 7743, pp. 195–204,
this?”–learning to segment unknown objects from manipulation se- 2019.
quences,” in Intern. Conf. on Robotics and Automation, 2021. [344] J. Willard, X. Jia, S. Xu, M. Steinbach, and V. Kumar, “Integrating
[323] C. Richter and N. Roy, “Safe visual navigation via deep learning and physics-based modeling with machine learning: A survey,” arXiv
novelty detection,” Robotics: Science and Systems Foundation, 2017. preprint arXiv:2003.04919, 2020.
[324] V. Peretroukhin, M. Giamou, D. M. Rosen, W. N. Greene, N. Roy, [345] E. De Bézenac, A. Pajot, and P. Gallinari, “Deep learning for physical
and J. Kelly, “A smooth representation of belief over so (3) for deep processes: Incorporating prior scientific knowledge,” Journal of Statis-
rotation learning with uncertainty,” arXiv preprint arXiv:2006.01031, tical Mechanics: Theory and Experiment, vol. 2019, no. 12, p. 124009,
2020. 2019.
[325] B. Lütjens, M. Everett, and J. P. How, “Safe reinforcement learning
with model uncertainty estimates,” in 2019 International Conference
on Robotics and Automation. IEEE, 2019, pp. 8662–8668.
[326] G. Kahn, A. Villaflor, B. Ding, P. Abbeel, and S. Levine, “Self-
supervised deep reinforcement learning with generalized computation
graphs for robot navigation,” in 2018 IEEE International Conference
on Robotics and Automation (ICRA). IEEE, 2018, pp. 5129–5136.
[327] F. Stulp, E. Theodorou, J. Buchli, and S. Schaal, “Learning to grasp
under uncertainty,” in 2011 IEEE International Conference on Robotics
and Automation. IEEE, 2011, pp. 5703–5708.
[328] V. Tchuiev and V. Indelman, “Inference over distribution of posterior
class probabilities for reliable bayesian classification and object-level
perception,” IEEE Robotics and Automation Letters, vol. 3, no. 4, pp.
4329–4336, 2018.
[329] Y. Feldman and V. Indelman, “Bayesian viewpoint-dependent robust
classification under model and localization uncertainty,” in 2018 IEEE
International Conference on Robotics and Automation. IEEE, 2018,
pp. 3221–3228.
[330] N. Yang, L. von Stumberg, R. Wang, and D. Cremers, “D3vo: Deep
depth, deep pose and deep uncertainty for monocular visual odometry,”
in Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition, 2020, pp. 1281–1292.
[331] S. Wang, R. Clark, H. Wen, and N. Trigoni, “Deepvo: Towards
end-to-end visual odometry with deep recurrent convolutional neural
networks,” in 2017 IEEE International Conference on Robotics and
Automation (ICRA). IEEE, 2017, pp. 2043–2050.
[332] C. Gurău, C. H. Tong, and I. Posner, “Fit for purpose? predicting
perception performance based on past experience,” in International
Symposium on Experimental Robotics. Springer, 2016, pp. 454–464.
[333] S. Daftry, S. Zeng, J. A. Bagnell, and M. Hebert, “Introspective
perception: Learning to predict failures in vision systems,” in 2016
IEEE/RSJ International Conference on Intelligent Robots and Systems
(IROS). IEEE, 2016, pp. 1743–1750.
[334] M. Netzband, W. L. Stefanov, and C. Redman, Applied remote sensing
for urban planning, governance and sustainability. Springer Science
& Business Media, 2007.
[335] C. Giardino, M. Bresciani, P. Villa, and A. Martinelli, “Application of
remote sensing in water resource management: the case study of lake
trasimeno, italy,” Water resources management, vol. 24, no. 14, pp.
3885–3899, 2010.
[336] C. J. Van Westen, “Remote sensing for natural disaster management,”
International archives of photogrammetry and remote sensing, vol. 33,
no. B7/4; PART 7, pp. 1609–1617, 2000.
[337] X. X. Zhu, D. Tuia, L. Mou, G.-S. Xia, L. Zhang, F. Xu, and
F. Fraundorfer, “Deep learning in remote sensing: A comprehensive
review and list of resources,” IEEE Geoscience and Remote Sensing
Magazine, vol. 5, no. 4, pp. 8–36, 2017.
[338] J. Gawlikowski, M. Schmitt, A. Kruspe, and X. X. Zhu, “On the
fusion strategies of sentinel-1 and sentinel-2 data for local climate zone
classification,” in IEEE International Geoscience and Remote Sensing
Symposium, 2020, pp. 2081–2084.