3 Bayesian Deep Learning
3 Bayesian Deep Learning
In previous chapters we reviewed Bayesian neural networks (BNNs) and historical tech-
niques for approximate inference in these, as well as more recent approaches. We discussed
the advantages and disadvantages of different techniques, examining their practicality.
This, perhaps, is the most important aspect of modern techniques for approximate infer-
ence in BNNs. The field of deep learning is pushed forward by practitioners, working on
real-world problems. Techniques which cannot scale to complex models with potentially
millions of parameters, scale well with large amounts of data, need well studied models
to be radically changed, or are not accessible to engineers, will simply perish.
In this chapter we will develop on the strand of work of [Graves, 2011; Hinton and
Van Camp, 1993], but will do so from the Bayesian perspective rather than the information
theory one. Developing Bayesian approaches to deep learning, we will tie approximate
BNN inference together with deep learning stochastic regularisation techniques (SRTs)
such as dropout. These regularisation techniques are used in many modern deep learning
tools, allowing us to offer a practical inference technique.
We will start by reviewing in detail the tools used by Graves [2011], and extend on
these with recent research. In the process we will comment and analyse the variance
of several stochastic estimators used in variational inference (VI). Following that we
will tie these derivations to SRTs, and propose practical techniques to obtain model
uncertainty, even from existing models. We finish the chapter by developing specific
examples for image based models (CNNs) and sequence based models (RNNs). These
will be demonstrated in chapter 5, where we will survey recent research making use of
the suggested tools in real-world problems.
30 Bayesian Deep Learning
N Z
X
LVI (θ) := − qθ (ω) log p(yi ♣f ω (xi ))dω + KL(qθ (ω)♣♣p(ω)). (3.1)
i=1
Evaluating this objective poses several difficulties. First, the summed-over terms
R
qθ (ω) log p(yi ♣f ω (xi ))dω are not tractable for BNNs with more than a single hidden
layer. Second, this objective requires us to perform computations over the entire dataset,
which can be too costly for large N .
To solve the latter problem, we may use data sub-sampling (also referred to as
mini-batch optimisation). We approximate eq. (3.1) with
N XZ
LbVI (θ) := − qθ (ω) log p(yi ♣f ω (xi ))dω + KL(qθ (ω)♣♣p(ω)) (3.2)
M i∈S
parameters θ. This allows us to optimise the objective and find the optimal parameters θ∗ .
There exist three main techniques for MC estimation in the VI literature (a brief survey
of the literature was collected by [Schulman et al., 2015]). These have very different
characteristics and variances for the estimation of the expected log likelihood and its
derivative. Here we will contrast all three techniques in the context of VI and analyse
them both empirically and theoretically.
To present the various techniques we will consider the general case of estimating the
integral derivative:
∂ Z
I(θ) = f (x)pθ (x)dx (3.3)
∂θ
which arises when we optimise eq. (3.2) (we will also refer to a stochastic estimator of
this quantity as a stochastic derivative estimator). Here f (x) is a function defined on
the reals, differentiable almost everywhere (a.e., differentiable on R apart from a zero
measure set), and pθ (x) is a probability density function (pdf) parametrised by θ from
which we can easily generate samples. We assume that the integral exists and is finite,
and that f (x) does not depend on θ. Note that we shall write f ′ (x) when we differentiate
∂
f (x) w.r.t. its input (i.e. ∂x f (x), in contrast to the differentiation of f (x) w.r.t. other
variables).
We further use pθ (x) = N (x; µ, σ 2 ) as a concrete example, with θ = ¶µ, σ♢. In this
case, we will refer to an estimator of (3.3) as the mean derivative estimator (for some
function f (x)) when we differentiate w.r.t. θ = µ, and the standard deviation derivative
estimator when we differentiate w.r.t. θ = σ.
Three MC estimators for eq. (3.3) are used in the VI literature:
1. The score function estimator (also known as a likelihood ratio estimator and
Reinforce, [Fu, 2006; Glynn, 1990; Paisley et al., 2012; Williams, 1992]) relies on
∂ ∂
the identity ∂θ pθ (x) = pθ (x) ∂θ log pθ (x) and follows the parametrisation:
∂ Z Z
∂
f (x)pθ (x)dx = f (x) pθ (x)dx (3.4)
∂θ ∂θ
Z
∂ log pθ (x)
= f (x) pθ (x)dx
∂θ
leading to the unbiased stochastic estimator Ib1 (θ) = f (x) ∂ log∂θpθ (x) with x ∼ pθ (x),
hence Epθ (x) [Ib1 (θ)] = I(θ). Note that the first transition is possible since x and
f (x) do not depend on θ, and only pθ (x) depends on it. This estimator is simple
and applicable with discrete distributions, but as Paisley et al. [2012] identify, it
32 Bayesian Deep Learning
has rather high variance. When used in practice it is often coupled with a variance
reduction technique.
∂
Ib2 (θ) = f ′ (g(θ, ϵ)) g(θ, ϵ). (3.5)
∂θ
∂ Z Z
f (x)pθ (x)dx = f ′ (x)pθ (x)dx,
∂µ
∂ Z Z
(x − µ)
f (x)pθ (x)dx = f ′ (x) pθ (x)dx,
∂σ σ
∂ Z 1 Z ′′
f (x)pθ (x)dx = 2σ · f (x)pθ (x)dx (3.6)
∂σ 2
i.e. Ib3 (σ) = σf ′′ (x) with Epθ (x) [Ib3 (σ)] = I(σ). This is in comparison to the
estimators above that use f (x) or its first derivative. We refer to this estimator as
a characteristic function estimator.
These three MC estimators are believed to have decreasing estimator variances (1) >
(2) > (3). Before we analyse this variance, we offer an alternative derivation to Kingma
and Welling [2013]’s derivation of the pathwise derivative estimator.
Now, since δ(x − g(θ, ϵ)) is zero for all x apart from x = g(θ, ϵ),
!
∂ Z Z ∂ Z
f (x)δ x − g(θ, ϵ) dx p(ϵ)dϵ = f (g(θ, ϵ))p(ϵ)dϵ
∂θ ∂θ
34 Bayesian Deep Learning
Z
∂
= f (g(θ, ϵ))p(ϵ)dϵ
∂θ
Z
∂
= f ′ (g(θ, ϵ)) g(θ, ϵ)p(ϵ)dϵ.
∂θ
This derivation raises an interesting question: why is it that here the function f (x)
depends on θ, whereas in the above (the score function estimator) it does not? A clue
into explaining this can be found through a measure theoretic view of the score function
estimator:
∂ Z ∂ Z
f (x)pθ (x)dx = f (x)pθ (x) dλ(x)
∂θ ∂θ
where λ(x) is the Lebesgue measure. Here the measure does not depend on θ, hence
x does not depend on θ. Only the integrand f (x)pθ (x) depends on θ, therefore the
∂
estimator depends on ∂θ pθ (x) and f (x). This is in comparison to the pathwise derivative
estimator:
∂ Z ∂ Z
f (x)pθ (x)dx = f (x)dpθ (x).
∂θ ∂θ
Here the integration is w.r.t. the measure pθ (x), which depends on θ. As a result the
random variable x as a measurable function depends on θ, leading to f (x) being a
measurable function depending on θ. Intuitively, the above can be seen as “stretching”
and “contracting” the space following the density pθ (x), leading to a function defined on
this space to depend on θ.
Table 3.3 Estimator variance for various functions f (x) for the score function estimator,
the pathwise derivative estimator, and the characteristic function estimator ((1), (2),
and (3) above). On the left is mean derivative estimator variance, and on the right is
standard deviation derivative estimator variance, both w.r.t. pθ = N (µ, σ 2 ) and evaluated
at µ = 0, σ = 1. In bold is lowest estimator variance.
We assess empirical sample variance of T = 106 samples for the mean and standard
deviation derivative estimators. Tables 3.1 and 3.2 show estimator sample variance for
the integral derivative w.r.t. µ and σ respectively, for three different functions f (x). Even
though all estimators result in roughly the same means, their variances differ considerably.
For functions with slowly varying derivatives in a neighbourhood of zero (such as the
smooth function f (x) = x + x2 ) we have that the estimator variance obeys (1) > (2) >
(3). Also note that mean variance for (2) and (3) are identical, and that σ derivative
variance under the characteristic function estimator in this case is zero (since the second
derivative for this function is zero). Lastly, note that even though f (x) = sin(x) is
smooth with bounded derivatives, the similar function f (x) = sin(10x) has variance (3)
> (2) > (1). This is because the derivative of the function has high magnitude and varies
much near zero.
I will provide a simple property a function has to satisfy for it to have lower variance
under the pathwise derivative estimator and the characteristic function estimator than
the score function estimator. This will be for the mean derivative estimator.
Proposition 1. Let f (x), f ′ (x), f ′′ (x) be real-valued functions s.t. f (x) is an indefinite
integral of f ′ (x), and f ′ (x) is an indefinite integral of f ′′ (x). Assume that Varpθ (x) ((x −
µ)f (x)) < ∞, and Varpθ (x) (f ′ (x)) < ∞, as well as Epθ (x) (♣(x − µ)f ′ (x) + f (x)♣) < ∞ and
Epθ (x) (♣f ′′ (x)♣) < ∞, with pθ (x) = N (µ, σ 2 ).
If it holds that
2
Epθ (x) (x − µ)f ′ (x) + f (x) − σ 4 Epθ (x) f ′′ (x)2 ≥ 0,
then the pathwise derivative and the characteristic function mean derivative estimators
w.r.t. the function f (x) will have lower variance than the score function estimator.
36 Bayesian Deep Learning
Before proving the proposition, I will give some intuitive insights into what the
condition means. For this, assume for simplicity that µ = 0 and σ = 1. This gives the
simplified condition
2
Epθ (x) xf ′ (x) + f (x) ≥ Epθ (x) f ′′ (x)2 .
First, observe that all expectations are taken over pθ (x), meaning that we only care
about the functions’ average behaviour near zero. Second, a function f (x) with a large
derivative absolute value will change considerably, hence will have high variance. Since
the pathwise derivative estimator boils down to f ′ (x) in the Gaussian case, and the score
function estimator boils down to xf (x) (shown in the proof), we wish the expected change
∂ ∂ ′
in ∂x xf (x) = xf ′ (x) + f (x) to be higher than the expected change in ∂x f (x) = f ′′ (x),
hence the condition.
Proof. We start with the observation that the score function mean estimator w.r.t. a
Gaussian pθ (x) is given by
∂ Z Z
x−µ
f (x)pθ (x)dx = f (x) 2 pθ (x)dx
∂µ σ
Proposition 3.2 in [Cacoullos, 1982] states that for g(x), g ′ (x) real-valued functions s.t.
g(x) is an indefinite integral of g ′ (x), and Varpθ (x) (g(x)) < ∞ and Epθ (x) (♣g ′ (x)♣) < ∞,
there exists that
2
σ 2 Epθ (x) g ′ (x) ≤ Varpθ (x) g(x) ≤ σ 2 Epθ (x) g ′ (x)2 .
we conclude
2
σ 4 Varpθ (x) f ′ (x) ≤ σ 6 Epθ (x) f ′′ (x)2 ≤ σ 2 Epθ (x) (x − µ)f ′ (x) + f (x)
≤ Varpθ (x) (x − µ)f (x)
as we wanted to show.
and indeed the score function estimator variance is higher than that of the pathwise
derivative estimator and that of the characteristic function estimator. A similar result
can be derived for the standard deviation derivative estimator.
From empirical observation, the functions f (x) often encountered in VI seem to
satisfy the variance relation (1) > (2). For this reason, and since we will make use of
distributions other than Gaussian, we continue our work using the pathwise derivative
estimator.
In his work, Graves [2011] used both delta approximating distributions, as well as fully
factorised Gaussian approximating distributions. As such, Graves [2011] relied on Opper
and Archambeau [2009]’s characteristic function estimator in his approximation of eq.
(3.2). Further, Graves [2011] factorised the approximating distribution for each weight
scalar, losing weight correlations. This approach has led to the limitations discussed in
§2.2.2, hurting the method’s performance and practicality.
Using the tools above, and relying on the pathwise derivative estimator instead of
the characteristic function estimator in particular, we can make use of more interesting
non-Gaussian approximating distributions. Further, to avoid losing weight correlations,
we factorise the distribution for each weight row wl,i in each weight matrix Wl , instead
of factorising over each weight scalar. The reason for this will be given below. Using
these two key changes, we will see below how our approximate inference can be closely
tied to SRTs, suggesting a practical, well performing, implementation.
To use the pathwise derivative estimator we need to re-parametrise each qθl,i (wl,i ) as
wl,i = g(θl,i , ϵl,i ) and specify some p(ϵl,i ) (this will be done at a later time). For simplicity
Q
of notation we will write p(ϵ) = l,i p(ϵl,i ), and ω = g(θ, ϵ) collecting all model random
variables. Starting from the data sub-sampling objective (eq. (3.2)), we re-parametrise
each integral to integrate w.r.t. p(ϵ):
N XZ
LbVI (θ) = − qθ (ω) log p(yi ♣f ω (xi ))dω + KL(qθ (ω)♣♣p(ω))
M i∈S
N XZ
=− p(ϵ) log p(yi ♣f g(θ,ϵ) (xi ))dϵ + KL(qθ (ω)♣♣p(ω))
M i∈S
and then replace each expected log likelihood term with its stochastic estimator (eq.
(3.5)), resulting in a new MC estimator:
N X
LbMC (θ) = − log p(yi ♣f g(θ,ϵ) (xi )) + KL(qθ (ω)♣♣p(ω)) (3.7)
M i∈S
7: Update θ:
d
θ ← θ + η ∆θ.
8: until θ has converged.
T Z
∗ ∗ 1X
q̃θ (y ♣x ) := ∗ ∗ b
p(y ♣x , ω t ) −−−→ p y∗ ♣x∗ , ω qθ (ω)dω (3.8)
T t=1 T →∞
Z
≈ p y∗ ♣x∗ , ω p(ω♣X, Y)dω
= p(y∗ ♣x∗ , X, Y)
with ωb t ∼ qθ (ω).
We next present distributions qθ (ω) corresponding to several SRTs, s.t. standard
techniques in the deep learning literature could be seen as identical to executing algorithm
1 for approximate inference with qθ (ω). This means that existing models that use such
SRTs can be interpreted as performing approximate inference. As a result, uncertainty
information can be extracted from these, as we will see in the following sections.
2016]. We will concentrate on dropout for the moment, and discuss alternative SRTs
below.
Notation remark. In this section and the next, in order to avoid confusion between
matrices (used as weights in a NN) and stochastic random matrices (which are random
variables inducing a distribution over BNN weights), we change our notation slightly
from §1.1. Here we use M to denote a deterministic matrix over the reals, W to denote a
c to denote a realisation
random variable defined over the set of real matrices2 , and use W
of W.
b
b = hM
y 2
= (h ⊙ ϵb2 )M2
= (h · diag(ϵb2 ))M2
= h(diag(ϵb2 )M2 )
b M1 + b (diag(ϵ
=σ x b2 )M2 )
= σ (x ⊙ ϵb1 )M1 + b (diag(ϵb2 )M2 )
= σ x(diag(ϵb1 )M1 ) + b (diag(ϵb2 )M2 )
c := diag(ϵ
writing W c := diag(ϵ
b1 )M1 and W b2 )M2 we end up with
1 2
c +b W c 1 ,W
c =: f W c2 ,b
b = σ xW
y 1 2 (x)
1 1
E M1 ,M2 ,b (x, y) = ♣♣y − f M1 ,M2 ,b (x)♣♣2 = − log p(y♣f M1 ,M2 ,b (x)) + const (3.10)
2 τ
where p(y♣f M1 ,M2 ,b (x)) = N (y; f M1 ,M2 ,b (x), τ −1 I) with τ −1 observation noise. It is
simple to see that this holds for classification as well (in which case we should set τ = 1).
Recall that ωb = ¶W c ,Wc , b♢ and write
1 2
ci,W
b i = ¶W
ω c i , b♢ = ¶diag(ϵ
bi1 )M1 , diag(ϵ
bi2 )M2 , b♢ =: g(θ, ϵ
bi )
1 2
with θ = ¶M1 , M2 , b♢, ϵbi1 ∼ p(ϵ1 ), and ϵbi2 ∼ p(ϵ2 ) for 1 ≤ i ≤ N . Here p(ϵl ) (l = 1, 2) is
a product of Bernoulli distributions with probabilities 1 − pl , from which a realisation
would be a vector of zeros and ones.
We can plug identity (3.10) into objective (3.9) and get
1 X
Lbdropout (M1 , M2 , b) = − log p(yi ♣f g(θ,bϵi ) (x)) + λ1 ♣♣M1 ♣♣2 + λ2 ♣♣M2 ♣♣2 + λ3 ♣♣b♣♣2
M τ i∈S
(3.11)
∂ b 1 X ∂ ∂
Ldropout (θ) = − log p(yi ♣f g(θ,bϵi ) (x)) + λ1 ♣♣M1 ♣♣2 + λ2 ♣♣M2 ♣♣2 + λ3 ♣♣b♣♣2 .
∂θ M τ i∈S ∂θ ∂θ
1. the regularisation term derivatives (KL(qθ (ω)♣♣p(ω)) in algo. 1 and λ1 ♣♣M1 ♣♣2 +
λ2 ♣♣M2 ♣♣2 + λ3 ♣♣b♣♣2 in algo. 2),
d (multiplied by a constant
2. and the scale of ∆θ 1
in algo. 2).
Nτ
3.2 Practical inference in Bayesian neural networks 43
7: Update θ:
d
θ ← θ + η ∆θ.
8: until θ has converged.
More specifically, if we define the prior p(ω) s.t. the following holds:
∂ ∂
KL(qθ (ω)♣♣p(ω)) = N τ λ1 ♣♣M1 ♣♣2 + λ2 ♣♣M2 ♣♣2 + λ3 ♣♣b♣♣2 (3.12)
∂θ ∂θ
(referred to as the KL condition), we would have the following relation between the
derivatives of objective (3.11) and objective (3.7):
∂ b 1 ∂ b
Ldropout (θ) = LMC (θ)
∂θ N τ ∂θ
with identical optimisation procedures!
We found that for a specific choice for the approximating distribution qθ (ω), VI
results in identical optimisation procedure to that of a dropout NN. I would stress that
this means that optimising any neural network with dropout is equivalent to a form of
approximate inference in a probabilistic interpretation of the model5 . This means that
the optimal weights found through the optimisation of a dropout NN (using algo. 2)
are the same as the optimal variational parameters in a Bayesian NN with the same
structure. Further, this means that a network already trained with dropout is a Bayesian
NN, thus possesses all the properties a Bayesian NN possesses.
We have so far concentrated mostly on the dropout SRT. As to alternative SRTs, re-
member that an approximating distribution qθ (ω) is defined through its re-parametrisation
ω = g(θ, ϵ). Various SRTs can be recovered for different re-parametrisations. For exam-
5
Note that to get well-calibrated uncertainty estimates we have to optimise the dropout probability
p as well as θ, for example through grid-search over over validation log probability. This is discussed
further in §4.3
44 Bayesian Deep Learning
ple, multiplicative Gaussian noise [Srivastava et al., 2014] can be recovered by setting
g(θ, ϵ) = ¶diag(ϵ1 )M1 , diag(ϵ2 )M2 , b♢ with p(ϵl ) (for l = 1, 2) a product of N (1, α) with
positive-valued α6 . This can be efficiently implemented by multiplying a network’s units
by i.i.d. draws from a N (1, α). On the other hand, setting g(θ, ϵ) = ¶M1 ⊙ϵ1 , M2 ⊙ϵ2 , b♢
with p(ϵl ) a product of Bernoulli random variables for each weight scalar we recover
dropConnect [Wan et al., 2013]. This can be efficiently implemented by multiplying a
network’s weight scalars by i.i.d. draws from a Bernoulli distribution. It is interesting
to note that Graves [2011]’s fully factorised approximation can be recovered by setting
g(θ, ϵ) = ¶M1 + ϵ1 , M2 + ϵ2 , b♢ with p(ϵl ) a product of N (0, α) for each weight scalar.
This SRT is often referred to as additive Gaussian noise.
3.2.3 KL condition
For VI to result in an identical optimisation procedure to that of a dropout NN, the KL
condition (eq. (3.12)) has to be satisfied. Under what constraints does the KL condition
hold? This depends on the model specification (selection of prior p(ω)) as well as choice
of approximating distribution qθ (ω). For example, it can be shown that setting the model
Q Q
prior to p(ω) = Li=1 p(Wi ) = Li=1 N (0, I/li2 ), in other words independent normal priors
over each weight, with prior length-scale 7
2N τ λi
li2 = (3.13)
1 − pi
we have
∂ ∂
KL(qθ (ω)♣♣p(ω)) ≈ N τ (λ1 ♣♣M1 ♣♣2 + λ2 ♣♣M2 ♣♣2 + λ3 ♣♣b♣♣2 )
∂θ ∂θ
for a large enough number of hidden units and a Bernoulli variational distribution. This
is discussed further in appendix A. Alternatively, a discrete prior distribution
l2 Tw
p(w) ∝ e− 2 w
6
For the multiplicative Gaussian noise this result was also presented in [Kingma et al., 2015], which
was done in parallel to this work.
7
A note on mean-squared-error losses: the mean-squared-error loss can be seen as a scaling of the
Euclidean loss (eq. (1.1)) by a factor of 2, which implies that the factor of 2 in the length-scale should
be removed. The mean-squared-error loss is used in many modern deep learning packages instead (or as)
the Euclidean loss.
3.2 Practical inference in Bayesian neural networks 45
defined over a finite space w ∈ X satisfies the KL condition (eq. (3.12)) exactly. This is
discussed in more detail in §6.5. For multiplicative Gaussian noise, Kingma et al. [2015]
have shown that an improper log-uniform prior distribution satisfies our KL condition.
Notation remark. In the rest of this work we will use M and W interchangeably,
with the understanding that in deterministic NNs the random variable W will follow a
delta distribution with mean parameter M.
i.e. placing a prior distribution N (0, I/l2 ) over W1 can be replaced by scaling the
inputs by 1/l with a N (0, I) prior instead. For example, multiplying the inputs
by 100 (making the function smoother) and placing a prior length-scale l = 100
would give identical model output to placing a prior length-scale l = 1 with the
original inputs. This means that the length-scale’s unit of measure is identical to
the inputs’ one.
What does the prior length-scale mean? To see this, consider a real valued function
f (x), periodic with period P , and consider its Fourier expansion with K terms:
A0 X K
fK (x) := + Ak · sin 2πk
P
x + ϕk .
2 k=1
This can be seen as a single hidden layer neural network with a non-linearity
σ(·) := sin(·), input weights given by the Fourier frequencies W1 := [ 2πk ]K (which
P k=1
are fixed and not learnt), b := [ϕk ]Kk=1 is the bias term of the hidden layer, the
A0
Fourier coefficients are the output weights W2 := [Ak ]K k=1 , and 2 is the output
bias (which can be omitted for centred data). For simplicity we assume that W1 is
composed of only the Fourier frequencies for which the Fourier coefficients are not
zero. For example, W1 might be composed of high frequencies, low frequencies, or
a combination of the two.
46 Bayesian Deep Learning
This view of single hidden layer neural networks gives us some insights into the role
of the different quantities used in a neural network. For example, erratic functions
have high frequencies, i.e. high magnitude input weights W1 . On the other hand,
smooth slow-varying functions are composed of low frequencies, and as a result the
magnitude of W1 is small. The magnitude of W2 determines how much different
frequencies will be used to compose the output function fK (x). High magnitude W2
results in a large magnitude for the function’s outputs, whereas low W2 magnitude
gives function outputs scaled down and closer to zero.
When we place a prior distribution over the input weights of a BNN, we can capture
this characteristic. Having W1 ∼ N (0, I/l2 ) a priori with long length-scale l results
in weights with low magnitude, and as a result slow-varying induced functions. On
the other hand, placing a prior distribution with a short length-scale gives high
magnitude weights, and as a result erratic functions with high frequencies. This
will be demonstrated empirically in §4.1.
Given the intuition about weight magnitude above, equation (3.13) can be re-written
to cast some light on the structure of the weight-decay in a neural networka :
li2 (1 − pi )
λi = . (3.14)
2N τ
A short length-scale li (corresponding to high frequency data) with high pre-
cision τ (equivalently, small observation noise) results in a small weight-decay
λi —encouraging the model to fit the data well but potentially generalising badly.
A long length-scale with low precision results in a large weight-decay—and stronger
regularisation over the weights. This trade-off between the length-scale and model
precision results in different weight-decay values.
Lastly, I would comment on the choice of placing a distribution over the rows of a
weight matrix rather than factorising it over each row’s elements. Gal and Turner
[2015] offered a derivation related to the Fourier expansion above, where a function
drawn from a Gaussian process (GP) was approximated through a finite Fourier
decomposition of the GP’s covariance function. This derivation has many properties
in common with the view above. Interestingly, in the multivariate f (x) case the
Fourier frequencies are given in the columns of the equivalent weight matrix W1
of size Q (input dimension) by K (number of expansion terms). This generalises
on the univariate case above where W1 is of dimensions Q = 1 by K and each
entry (column) is a single frequency. Factorising the weight matrix approximating
3.3 Model uncertainty in Bayesian neural networks 47
distribution qθ (W1 ) over its rows rather than columns captures correlations over
the function’s frequencies.
a
Note that with a mean-squared-error loss the factor of 2 should be removed.
where ω = ¶Wi ♢Li=1 is our set of random variables for a model with L layers, f ω (x∗ ) is
our model’s stochastic output, and qθ∗ (ω) is an optimum of eq. (3.7).
We will perform moment-matching and estimate the first two moments of the predictive
distribution empirically. The first moment can be estimated as follows:
Proposition 2. Given p(y∗ ♣f ω (x∗ )) = N (y∗ ; f ω (x∗ ), τ −1 I) for some τ > 0, Eqθ∗ (y∗ ♣x∗ ) [y∗ ]
can be estimated with the unbiased estimator
T
1X
e ∗ ] :=
E[y f ωbt (x∗ ) −−−→ Eqθ∗ (y∗ ♣x∗ ) [y∗ ] (3.16)
T t=1 T →∞
b t ∼ qθ∗ (ω).
with ω
Proof.
Z
Eqθ∗ (y∗ ♣x∗ ) [y∗ ] = y∗ qθ∗ (y∗ ♣x∗ )dy∗
Z Z
= y∗ N (y∗ ; f ω (x∗ ), τ −1 I)qθ∗ (ω)dωdy∗
Z Z !
∗ ∗ ∗ −1 ∗
= y N (y ; f (x ), τ
ω
I)dy qθ∗ (ω)dω
Z
= f ω (x∗ )qθ∗ (ω)dω,
PT
e ∗ ] :=
giving the unbiased estimator E[y 1
T t=1 f ωbt (x∗ ) following MC integration with T
samples.
48 Bayesian Deep Learning
When used with dropout, we refer to this Monte Carlo estimate (3.16) as MC dropout.
In practice MC dropout is equivalent to performing T stochastic forward passes through
the network and averaging the results. For dropout, this result has been presented
in the literature before as model averaging [Srivastava et al., 2014]. We have given
a new derivation for this result which allows us to derive mathematically grounded
uncertainty estimates as well, and generalises to all SRTs (including SRTs such as
multiplicative Gaussian noise where the model averaging interpretation would result in
infinitely many models). Srivastava et al. [2014, section 7.5] have reasoned based on
empirical experimentation that the model averaging can be approximated by multiplying
each network unit hi by 1/(1 − pi ) at test time, referred to as standard dropout. This
can be seen as propagating the mean of each layer to the next. Below (in section §4.4)
we give results showing that there exist models in which standard dropout gives a bad
approximation to the model averaging.
We estimate the second raw moment (for regression) using the following proposition:
Proposition 3. h i
Given p(y∗ ♣f ω (x∗ )) = N (y∗ ; f ω (x∗ ), τ −1 I) for some τ > 0, Eqθ∗ (y∗ ♣x∗ ) (y∗ )T (y∗ ) can be
estimated with the unbiased estimator
h i 1X T h i
e (y∗ )T (y∗ ) := τ −1 I +
E f ωbt (x∗ )T f ωbt (x∗ ) −−−→ Eqθ∗ (y∗ ♣x∗ ) (y∗ )T (y∗ )
T t=1 T →∞
Proof.
Z Z !
h i
∗ T ∗ ∗ T ∗ ∗ ∗ ∗
Eqθ∗ (y∗ ♣x∗ ) (y ) (y ) = (y ) (y )p(y ♣x , ω)dy qθ∗ (ω)dω
Z !
= Covp(y∗ ♣x∗ ,ω) [y ] + Ep(y∗ ♣x∗ ,ω) [y ] Ep(y∗ ♣x∗ ,ω) [y ] qθ∗ (ω)dω
∗ ∗ T ∗
Z !
−1
= τ I + f (x ) f (x ) qθ∗ (ω)dω
ω ∗ T ω ∗
h i PT
e (y∗ )T (y∗ ) := τ −1 I +
giving the unbiased estimator E 1
T t=1 f ωbt (x∗ )T f ωbt (x∗ ) following
MC integration with T samples.
T
1X
g ∗ ] := τ −1 I +
Var[y f ωbt (x∗ )T f ωbt (x∗ ) − E[y e ∗]
e ∗ ]T E[y
T t=1
3.3 Model uncertainty in Bayesian neural networks 49
which equals the sample variance of T stochastic forward passes through the NN plus
the inverse model precision.
How can we find the model precision? In practice in the deep learning literature we
often grid-search over the weight-decay λ to minimise validation error. Then, given a
weight-decay λi (and prior length-scale li ), eq. (3.13) can be re-written to find the model
precision8 :
(1 − p)li2
τ= . (3.17)
2N λi
8
Prior length-scale li can be fixed based on the density of the input data X and our prior belief as to
the function’s wiggliness, or optimised over as well (w.r.t. predictive log-likelihood over a validation set).
The dropout probability is optimised using grid search similarly.
50 Bayesian Deep Learning
deep layers with delta approximating distributions since that would sacrifice our
ability to capture model uncertainty.
Given a dataset X, Y and a new data point x∗ we can calculate the probability
of possible output values y∗ using the predictive probability p(y∗ ♣x∗ , X, Y). The log
of the predictive likelihood captures how well the model fits the data, with larger
values indicating better model fit. Our predictive log-likelihood (also referred to as test
log-likelihood) can be approximated by MC integration of eq. (3.15) with T terms:
T
! Z
gp(y∗ ♣x∗ , X, Y) 1X
log := log p(y ♣x , ω t ) −−−→ log p(y∗ ♣x∗ , ω)qθ∗ (ω)dω
∗ ∗
T t=1 T →∞
Z
≈ log p(y∗ ♣x∗ , ω)p(ω♣X, Y)dω
with ω t ∼ qθ∗ (ω) and since qθ∗ (ω) is the minimiser of eq. (2.3). Note that this is a
biased estimator since the expected quantity is transformed with the non-linear logarithm
function, but the bias decreases as T increases.
For regression we can rewrite this last equation in a more numerically stable way9 :
!
1 1 1
gp(y∗ ♣x∗ , X, Y)
log = logsumexp − τ ♣♣y − f ωbt (x∗ )♣♣2 − log T − log 2π + log τ
2 2 2
(3.18)
fx
variation-ratio[x] := 1 − . (3.19)
T
The variation ratio is a measure of dispersion—how “spread” the distribution is around
the mode. In the binary case, the variation ratio attains its maximum of 0.5 when the
two classes are sampled equally likely, and its minimum of 0 when only a single class is
sampled.
with c∗ = arg max p(y = c♣x, Dtrain ). This is because for y t the t’th class sampled
c=1,...,C
for input x we have:
fx 1X h i
= 1[y t = c∗ ] −−−→Eqθ∗ (y♣x) 1[y = c∗ ]
T T t T →∞
10
These approaches are necessary since the probability vector resulting from a deterministic forward
pass through the model does not capture confidence, as explained in figure 1.3.
11
Here 1[·] is the indicator function.
52 Bayesian Deep Learning
=qθ∗ (y = c∗ ♣x)
≈p(y = c∗ ♣x, Dtrain )
and,
X 1X
c∗ = arg max 1[y t = c] =arg max 1[y t = c]
c=1,...,C t c=1,...,C T t
−−−→arg max Eqθ∗ (y♣x) [1[y = c]]
T →∞ c=1,...,C
Unlike variation ratios, predictive entropy has its foundations in information theory.
This quantity captures the average amount of information contained in the predictive
distribution:
X
H[y♣x, Dtrain ] := − p(y = c♣x, Dtrain ) log p(y = c♣x, Dtrain ) (3.20)
c
summing over all possible classes c that y can take. Given a test point x, the predictive
entropy attains its maximum value when all classes are predicted to have equal uniform
probability, and its minimum value of zero when one class has probability 1 and all others
probability 0 (i.e. the prediction is certain).
In our setting, the predictive entropy can be approximated by collecting the probability
vectors from T stochastic forward passes through the network, and for each class c
averaging the probabilities of the class from each of the T probability vectors, replacing
p(y = c♣x, Dtrain ) in eq. (3.20). In other words, we replace p(y = c♣x, Dtrain ) with
1 P
T t p(y = c♣x, ωb t ), where p(y = c♣x, ω
b t ) is the probability of input x to take class c
with model parameters ω b t ∼ qθ∗ (ω):
[p(y = 1♣x, ω
b t ), ..., p(y = C♣x, ω bt (x)).
b t )] := Softmax(f ω
3.3 Model uncertainty in Bayesian neural networks 53
Then,
! !
X 1X 1X
H̃[y♣x, Dtrain ] := − p(y = c♣x, ω
b t ) log p(y = c♣x, ω
b t)
c T t T t
Z ! Z !
X
−−−→ − p(y = c♣x, ω)qθ∗ (ω)dω log p(y = c♣x, ω)qθ∗ (ω)dω
T →∞ c
Z ! Z !
X
≈− p(y = c♣x, ω)p(ω♣Dtrain )dω log p(y = c♣x, ω)p(ω♣Dtrain )dω
c
X
=− p(y = c♣x, Dtrain ) log p(y = c♣x, Dtrain )
c
=H[y♣x, Dtrain ]
with ωb t ∼ qθ∗ (ω) and since qθ∗ (ω) is the optimum of eq. (3.7). Note that this is a biased
P R
estimator since the unbiased estimator T1 t p(y = c♣x, ω b t ) −−−→ p(y = c♣x, ω)qθ∗ (ω)dω
T →∞
is transformed through the non-linear function H[·]. The bias of this estimator will
decrease as T increases.
As an alternative to the predictive entropy, the mutual information between the
prediction y and the posterior over the model parameters ω offers a different measure of
uncertainty:
h i
I[y, ω♣x, Dtrain ] := H[y♣x, Dtrain ] − Ep(ω♣Dtrain ) H[y♣x, ω]
X
=− p(y = c♣x, Dtrain ) log p(y = c♣x, Dtrain )
c
hX i
+ Ep(ω♣Dtrain ) p(y = c♣x, ω) log p(y = c♣x, ω)
c
with c the possible classes y can take. This tractable view of the mutual information
was suggested in [Houlsby et al., 2011] in the context of active learning. Test points x
that maximise the mutual information are points on which the model is uncertain on
average, yet there exist model parameters that erroneously produce predictions with high
confidence.
The mutual information can be approximated in our setting in a similar way to the
predictive entropy approximation:
! !
X 1X 1X
Ĩ[y, ω♣x, Dtrain ] := − p(y = c♣x, ω
b t ) log p(y = c♣x, ω
b t)
c T t T t
1X
+ p(y = c♣x, ω
b t ) log p(y = c♣x, ω
b t)
T c,t
h i
−−−→H[y♣x, Dtrain ] − Eqθ∗ (ω) H[y♣x, ω]
T →∞
54 Bayesian Deep Learning
b t ∼ qθ∗ (ω).
with ω
Some intuition. To understand the different measures for uncertainty, we shall look
at three concrete examples in binary classification of dogs and cats given an input image.
More specifically, we will look at the sets of probability vectors obtained from multiple
stochastic forward passes, and the uncertainty measures resulting from these sets. The
three examples are where the probabilities for the class “dog” in all vectors are
1. all equal to 1 (i.e. the probability vectors collected are ¶(1, 0), ..., (1, 0)♢),
2. all equal to 0.5 (i.e. the probability vectors collected are ¶(0.5, 0.5), ..., (0.5, 0.5)♢),
and
3. half of the probabilities sampled equal to 0 and half of the probabilities equal to 1
(i.e. the probability vectors collected are ¶(1, 0), (0, 1), (0, 1), ..., (1, 0)♢ for example).
In example (1) the prediction has high confidence, whereas in examples (2) and (3) the
prediction has low confidence. These are examples of predictive uncertainty. Compared
to this notion of confidence, in examples (1) and (2) the model is confident about its
output since it gives identical probabilities in multiple forward passes. On the other
hand, in the last example (3) the model is uncertain about its output, corresponding to
the case in figure 1.3 (where the layer before the softmax has high uncertainty). This is
an example of model uncertainty.
In example (1), both variation ratios, predictive entropy, and the mutual information
would return value 0, all measures indicating high confidence. In example (3) variation
ratios, predictive entropy, and the mutual information would all return value 0.5, all
measures indicating high uncertainty. All three measures of uncertainty agree on these
two examples.
However, in example (2) variation ratios and predictive entropy would return value
0.5, whereas the mutual information would return value 0. In this case variation ratios
and predictive entropy capture the uncertainty in the prediction, whereas the mutual
information captures the model’s confidence in its output. This information can be used
for example in active learning, and will be demonstrated in section §5.2.
First, even though the training time of our model is identical to that of existing models
in the field, the test time is scaled by T —the number of averaged forward passes through
the network. This may not be of real concern in some real world applications, as NNs are
often implemented on distributed hardware. Distributed hardware allows us to obtain
MC estimates in constant time almost trivially, by transferring an input to a GPU and
setting a mini-batch composed of the same input multiple times. In dropout for example
we sample different Bernoulli realisations for each output unit and each mini-batch input,
which results in a matrix of probabilities. Each row in the matrix is the output of the
dropout network on the same input generated with different random variable realisations
(dropout masks). Averaging over the rows results in the MC dropout estimate.
Another concern is that the model’s uncertainty is not calibrated. A calibrated
model is one in which the predictive probabilities match the empirical frequency of the
data. The lack of calibration can be seen through the derivation’s relation to Gaussian
processes [Gal and Ghahramani, 2015b]. Gaussian processes’ uncertainty is known
to not be calibrated—the Gaussian process’s uncertainty depends on the covariance
function chosen, which is shown in [Gal and Ghahramani, 2015b] to be equivalent to
the non-linearities and prior over the weights. The choice of a GP’s covariance function
follows from our assumptions about the data. If we believe, for example, that the model’s
uncertainty should increase far from the data we might choose the squared exponential
covariance function.
For many practical applications the lack of calibration means that model uncertainty
can increase for large magnitude data points or be of different scale for different datasets.
To calibrate model uncertainty in regression tasks we can scale the uncertainty linearly
to remove data magnitude effects, and manipulate uncertainty percentiles to compare
among different datasets. This can be done by finding the number of validation set
points having larger uncertainty than that of a test point. For example, if a test point
has predictive standard deviation 5, whereas almost all validation points have standard
deviation ranging from 0.2 to 2, then the test point’s uncertainty value will be in the top
percentile of the validation set uncertainty measures, and the model will be considered as
very uncertain about the test point compared to the validation data. However, another
model might give the same test point predictive standard deviation of 5 with most of
the validation data given predictive standard deviation ranging from 10 to 15. In this
model the test point’s uncertainty measure will be in the lowest percentile of validation
set uncertainty measures, and the model will be considered as fairly confident about the
test point with respect to the validation data.
56 Bayesian Deep Learning
the kernels with the input (with a given stride s) then results in an output layer
of dimensions y ∈ RHi−1 ×Wi−1 ×Ki with Hi−1
′ ′
′ ′
and Wi−1 being the new height and
width, and Ki channels—the number of kernels. Each element yi,j,k is the sum of
the element-wise product of kernel kk with a corresponding patch in the input image x:
[[xi−h/2,j−w/2,1 , ..., xi+h/2,j+w/2,1 ], ..., [xi−h/2,j−w/2,Ki−1 , ..., xi+h/2,j+w/2,Ki−1 ]].
To integrate over the kernels, we reformulate the convolution as a linear operation. Let
kk ∈ Rh×w×Ki−1 for k = 1, ..., Ki be the CNN’s kernels with height h, width w, and Ki−1
channels in the i’th layer. The input to the layer is represented as a 3 dimensional tensor
x ∈ RHi−1 ×Wi−1 ×Ki−1 with height Hi−1 , width Wi−1 , and Ki−1 channels. Convolving the
kernels with the input with a given stride s is equivalent to extracting patches from the
input and performing a matrix product: we extract h × w × Ki−1 dimensional patches
from the input with stride s and vectorise these. Collecting the vectors in the rows of
a matrix we obtain a new representation for our input x ∈ Rn×hwKi−1 with n patches.
The vectorised kernels form the columns of the weight matrix Wi ∈ RhwKi−1 ×Ki . The
convolution operation is then equivalent to the matrix product xWi ∈ Rn×Ki . The
columns of the output can be re-arranged into a 3 dimensional tensor y ∈ RHi ×Wi ×Ki
(since n = Hi × Wi ). Pooling can then be seen as a non-linear operation on the matrix y.
Note that the pooling operation is a non-linearity applied after the linear convolution
counterpart to ReLU or Tanh non-linearities.
To make the CNN into a probabilistic model we place a prior distribution over each
kernel and approximately integrate each kernels-patch pair with Bernoulli variational
distributions. We sample Bernoulli random variables ϵi,j,n and multiply patch n by
the weight matrix Wi · diag([ϵi,j,n ]K j=1 ). This product is equivalent to an approximating
i
distribution modelling each kernel-patch pair with a distinct random variable, tying
the means of the random variables over the patches. The distribution randomly sets
kernels to zero for different patches. This approximating distribution is also equivalent
to applying dropout for each element in the tensor y before pooling. Implementing our
Bayesian CNN is therefore as simple as using dropout after every convolution layer before
pooling.
The standard dropout test time approximation (scaling hidden units by 1/(1 − pi ))
does not perform well when dropout is applied after convolutions—this is a negative result
we identified empirically. We solve this by approximating the predictive distribution
following eq. (3.8), averaging stochastic forward passes through the model at test time
(using MC dropout). We assess the model above with an extensive set of experiments
studying its properties in §4.4.
58 Bayesian Deep Learning
for some non-linearity σ. The model output can be defined, for example, as
fy (hT ) = hT Wy + by .
This estimator is plugged into equation (3.1) to obtain our minimisation objective
!
N
X
ω
LbMC = − log p yi fybi fhωbi (xi,T , fhωbi (...fhωbi (xi,1 , h0 )...)) + KL(q(ω)♣♣p(ω)). (3.21)
i=1
Certain RNN models such as LSTMs [Graves et al., 2013; Hochreiter and Schmidhuber,
1997] and GRUs [Cho et al., 2014] use different gates within the RNN units. For example,
an LSTM is defined by setting four gates: “input”, “forget”, “output”, and an “input
modulation gate”,
i = sigm ht−1 Ui + xt Wi f = sigm ht−1 Uf + xt Wf
60 Bayesian Deep Learning
o = sigm ht−1 Uo + xt Wo g = tanh ht−1 Ug + xt Wg
ct = f ⊙ ct−1 + i ⊙ g ht = o ⊙ tanh(ct ) (3.22)
most language applications, yet it is often not regularised. Since the embedding
matrix is optimised it can lead to overfitting, and it is therefore desirable to apply
dropout to the one-hot encoded vectors. This in effect is identical to dropping
words at random throughout the input sentence, and can also be interpreted as
encouraging the model to not “depend” on single words for its output.
Note that as before, we randomly set rows of the matrix WE ∈ RV ×D to zero. Since
we repeat the same mask at each time step, we drop the same words throughout
the sequence—i.e. we drop word types at random rather than word tokens (as an
example, the sentence “the dog and the cat” might become “— dog and — cat”
or “the — and the cat”, but never “— dog and the cat”). A possible inefficiency
implementing this is the requirement to sample V Bernoulli random variables,
where V might be large. This can be solved by the observation that for sequences
of length T , at most T embeddings could be dropped (other dropped embeddings
have no effect on the model output). For T ≪ V it is therefore more efficient
to first map the words to the word embeddings, and only then to zero-out word
embeddings based on their word type.
F • f
In the next chapter we will study the techniques above empirically and analyse
them quantitatively. This is followed by a survey of recent literature making use of
the techniques in real-world problems concerning AI safety, image processing, sequence
processing, active learning, and other examples.