Variational Inference Ref Paper
Variational Inference Ref Paper
Abstract
Approximating complex probability densities is a core problem in modern statistics. In this paper,
we introduce the concept of Variational Inference (VI), a popular method in machine learning that
uses optimization techniques to estimate complex probability densities. This property allows VI to
converge faster than classical methods, such as, Markov Chain Monte Carlo sampling. Conceptually,
VI works by choosing a family of probability density functions and then finding the one closest to the
actual probability density—often using the Kullback-Leibler (KL) divergence as the optimization
metric. We introduce the Evidence Lower Bound to tractably compute the approximated probability
density and we review the ideas behind mean-field variational inference. Finally, we discuss the
applications of VI to variational auto-encoders (VAE) and VAE-Generative Adversarial Network
(VAE-GAN). With this paper, we aim to explain the concept of VI and assist in future research with
this approach.
2
taking expectations, with respect to the variational distri- defined in Equation 7. However, optimizing Equation 7 is
bution. Hence, the optimization problem in Equation 6 can still not tractable because we are required to compute the
be re-formulated as, evidence function. The KL-divergence objective function
from Equation 7 can be written as,
q ∗ (Z) = arg min DKL (Q(Z) k P (Z|X)). (7)
q(Z)∈Q D = E[log q(z)] − E[log p(z|x)], (8)
where,
Since q(Z) is selected from a tractable family of probabil-
D = DKL (Q(Z) k P (Z|X)), (9)
ity densities, computing expectations with respect to q is
also tractable. and all expectations are taken by sampling z from Q(Z).
We now expand the conditional probability density p(z|x)
Forward vs. Reverse KL using Equation 1, giving us,
Let P and Q be two distributions with probability density D = E[log q(z)] − E[log p(z, x)] + E[log p(x)]. (10)
functions p and q, where q is an approximation of p. As Since all expectations are under Q(Z), E[log p(x)] is the
stated earlier, KL-divergence is non-symmetric (Shlens, constant log p(x). Therefore, we can re-write Equation 10
2014), i.e., as,
DKL (P k Q) 6= DKL (Q k P ), D = E[log q(z)] − E[log p(z, x)] + log p(x). (11)
The KL-divergence cannot be computed directly as it de-
as such, minimizing the forward KL-divergence, pends on the evidence. Therefore, we must optimize an
DKL (P k Q), yields different results than minimizing alternative objective function that is equivalent to DKL up
the reverse KL-divergence, DKL (Q k P ). to an added constant,
The forward KL-divergence is also known as the M-
−D + log p(x) = E[log p(z, x)] − E[log q(z)],
projection or moment projection, (Murphy, 2013) and is
defined as, ELBO(Q) = E[log p(z, x)] − E[log q(z)], (12)
where the term ELBO is an abbreviation for evidence lower
p(x) bound. The ELBO the sum of the negative KL-divergence
DKL (P k Q) = Ex∼P (X) log .
q(x) and the constant term log p(x). Maximizing the ELBO is
This will be large wherever the approximation fails to cover equivalent to minimizing the KL-divergence (Blei et al.,
up the actual probability distribution (Murphy, 2013); i.e. 2017).
An intuitive explanation of the ELBO can be derived
p(x) by re-arranging the terms of Equation 12, as
lim → ∞, where p(x) > 0.
q(x)→0 q(x) ELBO(Q) = E[log p(z, x)] − E[log q(z)],
= E[log p(x|z)] + E[log p(z) − log q(z)],
So, if p(x) > 0, we must choose a probability density
to ensure that q(x) > 0 (Murphy, 2013). This particular = E[log p(x|z)] − DKL (Q(Z) k P (Z)).
case of optimizing is zero avoiding and can intuitively be (13)
interpreted as q over-estimating p. Thus, the ELBO is the sum of the expected log likelihood
The reverse KL-divergence is also known as the I- of the data and the KL-divergence between the prior and
projection or information projection, (Murphy, 2013) and approximated posterior probability density. The expected
is defined as, log likelihood describes how well the chosen statistical
model fits the data. The KL-divergence encourages the
q(x) variational probability density to be close to the actual
DKL (Q k P ) = Ex∼Q(X) log ,
p(x) prior. Thus, the ELBO can be seen as a regularised fit to
where, the data.
The ELBO lower-bounds the (log) evidence, log p(x).
q(x) This property was explored by Jordan et al. (1999), where
lim → ∞ where q(x) > 0. the authors used Jensen’s inequality (Klaričić Bakula et al.,
p(x)→0 p(x)
2008) to derive the relationship between the ELBO and the
The limit indicates the need to force q(x) = 0 wherever evidence function. The derivation is as follows:
p(x) = 0, otherwise the KL-divergence would be very
Z
large. This is zero forcing (Murphy, 2013) and can be in- log p(x) = log p(x, z)dz,
terpreted as q under-estimating p. The difference between z∈Z
the two methods is illustrated in Figure 2; based on Figure Z
q(z)
21.1 from Murphy (2013). = log p(x, z) dz,
q(z)
z∈Z
3 ELBO: Evidence Lower Bound
p(x, z)
= log Ez∼Q(z) ,
As mentioned in Section 2, we select a probability den- q(z)
sity from a tractable family which has the lowest KL- ≥ Ez∼Q(z) [log p(x, z)] − Ez∼Q(z) [log q(z)],
divergence from the actual posterior density. Therefore,
inference amounts to solving the optimization problem ≥ ELBO(Q). (14)
3
Figure 2: Figure illustrating forward vs reverse KL-divergence on a bimodal distribution. The blue and the red contours
represent the actual probability density, p, and the unimodal approximation, q, respectively. The left panel shows
the forward KL-divergence minimization where q tends to cover p. The centre and the right panels show the reverse
KL-divergence minimization where q locks on to one of the two modes.
This relationship between the ELBO and log p(x) has mo- which is drawn from a distributions with mean,
tivated researchers to use the variational lower bound as
the criterion for model selection. This bound serves as a µj ∼ N (0, σ 2 ) for j = 1, 2, ..., K, (17)
good approximation to the marginal likelihood; providing
a basis for model selection. Applications of VI for model and is assigned to a cluster using,
selection have been explored in a wide variety of tasks
such as for mixture models (McGrory and Titterington, ci ∼ U(K) for i = 1, 2, ..., N, (18)
2007), cross-validation mode selection (Nott et al., 2012)
and in a more general setting by Bernardo et al. (2003). where ci is a one-hot vector of K-dimensions, and with
latent variables µ and c. In our case, the one-hot vector is
4 Mean field variational family a K-dimensional binary vector where each dimension rep-
resents a cluster. A data-point belonging to the l-th cluster
We briefly, introduce the mean field variational family for will be represented by a value of one in the l-th dimen-
VI, where the latent variables are assumed to be mutually sion of the one-hot vector, while the (K − 1) remaining
independent—each governed by a distinct factor in the vari- dimensions will have a value of zero.
ational probability density (see Bishop, 2006, for a more We assume the approximate posterior probability den-
detailed explanation). This assumption greatly simplifies sity to be from the mean-field variational family (Section
the complexity of the optimization process. A generic 4). Thus, the variational parameterization is given by,
member of the mean field variational family is
m
Y K N
q(Z|X) = qj (Zj ), (15)
Y Y
q(µ, c) = q(µj ; mj , s2j ) q(ci ; φi ). (19)
j=1
j=1 i=1
where m is the number of latent variables. The observed
data X does not appear in Equation 15, therefore any prob- Each latent variable is governed by it’s own variational
ability density from this variational family is not a model factor (Blei et al., 2017). Here, the mixture components
of the data. Instead, it is the ELBO, and the corresponding are Gaussian with variational parameters (mean mk and
KL-divergence minimization problem, which connects the variance s2k ) specific to the k-th cluster. The cluster as-
fitted variational probability density to the data and model signments are categorical with variational parameters (K-
(Blei et al., 2017; Murphy, 2013). dimensional cluster probabilities φi vector) specific to the
i-th data point.
5 A toy problem The definition of ELBO from Equation 12 applied to
In this section, we explore, in detail, how VI can be used this specific case is,
to approximate a mixture of Gaussians. Consider a dis-
tribution of N real-valued data-points x = x1 , x2 , ..., xN ELBOm,s2 ,φ = E[log p(x, µ, c)]
sampled from a mixture of K univariate Gaussians with − E[log q(µ, c)], (20)
means µj = µ1 , µ2 , ..., µK . We assume that the variance
of the mean’s prior is a fixed hyperparameter σ 2 while the ELBOm,s2 ,φ = ELBO(m, s2 , φ).
observation variance is one. For this problem, we define a
single data-point as, We maximize the ELBO to derive the optimal values of
xi ∼ N (cTi µ, 1) for i = 1, 2, ...N, (16) the variational parameters (see Appendix A). The optimal
4
values for m, s and φ are given by, vector. The statistical model uses gradient backpropaga-
P tion to approximate the posterior distribution for the latent
∗ φij xi
mj = 1 i P , (21) vector. For large data sets we update the VAE’s parameters
σ2 + i φij using small mini-batches or even single data points.
1 Since their inception, VAEs have been widely used
(s2j )∗ = 1 P , (22) for generative modelling. They are easy to implement, con-
σ2 + i φij
1 2 2
verge faster than MCMC methods, and scale efficiently to
φ∗ij ∝ e− 2 (mj +sj )+xi mj . (23) large data sets; this makes them ideal for generative mod-
We employ the Coordinate Ascent VI (CAVI) algorithm elling of image data. However, the images they generate
to optimize the ELBO. Algorithm 1 (Blei et al., 2017) tend to have reduced quality compared to the input images.
describes the steps to optimize the ELBO using CAVI. This is the effect of minimizing the reverse KL-divergence,
For our experimental setup, we select K = 3, i.e., a which results in the approximate distribution being locked
mixture of three univariate Gaussians. We generate the to one of the modes—as explained in Section 3.
data by randomly sampling 1000 data-points for each of Kingma and Welling (2013) introduce the stochastic
the three Gaussians. The left panel in Figure 3 illustrates variational inference algorithm (VAE) that reparameter-
how the algorithm approximates the parameters of each izes the variational lower bound, yielding a lower bound
individual Gaussian by maximizing the ELBO. We select estimator that can be optimized using standard stochastic
a maximum iteration of 1000 steps, however, the ELBO gradient methods (Kingma and Welling, 2013).
converges by iteration 60 as illustrated in the right panel
of Figure 3.
Algorithm 1: CAVI for a Gaussian mixture
model
Data: Data x1:n , K mixture components and
prior variance of component means σ 2
Result: Variational densities q(µj ; mj , sj 2 ) and
q(ci ; φi )
m = m1:K , s2 = s21:K , φ = φ1:N ← initialize
variational parameters
while the ELBO has not converged do
for i ∈ 1, ..., N do
Set φij ∝ exp − 12 (m2j + s2j ) + xi mj
end
for j ∈ 1, ..., K doP
φij xi Figure 4: Illustration of the type of directed graphical
Set mj ←− 1/σ2i+P φij model under consideration with N observed data-points.
i
Set s2j ←− 1/σ2 +1P φij Solid lines denote the generative model, dashed lines de-
i
end note the variational approximation to the intractable pos-
terior density. The variational parameters φ are learned
Compute ELBO(m, s2 , φ) jointly with the generative model parameters θ (Kingma
end and Welling, 2013).
return q(m, s2 , φ)
The recognition model, qφ (Z|X), can be interpreted
6 Applications as a probabilistic encoder, since given a data point x the
encoder produces a latent vector z. This latent vector
We have established the optimization process for VI, next is then used to generate a sample from the likelihood
we look at some applications in generative modelling. density, pθ (X|Z), and, therefore, the generative model
can be interpreted as a probabilistic decoder (Kingma and
6.1 VAE: Variational Auto-Encoder Welling, 2013). The probability density qφ (Z|X) serves
An auto-encoder is a neural network that aims to learn as an approximation of the actual posterior probability den-
or encode a low-dimensional representation for high- sity pθ (Z|X). Using a mini-batch of data points sampled
dimensional data, e.g. images. Different variants of auto- from X, the encoder transforms these data points into the
encoders exist that aim to learn meaningful representations latent space, Z, which the decoder uses to generate a the
of high-dimensional data. One such variant is the Varia- samples in X.
tional Auto-Encoder (VAE). Introduced by Kingma and As established in Equation 13, maximizing the ELBO
Welling (2013), the VAE is a statistical model which is es- is equivalent to minimizing the KL-divergence (Blei et al.,
sentially a stochastic variational inference algorithm. The 2017). In order to jointly optimize the recognition model
VAE uses the concept of variational inference to compress and the generative model on mini-batches of data, we dif-
the high-dimensional data into a latent vector while assum- ferentiate and optimize the lower bound with respect to
ing a multi-variate distribution as a prior for the same latent both the variational and the generative parameters, φ and
5
−7.2
0.4
−7.3
ELBO [10e−2]
0.3
−7.4
Density
0.2 −7.5
−7.6
0.1
−7.7
0.0
0 2 4 6 8 10 12 10 20 30 40 50 60
x Iterations
Figure 3: Left: a histogram showing the distribution of data sampled from the three univariate Gaussians. The curved
lines indicate the fit by maximising the ELBO. Right: an illustration of the convergence of ELBO using CAVI.
θ. In this case Equation 15 can be rewritten as, ELBO to a Stochastic Gradient Variational Bayes (SGVB)
DZ,X,φ,θ = DKL (Qφ (Z|X) k Pθ (Z)), estimator L(φ, θ; x(i) ), as
L
L(φ, θ; x) = −DZ,X,φ,θ + Ez∼Qφ (Z|X) [log pθ (x|z)]. 1X
Lφ,θ,x(i) ' −DZ,X,φ,θ + log pθ (x(i) |z (i,l) )], (25)
(24) L
l=1
If we look closely at the terms on the right-hand-side of Lφ,θ,x(i) = L(φ, θ; x(i) ).
Equation 24, we can see the connection to auto-encoders, We re-parameterize the variational lower bound in terms
where the second term is the expected negative reconstruc- of a deterministic random variable which enables us to use
tion error and the KL-divergence term can be interpreted gradient based optimizers on mini-batches of data. This
as a regularization term. In order to optimize Equation further enables the optimization of the parameters of the
24 using standard gradient-based techniques, Kingma and distribution while still maintaining the ability to randomly
Welling (2013) introduce the Auto-Encoding Variational sample from that distribution (Doersch, 2016).
Bayes (AEVB) algorithm to efficiently compute the gradi-
ent of the ELBO from Equation 24. For a chosen approx-
imate posterior qφ (z|x), we re-parameterize the random
variable z = qφ (z|x) with a differentiable transformation
gφ (, x) of a noise variable , such that,
z = gφ (, x),
∼ P ().
This re-parameterization allows us to form Monte Carlo es-
timates of expectations of the function log pθ (x|z). Form-
ing Monte Carlo estimates enables us to numerically eval-
uate the expectation,
Ez∼Qφ (Z|X) [log pθ (x|z)].
Figure 5: The learning process in a typical VAE using
We draw independent samples, z i , from the variational gradient back-propagation.
distribution, Qφ (Z|X), and then compute the average of In order to simplify the calculations, we assume the
the function evaluated at these samples (Mohamed et al., variational approximate posterior to be a multi-variate
2020). Therefore, the Monte Carlo estimates of the expec- Gaussian with diagonal co-variance structure (Kingma and
tation of the function log pθ (x(i) |z) when z ∼ qφ (z|x(i) ) Welling, 2013). As for the prior we assume a multivariate
is as follows, Gaussian N (z; 0, I) where,
L
1X log qφ (z|x(i) ) = log N (z; µ(i) , (σ (i) )2 I),
Ez∼Qφ (Z|x(i) ) ' log pθ (x(i) |z (i,l) ),
L pθ (z) = N (z; 0, I),
l=1
(l) z (i) = µ(i) + σ (i) ,
where, ∼ P () and L is the number of samples per
data point. ∼ N (0, I).
Applying the above re-parameterization to the varia- Using the above parameterization, the KL-divergence term
tional lower-bound of Equation 24 we can re-formulate the in Equation 25 can be derived as Equation 27 (as shown in
6
Figure 6: A diagram showing a variational autoencoder model.
Box 1). Subsequently, Equation 25 can be used to define encourages the latent representation to be more factorised.
the loss function for the VAE framework at x(i) , as However, this can lead to even worse reconstruction qual-
J ity as compared to the standard VAE framework. This is
1 Xh (i) (i)
i
caused by a trade-off introduced by the modified training
Lφ,θ;x(i) ' 1 + log((σj )2 ) − (µij )2 − (σj )2
2 j=1 objective that punishes reconstruction quality in order to
encourage disentanglement between the latent representa-
L
1X tions (Burgess et al., 2018). Varying β during training en-
+ log pθ (x(i) |z (i,l) ), (26) courages the model to learn different latent representations
L
l=1 of the data. A high value of β encourages disentanglement
where J is the dimensionality of z. in the latent space but at the cost of reduced reconstruction
A variant of the VAE framework, β-VAE, adds an ex- quality. To mitigate this reconstruction problem, the au-
tra hyperparameter to the VAE objective which constricts thors introduce a capacity control parameter C. Increasing
the effective encoding capacity of the latent space. The C from zero to a value large enough produces good quality
β-VAE training objective is, reconstructions during training (Burgess et al., 2018). The
L(φ, θ; x) = −βD modified β-VAE training objective is given by,
Z,X,φ,θ
+ Ez∼Qφ (Z|X) [log pθ (x|z)],
where β = 1 corresponds to the original VAE formula- L(φ, θ; x) = −β|DZ,X,φ,θ − C|
tion of Kingma and Welling (2013). This constriction + Ez∼Qφ (Z|X) [log pθ (x|z)].
J J
1 X (i) 2
X (i) 2 (i) 2
= J log(2π) + (1 + log(σj ) ) − J log(2π) − ((µj ) + (σj ) )
2 j=1 j=1
J
1X (i) 2 (i) 2 (i) 2
= 1 + log(σj ) − (µj ) − (σj ) (27)
2 j=1
6.2 GAN: Generative Adversarial Network data samples across a range of problems—most notably in
computer vision.
Generative adversarial networks (GANs), introduced by A typical GAN framework involves simultaneously
Goodfellow et al. (2014), are deep-learning based gener- training two models; a generator and a discriminator. The
ative models that are extensively used to create realistic
7
generator tries to capture the data distribution by mapping where z is the latent representation of a data sample x
a latent vector to a data-point, thereby generating new sam- taken from marginal likelihood distribution. The encoder,
ples with similar statistical properties as the training data. Enc(x), takes a data-sample, x, and approximates the pos-
The discriminator aims to estimate the likelihood of a sam- terior density q(z|x). Whereas the decoder, Dec(z), takes
ple being drawn from the training data or created by the a sample from the latent space, z, and generates a sample
generator (Goodfellow et al., 2014). We can train a GAN from the likelihood density p(x|z). Larsen et al. (2016) de-
by minimizing the objective function, fine LVAE as the negative of training objective of a vanilla
VAE (given by Equation 24) and provide the objective
min max LGAN = Ex∼Pdata (X) [log(Dis(x))] function,
Gen Dis
+ Ez∼Pz (Z) [log(1 − Dis(Gen(z)))], LVAE = −L(φ, θ; x),
(28) = DKL (Q(Z|X) k P (Z))
where the generator, Gen(z), takes a sample from the latent − Ez∼Q(Z|X) [log p(x|z)]. (29)
distribution, Pz (Z), and creates a new data-point while the The authors go on to describe the terms of Equation 29 as,
discriminator, Dis(x), takes a data-point from both the real
distribution, Pdata (X), and the new data-point Gen(z) and Lpixel
llike = −Ez∼Q(Z|X) [log p(x|z)],
assigns probabilities to both. To minimize the objective, Lprior = DKL (Q(Z|X) k P (Z)),
the discriminator will try to assign probabilities close to
zero and one for data sampled from the real distribution where Lpixel
llike is the negative expected log likelihood and
and anything created by the generator, respectively. The Lprior is the KL-divergence between the approximated pos-
GAN’s objective is to train the discriminator to efficiently terior density and the prior on the latent variable. The
discriminate between real and generated data while encour- KL-divergence term in Equation 29 can also be interpreted
aging the generator to reproduce the true data distribution as a regularization term. Therefore, the VAE loss is the
(Goodfellow et al., 2014; Larsen et al., 2016). There is a sum of the negative expected log likelihood (the recon-
unique solution where the generator successfully recovers struction error) and the regularization term (Larsen et al.,
the training data distribution. At the same time, the dis- 2016).
criminator ends up assigning equal probability to samples The authors, further, propose a technique to exploit
from the training data and the generator. the capacity of the discriminator to differentiate between
real and generated images. The capacity of a neural net-
6.3 VAE-GAN work is defined as an upper bound on the number of bits
Different variants of the original GAN framework have that can be extracted from the training data and stored
evolved since its inception, such as the VAE-GAN intro- in the architecture during learning (Baldi and Vershynin,
duced by Larsen et al. (2016). This approach uses the 2019). Larsen et al. (2016) replace the VAE reconstruction
learned feature representations in the GAN discriminator error term in Equation 29, for better quality images, with a
as a basis for the VAE reconstruction objective. reconstruction error expressed in the GAN discriminator.
Larsen et al. (2016) show that unsupervised training, For this, they introduce a Gaussian observation model for
like that of a GAN, can result in the latent image represen- the hidden representation of the l-th layer of the discrimi-
tation with disentangled factors of variation (Bengio et al., nator Disl (x) with mean Disl (x̃) and identity covariance,
2013). This means that the model learns an embedding p(Disl (x)|z) = N (Disl (x)|Disl (x̃), I),
space with, abstract, high-level visual features which can
be modified using simple arithmetic (Larsen et al., 2016). where x̃ = Dec(z) is the output from the decoder for the
data point x. Subsequently, the VAE reconstruction error
in Equation 29 is replaced with the following,
LDis
llike = −Ez∼Q(Z|X) [log p(Disl (x)|z)].
l
z = Enc(x) = q(z|x),
The training procedure for VAE-GAN is illustrated in Al-
x̃ = Dec(z) = p(x|z), gorithm 2 and Figure 8.
8
• a VAE with learned distance, VAEDisl , where the
authors first train a GAN and use the l-th layer of
the discriminator network as a learned similarity
measure,
• the proposed VAE-GAN framework,
on the CelebA data set. As shown in Figure 9, the visual
realism of the VAE-GAN is superior to the traditional VAE.
Additionally, the learned latent space can be used to mod-
ify high-level facial features, such as, skin tone and hair
colour.
9
Moreover, we briefly presented the scenarios where VI tangling in β-vae. arXiv preprint arXiv:1804.03599,
has been applied to modern machine learning tasks, specif- 2018.
ically in computer vision and generative modeling, and X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever,
investigated how combining deep learning and VI enable and P. Abbeel. Infogan: Interpretable representation
us to perform inference on extremely complex posterior learning by information maximizing generative adver-
distributions. sarial nets. In Proceedings of the 30th International
VI is a powerful tool that allows us to approximate Conference on Neural Information Processing Systems,
the actual probability density of the latent representation. pages 2180–2188, 2016.
However, there are still many open avenues for statistical
research. One such avenue is to develop better approxima- P. Dagum and M. Luby. Approximating probabilistic infer-
tions (achieving lower KL-divergence) to the posterior den- ence in bayesian belief networks is np-hard. Artificial
sity, while maintaining efficient optimization. For example, intelligence, 60(1):141–153, 1993.
the mean-field family makes strong independence assump- A. Dembo, T. M. Cover, and J. A. Thomas. Information
tions which aid in scalable optimization. However, these theoretic inequalities. IEEE Transactions on Informa-
assumptions may lead the variance of the approximated tion theory, 37(6):1501–1518, 1991.
density to under-represent that of the target density (Blei
C. Doersch. Tutorial on variational autoencoders. arXiv
et al., 2017). As an alternative to the mean-field method,
preprint arXiv:1606.05908, 2016.
Minka (2005) use a fully-factorized approximation with
no explicit exponential family constraint along with loopy F. Gagliardi Cozman. Generalizing Variable Elimination
belief propagation to achieve a lower KL-divergence. An- in Bayesian Networks. Probabilistic Reasoning in Arti-
other possible area of research is to use α-divergance mea- ficial Intelligence, pages 1–11, 2000.
sures (Zhang et al., 2018) to get a tighter fit to the ELBO. S. Geman and D. Geman. Stochastic relaxation, gibbs
Although research in the field of VI algorithm has grown distributions, and the bayesian restoration of images.
in recent years, efforts to make VI more efficient, accurate, IEEE Transactions on pattern analysis and machine
scalable and easier are still ongoing. intelligence, (6):721–741, 1984.
I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu,
8 Acknowledgements D. Warde-Farley, S. Ozair, A. Courville, and Y. Ben-
We would like to thank our colleagues Sanjana Jain, gio. Generative adversarial networks. arXiv preprint
Dhruba Pujary and, Ukrit Watchareeruetai for their con- arXiv:1406.2661, 2014.
structive input and feedback during the writing of this J. Graving and I. Couzin. VAE-SNE: a deep generative
paper. model for simultaneous dimensionality reduction and
clustering. 2020. doi: 10.1101/2020.07.17.207993.
References X. Hou, L. Shen, K. Sun, and G. Qiu. Deep feature consis-
tent variational autoencoder. In 2017 IEEE Winter Con-
P. Baldi and R. Vershynin. The capacity of feedforward
ference on Applications of Computer Vision (WACV),
neural networks. Neural networks, 116:288–311, 2019.
pages 1133–1141. IEEE, 2017.
D. Barber. Bayesian reasoning and machine learning. S. Ioffe and C. Szegedy. Batch normalization: Accelerat-
Cambridge University Press, 2012. ing deep network training by reducing internal covariate
Y. Bengio, A. Courville, and P. Vincent. Representation shift. In International conference on machine learning,
learning: A review and new perspectives. IEEE transac- pages 448–456. PMLR, 2015.
tions on pattern analysis and machine intelligence, 35 M. I. Jordan, Z. Ghahramani, T. S. Jaakkola, and L. K.
(8):1798–1828, 2013. Saul. An introduction to variational methods for graphi-
J. Bernardo, M. Bayarri, J. Berger, A. Dawid, D. Hecker- cal models. Machine learning, 37(2):183–233, 1999.
man, A. Smith, M. West, et al. The variational bayesian T. Karras, S. Laine, and T. Aila. A style-based genera-
em algorithm for incomplete data: with application to tor architecture for generative adversarial networks. In
scoring graphical model structures. Bayesian statistics, Proceedings of the IEEE/CVF Conference on Computer
7(453-464):210, 2003. Vision and Pattern Recognition, pages 4401–4410, 2019.
C. M. Bishop. Pattern Recognition and Machine Learning D. P. Kingma and M. Welling. Auto-encoding variational
(Information Science and Statistics). Springer-Verlag, bayes. arXiv preprint arXiv:1312.6114, 2013.
Berlin, Heidelberg, 2006. ISBN 0387310738.
M. Klaričić Bakula, M. Matić, and J. Pečarić. On some
D. M. Blei, A. Kucukelbir, and J. D. McAuliffe. Varia- general inequalities related to jensen’s inequality. In-
tional inference: A review for statisticians. Journal of equalities and Applications, pages 233–243, 2008.
the American statistical Association, 112(518):859–877, A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet
2017. classification with deep convolutional neural networks.
C. P. Burgess, I. Higgins, A. Pal, L. Matthey, N. Watters, Advances in neural information processing systems, 25:
G. Desjardins, and A. Lerchner. Understanding disen- 1097–1105, 2012.
10
A. B. L. Larsen, S. K. Sønderby, H. Larochelle, and P. Purkait, C. Zach, and I. Reid. Sg-vae: Scene grammar
O. Winther. Autoencoding beyond pixels using a learned variational autoencoder to generate new indoor scenes.
similarity metric. In International conference on ma- In European Conference on Computer Vision, pages
chine learning, pages 1558–1566. PMLR, 2016. 155–171. Springer, 2020.
W. Li, W. Hu, N. Chen, and C. Feng. Stacking vae A. Radford, L. Metz, and S. Chintala. Unsupervised rep-
with graph neural networks for effective and inter- resentation learning with deep convolutional generative
pretable time series anomaly detection. arXiv preprint adversarial networks. arXiv preprint arXiv:1511.06434,
arXiv:2105.08397, 2021. 2015.
Z. Li, R. Togo, T. Ogawa, and M. Haseyama. Variational R. Ranganath, S. Gerrish, and D. Blei. Black box varia-
autoencoder based unsupervised domain adaptation for tional inference. In Artificial intelligence and statistics,
semantic segmentation. In 2020 IEEE International pages 814–822. PMLR, 2014.
Conference on Image Processing (ICIP), pages 2426– J. Regier, A. Miller, J. McAuliffe, R. Adams, M. Hoffman,
2430. IEEE, 2020. D. Lang, D. Schlegel, and M. Prabhat. Celeste: Varia-
Z. Liu, P. Luo, X. Wang, and X. Tang. Deep learn- tional inference for a generative model of astronomical
ing face attributes in the wild. Proceedings of the images. In International Conference on Machine Learn-
IEEE International Conference on Computer Vision, ing, pages 2095–2103. PMLR, 2015.
2015 International Conference on Computer Vision, C. E. Shannon. A mathematical theory of communication.
ICCV 2015:3730–3738, 2015. ISSN 15505499. doi: The Bell system technical journal, 27(3):379–423, 1948.
10.1109/ICCV.2015.425.
J. Shlens. Notes on kullback-leibler divergence and likeli-
D. J. MacKay. Information Theory, Inference, and Learn- hood. arXiv preprint arXiv:1404.2000, 2014.
ing Algorithms David, volume 44. 2015. ISBN
9780521642989. doi: 10.1198/jasa.2005.s54. R. Shu, H. H. Bui, S. Zhao, M. J. Kochenderfer, and S. Er-
mon. Amortized inference regularization. arXiv preprint
A. L. Madsen and F. V. Jensen. Lazy propagation: a junc-
arXiv:1805.08913, 2018.
tion tree inference algorithm based on lazy evaluation.
Artificial Intelligence, 113(1-2):203–245, 1999. M. Simonovsky and N. Komodakis. Graphvae: Towards
generation of small graphs using variational autoen-
C. A. McGrory and D. Titterington. Variational approxi-
coders. In International Conference on Artificial Neural
mations in bayesian model selection for finite mixture
Networks, pages 412–422. Springer, 2018.
distributions. Computational Statistics & Data Analysis,
51(11):5352–5367, 2007. K. Simonyan, A. Vedaldi, and A. Zisserman. Deep in-
side convolutional networks: Visualising image classi-
N. Metropolis, A. W. Rosenbluth, M. N. Rosenbluth, A. H.
fication models and saliency maps. 2nd International
Teller, and E. Teller. Equation of state calculations
Conference on Learning Representations, ICLR 2014 -
by fast computing machines. The journal of chemical
Workshop Track Proceedings, pages 1–8, 2014.
physics, 21(6):1087–1092, 1953.
T. Tabouy, P. Barbillon, and J. Chiquet. Variational in-
T. P. Minka. Divergence measures and message passing.
ference for stochastic block models from sampled data.
Microsoft Research Technical Report, (MSR-TR-2005-
Journal of the American Statistical Association, 115
173):17, 2005. ISSN 0735-0015.
(529):455–466, 2020.
S. Mohamed, M. Rosca, M. Figurnov, and A. Mnih. Monte
carlo gradient estimation in machine learning. J. Mach. A. Vahdat and J. Kautz. Nvae: A deep hierarchical vari-
Learn. Res., 21(132):1–62, 2020. ational autoencoder. arXiv preprint arXiv:2007.03898,
2020.
K. P. Murphy. Machine learning : a probabilistic perspec-
tive. MIT Press, Cambridge, Mass. [u.a.], 2013. ISBN Z. Yang, Z. Hu, R. Salakhutdinov, and T. Berg-Kirkpatrick.
9780262018029 0262018020. Improved variational autoencoders for text modeling
using dilated convolutions. In International conference
D. J. Nott, S. L. Tan, M. Villani, and R. Kohn. Regres- on machine learning, pages 3881–3890. PMLR, 2017.
sion density estimation with variational methods and
stochastic approximation. Journal of Computational C. Zhang, J. Bütepage, H. Kjellström, and S. Mandt. Ad-
and Graphical Statistics, 21(3):797–820, 2012. vances in variational inference. IEEE transactions on
pattern analysis and machine intelligence, 41(8):2008–
S. Pidhorskyi, D. A. Adjeroh, and G. Doretto. Adversarial 2026, 2018.
latent autoencoders. In Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recogni- S. Zhao, J. Song, and S. Ermon. Infovae: Balancing learn-
tion, pages 14104–14113, 2020. ing and inference in variational autoencoders. In Pro-
ceedings of the aaai conference on artificial intelligence,
A. A. Pol, V. Berger, C. Germain, G. Cerminara, and volume 33, pages 5885–5892, 2019.
M. Pierini. Anomaly detection with conditional vari-
ational autoencoders. In 2019 18th IEEE interna-
tional conference on machine learning and applications
(ICMLA), pages 1651–1657. IEEE, 2019.
11
A Appendix We now factorize the variational joint probability density
q(µ, c) in Equation 32 as,
We formulate the variational approximation for the mixture
of Gaussians in Equation 19 as follows:
Y Y
YK N
Y log q(µ, c) = log q(µj ; mj , s2j ) q(ci , φi ),
q(µ, c) = q(µj ; mj , s2j ) q(ci ; φi ). (31) j i
j=1 i=1 X X
= log q(µj ; mj , s2j ) + q(ci , φi ).
From Equation 20, we have the following definition of
j i
ELBO:
(38)
ELBO(m, s2 , φ) = E[log p(x, µ, c)]
− E[log q(µ, c)], (32)
We further expand the terms on the RHS of Equation 38 as
ELBOm,s2 ,φ = ELBO(m, s2 , φ), follows,
where expectations are taken under q and m, s and φ are
the variational parameters. 2
In order to derive the optimal values of the variational log q(µj ; mj , s2 ) = log q 1 exp − (µj − mj )
j ,
parameters, we will first have to express the ELBO in Equa- 2πs2j 2s2j
tion 32 in terms of m, s and φ. We start with simplifying
the first term, log p(x, µ, c), on the RHS of Equation 32 as 1 2 (µj − mj )2
follows, = − log(2πs j ) − .
2 2s2j
log p(x, µ, c) = log p(x|µ, c)p(µ, c), (39)
= log p(x|µ, c)p(µ)p(c),
= log p(x|µ, c) + log p(µ) + log p(c),
X X
= log p(µj ) + [log p(ci ) Y
log q(ci , φi ) = log φij ,
j i
j
+ log p(xi |ci , µ)], (33) X
1 = log φij . (40)
where p(ci ) = K is a constant and expanding p(µj ) we j
have,
µ2j
1
log p(µj ) = log √ exp − 2 , We combine the derivations for the joint variational prob-
2πσ 2 2σ ability density from equations 39 and 40 to re-write the
µ2j Equation 38 as,
∝− . (34)
2σ 2
For p(xi |ci , µ), in Equation 33, we can make use of the 1X
(µj − mj )2
2
fact that ci is a one-hot vector. Therefore, log p(xi |ci , µ) log q(µ, c) = − log(2πsj ) −
2 j 2s2j
can be expressed as:
Y XX
log p(xi |ci , µ) = log p(xi |µj )cij , + log φij . (41)
j i j
X
= cij log p(xi |µj ). (35)
j The final step towards deriving the ELBO in terms of the
In Equation 16 we define actual density function for the variational parameters is to factor the results from Equation
real-valued data-point xi as follows: 37 and 41 into Equation 32 as follows,
(xi − µj )2
1
log p(xi |µj ) = log √ exp − , 2
2π 2 X µj
ELBO m,s 2 ,φ ∝ −E
(xi − µj )2 j
2σ 2
∝− . (36)
2 XX
(xi − µj )2
We now re-write Equation 33 by combining the derivations + E −cij E
from equations 34, 35 and 36 as, i j
2
X µ2j 1X
(µj − mj )2
2
log p(x, µ, c) ∝ − 2 − E − log(2πsj ) −
j
2σ 2 j s2j
(xi − µj )2
XX XX
+ −cij . (37) − E log φij .
i j
2 i j
12
The final ELBO objective in terms of the variational pa- Deriving m∗j , the optimal value of mj :
rameters is as follows:
(
(xi − µj )2
∂ ∂ X
ELBOm,s2 ,φ ∝ − φij E
2 ∂mj ∂mj i
2
X µj
ELBOm,s2 ,φ ∝ −E
)
µ2j
j
2σ 2 −E
2σ 2
XX (xi − µj )2
+ E −φij E (
X 1
2 ∂
i j ∝ φij − (m2j + s2j ) + xi mj
∂mj 2
1X i
+ E log(s2j ) )
2 j 1
− 2 (m2j + s2j ) ,
XX 2σ
− E log φij , (42) (
∂ X 1
i j ∝ − φij m2j + φij xi mj
∂mj i
2
)
1
where all expectations are taken under q. − 2 m2j ,
Now, to derive the optimal values of the variational 2σ
parameters we take partial derivatives of the ELBO in X
mj
Equation 42 with respect to the variational parameters and ∝ φij mj + φij xi − 2
2σ
equate them to zero. i
13