0% found this document useful (0 votes)
18 views

Variational Inference Ref Paper

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views

Variational Inference Ref Paper

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

An Introduction to Variational Inference

Ankush Ganguly∗ Samuel W. F. Earp

Sertis Vision Lab†


arXiv:2108.13083v3 [cs.LG] 22 Nov 2021

Abstract
Approximating complex probability densities is a core problem in modern statistics. In this paper,
we introduce the concept of Variational Inference (VI), a popular method in machine learning that
uses optimization techniques to estimate complex probability densities. This property allows VI to
converge faster than classical methods, such as, Markov Chain Monte Carlo sampling. Conceptually,
VI works by choosing a family of probability density functions and then finding the one closest to the
actual probability density—often using the Kullback-Leibler (KL) divergence as the optimization
metric. We introduce the Evidence Lower Bound to tractably compute the approximated probability
density and we review the ideas behind mean-field variational inference. Finally, we discuss the
applications of VI to variational auto-encoders (VAE) and VAE-Generative Adversarial Network
(VAE-GAN). With this paper, we aim to explain the concept of VI and assist in future research with
this approach.

1 Introduction approximate inference. However, these methods are slow


to converge and do not scale efficiently.
As an alternative to MCMC sampling, variational
The core principle of Bayesian statistics is to frame infer- methods have been used to tractably approximate compli-
ence about unknown variables as a calculation involving cated probability densities. In recent years, Variational
a posterior probability density (Blei et al., 2017). This Inference (VI) (introduced, by Jordan et al., 1999) has
property of Bayesian statistics makes inference a recurring gained popularity in statistical physics (Regier et al., 2015),
problem; especially when the posterior density is difficult data modeling (Tabouy et al., 2020), and neural networks
to compute (Barber, 2012). Algorithms like the elimination (MacKay, 2015). The problem involves using a metric to
algorithm (Gagliardi Cozman, 2000), the message-passing select a tractable approximation to the posterior probability
algorithm (belief propagation: Barber, 2012), and the junc- density (Blei et al., 2017). This methodology formulates
tion tree algorithm (Madsen and Jensen, 1999) have been the statistical inference problem as an optimization prob-
used to solve exact inference. This method involves analyt- lem. Thus, we get the speed benefits of maximum a pos-
ically computing the conditional probability distribution teriori (MAP) estimation (Murphy, 2013) and can easily
over the variables of interest. However, the time com- scale to large data sets (Blei et al., 2017).
plexity of exact inference on arbitrary graphical models We organize the paper as follows. Section 2 outlines
is NP-hard (Dagum and Luby, 1993). In the case of large the problem statement and introduces the idea of using KL-
data-sets and complicated posterior probability densities, divergence, the metric used to measure the information gap
exact inference algorithms favour accuracy at the cost of between the approximate and the actual posterior probabil-
speed. Approximate inference techniques offer an efficient ity densities. Section 3 discusses the concept of evidence
solution by providing an estimate of the actual posterior lower bound and its importance. Section 4 introduces the
probability density. mean-field variational family. Section 5 applies VI to a toy
As a solution to approximate inference, various problem. Section 6 outlines a few practical applications
Markov Chain Monte Carlo (MCMC) methods have been of VI in the field of deep learning and computer vision.
extensively studied since the early 1950s. The most notable Finally, section 7 provides of a summary of the paper.
among these methods include the Metropolis-Hastings al-
gorithm (Metropolis et al., 1953) and Gibbs sampling (Ge-
man and Geman, 1984). MCMC techniques have since
evolved into an indispensable statistical tool for solving

Email: [email protected]

597/5 Sukhumvit Road, Watthana, Bangkok, 10110, Thailand
2 Problem Statement where X is sampled from the distribution P . Subsequently,
the KL-divergence, can be expressed as
X P (x)
DKL (P k Q) = P (x) log , (3)
Q(x)
x∈X

DKL (P k Q) = −H(P ) + H(P, Q),


where H(P, Q) is the cross-entropy between the two distri-
butions. In other words, the KL-divergence is the average
Figure 1: A directed graphical model showing that the extra amount of information required to encode the data
observed variable X is dependent on the latent variable Z. using the candidate probability distribution instead of the
actual distribution (Murphy, 2013). The KL-divergence
Consider the system of random variables illustrated in Fig- is non-negative, non-symmetric and is equal to zero or
ure 1, where X and Z represent the observed variable and infinite for two perfectly matching and non-matching dis-
the hidden (latent) variable, respectively. The arrow drawn tributions, respectively (Shlens, 2014).
from Z to X represents the conditional probability den- For a continuous random variable X, Equation 3 can
sity p(X|Z), referred to as the likelihood. From Bayes’ be extended to the form,
theorem we compute the posterior probability density as, Z ∞
p(x)
DKL (P k Q) = p(x) log dx, (4)
p(X|Z)p(Z) −∞ q(x)
p(Z|X) = . (1)
p(X) where P and Q are probability distributions of the contin-
The marginal, p(X), can be computed as, uous random variable X and p and q represent the proba-
Z bility density functions.
Alternatively, we can express the expectation of the
p(X) = p(X|z)p(z)dz, (2)
logarithmic difference between the probability densities p
z∈Z and q as,
 
where z is an instance from the sample space of Z. p(x)
This marginal probability density of observations is the DKL (P k Q) = Ex∼P (X) log , (5)
q(x)
evidence and p(Z) is referred to as the prior because it
captures the prior information about Z. For many models, where the random variable x is sampled from the prob-
this evidence integral depends on the selected model and ability distribution function P and E is the expectation
is either unavailable in closed form or requires exponential function.
time to compute (Blei et al., 2017). As established earlier in this Section, the objective
The purpose of VI is to provide an analytical approx- of VI is to select an approximate probability density q
imation of the posterior probability density p(Z|X) for from a family of tractable probability densities Q. Each
statistical inference over the latent variables. VI enables q(Z) ∈ Q is a candidate approximation of the actual poste-
efficient computation of a lower bound to the marginal rior. The goal is to find the best candidate, i.e. the one with
probability density, or the evidence. The idea is that a the minimum KL-divergence (Blei et al., 2017). In our
higher marginal likelihood is indicative of a better fit to the formulation, we assume the approximate probability den-
observed data by the chosen statistical model. Addition- sity is not conditioned on the observed variable. Therefore,
ally, VI addresses the approximation problem by choosing the inference problem is re-framed as the optimization
a probability density function q for the latent variable Z problem,
from a tractable family (Murphy, 2013). q ∗ (Z) = argmin DKL P (Z|X) k Q(Z) .

(6)
q(Z)∈Q
KL-Divergence
The choice of approximate probability density is done us- We optimize Equation 6 to yield the best approximation
ing a metric to measure the difference between it and the q ∗ (.) to the actual posterior from the chosen family of den-
actual posterior density (Ranganath et al., 2014). One sities. The complexity of the optimization depends our
popular metric used in VI is the Kullback–Leibler (KL) choice for the family of probability densities (Blei et al.,
divergence, suggested by Jordan et al. (1999). The KL- 2017; Murphy, 2013) and, therefore, most researchers
divergence is the relative entropy between two distributions choose to use the exponential family—motivated by their
(Dembo et al., 1991). It is a measure of information that conjugate nature.
quantifies how similar a probability distribution P (X) is to Computing Equation 6 is difficult as taking expec-
a candidate distribution Q(X) (Shlens, 2014). The entropy tations with respect to P is assumed to be intractable
is a measure of the mean information or uncertainty of a (Murphy, 2013). Moreover, computing the forward KL-
random variable X (Shannon, 1948), and is defined as divergence term in Equation 6 would require us to know
X the posterior. An alternative is to use the reverse KL-
H(P ) = − P (x) log P (x), divergence where the average cross-entropy between the
x∈X actual posterior and our approximation is computed by

2
taking expectations, with respect to the variational distri- defined in Equation 7. However, optimizing Equation 7 is
bution. Hence, the optimization problem in Equation 6 can still not tractable because we are required to compute the
be re-formulated as, evidence function. The KL-divergence objective function
from Equation 7 can be written as,
q ∗ (Z) = arg min DKL (Q(Z) k P (Z|X)). (7)
q(Z)∈Q D = E[log q(z)] − E[log p(z|x)], (8)
where,
Since q(Z) is selected from a tractable family of probabil-
D = DKL (Q(Z) k P (Z|X)), (9)
ity densities, computing expectations with respect to q is
also tractable. and all expectations are taken by sampling z from Q(Z).
We now expand the conditional probability density p(z|x)
Forward vs. Reverse KL using Equation 1, giving us,
Let P and Q be two distributions with probability density D = E[log q(z)] − E[log p(z, x)] + E[log p(x)]. (10)
functions p and q, where q is an approximation of p. As Since all expectations are under Q(Z), E[log p(x)] is the
stated earlier, KL-divergence is non-symmetric (Shlens, constant log p(x). Therefore, we can re-write Equation 10
2014), i.e., as,
DKL (P k Q) 6= DKL (Q k P ), D = E[log q(z)] − E[log p(z, x)] + log p(x). (11)
The KL-divergence cannot be computed directly as it de-
as such, minimizing the forward KL-divergence, pends on the evidence. Therefore, we must optimize an
DKL (P k Q), yields different results than minimizing alternative objective function that is equivalent to DKL up
the reverse KL-divergence, DKL (Q k P ). to an added constant,
The forward KL-divergence is also known as the M-
−D + log p(x) = E[log p(z, x)] − E[log q(z)],
projection or moment projection, (Murphy, 2013) and is
defined as, ELBO(Q) = E[log p(z, x)] − E[log q(z)], (12)
  where the term ELBO is an abbreviation for evidence lower
p(x) bound. The ELBO the sum of the negative KL-divergence
DKL (P k Q) = Ex∼P (X) log .
q(x) and the constant term log p(x). Maximizing the ELBO is
This will be large wherever the approximation fails to cover equivalent to minimizing the KL-divergence (Blei et al.,
up the actual probability distribution (Murphy, 2013); i.e. 2017).
An intuitive explanation of the ELBO can be derived
p(x) by re-arranging the terms of Equation 12, as
lim → ∞, where p(x) > 0.
q(x)→0 q(x) ELBO(Q) = E[log p(z, x)] − E[log q(z)],
= E[log p(x|z)] + E[log p(z) − log q(z)],
So, if p(x) > 0, we must choose a probability density
to ensure that q(x) > 0 (Murphy, 2013). This particular = E[log p(x|z)] − DKL (Q(Z) k P (Z)).
case of optimizing is zero avoiding and can intuitively be (13)
interpreted as q over-estimating p. Thus, the ELBO is the sum of the expected log likelihood
The reverse KL-divergence is also known as the I- of the data and the KL-divergence between the prior and
projection or information projection, (Murphy, 2013) and approximated posterior probability density. The expected
is defined as, log likelihood describes how well the chosen statistical
  model fits the data. The KL-divergence encourages the
q(x) variational probability density to be close to the actual
DKL (Q k P ) = Ex∼Q(X) log ,
p(x) prior. Thus, the ELBO can be seen as a regularised fit to
where, the data.
The ELBO lower-bounds the (log) evidence, log p(x).
q(x) This property was explored by Jordan et al. (1999), where
lim → ∞ where q(x) > 0. the authors used Jensen’s inequality (Klaričić Bakula et al.,
p(x)→0 p(x)
2008) to derive the relationship between the ELBO and the
The limit indicates the need to force q(x) = 0 wherever evidence function. The derivation is as follows:
p(x) = 0, otherwise the KL-divergence would be very
Z
large. This is zero forcing (Murphy, 2013) and can be in- log p(x) = log p(x, z)dz,
terpreted as q under-estimating p. The difference between z∈Z
the two methods is illustrated in Figure 2; based on Figure Z
q(z)
21.1 from Murphy (2013). = log p(x, z) dz,
q(z)
z∈Z
3 ELBO: Evidence Lower Bound 
p(x, z)

= log Ez∼Q(z) ,
As mentioned in Section 2, we select a probability den- q(z)
sity from a tractable family which has the lowest KL- ≥ Ez∼Q(z) [log p(x, z)] − Ez∼Q(z) [log q(z)],
divergence from the actual posterior density. Therefore,
inference amounts to solving the optimization problem ≥ ELBO(Q). (14)

3
Figure 2: Figure illustrating forward vs reverse KL-divergence on a bimodal distribution. The blue and the red contours
represent the actual probability density, p, and the unimodal approximation, q, respectively. The left panel shows
the forward KL-divergence minimization where q tends to cover p. The centre and the right panels show the reverse
KL-divergence minimization where q locks on to one of the two modes.

This relationship between the ELBO and log p(x) has mo- which is drawn from a distributions with mean,
tivated researchers to use the variational lower bound as
the criterion for model selection. This bound serves as a µj ∼ N (0, σ 2 ) for j = 1, 2, ..., K, (17)
good approximation to the marginal likelihood; providing
a basis for model selection. Applications of VI for model and is assigned to a cluster using,
selection have been explored in a wide variety of tasks
such as for mixture models (McGrory and Titterington, ci ∼ U(K) for i = 1, 2, ..., N, (18)
2007), cross-validation mode selection (Nott et al., 2012)
and in a more general setting by Bernardo et al. (2003). where ci is a one-hot vector of K-dimensions, and with
latent variables µ and c. In our case, the one-hot vector is
4 Mean field variational family a K-dimensional binary vector where each dimension rep-
resents a cluster. A data-point belonging to the l-th cluster
We briefly, introduce the mean field variational family for will be represented by a value of one in the l-th dimen-
VI, where the latent variables are assumed to be mutually sion of the one-hot vector, while the (K − 1) remaining
independent—each governed by a distinct factor in the vari- dimensions will have a value of zero.
ational probability density (see Bishop, 2006, for a more We assume the approximate posterior probability den-
detailed explanation). This assumption greatly simplifies sity to be from the mean-field variational family (Section
the complexity of the optimization process. A generic 4). Thus, the variational parameterization is given by,
member of the mean field variational family is
m
Y K N
q(Z|X) = qj (Zj ), (15)
Y Y
q(µ, c) = q(µj ; mj , s2j ) q(ci ; φi ). (19)
j=1
j=1 i=1
where m is the number of latent variables. The observed
data X does not appear in Equation 15, therefore any prob- Each latent variable is governed by it’s own variational
ability density from this variational family is not a model factor (Blei et al., 2017). Here, the mixture components
of the data. Instead, it is the ELBO, and the corresponding are Gaussian with variational parameters (mean mk and
KL-divergence minimization problem, which connects the variance s2k ) specific to the k-th cluster. The cluster as-
fitted variational probability density to the data and model signments are categorical with variational parameters (K-
(Blei et al., 2017; Murphy, 2013). dimensional cluster probabilities φi vector) specific to the
i-th data point.
5 A toy problem The definition of ELBO from Equation 12 applied to
In this section, we explore, in detail, how VI can be used this specific case is,
to approximate a mixture of Gaussians. Consider a dis-
tribution of N real-valued data-points x = x1 , x2 , ..., xN ELBOm,s2 ,φ = E[log p(x, µ, c)]
sampled from a mixture of K univariate Gaussians with − E[log q(µ, c)], (20)
means µj = µ1 , µ2 , ..., µK . We assume that the variance
of the mean’s prior is a fixed hyperparameter σ 2 while the ELBOm,s2 ,φ = ELBO(m, s2 , φ).
observation variance is one. For this problem, we define a
single data-point as, We maximize the ELBO to derive the optimal values of
xi ∼ N (cTi µ, 1) for i = 1, 2, ...N, (16) the variational parameters (see Appendix A). The optimal

4
values for m, s and φ are given by, vector. The statistical model uses gradient backpropaga-
P tion to approximate the posterior distribution for the latent
∗ φij xi
mj = 1 i P , (21) vector. For large data sets we update the VAE’s parameters
σ2 + i φij using small mini-batches or even single data points.
1 Since their inception, VAEs have been widely used
(s2j )∗ = 1 P , (22) for generative modelling. They are easy to implement, con-
σ2 + i φij
1 2 2
verge faster than MCMC methods, and scale efficiently to
φ∗ij ∝ e− 2 (mj +sj )+xi mj . (23) large data sets; this makes them ideal for generative mod-
We employ the Coordinate Ascent VI (CAVI) algorithm elling of image data. However, the images they generate
to optimize the ELBO. Algorithm 1 (Blei et al., 2017) tend to have reduced quality compared to the input images.
describes the steps to optimize the ELBO using CAVI. This is the effect of minimizing the reverse KL-divergence,
For our experimental setup, we select K = 3, i.e., a which results in the approximate distribution being locked
mixture of three univariate Gaussians. We generate the to one of the modes—as explained in Section 3.
data by randomly sampling 1000 data-points for each of Kingma and Welling (2013) introduce the stochastic
the three Gaussians. The left panel in Figure 3 illustrates variational inference algorithm (VAE) that reparameter-
how the algorithm approximates the parameters of each izes the variational lower bound, yielding a lower bound
individual Gaussian by maximizing the ELBO. We select estimator that can be optimized using standard stochastic
a maximum iteration of 1000 steps, however, the ELBO gradient methods (Kingma and Welling, 2013).
converges by iteration 60 as illustrated in the right panel
of Figure 3.
Algorithm 1: CAVI for a Gaussian mixture
model
Data: Data x1:n , K mixture components and
prior variance of component means σ 2
Result: Variational densities q(µj ; mj , sj 2 ) and
q(ci ; φi )
m = m1:K , s2 = s21:K , φ = φ1:N ← initialize
variational parameters
while the ELBO has not converged do
for i ∈ 1, ..., N do
Set φij ∝ exp − 12 (m2j + s2j ) + xi mj


end
for j ∈ 1, ..., K doP
φij xi Figure 4: Illustration of the type of directed graphical
Set mj ←− 1/σ2i+P φij model under consideration with N observed data-points.
i
Set s2j ←− 1/σ2 +1P φij Solid lines denote the generative model, dashed lines de-
i
end note the variational approximation to the intractable pos-
terior density. The variational parameters φ are learned
Compute ELBO(m, s2 , φ) jointly with the generative model parameters θ (Kingma
end and Welling, 2013).
return q(m, s2 , φ)
The recognition model, qφ (Z|X), can be interpreted
6 Applications as a probabilistic encoder, since given a data point x the
encoder produces a latent vector z. This latent vector
We have established the optimization process for VI, next is then used to generate a sample from the likelihood
we look at some applications in generative modelling. density, pθ (X|Z), and, therefore, the generative model
can be interpreted as a probabilistic decoder (Kingma and
6.1 VAE: Variational Auto-Encoder Welling, 2013). The probability density qφ (Z|X) serves
An auto-encoder is a neural network that aims to learn as an approximation of the actual posterior probability den-
or encode a low-dimensional representation for high- sity pθ (Z|X). Using a mini-batch of data points sampled
dimensional data, e.g. images. Different variants of auto- from X, the encoder transforms these data points into the
encoders exist that aim to learn meaningful representations latent space, Z, which the decoder uses to generate a the
of high-dimensional data. One such variant is the Varia- samples in X.
tional Auto-Encoder (VAE). Introduced by Kingma and As established in Equation 13, maximizing the ELBO
Welling (2013), the VAE is a statistical model which is es- is equivalent to minimizing the KL-divergence (Blei et al.,
sentially a stochastic variational inference algorithm. The 2017). In order to jointly optimize the recognition model
VAE uses the concept of variational inference to compress and the generative model on mini-batches of data, we dif-
the high-dimensional data into a latent vector while assum- ferentiate and optimize the lower bound with respect to
ing a multi-variate distribution as a prior for the same latent both the variational and the generative parameters, φ and

5
−7.2
0.4
−7.3

ELBO [10e−2]
0.3
−7.4
Density

0.2 −7.5

−7.6
0.1

−7.7
0.0
0 2 4 6 8 10 12 10 20 30 40 50 60
x Iterations

Figure 3: Left: a histogram showing the distribution of data sampled from the three univariate Gaussians. The curved
lines indicate the fit by maximising the ELBO. Right: an illustration of the convergence of ELBO using CAVI.

θ. In this case Equation 15 can be rewritten as, ELBO to a Stochastic Gradient Variational Bayes (SGVB)
DZ,X,φ,θ = DKL (Qφ (Z|X) k Pθ (Z)), estimator L(φ, θ; x(i) ), as
L
L(φ, θ; x) = −DZ,X,φ,θ + Ez∼Qφ (Z|X) [log pθ (x|z)]. 1X
Lφ,θ,x(i) ' −DZ,X,φ,θ + log pθ (x(i) |z (i,l) )], (25)
(24) L
l=1
If we look closely at the terms on the right-hand-side of Lφ,θ,x(i) = L(φ, θ; x(i) ).
Equation 24, we can see the connection to auto-encoders, We re-parameterize the variational lower bound in terms
where the second term is the expected negative reconstruc- of a deterministic random variable which enables us to use
tion error and the KL-divergence term can be interpreted gradient based optimizers on mini-batches of data. This
as a regularization term. In order to optimize Equation further enables the optimization of the parameters of the
24 using standard gradient-based techniques, Kingma and distribution while still maintaining the ability to randomly
Welling (2013) introduce the Auto-Encoding Variational sample from that distribution (Doersch, 2016).
Bayes (AEVB) algorithm to efficiently compute the gradi-
ent of the ELBO from Equation 24. For a chosen approx-
imate posterior qφ (z|x), we re-parameterize the random
variable z = qφ (z|x) with a differentiable transformation
gφ (, x) of a noise variable , such that,
z = gφ (, x),
 ∼ P ().
This re-parameterization allows us to form Monte Carlo es-
timates of expectations of the function log pθ (x|z). Form-
ing Monte Carlo estimates enables us to numerically eval-
uate the expectation,
Ez∼Qφ (Z|X) [log pθ (x|z)].
Figure 5: The learning process in a typical VAE using
We draw independent samples, z i , from the variational gradient back-propagation.
distribution, Qφ (Z|X), and then compute the average of In order to simplify the calculations, we assume the
the function evaluated at these samples (Mohamed et al., variational approximate posterior to be a multi-variate
2020). Therefore, the Monte Carlo estimates of the expec- Gaussian with diagonal co-variance structure (Kingma and
tation of the function log pθ (x(i) |z) when z ∼ qφ (z|x(i) ) Welling, 2013). As for the prior we assume a multivariate
is as follows, Gaussian N (z; 0, I) where,
L
1X log qφ (z|x(i) ) = log N (z; µ(i) , (σ (i) )2 I),
Ez∼Qφ (Z|x(i) ) ' log pθ (x(i) |z (i,l) ),
L pθ (z) = N (z; 0, I),
l=1
(l) z (i) = µ(i) + σ (i) ,
where,  ∼ P () and L is the number of samples per
data point.  ∼ N (0, I).
Applying the above re-parameterization to the varia- Using the above parameterization, the KL-divergence term
tional lower-bound of Equation 24 we can re-formulate the in Equation 25 can be derived as Equation 27 (as shown in

6
Figure 6: A diagram showing a variational autoencoder model.

Box 1). Subsequently, Equation 25 can be used to define encourages the latent representation to be more factorised.
the loss function for the VAE framework at x(i) , as However, this can lead to even worse reconstruction qual-
J ity as compared to the standard VAE framework. This is
1 Xh (i) (i)
i
caused by a trade-off introduced by the modified training
Lφ,θ;x(i) ' 1 + log((σj )2 ) − (µij )2 − (σj )2
2 j=1 objective that punishes reconstruction quality in order to
encourage disentanglement between the latent representa-
L
1X tions (Burgess et al., 2018). Varying β during training en-
+ log pθ (x(i) |z (i,l) ), (26) courages the model to learn different latent representations
L
l=1 of the data. A high value of β encourages disentanglement
where J is the dimensionality of z. in the latent space but at the cost of reduced reconstruction
A variant of the VAE framework, β-VAE, adds an ex- quality. To mitigate this reconstruction problem, the au-
tra hyperparameter to the VAE objective which constricts thors introduce a capacity control parameter C. Increasing
the effective encoding capacity of the latent space. The C from zero to a value large enough produces good quality
β-VAE training objective is, reconstructions during training (Burgess et al., 2018). The
L(φ, θ; x) = −βD modified β-VAE training objective is given by,
Z,X,φ,θ
+ Ez∼Qφ (Z|X) [log pθ (x|z)],
where β = 1 corresponds to the original VAE formula- L(φ, θ; x) = −β|DZ,X,φ,θ − C|
tion of Kingma and Welling (2013). This constriction + Ez∼Qφ (Z|X) [log pθ (x|z)].

Box 1: Derivation of the KL-divergence.


(i)
−DZ,x(i) ,φ,θ = DKL (Qφ (Z|x ) k Pθ (Z))
qφ (z|x(i) )
 
= −Ez∼Qφ (Z|X) log
pθ (z)
qφ (z|x(i) )
Z  
(i)
= − qφ (z|x ) log dz
pθ (z)
Z Z
= qφ (z|x(i) ) log pθ (z)dz − qφ (z|x(i) ) log qφ (z|x(i) )dz
Z Z
= N z; µ(i) , (σ (i) )2 I log N z; 0, I dz − N z; µ(i) , (σ (i) )2 I log N z; µ(i) , (σ (i) )2 I dz
       

 J J 
1 X (i) 2
X (i) 2 (i) 2
= J log(2π) + (1 + log(σj ) ) − J log(2π) − ((µj ) + (σj ) )
2 j=1 j=1
J  
1X (i) 2 (i) 2 (i) 2
= 1 + log(σj ) − (µj ) − (σj ) (27)
2 j=1

6.2 GAN: Generative Adversarial Network data samples across a range of problems—most notably in
computer vision.
Generative adversarial networks (GANs), introduced by A typical GAN framework involves simultaneously
Goodfellow et al. (2014), are deep-learning based gener- training two models; a generator and a discriminator. The
ative models that are extensively used to create realistic

7
generator tries to capture the data distribution by mapping where z is the latent representation of a data sample x
a latent vector to a data-point, thereby generating new sam- taken from marginal likelihood distribution. The encoder,
ples with similar statistical properties as the training data. Enc(x), takes a data-sample, x, and approximates the pos-
The discriminator aims to estimate the likelihood of a sam- terior density q(z|x). Whereas the decoder, Dec(z), takes
ple being drawn from the training data or created by the a sample from the latent space, z, and generates a sample
generator (Goodfellow et al., 2014). We can train a GAN from the likelihood density p(x|z). Larsen et al. (2016) de-
by minimizing the objective function, fine LVAE as the negative of training objective of a vanilla
VAE (given by Equation 24) and provide the objective
min max LGAN = Ex∼Pdata (X) [log(Dis(x))] function,
Gen Dis
+ Ez∼Pz (Z) [log(1 − Dis(Gen(z)))], LVAE = −L(φ, θ; x),
(28) = DKL (Q(Z|X) k P (Z))
where the generator, Gen(z), takes a sample from the latent − Ez∼Q(Z|X) [log p(x|z)]. (29)
distribution, Pz (Z), and creates a new data-point while the The authors go on to describe the terms of Equation 29 as,
discriminator, Dis(x), takes a data-point from both the real
distribution, Pdata (X), and the new data-point Gen(z) and Lpixel
llike = −Ez∼Q(Z|X) [log p(x|z)],
assigns probabilities to both. To minimize the objective, Lprior = DKL (Q(Z|X) k P (Z)),
the discriminator will try to assign probabilities close to
zero and one for data sampled from the real distribution where Lpixel
llike is the negative expected log likelihood and
and anything created by the generator, respectively. The Lprior is the KL-divergence between the approximated pos-
GAN’s objective is to train the discriminator to efficiently terior density and the prior on the latent variable. The
discriminate between real and generated data while encour- KL-divergence term in Equation 29 can also be interpreted
aging the generator to reproduce the true data distribution as a regularization term. Therefore, the VAE loss is the
(Goodfellow et al., 2014; Larsen et al., 2016). There is a sum of the negative expected log likelihood (the recon-
unique solution where the generator successfully recovers struction error) and the regularization term (Larsen et al.,
the training data distribution. At the same time, the dis- 2016).
criminator ends up assigning equal probability to samples The authors, further, propose a technique to exploit
from the training data and the generator. the capacity of the discriminator to differentiate between
real and generated images. The capacity of a neural net-
6.3 VAE-GAN work is defined as an upper bound on the number of bits
Different variants of the original GAN framework have that can be extracted from the training data and stored
evolved since its inception, such as the VAE-GAN intro- in the architecture during learning (Baldi and Vershynin,
duced by Larsen et al. (2016). This approach uses the 2019). Larsen et al. (2016) replace the VAE reconstruction
learned feature representations in the GAN discriminator error term in Equation 29, for better quality images, with a
as a basis for the VAE reconstruction objective. reconstruction error expressed in the GAN discriminator.
Larsen et al. (2016) show that unsupervised training, For this, they introduce a Gaussian observation model for
like that of a GAN, can result in the latent image represen- the hidden representation of the l-th layer of the discrimi-
tation with disentangled factors of variation (Bengio et al., nator Disl (x) with mean Disl (x̃) and identity covariance,
2013). This means that the model learns an embedding p(Disl (x)|z) = N (Disl (x)|Disl (x̃), I),
space with, abstract, high-level visual features which can
be modified using simple arithmetic (Larsen et al., 2016). where x̃ = Dec(z) is the output from the decoder for the
data point x. Subsequently, the VAE reconstruction error
in Equation 29 is replaced with the following,
LDis
llike = −Ez∼Q(Z|X) [log p(Disl (x)|z)].
l

Thus, the combined training objective for the VAE-GAN


is as follows,
L = Lprior + LDis
llike + LGAN .
l
(30)
The parameters for the decoder model θDec are updated
weighing the decoder’s reconstruction ability against the
Figure 7: Diagram of the VAE-GAN framework Larsen discriminator’s discernment. The authors use a parameter,
et al. (2016). γ, to weigh the VAE’s ability to reconstruct against the
discriminator. Therefore, the reconstruction error is,
As we see in Section 6.1, the VAE consists of an  
+
encoder and a decoder given by, θDec ←− −∇θDec γLDis l
llike − L GAN .

z = Enc(x) = q(z|x),
The training procedure for VAE-GAN is illustrated in Al-
x̃ = Dec(z) = p(x|z), gorithm 2 and Figure 8.

8
• a VAE with learned distance, VAEDisl , where the
authors first train a GAN and use the l-th layer of
the discriminator network as a learned similarity
measure,
• the proposed VAE-GAN framework,
on the CelebA data set. As shown in Figure 9, the visual
realism of the VAE-GAN is superior to the traditional VAE.
Additionally, the learned latent space can be used to mod-
ify high-level facial features, such as, skin tone and hair
colour.

Figure 8: Illustration of the flow through the combined


VAE-GAN framework. Larsen et al. (2016) combine a
VAE with a GAN by collapsing the decoder and the gener-
ator into one. The gray lines represent the training objec-
tive.

Algorithm 2: Training algorithm for VAE-


GAN taken from Larsen et al. (2016)
θEnc , θDec , θDis ← initialize network parameters
repeat
X ←− random mini-batch from data set
Z ←− Enc(X)
Lprior ←− DKL (Q(Z|X) k P (Z))
X̃ ←− Dec(X)   Figure 9: Reconstructions from different auto-encoders
Larsen et al. (2016).
LDis
llike ←− −Eq(Z|X) p(Disl (X)|Z)
l

The application of VI in machine learning is not lim-


Zp ←− samples from prior N (0, I) ited to these frameworks. Different variants of both VAE
Xp ←− Dec(Zp ) and VAE-GAN have been implemented and have continued
LGAN ←− log(Dis(X)) to produce state of the art research in generative modelling
+ log(1 − Dis(X̃)) + log(1 − Dis(Xp )) tasks (e.g. Hou et al., 2017; Li et al., 2020; Purkait et al.,
2020; Shu et al., 2018; Simonovsky and Komodakis, 2018;
U PDATE PARAMETERS ACCORDING TO Vahdat and Kautz, 2020; Zhang et al., 2018; Zhao et al.,
GRADIENTS   2019). In addition to image generation tasks, machine
+ Disl
θEnc ←− −∇θEnc Lprior + Lllike learning problems like anomaly detection, time series esti-
  mation, language modelling, dimensionality reduction and
+ unsupervised representation learning have all used VI in
θDec ←− −∇θDec γLDis l
llike − L GAN
one form or the other (e.g. Graving and Couzin, 2020; Li
et al., 2021; Pol et al., 2019; Yang et al., 2017).
 
+
θDis ←− −∇θDis LGAN
until convergence 7 Discussion
In recent years, the use of deep convolutional neu- We have introduced the concept of VI, a tool to perform
ral networks (CNNs), has resulted in state-of-the art per- approximate statistical inference. VI re-structures the sta-
formance for generative modeling tasks—especially in tistical problem of estimating the posterior probability den-
the field of computer vision (e.g. Chen et al., 2016; Kar- sity over the latent variable, given an observed variable,
ras et al., 2019; Pidhorskyi et al., 2020; Radford et al., into an optimization problem. The key idea is to select a
2015). Such networks are computationally efficient us- probability density from a family of tractable densities that
ing convolution operations to extract information from is closest to the actual posterior probability density. We
high-dimensional data without human supervision (Si- have demonstrated how:
monyan et al., 2014). Motivated this success, Larsen et al.
(2016) train CNNs along with batch-normalisation (Ioffe • the KL-divergence can be used as a metric to
and Szegedy, 2015), ReLU activations (Krizhevsky et al., measure the closeness between densities,
2012), consecutive down- and up-sampling layers in both • the ELBO can be used as a criterion for model
the encoder and discriminator. Liu et al. (2015) train: selection to better fit the observed data,
• a traditional VAE, • VI can be used to fit a mixture of Gaussians.

9
Moreover, we briefly presented the scenarios where VI tangling in β-vae. arXiv preprint arXiv:1804.03599,
has been applied to modern machine learning tasks, specif- 2018.
ically in computer vision and generative modeling, and X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever,
investigated how combining deep learning and VI enable and P. Abbeel. Infogan: Interpretable representation
us to perform inference on extremely complex posterior learning by information maximizing generative adver-
distributions. sarial nets. In Proceedings of the 30th International
VI is a powerful tool that allows us to approximate Conference on Neural Information Processing Systems,
the actual probability density of the latent representation. pages 2180–2188, 2016.
However, there are still many open avenues for statistical
research. One such avenue is to develop better approxima- P. Dagum and M. Luby. Approximating probabilistic infer-
tions (achieving lower KL-divergence) to the posterior den- ence in bayesian belief networks is np-hard. Artificial
sity, while maintaining efficient optimization. For example, intelligence, 60(1):141–153, 1993.
the mean-field family makes strong independence assump- A. Dembo, T. M. Cover, and J. A. Thomas. Information
tions which aid in scalable optimization. However, these theoretic inequalities. IEEE Transactions on Informa-
assumptions may lead the variance of the approximated tion theory, 37(6):1501–1518, 1991.
density to under-represent that of the target density (Blei
C. Doersch. Tutorial on variational autoencoders. arXiv
et al., 2017). As an alternative to the mean-field method,
preprint arXiv:1606.05908, 2016.
Minka (2005) use a fully-factorized approximation with
no explicit exponential family constraint along with loopy F. Gagliardi Cozman. Generalizing Variable Elimination
belief propagation to achieve a lower KL-divergence. An- in Bayesian Networks. Probabilistic Reasoning in Arti-
other possible area of research is to use α-divergance mea- ficial Intelligence, pages 1–11, 2000.
sures (Zhang et al., 2018) to get a tighter fit to the ELBO. S. Geman and D. Geman. Stochastic relaxation, gibbs
Although research in the field of VI algorithm has grown distributions, and the bayesian restoration of images.
in recent years, efforts to make VI more efficient, accurate, IEEE Transactions on pattern analysis and machine
scalable and easier are still ongoing. intelligence, (6):721–741, 1984.
I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu,
8 Acknowledgements D. Warde-Farley, S. Ozair, A. Courville, and Y. Ben-
We would like to thank our colleagues Sanjana Jain, gio. Generative adversarial networks. arXiv preprint
Dhruba Pujary and, Ukrit Watchareeruetai for their con- arXiv:1406.2661, 2014.
structive input and feedback during the writing of this J. Graving and I. Couzin. VAE-SNE: a deep generative
paper. model for simultaneous dimensionality reduction and
clustering. 2020. doi: 10.1101/2020.07.17.207993.
References X. Hou, L. Shen, K. Sun, and G. Qiu. Deep feature consis-
tent variational autoencoder. In 2017 IEEE Winter Con-
P. Baldi and R. Vershynin. The capacity of feedforward
ference on Applications of Computer Vision (WACV),
neural networks. Neural networks, 116:288–311, 2019.
pages 1133–1141. IEEE, 2017.
D. Barber. Bayesian reasoning and machine learning. S. Ioffe and C. Szegedy. Batch normalization: Accelerat-
Cambridge University Press, 2012. ing deep network training by reducing internal covariate
Y. Bengio, A. Courville, and P. Vincent. Representation shift. In International conference on machine learning,
learning: A review and new perspectives. IEEE transac- pages 448–456. PMLR, 2015.
tions on pattern analysis and machine intelligence, 35 M. I. Jordan, Z. Ghahramani, T. S. Jaakkola, and L. K.
(8):1798–1828, 2013. Saul. An introduction to variational methods for graphi-
J. Bernardo, M. Bayarri, J. Berger, A. Dawid, D. Hecker- cal models. Machine learning, 37(2):183–233, 1999.
man, A. Smith, M. West, et al. The variational bayesian T. Karras, S. Laine, and T. Aila. A style-based genera-
em algorithm for incomplete data: with application to tor architecture for generative adversarial networks. In
scoring graphical model structures. Bayesian statistics, Proceedings of the IEEE/CVF Conference on Computer
7(453-464):210, 2003. Vision and Pattern Recognition, pages 4401–4410, 2019.
C. M. Bishop. Pattern Recognition and Machine Learning D. P. Kingma and M. Welling. Auto-encoding variational
(Information Science and Statistics). Springer-Verlag, bayes. arXiv preprint arXiv:1312.6114, 2013.
Berlin, Heidelberg, 2006. ISBN 0387310738.
M. Klaričić Bakula, M. Matić, and J. Pečarić. On some
D. M. Blei, A. Kucukelbir, and J. D. McAuliffe. Varia- general inequalities related to jensen’s inequality. In-
tional inference: A review for statisticians. Journal of equalities and Applications, pages 233–243, 2008.
the American statistical Association, 112(518):859–877, A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet
2017. classification with deep convolutional neural networks.
C. P. Burgess, I. Higgins, A. Pal, L. Matthey, N. Watters, Advances in neural information processing systems, 25:
G. Desjardins, and A. Lerchner. Understanding disen- 1097–1105, 2012.

10
A. B. L. Larsen, S. K. Sønderby, H. Larochelle, and P. Purkait, C. Zach, and I. Reid. Sg-vae: Scene grammar
O. Winther. Autoencoding beyond pixels using a learned variational autoencoder to generate new indoor scenes.
similarity metric. In International conference on ma- In European Conference on Computer Vision, pages
chine learning, pages 1558–1566. PMLR, 2016. 155–171. Springer, 2020.
W. Li, W. Hu, N. Chen, and C. Feng. Stacking vae A. Radford, L. Metz, and S. Chintala. Unsupervised rep-
with graph neural networks for effective and inter- resentation learning with deep convolutional generative
pretable time series anomaly detection. arXiv preprint adversarial networks. arXiv preprint arXiv:1511.06434,
arXiv:2105.08397, 2021. 2015.
Z. Li, R. Togo, T. Ogawa, and M. Haseyama. Variational R. Ranganath, S. Gerrish, and D. Blei. Black box varia-
autoencoder based unsupervised domain adaptation for tional inference. In Artificial intelligence and statistics,
semantic segmentation. In 2020 IEEE International pages 814–822. PMLR, 2014.
Conference on Image Processing (ICIP), pages 2426– J. Regier, A. Miller, J. McAuliffe, R. Adams, M. Hoffman,
2430. IEEE, 2020. D. Lang, D. Schlegel, and M. Prabhat. Celeste: Varia-
Z. Liu, P. Luo, X. Wang, and X. Tang. Deep learn- tional inference for a generative model of astronomical
ing face attributes in the wild. Proceedings of the images. In International Conference on Machine Learn-
IEEE International Conference on Computer Vision, ing, pages 2095–2103. PMLR, 2015.
2015 International Conference on Computer Vision, C. E. Shannon. A mathematical theory of communication.
ICCV 2015:3730–3738, 2015. ISSN 15505499. doi: The Bell system technical journal, 27(3):379–423, 1948.
10.1109/ICCV.2015.425.
J. Shlens. Notes on kullback-leibler divergence and likeli-
D. J. MacKay. Information Theory, Inference, and Learn- hood. arXiv preprint arXiv:1404.2000, 2014.
ing Algorithms David, volume 44. 2015. ISBN
9780521642989. doi: 10.1198/jasa.2005.s54. R. Shu, H. H. Bui, S. Zhao, M. J. Kochenderfer, and S. Er-
mon. Amortized inference regularization. arXiv preprint
A. L. Madsen and F. V. Jensen. Lazy propagation: a junc-
arXiv:1805.08913, 2018.
tion tree inference algorithm based on lazy evaluation.
Artificial Intelligence, 113(1-2):203–245, 1999. M. Simonovsky and N. Komodakis. Graphvae: Towards
generation of small graphs using variational autoen-
C. A. McGrory and D. Titterington. Variational approxi-
coders. In International Conference on Artificial Neural
mations in bayesian model selection for finite mixture
Networks, pages 412–422. Springer, 2018.
distributions. Computational Statistics & Data Analysis,
51(11):5352–5367, 2007. K. Simonyan, A. Vedaldi, and A. Zisserman. Deep in-
side convolutional networks: Visualising image classi-
N. Metropolis, A. W. Rosenbluth, M. N. Rosenbluth, A. H.
fication models and saliency maps. 2nd International
Teller, and E. Teller. Equation of state calculations
Conference on Learning Representations, ICLR 2014 -
by fast computing machines. The journal of chemical
Workshop Track Proceedings, pages 1–8, 2014.
physics, 21(6):1087–1092, 1953.
T. Tabouy, P. Barbillon, and J. Chiquet. Variational in-
T. P. Minka. Divergence measures and message passing.
ference for stochastic block models from sampled data.
Microsoft Research Technical Report, (MSR-TR-2005-
Journal of the American Statistical Association, 115
173):17, 2005. ISSN 0735-0015.
(529):455–466, 2020.
S. Mohamed, M. Rosca, M. Figurnov, and A. Mnih. Monte
carlo gradient estimation in machine learning. J. Mach. A. Vahdat and J. Kautz. Nvae: A deep hierarchical vari-
Learn. Res., 21(132):1–62, 2020. ational autoencoder. arXiv preprint arXiv:2007.03898,
2020.
K. P. Murphy. Machine learning : a probabilistic perspec-
tive. MIT Press, Cambridge, Mass. [u.a.], 2013. ISBN Z. Yang, Z. Hu, R. Salakhutdinov, and T. Berg-Kirkpatrick.
9780262018029 0262018020. Improved variational autoencoders for text modeling
using dilated convolutions. In International conference
D. J. Nott, S. L. Tan, M. Villani, and R. Kohn. Regres- on machine learning, pages 3881–3890. PMLR, 2017.
sion density estimation with variational methods and
stochastic approximation. Journal of Computational C. Zhang, J. Bütepage, H. Kjellström, and S. Mandt. Ad-
and Graphical Statistics, 21(3):797–820, 2012. vances in variational inference. IEEE transactions on
pattern analysis and machine intelligence, 41(8):2008–
S. Pidhorskyi, D. A. Adjeroh, and G. Doretto. Adversarial 2026, 2018.
latent autoencoders. In Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recogni- S. Zhao, J. Song, and S. Ermon. Infovae: Balancing learn-
tion, pages 14104–14113, 2020. ing and inference in variational autoencoders. In Pro-
ceedings of the aaai conference on artificial intelligence,
A. A. Pol, V. Berger, C. Germain, G. Cerminara, and volume 33, pages 5885–5892, 2019.
M. Pierini. Anomaly detection with conditional vari-
ational autoencoders. In 2019 18th IEEE interna-
tional conference on machine learning and applications
(ICMLA), pages 1651–1657. IEEE, 2019.

11
A Appendix We now factorize the variational joint probability density
q(µ, c) in Equation 32 as,
We formulate the variational approximation for the mixture
of Gaussians in Equation 19 as follows:
Y Y
YK N
Y log q(µ, c) = log q(µj ; mj , s2j ) q(ci , φi ),
q(µ, c) = q(µj ; mj , s2j ) q(ci ; φi ). (31) j i
j=1 i=1 X X
= log q(µj ; mj , s2j ) + q(ci , φi ).
From Equation 20, we have the following definition of
j i
ELBO:
(38)
ELBO(m, s2 , φ) = E[log p(x, µ, c)]
− E[log q(µ, c)], (32)
We further expand the terms on the RHS of Equation 38 as
ELBOm,s2 ,φ = ELBO(m, s2 , φ), follows,
where expectations are taken under q and m, s and φ are
the variational parameters. 2
In order to derive the optimal values of the variational log q(µj ; mj , s2 ) = log q 1 exp − (µj − mj )
  
j ,
parameters, we will first have to express the ELBO in Equa- 2πs2j 2s2j
tion 32 in terms of m, s and φ. We start with simplifying
the first term, log p(x, µ, c), on the RHS of Equation 32 as 1 2 (µj − mj )2
follows, = − log(2πs j ) − .
2 2s2j
log p(x, µ, c) = log p(x|µ, c)p(µ, c), (39)
= log p(x|µ, c)p(µ)p(c),
= log p(x|µ, c) + log p(µ) + log p(c),
X X
= log p(µj ) + [log p(ci ) Y
log q(ci , φi ) = log φij ,
j i
j
+ log p(xi |ci , µ)], (33) X
1 = log φij . (40)
where p(ci ) = K is a constant and expanding p(µj ) we j
have,
µ2j
  
1
log p(µj ) = log √ exp − 2 , We combine the derivations for the joint variational prob-
2πσ 2 2σ ability density from equations 39 and 40 to re-write the
µ2j Equation 38 as,
∝− . (34)
2σ 2
For p(xi |ci , µ), in Equation 33, we can make use of the 1X

(µj − mj )2

2
fact that ci is a one-hot vector. Therefore, log p(xi |ci , µ) log q(µ, c) = − log(2πsj ) −
2 j 2s2j
can be expressed as:
Y XX
log p(xi |ci , µ) = log p(xi |µj )cij , + log φij . (41)
j i j
X
= cij log p(xi |µj ). (35)
j The final step towards deriving the ELBO in terms of the
In Equation 16 we define actual density function for the variational parameters is to factor the results from Equation
real-valued data-point xi as follows: 37 and 41 into Equation 32 as follows,
(xi − µj )2
  
1
log p(xi |µj ) = log √ exp − ,  2 
2π 2 X µj
ELBO m,s 2 ,φ ∝ −E
(xi − µj )2 j
2σ 2
∝− . (36)
2 XX   
(xi − µj )2

We now re-write Equation 33 by combining the derivations + E −cij E
from equations 34, 35 and 36 as, i j
2
X µ2j 1X

(µj − mj )2

2
log p(x, µ, c) ∝ − 2 − E − log(2πsj ) −
j
2σ 2 j s2j
(xi − µj )2
XX XX  
+ −cij . (37) − E log φij .
i j
2 i j

12
The final ELBO objective in terms of the variational pa- Deriving m∗j , the optimal value of mj :
rameters is as follows:
(
(xi − µj )2
 
∂ ∂ X
ELBOm,s2 ,φ ∝ − φij E
 2  ∂mj ∂mj i
2
X µj
ELBOm,s2 ,φ ∝ −E
)
µ2j
 
j
2σ 2 −E
2σ 2
XX  (xi − µj )2
  
+ E −φij E (
X  1 
2 ∂
i j ∝ φij − (m2j + s2j ) + xi mj
  ∂mj 2
1X i
+ E log(s2j ) )
2 j 1
− 2 (m2j + s2j ) ,
XX   2σ
− E log φij , (42) (
∂ X 1 
i j ∝ − φij m2j + φij xi mj
∂mj i
2
)
1
where all expectations are taken under q. − 2 m2j ,
Now, to derive the optimal values of the variational 2σ
parameters we take partial derivatives of the ELBO in X 
mj
Equation 42 with respect to the variational parameters and ∝ φij mj + φij xi − 2

equate them to zero. i

We derive the optimal value of mj , as follows:



Deriving φ∗ij , the optimal value of φij : 0= ELBOm,s2 ,φ ,
∂mj
P
∗ φij xi
mj = 1 i P .
σ2 + i φij

Deriving (s2j )∗ , the optimal value of s2j :


(
(xi − µj )2
 
∂ ∂
ELBOm,s2 ,φ ∝ − φij E
∂φij ∂φij 2 (
(xi − µj )2
 
∂ ∂ X
 ) ELBOm,s2 ,φ ∝ 2 − φij E
∂s2j ∂sj 2
− E log φij , i
 2  )
µj

1 2
−E + E log(sj ) ,
(
(xi − µj )2
 
∂ 2σ 2 2
∝ − φij E
∂φij 2 (  
∂ X 1 2 2
∝ 2 φij − (mj + sj ) + xi mj
)
− φij log φij , ∂sj i
2
)
1 2 2 1 2
(xi − µj )2 − 2 (mj + sj ) + log(sj ) ,
 
∝E − − log φij − 1, 2σ 2
2 ( )
1 ∂ X1 1 1
∝ − (m2j + s2j ) + xi mj − log φij . ∝ 2 − φij s2j − 2 s2j + log s2j ,
2 ∂sj i
2 2σ 2
X1 1 1
∝− φij − 2 + 2 .
i
2 2σ 2s j
We derive the optimal value of φij , as follows:
We derive the optimal value of s2j , as follows:

∂ 0= ELBOm,s2 ,φ ,
0= ELBOm,s2 ,φ , ∂s2j
∂φij
 1 1
φ∗ij ∝ exp − (m2j + s2j ) + xi mj
 (s2j )∗ = 1 P .
2 σ2 + i φij

13

You might also like