0% found this document useful (0 votes)

21 views16 pages

IAF Kingma Et Al 2016

Uploaded by

purnima

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views16 pages

IAF Kingma Et Al 2016

Uploaded by

purnima

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 16

Improved Variational Inference

with Inverse Autoregressive Flow

Diederik P. Kingma Tim Salimans Rafal Jozefowicz Xi Chen

[email protected] [email protected] [email protected] [email protected]
arXiv:1606.04934v2 [cs.LG] 30 Jan 2017

Ilya Sutskever Max Welling∗

[email protected] [email protected]

Abstract

The framework of normalizing flows provides a general strategy for flexible vari-
ational inference of posteriors over latent variables. We propose a new type of
normalizing flow, inverse autoregressive flow (IAF), that, in contrast to earlier
published flows, scales well to high-dimensional latent spaces. The proposed flow
consists of a chain of invertible transformations, where each transformation is
based on an autoregressive neural network. In experiments, we show that IAF
significantly improves upon diagonal Gaussian approximate posteriors. In addition,
we demonstrate that a novel type of variational autoencoder, coupled with IAF, is
competitive with neural autoregressive models in terms of attained log-likelihood
on natural images, while allowing significantly faster synthesis.

1 Introduction

Stochastic variational inference (Blei et al., 2012; Hoffman et al., 2013) is a method for scalable
posterior inference with large datasets using stochastic gradient ascent. It can be made especially
efficient for continuous latent variables through latent-variable reparameterization and inference
networks, amortizing the cost, resulting in a highly scalable learning procedure (Kingma and Welling,
2013; Rezende et al., 2014; Salimans et al., 2014). When using neural networks for both the
inference network and generative model, this results in class of models called variational auto-
encoders (Kingma and Welling, 2013) (VAEs). A general strategy for building flexible inference
networks, is the framework of normalizing flows (Rezende and Mohamed, 2015). In this paper we
propose a new type of flow, inverse autoregressive flow (IAF), which scales well to high-dimensional
latent space.
At the core of our proposed method lie Gaussian autoregressive functions that are normally used
for density estimation: functions that take as input a variable with some specified ordering such
as multidimensional tensors, and output a mean and standard deviation for each element of the
input variable conditioned on the previous elements. Examples of such functions are autoregressive
neural density estimators such as RNNs, MADE (Germain et al., 2015), PixelCNN (van den Oord
et al., 2016b) or WaveNet (van den Oord et al., 2016a) models. We show that such functions
can often be turned into invertible nonlinear transformations of the input, with a simple Jacobian
determinant. Since the transformation is flexible and the determinant known, it can be used as a
normalizing flow, transforming a tensor with relatively simple known density, into a new tensor with
more complicated density that is still cheaply computable. In contrast with most previous work on
∗
University of Amsterdam, University of California Irvine, and the Canadian Institute for Advanced Research
(CIFAR).

29th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.
(a) Prior distribution (b) Posteriors in standard VAE (c) Posteriors in VAE with IAF
Figure 1: Best viewed in color. We fitted a variational auto-encoder (VAE) with a spherical Gaussian
prior, and with factorized Gaussian posteriors (b) or inverse autoregressive flow (IAF) posteriors (c)
to a toy dataset with four datapoints. Each colored cluster corresponds to the posterior distribution of
one datapoint. IAF greatly improves the flexibility of the posterior distributions, and allows for a
much better fit between the posteriors and the prior.

improving inference models including previously used normalizing flows, this transformation is well
suited to high-dimensional tensor variables, such as spatio-temporally organized variables.
We demonstrate this method by improving inference networks of deep variational auto-encoders.
In particular, we train deep variational auto-encoders with latent variables at multiple levels of the
hierarchy, where each stochastic variable is a three-dimensional tensor (a stack of featuremaps), and
demonstrate improved performance.

2 Variational Inference and Learning

Let x be a (set of) observed variable(s), z a (set of) latent variable(s) and let p(x, z) be the parametric
model of their joint distribution, called the generative model defined over the variables. Given a
dataset X = {x1 , ..., xN } we typically wish to perform maximum marginal likelihood learning of its
parameters, i.e. to maximize
N
X
log p(X) = log p(x(i) ), (1)
i=1

but in general this marginal likelihood is intractable to compute or differentiate directly for flexible
generative models, e.g. when components of the generative model are parameterized by neural
networks. A solution is to introduce q(z|x), a parametric inference model defined over the latent
variables, and optimize the variational lower bound on the marginal log-likelihood of each observation
x:
log p(x) ≥ Eq(z|x) [log p(x, z) − log q(z|x)] = L(x; θ) (2)
where θ indicates the parameters of p and q models. Keeping in mind that Kullback-Leibler diver-
gences DKL (.) are non-negative, it’s clear that L(x; θ) is a lower bound on log p(x) since it can be
written as follows ):
L(x; θ) = log p(x) − DKL (q(z|x)||p(z|x)) (3)
There are various ways to optimize the lower bound L(x; θ); for continuous z it can be done efficiently
through a re-parameterization of q(z|x), see e.g. (Kingma and Welling, 2013; Rezende et al., 2014).
As can be seen from equation (3), maximizing L(x; θ) w.r.t. θ will concurrently maximize log p(x)
and minimize DKL (q(z|x)||p(z|x)). The closer DKL (q(z|x)||p(z|x)) is to 0, the closer L(x; θ) will
be to log p(x), and the better an approximation our optimization objective L(x; θ) is to our true objec-
tive log p(x). Also, minimization of DKL (q(z|x)||p(z|x)) can be a goal in itself, if we’re interested
in using q(z|x) for inference after optimization. In any case, the divergence DKL (q(z|x)||p(z|x))
is a function of our parameters through both the inference model and the generative model, and
increasing the flexibility of either is generally helpful towards our objective.

2
Note that in models with multiple latent variables, the inference model is typically factorized into
partial inference models with some ordering; e.g. q(za , zb |x) = q(za |x)q(zb |za , x). We’ll write
q(z|x, c) to denote such partial inference models, conditioned on both the data x and a further context
c which includes the previous latent variables according to the ordering.

2.1 Requirements for Computational Tractability

Requirements for the inference model, in order to be able to efficiently optimize the bound, are that it
is (1) computationally efficient to compute and differentiate its probability density q(z|x), and (2)
computationally efficient to sample from, since both these operations need to be performed for each
datapoint in a minibatch at every iteration of optimization. If z is high-dimensional and we want
to make efficient use of parallel computational resources like GPUs, then parallelizability of these
operations across dimensions of z is a large factor towards efficiency. This requirement restrict the
class of approximate posteriors q(z|x) that are practical to use. In practice this often leads to the use
of diagonal posteriors, e.g. q(z|x) ∼ N (µ(x), σ 2 (x)), where µ(x) and σ(x) are often nonlinear
functions parameterized by neural networks. However, as explained above, we also need the density
q(z|x) to be sufficiently flexible to match the true posterior p(z|x).

2.2 Normalizing Flow

Normalizing Flow (NF), introduced by (Rezende and Mohamed, 2015) in the context of stochastic
gradient variational inference, is a powerful framework for building flexible posterior distributions
through an iterative procedure. The general idea is to start off with an initial random variable with a
relatively simple distribution with known (and computationally cheap) probability density function,
and then apply a chain of invertible parameterized transformations ft , such that the last iterate zT has
a more flexible distribution2 :
z0 ∼ q(z0 |x), zt = ft (zt−1 , x) ∀t = 1...T (4)
As long as the Jacobian determinant of each of the transformations ft can be computed, we can still
compute the probability density function of the last iterate:
T
X dzt
log q(zT |x) = log q(z0 |x) − log det (5)
t=1
dzt−1
However, (Rezende and Mohamed, 2015) experiment with only a very limited family of such
invertible transformation with known Jacobian determinant, namely:
ft (zt−1 ) = zt−1 + uh(wT zt−1 + b) (6)
where u and w are vectors, wT is w transposed, b is a scalar and h(.) is a nonlinearity, such that
uh(wT zt−1 + b) can be interpreted as a MLP with a bottleneck hidden layer with a single unit. Since
information goes through the single bottleneck, a long chain of transformations is required to capture
high-dimensional dependencies.

3 Inverse Autoregressive Transformations

In order to find a type of normalizing flow that scales well to high-dimensional space, we consider
Gaussian versions of autoregressive autoencoders such as MADE (Germain et al., 2015) and the
PixelCNN (van den Oord et al., 2016b). Let y be a variable modeled by such a model, with some
chosen ordering on its elements y = {yi }D i=1 . We will use [µ(y), σ(y)] to denote the function of the
vector y, to the vectors µ and σ. Due to the autoregressive structure, the Jacobian is lower triangular
with zeros on the diagonal: ∂[µi , σ i ]/∂yj = [0, 0] for j ≥ i. The elements [µi (y1:i−1 ), σi (y1:i−1 )]
are the predicted mean and standard deviation of the i-th element of y, which are functions of only
the previous elements in y.
Sampling from such a model is a sequential transformation from a noise vector ∼ N (0, I) to the
corresponding vector y: y0 = µ0 + σ0 0 , and for i > 0, yi = µi (y1:i−1 ) + σi (y1:i−1 ) · i . The
2
where x is the context, such as the value of the datapoint. In case of models with multiple levels of latent
variables, the context also includes the value of the previously sampled latent variables.

3
Algorithm 1: Pseudo-code of an approximate posterior with Inverse Autoregressive Flow (IAF)
Data:
x: a datapoint, and optionally other conditioning information
θ: neural network parameters
EncoderNN(x; θ): encoder neural network, with additional output h
AutoregressiveNN[∗](z, h; θ): autoregressive neural networks, with additional input h
sum(.): sum over vector elements
sigmoid(.): element-wise sigmoid function
Result:
z: a random sample from q(z|x), the approximate posterior distribution
l: the scalar value of log q(z|x), evaluated at sample ’z’
[µ, σ, h] ← EncoderNN(x; θ)
∼ N (0, I)
z←σ +µ
l ← −sum(log σ + 12 2 + 12 log(2π))
for t ← 1 to T do
[m, s] ← AutoregressiveNN[t](z, h; θ)
σ ← sigmoid(s)
z ← σ z + (1 − σ) m
l ← l − sum(log σ)
end

computation involved in this transformation is clearly proportional to the dimensionality D. Since

variational inference requires sampling from the posterior, such models are not interesting for direct
use in such applications. However, the inverse transformation is interesting for normalizing flows, as
we will show. As long as we have σi > 0 for all i, the sampling transformation above is a one-to-one
transformation, and can be inverted: i = yiσ−µ i (y1:i−1 )
i (y1:i−1 )
.
We make two key observations, important for normalizing flows. The first is that this inverse
transformation can be parallelized, since (in case of autoregressive autoencoders) computations of
the individual elements i do not depend on eachother. The vectorized transformation is:
= (y − µ(y))/σ(y) (7)
where the subtraction and division are elementwise.
The second key observation, is that this inverse autoregressive operation has a simple Jacobian
determinant. Note that due to the autoregressive structure, ∂[µi , σi ]/∂yj = [0, 0] for j ≥ i. As a
result, the transformation has a lower triangular Jacobian (∂i /∂yj = 0 for j > i), with a simple
diagonal: ∂i /∂yi = σi . The determinant of a lower triangular matrix equals the product of the
diagonal terms. As a result, the log-determinant of the Jacobian of the transformation is remarkably
simple and straightforward to compute:
X D
d
log det = − log σi (y) (8)
dy i=1
The combination of model flexibility, parallelizability across dimensions, and simple log-determinant,
make this transformation interesting for use as a normalizing flow over high-dimensional latent space.

4 Inverse Autoregressive Flow (IAF)

We propose a new type normalizing flow (eq. (5)), based on transformations that are equivalent to
the inverse autoregressive transformation of eq. (7) up to reparameterization. See algorithm 1 for
pseudo-code of an appproximate posterior with the proposed flow. We let an initial encoder neural
network output µ0 and σ 0 , in addition to an extra output h, which serves as an additional input to
each subsequent step in the flow. We draw a random sample ∼ N (0, I), and initialize the chain
with:
z0 = µ0 + σ 0 (9)

4
Approximate Posterior with Inverse Autoregressive Flow (IAF)
IAF Step

x Encoder NN h ··· h Autoregressive NN

IAF IAF
σ μ ··· σ μ
step step

ε × + z z ··· z z × + z

Figure 2: Like other normalizing flows, drawing samples from an approximate posterior with Inverse
Autoregressive Flow (IAF) consists of an initial sample
Approximate drawn
z Inverse
Posterior with from aFlow
Autoregressive simple
(IAF) distribution, such as a
Gaussian with diagonal covariance, followed by a chain of nonlinear invertible transformations of z,
each with a simple Jacobian determinants.
x Encoder NN h Autoregressive NN Autoregressive NN

σ μ σ μ σ μ
The flow
z T consists of a chain of T of the following transformations:
ε × + z × + z × + ··· z

Normalizing z
zt = µ t + σ t zt−1 (10)
Flow Initial Network IAF step IAF Step
where at the t-th step of the flow, we use a different autoregressive neural network with inputs zt−1
and h, 0
z
and outputs µ and σ t . The neural network is structured to be autoregressive w.r.t. zt−1 , such
Generative
modelt
that for any choice of its parameters, the Jacobians dzdµ t
t−1
and dzdσt−1 t
are triangular with zeros on the
dzt QD
diagonal.
Inference
model As a result,
x
dzt−1 is triangular with σ t on the diagonal, with determinant i=1 σt,i . (Note IAF Step

that the Jacobian w.r.t. h does not have constraints.) Following eq. (5), the density under the final
Posterior with Inverse Autoregressive Flow (IAF) Context
iteratex is: (e.g. Encoder NN)
Autoregress

D Encoder
T
!IAF IAF
X x
NN X
z
step
z 0
step
··· 1 zT μt
2
log q(zT |x) = − 1

2 i + 1
2 log(2π) + log σ t,i (11) zt -

i=1 t=0

The flexibility of the distribution of the final iterate zT , and its ability to closely fit to the true posterior,
increases with the expressivity of the autoregressive models and the depth of the chain. See figure 2
for an illustration.
A numerically stable version, inspired by the LSTM-type update, is where we let the autoregressive
network output [mt , st ], two unconstrained real-valued vectors:
[mt , st ] ← AutoregressiveNN[t](zt , h; θ) (12)
and compute zt as:
σ t = sigmoid(st ) (13)
zt = σ t zt−1 + (1 − σ t ) mt (14)
This version is shown in algorithm 1. Note that this is just a particular version of the update of
eq. (10), so the simple computation of the final log-density of eq. (11) still applies.
We found it beneficial for results to parameterize or initialize the parameters of each
AutoregressiveNN[t] such that its outputs st are, before optimization, sufficiently positive, such as
close to +1 or +2. This leads to an initial behaviour that updates z only slightly with each step of IAF.
Such a parameterization is known as a ’forget gate bias’ in LSTMs, as investigated by Jozefowicz
et al. (2015).
Perhaps the simplest special version of IAF is one with a simple step, and a linear autoregressive
model. This transforms a Gaussian variable with diagonal covariance, to one with linear dependencies,
i.e. a Gaussian distribution with full covariance. See appendix A for an explanation.
Autoregressive neural networks form a rich family of nonlinear transformations for IAF. For non-
convolutional models, we use the family of masked autoregressive networks introduced in (Germain
et al., 2015) for the autoregressive neural networks. For CIFAR-10 experiments, which benefits more
from scaling to high dimensional latent space, we use the family of convolutional autoregressive
autoencoders introduced by (van den Oord et al., 2016b,c).
We found that results improved when reversing the ordering of the variables after each step in the IAF
chain. This is a volume-preserving transformation, so the simple form of eq. (11) remains unchanged.

5
5 Related work
Inverse autoregressive flow (IAF) is a member of the family of normalizing flows, first discussed
in (Rezende and Mohamed, 2015) in the context of stochastic variational inference. In (Rezende and
Mohamed, 2015) two specific types of flows are introduced: planar flows and radial flows. These
flows are shown to be effective to problems relatively low-dimensional latent space (at most a few
hundred dimensions). It is not clear, however, how to scale such flows to much higher-dimensional
latent spaces, such as latent spaces of generative models of /larger images, and how planar and radial
flows can leverage the topology of latent space, as is possible with IAF. Volume-conserving neural
architectures were first presented in in (Deco and Brauer, 1995), as a form of nonlinear independent
component analysis.
Another type of normalizing flow, introduced by (Dinh et al., 2014) (NICE), uses similar transforma-
tions as IAF. In contrast with IAF, this type of transformations updates only half of the latent variables
z1:D/2 per step, adding a vector f (zD/2+1:D ) which is a neural network based function of the
remaining latent variables zD/2+1:D . Such large blocks have the advantage of computationally cheap
inverse transformation, and the disadvantage of typically requiring longer chains. In experiments,
(Rezende and Mohamed, 2015) found that this type of transformation is generally less powerful than
other types of normalizing flow, in experiments with a low-dimensional latent space. Concurrently
to our work, NICE was extended to high-dimensional spaces in (Dinh et al., 2016) (Real NVP). An
empirical comparison would be an interesting subject of future research.
A potentially powerful transformation is the Hamiltonian flow used in Hamiltonian Variational
Inference (Salimans et al., 2014). Here, a transformation is generated by simulating the flow
of a Hamiltonian system consisting of the latent variables z, and a set of auxiliary momentum
variables. This type of transformation has the additional benefit that it is guided by the exact
posterior distribution, and that it leaves this distribution invariant for small step sizes. Such as
transformation could thus take us arbitrarily close to the exact posterior distribution if we can apply
it for a sufficient number of times. In practice, however, Hamiltonian Variational Inference is very
demanding computationally. Also, it requires an auxiliary variational bound to account for the
auxiliary variables, which can impede progress if the bound is not sufficiently tight.
An alternative method for increasing the flexiblity of the variational inference, is the introduction
of auxiliary latent variables (Salimans et al., 2014; Ranganath et al., 2015; Tran et al., 2015) and
corresponding auxiliary inference models. Latent variable models with multiple layers of stochastic
variables, such as the one used in our experiments, are often equivalent to such auxiliary-variable
methods. We combine deep latent variable models with IAF in our experiments, benefiting from both
techniques.

6 Experiments
We empirically evaluate IAF by applying the idea to improve variational autoencoders. Please see
appendix C for details on the architectures of the generative model and inference models. Code for
reproducing key empirical results is available online3 .

6.1 MNIST

In this expermiment we follow a similar implementation of the convolutional VAE as in (Salimans

et al., 2014) with ResNet (He et al., 2015) blocks. A single layer of Gaussian stochastic units
of dimension 32 is used. To investigate how the expressiveness of approximate posterior affects
performance, we report results of different IAF posteriors with varying degrees of expressiveness.
We use a 2-layer MADE (Germain et al., 2015) to implement one IAF transformation, and we stack
multiple IAF transformations with ordering reversed between every other transformation.
Results: Table 1 shows results on MNIST for these types of posteriors. Results indicate that as
approximate posterior becomes more expressive, generative modeling performance becomes better.
Also worth noting is that an expressive approximate posterior also tightens variational lower bounds
as expected, making the gap between variational lower bounds and marginal likelihoods smaller. By
making IAF deep and wide enough, we can achieve best published log-likelihood on dynamically
3
https://github.com/openai/iaf

6
Table 1: Generative modeling results on the dynamically sampled binarized MNIST version used
in previous publications (Burda et al., 2015). Shown are averages; the number between brackets
are standard deviations across 5 optimization runs. The right column shows an importance sampled
estimate of the marginal likelihood for each model with 128 samples. Best previous results are repro-
duced in the first segment: [1]: (Salimans et al., 2014) [2]: (Burda et al., 2015) [3]: (Kaae Sønderby
et al., 2016) [4]: (Tran et al., 2015)
Model VLB log p(x) ≈

Convolutional VAE + HVI [1] -83.49 -81.94

DLGM 2hl + IWAE [2] -82.90
LVAE [3] -81.74
DRAW + VGP [4] -79.88

Diagonal covariance -84.08 (± 0.10) -81.08 (± 0.08)

IAF (Depth = 2, Width = 320) -82.02 (± 0.08) -79.77 (± 0.06)
IAF (Depth = 2, Width = 1920) -81.17 (± 0.08) -79.30 (± 0.08)
IAF (Depth = 4, Width = 1920) -80.93 (± 0.09) -79.17 (± 0.08)
IAF (Depth = 8, Width = 1920) -80.80 (± 0.07) -79.10 (± 0.07)

… … … … …

Bottom-Up Top-Down
z3 z3 z3 +
ResNet Block ResNet Block

Layer Prior:
ELU ELU
z ~ p(zi|z>i)
z2
+ z2
= z2 +

Layer Posterior:
ELU ELU
z1 z1 z ~ q(zi|z>i,x)
z1
+
x x x x

Deep Bidirectional VAE with

= Identity
Identity = Convolution ELU = Nonlinearity
generative model inference model bidirectional inference

Figure 3: Overview of our ResNet VAE with bidirectional inference. The posterior of each layer is
parameterized by its own IAF.

binarized MNIST: -79.10. On Hugo Larochelle’s statically binarized MNIST, our VAE with deep
IAF achieves a log-likelihood of -79.88, which is slightly worse than the best reported result, -79.2,
using the PixelCNN (van den Oord et al., 2016b).

6.2 CIFAR-10

We also evaluated IAF on the CIFAR-10 dataset of natural images. Natural images contain a much
greater variety of patterns and structure than MNIST images; in order to capture this structure well,
we experiment with a novel architecture, ResNet VAE, with many layers of stochastic variables, and
based on residual convolutional networks (ResNets) (He et al., 2015, 2016). Please see our appendix
for details.

Log-likelihood. See table 2 for a comparison to previously reported results. Our architecture with
IAF achieves 3.11 bits per dimension, which is better than other published latent-variable models,
and almost on par with the best reported result using the PixelCNN. See the appendix for more
experimental results. We suspect that the results can be further improved with more steps of flow,
which we leave to future work.

Synthesis speed. Sampling took about 0.05 seconds/image with the ResNet VAE model, versus
52.0 seconds/image with the PixelCNN model, on a NVIDIA Titan X GPU. We sampled from the
PixelCNN naïvely by sequentially generating a pixel at a time, using the full generative model at each
iteration. With custom code that only evaluates the relevant part of the network, PixelCNN sampling
could be sped up significantly; however the speedup will be limited on parallel hardware due to the

7
Table 2: Our results with ResNet VAEs on CIFAR-10 images, compared to earlier results, in average
number of bits per data dimension on the test set. The number for convolutional DRAW is an upper
bound, while the ResNet VAE log-likelihood was estimated using importance sampling.
Method bits/dim ≤

Results with tractable likelihood models:

Uniform distribution (van den Oord et al., 2016b) 8.00
Multivariate Gaussian (van den Oord et al., 2016b) 4.70
NICE (Dinh et al., 2014) 4.48
Deep GMMs (van den Oord and Schrauwen, 2014) 4.00
Real NVP (Dinh et al., 2016) 3.49
PixelRNN (van den Oord et al., 2016b) 3.00
Gated PixelCNN (van den Oord et al., 2016c) 3.03

Results with variationally trained latent-variable models:

Deep Diffusion (Sohl-Dickstein et al., 2015) 5.40
Convolutional DRAW (Gregor et al., 2016) 3.58
ResNet VAE with IAF (Ours) 3.11

sequential nature of the sampling operation. Efficient sampling from the ResNet VAE is a parallel
computation that does not require custom code.

7 Conclusion
We presented inverse autoregressive flow (IAF), a new type of normalizing flow that scales well to
high-dimensional latent space. In experiments we demonstrated that autoregressive flow leads to
significant performance gains compared to similar models with factorized Gaussian approximate
posteriors, and we report close to state-of-the-art log-likelihood results on CIFAR-10, for a model
that allows much faster sampling.

Acknowledgements
We thank Jascha Sohl-Dickstein, Karol Gregor, and many others at Google Deepmind for interesting
discussions. We thank Harri Valpola for referring us to Gustavo Deco’s relevant pioneering work on
a form of inverse autoregressive flow applied to nonlinear independent component analysis.

References
Blei, D. M., Jordan, M. I., and Paisley, J. W. (2012). Variational Bayesian inference with Stochastic Search. In
Proceedings of the 29th International Conference on Machine Learning (ICML-12), pages 1367–1374.

Bowman, S. R., Vilnis, L., Vinyals, O., Dai, A. M., Jozefowicz, R., and Bengio, S. (2015). Generating sentences
from a continuous space. arXiv preprint arXiv:1511.06349.

Burda, Y., Grosse, R., and Salakhutdinov, R. (2015). Importance weighted autoencoders. arXiv preprint
arXiv:1509.00519.

Clevert, D.-A., Unterthiner, T., and Hochreiter, S. (2015). Fast and accurate deep network learning by Exponential
Linear Units (ELUs). arXiv preprint arXiv:1511.07289.

Deco, G. and Brauer, W. (1995). Higher order statistical decorrelation without information loss. Advances in
Neural Information Processing Systems, pages 247–254.

Dinh, L., Krueger, D., and Bengio, Y. (2014). Nice: non-linear independent components estimation. arXiv
preprint arXiv:1410.8516.

Dinh, L., Sohl-Dickstein, J., and Bengio, S. (2016). Density estimation using Real NVP. arXiv preprint
arXiv:1605.08803.

8
Germain, M., Gregor, K., Murray, I., and Larochelle, H. (2015). Made: Masked autoencoder for distribution
estimation. arXiv preprint arXiv:1502.03509.

Gregor, K., Besse, F., Rezende, D. J., Danihelka, I., and Wierstra, D. (2016). Towards conceptual compression.
arXiv preprint arXiv:1604.08772.

Gregor, K., Mnih, A., and Wierstra, D. (2013). Deep AutoRegressive Networks. arXiv preprint arXiv:1310.8499.

He, K., Zhang, X., Ren, S., and Sun, J. (2015). Deep residual learning for image recognition. arXiv preprint
arXiv:1512.03385.

He, K., Zhang, X., Ren, S., and Sun, J. (2016). Identity mappings in deep residual networks. arXiv preprint
arXiv:1603.05027.

Hoffman, M. D., Blei, D. M., Wang, C., and Paisley, J. (2013). Stochastic variational inference. The Journal of
Machine Learning Research, 14(1):1303–1347.

Jozefowicz, R., Zaremba, W., and Sutskever, I. (2015). An empirical exploration of recurrent network archi-
tectures. In Proceedings of the 32nd International Conference on Machine Learning (ICML-15), pages
2342–2350.

Kaae Sønderby, C., Raiko, T., Maaløe, L., Kaae Sønderby, S., and Winther, O. (2016). How to train deep
variational autoencoders and probabilistic ladder networks. arXiv preprint arXiv:1602.02282.

Kingma, D. P. and Welling, M. (2013). Auto-encoding variational Bayes. Proceedings of the 2nd International
Conference on Learning Representations.

Ranganath, R., Tran, D., and Blei, D. M. (2015). Hierarchical variational models. arXiv preprint
arXiv:1511.02386.

Rezende, D. and Mohamed, S. (2015). Variational inference with normalizing flows. In Proceedings of The
32nd International Conference on Machine Learning, pages 1530–1538.

Rezende, D. J., Mohamed, S., and Wierstra, D. (2014). Stochastic backpropagation and approximate inference
in deep generative models. In Proceedings of the 31st International Conference on Machine Learning
(ICML-14), pages 1278–1286.

Salimans, T. (2016). A structured variational auto-encoder for learning deep hierarchies of sparse features. arXiv
preprint arXiv:1602.08734.

Salimans, T. and Kingma, D. P. (2016). Weight normalization: A simple reparameterization to accelerate training
of deep neural networks. arXiv preprint arXiv:1602.07868.

Salimans, T., Kingma, D. P., and Welling, M. (2014). Markov chain Monte Carlo and variational inference:
Bridging the gap. arXiv preprint arXiv:1410.6460.

Sohl-Dickstein, J., Weiss, E. A., Maheswaranathan, N., and Ganguli, S. (2015). Deep unsupervised learning
using nonequilibrium thermodynamics. arXiv preprint arXiv:1503.03585.

Tran, D., Ranganath, R., and Blei, D. M. (2015). Variational gaussian process. arXiv preprint arXiv:1511.06499.

van den Oord, A., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., Kalchbrenner, N., Senior, A.,
and Kavukcuoglu, K. (2016a). Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499.

van den Oord, A., Kalchbrenner, N., and Kavukcuoglu, K. (2016b). Pixel recurrent neural networks. arXiv
preprint arXiv:1601.06759.

van den Oord, A., Kalchbrenner, N., Vinyals, O., Espeholt, L., Graves, A., and Kavukcuoglu, K. (2016c).
Conditional image generation with pixelcnn decoders. arXiv preprint arXiv:1606.05328.

van den Oord, A. and Schrauwen, B. (2014). Factoring variations in natural images with deep gaussian mixture
models. In Advances in Neural Information Processing Systems, pages 3518–3526.

Zagoruyko, S. and Komodakis, N. (2016). Wide residual networks. arXiv preprint arXiv:1605.07146.

Zeiler, M. D., Krishnan, D., Taylor, G. W., and Fergus, R. (2010). Deconvolutional networks. In Computer
Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on, pages 2528–2535. IEEE.

9
A Linear IAF
Perhaps the simplest special case of IAF is the transformation of a Gaussian variable with diagonal
covariance to one with linear dependencies.
Any full-covariance multivariate Gaussian distribution with mean m and covariance matrix C can be
expressed as an autoregressive model with
yi = µi (y1:i−1 ) + σi (y1:i−1 ) · i , with
µi (y1:i−1 ) = mi + C[i, 1 : i − 1]C[1 : i − 1, 1 : i − 1]−1 (y1:i−1 − m1:i−1 ), and
σi (y1:i−1 ) = C[i, i] − C[i, 1 : i − 1]C[1 : i − 1, 1 : i − 1]−1 C[1 : i − 1, i].

Inverting the autoregressive model then gives = (y − µ(y))/σ(y) = L(y − m) with L the inverse
Cholesky factorization of the covariance matrix C.
By making L(x) and m(x) part of our variational encoder we can then use this inverse flow to
form a posterior approximation. In experiments, we do this by starting our with a fully-factorized
Gaussian approximate posterior as in e.g. (Kingma and Welling, 2013): y = µ(x) + σ(x) where
∼ N (0, I), and where µ(x) and σ(x) are vectors produced by our inference network. We then let
the inference network produce an extra output L(x), the lower triangular inverse Cholesky matrix,
which we then use to update the approximation. With this setup, the problem is overparameterized,
so we define the mean vector m above to be the zero vector, and we restrict L to have ones on the
diagonal. One step of linear IAF then turns the fully-factorized distribution of y into an arbitrary
multivariate Gaussian distribution: z = L(x)·y. This results in a simple and computationally efficient
posterior approximation, with scalar density function given by q(z|x) = q(y|x). By optimizing the
variational lower bound we then fit this conditional multivariate Gaussian approximation to the true
posterior distribution.

B MNIST
In MNIST expermiment we follow a similar implementation of the convolutional VAE as in (Salimans
et al., 2014) with ResNet (He et al., 2015) blocks. A single layer of Gaussian stochastic units of
dimension 32 is used. The inference network has three 2-strided resnet blocks with 3x3 filters
and [16,32,32] feature maps. Between every other strided convolution, there is another resnet
block with stride 1 and same number feature maps. There is one more fully-connected layer after
convolutional layers with 450 hidden units. The generation network has a symmetric structure
with strided convolution replaced by transposed convolution(Zeiler et al., 2010). When there is a
change in dimensionality, we use strided convolution or transposed convolution to replace the identity
connection in ResNet blocks. Exponential Linear Units (Clevert et al., 2015) are used as activation
functions. The whole network is parametrized according to Weight Normalization(Salimans and
Kingma, 2016) and data-dependent initialization is used.

C ResNet VAE
For details of the ResNet VAE architecture used for CIFAR-10, please see figure 3 and our code. The
main benefits of this architecture is that it forms a flexible autoregressive prior latent space, while
still being straightforward to sample from.
For CIFAR-10, we used a novel neural variational autoencoder (VAE) architecture with ResNet (He
et al., 2015, 2016) units and multiple stochastic layers. Our architecture consists of L stacked blocks,
where each block (l = 1..L) is a combination of a bottom-up residual unit for inference, producing
(q)
a series of bottom-up activations hl , and a top-down residual unit used for both inference and
(p)
generation, producing a series of top-down activations hl .
The hidden layer of each residual function in the generative model contains a combination of the
usual deterministic hidden units and a relatively small number of stochastic hidden units with a
(p) (p)
heteroscedastic diagonal Gaussian distribution p(zl |hl ) given the unit’s input hl , followed by a
nonlinearity. We utilize wide (Zagoruyko and Komodakis, 2016) pre-activation residual units (He
et al., 2015) with single-hidden-layer residual functions.

10
Layer of generative model
with stochastic ResNet blocks

…
Generative ResNet layer Generative
nonlinearity, 3x3 conv, slice features
Stochastic R
z3
parameters of p(z|.)

stochastic ResN
z2 deterministic
z ~ p(z|.) (sample)
features z (latent)

Layer z1 stochastic features

concat features, nonlinearity, 3x3 conv (more generative

x
+

Figure 4: Generative ResNet and detail of layer. This is the generative component of our ResNet
VAE.

See figure 5 for an illustration of the generative Resnet. Assuming L layers of latent variables, the gen-
erative model’s density function is factorized as p(x, z1 , z2 , z3 , ...) = p(x, z1:L ) = p(x|z1:L )p(z1:L ).
The second part of this density, the prior over the latent variable, is autoregressive: p(z1:L ) =
QL−1
p(zL ) l=1 p(zl |zl+1:L ). This autoregressive nature of the prior increases the flexibility of the true
posterior, leading to improved empirical results. This improved performance is easily explained: as
the true posterior is largely a function of this prior, a flexible prior improves the flexibility of the true
posterior, making it easier for the VAE to match the approximate and true posteriors, leading to a
tighter bound, without sacrificing the flexibility of the generative model itself.

C.1 Bottom-Up versus Bidirectional inference

Figure 5 illustrates the difference between a bidirectional inference network, whose topological
ordering over the latent variables equals that of the generative model, and a bottom-up inference
network, whose topological ordering is reversed.
The top left shows a generative model with three levels of latent variables, with topologi-
cal ordering z1 → z2 → z3 → x, and corresponding joint distribution that factorizes as
p(x, z1 , z2 , z3 ) = p(x|z1 , z2 , z3 )p(z1 |z2 , z3 )p(z2 |z3 )p(z3 ). The top middle shows a corresponding
inference model with reversed topological ordering, and corresponding factorization q(z1 , z2 , z3 |x) =
q(z1 |x)q(z2 |z1 , x)q(z3 |z2 , z1 , x). We call this a bottom-up inference model. Right: the resulting
variational autoencoder (VAE). In the VAE, the log-densities of the inference model and generative
model, log p(x, z1 , z2 , z3 ) − log q(z1 , z2 , z3 |x) respectively, are evaluated under a sample from
the inference model, to produce an estimate of the variational bound. Computation of the density
of the samples (x, z1 , z2 , z3 ) under the generative model, requires computation of the conditional
distributions of the generative model, which requires bidirectional computation. Hence, evaluation of
the model requires both bottom-up inference (for sampling z s, and evaluating the posterior density)
and top-down generation.
In case of bidirectional inference (see also (Salimans, 2016; Kaae Sønderby et al., 2016)), we first
perform a fully deterministic bottom-up pass, before sampling from the posterior in top-down order
in the topological ordering of the generative models.

11
… … … …

Bottom-Up Top-Down
z3 +
z3 z3 ResNet Block ResNet Block

Layer Prior:
ELU ELU
z ~ p(zi|z>i)

z2 z2
z2
+ =
Layer Posterior:
ELU ELU
z ~ q(zi|z<i,x)
z1 z1 z1
+

x x x x

= Identity = Convolution = Nonlinearity

Deep Bottom-up VAE with Identity ELU

generative model inference model bottom-up inference

(a) Schematic overview of a ResNet VAE with bottom-up inference.

… … … … …

Bottom-Up Top-Down
z3 z3 z3 +
ResNet Block ResNet Block

Layer Prior:
ELU ELU
z ~ p(zi|z>i)
z2
+ z2
= z2 +

Layer Posterior:
ELU ELU
z1 z1 z ~ q(zi|z>i,x)
z1
+
x x x x

Deep Bidirectional VAE with

= Identity
Identity = Convolution ELU = Nonlinearity
generative model inference model bidirectional inference

(b) Schematic overview of a ResNet VAE with bidirectional inference.

Figure 5: Schematic overview of topological orderings of bottom-up (a) versus bidirectional (b)
inference networks, and their corresponding variational autoencoders (VAEs). See section C.1

C.2 Inference with stochastic ResNet

Like our generative Resnet, the both our bottom-up and bidirectional inference models are imple-
mented through ResNet blocks. See figure 6. As we explained, in case of bottom-up inference,
latent variables are sampled in bottom-up order, i.e. the reverse of the topological ordering of the
generative model. The residual functions in the bottom-up inference network compute a conditional
(p)
approximate posterior, q(zl |hl+1 ), conditioned on the bottom-up residual unit’s input. The sample
from this distribution is then, after application of the nonlinearity, treated as part of the hidden layer
of the bottom-up residual function and thus used upstream.
In the bidirectional inference case, the approximate posterior for each layer is conditioned on both
(q) (p)
the bottom-up input and top-down input: q(zl |hl , hl ). The sample from this distribution is, again
after application of the nonlinearity, treated as part of the hidden layer of the top-down residual
function and thus used downstream.

C.3 Approximate posterior

The approximate posteriors q(zl |.) are defined either through a diagonal Gaussian, or through an IAF
(q) (q) (p)
posterior. In case of IAF, the context c is provided by either hl , or hl and hl , dependent on
inference direction.
We use either diagonal Gaussian posteriors, or IAF with a single step of masked-based Pixel-
CNN (van den Oord et al., 2016b) with zero, one or two layers of hidden layers with ELU non-
linearities (Clevert et al., 2015). Note that IAF with zero hidden layers corresponds to a linear
transformation; i.e. a Gaussian with off-diagonal covariance.
We investigate the importance of a full IAF transformation with learned (dynamic) σ t (.) rescaling
term term (in eq.(10)), versus a fixed σ t (.) = 1 term. See table 3 for a comparison of resulting test-set
bpp (bits per pixel) performance, on the CIFAR-10 dataset. We found the difference in performance
to be almost negligible.

12
Layer of ResNet VAE with bottom-up inference

+ Inference model layer Generative model layer

concat features, nonlinearity, 3x3 conv nonlinearity, 3x3 conv, slice features

(deterministic z ~ q(z|.)
(parameters of prior p)
features) (sample from posterior)

(stochastic (deterministic
(parameters of posterior q) features) features)

nonlinearity, 3x3 conv, slice features concat features, nonlinearity, 3x3 conv
+

Layer of ResNet VAE with top-down inference layer objective:

log p(z) - log q(z)

(a) Computational flow schematic of ResNet VAE with bottom-up inference

+
Inference block Generative block

concat features, nonlinearity, 3x3 conv nonlinearity, 3x3 conv nonlinearity, 3x3 conv, slice features

(deterministic combine into posterior (parameters of

features) params, e.g. sum posterior q, part 2) (parameters of prior p)

(parameters of
z ~ q(z|.) (stochastic (deterministic
posterior q, part 1)
(sample from posterior) features) features)

nonlinearity, 3x3 conv, slice features concat features, nonlinearity, 3x3 conv
+

layer objective:
log p(z) - log q(z)

(b) Computational flow schematic of ResNet VAE with bidirectional inference

Figure 6: Detail of a single layer of the ResNet VAE, with the bottom-up inference (top) and
bidirectional inference (bottom).

C.4 Bottom layer

The first layer of the encoder is a convolutional layer with 2 × 2 spatial subsampling; the last layer of
the decoder has a matching convolutional layer with 2 × 2 spatial upsampling. The ResNet layers
could perform further up- and downsampling, but we found that this did not improve empirical
results.

13
Table 3: CIFAR-10 test-set bpp (bits per pixel), when training IAF with location-only perturbation
versus full (location+scale) perturbation.
ResNet depth Inference direction Posterior Location-only Location+scale

4 Bottom-up IAF with 0 hidden layers 3.67 3.68

4 Bottom-up IAF with 1 hidden layers 3.61 3.61
4 Bidirectional IAF with 0 hidden layers 3.66 3.67
4 Bidirectional IAF with 1 hidden layers 3.58 3.56
8 Bottom-up IAF with 0 hidden layers 3.54 3.55
8 Bottom-up IAF with 1 hidden layers 3.48 3.49
8 Bidirectional IAF with 0 hidden layers 3.51 3.52
8 Bidirectional IAF with 1 hidden layers 3.45 3.42

12000 λ =0 λ =0.125 λ =0.5 λ =2

10000
8000
# Nats

6000
4000
2000
0
100 101 102 100 101 102 100 101 102 100 101 102
# Epochs

Figure 7: Shown are stack plots of the number of nats required to encode the CIFAR-10 set images,
per stochastic layer of the 24-layer network, as a function of the number of training epochs, for
different choices of minimum information constraint λ (see section C.8). Enabling the constraint
(λ > 0) results in avoidance of undesirable stable equilibria, and fuller use of the stochastic layers
by the model. The bottom-most (white) area corresponds to the bottom-most (reconstruction) layer,
the second area from the bottom denotes the first stochastic layer, the third area denotes the second
stochastic layer, etc.

C.5 Discretized Logistic Likelihood

The first layer of the encoder, and the last layer of the decoder, consist of convolutions that project
from/to input space. The pixel data is scaled to the range [0, 1], and the data likelihood of pixel values
in the generative model is the probability mass of the pixel value under the logistic distribution. Noting
that the CDF of the standard logistic distribution is simply the sigmoid function, we simply compute
the probability mass per input pixel using p(xi |µi , si ) = CDF(xi + 256 1
|µi , si ) − CDF(xi |µi , si ),
where the locations µi are output of the decoder, and the log-scales log si are learned scalar parameter
per input channel.

C.6 Weight initialisation and normalization

We also found that the noise introduced by batch normalization hurts performance; instead we use
weight normalization (Salimans and Kingma, 2016) method. We initialized the parameters using the
data-dependent technique described in (Salimans and Kingma, 2016).

C.7 Nonlinearity

We compared ReLU, softplus, and ELU (Clevert et al., 2015) nonlinearities; we found that ELU
resulted in significantly better empirical results, and used the ELU nonlinearity for all reported
experiments and both the inference model and the generative model.

C.8 Objective with Free Bits

To accelerate optimization and reach better optima, we optimize the bound using a slightly modified
objective with free bits: a constraint on the minimum amount of information per group of latent

14
Table 4: Our results in average number of bits per data dimension on the test set with ResNet VAEs,
for various choices of posterior ResNet depth, and IAF depth.
Posterior: ResNet Depth: 4 8 12

Bottom-up, factorized Gaussians 3.71 3.55 3.44

Bottom-up, IAF, linear, 1 step) 3.68 3.55 3.41
Bottom-up, IAF, 1 hidden layer, 1 step) 3.61 3.49 3.38
Bidirectional, factorized Gaussians 3.74 3.60 3.46
Bidirectional, IAF, linear, 1 step) 3.67 3.52 3.40
Bidirectional, IAF, 1 hidden layer, 1 step) 3.56 3.42 3.28
Bidirectional, IAF, 2 hidden layers, 1 steps) 3.54 3.39 3.27
Bidirectional, IAF, 1 hidden layer, 2 steps) 3.53 3.36 3.26

variables. Consistent with findings in (Bowman et al., 2015) and (Kaae Sønderby et al., 2016), we
found that stochastic optimization with the unmodified lower bound objective often gets stuck in an
undesirable stable equilibrium. At the start of training, the likelihood term log p(x|z) is relatively
weak, such that an initially attractive state is where q(z|x) ≈ p(z). In this state, encoder gradients
have a relatively low signal-to-noise ratio, resulting in a stable equilibrium from which it is difficult
to escape. The solution proposed in (Bowman et al., 2015) and (Kaae Sønderby et al., 2016) is to use
an optimization schedule where the weight of the latent cost DKL (q(z|x)||p(z)) is slowly annealed
from 0 to 1 over many epochs.
We propose a different solution that does not depend on an annealing schedule, but uses a modified
objective function that is constant throughout training instead. We divide the latent dimensions into
the K groups/subsets within which parameters are shared (e.g. the latent feature maps, or individual
dimensions if no parameter are shared across dimensions). We then use the following objective,
which ensures that using less than λ nats of information per subset j (on average per minibatch M)
is not advantageous:
K
X
Leλ = Ex∼M Eq(z|x) [log p(x|z)] − maximum(λ, Ex∼M [DKL (q(zj |x)||p(zj ))]) (15)
j=1

Since increasing the latent information is generally advantageous for the first (unaffected)
term of the objective (often called the negative reconstruction errror), this results in
Ex∼M [DKL (q(zj |x)||p(zj ))] ≥ λ for all j, in practice.
We experimented with λ ∈ [0, 0.125, 0.25, 0.5, 1, 2, 4, 8] and found that values in the range λ ∈
[0.125, 0.25, 0.5, 1, 2] resulted in more than 0.1 nats improvement in bits/pixel on the CIFAR-10
benchmark.

C.9 IAF architectural comparison

We performed an empirical evaluation on CIFAR-10, comparing various choices of ResNet VAE

combined with various approximate posteriors and model depths, and inference directions. See
table 4 for results. Our best CIFAR-10 result, 3.11 bits/dim in table 2 was produced by a ResNet VAE
with depth 20, bidirectional inference, nonlinear IAF with 2 hidden layers and 1 step. Please see our
code for further details.

D Equivalence with Autoregressive Priors

Earlier work on improving variational auto-encoders has often focused on improving the prior p(z)
of the latent variables in our generative model. For example, (Gregor et al., 2013) use a variational
auto-encoder where both the prior and inference network have recursion. It is therefore worth noting
that our method of improving the fully-factorized posterior approximation with inverse autoregressive
flow, in combination with a factorized prior p(z), is equivalent to estimating a model where the prior
p(z) is autoregressive and our posterior approximation is factorized. This result follows directly from
the analysis of section 3: we can consider the latent variables y to be our target for inference, in

15
which case our prior is autoregressive. Equivalently, we can consider the whitened representation z to
be the variables of interest, in which case our prior is fully-factorized and our posterior approximation
is formed through the inverse autoregressive flow that whitens the data (equation 7).

Gen AI Unit 2
100% (1)
Gen AI Unit 2
65 pages
Demystifying Variational Diffusion Models
No ratings yet
Demystifying Variational Diffusion Models
48 pages
Multivariate Analysis With LISREL
100% (1)
Multivariate Analysis With LISREL
561 pages
ZG513 Lecture 6 2024-5 Sem 2
No ratings yet
ZG513 Lecture 6 2024-5 Sem 2
40 pages
Presentation - Deeplearning2015 Courville Autoencoder Extension 01
No ratings yet
Presentation - Deeplearning2015 Courville Autoencoder Extension 01
61 pages
Chapter 5
No ratings yet
Chapter 5
140 pages
Statistical Theory and Methodology in Science and Engineering
No ratings yet
Statistical Theory and Methodology in Science and Engineering
607 pages
Final Solution
No ratings yet
Final Solution
12 pages
Mod4 Slides
No ratings yet
Mod4 Slides
49 pages
DLbook
No ratings yet
DLbook
165 pages
Semiparametric M-Estimation With Overparameterized Neural Networks
No ratings yet
Semiparametric M-Estimation With Overparameterized Neural Networks
33 pages
Bayesian NN
No ratings yet
Bayesian NN
82 pages
VAE talk.compressed - 副本
No ratings yet
VAE talk.compressed - 副本
59 pages
Auto Encoding Variational Bayes
No ratings yet
Auto Encoding Variational Bayes
14 pages
Machine Learning For Data Science 2 - Normalizing Flows V2
No ratings yet
Machine Learning For Data Science 2 - Normalizing Flows V2
50 pages
Reparametrization Trick
No ratings yet
Reparametrization Trick
8 pages
Mlgs 2021 Retake
No ratings yet
Mlgs 2021 Retake
54 pages
An Introduction To Variational Autoencoders: Foundations and Trends in Machine Learning
No ratings yet
An Introduction To Variational Autoencoders: Foundations and Trends in Machine Learning
89 pages
Intro To Vae
No ratings yet
Intro To Vae
89 pages
Econometrica - 2021 - Farrell - Deep Neural Networks For Estimation and Inference
No ratings yet
Econometrica - 2021 - Farrell - Deep Neural Networks For Estimation and Inference
33 pages
Notes For Generative AI
No ratings yet
Notes For Generative AI
31 pages
L11 - UCLxDeepMind DL2020
No ratings yet
L11 - UCLxDeepMind DL2020
68 pages
Hyperspherical Variational Auto-Encoders: Tim R. Davidson Luca Falorsi Nicola de Cao Thomas Kipf Jakub M. Tomczak
No ratings yet
Hyperspherical Variational Auto-Encoders: Tim R. Davidson Luca Falorsi Nicola de Cao Thomas Kipf Jakub M. Tomczak
19 pages
Idea 2024 (L) Neural Autoregressive Flows
No ratings yet
Idea 2024 (L) Neural Autoregressive Flows
16 pages
Analyzing Inverse Problems With Invertible Neural Networks
No ratings yet
Analyzing Inverse Problems With Invertible Neural Networks
20 pages
Rezende 15
No ratings yet
Rezende 15
9 pages
Masked Autoregressive Flow For Density Estimation: George Papamakarios Theo Pavlakou Iain Murray
No ratings yet
Masked Autoregressive Flow For Density Estimation: George Papamakarios Theo Pavlakou Iain Murray
17 pages
Flow Priv
No ratings yet
Flow Priv
26 pages
NeurIPS 2023 Training Energy Based Normalizing Flow With Score Matching Objectives Paper Conference
No ratings yet
NeurIPS 2023 Training Energy Based Normalizing Flow With Score Matching Objectives Paper Conference
26 pages
Flow Factorized Representation Learning
No ratings yet
Flow Factorized Representation Learning
22 pages
Puthawala 22 A
No ratings yet
Puthawala 22 A
25 pages
Data-Driven Nonlinear Parametric Model Order Reduction Framework Using Deep Hierarchical Variational Autoencoder
No ratings yet
Data-Driven Nonlinear Parametric Model Order Reduction Framework Using Deep Hierarchical Variational Autoencoder
30 pages
Stimper 22 A
No ratings yet
Stimper 22 A
22 pages
Multivariate Probabilistic Time Series Forecasting Via Conditioned Normalizing Flows
No ratings yet
Multivariate Probabilistic Time Series Forecasting Via Conditioned Normalizing Flows
19 pages
Embedded-Model Flows Combining The Inductive Biase
No ratings yet
Embedded-Model Flows Combining The Inductive Biase
22 pages
481 Generative Latent Flow
No ratings yet
481 Generative Latent Flow
20 pages
Hagemann 2021 Inverse Problems 37 085002
No ratings yet
Hagemann 2021 Inverse Problems 37 085002
24 pages
Covariate-Informed Representation Learning With Sa
No ratings yet
Covariate-Informed Representation Learning With Sa
20 pages
Information Dropout Learning Optimal Representations Through Noisy Computation
No ratings yet
Information Dropout Learning Optimal Representations Through Noisy Computation
9 pages
Density Estimation Using Real NVP
No ratings yet
Density Estimation Using Real NVP
32 pages
Flow Based Deep Generative Models Report
No ratings yet
Flow Based Deep Generative Models Report
12 pages
Auto-Encoding Variational Bayes
No ratings yet
Auto-Encoding Variational Bayes
8 pages
Understanding Diffusion Models: A Unified Perspective
No ratings yet
Understanding Diffusion Models: A Unified Perspective
23 pages
5893 Densely Connected Normalizing
No ratings yet
5893 Densely Connected Normalizing
15 pages
Time Grad
No ratings yet
Time Grad
11 pages
8.auto-Encoding Variational Bayes
No ratings yet
8.auto-Encoding Variational Bayes
14 pages
Sinha Et Al. - 2019 - Variational Adversarial Active Learning
No ratings yet
Sinha Et Al. - 2019 - Variational Adversarial Active Learning
10 pages
Data-Driven Prediction of General Hamiltonian Dynamics Via Learning Exactly-Symplectic Maps
No ratings yet
Data-Driven Prediction of General Hamiltonian Dynamics Via Learning Exactly-Symplectic Maps
17 pages
Wikipedia VAE
No ratings yet
Wikipedia VAE
9 pages
D U C G M V A: EEP Nsupervised Lustering With Aussian Ixture Ariational Utoencoders
No ratings yet
D U C G M V A: EEP Nsupervised Lustering With Aussian Ixture Ariational Utoencoders
12 pages
Early Warning Via Transitions in Latent Stochastic Dynamical Systems
No ratings yet
Early Warning Via Transitions in Latent Stochastic Dynamical Systems
14 pages
Auto-Encoding Variational Bayes: Diederik P. Kingma Max Welling
No ratings yet
Auto-Encoding Variational Bayes: Diederik P. Kingma Max Welling
9 pages
Ffjord: F - C D S R G M: REE Form Ontinuous Ynamics For Calable Eversible Enerative Odels
No ratings yet
Ffjord: F - C D S R G M: REE Form Ontinuous Ynamics For Calable Eversible Enerative Odels
13 pages
Unsupervised Variational Acoustic Clustering: Luan Vin Icius Fiorio Bruno Defraene Johan David
No ratings yet
Unsupervised Variational Acoustic Clustering: Luan Vin Icius Fiorio Bruno Defraene Johan David
5 pages
On The Challenges of Learning With Inference Networks On Sparse, High-Dimensional Data
No ratings yet
On The Challenges of Learning With Inference Networks On Sparse, High-Dimensional Data
14 pages
Tutorial On Diffusion Models
No ratings yet
Tutorial On Diffusion Models
4 pages
Tung Kieu - Probabilistic - Graphical - Model - Report
No ratings yet
Tung Kieu - Probabilistic - Graphical - Model - Report
9 pages
L15 Autoregressive and Reversible Models
No ratings yet
L15 Autoregressive and Reversible Models
7 pages
Timetable 22 Apr 2024
No ratings yet
Timetable 22 Apr 2024
138 pages
Information Theoretic-Learning Auto-Encoder - 2016
No ratings yet
Information Theoretic-Learning Auto-Encoder - 2016
8 pages
Z-Forcing: Training Stochastic Recurrent Networks
No ratings yet
Z-Forcing: Training Stochastic Recurrent Networks
11 pages
Rank Priors For Continuous Non-Linear Dimensionality Reduction
No ratings yet
Rank Priors For Continuous Non-Linear Dimensionality Reduction
8 pages
Timetable 08 Jan 2024
No ratings yet
Timetable 08 Jan 2024
133 pages
Schermelleh Moosbrugger Mueller ModelFit MPR 2003
No ratings yet
Schermelleh Moosbrugger Mueller ModelFit MPR 2003
53 pages
Multivariate Material
No ratings yet
Multivariate Material
58 pages
ALL ST218 Lecture Notes
No ratings yet
ALL ST218 Lecture Notes
87 pages
Cross Correlation: Unlocking Patterns in Computer Vision
From Everand
Cross Correlation: Unlocking Patterns in Computer Vision
Fouad Sabry
No ratings yet
1.5 Common Probability Distribution
No ratings yet
1.5 Common Probability Distribution
48 pages
PTRV IV Unit - Classification of Random Processes
100% (1)
PTRV IV Unit - Classification of Random Processes
9 pages
B. SC (Honours) Statistics
No ratings yet
B. SC (Honours) Statistics
58 pages
Ch8-Principal Components
No ratings yet
Ch8-Principal Components
77 pages
Wide Open Spaces: A Statistical Technique For Measuring Space Creation in Professional Soccer
No ratings yet
Wide Open Spaces: A Statistical Technique For Measuring Space Creation in Professional Soccer
19 pages
Nutritionist - Akshata Mallya Shenoy: Diet Chart For Purnima
No ratings yet
Nutritionist - Akshata Mallya Shenoy: Diet Chart For Purnima
4 pages
CPSC 540: Machine Learning: Gaussians
No ratings yet
CPSC 540: Machine Learning: Gaussians
30 pages
Mirt
No ratings yet
Mirt
103 pages
Research Methodology Part 1
No ratings yet
Research Methodology Part 1
25 pages
Stat331-Multiple Linear Regression
No ratings yet
Stat331-Multiple Linear Regression
13 pages
Stamenov (2024) Optimal Experimental Design For Identification of
No ratings yet
Stamenov (2024) Optimal Experimental Design For Identification of
11 pages
Usp, Winter - 2010 - Practical Approaches To Dealing With Nonnormal and Categorical Variables-Annotated
No ratings yet
Usp, Winter - 2010 - Practical Approaches To Dealing With Nonnormal and Categorical Variables-Annotated
3 pages
M.a.M.sc. Statistics Syllabus
No ratings yet
M.a.M.sc. Statistics Syllabus
14 pages
Detecting Anomalies in Financial Statements Using ML
No ratings yet
Detecting Anomalies in Financial Statements Using ML
21 pages
Analysis of A Complex of Statistical Variables Into Principal Components
No ratings yet
Analysis of A Complex of Statistical Variables Into Principal Components
25 pages
Appendix-I: - Scheme of Examination
No ratings yet
Appendix-I: - Scheme of Examination
11 pages
Multivariate Normal Distribution
No ratings yet
Multivariate Normal Distribution
14 pages
Meucci 2011 - The Prayer
No ratings yet
Meucci 2011 - The Prayer
29 pages
Drowsiness Detection
No ratings yet
Drowsiness Detection
20 pages
Appropiate Critical Values Multivariate Outliers Mahalanobis Distance
No ratings yet
Appropiate Critical Values Multivariate Outliers Mahalanobis Distance
10 pages
Department of Statistics: (Chairperson) : B.S., Istanbul University M.S., University of Aberdeen
No ratings yet
Department of Statistics: (Chairperson) : B.S., Istanbul University M.S., University of Aberdeen
8 pages
Multivariate Capability Analysis
No ratings yet
Multivariate Capability Analysis
10 pages
Agricultural Statistics
No ratings yet
Agricultural Statistics
3 pages
Econ F215 2534 20240110215713
No ratings yet
Econ F215 2534 20240110215713
3 pages
Homework 1 - Theoretical Part: IFT 6390 Fundamentals of Machine Learning Ioannis Mitliagkas
No ratings yet
Homework 1 - Theoretical Part: IFT 6390 Fundamentals of Machine Learning Ioannis Mitliagkas
6 pages
On The Mean-Square Performance of The Constrained LMS Algorithm
No ratings yet
On The Mean-Square Performance of The Constrained LMS Algorithm
5 pages
Bio F376 1685 20240203173555
No ratings yet
Bio F376 1685 20240203173555
2 pages
Prerequisites, MFE Program, Berkeley-Haas PDF
No ratings yet
Prerequisites, MFE Program, Berkeley-Haas PDF
3 pages
Prescription Labs Tests: Name - Purnima Malviya Date - 30/12/2023
No ratings yet
Prescription Labs Tests: Name - Purnima Malviya Date - 30/12/2023
1 page

IAF Kingma Et Al 2016

Uploaded by

IAF Kingma Et Al 2016

Uploaded by

Improved Variational Inference

with Inverse Autoregressive Flow

Diederik P. Kingma Tim Salimans Rafal Jozefowicz Xi Chen

Ilya Sutskever Max Welling∗

2 Variational Inference and Learning

2.1 Requirements for Computational Tractability

2.2 Normalizing Flow

3 Inverse Autoregressive Transformations

computation involved in this transformation is clearly proportional to the dimensionality D. Since

4 Inverse Autoregressive Flow (IAF)

x Encoder NN h ··· h Autoregressive NN

In this expermiment we follow a similar implementation of the convolutional VAE as in (Salimans

Convolutional VAE + HVI [1] -83.49 -81.94

Diagonal covariance -84.08 (± 0.10) -81.08 (± 0.08)

Deep Bidirectional VAE with

Results with tractable likelihood models:

Results with variationally trained latent-variable models:

Layer z1 stochastic features

concat features, nonlinearity, 3x3 conv (more generative

C.1 Bottom-Up versus Bidirectional inference

= Identity = Convolution = Nonlinearity

generative model inference model bottom-up inference

(a) Schematic overview of a ResNet VAE with bottom-up inference.

Deep Bidirectional VAE with

(b) Schematic overview of a ResNet VAE with bidirectional inference.

C.2 Inference with stochastic ResNet

C.3 Approximate posterior

+ Inference model layer Generative model layer

Layer of ResNet VAE with top-down inference layer objective:

(a) Computational flow schematic of ResNet VAE with bottom-up inference

(deterministic combine into posterior (parameters of

(b) Computational flow schematic of ResNet VAE with bidirectional inference

C.4 Bottom layer

4 Bottom-up IAF with 0 hidden layers 3.67 3.68

12000 λ =0 λ =0.125 λ =0.5 λ =2

C.5 Discretized Logistic Likelihood

C.6 Weight initialisation and normalization

C.8 Objective with Free Bits

Bottom-up, factorized Gaussians 3.71 3.55 3.44

C.9 IAF architectural comparison

We performed an empirical evaluation on CIFAR-10, comparing various choices of ResNet VAE

D Equivalence with Autoregressive Priors

You might also like