Normalizing Flows For Probabilistic Modeling and Inference PDF
Normalizing Flows For Probabilistic Modeling and Inference PDF
Abstract
Normalizing flows provide a general mechanism for defining expressive probability distribu-
tions, only requiring the specification of a (usually simple) base distribution and a series of
bijective transformations. There has been much recent work on normalizing flows, ranging
from improving their expressive power to expanding their application. We believe the field
has now matured and is in need of a unified perspective. In this review, we attempt to
provide such a perspective by describing flows through the lens of probabilistic modeling
and inference. We place special emphasis on the fundamental principles of flow design,
and discuss foundational topics such as expressive power and computational trade-offs. We
also broaden the conceptual framing of flows by relating them to more general probabil-
ity transformations. Lastly, we summarize the use of flows for tasks such as generative
modeling, approximate inference, and supervised learning.
1. Introduction
The search for well-specified probabilistic models—models that correctly describe the pro-
cesses that produce data—is one of the enduring ideals of the statistical sciences. Yet, in
only the simplest of settings are we able to achieve this goal. A central need in all of statis-
tics and machine learning is then to develop the tools and theories that allow ever-richer
probabilistic descriptions to be made, and consequently, that make it possible to develop
better-specified models.
This paper reviews one tool we have to address this need: building probability distributions
as normalizing flows. Normalizing flows operate by pushing a simple density through a series
of transformations to produce a richer, potentially more multi-modal distribution—like a
fluid flowing through a set of tubes. As we will see, repeated application of even simple
transformations to a unimodal initial density leads to models of exquisite complexity. This
2021
c George Papamakarios, Eric Nalisnick, Danilo Jimenez Rezende, Shakir Mohamed, Balaji Lakshminarayanan.
License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided
at http://jmlr.org/papers/v22/19-1028.html.
Papamakarios, Nalisnick, Rezende, Mohamed and Lakshminarayanan
flexibility means that flows are ripe for use in the key statistical tasks of modeling, inference,
and simulation.
Normalizing flows are an increasingly active area of machine learning research. Yet there
is an absence of a unifying lens with which to understand the latest advancements and
their relationships to previous work. The thesis of Papamakarios (2019) and the survey
by Kobyzev et al. (2020) have made steps in establishing this broader understanding. Our
review complements these existing papers. In particular, our treatment of flows is more
comprehensive than Papamakarios (2019)’s but shares some organizing principles. Kobyzev
et al. (2020)’s article is commendable in its coverage and synthesis of the literature, dis-
cussing both finite and infinitesimal flows (as we do) and curating the latest results in
density estimation. Our review is more tutorial in nature and provides in-depth discussion
of several areas that Kobyzev et al. (2020) label as open problems (such as extensions to
discrete variables and Riemannian manifolds).
Our exploration of normalizing flows attempts to illuminate enduring principles that will
guide their construction and application for the foreseeable future. Specifically, our review
begins by establishing the formal and conceptual structure of normalizing flows in Section 2.
Flow construction is then discussed in detail, both for finite (Section 3) and infinitesimal
(Section 4) variants. A more general perspective is then presented in Section 5, which
in turn allows for extensions to structured domains and geometries. Lastly, we discuss
commonly encountered applications in Section 6.
Notation We use bold symbols to indicate vectors (lowercase) and matrices (uppercase),
otherwise variables are scalars. We indicate probabilities by Pr(·) and probability densities
by p(·). We will also use p(·) to refer to the distribution with that density function. We often
add a subscript to probability densities—e.g. px (x)—to emphasize which random variable
they refer to. The notation p(x; θ) represents the distribution of random variables x with
distributional parameters θ. The symbol ∇θ represents the gradient operator that collects
all partial derivatives of a function with respect to parameters in the set θ, that is ∇θ f =
∂f ∂f
[ ∂θ 1
, . . . , ∂θ K
] for K-dimensional parameters. The Jacobian of a function f : RD → RD is
denoted by Jf (·). Finally, we represent the sampling or simulation of variates x from a
distribution p(x) using the notation x ∼ p(x).
2. Normalizing Flows
2
Normalizing Flows for Probabilistic Modeling and Inference
would like to define a joint distribution over x. The main idea of flow-based modeling is to
express x as a transformation T of a real vector u sampled from pu (u):
We refer to pu (u) as the base distribution of the flow-based model.1 The transformation T
and the base distribution pu (u) can have parameters of their own (denote them as φ and ψ
respectively); this induces a family of distributions over x parameterized by {φ, ψ}.
The defining property of flow-based models is that the transformation T must be invertible
and both T and T −1 must be differentiable. Such transformations are known as diffeo-
morphisms and require that u be D-dimensional as well (Milnor and Weaver, 1997). Under
these conditions, the density of x is well-defined and can be obtained by a change of variables
(Rudin, 2006; Bogachev, 2007):
The Jacobian JT (u) is the D × D matrix of all partial derivatives of T given by:
∂T1 ∂T1
∂u1 · · · ∂u D
3
Papamakarios, Nalisnick, Rezende, Mohamed and Lakshminarayanan
4 4 4 4 4
2 2 2 2 2
0 0 0 0 0
−2 −2 −2 −2 −2
−4 −4 −4 −4 −4
−4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4
Figure 1: Example of a 4-step flow transforming samples from a standard-normal base den-
sity to a cross-shaped target density.
In terms of functionality, a flow-based model provides two operations: sampling from the
model via Equation 1, and evaluating the model’s density via Equation 3. These operations
have different computational requirements. Sampling from the model requires the ability to
sample from pu (u) and to compute the forward transformation T . Evaluating the model’s
density requires computing the inverse transformation T −1 and its Jacobian determinant,
and evaluating the density pu (u). The application will dictate which of these operations
need to be implemented and how efficient they need to be. We discuss the computational
trade-offs associated with various implementation choices in Sections 3 and 4.
Before discussing the particulars of flows, a question of foremost importance is: how ex-
pressive are flow-based models? Can they represent any distribution px (x), even if the
base distribution is restricted to be simple? We show that this universal representation
is possible under reasonable conditions on px (x). Specifically, we will show that for any
pair of well-behaved distributions px (x) (the target) and pu (u) (the base), there exists a
diffeomorphism that can turn pu (u) into px (x). The argument is constructive, and is based
on a similar proof of the existence of non-linear ICA by Hyvärinen and Pajunen (1999); a
more formal treatment is provided by e.g. Bogachev et al. (2005).
4
Normalizing Flows for Probabilistic Modeling and Inference
Suppose that px (x) > 0 for all x ∈ RD , and assume that all conditional probabilities
Pr(x0i ≤ xi | x<i )—with x0i being the random variable this probability refers to—are differ-
entiable with respect to (xi , x<i ). Using the chain rule of probability, we can decompose
px (x) into a product of conditional densities as follows:
D
Y
px (x) = px (xi | x<i ). (7)
i=1
Since px (x) is non-zero everywhere, px (xi | x<i ) > 0 for all i and x. Next, define the trans-
formation F : x 7→ z ∈ (0, 1)D whose i-th element is given by the cumulative distribution
function of the i-th conditional:
Z xi
zi = Fi (xi , x<i ) = px (x0i | x<i ) dx0i = Pr(x0i ≤ xi | x<i ). (8)
−∞
Since each Fi is differentiable with respect to its inputs, F is differentiable with respect to
x. Moreover, each Fi (·, x<i ) : R → (0, 1) is invertible, since its derivative ∂F
∂xi = px (xi | x<i )
i
is positive everywhere. Because zi doesn’t depend on xj for i < j, that implies we can invert
F , with its inverse F −1 given element-by-element as follows:
The Jacobian of F is lower triangular since ∂F ∂xj = 0 for i < j. Hence, the Jacobian
i
Since px (x) > 0, the Jacobian determinant is non-zero everywhere. Therefore, the inverse
of JF (x) exists, and is equal to the Jacobian of F −1 , so F is a diffeomorphism. Using a
change of variables, we can calculate the density of z as follows:
which implies z is distributed uniformly in the open unit cube (0, 1)D .
The above argument shows that a flow-based model can express any distribution px (x)
(satisfying the conditions stated above) even if we restrict the base distribution to be uniform
in (0, 1)D . We can extend this statement to any base distribution pu (u) (satisfying the same
conditions as px (x)) by first transforming u to a uniform z ∈ (0, 1)D as an intermediate
step. In particular, given any pu (u) that satisfies the above conditions, define G to be the
following transformation:
Z ui
zi = Gi (ui , u<i ) = pu (u0i | u<i ) du0i = Pr(u0i ≤ ui | u<i ). (12)
−∞
5
Papamakarios, Nalisnick, Rezende, Mohamed and Lakshminarayanan
Similarly to fitting any probabilistic model, fitting a flow-based model px (x; θ) to a target
distribution p∗x (x) can be done by minimizing some divergence or discrepancy between them.
This minimization is performed with respect to the model’s parameters θ = {φ, ψ}, where
φ are the parameters of T and ψ are the parameters of pu (u). In the following sections,
we discuss a number of divergences for fitting flow-based models, with a particular focus on
the Kullback–Leibler (KL) divergence as it is one of the most popular choices.
The forward KL divergence between the target distribution p∗x (x) and the flow-based model
px (x; θ) can be written as follows:
The forward KL divergence is well-suited for situations in which we have samples from the
target distribution (or the ability to generate them), but we cannot necessarily evaluate
the target density p∗x (x). Assuming we have a set of samples {xn }N ∗
n=1 from px (x), we can
∗
estimate the expectation over px (x) by Monte Carlo as follows:
N
1 X
L(θ) ≈ − log pu (T −1 (xn ; φ); ψ) + log |det JT −1 (xn ; φ)| + const. (14)
N
n=1
Minimizing the above Monte Carlo approximation of the KL divergence is equivalent to fit-
ting the flow-based model to the samples {xn }N
n=1 by maximum likelihood estimation.
The update with respect to ψ may also be done in closed form if pu (u; ψ) admits closed-form
maximum likelihood estimates, as is the case for example with Gaussian distributions.
In order to fit a flow-based model via maximum likelihood, we need to compute T −1 ,
its Jacobian determinant and the density pu (u; ψ)—as well as differentiate through all
three, if using gradient-based optimization. That means we can train a flow model with
maximum likelihood even if we are not able to compute T or sample from pu (u; ψ). Yet
these operations will be needed if we want to sample from the model after it is fitted.
6
Normalizing Flows for Probabilistic Modeling and Inference
Alternatively, we may fit the flow-based model by minimizing the reverse KL divergence,
which can be written as follows:
L(θ) = DKL [ px (x; θ) k p∗x (x) ]
= Epx (x;θ) [ log px (x; θ) − log p∗x (x) ] (17)
= Epu (u;ψ) [ log pu (u; ψ) − log |det JT (u; φ)| − log p∗x (T (u; φ)) ] .
We made use of a change of variable in order to express the expectation with respect to
u. The reverse KL divergence is suitable when we have the ability to evaluate the target
density p∗x (x) but not necessarily sample from it. In fact, we can minimize L(θ) even if
we can only evaluate p∗x (x) up to a multiplicative normalizing constant C, since in that
case log C will be an additive constant in the above expression for L(θ). We may therefore
∗
R
assume that px (x) = pex (x)/C, where pex (x) is tractable but C = pex (x) dx is not, and
rewrite the reverse KL divergence as:
L(θ) = Epu (u;ψ) [ log pu (u; ψ) − log |det JT (u; φ)| − log pex (T (u; φ)) ] + const. (18)
In practice, we can minimize L(θ) iteratively with stochastic gradient-based methods. Since
we reparameterized the expectation to be with respect to the base distribution pu (u; ψ), we
can easily obtain an unbiased estimate of the gradient of L(θ) with respect to φ by Monte
Carlo. In particular, let {un }N
n=1 be a set of samples from pu (u; ψ); the gradient of L(θ)
with respect to φ can be estimated as follows:
N
1 X
∇φ L(θ) ≈ − ∇φ log |det JT (un ; φ)| + ∇φ log pex (T (un ; φ)). (19)
N
n=1
and then writing the expectation with respect to pu0 (u0 ). However, since we can equivalently
absorb the reparameterization T 0 into T and replace the base distribution with pu0 (u0 ), we
can assume without loss of generality that the parameters ψ are fixed and only optimize
with respect to φ.
In order to minimize the reverse KL divergence as described above, we need to be able to
sample from the base distribution pu (u; ψ) as well as compute and differentiate through the
transformation T and its Jacobian determinant. That means that we can fit a flow-based
model by minimizing the reverse KL divergence even if we cannot evaluate the base density
or compute the inverse transformation T −1 . However, we will need these operations if we
would like to evaluate the density of the trained model.
The reverse KL divergence is often used for variational inference (Wainwright and Jordan,
2008; Blei et al., 2017), a form of approximate Bayesian inference. In this case, the target is
the posterior, making pex (x) the product between a likelihood function and a prior density.
7
Papamakarios, Nalisnick, Rezende, Mohamed and Lakshminarayanan
Examples of work using flows in variational inference are given by Rezende and Mohamed
(2015); van den Berg et al. (2018); Kingma et al. (2016); Tomczak and Welling (2016);
Louizos and Welling (2017). We cover this topic in more detail in Section 6.2.3.
Another application of the reverse KL divergence is in the context of model distillation:
a flow model is trained to replace a target model p∗x (x) whose density can be evaluated
but that is otherwise inconvenient. An example of model distillation with flows is given
by van den Oord et al. (2018). In their case, samples cannot be efficiently drawn from the
target model and so they distill it into a flow that supports fast sampling.
An alternative perspective of a flow-based model is to think of the target p∗x (x) as the base
distribution and the inverse flow as inducing a distribution p∗u (u; φ). Intuitively, p∗u (u; φ) is
the distribution that the training data would follow when passed through T −1 . Since the
target and the base distributions uniquely determine each other given the transformation
between them, p∗u (u; φ) = pu (u; ψ) if and only if p∗x (x) = px (x; θ). Therefore, fitting the
model px (x; θ) to the target p∗x (x) can be equivalently thought of as fitting the induced
distribution p∗u (u; φ) to the base pu (u; ψ).
We may now ask: how does fitting px (x; θ) to the target relate to fitting p∗u (u; φ) to the
base? Using a change of variables (see Section A for details), we can show the following
equality (Papamakarios et al., 2017):
The above equality means that fitting the model to the target using the forward KL diver-
gence (maximum likelihood) is equivalent to fitting the induced distribution p∗u (u; φ) to the
base pu (u; ψ) under the reverse KL divergence. In Section A, we also show that:
which means that fitting the model to the target via the reverse KL divergence is equivalent
to fitting p∗u (u; φ) to the base via the forward KL divergence (maximum likelihood).
Learning the parameters of flow models is not restricted to the use of the KL divergence.
Many alternative measures of difference between distributions are available. These alterna-
tives are often grouped into two general families, the f -divergences that use density ratios
to compare distributions, and the integral probability metrics (IPMs) that use differences
for comparison:
∗
∗ px (x)
f -divergence Df [ px (x) k px (x; θ) ] = Epx (x;θ) f (23)
px (x; θ)
IPM δs [ p∗x (x) k px (x; θ) ] = Ep∗x (x) [ s(x) ] − Epx (x;θ) [ s(x) ] . (24)
8
Normalizing Flows for Probabilistic Modeling and Inference
For the f -divergences, the function f is convex; when this function is f (r) = r log r we
recover the KL divergence. For IPMs, the function s can be chosen from a set of test
statistics, or can be a witness function chosen adversarially.
The same considerations that applied to the KL divergence previously apply here and
inform the choice of divergence: can we simulate from the model px (x; θ), do we know
the true distribution up to a multiplicative constant, do we have access to the transform
or its inverse? When considering these divergences, we see a connection between flow-
based models, whose design principles use composition and change of variables, to the more
general class of implicit probabilistic models (Diggle and Gratton, 1984; Mohamed and
Lakshminarayanan, 2016). If we choose the generator of a generative adversarial network
as a normalizing flow, we can train the flow parameters using adversarial training (Grover
et al., 2018; Danihelka et al., 2017), with Wasserstein losses (Arjovsky et al., 2017), using
maximum mean discrepancy (Bińkowski et al., 2018), or other approaches.
9
Papamakarios, Nalisnick, Rezende, Mohamed and Lakshminarayanan
Optimal transport and the Wasserstein metric (Villani, 2008) can also be formulated in
terms of transformations of measures (‘transport of measures’)—also known as the Monge
problem. In particular, triangular maps (a concept deeply related to autoregressive flows)
can be shown to be a limiting solution to a class of Monge–Kantorovich problems (Carlier
et al., 2010). This class of triangular maps itself has a long history, with Rosenblatt (1952)
studying their properties for transforming multivariate distributions uniformly over the
hypercube. Optimal transport could be a tutorial unto itself and therefore we mostly
sidestep this framework, instead choosing to think in terms of the change of variables.
Having described some high-level properties and uses of flows, we transition into describing,
categorizing, and unifying the various ways to construct a flow. As discussed in Section 2.1,
normalizing flows are composable; that is, we can construct a flow with transformation T
by composing a finite number of simple transformations Tk as follows:
T = TK ◦ · · · ◦ T1 . (25)
The idea is to use simple transformations as building blocks—each having a tractable in-
verse and Jacobian determinant—to define a complex transformation with more expressive
power than any of its constituent components. Importantly, the flow’s forward and inverse
evaluation and Jacobian-determinant computation can be localized to the sub-flows. As
illustrated in Figure 2, assuming z0 = u and zK = x, the forward evaluation is:
10
Normalizing Flows for Probabilistic Modeling and Inference
Increasing the ‘depth’ (i.e. number of composed sub-flows) of the transformation crucially
results in only O(K) growth in the computational complexity—a pleasantly practical cost
to pay for the increased expressivity.
In practice we implement either Tk or Tk−1 using a model (such as a neural network) with
parameters φk , which we will denote as fφk . That is, we can take the model fφk to
implement either Tk , in which case it will take in zk−1 and output zk , or Tk−1 , in which
case it will take in zk and output zk−1 . In either case, we must ensure that the model is
invertible and has a tractable Jacobian determinant. In the rest of this section, we will
describe several approaches for constructing fφk so that these requirements are satisfied.
An overview of all the methods discussed in this section is shown in Table 1.
Ensuring that fφk is invertible and explicitly calculating its inverse are not synonymous.
In many implementations, even though the inverse of fφk is guaranteed to exist, it can be
expensive or even intractable to compute exactly. As discussed in Section 2, the forward
transformation T is used when sampling, and the inverse transformation T −1 is used when
evaluating densities. If the inverse of fφk is not efficient, either density evaluation or sam-
pling will be inefficient or even intractable, depending on whether fφk implements Tk or
Tk−1 . Whether fφk should be designed to have an efficient inverse and whether it should be
taken to implement Tk or Tk−1 are decisions that ought to be based on intended usage.
11
Papamakarios, Nalisnick, Rezende, Mohamed and Lakshminarayanan
We should also clarify what we mean by ‘tractable Jacobian determinant’. We can always
compute the Jacobian matrix of a differentiable function with D inputs and D outputs,
using D passes of either forward-mode or reverse-mode automatic differentiation. Then, we
can explicitly calculate
the determinant of that Jacobian. However, this computation has
3
a time cost of O D , which can be intractable for large D. For most applications of flow-
based models, the Jacobian-determinant computation should be at most O(D). Hence, in
the following sections, we will describe functional forms that allow the Jacobian determinant
to be computed in linear time with respect to the input dimensionality.
To simplify notation from here on, we will drop the dependence of the model parameters on
k and denote the model by fφ . Also, we will denote the model’s input by z and its output
by z0 , regardless of whether the model implements Tk or Tk−1 .
Autoregressive flows were one of the first classes of flows developed and remain among the
most popular. In Section 2.2 we saw that, under mild conditions, we can transform any
distribution px (x) into a uniform distribution in (0, 1)D using maps with a triangular Jaco-
bian. Autoregressive flows are a direct implementation of this construction, specifying fφ
to have the following form (as described by e.g. Huang et al., 2018; Jaini et al., 2019):
where τ is termed the transformer and ci the i-th conditioner. This is illustrated in Fig-
ure 3a. The transformer is a strictly monotonic function of zi (and therefore invertible),
is parameterized by hi , and specifies how the flow acts on zi in order to output z0i . The
conditioner determines the parameters of the transformer, and in turn, can modify the
transformer’s behavior. The conditioner does not need to be a bijection. Its one constraint
is that the i-th conditioner can take as input only the variables with dimension indices less
that i. The parameters φ of fφ are typically the parameters of the conditioner (not shown
above for notational simplicity), but sometimes the transformer has its own parameters too
(in addition to hi ).
It is easy to check that the above construction is invertible for any choice of τ and ci as long
as the transformer is invertible. Given z0 , we can compute z iteratively as follows:
This is illustrated in Figure 3b. In the forward computation, each hi and therefore each
z0i can be computed independently in any order or in parallel. In the inverse computation
however, all z<i need to have been computed before zi , so that z<i is available to the
conditioner for computing hi .
It is also easy to show that the Jacobian of the above transformation is triangular, and thus
the Jacobian determinant is tractable. Since each z0i doesn’t depend on z>i , the partial
derivative of z0i with respect to zj is zero whenever j > i. Hence, the Jacobian of fφ can be
12
Normalizing Flows for Probabilistic Modeling and Inference
The Jacobian is a lower-triangular matrix whose diagonal elements are the derivatives of the
transformer for each of the D elements of z. Since the determinant of any triangular matrix
is equal to the product of its diagonal elements, the log-absolute-determinant of Jfφ (z) can
be calculated in O(D) time as follows:
D D
Y ∂τ X ∂τ
log det Jfφ (z) = log (zi ; hi ) = log (zi ; hi ) . (32)
∂zi ∂zi
i=1 i=1
The lower-triangular part of the Jacobian— denoted here by L(z)—is irrelevant. The deriva-
tives of the transformer can be computed either analytically or via automatic differentiation,
depending on the implementation.
Autoregressive flows are universal approximators (under the conditions discussed in Sec-
tion 2.2) provided the transformer and the conditioner are flexible enough to represent any
function arbitrarily well. This follows directly from the fact that the universal transfor-
mation from Section 2.2, which is based on the cumulative distribution functions of the
conditionals, is indeed an autoregressive flow. Yet, this is just a statement of representa-
tional power and makes no guarantees about the flow’s behavior in practice.
An alternative, but mathematically equivalent, formulation of autoregressive flows is to
have the conditioner ci take in z0<i instead of z<i . This is equivalent to swapping τ with
τ −1 and z with z0 in the formulation presented above. Both formulations are common in
the literature; here we use the convention that ci takes in z<i without loss of generality.
The computational differences between the two alternatives are discussed in more detail by
e.g. Kingma et al. (2016); Papamakarios et al. (2017).
13
Papamakarios, Nalisnick, Rezende, Mohamed and Lakshminarayanan
Affine transformers One of the simplest possible choices for the transformer—and one
of the first used—is the class of affine functions:
The above can be thought of as a location-scale transformation, where αi controls the scale
and βi controls the location. Invertibility is guaranteed if αi 6= 0, and this can be easily
achieved by e.g. taking αi = exp α̃i , where α̃i is an unconstrained parameter (in which case
hi = {α̃i , βi }). The derivative of the transformer with respect to zi is equal to αi ; hence
the log absolute Jacobian determinant is:
D D
X X
log det Jfφ (z) = log |αi | = α̃i . (34)
i=1 i=1
Autoregressive flows with affine transformers are attractive because of their simplicity and
analytical tractability, but their expressivity is limited. To illustrate why, suppose z fol-
lows a Gaussian distribution; then, each z0i conditioned on z0<i will also follow a Gaussian
distribution. In other words, a single affine autoregressive transformation of a multivari-
ate Gaussian results in a distribution whose conditionals pz0 (z0i | z0<i ) will necessarily be
Gaussian. Nonetheless, expressive flows can still be obtained by stacking multiple affine
autoregressive layers, but it’s unknown whether affine autoregressive flows with multiple
layers are universal approximators or not. Affine transformers are popular in the literature,
having been used in models such as NICE (Dinh et al., 2015), Real NVP (Dinh et al.,
2017), IAF (Kingma et al., 2016), MAF (Papamakarios et al., 2017), and Glow (Kingma
and Dhariwal, 2018).
14
Normalizing Flows for Probabilistic Modeling and Inference
15
Papamakarios, Nalisnick, Rezende, Mohamed and Lakshminarayanan
where αi = K 2
P
k=1 αik0 . It can be shown that, for large enough L, the sum-of-squares poly-
nomial transformer can approximate arbitrarily well any monotonically increasing function
(Jaini et al., 2019, Theorem 3). Nonetheless, since only polynomials of degree up to 4 can
be solved analytically, the sum-of-squares polynomial transformer is not analytically invert-
ible for L ≥ 2, and can only be inverted iteratively using e.g. bisection search (Burden and
Faires, 1989).
Spline-based transformers are distinguished by the type of spline they use, i.e. by the func-
tional form of the segments. The following options have been explored thus far, in order of
increasing flexibility: linear and quadratic splines (Müller et al., 2019), cubic splines (Durkan
et al., 2019a), linear-rational splines (Dolatabadi et al., 2020), and rational-quadratic splines
(Durkan et al., 2019b). Spline-based transformers are as fast to invert as to evaluate, while
maintaining exact analytical invertibility. Evaluating or inverting a spline-based transformer
16
Normalizing Flows for Probabilistic Modeling and Inference
is done by first locating the right segment—which can be done in O(log K) time using binary
search—and then evaluating or inverting that segment, which is assumed to be analytically
tractable. By increasing the number of segments K, a spline-based transformer can be
made arbitrarily flexible.
The conditioner ci (z<i ) can be any function of z<i , meaning that each conditioner can, in
principle, be implemented as an arbitrary model with input z<i and output hi . However, a
naı̈ve implementation in which each ci (z<i ) is a separate model would scale poorly with the
dimensionality D, requiring D model evaluations, each with a vector of average size D/2.
This is in addition to the cost of storing and estimating the parameters of D independent
models. In fact, early work on flow precursors (Chen and Gopinath, 2000) dismissed the
autoregressive approach as prohibitively expensive.
Nonetheless, this problem can be effectively addressed in practice by sharing parameters
across the conditioners ci (z<i ), or even by combining the conditioners into a single model. In
the following paragraphs, we will discuss some practical implementations of the conditioner
that allow it to scale to high dimensions.
s1 = initial state
hi = c(si ) where (39)
si = RNN(zi−1 , si−1 ) for i > 1.
The RNN processes z<D = (z1 , . . . , zD−1 ) one element at a time, and at each step it updates
a fixed-size internal state si that summarizes the subsequence z<i = (z1 , . . . , zi−1 ). The
network c that computes hi from si can be the same for each step. The initial state s1 can
be fixed or it can be a learned parameter of the RNN. Any RNN architecture can be used,
such as LSTM (Hochreiter and Schmidhuber, 1997) or GRU (Cho et al., 2014).
RNNs have been used extensively to share parameters across the conditional distributions of
autoregressive models. Examples of RNN-based autoregressive models include distribution
estimators (Larochelle and Murray, 2011; Uria et al., 2013, 2014), sequence models (Mikolov
et al., 2010; Graves, 2013; Sutskever et al., 2014), and image/video models (Theis and
Bethge, 2015; van den Oord et al., 2016b; Kalchbrenner et al., 2017). Section 3.1.3 discusses
the relationship between autoregressive models and autoregressive flows in detail.
In the autoregressive-flows literature, RNN-based conditioners have been proposed by e.g.
Oliva et al. (2018) and Kingma et al. (2016), but are relatively uncommon compared to
alternatives. The main downside of RNN-based conditioners is that they turn an inherently
parallel computation into a sequential one: the states s1 , . . . , sD must be computed sequen-
tially, even though each hi can in principle be computed independently and in parallel from
z<i . Since this recurrent computation involves O(D) sequential steps, it can be slow for
high-dimensional data such as images or videos.
17
Papamakarios, Nalisnick, Rezende, Mohamed and Lakshminarayanan
Masked conditioners Another approach that shares parameters across conditioners but
avoids the sequential computation of an RNN is based on masking. This approach uses a
single, typically feedforward neural network that takes in z and outputs the entire sequence
(h1 , . . . , hD ) in one pass. The only requirement is that this network must obey the autore-
gressive structure of the conditioner: an output hi cannot depend on inputs z≥i .
To construct such a network, one takes an arbitrary neural network and removes connections
until there is no path from input zi to outputs (h1 , . . . , hi ). A simple way to remove
connections is by multiplying each weight matrix elementwise with a binary matrix of
the same size. This has the effect of removing the connections corresponding to weights
that are multiplied by zero, while leaving all other connections unmodified. These binary
matrices can be thought of as ‘masking out’ connections, hence the term masking. The
masked network will have the same architecture and size as the original network. In turn, it
retains the computational properties of the original network, such as parallelism or ability
to evaluate efficiently on a GPU.
A general procedure for constructing masks for multilayer perceptrons with arbitrarily many
hidden layers or hidden units was proposed by Germain et al. (2015). The key idea is to
assign a ‘degree’ between 1 and D to each input, hidden, and output unit, and mask-out
the weights between subsequent layers such that no unit feeds into a unit with lower or
equal degree. In convolutional networks, masking can be done by multiplying the filter
with a binary matrix of the same size, which leads to a type of convolution often referred
to as autoregressive or causal convolution (van den Oord et al., 2016c,a; Hoogeboom et al.,
2019b). In architectures that use self-attention, masking can be done by zeroing out the
softmax probabilities (Vaswani et al., 2017).
Masked autoregressive flows have two main advantages. First, they are efficient to evaluate.
Given z, the parameters (h1 , . . . , hD ) can be obtained in one neural-network pass, and
then each dimension of z0 can be computed in parallel via z0i = τ (zi ; hi ). Second, masked
autoregressive flows are universal approximators. Given a large enough conditioner and
a flexible enough transformer, they can represent any autoregressive transformation with
monotonic transformers and thus transform between any two distributions (as discussed in
Section 2.2).
On the other hand, the main disadvantage of masked autoregressive flows is that they are
not as efficient to invert as to evaluate. This is because the parameters hi that are needed
to obtain zi = τ −1 (z0i ; hi ) cannot be computed until all (zi , . . . , zi−1 ) have been obtained.
Following this logic, we must first compute h1 by which we obtain z1 , then compute h2 by
which we obtain z2 , and so on until zD has been obtained. Using a masked conditioner c,
the above procedure can be implemented in pseudocode as follows:
Initialize z to an arbitrary value
for i = 1, . . . , D
(40)
(h1 , . . . , hD ) = c(z)
zi = τ −1 (z0i ; hi ).
To see why this procedure is correct, observe that if z≤i−1 is correct before the i-th iteration,
then hi will be computed correctly (due to the autoregressive structure of c) and thus z≤i
18
Normalizing Flows for Probabilistic Modeling and Inference
will be correct before the (i + 1)-th iteration. Since z≤0 = ∅ is correct before the first
iteration (in a degenerate sense, but still), by induction it follows that z≤D = z will be
correct at the end of the loop. Even though the above procedure can invert the flow exactly
(provided the transformer is easy to invert), it requires calling the conditioner D times.
This means that inverting a masked autoregressive flow using the above method is about
D times more expensive than evaluating the forward transformation. For high-dimensional
data such as images or video, this can be prohibitively expensive.
An alternative way to invert the flow, proposed by Song et al. (2019), is to solve the equation
z0 = fφ (z) approximately, by iterating the following Newton-style fixed-point update:
−1
fφ (zk ) − z0 ,
zk+1 = zk − α diag Jfφ (zk ) (41)
where α is a step-size hyperparameter, and diag(·) returns a diagonal matrix whose diag-
onal is the same as that of its input. A convenient initialization is z0 = z0 . Song et al.
(2019) showed that the above procedure is locally convergent for 0 < α < 2, and since
fφ−1 (z0 ) is the only fixed point, the procedure must either converge to it or diverge. With
a masked autoregressive flow, computing both fφ (zk ) and diag Jfφ (zk ) can be done effi-
ciently by calling the conditioner once. Hence the above Newton-like procedure can be more
efficient than inverting the flow exactly when the number of iterations to convergence in
practice is significantly less than D. On the other hand, the above Newton-like procedure
is approximate and guaranteed to converge only locally.
Despite the computational difficulties associated with inversion, masking remains one of
the most popular techniques for implementing autoregressive flows. It is well suited to
situations for which inverting the flow is not needed or the data dimensionality is not too
large. Examples of flow-based models that use masking include IAF (Kingma et al., 2016),
MAF (Papamakarios et al., 2017), NAF (Huang et al., 2018), block-NAF (De Cao et al.,
2019), MintNet (Song et al., 2019) and MaCow (Ma et al., 2019). Masking can also be
used and has been popular in implementing non-flow-based autoregressive models such as
MADE (Germain et al., 2015), PixelCNN (van den Oord et al., 2016c; Salimans et al., 2017)
and WaveNet (van den Oord et al., 2016a, 2018).
Coupling layers As we have seen, masked autoregressive flows have computational asym-
metry that impacts their application and usability. Either sampling or density evaluation
will be D times slower than the other. If both of these operations are required to be fast,
a different implementation of the conditioner is needed. One such implementation that
is computationally symmetric, i.e. equally fast to evaluate or invert, is the coupling layer
(Dinh et al., 2015, 2017). The idea is to choose an index d (a common choice is D/2 rounded
to an integer) and design the conditioner such that:
• Parameters (hd+1 , . . . , hD ) are functions of z≤d only, i.e. they don’t depend on z>d .
19
Papamakarios, Nalisnick, Rezende, Mohamed and Lakshminarayanan
20
Normalizing Flows for Probabilistic Modeling and Inference
simply the product of the diagonal elements of D, which are equal to the derivatives of the
transformers τ (·; hd+1 ), . . . , τ (·; hD ).
Coupling layers and fully autoregressive flows are two extremes on a spectrum of possible
implementations. A coupling layer splits z into two parts and transforms the second part
elementwise as a function of the first part, whereas a fully autoregressive flow splits the
input into D parts (each with one element in it) and transforms each part as a function of
all previous parts. Clearly, there are intermediate choices: one can split the input into K
parts and transform the k-th part elementwise as a function of parts 1 to k − 1, with K = 2
corresponding to a coupling layer and K = D to a fully autoregressive flow. Using masking,
inverting the transformation will be O(K) times more expensive than evaluating it, hence
K could be chosen based on the computational requirements of the task.
The efficiency of coupling layers comes at the cost of reduced expressive power. Unlike a
recurrent or masked autoregressive flow, a single coupling layer can no longer represent any
autoregressive transformation, regardless of how expressive the function F is. As a result,
an autoregressive flow with a single coupling layer is no longer a universal approximator.
Nonetheless, the expressivity of the flow can be increased by composing multiple coupling
layers. When composing coupling layers, the elements of z need to be permuted between
layers so that all dimensions have a chance to be transformed as well as interact with
one another. Previous work across various domains (see e.g. Kingma and Dhariwal, 2018;
Prenger et al., 2019; Durkan et al., 2019b) has shown that composing coupling layers can
indeed create flexible flows.
Coupling layers are one of the most popular methods for implementing flow-based models
because they allow both density evaluation and sampling to be fast. A flow based on cou-
pling layers can be tractably fitted with maximum likelihood and then be sampled from
efficiently. Thus coupling layers are often found in generative models of high-dimensional
data such as images, audio and video. Examples of flow-based models with coupling layers
include NICE (Dinh et al., 2015), Real NVP (Dinh et al., 2017), Glow (Kingma and Dhari-
wal, 2018), WaveGlow (Prenger et al., 2019), FloWaveNet (Kim et al., 2019) and Flow++
(Ho et al., 2019).
Alongside normalizing flows, another popular class of models for high-dimensional distri-
butions is the class of autoregressive models. Autoregressive models have a long history,
21
Papamakarios, Nalisnick, Rezende, Mohamed and Lakshminarayanan
from the general framework of Bayesian networks (Pearl, 1988; Frey, 1998) to more recent
neural-network-based implementations (Bengio and Bengio, 2000; Uria et al., 2016).
To construct an autoregressive model of px (x), we first decompose px (x) into a product of
1-dimensional conditionals using the chain rule of probability:
D
Y
px (x) = px (xi | x<i ). (46)
i=1
For example, px (xi ; hi ) can be a Gaussian parameterized by its mean and variance, or a
mixture of Gaussians parameterized by the mean, variance and mixture coefficient of each
component. The functions ci are analogous to the conditioners of an autoregressive flow,
and are often implemented with neural networks using either RNNs or masking as discussed
in previous sections. Apart from continuous data, autoregressive models can be readily
used for discrete or even mixed data. If xi is discrete for some i, then px (xi ; hi ) can be a
parametric probability mass function such as a categorical or a mixture of Poissons.
We now show that all autoregressive models of continuous variables are in fact autoregressive
flows with a single autoregressive layer. Let τ (xi ; hi ) be the cumulative distribution function
of px (xi ; hi ), defined as follows:
Z xi
τ (xi ; hi ) = px (x0i ; hi ) dx0i . (48)
−∞
is always distributed uniformly in (0, 1)D . The above expression has exactly the same form
as the definition of an autoregressive flow in Equation 29, with z = x and z0 = u. There-
fore, an autoregressive model is in fact an autoregressive flow with a single autoregressive
layer. Moreover, the layer’s transformers are the cumulative distribution functions of the
conditionals of the autoregressive model, and the layer’s base distribution is a uniform in
(0, 1)D . We can make the connection explicit by writing the density under the change of
variables:
D
Y D
Y D
X
log px (x) = log Uniform(τ (xi ; hi ); 0, 1) + log px (xi ; hi ) = log px (xi | x<i ). (50)
i=1 i=1 i=1
The term involving the uniform base density drops from the expression, leaving just the
Jacobian determinant.2 Following Equation 30, the inverse autoregressive flow that maps
2. Inouye and Ravikumar (2018) termed flows of this form—whereby the density is fully determined by the
Jacobian determinant—density destructors.
22
Normalizing Flows for Probabilistic Modeling and Inference
The above corresponds exactly to sampling from the autoregressive model one element at
a time, where at each step the corresponding conditional is sampled from using inverse
transform sampling.
Yet the transformer is not necessarily limited to being the inverse CDF. We can make
further connections between specific types of autoregressive models and the transformers
discussed in Section 3.1.1. For example, consider an autoregressive model with Gaussian
conditionals of the form:
px (xi ; hi ) = N xi ; µi , σi2
where hi = {µi , σi } . (52)
As discussed in the previous section, autoregressive flows restrict an output variable z0i to
depend only on inputs z≤i , making the flow dependent on the order of the input variables.
As we showed, in the limit of infinite capacity, this restriction doesn’t limit the flexibility
of the flow-based model. However, in practice we don’t operate at infinite capacity. The
order of the input variables will determine the set of distributions the model can represent.
Moreover, the target transformation may be easy to learn for some input orderings and
hard to learn for others. The problem is further exacerbated when using coupling layers
since only part of the input variables is transformed.
23
Papamakarios, Nalisnick, Rezende, Mohamed and Lakshminarayanan
To cope with these limitations in practice, it often helps to permute the input variables
between successive autoregressive layers. For coupling layers it is in fact necessary: if we
don’t permute the input variables between successive layers, part of the input will never
be transformed. A permutation of the input variables is itself an easily invertible trans-
formation, and its absolute Jacobian determinant is always 1 (i.e. it is volume-preserving).
Hence, permutations can seamlessly be composed with other invertible and differentiable
transformations the usual way.
An approach that generalizes the idea of a permutation of input variables is that of a linear
flow. A linear flow is essentially an invertible linear transformation of the form:
z0 = Wz, (54)
the determinant costs O(D). In Appendix B, we discuss in more detail this and a few more
parameterizations that restrict the form of W in various ways.
In any case, it is important to note that it is impossible to parameterize all invertible matrices
of size D ×D in a continuous way, so any continuous parameterization of W that guarantees
its invertibility will unavoidably leave out some invertible matrices. That’s because there
2
is no continuous surjective function from RD to the set of D × D invertible matrices.
To see why, consider two invertible matrices WA and WB such that det WA > 0 and
det WB < 0. If there exists a continuous parameterization of all invertible matrices, then
there exists a continuous path that connects WA and WB . However, since the determinant
is a continuous function of the matrix entries, any such path must include a matrix with
zero determinant, i.e. a non-invertible matrix, which is a contradiction. This argument
shows that the set of D × D invertible matrices contains two disconnected ‘islands’—one
containing matrices with positive determinant, the other with negative determinant—that
are fully separated by the set of non-invertible matrices. In practice, this means that we
can only hope to continuously parameterize one of these two islands, fixing the sign of the
determinant from the outset.
24
Normalizing Flows for Probabilistic Modeling and Inference
25
Papamakarios, Nalisnick, Rezende, Mohamed and Lakshminarayanan
The Banach fixed-point theorem guarantees that the above procedure converges to z∗ =
fφ−1 (z0 ) for any choice of starting point z0 . Moreover, it can be shown that the rate of
convergence (with respect to δ) is exponential in the number of iterations k, and can be
quantified as follows:
Lk
δ(zk , z∗ ) ≤ δ(z0 , z1 ). (59)
1−L
The smaller the Lipschitz constant is, the faster zk converges to z∗ . We can think of L
as trading off flexibility for efficiency: as L gets smaller, the fewer iterations it takes to
approximately invert the flow, but the residual transformation becomes more constrained,
i.e. less flexible. In the extreme case of L = 0, the inversion procedure converges after one
iteration, but the transformation reduces to adding a constant.
A challenge in building contractive residual flows is designing the function gφ to be con-
tractive without impinging upon its flexibility. It is easy to see that the composition of
K Lipschitz-continuous
QK functions F1 , . . . , FK is also Lipschitz continuous with a Lipschitz
constant equal to k=1 Lk , where Lk is the Lipschitz constant of Fk . Hence, if gφ is a com-
position of neural-network layers (as is common in deep learning), it is sufficient to make
each layer Lipschitz continuous with Lk ≤ 1, with at least one layer having Lk < 1, for the
entire network to be contractive. Many elementwise nonlinearities used in deep learning—
including the logistic sigmoid, hyperbolic tangent (tanh), and rectified linear (ReLU)—are
in fact already Lipschitz continuous with a constant no greater than 1. Furthermore, lin-
ear layers (including dense layers and convolutional layers) can be made contractive with
respect to a norm by dividing them with a constant strictly greater than their induced op-
erator norm. One such implementation was proposed by Behrmann et al. (2019): spectral
normalization (Miyato et al., 2018) was used to make linear layers contractive with respect
to the Euclidean norm.
One drawback of contractive residual flows is that there is no known general, efficient
procedure for computing their Jacobian determinant. Rather, one would have to revert to
automatic differentiation to obtain the Jacobian and an explicit determinant computation
to obtain the Jacobian determinant, which costs O D3 as discussed earlier. Without an
efficient way to compute the Jacobian determinant, exactly evaluating the density of the flow
model is costly and potentially infeasible for high-dimensional data such as images.
Nonetheless, it is possible to obtain an unbiased estimate of the log absolute Jacobian de-
terminant, and hence of the log density, which is enough to train the flow model e.g. with
maximum likelihood using stochastic gradients. We begin by writing the log absolute Ja-
cobian determinant as a power series:3
∞
X (−1)k+1 n k o
log det Jfφ (z) = log det I + Jgφ (z) =
Tr Jgφ (z) , (60)
k
k=1
where Jgkφ (z) is the k-th power of the Jacobian of gφ evaluated at z. The above series
converges if
Jgφ (z)
< 1 for some submultiplicative matrix norm k·k, which in our case
x2 x3
3. This power series is essentially the Maclaurin series log(1 + x) = x − 2
+ 3
− . . . extended to matrices.
26
Normalizing Flows for Probabilistic Modeling and Inference
holds due to gφ being contractive. The trace of Jgkφ (z) can be efficiently estimated using
the Hutchinson trace estimator (Hutchinson, 1990):
n o
Tr Jgkφ (z) ≈ v> Jgkφ (z) v, (61)
where v can be any D-dimensional random vector with zero mean and unit covariance. The
Jacobian-vector product v> Jgkφ (z) can then be computed with k backpropagation passes.
Finally, the infinite sum can be estimated by a finite sum of appropriately re-weighted terms
using the Russian-roulette estimator (Chen et al., 2019).
Unlike autoregressive flows, which are based on constraining the Jacobian to be sparse,
contractive residual flows have a dense Jacobian in general, which allows all input variables
to affect all output variables. As a result, contractive residual flows can be very flexible and
have demonstrated good results in practice. On the other hand, unlike the one-pass density
evaluation and sampling offered by flows based on coupling layers, exact density evaluation
is computationally expensive and sampling is done iteratively, which limits the applicability
of contractive residual flows in certain tasks.
If the determinant and inverse of A are tractable and M is less than D, the matrix deter-
minant lemma can provide a computationally efficient way to compute the determinant of
A+VW> . For example, if A is diagonal, computing the left-hand 3 + D2 M ,
side costs O D
whereas computing the right-hand side costs O M 3 + DM 2 , which is preferable if M < D.
In the context of flows, the matrix determinant lemma can be used to efficiently compute
the Jacobian determinant. In this section, we will discuss examples of residual flows that
are specifically designed such that application of the matrix determinant lemma leads to
efficient Jacobian-determinant computation.
Planar flow One early example is the planar flow (Rezende and Mohamed, 2015), where
the function gφ is a one-layer neural network with a single hidden unit:
27
Papamakarios, Nalisnick, Rezende, Mohamed and Lakshminarayanan
where σ 0 is the derivative of the activation function. The Jacobian has the form of a
diagonal matrix plus a rank-1 update. Using the matrix determinant lemma, the Jacobian
determinant can be computed in time O(D) as follows:
det Jfφ (z) = 1 + σ 0 (w> z + b) w> v. (65)
In general, the planar flow is not invertible for all values of v and w. However, assuming that
σ 0 is positive everywhere and bounded from above (which is the case if σ is the hyperbolic
tangent, for example), a sufficient condition for invertibility is w> v > − sup 1σ0 (x) .
x
Sylvester flow Planar flows can be extended to M hidden units, in which case they are
known as Sylvester flows (van den Berg et al., 2018) and can be written as:
z0 = z + Vσ(W> z + b). (66)
The parameters of the flow are now V ∈ RD×M , W ∈ RD×M and b ∈ RM , and the
activation function σ is understood elementwise. The Jacobian can be written as:
Jfφ (z) = I + VS(z)W> , (67)
where S(z) is an M ×M diagonal matrix whose diagonal is equal to σ 0 (W> z+b). Applying
the matrix determinant lemma we get:
det Jfφ (z) = det I + S(z)W> V , (68)
van den Berg et al. (2018) proposed the parameterization V = QU and W = QL, where Q
is a D × M matrix whose columns are an orthonormal set of vectors (this requires M ≤ D),
U is M × M upper triangular, and L is M × M lower triangular. Since Q> Q = I and
the product of upper-triangular matrices is also upper triangular, the Jacobian determinant
becomes:
Y D
>
det Jfφ (z) = det I + S(z)L U = (1 + Sii (z)Lii Uii ). (69)
i=1
Similar to planar flows, Sylvester flows are not invertible for all values of their parameters.
Assuming σ 0 is positive everywhere and bounded from above, a sufficient condition for
invertibility is Lii Uii > − sup 1σ0 (x) for all i ∈ {1, . . . , D}.
x
Radial flow Radial flows (Tabak and Turner, 2013; Rezende and Mohamed, 2015) take
the following form:
β
z0 = z + (z − z0 ) where r(z) = kz − z0 k . (70)
α + r(z)
The parameters of the flow are α ∈ (0, +∞), β ∈ R and z0 ∈ RD , and k·k is the Euclidean
norm. The above transformation can be thought of as a contraction/expansion radially
with center z0 . The Jacobian can be written as follows:
β β
Jfφ (z) = 1 + I− (z − z0 )(z − z0 )> , (71)
α + r(z) r(z)(α + r(z))2
28
Normalizing Flows for Probabilistic Modeling and Inference
which is a diagonal matrix plus a rank-1 update. Applying the matrix determinant lemma
and rearranging, we get the following expression for the Jacobian determinant, which can
be computed in O(D):
D−1
αβ β
det Jfφ (z) = 1 + 1+ . (72)
(α + r(z))2 α + r(z)
The radial flow is not invertible for all values of β. A sufficient condition for invertibility is
β > −α.
In summary, planar, Sylvester and radial flows have Jacobian determinants that cost O(D)
to compute, and can be made invertible by suitably restricting their parameters. However,
there is no analytical way of computing their inverse, which is why these flows have mostly
been used to approximate posteriors for variational autoencoders. Moreover, each individual
transformation is fairly simple, and it’s not clear how the flexibility of the flow can be
increased other than by increasing the number of transformations.
Normalization Like with deep neural networks trained with gradient-based methods,
normalizing the intermediate representations zk is crucial for maintaining stable gradients
throughout the flow. Batch normalization or batch norm (Ioffe and Szegedy, 2015) has been
widely demonstrated to be effective in stabilizing and improving neural-network training,
thus making it attractive for use in deep flows as well. Viewing the batch statistics as fixed,
batch norm is essentially a composition of two affine transformations. The first has scale
and translation parameters set by the batch statistics, and the second has free parameters
α (scale) and β (translation):
z − µ̂ z0 − β p 2
BN(z) = α √ + β, BN−1 (z0 ) = µ̂ + σ̂ + . (73)
σ̂ 2 + α
Moreover, batch norm has an easy-to-compute Jacobian determinant due to it acting ele-
mentwise (and thus having a diagonal Jacobian):
D
α
q i
Y
det JBN (z) = . (74)
σ̂ 2+
i=1 i i
29
Papamakarios, Nalisnick, Rezende, Mohamed and Lakshminarayanan
The above formulas assume that batch statistics are fixed, which is true for a trained model.
During training, however, batch statistics are not fixed, but are functions of all examples
in the batch. This makes batch norm not invertible, unless the batch statistics have been
cached during a forward pass. Also, the Jacobian determinant as written above makes little
sense mathematically, since batch norm is now a function of the whole batch. Yet, using
this Jacobian-determinant formula as an approximation often suffices for training, at least
if the batch is large enough (Dinh et al., 2017; Papamakarios et al., 2017).
Glow employs a variant termed activation normalization or act norm (Kingma and Dhari-
wal, 2018) that doesn’t use batch statistics µ̂ and σ̂. Instead, before training begins, a batch
is passed through the flow, and α and β are set such that the transformed batch has zero
mean and unit variance. After this data-dependent initialization, α and β are optimized
as model parameters. Act norm is preferable when training with small mini-batches since
batch norm’s statistics become noisy and can destabilize training.
Multi-scale architectures As mentioned in Section 2.1, x and u must have the same
dimensionality and every sub-transformation Tk must preserve dimensionality. This means
that evaluating T incurs an increasing computational cost as dimensionality grows. This
constraint is at direct odds with our desire to use as many steps in the flow as possible.
Dinh et al. (2017) proposed side-stepping this issue by way of a multi-scale architecture.
At regular intervals in the steps of the flow when going from x to u, some number of
sub-dimensions of zk are clamped and no additional transformation is applied. One can
think of this as implementing a skip-connection, mapping those dimensions directly to the
corresponding dimensions in the final representation u: (uj , . . . , ui ) = (zk,j , . . . , zk,i ) where
k is the step at which the clamping is applied. All K steps are applied to only a small
subset of dimensions, which is less costly than applying all steps to all dimensions. Dinh
et al. (2017) also argue that these skip-connections help with optimization, distributing the
objective throughout the full depth of the flow.
Besides having this practical benefit, multi-scale architectures are a natural modeling choice
for granular data types such as pixels (Dinh et al., 2017; Kingma and Dhariwal, 2018) and
waveforms (Prenger et al., 2019; Kim et al., 2019). The macro-structures that we often
care about—such as shapes and textures, in the case of images—typically do not need all
D dimensions to be described. Dinh et al. (2017) showed that multi-scale architectures do
indeed encode more global, semantic information in the dimensions that undergo all trans-
formations. On the other hand, dimensions that are factored out earlier in the flow represent
lower-level information; see Dinh et al. (2017)’s Appendix D for demonstrations.
30
Normalizing Flows for Probabilistic Modeling and Inference
equation (ODE) that describes the flow’s evolution in time. We call these ‘continuous-time’
flows as they evolve according to a real-valued scalar variable analogous to the number of
steps. We call this scalar ‘time’ as it determines how long the dynamics are run. In this
section, we will describe this class of continuous-time flows and summarize numerical tools
necessary for their implementation.
4.1 Definition
Let zt denote the flow’s state at time t (or ‘step’ t, thinking in the discrete setting). Time t
is assumed to run continuously from t0 to t1 , such that zt0 = u and zt1 = x. A continuous-
time flow is constructed by parameterizing the time derivative of zt with a function gφ with
parameters φ, yielding the following ordinary differential equation (ODE):
dzt
= gφ (t, zt ). (75)
dt
The function gφ takes as inputs both the time t and the flow’s state zt , and outputs the
time derivative of zt at time t. The only requirements for gφ are that it be uniformly
Lipschitz continuous in zt (meaning that there is a single Lipschitz constant that works
for all t) and continuous in t (Chen et al., 2018). From Picard’s existence theorem, it
follows that satisfying these requirements ensures that the above ODE has a unique solution
(Coddington and Levinson, 1955). Many neural-network layers meet these requirements
(Gouk et al., 2018), and unlike the architectures described in Section 3 that require careful
structural assumptions to ensure invertibility and tractability of their Jacobian determinant,
gφ has no such requirements.
To compute the transformation x = T (u), we need to run the dynamics forward in time by
integrating: Z t1
x = zt1 = u + gφ (t, zt ) dt. (76)
t=t0
The inverse transform T −1is then:
Z t0 Z t1
u = zt0 = x + gφ (t, zt ) dt = x − gφ (t, zt ) dt, (77)
t=t1 t=t0
where in the right-most expression we used the fact that switching the limits of integration
is equivalent to negating the integral. We write the inverse in this last form to show that,
unlike many flows comprised of discrete compositions (Section 3), continuous-time flows
have the same computational complexity in each direction. In consequence, choosing which
direction is the forward and which is the inverse is not a crucial implementation choice as
it is for e.g. autoregressive flows.
The change in log density for continuous-time flows can be characterized directly as (Chen
et al., 2018):
d log p(zt ) n o
= −Tr Jgφ (t,·) (zt ) (78)
dt
where Tr{·} denotes the trace operator and Jgφ (t,·) (zt ) is the Jacobian of gφ (t, ·) evaluated at
zt . The above equation can be obtained as a special case of the Fokker–Planck equation for
31
Papamakarios, Nalisnick, Rezende, Mohamed and Lakshminarayanan
zero diffusion (Risken, 1996). While the trace operator at first glance seems more compu-
tationally tractable than a determinant, in practice it still requires O(D) backpropagation
passes to obtain the diagonal elements of Jgφ (t,·) (zt ). Similarly to contractive residual flows
(Section 3.3.1), the Hutchinson’s trace estimator (Hutchinson, 1990) can be used to obtain
an approximation in high-dimensional settings (Grathwohl et al., 2019):
n o
Tr Jgφ (t,·) (zt ) ≈ v> Jgφ (t,·) (zt ) v, (79)
where v can be any D-dimensional random vector with zero mean and unit covariance. The
Jacobian-vector product v> Jgφ (t,·) (zt ) can be computed in a single backpropagation pass,
which makes the Hutchinson trace estimator about D times more efficient than calculating
the trace exactly. Chen and Duvenaud (2019) propose an alternative solution in which the
architecture of gφ is carefully constrained so that the exact Jacobian trace can be computed
in a single backpropagation pass.
By integrating the derivative of log p(zt ) over time, we obtain an expression for the log
density of x under the continuous-time flow:
Z t1 n o
log px (x) = log pu (u) − Tr Jgφ (t,·) (zt ) dt. (80)
t=t0
Evaluating the forward transform and the log density can be done simultaneously by com-
puting the following combined integral:
Z t1 " #
gφ (t, zt ) o
x u
= + n dt. (81)
log px (x) log pu (u) t=t0 −Tr Jgφ (t,·) (zt )
Due to continuous-time flows being defined by an ODE, the vast literature on numerical
ODE solvers and the corresponding software can be leveraged to implement these flows.
See Süli (2010) for an accessible introduction to numerical methods for ODEs. While there
are numerous numerical methods that could be employed, below we briefly describe Euler’s
method and the adjoint method.
Perhaps the simplest numerical technique one can apply is Euler’s method. The idea is to
first discretize the ODE using a small step-size > 0 as follows:
with the approximation becoming exact as → 0. The ODE can then be (approximately)
solved by iterating the above computation starting from zt0 = u until obtaining zt1 = x.
32
Normalizing Flows for Probabilistic Modeling and Inference
This way, the parameters φ can be optimized with gradients computed via backpropagation
through the ODE solver. It is relatively straightforward to use other discrete solvers such as
any in the Runge–Kutta family. The discretized forward solution would be backpropagated
through just as with Euler’s method.
Interestingly, the Euler approximation implements the continuous-time flow as a discrete-
time residual flow of the class described in Equation 55. Having assumed that gφ (t, ·) is
uniformly Lipschitz continuous with a Lipschitz constant L independent of t, it immediately
follows that gφ (t, ·) is contractive for any < 1/L. Hence, for small enough we can think
of the above Euler discretization as a particular instance of a contractive residual flow
(Section 3.3.1).
Approximating continuous-time flows using discrete-time residual flows gives us insight
on Equation 78’s description of the time evolution of log p(zt ). Using the Taylor-series
expansion of Equation 60, we can write the log absolute Jacobian determinant of fφ as
follows:
∞
X (−1)k+1 k n k o n o
Tr Jgφ (t,·) (zt ) = Tr Jgφ (t,·) (zt ) + O 2 .
log det Jfφ (zt ) = (83)
k
k=1
Substituting the above into the change-of-variables formula and rearranging we get:
n o
log p(zt+ ) = log p(zt ) − Tr Jgφ (t,·) (zt ) + O 2 ⇒
(84)
log p(zt+ ) − log p(zt ) n o
= −Tr Jgφ (t,·) (zt ) + O(), (85)
from which we directly obtain Equation 78 by letting → 0.
Chen et al. (2018) proposed an elegant alternative to the discrete, fixed-step methods men-
tioned above. For a general optimization target L(x; φ) (such as log likelihood), they show
that the gradient ∂L/∂zt with respect to the flow’s intermediate state zt can be character-
ized by the following ODE:
33
Papamakarios, Nalisnick, Rezende, Mohamed and Lakshminarayanan
backpropagate through the solver’s computational graph. This results in significant practi-
cal benefits since backpropagating through a solver is costly both in terms of computation
and memory requirements. Another benefit is that a more sophisticated solver can allocate
instance-dependent computation based on a user-specified tolerance, a common hyperpa-
rameter in most off-the-shelf solvers. At test time, this tolerance level can be tuned based
on runtime or other constraints on computation.
5. Generalizations
34
Normalizing Flows for Probabilistic Modeling and Inference
falls in ω ⊆ U, whereas the RHS is the probability that a sample from px (x) falls in γ ⊆ X .
Since γ is the image of ω under T , these two probabilities must be the same for any ω.
Standard flow-based models are a special case of this formula, where T is a diffeomorphism
(i.e. an differentiable invertible transformation with differentiable inverse), the sets U and
X are equal to the D-dimensional Euclidean space RD , and dµ(u) and dν(x) are equal to
the Lebesgue measure in RD . In that case, the conservation of probability measure can be
written as:
Z Z
pu (u) du = px (x) dx. (89)
u∈ω x∈γ
Since T is a diffeomorphism, we can express the LHS integral via the change of variable
u = T −1 (x) as follows (Rudin, 2006):
Z Z
−1
pu T (x) |det JT −1 (x)| dx = px (x) dx. (90)
x∈γ x∈γ
We can extend the transformation T to be many-to-one, whereby multiple u’s are mapped
to the same x. One possibility is for T to be piecewise invertible, in which case we can
partition U into a countable collection of non-overlapping subsets {Ui }i∈I such that the
restriction of T to Ui is an invertible transformation Ti : Ui → X (Figure 8a). Then, the
conservation of probability measure can be written as:
XZ Z
pu (u) du = px (x) dx, (92)
i∈I u∈ωi x∈γ
where each ωi ⊆ Ui is the image of γ under Ti−1 . Using the change of variables x = Ti (u)
for each corresponding integral on the LHS, we obtain:
Z Z
X
−1
pu Ti (x) det JT −1 (x) dx = px (x) dx. (93)
i
x∈γ i∈I x∈γ
35
Papamakarios, Nalisnick, Rezende, Mohamed and Lakshminarayanan
The above can be interpreted as a mixture of flows, where the i-th flow has transformation
Ti , base density pu (u) restricted to Ui and mixture weight equal to Pr(u ∈ Ui ).
XZ XZ
pu (u) p(i | u) du = px (x) dx ∀ξ ⊆ I, ∀ω ⊆ U, (95)
i∈ξ u∈ω i∈ξ x∈γi
where γi ⊆ Xi is the image of ω under Ti . Using the change of variables x = Ti (u) for each
corresponding integral on the LHS, we obtain:
XZ XZ
pu Ti−1 (x) p i | Ti−1 (x) det JT −1 (x) dx =
px (x) dx. (96)
i
i∈ξ x∈γi i∈ξ x∈γi
The above must be true for all ξ ⊆ I and γi ⊆ X , and by definition Ti−1 (x) = R(x) for all
x ∈ Xi , therefore:
px (x) = pu (R(x)) p(i(x) | R(x)) |det JR (x)| , (97)
where i(x) indexes the subset of X in which x belongs. The above can be interpreted
as a mixture of flows with non-overlapping components, where the i-th component uses
transformation
R Ti : U → Xi , base distribution pu (u | i) ∝ pu (u) p(i | u) and mixture weight
p(i) = pu (u) p(i | u) du.
36
Normalizing Flows for Probabilistic Modeling and Inference
px (x) = pu T −1 (x) .
(100)
We will refer to the above type of normalizing flow as a discrete flow. Unlike standard flows,
discrete flows don’t involve a Jacobian term in their density calculation.
Hoogeboom et al. (2019a) proposed a discrete flow for U = X = ZD based on affine
autoregressive flows (Section 3.1). Specifically, they implement the transformer τ : Z → Z
as follows:
τ (zi ; βi ) = zi + round(βi ), (101)
where βi is given by the conditioner and is a function of z<i , and round(·) maps its input
to the nearest integer. Similarly, Tran et al. (2019) proposed a discrete flow for U =
X = {0, . . . , K − 1}D also based on affine autoregressive flows and whose transformer τ :
{0, . . . , K − 1} → {0, . . . , K − 1} is given by:
where αi and βi are each given by the argmax of a K-dimensional vector outputted by
the conditioner. The above transformer can be shown to be bijective whenever αi and K
are coprime. To backpropagate through the discrete-valued functions round(·) and argmax,
both Hoogeboom et al. (2019a) and Tran et al. (2019) use the straight-through gradient
estimator (Bengio et al., 2013).
Compared to flows on RD , discrete flows have notable theoretical limitations. As shown in
Section 2.2, under mild conditions, flows on RD can transform any base density pu (u) to
any target density px (x). However, this is not true for discrete flows—at least not as long
as they are defined using a bijective transformation. For example, if the base distribution
pu (u) of a discrete flow is uniform, px (x) will necessarily be uniform too. More generally,
due to bijectivity, for every x there must be a u such that px (x) = pu (u). In other words, a
discrete flow can never change the values of pu (u), only permute them. Recalling the change-
of-variables formula for discrete flows (Equation 100), the absence of a Jacobian term may
4. The probability mass function px (x) can be though of as a density with respect to the counting measure.
37
Papamakarios, Nalisnick, Rezende, Mohamed and Lakshminarayanan
seem to be a computational boon that makes discrete flows preferable to continuous ones.
However, this absence is also a restriction.
Another useful property of standard flows isQ that they can model any target density using
a fully factorized base distribution pu (u) = D i=1 pu (ui ). Such a base density can be eval-
uated and sampled from in parallel, which is important for scalability to high dimensions.
Nonetheless, a discrete flow with a fully factorized base distribution may not be able to
model all target densities, even if the base distribution is learned. To see this, consider the
target density px (x1 , x2 ) with x1 ∈ {0, 1}, x2 ∈ {0, 1} given by:
px (0, 0) = 0.1 px (0, 1) = 0.3
(103)
px (1, 0) = 0.2 px (1, 1) = 0.4.
Assuming U = X = {0, 1}2 , all a discrete flow can do to px (x1 , x2 ) is permute the 4 values
of the probability table:
0.1 0.3
0.2 0.4
into a new 2 × 2 probability table. Therefore, to model px (x1 , x2 ) with a factorized base
pu (u1 , u2 ) = pu (u1 )pu (u2 ), there must be a permutation of the above table such that the
permuted table is of rank 1 (so it can be factorized as the outer product of two vectors).
However, after checking all 4! = 24 permutations, we find that all permuted tables are of
rank 2, which shows that the above target can’t be modeled using a factorized base. This
limitation means that, in practice, we may need to explicitly incorporate dependencies in
the base distribution to increase the model’s capacity. Both Hoogeboom et al. (2019a) and
Tran et al. (2019) take this approach, modeling the base distribution autoregressively.
A possible way to overcome this limitation, pointed out by van den Berg et al. (2020), is to
embed U and X into extended spaces U 0 and X 0 , such that the base distribution factorizes
in the extended space. In the above example, we could take U 0 = X 0 = {0, 1, 2, 3}2 , so that
the probability table becomes:
0.1 0.3 0 0
0.2 0.4 0 0
.
0 0 0 0
0 0 0 0
Then, a discrete flow on the extended space can rearrange the probability table into:
0.1 0 0 0
0.2 0 0 0
0.3 0 0 0 ,
0.4 0 0 0
which is rank 1 and thus can be factorized.
38
Normalizing Flows for Probabilistic Modeling and Inference
flow
which gives the density on RD as a function of the density on the manifold. If we restrict
the range of T to X , we can define the inverse mapping T −1 : X → RD and then use it to
obtain the density on the manifold:
−1/2
px (x) = pu T −1 (x) det G T −1 (x)
. (107)
39
Papamakarios, Nalisnick, Rezende, Mohamed and Lakshminarayanan
it to RD using the inverse embedding map for U, perform any number of standard flow
steps on RD , and finally transform the resulting density on the target manifold X using
the embedding map for X . We illustrate this approach in Figure 9, where U and X are
2-dimensional spheres embedded in R3 .
An important limitation of the above approach is that it assumes the existence of a differ-
entiable invertible map T : RD → X , whose inverse maps the manifold onto RD . However,
such a map can only exist for manifolds that are homeomorphic, i.e. topologically equiv-
alent, to RD (Kobayashi and Nomizu, 1963). There are several manifolds that are not
homeomorphic to RD , including spheres, tori, and arbitrary products thereof, for which the
above approach will be problematic. In the sphere example in Figure 9, any embedding
map will create at least one coordinate singularity (akin to how all longitude lines meet at
Earth’s poles). Such coordinate singularities can manifest as points of infinite density, and
thus create numerical instabilities in practice.
An alternative approach was proposed by Falorsi et al. (2019) for the special case where
the manifold is also a group, in which case it is known as a Lie group. A Lie group can
be reparameterized with respect to its Lie algebra (i.e. the tangent space at the identity
element) using the exponential map, which maps the Lie algebra to the Lie group. Falorsi
et al. (2019) show that this parameterization can be done analytically in certain cases;
in general however, computing the exponential map has a cubic cost with respect to the
dimensionality D.
Due to T being a diffeomorphism, a flow must preserve topological properties, which means
that U and X must be homeomorphic, i.e. topologically equivalent (Kobayashi and Nomizu,
0
1963). For example, a flow cannot map RD to the sphere SD , or RD to RD for D 6= D0 ,
as these spaces have different topologies. For the same reason, every intermediate space
Zk along the flow’s trajectory must be topologically equivalent to both U and X . In the
previous section, we discussed how this constraint can create problems for flows that map
between Riemannian manifolds and Euclidean space.
This constraint’s effect on continuous-time flows is prominent. In this case there is a contin-
uum of intermediate spaces Zt for t ∈ [t0 , t1 ], all of which must be topologically equivalent.
For example, a continuous-time flow cannot implement a transformation T : R → R with
T (1) = −1 and T (−1) = 1, due to any such transformation requiring an intersection at
some intermediate stage (Dupont et al., 2019). One way to bypass this restriction is to
instantiate the flow in a lifted space, as was proposed by Dupont et al. (2019). Introducing
ρ auxiliary variables, the flow’s transformation is now a map between D + ρ dimensional
spaces, T : RD+ρ → RD+ρ . Although the lifted spaces must still be topologically equivalent,
T can now represent a wider class of functions when projected to the original D-dimensional
space. However, density evaluation in the original D-dimensional space can no longer be
done analytically, which removes a benefit of flows that makes them an attractive modeling
choice in the first place. Rather, the ρ auxiliary variables must be numerically integrated in
practice. Dupont et al. (2019)’s augmentation method can be thought of as defining a mix-
40
Normalizing Flows for Probabilistic Modeling and Inference
ture of flows whereby the auxiliary variables serve as an index; integrating out the auxiliary
variables is then akin to summing over the (infinitely many) mixture components.
For another example, if the target density is comprised of disconnected modes, the base
density must have the same number of disconnected modes. This is true for all flows, not
just continuous-time ones. If the base density has fewer modes than the target (which
will likely be the case in practice), the flow will be forced to assign a non-zero amount of
probability mass to the ‘empty’ space between the disconnected modes. To alleviate this
issue, Cornish et al. (2019) propose a latent-variable flow defined as follows:
In many real-world applications, we have domain knowledge of the symmetries of the target
density. For example, if we want to model a physical system composed of many interact-
ing particles, the physical interactions between particles (e.g. gravitational or electrostatic
forces) are invariant to translations. Another example is modeling the configuration of a
molecular structure, where there may be rotational symmetries in the molecule. Such sym-
metries result in a probability density (e.g. of molecular configurations) that is invariant to
specific transformations (e.g. rotations).
In such situations, it is desirable to build the known symmetries directly into the model, not
only so that the model has the right properties, but also so that estimating the model is more
data efficient. However, in general it is not trivial to build expressive probability densities
that are also invariant to a prescribed set of symmetries. Flows are a good candidate class
of models to combine with such domain knowledge because we have a lot of control of how
they deform an initial density. In this section, we explore ways to build flow-based models
that respect a prescribed set of symmetries.
We say that g is a symmetry of the density p : RD → [0, +∞) with representation an
invertible matrix Rg ∈ RD×D if the density remains invariant after transforming its input
by Rg , that is, if p(Rg x) = p(x) for all x ∈ RD . For example, a standard-normal density is
symmetric with respect to rotations, reflections, and axes permutations. It is easy to show
that |det Rg | = 1 for any symmetry g. Using the change of variables x0 = R−1 g x, we have
the following:
Z Z Z
1 = p(x) dx = p(Rg x0 ) |det Rg | dx0 = |det Rg | p(x0 ) dx0 = |det Rg | . (110)
41
Papamakarios, Nalisnick, Rezende, Mohamed and Lakshminarayanan
The set of all symmetries of a target density is closed under composition, is associative, has
an identity element, and each element has an inverse—therefore forming a group G.
An important concept for dealing with symmetries in group theory is that of equivariance.
We say that a transformation T : RD → RD is equivariant with respect to the group G if
T (Rg u) = Rg T (u) for all g ∈ G and u ∈ RD —that is, we get the same result regardless of
whether we transform the input or the output of T by Rg . It is straightforward to see that
the composition of two equivariant transformations is also equivariant, and if T is bijective
its inverse is also equivariant.
These observations lead to the result shown in Lemma 1, which provides a general mecha-
nism for constructing flow-based models whose density is invariant with respect to a pre-
scribed symmetry group G.
Lemma 1 (Equivariant flows) Let px (x) be the density function of a flow-based model
with transformation T : RD → RD and base density pu (u). If T is equivariant with respect
to G and pu (u) is invariant with respect to G, then px (x) is invariant with respect to G.
From the equivariance of T and hence of T −1 , we have that T −1 (Rg x) = Rg T −1 (x). Taking
the Jacobian of both sides, we obtain:
Finally, from the invariance of pu (u) we have that pu Rg T −1 (x) = pu T −1 (x) . Therefore,
In practice, taking the base density to be invariant with respect to a prescribed symmetry
group is usually easy. What it more challenging is constructing the transformation T to
be equivariant. One approach is based on invariant functions with respect to G, that is,
functions f for which f (Rg u) = f (u) for all g ∈ G and u ∈ RD . This approach is outlined
in Lemma 2 below.
42
Normalizing Flows for Probabilistic Modeling and Inference
Proof From the invariance of f we have that f (Rg u) = f (u). Taking the gradient on both
sides, we obtain:
R>
g ∇Rg u f (Rg u) = ∇u f (u) ⇒ ∇Rg u f (Rg u) = R−>
g ∇u f (u)
⇒ ∇Rg u f (Rg u) = Rg ∇u f (u).
Lemma 2 gives a general mechanism for constructing equivariant transformations with re-
spect to symmetries with orthogonal representations. Several symmetries fall into this
category, including rotations, reflections and axes permutations. The practical significance
of Lemma 2 is that it is often easier to construct an invariant function than an equivariant
one. For instance, any function of the `2 norm kuk is invariant with respect to symmetries
with orthogonal representations, since kRuk = kuk for any orthogonal matrix R.
6. Applications
Normalizing flows have two primitive operations: density calculation and sampling. In turn,
flows are effective in any application requiring a probabilistic model with either of those
capabilities. In this section, we summarize applications to probabilistic modeling, inference,
supervised learning, and reinforcement learning.
Normalizing flows, due to their ability to be expressive while still allowing for exact likeli-
hood calculations, are often used for probabilistic modeling of data. For this application, we
assume access to a finite number of draws x from some unknown generative process p∗x (x).
These draws constitute a size N data set X = {xn }N n=1 . Our goal then is to fit a flow-based
model px (x; θ) to X such that the model serves as a good approximation for p∗x (x).
Often, the data {xn }N n=1 are discrete; for example, they could be images with pixel values in
{0, 1, . . . , 255}. Flow-based models are defined over continuous random variables (with the
exception of discrete flows, Section 5.3), so they are not directly applicable to discrete data.
To use flows with discrete data, we often dequantize {xn }N n=1 by adding continuous noise.
The noise distribution can be fixed (e.g. uniform in [0, 1] for the image example above), or
learned, as for example in variational dequantization (Ho et al., 2019).
One of the most popular method for fitting px (x; θ) is maximum likelihood estimation, which
exploits the forward KL divergence first introduced in Section 2.3.1:
DKL [ p∗x (x) k px (x; θ) ] = −Ep∗x (x) [ log px (x; θ) ] + const
N
1 X
≈− log px (xn ; θ) + const
N (112)
n=1
N
1 X
=− log pu (T −1 (xn ; φ); ψ) + log |JT −1 (xn ; φ)| + const.
N
n=1
43
Papamakarios, Nalisnick, Rezende, Mohamed and Lakshminarayanan
However, in principle any valid divergence or integral probability metric can be used as an
optimization target, as discussed in Section 2.3.4. Typically, there are two downstream uses
for the resulting model px (x; θ): density estimation and generation.
The first task is primarily quantitative: we can use the model to estimate densities, expecta-
tions, marginals, or other quantities of interest on never-before-seen data. Early work (Chen
and Gopinath, 2000; Tabak and Turner, 2013) considered only synthetic low-dimensional
cases, showing that normalizing flows could indeed represent skewed, multi-modal densities
as well as kernel density estimators could. It was Laparra et al. (2011) that first applied
Gaussianization to real data, using the density function to perform one-class classification
to detect urban areas in satellite images. Rippel and Adams (2013) next showed that their
deep flow model’s density could detect rotations and corruptions of images. Papamakarios
et al. (2017) performed a systematic comparison of their masked autoregressive flow on un-
conditional and conditional density estimation tasks, showing that the composition enabled
by their framework allows for better density estimation than other variants (namely MADE
and Real NVP). Grathwohl et al. (2019) performed similar experiments to validate the
effectiveness of continuous-time flows.
6.1.2 Generation
The second task is generation: sampling from the model novel instances that could have
plausibly been sampled from p∗x (x). In this latter case, the availability of exact likelihood
values is not the end goal so much as a principled training target that one would expect to
result in good generative performance. Generation has been a popular application of flows
in machine learning, and below we summarize their use for various categories of data.
Images & video Image generation has been given serious effort since the earliest work
on flows. Laparra et al. (2011), in the same work mentioned above, used Gaussianization
to generate gray-scale images of faces. Rippel and Adams (2013) also demonstrated early
success in generative performance as MNIST samples from their model looked rather com-
pelling. Dinh et al. (2015), through the use of their coupling parameterization, showed
further improvements including density estimation competitive with other high-capacity
models (such as deep mixtures of factor analysers) and respectable generation of SVHN dig-
its. In follow-up work, Dinh et al. (2017) increased the capacity of their model by including
scale transformations (instead of just translations), being the first to demonstrate that flows
could produce sharp, visually compelling full-color images. Specifically, Dinh et al. (2017)
showed compelling samples from models trained on CelebA, ImageNet (64×64), CIFAR-10,
and LSUN. Kingma and Dhariwal (2018), using a similar model but with additional convo-
lutional layers, further improved upon Dinh et al. (2017)’s results in density estimation and
generation of high-dimensional images. Continuous (Grathwohl et al., 2019) and residual
(Behrmann et al., 2019; Chen et al., 2019) flows have been demonstrated to produce sharp,
high-dimensional images as well. Finally, Kumar et al. (2019) propose a normalizing flow
44
Normalizing Flows for Probabilistic Modeling and Inference
for modeling video data, adapting the Glow architecture (Kingma and Dhariwal, 2018) to
synthesize raw RGB frames.
Audio The autoregressive model WaveNet (van den Oord et al., 2016a) demonstrated
impressive performance in audio synthesis. While WaveNet is not a normalizing flow, in
follow-up work van den Oord et al. (2018) defined a proper flow for audio synthesis by
distilling WaveNet into an inverse autoregressive flow so as to make test-time sampling
more efficient. Prenger et al. (2019) and Kim et al. (2019) have since formulated WaveNet
variants built from coupling layers to enable fast likelihood and sampling, in turn obviating
the need for van den Oord et al. (2018)’s post-training distillation step.
Text The most direct way to apply normalizing flows to text data is to define a discrete
flow over characters or a vocabulary. Tran et al. (2019) take this approach, showing per-
formance in character-level language modeling competitive to RNNs while having superior
generation runtime. An alternative approach that has found wider use is to define a latent
variable model with a discrete likelihood but a continuous latent space. A normalizing flow
can then be defined on the latent space as usual. Ziegler and Rush (2019) use such an
approach for character-level language modeling. Zhou et al. (2019), He et al. (2018), and
Jin et al. (2019) define normalizing flows on the continuous space of word embeddings as a
subcomponent of models for translation, syntactic structure, and parsing respectively.
6.2 Inference
In the previous section our focus was on modeling data and recovering its underlying distri-
bution. We now turn to inference: estimating unknown quantities within a model. The most
common setting is the computation of high-dimensional, analytically intractable integrals
of the form: Z
π(η) dη. (113)
Bayesian inference usually runs into such an obstacle when computing the posterior’s nor-
malizing constant or when computing expectations under the posterior. Below we summa-
rize the use of flows for sampling, variational inference, and likelihood-free inference.
45
Papamakarios, Nalisnick, Rezende, Mohamed and Lakshminarayanan
where q(η) is a user-specified density function and η̂s is a sample from q(η). Clearly, IS
requires both sampling and density evaluation. Since both operations are tractable for many
flows, they make for an attractive model from which to construct the proposal.
Müller et al. (2019) do just this: they implement q(η) using normalizing flows. The prac-
ticality of IS crucially depends on the choice of proposal and thus the flow’s parameters
must be optimized. Müller et al. (2019) discuss two strategies for fitting q(η). When π(η)
can be interpreted as an unnormalized density, the first is to minimize the KL divergence
R the normalized target and the flow: DKL [ p(η) k q(η) ] where p(η) = π(η)/Z with
between
Z = π(η) dη being an intractable normalizing constant (equal to the very quantity that we
wish to compute). While DKL [ p(η) k q(η) ] cannot be computed, IS can be used to compute
the divergence’s gradient with respect to the flow’s parameters. The second is to minimize
the variance of the IS estimator directly. When π(η) is again an unnormalized density,
this is equivalent to minimizing a χ2 -divergence between the proposal and p(η) = π(η)/Z.
Flows have also been used for the proposal distribution in similar ways by Noé et al. (2019)
and Wirnsberger et al. (2020).
The related technique of rejection sampling (RS) aims to draw samples from p(η) = π(η)/Z,
where π(η) is again an unnormalized density. Both density evaluation and sampling are
required from the proposal in RS, again making normalizing flows well-suited. Bauer and
Mnih (2019) use the Real NVP architecture (Dinh et al., 2017) to parameterize a pro-
posal distribution for RS since the coupling layers allow for fast density evaluation and
sampling.
The application of flows in Markov chain Monte Carlo (MCMC) precedes the appearance
of flows in deep learning by at least a few decades. One prominent example is Hamiltonian
Monte Carlo (HMC), also known as Hybrid Monte Carlo (Duane et al., 1987; Neal, 2010).
HMC operates on the ‘phase space’ (η, v), where η are the variables of interest and v are
additional ‘momentum’ variables of the same dimensionality as η. HMC generates samples
from a joint distribution p(η, v) constructed so that its marginal over η is the distribution
of interest. Central to HMC is the Hamiltonian defined by H(η, v) = − log p(η, v). Given
a state (η, v), HMC proposes a new state (η 0 , v0 ) = T (η, v) deterministically, where T is a
Hamiltonian flow followed by negation of the momentum variables. The proposed state is
then accepted/rejected using the usual Metropolis–Hastings step. The Hamiltonian flow is
the continuous-time flow generated by the following ODE:
d(η, v) ∂H ∂H
= ,− . (115)
dt ∂v ∂η
This flow is volume-preserving, meaning that its absolute Jacobian determinant is 1 every-
where, which, in combination with the negation of the momentum variables, ensures that
the proposal is symmetric and thus cancels in the Metropolis–Hastings ratio.
It is also possible to construct MCMC algorithms with flows other than the Hamiltonian flow
described above. One example is A-NICE-MC (Song et al., 2017), which is similar to HMC
46
Normalizing Flows for Probabilistic Modeling and Inference
but constructs the proposal using an arbitrary volume-preserving flow T (·; φ) parameterized
by φ. Song et al. (2017) use the NICE model of Dinh et al. (2015), but their method applies
to any volume-preserving flow more generally. Given a state (η, v), a new state is proposed
which is equal to either T (η, v; φ) or T −1 (η, v; φ) with equal probability. This proposal is
symmetric and so it cancels in the Metripolis–Hastings ratio. The parameters φ are tuned
to the distribution of interest using adversarial training.
Another way of applying flows to MCMC is to use the flow to reparameterize the target
distribution. It is well understood that the efficiency of MCMC drastically depends on the
target distribution being easy to explore. If the target is highly skewed and/or multi-modal,
the performance of MCMC suffers, resulting in slow mixing and convergence. Normalizing
flows can effectively ‘smooth away’ these pathologies in the target’s geometry by allowing
MCMC to be run on the simpler and better-behaved base density. Given the unnormalized
target π(η), we can reparameterize the model in terms of a base density pu (u) such that
η = T (u; φ). Assuming a symmetric proposal distribution for simplicity, applying the
Metropolis–Hastings ratio to the reparameterized model yields:
pu (û∗ ) π(T (û∗ ; φ)) |det JT (û∗ ; φ)|
r(û∗ ; ût ) = = , (116)
pu (ût ) π(T (ût ; φ)) |det JT (ût ; φ)|
where û∗ denotes the proposed value and ût the current value. Assuming T is sufficiently
powerful such that T −1 (η; φ) is truly distributed according to the simpler base distribution,
exploring the target should become considerably easier. In practice, it is still useful to
generate proposals via Hamiltonian dynamics rather than from a simple isotropic proposal,
even if pu (u) is isotropic (Hoffman et al., 2019).
While the reparameterization above is relatively straightforward, there is still the crucial
issue of how to set or optimize the parameters of T . Titsias (2017) interleaves runs of the
chain with updates to the flow’s parameters, performing optimization by maximizing the
unnormalized reparameterized target under the last sample from a given run:
arg max log π(T (ûtfinal ; φ)) + log |det JT (ûtfinal ; φ)| . (117)
φ
However, this choice is a heuristic, and is not guaranteed to encourage the chain’s mixing.
As Hoffman et al. (2019) point out, such a choice may emphasize mode finding. As an
alternative, Hoffman et al. (2019) propose fitting the flow model to p(η) = π(η)/Z first via
variational inference and then running Hamiltonian Monte Carlo on the reparameterized
model, using a sample from the flow to initialize the chain.
We can also use normalizing flows to fit distributions over latent variables or model param-
eters. Specifically, flows can usefully serve as posterior approximations for local (Rezende
and Mohamed, 2015; van den Berg et al., 2018; Kingma et al., 2016; Tomczak and Welling,
2016) and global (Louizos and Welling, 2017) variables.
For example, suppose we wish to infer variables η given some observation x. In variational
inference with normalizing flows, we use a (trained) flow-based model q(η; φ) to approximate
47
Papamakarios, Nalisnick, Rezende, Mohamed and Lakshminarayanan
where qu (u) is the base distribution (which here is typically fixed) and T (·; φ) is the trans-
formation (parameterized by φ). If we want to approximate the posterior for multiple values
of x, we can make the flow model conditional on x and amortize the cost of inference across
values of x. The flow is trained by maximizing the evidence lower bound (ELBO), which
can be written as:
log p(x) ≥ Eq(η;φ) [ log p(x, η) ] − Eq(η;φ) [ log q(η; φ) ]
= Equ (u) [ log p(x, T (u; φ)) ] − Equ (u) [ log qu (u) ] + Equ (u) [ log |det JT (u; φ)| ] (119)
= Equ (u) [ log p(x, T (u; φ)) ] + H [ qu (u) ] + Equ (u) [ log |det JT (u; φ)| ] ,
where H [ qu (u) ] is the differential entropy of the base distribution, which is a constant with
respect to φ. The expectation terms can be estimated by Monte Carlo, using samples from
the base distribution as follows:
S
1X
Equ (u) [ log p(x, T (u; φ)) ] ≈ log p(x, T (ûs ; φ)),
S
s=1
(120)
S
1X
Equ (u) [ log |det JT (u; φ)| ] ≈ log |det JT (ûs ; φ)| .
S
s=1
Models are often implicit, meaning they are not defined in terms of a likelihood func-
tion p(x | η) that describes how observable variables x depend on model parameters η.
Rather, they come in the form of a simulator that takes in parameters η and simulates vari-
ables x (Diggle and Gratton, 1984). Such simulator-based models are common in scientific
fields such as cosmology (Alsing et al., 2018), high-energy physics (Brehmer et al., 2018),
and computational neuroscience (Gonçalves et al., 2020). Inferring the parameters η of a
simulator-based model given observed data x is often referred to as likelihood-free inference
(Papamakarios, 2019), simulation-based inference (Cranmer et al., 2020), or approximate
Bayesian computation (Beaumont et al., 2002; Beaumont, 2010). The typical assumption
in likelihood-free inference is that it is easy to simulate variables x from the model given η,
but it is intractable to evaluate the likelihood p(x | η).
Normalizing flows are a natural fit for likelihood-free inference, especially flows condi-
tioned on side information (e.g. Winkler et al., 2019; Ardizzone et al., 2019). Assuming
48
Normalizing Flows for Probabilistic Modeling and Inference
a tractable prior distribution p(η) over the parameters of interest, we can generate a data
set {(ηn , xn )}N
n=1 where ηn ∼ p(η) and xn is simulated from the model with parameters
ηn . In other words, (ηn , xn ) is a joint sample from p(η, x) = p(η) p(x | η). Then, using
the techniques described in Section 6.1, we can fit a flow-based model q(η | x) conditioned
on x to the generated data set {(ηn , xn )}N n=1 in order to approximate the posterior p(η | x)
(Greenberg et al., 2019; Gonçalves et al., 2020). Alternatively, we can fit a flow-based model
q(x | η) conditioned on η in order to approximate the intractable likelihood p(x | η) (Pa-
pamakarios et al., 2019). In either case, the trained flow-based model is useful for various
downstream tasks that require density evaluation or sampling.
Flows also have applications as building blocks for downstream tasks. We discuss two such
cases, namely supervised learning and reinforcement learning.
We can think of the flow as defining the first L − 1 layers of the architecture, and of
the last layer as a (generalized) linear model operating on the features u = T −1 (x). The
second term then makes these features distribute according to pu (u), which could be viewed
as a regularizer
on the feature space. For instance, if pu (u) is standard Normal, then
log pu T −1 (x) effectively acts as an `2 penalty. The Jacobian determinant serves its usual
role in ensuring the density is properly normalized. Hence, a hybrid model in the form of
Equation 121 can compute the joint density of labels and features at little additional cost to
49
Papamakarios, Nalisnick, Rezende, Mohamed and Lakshminarayanan
standard forward propagation. The extra computation is introduced by the right-most two
terms and depends on their particular forms. If pu (u) is standard Normal and T defined
via coupling layers, then the additional computation is O(DL) with D being the number
of input dimensions and L the number of layers.
Finally, in this section we give two examples of how normalizing flows have been used thus
far for reinforcement learning (RL).
Reparameterized policies The most popular use for flows in RL has been to model
(continuous) policies. An action at ∈ RD taken in state st at time t is sampled according
to ât = T (ût ; st , φ), ût ∼ pu (ut ) where pu denotes the base density. The corresponding
conditional density is written as:
Haarnoja et al. (2018) and Ward et al. (2019) use such a policy within the maximum entropy
and soft actor-critic frameworks respectively.
Imitation learning Schroecker et al. (2019) use normalizing flows within the imitation-
learning paradigm for control problems. Given the observed expert (continuous) state-action
pairs (s̄, ā), a core challenge to imitation learning is accounting for unobserved intermediate
states. Schroecker et al. (2019) use conditional flows to simulate these intermediate states
and their corresponding actions. Specifically, for a state s̄t+j (j ≥ 1), predecessor state-
action pairs are sampled from the model:
Both terms on the right-hand side are defined by MAFs (Papamakarios et al., 2017).
7. Conclusions
We have described normalizing flows and their use for probabilistic modeling and inference.
We addressed key issues such as their expressive power (Section 2.2) and the fundamentals
underlying their construction (both in discrete and continuous time). We also described
the general principle of probability transformations (Section 5.1) and its implications for
defining flows beyond Euclidean space. In particular, we showed that discrete domains,
mixtures of flows, and extensions to Riemannian manifolds all follow from this generalized
perspective. Lastly, we summarized the primary applications of flows (Section 6): tasks
ranging from density estimation to likelihood-free inference to classification.
While many flow designs and specific implementations will inevitably become out-of-date as
work on normalizing flows continues, we have attempted to isolate foundational ideas that
will continue to guide the field well into the future. One of these keystone principles is the
50
Normalizing Flows for Probabilistic Modeling and Inference
chain rule of probability and its relationship to transformations with a triangular Jacobian.
Autoregressive flows stand on these two pillars, with the former underlying their expressive
power and the latter providing their efficient implementation. Similarly, the Banach fixed-
point theorem provides the mathematical foundation for contractive residual flows. While
alternative parameterizations of the translation function gφ or normalization strategies may
be developed, the underlying Lipschitz constraints cannot be deserted without violating the
fixed-point theorem.
Throughout the text we emphasized crucial implementation notes that guide a successful
application of flows. Perhaps of foremost importance is determining the computational con-
straints on evaluating the forward and inverse transformations. As we showed in Section 2,
sampling and density evaluation place distinct demands on the transformation. Since flows
have no inherent requirement as to whether T should implement U → X or vice versa, we
are free to choose which direction better suits our application. If either sampling or den-
sity estimation is the primary objective—but not both—then autoregressive flows present
an attractive, flexible class of model. Yet if both sampling and density evaluation must
be done often or quickly, then implementing the autoregressive flow with a coupling-based
conditioner will make both operations efficient at the cost of expressive power. On the
other hand, non-autoregressive flows such as linear and residual flows allow for interaction
between all dimensions at each step in the flow. While these interactions can be useful at
times—making linear flows good for permuting variables between successive autoregressive
flows and residual flows highly expressive—other limitations arise. For instance, contrac-
tive residual flows typically require iterative algorithms for sampling and density evaluation.
As all flow constructions present trade-offs of some form, we hope this article provides a
coherent and accessible summary to guide practitioners through these choice points.
Looking forward, the obstacles preventing wider application of normalizing flows are simi-
lar in spirit to those faced by any probabilistic model. However, unlike other probabilistic
models that require approximate inference as they scale, flows usually admit analytical cal-
culations and exact sampling even in high dimensions. Rather, the difficulty is transferred
to the construction of the flow’s transformation: how can we define ever more flexible trans-
formations while keeping exact density evaluation and sampling computationally tractable?
This is currently the focus of much work and will likely remain a core issue for some time.
More study of the theoretical properties of flows is also needed. Understanding their ap-
proximation capabilities for finite sample and finite depth settings would help practitioners
select which flow classes are best for a given application. Our discussion of generalizations
in Section 5 will hopefully provide grounding as well as inspiration for this next wave of
developments in the theory and application of normalizing flows.
Acknowledgments
We would like to thank Ivo Danihelka for his invaluable feedback on the manuscript, and the
anonymous reviewers for their many improvement suggestions. We also thank Hyunjik Kim
and Sébastien Racanière for useful discussions on a wide variety of flow-related topics.
51
Papamakarios, Nalisnick, Rezende, Mohamed and Lakshminarayanan
Let px (x) be the distribution induced by a flow with transformation T and base distribution
pu (u). Also, let p∗u (u) be the distribution induced by the inverse flow with transformation
T −1 and base distribution p∗x (x). Using the formula for the density of a flow-based model
and a change of variables, we have the following:
DKL [ p∗x (x) k px (x) ] = Ep∗x (x) [ log p∗x (x) − log px (x) ]
= Ep∗x (x) log p∗x (x) − log |det JT −1 (x)| − log pu T −1 (x)
= Ep∗u (u) [ log p∗x (T (u)) + log |det JT (u)| − log pu (u) ] (124)
= Ep∗u (u) [ log p∗u (u) − log pu (u) ]
= DKL [ p∗u (u) k pu (u) ] .
Similarly, we have the following:
DKL [ px (x) k p∗x (x) ] = Epx (x) [ log px (x) − log p∗x (x) ]
= Epx (x) log pu T −1 (x) + log |det JT −1 (x)| − log p∗x (x)
= Epu (u) [ log pu (u) − log |det JT (u)| − log p∗x (T (u)) ] (125)
= Epu (u) [ log pu (u) − log p∗u (u) ]
= DKL [ pu (u) k p∗u (u) ] .
W = PLU, (126)
where P is a permutation matrix, L is lower triangular, U is upper triangular, and all three
are of size D ×D. We can easily constrain W to be invertible by restricting L and U to have
positive diagonal entries. In that case, the absolute determinant of W can be computed in
O(D) time by:
D
Y
|det W| = Lii Uii . (127)
i=1
52
Normalizing Flows for Probabilistic Modeling and Inference
Inverting the linear system QRz = z0 can be done in O D2 time by first multiplying by
Q> and then solving a triangular system by forward/backward substitution. We can also
think of the QR flow as a linear autoregressive flow followed by an orthogonal flow; in the
next paragraph we will discuss in more detail how to parameterize an orthogonal flow. The
QR flow was proposed by Hoogeboom et al. (2019b).
53
Papamakarios, Nalisnick, Rezende, Mohamed and Lakshminarayanan
parameterization also takes O D3 time to compute, and can only parameterize those
References
Justin Alsing, Benjamin D. Wandelt, and Stephen M. Feeney. Massive optimal data
compression and density estimation for scalable, likelihood-free inference in cosmology.
Monthly Notices of the Royal Astronomical Society, 477(3):2874–2885, 2018.
Lynton Ardizzone, Carsten Lüth, Jakob Kruse, Carsten Rother, and Ullrich Köthe.
Guided image generation with conditional invertible neural networks. ArXiv preprint
arXiv:1907.02392, 2019.
Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein generative adversarial
networks. In Proceedings of the 34th International Conference on Machine Learning,
pages 214–223, 2017.
Matthias Bauer and Andriy Mnih. Resampled priors for variational autoencoders. In
Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics,
pages 66–75, 2019.
Mark A. Beaumont. Approximate Bayesian computation in evolution and ecology. Annual
Review of Ecology, Evolution, and Systematics, 41(1):379–406, 2010.
Mark A. Beaumont, Wenyang Zhang, and David J. Balding. Approximate Bayesian com-
putation in population genetics. Genetics, 162:2025–2035, 2002.
Jens Behrmann, Will Grathwohl, Ricky T. Q. Chen, David K. Duvenaud, and Jörn-Henrik
Jacobsen. Invertible residual networks. In Proceedings of the 36th International Confer-
ence on Machine Learning, pages 573–582, 2019.
Yoshua Bengio and Samy Bengio. Modeling high-dimensional discrete data with multi-layer
neural networks. In Advances in Neural Information Processing Systems, pages 400–406,
2000.
54
Normalizing Flows for Probabilistic Modeling and Inference
Mikolaj Bińkowski, Dougal J. Sutherland, Michael Arbel, and Arthur Gretton. Demystifying
MMD GANs. In International Conference on Learning Representations, 2018.
David M. Blei, Alp Kucukelbir, and Jon D. McAuliffe. Variational inference: A review for
statisticians. Journal of the American Statistical Association, 112(518):859–877, 2017.
Johann Brehmer, Kyle Cranmer, Gilles Louppe, and Juan Pavez. Constraining effective
field theories with machine learning. Physical Review Letters, 121(11):111801, 2018.
Richard L. Burden and J. Douglas Faires. Numerical Analysis. The Prindle, Weber and
Schmidt Series in Mathematics. PWS-Kent Publishing Company, fourth edition, 1989.
Guillaume Carlier, Alfred Galichon, and Filippo Santambrogio. From Knothe’s transport
to Brenier’s map and a continuation method for optimal transport. SIAM Journal on
Mathematical Analysis, 41(6):2554–2576, 2010.
Ricky T. Q. Chen and David K. Duvenaud. Neural networks with cheap differential oper-
ators. In Advances in Neural Information Processing Systems, 2019.
Ricky T. Q. Chen, Yulia Rubanova, Jesse Bettencourt, and David K. Duvenaud. Neural
ordinary differential equations. In Advances in Neural Information Processing Systems,
pages 6571–6583, 2018.
Ricky T. Q. Chen, Jens Behrmann, David K. Duvenaud, and Jörn-Henrik Jacobsen. Resid-
ual flows for invertible generative modeling. In Advances in Neural Information Processing
Systems, 2019.
Kyunghyun Cho, Bart van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi
Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using
RNN encoder-decoder for statistical machine translation. In Proceedings of the 2014 Con-
ference on Empirical Methods in Natural Language Processing, pages 1724–1734, 2014.
Earl A. Coddington and Norman Levinson. Theory of ordinary differential equations. In-
ternational Series in Pure and Applied Mathematics. McGraw-Hill, 1955.
Rob Cornish, Anthony L. Caterini, George Deligiannidis, and Arnaud Doucet. Localised
generative flows. ArXiv Preprint arXiv:1909.13833, 2019.
55
Papamakarios, Nalisnick, Rezende, Mohamed and Lakshminarayanan
Kyle Cranmer, Johann Brehmer, and Gilles Louppe. The frontier of simulation-based in-
ference. Proceedings of the National Academy of Sciences, 2020. doi: 10.1073/pnas.
1912789117.
Ivo Danihelka, Balaji Lakshminarayanan, Benigno Uria, Daan Wierstra, and Peter Dayan.
Comparison of maximum likelihood and GAN-based training of Real NVPs. ArXiv
Preprint arXiv:1705.05263, 2017.
Nicola De Cao, Ivan Titov, and Wilker Aziz. Block neural autoregressive flow. In Proceedings
of the 35th Conference on Uncertainty in Artificial Intelligence, 2019.
Zhiwei Deng, Megha Nawhal, Lili Meng, and Greg Mori. Continuous graph flow. ArXiv
Preprint arXiv:1908.02436, 2019.
Peter J. Diggle and Richard J. Gratton. Monte Carlo methods of inference for implicit
statistical models. Journal of the Royal Statistical Society. Series B (Methodological),
pages 193–227, 1984.
Laurent Dinh, David Krueger, and Yoshua Bengio. NICE: Non-linear independent compo-
nents estimation. ICLR Workshop Track, 2015.
Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio. Density estimation using Real
NVP. In International Conference on Learning Representations, 2017.
Laurent Dinh, Jascha Sohl-Dickstein, Razvan Pascanu, and Hugo Larochelle. A RAD ap-
proach to deep mixture models. ICLR Workshop on Deep Generative Models for Highly
Structured Data, 2019.
Hadi M. Dolatabadi, Sarah Erfani, and Christopher Leckie. Invertible generative modeling
using linear rational splines. In Proceedings of the 23nd International Conference on
Artificial Intelligence and Statistics, 2020.
Simon Duane, Anthony D. Kennedy, Brian J. Pendleton, and Duncan Roweth. Hybrid
Monte Carlo. Physics Letters B, 195(2):216–222, 1987.
Emilien Dupont, Arnaud Doucet, and Yee Whye Teh. Augmented neural ODEs. In Advances
in Neural Information Processing Systems, 2019.
Conor Durkan, Artur Bekasov, Iain Murray, and George Papamakarios. Cubic-spline flows.
ICML Workshop on Invertible Neural Networks and Normalizing Flows, 2019a.
Conor Durkan, Artur Bekasov, Iain Murray, and George Papamakarios. Neural spline flows.
In Advances in Neural Information Processing Systems, 2019b.
Luca Falorsi, Pim de Haan, Tim R. Davidson, and Patrick Forré. Reparameterizing distri-
butions on Lie groups. In Proceedings of the 22nd International Conference on Artificial
Intelligence and Statistics, pages 3244–3253, 2019.
56
Normalizing Flows for Probabilistic Modeling and Inference
Brendan J Frey. Graphical models for machine learning and digital communication. MIT
press, 1998.
Mevlana C. Gemici, Danilo Jimenez Rezende, and Shakir Mohamed. Normalizing flows on
Riemannian manifolds. NeurIPS Workshop on Bayesian Deep Learning, 2016.
Mathieu Germain, Karol Gregor, Iain Murray, and Hugo Larochelle. MADE: Masked au-
toencoder for distribution estimation. In Proceedings of the 32nd International Conference
on Machine Learning, pages 881–889, 2015.
Adam Golinski, Mario Lezcano-Casado, and Tom Rainforth. Improving normalizing flows
via better orthogonal parameterizations. ICML Workshop on Invertible Neural Networks
and Normalizing Flows, 2019.
Aidan N. Gomez, Mengye Ren, Raquel Urtasun, and Roger B. Grosse. The reversible
residual network: Backpropagation without storing activations. In Advances in Neural
Information Processing Systems, pages 2214–2224, 2017.
Henry Gouk, Eibe Frank, Bernhard Pfahringer, and Michael Cree. Regularisation of neural
networks by enforcing Lipschitz continuity. ArXiv Preprint arXiv:1804.04368, 2018.
Will Grathwohl, Ricky T. Q. Chen, Jesse Betterncourt, Ilya Sutskever, and David K. Duve-
naud. FFJORD: Free-form continuous dynamics for scalable reversible generative models.
In International Conference on Learning Representations, 2019.
Alex Graves. Generating sequences with recurrent neural networks. ArXiv Preprint
arXiv:1308.0850, 2013.
Aditya Grover, Manik Dhar, and Stefano Ermon. Flow-GAN: Combining maximum like-
lihood and adversarial learning in generative models. In Proceedings of the 32nd AAAI
Conference on Artificial Intelligence, 2018.
Tuomas Haarnoja, Kristian Hartikainen, Pieter Abbeel, and Sergey Levine. Latent space
policies for hierarchical reinforcement learning. In Proceedings of the 35th International
Conference on Machine Learning, pages 1851–1860, 2018.
57
Papamakarios, Nalisnick, Rezende, Mohamed and Lakshminarayanan
Junxian He, Graham Neubig, and Taylor Berg-Kirkpatrick. Unsupervised learning of syn-
tactic structure with invertible neural projections. In Proceedings of the 2018 Conference
on Empirical Methods in Natural Language Processing, pages 1292–1302, 2018.
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for
image recognition. In Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, pages 770–778, 2016.
Gustav Eje Henter, Simon Alexanderson, and Jonas Beskow. MoGlow: Probabilistic and
controllable motion synthesis using normalising flows. ArXiv Preprint arXiv:1905.06598,
2019.
Jonathan Ho, Xi Chen, Aravind Srinivas, Yan Duan, and Pieter Abbeel. Flow++: Im-
proving flow-based generative models with variational dequantization and architecture
design. In Proceedings of the 36th International Conference on Machine Learning, pages
2722–2730, 2019.
Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural Computation,
9(8):1735–1780, 1997.
Matthew Hoffman, Pavel Sountsov, Joshua V. Dillon, Ian Langmore, Dustin Tran, and
Srinivas Vasudevan. NeuTra-lizing bad geometry in Hamiltonian Monte Carlo using
neural transport. ArXiv Preprint arXiv:1903.03704, 2019.
Shion Honda, Hirotaka Akita, Katsuhiko Ishiguro, Toshiki Nakanishi, and Kenta Oono.
Graph residual flow for molecular graph generation. ArXiv Preprint arXiv:1909.13521,
2019.
Emiel Hoogeboom, Jorn W. T. Peters, Rianne van den Berg, and Max Welling. Integer
discrete flows and lossless compression. In Advances in Neural Information Processing
Systems, 2019a.
Emiel Hoogeboom, Rianne Van Den Berg, and Max Welling. Emerging convolutions for
generative normalizing flows. In Proceedings of the 36th International Conference on
Machine Learning, pages 2771–2780, 2019b.
Chin-Wei Huang, David Krueger, Alexandre Lacoste, and Aaron Courville. Neural autore-
gressive flows. In Proceedings of the 35th International Conference on Machine Learning,
pages 2078–2087, 2018.
Michael F. Hutchinson. A stochastic estimator of the trace of the influence matrix for Lapla-
cian smoothing splines. Communications in Statistics—Simulation and Computation, 19
(2):433–450, 1990.
Aapo Hyvärinen and Petteri Pajunen. Nonlinear independent component analysis: Exis-
tence and uniqueness results. Neural Networks, 12(3):429–439, 1999.
David Inouye and Pradeep Ravikumar. Deep density destructors. In Proceedings of the
35th International Conference on Machine Learning, pages 2167–2175, 2018.
58
Normalizing Flows for Probabilistic Modeling and Inference
Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network train-
ing by reducing internal covariate shift. In Proceedings of the 32nd International Confer-
ence on Machine Learning, pages 448–456, 2015.
Jörn-Henrik Jacobsen, Arnold W. M. Smeulders, and Edouard Oyallon. i-RevNet: Deep
invertible networks. In International Conference on Learning Representations, 2018.
Jörn-Henrik Jacobsen, Jens Behrmann, Richard Zemel, and Matthias Bethge. Excessive
invariance causes adversarial vulnerability. In International Conference on Learning Rep-
resentations, 2019.
Priyank Jaini, Kira A. Selby, and Yaoliang Yu. Sum-of-squares polynomial flow. In Pro-
ceedings of the 36th International Conference on Machine Learning, pages 3009–3018,
2019.
Lifeng Jin, Finale Doshi-Velez, Timothy Miller, Lane Schwartz, and William Schuler. Un-
supervised learning of PCFGs with normalizing flow. In Proceedings of the 57th Annual
Meeting of the Association for Computational Linguistics, pages 2442–2452, 2019.
Richard M. Johnson. The minimal transformation to orthonormality. Psychometrika, 31:
61–66, 03 1966.
Nal Kalchbrenner, Aäron van den Oord, Karen Simonyan, Ivo Danihelka, Oriol Vinyals,
Alex Graves, and Koray Kavukcuoglu. Video pixel networks. In Proceedings of the 34th
International Conference on Machine Learning, pages 1771–1779, 2017.
Sungwon Kim, Sang-Gil Lee, Jongyoon Song, Jaehyeon Kim, and Sungroh Yoon.
FloWaveNet : A generative flow for raw audio. In Proceedings of the 36th International
Conference on Machine Learning, pages 3370–3378, 2019.
Diederik P. Kingma and Prafulla Dhariwal. Glow: Generative flow with invertible 1 × 1
convolutions. In Advances in Neural Information Processing Systems, pages 10215–10224,
2018.
Diederik P. Kingma and Max Welling. Auto-encoding variational Bayes. In International
Conference on Learning Representations, 2014a.
Diederik P. Kingma and Max Welling. Efficient gradient-based inference through transfor-
mations between Bayes nets and neural nets. In Proceedings of the 31st International
Conference on Machine Learning, pages 1782–1790, 2014b.
Diederik P. Kingma, Tim Salimans, Rafal Jozefowicz, Xi Chen, Ilya Sutskever, and Max
Welling. Improved variational inference with inverse autoregressive flow. In Neural In-
formation Processing Systems, pages 4743–4751, 2016.
Shoshichi Kobayashi and Katsumi Nomizu. Foundations of differential geometry, volume 1.
Interscience Publishers, 1963.
Ivan Kobyzev, Simon Prince, and Marcus A. Brubaker. Normalizing flows: An introduction
and review of current methods. IEEE Transactions on Pattern Analysis and Machine
Intelligence, 2020. doi: 10.1109/TPAMI.2020.2992934.
59
Papamakarios, Nalisnick, Rezende, Mohamed and Lakshminarayanan
Jonas Köhler, Leon Klein, and Frank Noé. Equivariant flows: Sampling configurations for
multi-body systems with symmetric energies. ArXiv Preprint arXiv:1910.00753, 2019.
Manoj Kumar, Mohammad Babaeizadeh, Dumitru Erhan, Chelsea Finn, Sergey Levine,
Laurent Dinh, and Diederik P. Kingma. VideoFlow: A flow-based generative model for
video. ICML Workshop on Invertible Neural Networks and Normalizing Flows, 2019.
Valero Laparra, Gustavo Camps-Valls, and Jesús Malo. Iterative Gaussianization: From
ICA to random rotations. IEEE Transactions on Neural Networks, 22(4):537–549, 2011.
Hugo Larochelle and Iain Murray. The neural autoregressive distribution estimator. In
Proceedings of the 14th International Conference on Artificial Intelligence and Statistics,
pages 29–37, 2011.
Mario Lezcano-Casado and David Martı́nez-Rubio. Cheap orthogonal constraints in neural
networks: A simple parametrization of the orthogonal and unitary group. In Proceedings
of the 36th International Conference on Machine Learning, 2019.
Christos Louizos and Max Welling. Multiplicative normalizing flows for variational Bayesian
neural networks. In Proceedings of the 34th International Conference on Machine Learn-
ing, pages 2218–2227, 2017.
Xuezhe Ma, Xiang Kong, Shanghang Zhang, and Eduard Hovy. MaCow: Masked convolu-
tional generative flow. In Advances in Neural Information Processing Systems, 2019.
Andrew L. Maas, Awni Y. Hannun, and Andrew Y. Ng. Rectifier nonlinearities improve
neural network acoustic models. ICML Workshop on Deep Learning for Audio, Speech,
and Language Processing, 2013.
Kaushalya Madhawa, Katushiko Ishiguro, Kosuke Nakago, and Motoki Abe. Graph-
NVP: An invertible flow model for generating molecular graphs. ArXiv Preprint
arXiv:1905.11600, 2019.
Murray Marshall. Positive polynomials and sums of squares. American Mathematical So-
ciety, 2008.
Tomáš Mikolov, Martin Karafiát, Lukáš Burget, Jan Černockỳ, and Sanjeev Khudanpur.
Recurrent neural network based language model. In Proceedings of the 11th Annual
Conference of the International Speech Communication Association, 2010.
John W. Milnor and David W. Weaver. Topology from the differentiable viewpoint. Princeton
University Press, 1997.
Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. Spectral nor-
malization for generative adversarial networks. In International Conference on Learning
Representations, 2018.
Shakir Mohamed and Balaji Lakshminarayanan. Learning in implicit generative models.
NeurIPS Workshop on Adversarial Training, 2016.
Thomas Müller, Brian McWilliams, Fabrice Rousselle, Markus Gross, and Jan Novák. Neu-
ral importance sampling. ACM Transactions on Graphics, 38(5):145, 2019.
60
Normalizing Flows for Probabilistic Modeling and Inference
Eric Nalisnick, Akihiro Matsukawa, Yee Whye Teh, Dilan Gorur, and Balaji Lakshmi-
narayanan. Hybrid models with deep and invertible features. In Proceedings of the 36th
International Conference on Machine Learning, pages 4723–4732, 2019.
Radford M. Neal. MCMC using Hamiltonian dynamics. Handbook of Markov Chain Monte
Carlo, 54:113–162, 2010.
Frank Noé, Simon Olsson, Jonas Köhler, and Hao Wu. Boltzmann generators: Sampling
equilibrium states of many-body systems with deep learning. Science, 365, 2019.
Junier Oliva, Avinava Dubey, Manzil Zaheer, Barnabas Poczos, Ruslan Salakhutdinov, Eric
Xing, and Jeff Schneider. Transformation autoregressive networks. In Proceedings of the
35th International Conference on Machine Learning, pages 3898–3907, 2018.
George Papamakarios. Neural density estimation and likelihood-free inference. PhD thesis,
University of Edinburgh, 2019. Available at https://arxiv.org/abs/1910.13233.
George Papamakarios, Theo Pavlakou, and Iain Murray. Masked autoregressive flow for
density estimation. In Advances in Neural Information Processing Systems, pages 2338–
2347, 2017.
George Papamakarios, David Sterratt, and Iain Murray. Sequential neural likelihood: Fast
likelihood-free inference with autoregressive flows. In Proceedings of the 22nd Interna-
tional Conference on Artificial Intelligence and Statistics, pages 837–848, 2019.
Ryan Prenger, Rafael Valle, and Bryan Catanzaro. WaveGlow: A flow-based generative
network for speech synthesis. In Proceedings of the 2019 IEEE International Conference
on Acoustics, Speech and Signal Processing, pages 3617–3621. IEEE, 2019.
Danilo Jimenez Rezende and Shakir Mohamed. Variational inference with normalizing
flows. In Proceedings of the 32nd International Conference on Machine Learning, pages
1530–1538, 2015.
Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropaga-
tion and approximate inference in deep generative models. In Proceedings of the 31st
International Conference on Machine Learning, pages 1278–1286, 2014.
Danilo Jimenez Rezende, Sébastien Racanière, Irina Higgins, and Peter Toth. Equivariant
Hamiltonian flows. ArXiv Preprint arXiv:1909.13739, 2019.
Oren Rippel and Ryan Prescott Adams. High-dimensional probability estimation with deep
density models. ArXiv Preprint arXiv:1302.5125, 2013.
61
Papamakarios, Nalisnick, Rezende, Mohamed and Lakshminarayanan
62
Normalizing Flows for Probabilistic Modeling and Inference
Dustin Tran, Keyon Vafa, Kumar Krishna Agrawal, Laurent Dinh, and Ben Poole. Discrete
flows: Invertible generative models of discrete data. In Advances in Neural Information
Processing Systems, 2019.
Benigno Uria, Iain Murray, and Hugo Larochelle. RNADE: The real-valued neural autore-
gressive density-estimator. In Advances in Neural Information Processing Systems, pages
2175–2183, 2013.
Benigno Uria, Iain Murray, and Hugo Larochelle. A deep and tractable density estimator.
In Proceedings of the 31st International Conference on Machine Learning, pages 467–475,
2014.
Benigno Uria, Marc-Alexandre Côté, Karol Gregor, Iain Murray, and Hugo Larochelle.
Neural autoregressive distribution estimation. Journal of Machine Learning Research, 17
(205):1–37, 2016.
Rianne van den Berg, Leonard Hasenclever, Jakub M. Tomczak, and Max Welling. Sylvester
normalizing flows for variational inference. The 34th Conference on Uncertainty in Arti-
ficial Intelligence, 2018.
Rianne van den Berg, Alexey A. Gritsenko, Mostafa Dehghani, Casper Kaae Sønderby,
and Tim Salimans. IDF++: Analyzing and improving integer discrete flows for lossless
compression. ArXiv Preprint arXiv:2006.12459, 2020.
Aäron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex
Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. WaveNet: A gener-
ative model for raw audio. ArXiv Preprint arXiv:1609.03499, 2016a.
Aäron van den Oord, Nal Kalchbrenner, and Koray Kavukcuoglu. Pixel recurrent neural
networks. In Proceedings of The 33rd International Conference on Machine Learning,
pages 1747–1756, 2016b.
Aäron van den Oord, Nal Kalchbrenner, Oriol Vinyals, Lasse Espeholt, Alex Graves, and
Koray Kavukcuoglu. Conditional image generation with PixelCNN decoders. In Advances
in Neural Information Processing Systems, pages 4797–4805, 2016c.
Aäron van den Oord, Yazhe Li, Igor Babuschkin, Karen Simonyan, Oriol Vinyals, Koray
Kavukcuoglu, George van den Driessche, Edward Lockhart, Luis Cobo, Florian Stimberg,
Norman Casagrande, Dominik Grewe, Seb Noury, Sander Dieleman, Erich Elsen, Nal
Kalchbrenner, Heiga Zen, Alex Graves, Helen King, Tom Walters, Dan Belov, and Demis
Hassabis. Parallel WaveNet: Fast high-fidelity speech synthesis. In Proceedings of the
35th International Conference on Machine Learning, pages 3918–3926, 2018.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N.
Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in
Neural Information Processing Systems, pages 5998–6008, 2017.
Cédric Villani. Optimal transport: Old and new, volume 338. Springer Science & Business
Media, 2008.
63
Papamakarios, Nalisnick, Rezende, Mohamed and Lakshminarayanan
Martin J. Wainwright and Michael I. Jordan. Graphical models, exponential families, and
variational inference. Foundations and Trends in Machine Learning, 1(1–2):1–305, 2008.
Prince Zizhuang Wang and William Yang Wang. Riemannian normalizing flow on vari-
ational Wasserstein autoencoder for text modeling. ArXiv Preprint arXiv:1904.02399,
2019.
Patrick Nadeem Ward, Ariella Smofsky, and Avishek Joey Bose. Improving exploration in
soft-actor-critic with normalizing flows policies. ICML Workshop on Invertible Neural
Networks and Normalizing Flows, 2019.
Antoine Wehenkel and Gilles Louppe. Unconstrained monotonic neural networks. In Ad-
vances in Neural Information Processing Systems, 2019.
Christina Winkler, Daniel E. Worrall, Emiel Hoogeboom, and Max Welling. Learning
likelihoods with conditional normalizing flows. ArXiv Preprint arXiv:1912.00042, 2019.
Peter Wirnsberger, Andrew J. Ballard, George Papamakarios, Stuart Abercrombie,
Sébastien Racanière, Alexander Pritzel, Danilo Jimenez Rezende, and Charles Blundell.
Targeted free energy estimation via learned mappings. The Journal of Chemical Physics,
153(14):144112, 2020.
Guandao Yang, Xun Huang, Zekun Hao, Ming-Yu Liu, Serge J. Belongie, and Bharath
Hariharan. PointFlow: 3D point cloud generation with continuous normalizing flows. In
Proceedings of the International Conference on Computer Vision, 2019.
Chunting Zhou, Xuezhe Ma, Di Wang, and Graham Neubig. Density matching for bilingual
word embedding. In Proceedings of the 2019 Conference of the North American Chapter
of the Association for Computational Linguistics, pages 1588–1598, 2019.
Zachary Ziegler and Alexander Rush. Latent normalizing flows for discrete sequences. In
Proceedings of the 36th International Conference on Machine Learning, pages 7673–7682,
2019.
64