Tutorial On Diffusion Models For Imaging and Vision: Stanley Chan September 10, 2024
Tutorial On Diffusion Models For Imaging and Vision: Stanley Chan September 10, 2024
Stanley Chan1
Contents
1 Variational Auto-Encoder (VAE) 2
1.1 Building Blocks of VAE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Evidence Lower Bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 Optimization in VAE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.4 Concluding Remark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
6 Conclusion 86
1 School of Electrical and Computer Engineering, Purdue University, West Lafayette, IN 47907.
Email: [email protected].
The latent variable z has two special roles in this setup. With respect to the input, the latent variable
encapsulates the information that can be used to describe x. The encoding procedure could be a lossy
process, but our goal is to preserve the important content of x as much as we can. With respect to the
output, the latent variable serves as the “seed” from which an image x b can be generated. Two different z’s
should in theory give us two different generated images.
A slightly more formal definition of a latent variable is given below.
Definition 1.1. Latent Variables[24]. In a probabilistic model, latent variables z are variables that
we do not observe and hence are not part of the training dataset, although they are part of the model.
Example 1.1. Getting a latent representation of an image is not an alien thing. Back in the time of
JPEG compression (which is arguably a dinosaur), we used discrete cosine transform (DCT) basis func-
tions φn to encode the underlying image/patches of an image. The coefficient vector z = [z1 , . . . , zN ]T
is obtained by projecting the image x onto the space spanned by the basis, via zn = ⟨φn , x⟩. So, given
an image x, we can produce a coefficient vector z. From z, we can use the inverse transform to recover
(i.e. decode) the image.
In this example, the coefficient vector z is the latent variable. The encoder is the DCT transform,
and the decoder is the inverse DCT transform.
The term “variational” in VAE is related to the subject of calculus of variations which studies opti-
mization over functions. In VAE, we are interested in searching for the optimal probability distributions to
describe x and z. In light of this, we need to consider a few distributions:
Definition 1.2. Deep Latent Variables[24]. Deep Latent Variables are latent variables whose
distributions p(z), p(x|z), or p(z|x) are parameterized by a neural network.
The advantage of deep latent variables is that they can model very complex data distributions p(x) even
though the structures of the prior distributions and the conditional distributions are relatively simple (e.g.
Gaussian). One way to think about this is that the neural networks can be used to estimate the mean of
a Gaussian. Although the Gaussian itself is simple, the mean is a function of the input data, which passes
through a neural network to generate a data-dependent mean. So the expressiveness of the Gaussian is
significantly improved.
Let’s go back to the four distributions above. Here is a somewhat trivial but educational example that
can illustrate the idea:
Example 1.2. Consider a random variable X distributed according to a Gaussian mixture model with
a latent variable z ∈ {1, . . . , K} denoting the cluster identity such that pZ (k) = P[Z = k] = πk for
PK
k = 1, . . . , K. We assume k=1 πk = 1. Then, if we are told that we need to look at the k-th cluster
only, the conditional distribution of X given Z is
The marginal distribution of x can be found using the law of total probability, giving us
K
X K
X
pX (x) = pX|Z (x|k)pZ (k) = πk N (x | µk , σk2 I). (1)
k=1 k=1
Therefore, if we start with pX (x), the design question for the encoder is to build a magical encoder such
that for every sample x ∼ pX (x), the latent code will be z ∈ {1, . . . , K} with a distribution z ∼ pZ (k).
To illustrate how the encoder and decoder work, let’s assume that the mean and variance are known
and are fixed. Otherwise we will need to estimate the mean and variance through an expectation-
maximization (EM) algorithm. It is doable, but the tedious equations will defeat the educational
purpose of this illustration.
Encoder: How do we obtain z from x? This is easy because at the encoder, we know pX (x) and
pZ (k). Imagine that you only have two class z ∈ {1, 2}. Effectively you are just making a binary
decision of where the sample x should belong to. There are many ways you can do the binary decision.
and this will return you a simple decision: You give us x, we tell you z ∈ {1, 2}.
Decoder: On the decoder side, if we are given a latent code z ∈ {1, . . . , K}, the magical decoder
just needs to return us a sample x which is drawn from pX|Z (x|k) = N (x | µk , σk2 I). A different z will
give us one of the K mixture components. If we have enough samples, the overall distribution will
follow the Gaussian mixture.
This example is certainly oversimplified because real-world problems can be much harder than a Gaussian
mixture model with known means and known variances. But one thing we realize is that if we want to find
the magical encoder and decoder, we must have a way to find the two conditional distributions p(z|x) and
p(x|z). However, they are both high-dimensional.
In order for us to say something more meaningful, we need to impose additional structures so that we can
generalize the concept to harder problems. To this end, we consider the following two proxy distributions:
• qϕ (z|x): The proxy for p(z|x), which is also the distribution associated with the encoder. qϕ (z|x) can
be any directed graphical model and it can be parameterized using deep neural networks [24, Section
2.1]. For example, we can define
(µ, σ 2 ) = EncoderNetworkϕ (x),
qϕ (z|x) = N (z | µ, diag(σ 2 )). (2)
This model is a widely used because of its tractability and computational efficiency.
• pθ (x|z): The proxy for p(x|z), which is also the distribution associated with the decoder. Like the
encoder, the decoder can be parameterized by a deep neural network. For example, we can define
fθ (z) = DecoderNetworkθ (z),
2
pθ (x|z) = N (x | fθ (z), σdec I), (3)
where σdec is a hyperparameter that can be pre-determined or it can be learned.
The relationship between the input x and the latent z, as well as the conditional distributions, are
summarized in Figure 1. There are two nodes x and z. The “forward” relationship is specified by p(z|x)
(and approximated by qϕ (z|x)), whereas the “reverse” relationship is specified by p(x|z) (and approximated
by pθ (x|z)).
Figure 1: In a variational autoencoder, the variables x and z are connected by the conditional distributions
p(x|z) and p(z|x). To make things work, we introduce proxy distributions pθ (x|z) and qϕ (z|x).
Example 1.3. Suppose that we have a random variable x ∈ Rd and a latent variable z ∈ Rd such that
x ∼ p(x) = N (x | µ, σ 2 I),
z ∼ p(z) = N (z | 0, I).
We want to construct a VAE. By this, we mean that we want to build two mappings Encoder(·) and
Decoder(·). The encoder will take a sample x and map it to the latent variable z, whereas the decoder
will take the latent variable z and map it to the generated variable x
b. If we knew what p(x) is, then
there is a trivial solution where z = (x − µ)/σ and x
b = µ + σz. In this case, the true distributions can
Suppose now that we do not know p(x) so we need to build an encoder and a decoder to estimate
z and xb. Let’s first define the encoder. Our encoder in this example takes the input x and generates
a pair of parameters µ b(x)2 , denoting the parameters of a Gaussian. Then, we define qϕ (z|x)
b (x) and σ
as a Gaussian:
qϕ (z|x) = N (z | ax + b, t2 I).
Again, for the purpose of discussion, we assume that µ e is affine so that µe (z) = cz + v for some
e(z)2 = s2 for some scalar s. Therefore, pθ (x|z) takes the form of:
parameters c and v and σ
pθ (x|z) = N (z | cx + v, s2 I).
Definition 1.3. (Evidence Lower Bound) The Evidence Lower Bound is defined as
def p(x, z)
ELBO(x) = Eqϕ (z|x) log . (4)
qϕ (z|x)
You are certainly puzzled how on the Earth people can come up with this loss function!? Let’s see what
ELBO means and how it is derived.
In a nutshell, ELBO is a lower bound for the prior distribution log p(x) because we can show that
log p(x) = some magical steps to be derived
p(x, z)
= Eqϕ (z|x) log + DKL (qϕ (z|x)∥p(z|x)) (5)
qϕ (z|x)
p(x, z)
≥ Eqϕ (z|x) log
qϕ (z|x)
def
= ELBO(x),
where the inequality follows from the fact that the KL divergence is always non-negative. Therefore, ELBO
is a valid lower bound for log p(x). Since we never have access to log p(x), if we somehow have access to
Figure 2: Visualization of log p(x) and ELBO. The gap between the two is determined by the KL divergence
DKL (qϕ (z|x)∥p(z|x)).
Theorem 1.1. Decomposition of Log-Likelihood. The log likelihood log p(x) can be decomposed
as
p(x, z)
log p(x) = Eqϕ (z|x) log + DKL (qϕ (z|x)∥p(z|x)). (6)
qϕ (z|x)
| {z }
def
= ELBO(x)
Proof. The trick is to use our magical proxy qϕ (z|x) to poke around p(x) and derive the bound.
Z
log p(x) = log p(x) × qϕ (z|x)dz multiply 1
| {z }
=1
Z
= log p(x) × qϕ (z|x) dz move log p(x) into integral
| {z } | {z }
some constant wrt z distribution in z
= Eqϕ (z|x) [log p(x)], (7)
R
where the last equality is the fact that a × pZ (z)dz = E[a] = a for any random variable Z and a scalar
a.
See, we have already got Eqϕ (z|x) [·]. Just a few more steps. Let’s use Bayes theorem which states
that p(x, z) = p(z|x)p(x):
p(x, z)
Eqϕ (z|x) [log p(x)] = Eqϕ (z|x) log Bayes Theorem
p(z|x)
p(x, z) qϕ (z|x)
= Eqϕ (z|x) log × Multiply and divide qϕ (z|x)
p(z|x) qϕ (z|x)
p(x, z) qϕ (z|x)
= Eqϕ (z|x) log + Eqϕ (z|x) log , (8)
qϕ (z|x) p(z|x)
| {z } | {z }
ELBO DKL (qϕ (z|x)∥p(z|x))
where we recognize that the first term is exactly ELBO, whereas the second term is exactly the KL
divergence. Comparing Eqn (8) with Eqn (5), we complete the proof.
The equality holds if and only if the KL-divergence term is zero. For the KL divergence to be zero, it
is necessary that qϕ (z|x) = p(z|x). However, since p(z|x) is a delta function, the only possibility is to
have
x−µ
qϕ (z|x) = N (z | σ , 0)
x−µ
= δ(z − σ ), (9)
i.e., we set the standard deviation to be t = 0. To determine pθ (x|z), we need some additional steps to
simplify ELBO.
We now have ELBO. But this ELBO is still not too useful because it involves p(x, z), something we have
no access to. So, we need to do a little more work.
This is a beautiful result. We just showed something very easy to understand. Let’s look at the two
terms in Eqn (10):
• Reconstruction. The first term is about the decoder. We want the decoder to produce a good image
x if we feed a latent z into the decoder (of course!!). So, we want to maximize log pθ (x|z). It is
similar to maximum likelihood where we want to find the model parameter to maximize the likelihood
of observing the image. The expectation here is taken with respect to the samples z (conditioned on
x). This shouldn’t be a surprise because the samples z are used to assess the quality of the decoder.
It cannot be an arbitrary noise vector but a meaningful latent vector. So, z needs to be sampled from
qϕ (z|x).
• Prior Matching. The second term is the KL divergence for the encoder. We want the encoder to
turn x into a latent vector z such that the latent vector will follow our choice of distribution, e.g.,
z ∼ N (0, I). To be slightly more general, we write p(z) as the target distribution. Because the KL
divergence is a distance (which increases when the two distributions become more dissimilar), we need
to put a negative sign in front so that it increases when the two distributions become more similar.
1 c2
= − log 2π − log s − 2 E x−µ ∥z − x−v ∥2
2 2s δ z− σ c
1 c2
= − log 2π − log s − 2 ∥ x−µ x−v 2
2 2s σ − c ∥
1
≤ − log 2π − log s,
2
where the upper bound is tight if and only if the norm-square term is zero, which holds when v = µ
and c = σ. For the remaining terms, it is clear that − log s is a monotonically decreasing function in s
with − log s → ∞ as s → 0. Therefore, when v = µ and c = σ, it follows that Eqϕ (z|x) [log pθ (x|z)] is
maximized when s = 0. This implies that
pθ (x|z) = N (x | σz + µ, 0)
= δ(x − (σz + µ)). (11)
Limitation of ELBO. ELBO is practically useful, but it is not the same as the true likelihood log p(x).
As we mentioned, ELBO is exactly equal to log p(x) if and only if DKL (qϕ (z|x)∥p(z|x)) = 0 which happens
when qϕ (z|x) = p(z|x). In the following example, we will show a case where the qϕ (z|x) obtained from
maximizing ELBO is not the same as p(z|x).
Example 1.6. (Limitation of ELBO). In the previous example, if we have no idea about p(z|x), we
need to train the VAE by maximizing ELBO. However, since ELBO is only a lower bound of the true
distribution log p(x), maximizing ELBO will not return us the delta functions as we hope. Instead, we
will obtain something that is quite meaningful but not exactly the delta functions.
For simplicity, let’s consider the distributions that will return us unbiased estimates of the mean
but with unknown variances:
x−µ 2
qϕ (z|x) = N (z | σ , t I),
2
pθ (x|z) = N (x | σz + µ, s I).
This is partially “cheating” because in theory we should not assume anything about the estimates of
the means. But from an intuitive angle, since qϕ (z|x) and pθ (x|z) are proxies to p(z|x) and p(x|z),
they must resemble some properties of the delta functions. The closest choice is to define qϕ (z|x) and
pθ (x|z) as Gaussians with means consistent with those of the two delta functions. The variances are
unknown, and they are the subject of interest in this example.
Our focus here is to maximize ELBO which consists of the prior matching term and the reconstruc-
tion term. For the prior matching error, we want to minimize the KL-divergence:
The KL-divergence of two multivariate Gaussians N (z|µ0 , Σ0 ) and N (z|µ1 , Σ1 ) has a closed form
Using this result (and with some algebra), we can show that
x−µ 2
12
t d − d + ∥ x−µ ∥2 − 2 log t ,
DKL N (z | σ , t I) ∥ N (z | 0, I) = σ
2
where d is the dimension of x and z. To minimize the KL-divergence, we take derivative with respect
to t and show that
∂ 12 x−µ 2
1
t d − d + ∥ σ ∥ − 2 log t = t · d − .
∂t 2 t
x−µ 1
qϕ (z|x) = N (z | σ , d I).
σ2 d σ2
d
− d2 log 2π − d log s − 2 = − + 3 = 0.
ds 2s s s
σ
Equating this to zero will give us s = √
d
. Therefore,
2
pθ (x|z) = N (x | σz + µ, σd I).
As we can see in this example and the previous example, while the ideal distributions are delta
functions, the proxy distributions we obtain have a finite variance. This finite variance adds additional
randomness to the samples generated by the VAE. There is nothing wrong with this VAE — we do it
correctly by maximizing ELBO. It is just that maximizing the ELBO is not the same as maximizing
log p(x).
An interesting observation in this example is that the variance in qϕ (z|x) and pθ (x|z) scales linearly
with 1/d where d is the dimension of the input x. Therefore, for high-dimensional problems when d is
large, the two distributions qϕ (z|x) and pθ (x|z) will approach delta functions. So, even though qϕ (z|x)
and pθ (x|z) are not exactly delta functions for finite d, they are asymptotically approaching the true
distributions.
Intractability of ELBO’s Gradient. The challenge associated with the above optimization is that
the gradient of ELBO with respect to (ϕ, θ) is intractable. Since the majority of today’s neural network opti-
mizers use first order methods and backpropagate the gradient to update the network weights, an intractable
gradient will pose difficulties in training the VAE.
Let’s elaborate more about the intractability of the gradient. We first substitute Definition 1.3 into the
above objective function. The gradient of ELBO is: 2
pθ (x, z)
∇θ,ϕ ELBO(x) = ∇θ,ϕ Eqϕ (z|x) log
qϕ (z|x)
n h io
= ∇θ,ϕ Eqϕ (z|x) log pθ (x, z) − log qϕ (z|x) . (13)
The gradient contains two parameters. Let’s first look at θ. We can show that
n h io
∇θ ELBO(x) = ∇θ Eqϕ (z|x) log pθ (x, z) − log qϕ (z|x)
Z h i
= ∇θ log pθ (x, z) − log qϕ (z|x) · qϕ (z|x)dz
Z n o
= ∇θ log pθ (x, z) − log qϕ (z|x) · qϕ (z|x) dz
h n oi
= Eqϕ (z|x) ∇θ log pθ (x, z) − log qϕ (z|x)
h n oi
= Eqϕ (z|x) ∇θ log pθ (x, z)
L
1X n o
≈ ∇θ log pθ (x, z(ℓ) ) , where z(ℓ) ∼ qϕ (z|x), (14)
L
ℓ=1
where the last equality is the Monte Carlo approximation of the expectation.
In the above equation, if pθ (x, z) is realized by a computable model such as a neural network, then its
gradient ∇θ {log pθ (x, z)} can be computed via automatic differentiation. Thus, the maximization can be
achieved by backpropagating the gradient.
2 The original definition of ELBO uses the true joint distribution p(x, z). In practice, since p(x, z) is not accessible, we
As we can see, even though we wish to maintain a similar structure as we did for θ, the expectation
and the gradient operators in the above derivations cannot be switched. This forbids us from doing any
backpropagation of the gradient to maximize ELBO.
Reparameterization Trick. The intractability of ELBO’s gradient is inherited from the fact that we
need to draw samples z from a distribution qϕ (z|x) which itself is a function of ϕ. As noted by Kingma and
Welling [23], for continuous latent variables, it is possible to compute an unbiased estimate of ∇θ,ϕ ELBO(x)
so that we can approximately calculate the gradient and hence maximize ELBO. The idea is employ an
technique known as the reparameterization trick [23].
Recall that the latent variable z is a sample drawn from the distribution qϕ (z|x). The idea of repa-
rameterization trick is to express z as some differentiable and invertible transformation of another random
variable ϵ whose distribution is independent of x and ϕ. That is, we define a differentiable and invertible
function g such that
z = g(ϵ, ϕ, x), (16)
for some random variable ϵ ∼ p(ϵ). To make our discussions easier, we pose an additional requirement that
∂z
qϕ (z|x) · det = p(ϵ), (17)
∂ϵ
where ∂z
∂ϵ is the Jacobian, and det(·) is the matrix determinant. This requirement is related to change of
variables in multivariate calculus. The following example will make it clear.
def
Example 1.7. Suppose z ∼ qϕ (z|x) = N (z | µ, diag(σ 2 )). We can define
def
z = g(ϵ, ϕ, x) = ϵ ⊙ σ + µ, (18)
where ϵ ∼ N (0, I) and “⊙” means elementwise multiplication. The parameter ϕ is ϕ = (µ, σ 2 ). For
this choice of the distribution, we can show that by letting ϵ = z−µ
σ :
d d
(zi − µi )2
Y
∂z Y 1
qϕ (z|x) · det = exp − · σi
2σi2
p
∂ϵ i=1 2πσi2 i=1
∥ϵ∥2
1
= √ exp − = N (0, I) = p(ϵ).
( 2π)d 2
With this re-parameterization of z by expressing it in terms of ϵ, we can look at ∇ϕ Eqϕ (z|x) [f (z)] for
some general function f (z). (Later we will consider f (z) = − log qϕ (z|x).) For notational simplicity, we
So, if we want to take the gradient with respect to ϕ, we can show that
Z
∇ϕ Eqϕ (z|x) [f (z)] = ∇ϕ Ep(ϵ) [f (z)] = ∇ϕ f (z) · p(ϵ) dϵ
Z
= ∇ϕ {f (z) · p(ϵ)} dϵ
Z
= {∇ϕ f (z)} · p(ϵ) dϵ
which can be approximated by Monte Carlo. Substituting f (z) = − log qϕ (z|x), we can show that
So, as long as the determinant is differentiable with respect to ϕ, the Monte Carlo approximation can be
numerically computed.
Example 1.8. Suppose that the parameters and the distribution qϕ are defined as follows:
As we can see in the above example, for some specific choices of the distributions (e.g., Gaussian), the
gradient of ELBO can be significantly easier to derive.
VAE Encoder. After discussing the reparameterizing trick, we can now discuss the specific structure of
the encoder in VAE. To make our discussions focused, we assume a relatively common choice of the encoder:
(µ, σ 2 ) = EncoderNetworkϕ (x)
qϕ (z|x) = N (z | µ, σ 2 I).
The parameters µ and σ are technically neural networks because they are the outputs of EncoderNetworkϕ (·).
Therefore, it will be helpful if we denote them as
µ= µϕ (x),
|{z}
neural network
2 2
σ = σϕ (x),
|{z}
neural network
Our notation is slightly more complicated because we want to emphasize that µ is a function of x; You give
us an image x, our job is to return you the parameters of the Gaussian (i.e., mean and variance). If you give
us a different x, then the parameters of the Gaussian should also be different. The parameter ϕ specifies
that µ is controlled (or parameterized) by ϕ.
Suppose that we are given the ℓ-th training sample x(ℓ) . From this x(ℓ) we want to generate a latent
variable z(ℓ) which is a sample from qϕ (z|x). Because of the Gaussian structure, it is equivalent to say that
z(ℓ) ∼ N z µϕ (x(ℓ) ), σϕ 2
(x(ℓ) )I . (21)
The interesting thing about this equation is that we use a neural network EncoderNetworkϕ (·) to estimate
the mean and variance of the Gaussian. Then, from this Gaussian we draw a sample z(ℓ) , as illustrated in
Figure 3.
Figure 3: Implementation of a VAE encoder. We use a neural network to take the image x and estimate the
2
mean µϕ and variance σϕ of the Gaussian distribution.
A more convenient way of expressing Eqn (21) is to realize that the sampling operation z ∼ N (µ, σ 2 I)
can be done using the reparameterization trick.
Proof. We will prove a general case for an arbitrary covariance matrix Σ instead of a diagonal matrix
σ 2 I.
For any high-dimensional Gaussian z ∼ N (z|µ, Σ), the sampling process can be done via the
transformation of white noise 1
z = µ + Σ 2 ϵ, (23)
1
where ϵ ∼ N (0, I). The half matrix Σ 2 can be obtained through eigen-decomposition or Cholesky
1 1
factorization. If Σ has an eigen-decomposition Σ = USUT , then Σ 2 = US 2 UT . The square root of
the eigenvalue matrix S is well-defined because Σ is a positive semi-definite matrix.
We can calculate the expectation and covariance of x:
1 1
E[z] = E[µ + Σ 2 ϵ] = µ + Σ 2 E[ϵ] = µ,
|{z}
=0
h 1 1
i 1 1
T
Cov(z) = E[(z − µ)(z − µ) ] = E Σ 2 ϵϵT (Σ 2 )T = Σ 2 E[ϵϵT ](Σ 2 )T = Σ.
| {z }
=I
Given the VAE encoder structure and qϕ (z|x), we can go back to ELBO. Recall that ELBO consists of
the prior matching term and the reconstruction term. The prior matching term is measured in terms of the
KL divergence DKL (qϕ (z|x)∥p(z)). Let’s evaluate this KL divergence.
To evaluate the KL divergence, we (re)use a result which we summarize below:
VAE Decoder. The decoder is implemented through a neural network. For notation simplicity, let’s
define it as DecoderNetworkθ (·) where θ denotes the network parameters. The job of the decoder network
is to take a latent variable z and generates an image fθ (z):
fθ (z) = DecoderNetworkθ (z). (28)
The distribution pθ (x|z) can be defined as
2
pθ (x|z) = N (x | fθ (z), σdec I), for some hyperparameter σdec . (29)
The interpretation of pθ (x|z) is that we estimate fθ (z) through a network and put it as the mean of the
Gaussian. If we draw a sample x from pθ (x|z), then by the reparameterization trick we can write the
generated image xb as
x
b = fθ (z) + σdec ϵ, ϵ ∼ N (0, I).
Moreover, if we take the log of the likelihood, we can show that
2
log pθ (x|z) = log N (x | fθ (z), σdec I)
∥x − fθ (z)∥2
1
= log p 2 )d
exp − 2
(2πσdec 2σdec
∥x − fθ (z)∥2
q
=− − log 2 )d
(2πσdec . (30)
2
2σdec | {z }
independent of θ so we can drop it
Going back to ELBO, we want to compute Eqϕ (z|x) [log pθ (x|z)]. If we straightly calculate the expectation,
we will need to compute an integration
Z
2 2
Eqϕ (z|x) [log pθ (x|z)] = log N (x | fθ (z), σdec I) · N (z | µϕ (x), σϕ (x))dz
∥x − fθ (z)∥2
Z
2
=− 2 · N z µϕ (x), σϕ (x) dz + C,
2σdec
where the constant C coming out of the log of the Gaussian can be dropped. By using the reparameterization
trick, we write z = µϕ (x) + σϕ (x)ϵ and substitute it into the above equation. This will give us3
∥x − fθ (z)∥2
Z
2
Eqϕ (z|x) [log pθ (x|z)] = − 2 · N z µ ϕ (x), σϕ (x) dz
2σdec
M
1 X ∥x − fθ (z(m) )∥2
≈− 2 (31)
M m=1 2σdec
(m)
1 XM ∥x − f
θ µϕ (x) + σ ϕ (x)ϵ ∥2
=− 2 .
M m=1 2σdec
The approximation above is due to Monte Carlo where the randomness is based on the sampling of the
ϵ ∼ N (ϵ | 0, I). The index M specifies the number of Monte Carlo samples we want to use to approximate
the expectation. Note that the input image x is fixed because Eqϕ (z|x) [log pθ (x|z)] is a function of x.
The gradient of Eqϕ (z|x) [log pθ (x|z)] with respect to θ is relatively easy to compute. Since only fθ depend
on θ, we can do automatic differentiation. The gradient of with respect to ϕ is slightly harder, but it is still
computable because we use chain rule and go into µϕ (x) and ϕϕ (x).
Inspecting Eqn (31), we notice one interesting thing that the loss function is simply the ℓ2 norm between
the reconstructed image fθ (z) and the ground truth image x. This means that if we have the generated
image fθ (z), we can do a direct comparison with the ground truth x via the usual ℓ2 loss as illustrated in
Figure 4.
3 The negative sign here is not a mistake. We want to maximize E
qϕ (z|x) [log pθ (x|z)], which is equivalent to minimize the
negative of the ℓ2 norm.
where the summation is taken with respect to the entire training dataset. The individual ELBO is based on
the sum of the terms we derived above
ELBOϕ,θ (x) = Eqϕ (z|x) [log pθ (x|z)] − DKL qϕ (z|x) ∥ p(z) . (32)
Theorem 1.4. (VAE Training). To train a VAE, we need to solve the optimization problem
X
argmax ELBOϕ,θ (x),
θ,ϕ x∈X
where
(m)
1 X ∥x − fθ µϕ (x) + σϕ (x)ϵ
M ∥2
ELBOϕ,θ (x) = − 2
M m=1 2σdec
1 2
+ σϕ (x)d − d + ∥µϕ (x)∥2 − 2 log σϕ (x) . (35)
2
VAE Inference. The inference of an VAE is relatively simple. Once the VAE is trained, we can drop
the encoder and only keep the decoder, as shown in Figure 5. To generate a new image from the model, we
pick a random latent vector z ∈ Rd . By sending this z through the decoder fθ , we will be able to generate
a new image xb = fθ (z).
Diffusion models are incremental updates where the assembly of the whole gives us the
encoder-decoder structure.
Why increment? It’s like turning the direction of a giant ship. You need to turn the ship slowly towards
your desired direction or otherwise you will lose control. The same principle applies to your company HR
and your university administration.
DDPM has a lot of linkage to a piece of earlier work by Sohl-Dickstein et al in 2015 [38]. Sohl-Dickstein
et al asked the question of how to convert from one distribution to another distribution. VAE provides
one approach: Referring to the previous section, we can think of the source distribution being the latent
variable z ∼ p(z) and the target distribution being the input variable x ∼ p(x). Then by setting up the
proxy distributions pθ (x|z) and qϕ (z|x), we can train the encoder and decoder so that the decoder will serve
the goal of generating images. But VAE is largely a one-step generation — if you give us a latent code z,
2
we ask the neural network fθ (·) to immediately return us the generated signal x ∼ N (x | fθ (z), σdec I). In
some sense, this is asking a lot from the neural network. We are asking it to use a few layers of neurons to
immediately convert from one distribution p(z) to another distribution p(x). This is too much.
The idea Sohl-Dickstein et al proposed was to construct a chain of conversions instead of a one-step
process. To this end they defined two processes analogous to the encoder and decoder in a VAE. They call
the encoder as the forward process, and the decoder as the reverse process. In both processes, they consider
a sequence of variables x0 , . . . , xT whose joint distribution is denoted as qϕ (x0:T ) and pθ (x0:T ) respectively
for the forward and reverse processes. To make both processes tractable (and also flexible), they impose a
Markov chain structure (i.e., memoryless) where
T
Y
forward from x0 to xT : qϕ (x0:T ) = q(x0 ) qϕ (xt | xt−1 ),
t=1
T
Y
reverse from xT to x0 : pθ (x0:T ) = p(xT ) pθ (xt−1 | xt ).
t=1
In both equations, the transition distributions are only dependent on its immediate previous stage. Therefore,
if each transition is realized through some form of neural networks, the overall generation process is broken
down into many smaller tasks. It does not mean that we will need T times more neural networks. We are
just re-using one network for T times.
Breaking the overall process into smaller steps allows us to use simple distributions at each step. As
will be discussed in the following subsections, we can use Gaussian distributions for the transitions. Thanks
to the properties of a Gaussian, the posterior will remain a Gaussian if the likelihood and the prior are
both Gaussians. Therefore, if each transitional distributions above are Gaussian, the joint distribution is
also a Gaussian. Since a Gaussian is fully characterized by the first two moments (mean and variance), the
computation is highly tractable. In the original paper of Sohl-Dickstein et al, there is also a case study of
binomial diffusion processes.
After providing a high-level overview of the concepts, let’s talk about some details. The starting point
the diffusion model is to consider the VAE structure and make it a chain of incremental updates as shown
Figure 6: Variational diffusion model by Kingma et al [22]. In this model, the input image is x0 and the
white noise is xT . The intermediate variables (or states) x1 , . . . , xT −1 are latent variables. The transition
from xt−1 to xt is analogous to the forward step (encoder) in VAE, whereas the transition from xt to xt−1
is analogous to the reverse step (decoder) in VAE. In variational diffusion models, the input dimension and
the output dimension of the encoders/decoders are identical.
Figure 7: The transition block of a variational diffusion model consists of three nodes. The transition
distributions p(xt |xt+1 ) and p(xt |xt−1 ) are not accessible, but we can approximate them by Gaussians.
• The first path is the forward transition going from xt−1 to xt . The associated transition distribution
is p(xt |xt−1 ). In plain words, if you tell us xt−1 , we can draw a sample xt according to p(xt |xt−1 ).
Initial Block The initial block of the variational diffusion model focuses on the state x0 . Since we
start at x0 , we only need the reverse transition from x1 to x0 . The forward transition from x−1 to x0
can be dropped. Therefore, we only need to consider p(x0 |x1 ). But since p(x0 |x1 ) is never accessible, we
approximate it by a Gaussian pθ (x0 |x1 ) where the mean is computed through a neural network. See Figure 8
for illustration.
Figure 8: The initial block of a variational diffusion model focuses on the node x0 . Since there is no state
before time t = 0, we only have a reverse transition from x1 to x0 .
Final Block. The final block focuses on the state xT . Remember that xT is supposed to be our final
latent variable which is a white Gaussian noise vector. Because it is the final block, we only need a forward
transition from xT −1 to xT , and nothing such as xT +1 to xT . The forward transition is approximated by
qϕ (xT |xT −1 ) which is a Gaussian. See Figure 9 for illustration.
Figure 9: The final block of a variational diffusion model focuses on the node xT . Since there is no state
after time t = T , we only have a forward transition from xT −1 to xT .
Understanding the Transition Distribution. Before we proceed further, we need to explain the
transition distribution qϕ (xt |xt−1 ). We know that it is a Gaussian. But what is the mean and variance of
this Gaussian?
Definition 2.1. Transition Distribution qϕ (xt |xt−1 ). In a variational diffusion model (and also
DDPM which we will discuss later), the transition distribution qϕ (xt |xt−1 ) is defined as
def √
qϕ (xt |xt−1 ) = N (xt | αt xt−1 , (1 − αt )I). (36)
√
In other words, qϕ (xt |xt−1 ) is a Gaussian. The mean is αt xt−1 and the variance is 1 − αt . The choice of
√
the scaling factor αt is to make sure that the variance magnitude is preserved so that it will not explode
and vanish after many iterations.
Our goal is to see whether this iterative procedure (using the above transition probability) will give us
a white Gaussian in the equilibrium state (i.e., when t → ∞).
For a mixture model, it is not difficult to show that the probability distribution of xt can be
calculated recursively via the algorithm for t = 1, 2, . . . , T : (the proof will be shown later)
√ 2
xt ∼ pt (x) =π1 N (x| αt µ1,t−1 , αt σ1,t−1 + (1 − αt ))
√ 2
+ π2 N (x| αt µ2,t−1 , αt σ2,t−1 + (1 − αt )), (37)
where µ1,t−1 is the mean for class 1 at time t − 1, with µ1,0 = µ1 being the initial mean. Similarly,
2 2
σ1,t−1 is the variance for class 1 at time t − 1, with σ1,0 = σ12 being the initial variance.
In the figure below, we show a numerical example where π1 = 0.3, π2 = 0.7, µ1 = −2, µ2 = 2,
σ1 = 0.2, and σ2 = 1. The rate is defined as αt = 0.97 for all t. We plot the probability distribution
function for different t.
Proof of Eqn (37). For those who would like to understand how we derive the probability density of
a mixture model in Eqn (37), we can show a simple derivation. Consider a mixture model
K
X
p(x) = πk N (x|µk , σk2 I).
| {z }
k=1
p(x|k)
√ √
If we consider a new variable y = αx + 1 − αϵ where ϵ ∼ N (0, I), then the distribution of y can be
derived by using the law of total probability:
K
X K
X
p(y) = p(y|k)p(k) = πk p(y|k).
k=1 k=1
√ √
Since y|k = αx|k + 1 − αϵ is a linear combination of a (conditioned) Gaussian random variable
x|k and another Gaussian random variable ϵ, the sum y|k will remain as a Gaussian. The mean and
variance are
√ √ √
E[y|k] = αE[x|k] + 1 − αE[ϵ] = αµk ,
Var[y|k] = αVar[x|k] + (1 − α)Var[ϵ] = ασk2 + (1 − α),
xt = axt−1 + bϵt−1
= a(axt−2 + bϵt−2 ) + bϵt−1 (substitute xt−1 = axt−2 + bϵt−2 )
2
= a xt−2 + abϵt−2 + bϵt−1 (regroup terms )
.
= ..
= at x0 + b ϵt−1 + aϵt−2 + a2 ϵt−3 + . . . + at−1 ϵ0 .
(41)
| {z }
def
= wt
The finite sum above is a sum of independent Gaussian random variables. The mean vector E[wt ]
remains zero because everyone has a zero mean. The covariance matrix (for a zero-mean vector) is
def
Cov[wt ] = E[wt wtT ]
= b2 (Cov(ϵt−1 ) + a2 Cov(ϵt−2 ) + . . . + (at−1 )2 Cov(ϵ0 ))
= b2 (1 + a2 + a4 + . . . + a2(t−1) )I
1 − a2t−1
= b2 · I.
1 − a2
As t → ∞, at → 0 for any 0 < a < 1. Therefore, at the limit when t = ∞,
b2
lim Cov[wt ] = I.
t→∞ 1 − a2
b2
1= ,
1 − a2
√ √ √
or equivalently b = 1 − a2 . Now, if we let a = α, then b = 1 − α. This will give us
√ √
xt = αxt−1 + 1 − αϵt−1 . (42)
Distribution qϕ (xt |x0 ). With the understanding of the magical scalars, we can talk about the distri-
bution qϕ (xt |x0 ). That is, we want to know how xt will be distributed if we are given x0 .
Theorem 2.2. (Conditional Distribution qϕ (xt |x0 )). The conditional distribution qϕ (xt |x0 ) is
given by √
qϕ (xt |x0 ) = N (xt | αt x0 , (1 − αt )I), (43)
Qt
where αt = i=1 αi .
√
Proof. To see how Eqn (43) is derived, we can re-do the recursion but this time we use αt xt−1 and
(1 − αt )I as the mean and covariance, respectively. This will give us
√ √
xt = αt xt−1 + 1 − αt ϵt−1
√ √ p √
= αt ( αt−1 xt−2 + 1 − αt−1 ϵt−2 ) + 1 − αt ϵt−1
√ √ p √
= αt αt−1 xt−2 + αt 1 − αt−1 ϵt−2 + 1 − αt ϵt−1 . (44)
| {z }
w1
Therefore, we have a sum of two Gaussians. But since sum of two Gaussians remains a Gaussian, we
can just calculate its new covariance (because the mean remains zero). The new covariance is
√ p √
E[w1 w1T ] = [( αt 1 − αt−1 )2 + ( 1 − αt )2 ]I
= [αt (1 − αt−1 ) + 1 − αt ]I = [1 − αt αt−1 ]I.
Returning to Eqn (44), we can show that the recursion becomes a linear combination of xt−2 and a
noise vector ϵt−2 :
√ p
xt = αt αt−1 xt−2 + 1 − αt αt−1 ϵt−2
√ p
= αt αt−1 αt−2 xt−3 + 1 − αt αt−1 αt−2 ϵt−3
.
= ..
vu t v
uY
u
u Yt
= t αi x0 + t1 − α i ϵ0 . (45)
i=1 i=1
Qt
So, if we define αt = i=1 αi , we can show that
√ √
xt = α t x0 + 1 − α t ϵ 0 . (46)
The utility of the new distribution qϕ (xt |x0 ) is its one-shot forward diffusion step compared to the chain
x0 → x1 → . . . → xT −1 → xT . In every step of the forward diffusion model, since we already know x0 and
we assume that all subsequence transitions are Gaussian, we will know xt for any t. The situation can be
understood from Figure 10.
PK
Example 2.2. For a Gaussian mixture model such that x0 ∼ p0 (x) = k=1 πk N (x|µk , σk2 I), we can
show that the distribution at time t is
K
X √
xt ∼ pt (x) = πk N (x | αt µk , (1 − αt )I + αt σk2 I) (48)
k=1
K
X √ t
Y
= πk N (x | αt µk , (1 − αt )I + αt σk2 I), if αt = α so that αt = α = αt .
k=1 i=1
If you are curious about how the probability distribution pt evolves over time t, we can visualize
the trajectory of a Gaussian mixture distribution we discussed in Example 2.1. We use Eqn (48) to plot
the heatmap. You can see that when t = 0, the initial distribution is a mixture of two Gaussians. As
we progress by following the transition defined in Eqn (48), we can see that the distribution gradually
becomes the single Gaussian N (0, I).
In the same plot, we overlay and show a few instantaneous trajectories of the random samples xt
as a function of time t. The equation we used to generate the samples is
√ √
xt = αt xt−1 + 1 − αt ϵ, ϵ ∼ N (0, I).
As you can see, the trajectories of xt more or less follow the distribution pt (x).
A confusing point to many readers is that if the goal is to convert from an image p(x0 ) to white noise
p(xT ), what is the point of deriving qϕ (xt |xt−1 ) and qϕ (xt | x0 )? The answer is that so far we have been
just talking about the forward process. In a diffusion model, the forward process is chosen such that they
can be expressed in a closed form. The more interesting part is the reverse process. As will be discussed, the
reverse process is realized through a chain of denoising operations. Each denoising step should be coupled
with the corresponding step in the forward process. qϕ (xt | x0 ) just provides us a slightly more convenient
way to implement the forward process.
If you are a casual reader of this tutorial, we hope that this equation does not throw you off. While it
appears a monster, it does have structures. We just need to be patient when we try to understand it.
Reconstruction (Initial Block). Let’s first look at the term
h i
Eqϕ (x1 |x0 ) log pθ (x0 |x1 ) .
This term is based on the initial block and it is analogous to Eqn (10). The subject inside the expectation is
the log-likelihood log pθ (x0 |x1 ). This log-likelihood measures how good the neural network (associated with
pθ ) can recover x0 from the latent variable x1 .
The expectation is taken with respect to the samples drawn from qϕ (x1 |x0 ). Recall that qϕ (x1 |x0 ) is the
distribution that generates x1 . We require x1 to be drawn from this distribution because x1 do not come
from the sky but created by the forward transition qϕ (x1 |x0 ). The conditioning on x0 is need here because
we need to know what the original image is.
The reason why expectation is used here is that pθ (x0 |x1 ) is a function of x1 (and so if x1 is random then
pθ (x0 |x1 ) is random too). For a different intermediate state x1 , the probability pθ (x0 |x1 ) will be different.
The expectation eliminates the dependency on x1 .
and it is based on the final block. We use the KL divergence to measure the difference between qϕ (xT |xT −1 )
and p(xT ). The distribution qϕ (xT |xT −1 ) is the forward transition from xT −1 to xT . This describes how
xT is generated. The second distribution is p(xT ). Because of our laziness, we assume that p(xT ) = N (0, I).
We want qϕ (xT |xT −1 ) to be as close to N (0, I) as possible.
When computing the KL-divergence, the variable xT is a dummy variable. However, since qϕ is condi-
tioned on xT −1 , the KL-divergence calculated here is a function of the conditioned variable xT −1 . Where
does xT −1 come from? It is generated by qϕ (xT −1 |x0 ). We use a conditional distribution qϕ (xT −1 |x0 )
because xT −1 depends on what x0 we use in the first place. The expectation over qϕ (xT −1 |x0 ) says that
for each of the xT −1 generated by qϕ (xT −1 |x0 ), we will have a value of the KL divergence. We take the
expectation over all the possible xT −1 generated to eliminate the dependency.
Proof of Theorem 2.3. Let’s define the following notation: x0:T = {x0 , . . . , xT } means the collection
of all state variables from t = 0 to t = T . We also recall that the prior distribution p(x) is the
distribution for the image x0 . So it is equivalent to p(x0 ). With these in mind, we can show that
qϕ (x1:T |x0 )
Z
= log p(x0:T ) dx1:T Multiply and divide qϕ (x1:T |x0 )
qϕ (x1:T |x0 )
Z
p(x0:T )
= log qϕ (x1:T |x0 ) dx1:T Rearrange terms
qϕ (x1:T |x0 )
p(x0:T )
= log Eqϕ (x1:T |x0 ) Definition of expectation.
qϕ (x1:T |x0 )
Now, we need to use Jensen’s inequality, which states that for any random variable X and any concave
function f , it holds that f (E[X]) ≥ E[f (X)]. By recognizing that f (·) = log(·), we can show that
p(x0:T ) p(x0:T )
log p(x) = log Eqϕ (x1:T |x0 ) ≥ Eqϕ (x1:T |x0 ) log (52)
qϕ (x1:T |x0 ) qϕ (x1:T |x0 )
Let’s take a closer look at p(x0:T ). Inspecting Figure 7, we notice that if we want to decouple p(x0:T ),
we should do conditioning for xt−1 |xt . This leads to:
T
Y T
Y
p(x0:T ) = p(xT ) p(xt−1 |xt ) = p(xT )p(x0 |x1 ) p(xt−1 |xt ). (53)
t=1 t=2
As for qϕ (x1:T |x0 ), Figure 7 suggests that we need to do the conditioning for xt |xt−1 . However, because
of the sequential relationship, we can write
T
Y −1
TY
qϕ (x1:T |x0 ) = qϕ (xt |xt−1 ) = qϕ (xT |xT −1 ) qϕ (xt |xt−1 ). (54)
t=1 t=1
Substituting Eqn (53) and Eqn (54) back to Eqn (52), we can show that
p(x0:T )
log p(x) ≥ Eqϕ (x1:T |x0 ) log
qϕ (x1:T |x0 )
" QT #
p(xT )p(x0 |x1 ) t=2 p(xt−1 |xt )
= Eqϕ (x1:T |x0 ) log QT −1
qϕ (xT |xT −1 ) t=1 qϕ (xt |xt−1 )
" QT −1 #
p(xT )p(x0 |x1 ) t=1 p(xt |xt+1 )
= Eqϕ (x1:T |x0 ) log QT −1 shift t to t + 1
qϕ (xT |xT −1 ) t=1 qϕ (xt |xt−1 )
" T −1 #
p(xT )p(x0 |x1 ) Y p(xt |xt+1 )
= Eqϕ (x1:T |x0 ) log + Eqϕ (x1:T |x0 ) log split expectation
qϕ (xT |xT −1 ) q (x |x )
t=1 ϕ t t−1
where we used the fact that the conditioning x1:T |x0 is equivalent to x1 |x0 when the subject of interest
(i.e., log p(x0 |x1 )) only involves x0 and x1 .
The Prior Matching term is
p(xT ) p(xT )
Eqϕ (x1:T |x0 ) log = Eqϕ (xT ,xT −1 |x0 ) log ,
qϕ (xT |xT −1 ) qϕ (xT |xT −1 )
where we note that the conditional expectation can be simplified to samples xT and xT −1 only, because
log qϕ (xp(x T)
T |xT −1 )
only depends on xT and xT −1 . For the expectation term, we note that xT −1 and xT
are conditionally independent given x0 because if we know x0 , we can define xT −1 and xT directly
using Eqn (43). As a result, the joint expectation Eqϕ (xT ,xT −1 |x0 ) can be written as a product of two
expectations Eqϕ (xT −1 |x0 ) Eqϕ (xT |x0 ) . This will give us
p(xT ) p(xT )
Eqϕ (xT ,xT −1 |x0 ) log = Eqϕ (xT −1 |x0 ) Eqϕ (xT |x0 ) log
qϕ (xT |xT −1 ) qϕ (xT |xT −1 )
" #
= −Eqϕ (xT −1 |x0 ) DKL (qϕ (xT |xT −1 )∥p(xT )) .
where again we use the fact the expectation only needs xt−1 , xt , and xt+1 . Then, by using the same
conditional independence argument, we can show that
T −1 TX−1
X p(xt |xt+1 ) p(xt |xt+1 )
Eqϕ (xt−1 ,xt ,xt+1 |x0 ) log = Eqϕ (xt−1 ,xt+1 |x0 ) Eqϕ (xt |x0 ) log
t=1
qϕ (xt |xt−1 ) t=1
qϕ (xt |xt−1 )
−1
T
" #
X
=− Eqϕ (xt−1 ,xt+1 |x0 ) DKL (qϕ (xt |xt−1 )∥p(xt |xt+1 )) .
t=1
By replacing p(x0 |x1 ) with pθ (x0 |x1 ) and p(xt |xt+1 ) with pθ (xt |xt+1 ), we are done.
Rewrite the Consistency Term. The nightmare of the above variational diffusion model is that
we need to draw samples (xt−1 , xt+1 ) from a joint distribution qϕ (xt−1 , xt+1 |x0 ). We don’t know what
qϕ (xt−1 , xt+1 |x0 ) is! It is a Gaussian by our choice, but we still need to use future samples xt+1 to draw the
current sample xt . This is odd.
Inspecting the consistency term, we notice that qϕ (xt |xt−1 ) and pθ (xt |xt+1 ) are moving along two
opposite directions. Thus, it is unavoidable that we need to use xt−1 and xt+1 . The question we need to
With this change of the conditioning order, we can switch q(xt |xt−1 , x0 ) to q(xt−1 |xt , x0 ) by adding one
more condition variable x0 . (If you do not condition on x0 , there is no way that we can draw samples from
q(xt−1 ), for example, because the specific state of xt−1 depends on the initial image x0 .) The direction
q(xt−1 |xt , x0 ) is now parallel to pθ (xt−1 |xt ) as shown in Figure 11. So, if we want to rewrite the consistency
term, a natural option is to calculate the KL divergence between qϕ (xt−1 |xt , x0 ) and pθ (xt−1 |xt ).
Figure 11: If we consider the Bayes theorem in Eqn (55), we can define a distribution qϕ (xt−1 |xt , x0 ) that
has a direction parallel to pθ (xt−1 |xt ).
If we manage to go through a few (boring) algebraic derivations, we can show that the ELBO is now:
Theorem 2.4. (ELBO for Variational Diffusion Model). Let x = x0 , and xT ∼ N (0, I). The
ELBO for a variational diffusion model in Theorem 2.3 can be equivalently written as
ELBOϕ,θ (x) = Eqϕ (x1 |x0 ) [log pθ (x0 |x1 ) ] − DKL qϕ (xT |x0 )∥p(xT )
| {z } | {z }
same as before
new prior matching
T
X h i
− Eqϕ (xt |x0 ) DKL qϕ (xt−1 |xt , x0 )∥pθ (xt−1 |xt ) . (56)
t=2 | {z }
new consistency
QT at−1 a1 a2
where the last equation uses the fact that for any sequence a1 , . . . , aT , we have t=2 at = a2 × a3 ×
. . . × aTaT−1 = aaT1 . Going back to the Eqn (57), we can see that
" T
#
p(xT )p(x0 |x1 ) Y p(xt−1 |xt )
Eqϕ (x1:T |x0 ) log + Eqϕ (x1:T |x0 ) log
qϕ (x1 |x0 ) q (x |x , x0 )
t=2 ϕ t t−1
" T
#
p(xT )p(x0 |x1 ) qϕ (x1 |x0 ) Y p(xt−1 |xt )
= Eqϕ (x1:T |x0 ) log + log + Eqϕ (x1:T |x0 ) log
qϕ (x1 |x0 ) qϕ (xT |x0 ) q (x |x , x0 )
t=2 ϕ t−1 t
" T
#
p(xT )p(x0 |x1 ) Y p(xt−1 |xt )
= Eqϕ (x1:T |x0 ) log + Eqϕ (x1:T |x0 ) log ,
qϕ (xT |x0 ) q (x |x , x0 )
t=2 ϕ t−1 t
where we canceled qϕ (x1 |x0 ) in the numerator and denominator since log ab + log cb = log ac for any
positive constants a, b, and c. This will give us
p(xT )p(x0 |x1 ) p(xT )
Eqϕ (x1:T |x0 ) log = Eqϕ (x1:T |x0 ) [log p(x0 |x1 )] + Eqϕ (x1:T |x0 ) log
qϕ (xT |x0 ) qϕ (xT |x0 )
= Eqϕ (x1 |x0 ) [log p(x0 |x1 )] − DKL (qϕ (xT |x0 )∥p(xT )).
| {z } | {z }
reconstruction prior matching
Finally, replace p(xt−1 |xt ) by pθ (xt−1 |xt ), and p(x0 |x1 ) by pθ (x0 |x1 ). Done!
where
√ √
(1 − αt−1 ) αt (1 − αt ) αt−1
µq (xt , x0 ) = xt + x0 (60)
1 − αt 1 − αt
√
(1 − αt )(1 − αt−1 ) def 2
Σq (t) = I = σq (t)I, (61)
1 − αt
Qt
where αt = i=1 αi .
Eqn (60) reveals an interesting fact that the mean µq (xt , x0 ) is a linear combination of xt and x0 .
Geometrically, µq (xt , x0 ) lives on the straight line connecting xt and x0 , as illustrated in Figure 12.
Figure 12: According to Eqn (60), the mean µq (xt , x0 ) is a linear combination of xt and x0 .
Proof of Theorem 2.5. Using the Bayes theorem stated in Eqn (55), q(xt−1 |xt , x0 ) can be determined
if we evaluate the following product of Gaussians
√ √
N (xt | αt xt−1 , (1 − αt )I)N (xt−1 | αt−1 x0 , (1 − αt−1 I))
q(xt−1 |xt , x0 ) = √ . (62)
N (xt | αt x0 , (1 − αt )I)
For simplicity we will treat the vectors are scalars. Then the above product of Gaussians will become
√ √ √
(xt − αt xt−1 )2 (xt−1 − αt−1 x0 )2 (xt − αt x0 )2
q(xt−1 |xt , x0 ) ∝ exp + − . (63)
2(1 − αt ) 2(1 − αt−1 ) 2(1 − αt )
x = xt , a = αt
y = xt−1 , b = αt−1
z = x0 , c = αt .
Similarly, for the variance, we can check the curvature f ′′ (y). We can easily show that
1 − ab 1 − αt
f ′′ (y) = = √ .
(1 − a)(1 − b) (1 − αt )(1 − αt−1 )
Figure 13: The trajectory of the coefficients for xt and for x0 , as t grows.
As t goes from T to 1, the variance σq2 (t) will also change. Figure 14 shows the trajectory of xt as a
function of t by sampling xt according to qϕ (xt−1 |xt , x0 ). On the same plot, we show the radius of the
Gaussian, defined by σq2 (t). Our plot indicates that when t = T , the variance σq2 (t) is fairly large so that
xt is closer to white Gaussian noise. As t drops to t = 1, the variance σq2 (t) also drops to zero. This makes
sense because eventually we want x0 to be the clean image which is noise free.
Constructing pθ (xt−1 |xt ). The interesting part of Eqn (59) is that qϕ (xt−1 |xt , x0 ) is completely char-
acterized by xt and x0 . There is no neural network required to estimate the mean and variance! (You can
compare this with VAE where a network is needed.) Since a network is not needed, there is really nothing
to “learn”. The distribution qϕ (xt−1 |xt , x0 ) is automatically determined if we know xt and x0 .
The realization here is important. Let’s look at the consistency term in Eqn (56):
ELBOϕ,θ (x) = Eqϕ (x1 |x0 ) [log pθ (x0 |x1 ) ] − DKL qϕ (xT |x0 )∥p(xT )
| {z } | {z }
same as before
new prior matching
T
X h i
− Eqϕ (xt |x0 ) DKL qϕ (xt−1 |xt , x0 )∥pθ (xt−1 |xt ) , (from Eqn (56))
t=2 | {z }
new consistency
There is no “learning” for qϕ (xt−1 |xt , x0 ) because it is defined once the hyperparameter αt are defined.
Therefore, the consistency term is a summation of many KL divergence terms where the t-th term is
where we assume that the mean vector can be determined using a neural network. As for the variance, we
choose the variance to be σq2 (t). This is identical to Eqn (61)! Thus, if we put Eqn (59) side by side with
pθ (xt−1 |xt ), we notice a parallel relation between the two:
qϕ (xt−1 |xt , x0 ) = N xt−1 | µq (xt , x0 ), σq2 (t)I , (70)
| {z } | {z }
known known
pθ (xt−1 |xt ) = N xt−1 | µθ (xt ) , σq2 (t)I . (71)
| {z } | {z }
neural network known
Theorem 2.6. The ELBO for a variational diffusion model in Eqn (56) can be simplified to
ELBOθ (x) = Eq(x1 |x0 ) [log pθ (x0 |x1 )] − DKL q(xT |x0 )∥p(xT )
| {z }
nothing to train
T
X h 1 2
i
− Eq(xt |x0 ) ∥µ (x t , x 0 ) − µ (xt )∥ , (73)
t=2
2σq2 (t) q θ
One remark for Theorem 2.6 is that the subscript ϕ is dropped because the distribution qϕ defined in
Theorem 2.5 is fully characterized by xt and x0 . There is nothing to learn, and so the optimization does
not need to include ϕ. Because of this, we can drop the KL-divergence term in h Eqn (73). This leaves us the i
PT
reconstruction term Eq(x1 |x0 ) [log pθ (x0 |x1 )] and transition term t=2 Eq(xt |x0 ) 2σ21(t) ∥µq (xt , x0 )−µθ (xt )∥2 .
q
The reconstruction term can be simplified, since
log pθ (x0 |x1 ) = log N (x0 |µθ (x1 ), σq2 (1)I)
∥x0 − µθ (x1 )∥2
1
= log q exp −
( 2πσq2 (1))d 2σq2 (1)
So, as soon as we know x1 , we can send it to a network µθ (x1 ) to return us a mean estimate. The mean
estimate will then be used to compute the likelihood.
Substituting Eqn (75) and Eqn (76) into Eqn (74) will give us
√ 2
1 2 1 (1 − αt ) αt−1
∥µ (xt , x0 ) − µθ (xt )∥ = xθ (xt ) − x0 )
(b
2σq2 (t) q 2σq2 (t) 1 − αt
1 (1 − αt )2 αt−1 2
= ∥b
xθ (xt ) − x0 ∥ . (77)
2σq2 (t) (1 − αt )2
Proof. Substituting Eqn (77) into Eqn (73), we can see that
T
X h 1 2
i
ELBOθ (x) = Eq(x1 |x0 ) [log pθ (x0 |x1 )] − Eq(xt |x0 ) ∥µq (x t , x0 ) − µθ (x t )∥
t=2
2σq2 (t)
T
X h 1 (1 − αt )2 αt−1 2
i
= Eq(x1 |x0 ) [log pθ (x0 |x1 )] − Eq(xt |x0 ) 2 2
∥b
xθ (xt ) − x0 ∥ . (80)
t=2
2σq (t) (1 − αt )
The first term is
1
log pθ (x0 |x1 ) = log N (x0 |µθ (x1 ), σq2 (1)I) ∝ − ∥µθ (x1 ) − x0 ∥2 definition
2σq2 (1)
√ √ 2
1 (1 − α0 ) α1 (1 − α1 ) α0
=− 2 x1 + bθ (x1 ) − x0
x recall α0 = 1
2σq (1) 1 − α1 1 − α1
2
1 (1 − α1 )
=− bθ (x1 ) − x0
x
2σq2 (1) 1 − α1
1 2
=− 2 ∥b
xθ (x1 ) − x0 ∥ . recall α1 = α1 (81)
2σq (1)
This is nothing but a denoising problem because we need to find a network x bθ such that the denoised image
x
bθ (xt ) will be close to the ground truth x0 . What makes it not a typical denoiser is the following reasons:
• Eq(xt |x0 ) : We are not trying to denoise any random noisy image. Instead, we are carefully choosing
the noisy image to be
√
xt ∼ q(xt |x0 ) = N (xt | αt x0 , (1 − αt )I)
√ p
⇔ xt = αt x0 + (1 − αt )ϵt , where ϵt ∼ N (0, I).
(1−αt )2 αt−1
• 1
2σq2 (t) (1−αt )2 :
We do not weight the denoising loss equally for all steps. Instead, there is a scheduler
to control the relative emphasis on each denoising loss. Considering this, and using Monte Carlo to
approximate the expectation, we can write the optimization problem as
X
argmax ELBO(x0 )
θ x0 ∈X
T √
1 (1 − αt )2 αt−1
X X p 2
= argmin Eq(xt |x0 ) x θ α t x 0 + (1 − α t )ϵ t − x0
2σq2 (t) (1 − αt )2
b
θ x0 ∈X t=1
T M √
X X 1 X 1 (1 − αt )2 αt−1 p (m)
2
= argmin x θ α t x0 + (1 − α t )ϵ t − x0 , (82)
M m=1 2σq2 (t) (1 − αt )2
b
θ x0 ∈X t=1
(m)
where ϵt ∼ N (0, I), and the summation over x0 ∈ X means we consider all samples in the training
set X . As you can see, the training of this model involves training a denoiser x
bθ (·). For this reason,
the resulting model is known as the denoising diffusion probabilistic model (DDPM).
Forward Diffusion in DDPM. The training of a DDPM involves two parallel branches. The first
branch is the forward diffusion. The goal of forward diffusion is to generate the intermediate variables
x1 , . . . , xT −1 by using
√
xt ∼ q(xt |x0 ) = N (xt | αt x0 , (1 − αt )I), t = 1, . . . , T − 1.
The forward diffusion does not require any training. If you give us the clean image x0 , we can run the
forward diffusion and prepare the images x1 , . . . , xT . A pictorial illustration is shown in Figure 15.
Figure 16: Training of a denoising diffusion probabilistic model. For the same neural network x
bθ , we send
noisy inputs xt to the network. The gradient of the loss is back-propagated to update the network. Note
that the noisy images are not arbitrary. They are generated according to the forward sampling process.
The training of a denoiser is no different than any conventional supervised learning. Given a pair of
clean and noisy image, which in our case is x0 and xt , we train the denoiser x
bθ (·). The training loss in Eqn
(82) has three summations. If we run stochastic gradient descent, we can simplify the above optimization
into the following procedure.
Training Algorithm for DDPM. For every image x0 in your training dataset:
You can do this in batches, just like how you train any other neural networks. Note that, here, you are
training one denoising network x bθ for all noisy conditions.
Inference of DDPM – the Reverse Diffusion. Once the denoiser x bθ is trained, we can apply it
to do the inference. The inference is about sampling images from the distributions pθ (xt−1 |xt ) over the
sequence of states xT , xT −1 , . . . , x1 . Since it is the reverse diffusion process, we need to do it recursively via:
By reparameterization, we have
Inference of DDPM.
• You give us a white noise vector xT ∼ N (0, I).
• Repeat the following for t = T, T − 1, . . . , 1.
• We calculate x
bθ (xt ) using our trained denoiser.
• Update according to
√ √
(1 − αt−1 ) αt (1 − αt ) αt−1
xt−1 = xt + x
bθ (xt ) + σq (t)ϵ, ϵ ∼ N (0, I). (83)
1 − αt 1 − αt
We remark that this ELBO is a function of x0 and ϵ0 . Therefore, to train the denoiser, we need to solve the
optimization
where the superscript denotes the m-th initial noise term. If we run stochastic gradient descent, we can
simplify the above description to the following procedure.
There is no particularly strong physical meaning for thisQchoice, except that it makes the notation simpler.
t
With this choice, the product term is simplified to αt = i=1 ααi−1i
= αt assuming that α0 = 1. Therefore,
√
q(xt |x0 ) = N xt | αt x0 , (1 − αt )I . (88)
Expressing Eqn (88) by means of reparametrization, we note that xt in Eqn (88) can be represented in
terms of x0 as follows
√ √
xt = αt x0 + 1 − αt ϵ, where ϵ ∼ N (0, I).
So here comes an interesting trick. Let’s replace ϵ by something so that xt−1 is no longer x0 perturbed by
white noise. Perhaps we can consider the following derivation
√ √
xt = αt x0 + 1 − αt ϵ
√ √
=⇒ 1 − αt ϵ = xt − αt x0
√
xt − αt x0
=⇒ ϵ= √ .
1 − αt
The difference between Eqn (90) and Eqn (89) is that in Eqn (89), the noise term is ϵ which is N (0, I).
It is this Gaussian that makes the derivations of DDPM easy, but it is also this Gaussian that makes the
reverse diffusion slow. In contrast, Eqn (90) replaces the Gaussian by an estimate. This estimate uses the
previous signal xt combined with the initial signal x0 . Of course, one can argue that in DDPM (e.g., Eqn
(59)) also uses a combination of xt and x0 . The difference is that the combination in Eqn (90) allows us to
do something that Eqn (59) does not, which is the derivation of the marginal distribution q(xt−1 |x0 ) and
make it to a desired form.
Let’s elaborate more on the marginal distribution. Referring to Eqn (90), we notice that we can choose
√
xt − αt x0
√ p
q(xt−1 |xt , x0 ) = N αt−1 x0 + 1 − αt−1 √ , something .
1 − αt
where “something” stands for the variance of the Gaussian which can be made as σt2 I for some hyperpa-
rameter σt . One important (likely the most important) argument in DDIM is that we want the marginal
distribution q(xt−1 |x0 ) to have the same form as q(xt |x0 ):
√
q(xt−1 |x0 ) = N ( αt−1 x0 , (1 − αt−1 )I).
The reason of aiming for this distribution is that ultimately we care about the marginal distribution q(xt |x0 )
which we want it to become pure white noise when t = T and it is the original image when t = 0. Therefore,
while we can have millions of different choice of the transitional distribution q(xt−1 |xt , x0 ), only some very
specialized transition probabilities can ensure that q(xt−1 |x0 ) takes a form we like.
Derivation of the Transition Distribution. With this goal in mind, we now state our mathematical
problem. Suppose that
√
q(xt |x0 ) = N αt x0 , (1 − αt )I ,
√
xt − αt x0
√ p
q(xt−1 |xt , x0 ) = N αt−1 x0 + 1 − αt−1 √ , σt2 I , (91)
1 − αt
√
can we ensure that q(xt−1 |x0 ) = N ( αt−1 x0 , (1 − αt−1 )I)? If not, what additional changes do we need?
The answer to this mathematical question requires some tools from textbooks. We recall the following
result from Bishop’s textbook [4].
Theorem 2.8. Bishop [4, Eqn 2.115] Suppose that we have two random variables x and y following
the distributions
Let’s see how we can apply this result to our problem. Looking at Eqn (91), we can identify the following
qualities:
r r
1 − αt−1 √ √ 1 − αt−1 √
A= , µ = αt x0 , b = αt−1 x0 − αt x0 .
1 − αt 1 − αt
√ 1 √ (t)
=N αt−1 · √ xt − 1 − αt ϵθ (xt ) +
αt
√ √ (t)
xt − αt √1αt xt − 1 − αt ϵθ (xt )
q !
+ 1 − αt−1 − σt2 · √ , σt2 I
1 − αt
√ (t)
! q !
√ xt − 1 − αt ϵθ (xt ) (t)
=N αt−1 √ + 1 − αt−1 − σt2 · ϵθ (xt ), σt2 I . (93)
αt
(1)
For the special case where t = 1, we define pθ (xt−1 |xt ) = N (fθ (x1 ), σ12 I) so that the reverse process is
supported everywhere. Looking at Eqn (93), we use reparametrization to write it as follows and interpret
the equation according to [39].
√ (t)
! q
√ xt − 1 − αt ϵθ (xt ) (t)
(DDIM) xt−1 = αt−1 √ + 1 − αt−1 − σt2 · ϵθ (xt ) + σt ϵt . (94)
αt | {z } |{z}
| {z } direction pointing to xt ∼N (0,I)
predicted x0
It would be helpful to compare this equation with the DDPM equation in Eqn (87):
1 1 − αt (t)
(DDPM) xt−1 = √ xt − √ ϵθ (xt ) + σt ϵt , ϵt ∼ N (0, I). (95)
αt 1 − αt
The main difference between DDPM and DDIM is subtle. While they both use xt and ϵ(t) (xt ) in their
updates, the specific update formula makes a different convergence speed. In fact, later in the differential
equation literature where people connect DDIM and DDPM with stochastic differential equations, it was
observed that DDIM employed some special accelerated first-order numerical schemes when solving the
differential equation.
Figure 18: If our goal is to generate a fruit image, then we should expect the underlying distribu-
tion p(x) will have a higher value for those “normal-looking” fruit images than those “weird-looking”
images. Therefore, to sample from a distribution, it is more natural to pick a sample from a higher
position of the distribution. The fruit image is taken from https://cognitiveseo.com/blog/4224/
unnatural-links-definition-examples/
Therefore, in a non-rigorous way, we can argue that if we are given p(x), we should aim to draw samples
from a location where p(x) has a high value. This idea of searching for a higher probability can be translated
to an optimization
x∗ = argmax log p(x),
x
where the goal is to maximize the log-likelihood of the distribution p(x). Certainly, this maximization does
not provide any clue on how to draw a low probability sample which we will explain through the lens of
Langevin equation. For now, we want to make a remark about the difference between the maximization
here and maximum likelihood estimation. In maximum likelihood, the data point x is fixed but the model
parameters are changing. Here, the model parameters are fixed but the data point is changing. We are
given a fixed model. Our goal is to draw the most likely sample from this model. The table below shows
the difference between our sampling problem and maximum likelihood estimation.
Let’s continue our argument about the maximization. If p(x) is a simple parameteric model, the max-
imization will have analytic solutions. However, in general, optimizations in a high-dimensional space are
ill-posed with many local minima. Therefore, there is no single algorithm that is globally converging in
where ∇x log p(xt ) denotes the gradient of log p(x) evaluated at xt , and τ is the step size. Here we use “+”
instead of the typical “−” because we are solving a maximization problem.
If you agree with the above approach, we can now provide an information introduction about the Langevin
equation. Without worrying too much about its roots in physics, we can treat Langevin equation as an
iterative procedure that allows us to draw samples.
Definition 3.1. The (discrete-time) Langevin equation for sampling from a known distribution p(x)
is an iterative procedure for t = 1, . . . , T :
√
xt+1 = xt + τ ∇x log p(xt ) + 2τ z, z ∼ N (0, I), (96)
where τ is the step size which users can control, and x0 is white noise.
Example 3.1. Consider a Gaussian distribution p(x) = N (x | µ, σ 2 ), we can show that the Langevin
equation is
√
1 (x −µ)2
− t2σ2
xt+1 = xt + τ · ∇x log √ e + 2τ z
2πσ 2
xt − µ √
= xt − τ · + 2τ z, z ∼ N (0, 1),
σ2
where the initial state can be set as x0 ∼ N (0, 1).
√
If we ignore the noise term 2τ z, the Langevin equation in Eqn (96) is exactly gradient descent, but
for a particular function — the log likelihood of the random variable x. Therefore, while gradient descent
is a generic first-order optimization algorithm for any objective function, the Langevin equation focuses on
the distribution if we use it in the context of generative models. The gradient descent algorithm plus a small
noise perturbation gives us a simple summary.
In generative models,
Langevin equation = gradient descent + noise
But why do we want gradient descent + noise instead of gradient descent? One interpretation is that
we are not interested in solving the optimization problem. Instead, we are more interested in sampling from
a distribution. By introducing the random noise to the gradient descent step, we randomly pick a sample
that is following the objective function’s trajectory while not staying at where it is. If we are closer to the
peak, we will move left and right slightly. If we are far from the peak, the gradient direction will pull us
towards the peak. If the curvature around the peak is sharp, we will concentrate most of the steady state
points xT there. If the curvature around the peak is flat, we will spread around. Therefore, by repeatedly
initialize the gradient descent (plus noise) algorithm at a uniformly distributed location, we will eventually
collect samples that will follow the distribution we designate.
A slightly more formal way to justify the Langevin equation is the Fokker-Planck equation. The Fokker-
Planck equation is a fundamental result about stochastic processes. For any Markovian processes (e.g.,
Wiener process and Brownian motion), the dynamics of the solution xt is described by a stochastic differ-
ential equation (i.e., Langevin equation). However, since xt is a random variable at any time t, there is
Theorem 3.1. In the context of our problem, the solution xt of the Langevin equation will have a
probability distribution p(x, t) at time t satisfying the Fokker-Planck Equation
n o
∂t p(x, t) = −∂x [∂x (log p(x))]p(x, t) + ∂x2 p(x, t), (97)
Deriving the Fokker-Planck equation will take a tremendous amount of effort. However, if we are given a
candidate solution, verifying whether it satisfies the Fokker-Planck equation is not hard.
Verification of Theorem. Suppose that we have run Langevin equation for long enough that we have
reached a converging solution xt as t → ∞. We argue that this limiting distribution is p(x). Indeed,
we can show that
n o ∂ p(x)
x
∂x log p(x) = .
p(x)
On the other hand, when t → ∞, it holds that ∂t p(x) = 0. Therefore, the Fokker-Planck equation is
verified.
Example 3.2. Consider a Gaussian mixture p(x) = π1 N (x | µ1 , σ12 )+π2 N (x | µ2 , σ22 ). We can calculate
the gradient ∇x log p(x) analytically or numerically. For demonstration, we choose π1 = 0.6. µ1 = 2,
σ1 = 0.5, π2 = 0.4, µ2 = −2, σ2 = 0.2. We initialize x0 = 0. We choose τ = 0.05. We run the
above gradient descent iteration for T = 500 times, and we plot the trajectory of the values p(xt ) for
t = 1, . . . , T . As we can see in the figure below, the sequence {x1 , x2 , . . . , xT } simply follows the shape
of the Gaussian and climb to one of the peaks.
What is more interesting is when we add the noise term. Instead of landing at the peak, the
sequence xt moves around the peak and finishes somewhere near the peak. (Remark: To terminate the
algorithm, we can gradually make τ smaller or we can early stop.)
√
xt+1 = xt + τ ∇x log p(xt ) xt+1 = xt + τ ∇x log p(xt ) + 2τ z
Figure 19: Trajectory of sample evolutions using the Langevin dynamics. We colored the two modes of the
Gaussian mixture in different colors for better visualization. The setting here is identical to the example
above, except that the step size is τ = 0.001.
Example 3.3. Following the previous example we again consider a Gaussian mixture
We should be careful not to confuse Stein’s score function with the ordinary score function which is
defined as
def
sx (θ) = ∇θ log pθ (x). (99)
The ordinary score function is the gradient (with respect to θ) of the log-likelihood. In contrast, the Stein’s
score function is the gradient with respect to the data point x. Maximum likelihood estimation uses the
ordinary score function, whereas Langevin dynamics uses Stein’s score function. However, since most people
in the diffusion literature calls Stein’s score function as the score function, we follow this culture.
(x−µ)2
Example 3.4. If p(x) is a Gaussian with p(x) = √ 1
2πσ 2
e− 2σ 2 , then
(x − µ)
s(x) = ∇x log p(x) = − .
σ2
(x−µi )2
PN −
2σ 2
Example 3.5. If p(x) is a Gaussian mixture with p(x) = i=1 πi
√1 2
e i , then
2πσi
(x−µj )2
PN − (x−µj )
2σ 2
√1
j=1 πj 2πσ 2 e
j
σj2
j
s(x) = ∇x log p(x) = − (x−µi )2
.
PN −
2σ 2
i=1 πi
√1 e i
2πσi2
The probability density function and the corresponding score function of the above two examples are
shown in Figure 20.
Geometric Interpretations of the Score Function. The way to understand the score function is
to remember that it is the gradient with respect to the data x. For any high-dimensional distribution p(x),
the gradient will give us vector field. There are a few useful interpretations of the score functions:
Figure 21: The contour map of the score function, and the corresponding trajectory of two samples.
where h is just some hyperparameter for the kernel function K(·), and x(m) is the m-th sample in the training
set. Figure 22 illustrates the idea of kernel density estimation. In the cartoon figure shown on the left, we
show multiple kernels K(·) centered at different data points x(m) . The sum of all these individual kernels
gives us the overall kernel density estimate q(x). On the right hand side we show a real histogram and the
corresponding kernel density estimate. We remark that q(x) is at best an approximation to the true data
distribution p(x) which is never known.
Since q(x) is an approximation to p(x) which is never accessible, we can learn sθ (x) based on q(x). This
leads to the following definition of the a loss function which can be used to train a network.
By substituting the kernel density estimation, we can show that the loss is
So, we have derived a loss function that can be used to train the network. Once we train the network sθ ,
we can replace it in the Langevin dynamics equation to obtain the recursion:
√
xt+1 = xt + τ sθ (xt ) + 2τ z. (103)
The issue of explicit score matching is that the kernel density estimation is a fairly poor non-parameter
estimation of the true distribution. Especially when we have limited number of samples and the samples
live in a high dimensional space, the kernel density estimation performance can be poor.
Implicit Score Matching [18]. In implicit score matching, the explicit score matching loss is replaced
by an implicit one.
def 1 2
JISM (θ) = Ep(x) Tr(∇x sθ (x)) + ∥sθ (x)∥ , (104)
2
where ∇x sθ (x) denotes the Jacobian of sθ (x). The implicit score matching loss can be approximated by
Monte Carlo
M
1 XX 1
JISM (θ) ≈ ∂i sθ (x(m) ) + |[sθ (x(m) )]i |2 ,
M m=1 i 2
2
∂ ∂
where ∂i sθ (x(m) ) = ∂x i
[sθ (x)]i = ∂x 2 log p(x). If the model for the score function is realized by a deep
i
neural network, the trace operator can be difficult to compute, hence making the implicit score matching
not scalable [40].
Denoising Score Matching. Given the potential drawbacks of explicit and implicit score matching,
we now introduce a more popular score matching known as the denoising score matching (DSM) by Vincent
[43]. In DSM, the loss function is defined as follows.
def 1 2
JDSM (θ) = Eq(x,x′ ) ∥sθ (x) − ∇x log q(x|x′ )∥ (105)
2
The key difference here is that we replace the distribution q(x) by a conditional distribution q(x|x′ ). The
former requires an approximation, e.g., via kernel density estimation, whereas the latter does not. Here is
an example.
∥x − x′ ∥2
′ 1
∇x log q(x|x ) = ∇x log √ exp −
( 2πσ 2 )d 2σ 2
∥x − x ∥ ′ 2 √
= ∇x − − log( 2πσ 2 )d
2σ 2
′
x−x z
=− 2
=− .
σ σ
As a result, the loss function of the denoising score matching becomes
def 1 2
JDSM (θ) = Eq(x,x′ ) ∥sθ (x) − ∇x log q(x|x′ )∥
2
1 ′ z 2
= Eq(x′ ) sθ (x + σz) + .
2 σ
If we replace the dummy variable x′ by x, and we note that sampling from q(x) can be replaced by sampling
from p(x) when we are given a training dataset, we can conclude the following.
Theorem 3.3. The Denoising Score Matching has a loss function defined as
1 z 2
JDSM (θ) = Ep(x) sθ (x + σz) + (106)
2 σ
The beauty about Eqn (106) is that it is highly interpretable. The quantity x + σz is effectively adding
noise σz to a clean image x. The score function sθ is supposed to take this noisy image and predict the noise
z
σ . Predicting noise is equivalent to denoising, because any denoised image plus the predicted noise will give
us the noisy observation. Therefore, Eqn (106) is a denoising step.
The following theorem, proven by Vincent [43], establishes the equivalence between DSM and ESM. It
is this equivalence that allows us to use DSM to estimate the score function.
Theorem 3.4. [Vincent [43]] For up to a constant C which is independent of the variable θ, it holds
that
JDSM (θ) = JESM (θ) + C. (107)
Proof of Theorem 3.4 The proof here is based on [43]. We start with the explicit score matching
loss function, which is given by
1 2
JESM (θ) = Eq(x) ∥sθ (x) − ∇x log q(x)∥
2
h1 1 i
2 2
= Eq(x) ∥sθ (x)∥ − sθ (x)T ∇x log q(x) + ∥∇x log q(x)∥ .
2 |2 {z }
def
= C1 ,independent of θ
Z Z Z
T T ′ ′ ′
sθ (x) ∇x q(x)dx = sθ (x) ∇x q(x )q(x|x )dx dx (conditional)
| {z }
=q(x)
Z Z
T ′ ′ ′
= sθ (x) q(x )∇x q(x|x )dx dx (move gradient)
q(x|x′ ) ′
Z Z
= q(x′ )∇x q(x|x′ ) ×
sθ (x)T dx dx (multiple and divide)
q(x|x′ )
∇x q(x|x′ )
Z Z
= sθ (x)T q(x′ ) q(x|x′ )dx′ dx (rearrange terms)
q(x|x′ )
| {z }
=∇x log q(x|x′ )
Z Z
T ′ ′ ′ ′
= sθ (x) q(x ) ∇x log q(x|x ) q(x|x )dx dx
Z Z
= q(x|x′ )q(x′ ) sθ (x)T ∇x log q(x|x′ ) dx′ dx (move integration)
| {z }
=q(x,x′ )
So, if we substitute this result back to the definition of ESM, we can show that
h1 i
2
− Eq(x,x′ ) sθ (x)T ∇x log q(x|x′ ) + C1 .
JESM (θ) = Eq(x) ∥sθ (x)∥
2
Comparing this with the definition of DSM, we can observe that
def 1 ′ 2
JDSM (θ) = Eq(x,x′ ) ∥sθ (x) − ∇x q(x|x )∥
2
h1 1 i
2 2
= Eq(x,x′ ) ∥sθ (x)∥ − sθ (x)T ∇x log q(x|x′ ) + ∥∇x log q(x|x′ )∥
2 |2 {z }
def
= C2 ,independent of θ
h1 i
2
− Eq(x,x′ ) sθ (x)T ∇x log q(x|x′ ) + C2 .
= Eq(x) ∥sθ (x)∥
2
Therefore, we conclude that
JDSM (θ) = JESM (θ) − C1 + C2 .
The training procedure in a score matching model is typically done by minimizing the denoising score
matching loss function. If we are given a training dataset {x(m) }M
m=1 , the optimization goal is to
1 z 2
θ ∗ = argmin Ep(x) sθ (x + σz) +
θ 2 σ
M
1 X 1 z(m) 2
(m) (m)
≈ argmin sθ x + σz + , where z(m) ∼ N (0, I).
θ M m=1 2 σ
where the individual loss function is defined according to the noise levels σ1 , . . . , σL :
1 z 2
ℓ(θ; σ) = Ep(x) sθ (x + σz) + .
2 σ
The coefficient function λ(σi ) is often chosen as λ(σ) = σ 2 based on empirical findings [40]. The noise level
sequence often satisfies σσ21 = . . . = σσL−1
L
> 1.
For inference, we assume that we have already trained the score estimator sθ . To generate an image,
we use the Langevin equation to iteratively draw samples by denoising the image. In case of NCSN, the
corresponding Langevin equation can be implemented via an annealed importance sampling:
αi √
xt+1 = xt + sθ (xt , σi ) + αi zt , zt ∼ N (0, I),
2
where αi = σi2 /σL
2
is the step size and sθ (xt , σi ) denotes the score matching function for noise level σi .
The iteration over t is repeated sequentially for each σi from i = 1 to L. For additional details of the
implementation, we refer readers to Algorithm 1 of the original paper by Song and Ermon [40].
Example 4.1. Simple First-Order ODE. Imagine that we are given a discrete-time algorithm with
the iterations defined by the recursion:
β∆t
xi = 1 − xi−1 , for i = 1, 2, . . . , N, (109)
2
for some hyperparameter β and a step-size parameter ∆t. We can turn this iterative scheme into a
continuous-time differential equation.
Suppose that there is a continuous time function x(t). We define a discretization scheme by letting
xi = x( Ni ) for i = 1, . . . , N , and ∆t = N1 , and t ∈ {0, N1 , . . . , NN−1 }. Then the above recursion can be
written as
β∆t
x(t + ∆t) = 1 − x(t).
2
Verification of the solution can be done by substituting Eqn (111) into Eqn (110).
The power of the ODE is that it offers us an analytic solution. Instead of resorting to the iterative
scheme (which will take hundreds to thousands of iterations), the analytic solution tells us exactly the
• The discrete-time iterative scheme can be written as a continuous-time ODE. In fact, many discrete-
time algorithms have their associating ODEs.
• For simple ODEs, we can write down the analytic solution in closed form. More complicated ODE
would be hard to write an analytic solution. But we can still use ODE tools to analyze the behavior
of the solution. We can also derive the limiting solution t → 0.
Example 4.2. Gradient Descent. Recall that a gradient descent algorithm for a (well-behaved)
convex function f is the following recursion. For i = 1, 2, . . . , N , do
for step-size parameter βi . Using the same discretization as we did in the previous example, we can
show that (by letting βi−1 = β(t)∆t):
d dx(t)
f (x(t)) = ∇f (x(t))T (chain rule)
dt dt
= ∇f (x(t))T [−β∇f (x(t))] (Eqn (113))
T
= −β∇f (x(t)) ∇f (x(t))
= −β∥∇f (x(t))∥2 ≤ 0 (norm-squares).
Therefore, as we move from xi−1 to xi , the objective value f (x(t)) has to go down. This is consistent
with our expectation because a gradient descent algorithm should bring the cost down as the iteration
goes on. Second, at the limit when t → ∞, we know that dx(t) dx(t)
dt → 0. Hence, dt = −β∇f (x(t)) will
imply that
∇f (x(t)) → 0, as t → ∞. (114)
Therefore, the solution trajectory x(t) will approach the minimizer for the function f .
Let’s use the gradient descent example to illustrate one more aspect of the ODE. Going back to Eqn
(112), we recognize that the recursion can be written equivalently as (assuming β(t) = β):
We call this as the forward equation because we update x by x + ∆x assuming that t ← t + ∆t.
Now, consider a sequence of iterates i = N, N − 1, . . . , 2, 1. If we are told that the progression of the
iterates follows Eqn (115), then the time-reversal iterates will be
Note the change in sign when reversing the progression direction. We call this the reverse equation.
SDE. In an SDE, in addition to a deterministic function f (t, x), we consider a stochastic perturbation.
For example, the stochastic perturbation can take the following form:
dx(t)
= f (t, x) + g(t, x)ξ(t), where ξ(t) ∼ N (0, I),
dt
where ξ(t) is a noise function, e.g., white noise. We can define dw = ξ(t)dt, where dw is often known as the
differential form of the Brownian motion. Then, the differential form of this SDE can be written as
Because ξ(t) is random, the solution to this differential equation is also random. To be explicit about the
randomness of the solution, we should interpret the differential form via the integral equation
Z t Z t
x(t, ω) = x0 + f (s, x(s, ω))ds + g(s, x(s, ω))dw(s, ω),
0 0
where ω denotes the index of the state of x. Therefore, as we pick a particular state of the random process
w(s, ω), we solve a differential equation corresponding to this particular ω.
dx = adw,
for some constants a. Based on our discussions above, the solution trajectory will take the form
Z t Z t
x(t) = x0 + adw(s) = x0 + a ξ(s)ds,
0 0
where the last equality uses the fact that dw = ξ(t)dt. We can visualize the solution trajectory by
numerically implementing dx = adw via the discrete-time iteration:
⇒ xi = x0 + az1 + . . . azi .
For example, if a = 0.05, one possible trajectory of x(t) will behave as shown below. The initial point
x0 = 0 is marked as in red to indicate that the process is moving forward in time.
To visualize the trajectory, we consider α = 1 and β = 0.1. The discretization will give us
α α
xi − xi−1 = − xi−1 + β(wi − wi−1 ) ⇒ xi = 1 − xi−1 + βzi−1 .
2 2
Forward and Reverse Diffusion. Diffusion models involve a pair of equations: the forward diffusion
process and the reverse diffusion process. Expressed in terms of derivatives, a forward diffusion process can
be written as
dx(t)
= f (x, t) + g(t)ξ(t), where ξ(t) ∼ N (0, I).
dt
We emphasize that this is a particular diffusion process based on Brownian motion. The random perturbation
ξ is assumed to be a random process with ξ(t) being an i.i.d. Gaussian random variable at all t. Thus, the
autocorrelation function is a delta function E[ξ(t)ξ(t′ )] = δ(t − t′ ). For this diffusion process, we can write
it in terms of the differential:
Here, the differential is dw = ξ(t)dt. This suggests that we can view ξ(t) as some rate of change (over t)
from which the integration of ξ(t)dt will give us dw.
The two terms f (x, t) and g(t) carry physical meanings. The drift coefficient is a vector-valued function
f (x, t) defining how molecules in a closed system would move in the absence of random effects. For the
gradient descent algorithm, the drift is defined by the negative gradient of the objective function. That
is, we want the solution trajectory to follow the gradient of the objective. The diffusion coefficient g(t) is
a scalar function describing how the molecules would randomly walk from one position to another. The
function g(t) determines how strong the random movement is.
The reverse direction of the diffusion equation is to move backward in time. The reverse-time SDE,
according to Anderson [2], is given as follows.
where pt (x) is the probability distribution of x at time t, and w is the Wiener process when time flows
backward.
Let’s briefly talk about the reverse-time diffusion. The reverse-time diffusion is nothing but a random
process that proceeds in the reverse time order. So while the forward diffusion defines zi = wi+1 − wi , the
reverse diffusion defines zi = wi−1 − wi . The following is an example.
dx = adw. (120)
In the figure below we show the trajectory of this reverse-time process. Note that the initial point
marked in red is at xN . The process is tracked backward to x0 .
We can show that this equation can be derived from the forward SDE equation below.
Theorem 4.1. The forward sampling equation of DDPM can be written as an SDE via
β(t) p
dx = − x dt + β(t)dw. (122)
2 }
| {z | {z }
=g(t)
=f (x,t)
βi
Proof. We define a step size ∆t = N1 , and consider an auxiliary noise level {β i }N
i=1 where βi = N.
Then
1
βi = β Ni · = β(t + ∆t)∆t,
| {z } N
βi
where we assume that in the N → ∞, β i → β(t) which is a continuous time function for 0 ≤ t ≤ 1.
Similarly, we define
Hence, we have
p p
xi = 1 − βi xi−1 +
βi zi−1
q q
⇒ xi = 1 − βNi xi−1 + βNi zi−1
p p
⇒ x(t + ∆t) = 1 − β(t + ∆t) · ∆t x(t) + β(t + ∆t) · ∆t z(t)
1 p
⇒ x(t + ∆t) ≈ 1 − β(t + ∆t) · ∆t x(t) + β(t + ∆t) · ∆t z(t)
2
1 p
⇒ x(t + ∆t) ≈ x(t) − β(t)∆t x(t) + β(t) · ∆t z(t).
2
Thus, as ∆t → 0, we have
1 p
dx = − β(t)xdt + β(t) dw. (123)
2
Therefore, we showed that the DDPM forward update iteration can be equivalently written as an SDE.
Being able to write the DDPM forward update iteration as an SDE means that the DDPM estimates
can be determined by solving the SDE. In other words, for an appropriately defined SDE solver, we can
throw the SDE into the solver. The solution returned by an appropriately chosen solver will be the DDPM
Example 4.6. Consider the DDPM forward equation with βi = 0.05 for all i = 0, . . . , N − 1. We
initialize the sample x0 by drawing it from a Gaussian mixture such that
K
X
x0 ∼ πk N (x0 |µk , σk2 I),
k=1
The reverse diffusion equation follows from Eqn (119) by substituting the appropriate quantities: f (x, t) =
− β(t)
p
2 and g(t) = β(t). This will give us
Theorem 4.2. The reverse sampling equation of DDPM can be written as an SDE via
hx i p
dx = −β(t) + ∇x log pt (x) dt + β(t)dw. (124)
2
Proof. The iterative update scheme can be written by considering dx = x(t) − x(t − ∆t), and dw =
w(t − ∆t) − w(t) = −z(t). Then, letting dt = ∆t, we can show that
x(t) p
x(t) − x(t − ∆t) = −β(t)∆t + ∇x log pt (x(t)) − β(t)∆tz(t)
2
x(t) p
⇒ x(t − ∆t) = x(t) + β(t)∆t + ∇x log pt (x(t)) + β(t)∆tz(t).
2
Then, following the discretization scheme by letting t ∈ {0, . . . , NN−1 }, ∆t = 1/N , x(t − ∆t) = xi−1 ,
x(t) = xi , and β(t)∆t = βi , we can show that
h i p
xi−1 = (1 + β2i ) xi + β2i ∇x log pi (xi ) + βi zi
h i p
1
≈ √1−β xi + β2i ∇x log pi (xi ) + βi zi , (125)
i
where pi (x) is the probability density function of x at time i. For practical implementation, we can
replace ∇x log pi (xi ) by the estimated score function sθ (xi ).
So, we have recovered the DDPM iteration that is consistent with the one defined by Song and Ermon in
[42]. This is an interesting result, because it allows us to connect DDPM’s iteration using the score function.
Song and Ermon [42] called the SDE an variance preserving (VP) SDE.
Example 4.7. Following from the previous example, we perform the reverse diffusion equation using
h i p
1
xi−1 = √1−β xi + β2i ∇x log pi (xi ) + βi zi ,
i
Stochastic Differential Equation for SMLD. The score-matching Langevin Dynamics model can
also be described by an SDE. To start with, we notice that in the SMLD setting, there isn’t really a “forward
diffusion step”. However, we can roughly argue that if we divide the noise scale in the SMLD training into
N levels, then the recursion should follow a Markov chain
q
xi = xi−1 + σi2 − σi−1
2 z
i−1 , i = 1, 2, . . . , N. (126)
2
This is not too hard to see. If we assume that the variance of xi−1 is σi−1 , then we can show that
Var[xi ] = Var[xi−1 ] + (σi2 − σi−1
2
)
2
= σi−1 + (σi2 − σi−1
2
) = σi2 .
Therefore, given a sequence of noise levels, Eqn (126) will indeed generate estimates xi such that the noise
statistics will satisfy the desired property.
If we agree Eqn (126), it is easy to derive the SDE associated with Eqn (126). We summarize our result
as follows.
Theorem 4.4. The reverse sampling equation of SMLD can be written as an SDE via
r
d[σ(t)2 ] d[σ(t)2 ]
dx = − ∇x log pt (x) dt + dw. (128)
dt dt
which is identical to the SMLD reverse update equation. Song and Ermon [42] called the SDE an variance
exploding (VE) SDE.
Equivalence between VP and VE’s Inference. As we have just seen, DDPM and SMLD corre-
spond to two variants of the stochastic differential equation: the variance exploding (VE) and the variance
preserving (VP). An observation made by Kawar et al. [21] was that the inference process determined by
VP and VE are equivalent. Therefore, if we want to use a diffusion model as part of another task such
as image restoration, it does not matter if we use VE or VP. Note, however, that the specific choice of
hyperparameters will affect still the training and hence the model itself.
dx(t)
= f (t, x(t)). (130)
dt
Geometrically, this ODE means that the derivative of x(t) equals to the function f (t, x) evaluated at x(t).
For this to happen, we need the slope of x(t) to match with the functional value. In the 1D case, the above
geometric interpretation can be translated to the following equation:
x(t) − x(t0 )
= f (t0 , x0 ),
t − t0
where t0 is a time fairly close to t. By rearranging the terms, we see that this equation is equivalent to
So we have recovered something very similar to gradient descent. This is the Euler method.
Euler Method. Euler method is a first order numerical method for solving the ODE. Given dx(t) dt =
f (t, x), and x(t0 ) = x0 , Euler method solves the problem via an iterative scheme for i = 0, 1, . . . , N − 1 such
that
xi+1 = xi + α · f (ti , xi ), i = 0, 1, . . . , N − 1,
dx(t) x(t) + t2 − 2
= .
dt t+1
If we apply the Euler method with a step size α, then the iteration will take the form
(xi + t2i − 2)
xi+1 = xi + α · f (ti , xi ) = xi + α · .
ti + 1
Runge-Kutta (RK) Method. Another popularly used ODE solver is the Runge-Kutta (RK) method.
The classical RK-4 algorithm solves the ODE via the iteration
α
xi+1 = xi + · k1 + 2k2 + 2k3 + k4 , i = 1, 2, . . . , N,
6
where the quantities k1 , k2 , k3 and k4 are defined as
k1 = f (xi , ti ),
k2 = f ti + α2 , xi + α k21 ,
k3 = f ti + α2 , xi + α k22 ,
k4 = f (ti + α, xi + αk3 ) .
For more details, you can consult numerical methods textbooks such as [3].
Predictor-Corrector Algorithm [42]. Since different numerical solvers have different behavior in
terms of the error of approximation, throwing the ODE (or SDE) into an off-the-shelf numerical solver will
result in various degrees of error [20]. However, if we are specifically trying to solve the reverse diffusion
equation, it is possible to use techniques other than numerical ODE/SDE solvers to make the appropriate
corrections, as illustrated in Figure 24.
Let’s use DDPM as an example. In DDPM, the reverse diffusion equation is given by
h i p
1
xi−1 = √1−β xi + β2i ∇x log pi (xi ) + βi zi .
i
We can consider it as an Euler method for the reverse diffusion. However, if we have already trained the
score function sθ (xi , i), we can run the score-matching equation, i.e.,
√
xi−1 = xi + ϵi sθ (xi , i) + 2ϵi zi ,
for M times to make the correction. The algorithm below summarizes the idea. (Note that we have replaced
the score function by the estimate.)
for m = 1, . . . , M do
√
(Correction) xi−1 = xi + ϵi sθ (xi , i) + 2ϵi zi , (132)
end for
end for
√
xi−1 = xi + ϵi ∇x sθ (xi , σi ) + ϵi z Correction.
We can pair them up as in the case of DDPM’s prediction-correction algorithm by repeating the correction
iteration a few times.
Derivation of Brownian Motion. So, what is Brownian motion and how is it related to diffusion
models? Assume that there is a particle suspended in fluid. Stoke’s law states that the friction applied to
the particle is given by
F (t) = −αv(t), (133)
where F is the friction, v is the velocity, and α = 6πµR. Here, R is the radius of the particle, and µ is the
viscosity of the fluid. By Newton’s second law, we further know that F (t) = mv̇(t), where m is the mass of
the particle. Equating the two equations
(
F (t) = −αv(t)
F (t) = m dv(t)
dt ,
Remark. Properties (i) and (ii) are special cases of a wide sense stationary process. A wide sense
stationary process is a random process that has a constant mean function (the constant is not necessarily
def
zero), and that the autocorrelation function R(t, t′ ) = E[Γ(t)Γ(t′ )] is a function of the difference
′
t − t′ . This function is not necessarily a delta function. For example R(t, t′ ) = e−|t−t | can be a valid
autocorrelation function of a wide sense stationary process.
A random process satisfying properties (i) and (ii) are sometime called a delta-correlated process
in the statistical mechanics literature. There are different ways to construct a delta-correlated process.
For example, we can assume Γ(t) ∼ N (0, 1) for every t, or any other independent and identically
distributions defined in the same way. Gaussian distributions are more often used because many physical
phenomena can be described by a Gaussian, e.g., thermal noise. A Gaussian random process satisfying
properties (i) and (ii) is called a Gaussian white noise.
For any wide sense stationary process, Wiener-Khinchin Theorem says that the power spectral
density can be defined through the Fourier transform of the autocorrelation function. More specifically,
if R(τ ) = E[Γ(t + τ )Γ(t)] is the autocorrelation function (we can write R(t, t′ ) as R(τ ) if Γ(t) is a wide
sense stationary process), Wiener-Khinchin Theorem states that the power spectral density is
Z ∞
S(ω) = R(τ )e−jωτ dτ. (137)
−∞
So if R(τ ) is a delta function, S(ω) will have a constant value for all ω.
Remark. The name “Gaussian white noise” comes from the fact that the power spectral density
S(ω) is uniform for every frequency ω (so it contains all the colors in the visible spectrum). A white
noise is defined as Γ(t) ∼ N (0, σ 2 ) for all t. It is easy to show that such a Γ(t) would satisfy the above
two criteria.
Firstly, E[Γ(t)] is E[Γ(t)] = 0 by construction (since Γ(t) ∼ N (0, σ 2 )). Secondly, if Γ(t) ∼ N (0, σ 2 ),
it is necessary that R(τ ) = E[Γ(t + τ )Γ(t)] is a delta function. Wiener-Khinchin Theorem then states
that the power spectral density is flat because it is the Fourier transform of a delta function. The
2
0.8
1
0.6
0
0.4
-1
0.2
-2
0 -3
-0.2 -4
0 500 1000 1500 2000 0 200 400 600 800 1000
From Physics to Generative AI. Because of the randomness exhibited in Γ(t), the differential equa-
tion given by Eqn (136) is a stochastic differential equation (SDE). The solution to this SDE is therefore
a random process where the value v(t) is a random variable at any time t. Brownian motion refers to the
trajectory of this random process v(t) as a function of time. The resulting SDE in Eqn (136) is a special case
of the Langevin equation. We call it a linear Langevin equation with a δ-correlated Langevin force:
Definition 5.1. A linear Langevin equation with a δ-correlated Langevin force is a stochastic
differential equation of the form
ξ˙ + γξ = Γ(t), (138)
where Γ(t) is a random process satisfying two properties that (i) E[Γ(t)] = 0 for all t and (ii) E[Γ(t)Γ(t′ )] =
qδ(t − t′ ) for all t and t′ .
At this point we can connect the Langevin equation in Eqn (138) with a diffusion model, e.g., DDPM.
Example 5.1. Forward DDPM. Recall that a DDPM forward diffusion equation is given by
β(t) p
dx = − x dt + β(t)dw.
2
|{z} | {z }
=g(t)
=f (t)
Example 5.2. Reverse DDPM. The reverse DDPM diffusion is given by Eqn (124)
hx i p
dx = −β(t) + ∇x log pt (x) dt + β(t)dw.
| 2 {z } | {z }
=g(t)
=f (ξ,t)
We can continue these examples for other diffusion models such as SMLD. We leave these as exercises for
the readers. Our bottom message is that the diffusion equations we see in the previous chapters can all be
and it is called a first-order homogeneous differential equation. The solution of this differential equation is
as follows.
Theorem 5.1. Consider the following differential equation
˙ + γξ(t) = 0,
ξ(t)
where we assume that ξ(t) ̸= 0 for all t so that we can take 1/ξ(t). Integrating both sides will give us
Z t ˙ ′)
ξ(t
Z t
′
dt = − γ dt′ .
0 ξ(t′ ) 0
The left hand side of the equation will give us log ξ(t) − log ξ(0) where as the right hand side will give
us −γt. Equating them will give us
Now let’s consider the case where Γ(t) is present. The differential equation becomes
ξ˙ + γξ = Γ(t),
which is called a first-order non-homogeneous differential equation. To solve this differential equation, we
employ a technique known as the variation of parameter or variation of constant [29, Theorem 1.2.3].
The idea can be summarized in two steps. We know from our previous derivation that the solution to a
homogeneous equation is ξ(t) = ξ0 e−γt . So let’s make an educated guess about the solution of the non-
homogenous case that the solution takes the form of s(t) = A(t)e−γt for some A(t). For notation simplicity
we define h(t) = e−γt . If s(t) is indeed the solution to the differential equation, then we can evaluate
ṡ(t) + γs(t) = Γ(t). The left hand side of this equation is
where the last equality follows from the fact that h(t) = e−γt is a solution to the homogeneous equation,
hence h′ (t) + γh(t) = 0. Therefore, for ṡ(t) + γs(t) = Γ(t), it is necessary for A′ (t)h(t) = Γ(t) by finding an
appropriate A′ (t). But this is not difficult. The equation A′ (t)h(t) = Γ(t) can be written as
Therefore, the complete solution (which is the sum of the homogeneous part and the non-homogeneous part)
is Z t
′
ξ(t) = ξ0 e−γt + e−γ(t−t ) Γ(t′ ) dt′ .
0
We summarize the result as follows.
Theorem 5.2. Consider the following differential equation
˙ + γξ(t) = Γ(t),
ξ(t)
Distribution at Equilibrium. The previous result shows that the solution ξ(t) is a function of a
random process Γ(t). Since we do not know the particular realization of Γ(t) every time we run the (Brownian
motion) experiment, it is often more useful to characterize ξ(t) by looking at the probability distribution of
ξ(t). In what follows, we follow Risken [33] to analyze the probability distribution at the equilibrium where
t → ∞ and ξ(t) → x.
ξ˙ + γξ = Γ(t), (141)
and Γ(t) is a white Gaussian noise so that it satisfies the properties aforementioned. Let ξ(t) = x be
the solution at equilibrium to this SDE, and let p(x) be the probability distribution of x. It holds that
1 x2
p(x) = √ e− 2σ2 , (142)
2πσ 2
q
q
where σ = 2γ . In other words, the solution ξ(t) = x at equilibrium is a zero-mean Gaussian random
variable.
Proof. Let ξ0 = ξ(0) be the initial condition. Then, the solution of the SDE takes the form
Z t
′
−γt
ξ(t) = ξ0 e + e−γ(t−t ) Γ(t′ )dt′ . (143)
0
At equilibrium when t → ∞, we can drop ξ0 e−γt . Moreover, by letting τ = t − t′ , we can write the
solution as
Z ∞ Z ∞
′
ξ(t) = e−γ(t−t ) Γ(t′ )dt′ = e−γτ Γ(t − τ )dτ.
0 0
The probability density function p(ξ) can be determined by taking the inverse Fourier transform of
So, to find C(u) we need to determine the moments E[ξ(t)n ]. Using a result in Risken (Chapter 3 Eqns
3.26 and 3.27), we can show that
E[ξ(t)2n+1 ] = 0
Z ∞ Z ∞ n
2n (2n)! −γ(τ1 +τ2 )
E[ξ(t) ] = n e qδ(τ1 − τ2 )dτ1 dτ2 (144)
2 n! 0 0
Z ∞ n n
(2n)! (2n)! q
= n q e−2γτ2 dτ2 = n . (145)
2 n! 0 2 n! 2γ
Recognizing that this is the characteristic function of a Gaussian, we can use inverse Fourier transform
to retrieve the probability density function
r
γ − γxq 2
p(x) = e .
πq
Example 5.3. (Forward DDPM Distribution at Equilibrium) Let’s do a sanity check by applying
our result to the forward DDPM equation, and see what probability distribution will we obtain at the
equilibrium state.
For simplicity let’s assume a constant learning rate for the DDPM equation
β p
dx = − x dt + βdw.
2
The associate Langevin equation is
˙ + β ξ(t) = βΓ(t).
p
ξ(t)
2
As t → ∞, our theorem above suggests that
r
γ − γxq 2 1 x2
p(x) = e = √ e− 2 ,
πq 2π
√ 2
where we substituted γ = β/2, and q = β = β. Therefore, the probability distribution of ξ(t) when
t → ∞ is N (0, 1). This is consistent with what we expect.
which is also known as the Wiener process. The probability distribution of the solution of the Wiener process
can be derived as follows.
Theorem 5.4. Wiener Process. Consider the Wiener process
ξ˙ = Γ(t), (147)
where Γ(t) is the Gaussian white noise with E[Γ(t)] = 0 and E[Γ(t)Γ(t′ )] = qδ(t − t′ ). The probability
distribution p(x, t) of the solution ξ(t) where ξ(t) = x is
1 (x−ξ0 )2
p(x, t) = √ e− 2qt . (148)
2πqt
Proof. The main difference between this result and Theorem R5.3 is that here we are interested in the
t
distribution at any time t. To do so, we notice that ξ(t) = ξ0 + 0 Γ(t′ )dt′ . So, to eliminate the non-zero
mean, we can consider ξ(t) − ξ0 instead. Substituting this into Eqn (145), we can show that
E[(ξ(t) − ξ0 )2n+1 ] = 0
Z t n
(2n)! (2n)!
E[(ξ(t) − ξ0 )2n ] = q e 0
dτ2 = n (qt)n .
2n n! 0 2 n!
Taking the inverse Fourier transform will give us the probability distribution for ξ(t):
1 (x−ξ0 )2
p(x, t) = √ e− 2qt . (150)
2πqt
To gain insights about this equation, let’s assume ξ0 = 0 and q = 2k for some constant k. This will give
us
1 x2
p(x, t) = √ e− 4kt .
4πkt
An interesting observation of this result, which can be found in many thermal dynamics textbooks, is that
the probability distribution p(x, t) derived above is in fact the solution of the heat equation:
∂ ∂2
p(x, t) = k 2 p(x, t), (151)
∂t ∂x
assuming that the initial condition is p(x, 0) = δ(x). To see this, we just need to substitute the probability
distribution into the heat equation. We can then see that
2
∂ 1 x 1 x2
p(x, t) = · −1 · √ e− 4kt
∂t 2t 2kt 4πkt
2
2
∂ 1 x 1 x2
p(x, t) = · −1 · √ e− 4kt .
∂x2 2kt 2kt 4πkt
Figure 25: Realization of a Wiener process. (a) The random process follows the stochastic differential
equation. We show a few realizations of the random process. (b) The underlying probability distribution
p(x, t). As t increases, the variance of the Gaussian also increases.
For more complicated Langevin equations (involving nonlinear terms), it seems natural to expect a
similar partial equation characterizing the probability distribution. More specifically, it seems reasonable to
expect that on one side of the equation, we will have ∂/∂t. And on the other side of the equation, we will
have ∂ 2 /∂x2 . As we will show later, the Fokker-Planck equation will have a form similar to this. Indeed,
one can derive the heat equation from the Fokker-Planck equation.
with E[Γi (t)] = 0 and E[Γi (t)Γj (t)] = qij δ(t − t′ ), and qij = qji , the corresponding random process is known
as the Ornstein-Uhlenbeck process.
where h(ξ, t) and g(ξ, t) are functions denoting the drift and diffusion, respectively. Like before, we
assume that Γ(t) is a Gaussian white noise so that E[Γ(t)] = 0 for all t, and E[Γ(t)Γ(t′ )] = 2δ(t − t′ ).
Readers can refer to Example 5.2 to see how the reverse DDPM would fit this equation.
Markov Property. Let’s first define a Markov process. Suppose that ξ(t) has a value xn = ξ(tn ) at
time tn , and let t1 ≤ t2 . . . ≤ tn . We will use the notation p(xn , tn ) to describe the probability density of
having ξ(tn ) = xn . We also introduce the following short-hand notation
That is, the probability of getting state xn at tn given all the previous states is the same as if we are
only conditioning on the immediate previous state xn−1 at tn−1 .
The random process ξ(t) satisfying the nonlinear Langevin equation defined in Definition 5.2 is Markov, as
long as Γ(t) is δ-correlated. That means, the conditional probability at tn only depends on that value at tn−1 .
The reason was summarized by Risken [33]: (i) A first-order differential equation is uniquely determined
by its initial value; (ii) A δ-correlated Langevin force Γ(t) at a former time t < tn−1 cannot change the
conditional probability at a later time t > tn−1 . Risken further elaborates that the Markovian property is
q −γ|t−t′ |
destroyed if Γ(t) is no longer δ-correlated. For example, if Γ(t) is such that E[Γ(t)Γ(t′ )] = 2γ e , then
˙
the process described by ξ(t) = h(ξ) + Γ(t) will be non-Markovian. From now on, we will focus only the
Markov processes.
Chapman-Kolmogorov Equation. Consider a Markov process ξ(t). We can derive a useful result
known as the Chapman-Kolmogorov equation. The Chapman-Kolmogorov equation states that the joint
distribution at t3 and t1 can be found by integrating the conditional probabilities of t3 given t2 and then t2
given t1 . The two key arguments here is the Bayes Theorem plus the definition of marginalization, and the
memoryless property of a Markov process.
Theorem 5.5. Chapman-Kolmogorov Equation. Let ξ(t) be a Markov process, and let xn = ξ(tn )
be the state of ξ(t) at time tn . Then
Z
p(x3 , t3 | x1 , t1 ) = p(x3 , t3 |x2 , t2 )p(x2 , t2 |x1 , t1 )dx2 , (154)
assuming t1 ≤ t2 ≤ t3 .
Proof. For notational simplicity, let’s denote xn = {xn , . . . , x1 } and tn = {tn , . . . , t1 }. If ξ(t) is
Markov, then by the memoryless property of Markov, we have that
Masters Equation. Based on the Chapman-Kolmogorov equation, we can derive a fundamental equa-
tion for Markov processes. This equation is called the Masters Equation.
Theorem 5.6. Let ξ(t) be a Markov process. The Masters Equation states that
Z h
∂ i
p(x, t) = W (x|x′ )p(x′ , t) − W (x′ |x)p(x, t) dx′ , (157)
∂t
(x1 , t1 ) −→ (x0 , t0 )
(x2 , t2 ) −→ (x, t)
(x3 , t3 ) −→ (x, t + ∆t)
Since our goal is to obtain the partial derivative in time, we consider the time derivative of
∂ p(x, t + ∆t | x0 , t0 ) − p(x, t | x0 , t0 )
p(x, t | x0 , t0 ) = lim
∂t ∆t→0 ∆t
p(x, t + ∆t | x′ , t)p(x′ , t | x0 , t0 )dx′ − p(x, t | x0 , t0 )
R
= lim .
∆t→0 ∆t
We note that on the right hand side of the equation above, there is an integration. If we switch the
variables x and x′ , we can use the following observation:
Z
p(x′ , t + ∆t | x, t)dx′ = 1.
Next, we can move the limits into the integration. Let’s define
1
W (x, t | x′ , t) = lim p(x, t + ∆t | x′ , t)
∆t
∆t→0
1
W (x′ , t | x, t) = lim p(x′ , t + ∆t | x, t)
∆t→0 ∆t
So, we have
Z h
∂ i
p(x, t | x0 , t0 ) = W (x, t | x′ , t)p(x′ , t | x0 , t0 ) − W (x′ , t | x, t)p(x, t | x0 , t0 ) dx′ (160)
∂t
If we fix (x0 , t0 ), then we can drop the conditioning. This will give us
Z h
∂ i
p(x, t) = W (x, t | x′ , t)p(x′ , t) − W (x′ , t | x, t)p(x, t) dx′ . (161)
∂t
In the derivation above, the terms W (x, t | x′ , t) and W (x′ , t | x, t) are known as the transition rates. They
are the transition probability per unit time, with a unit [time−1 ]. Thus, if we integrate them with respect
to time, we will obtain
Z
W (x, t | x′ , t)dt = p(x, t | x′ , t)
Z
W (x′ , t | x, t)dt = p(x′ , t | x, t).
way to visualize the Masters Equation is to consider W (x, t | x′ , t)p(x′ , t)dx′ as the in-flow and
R
R One
W (x′ , t | x, t)p(x, t)dx′ as the out-flow of the transition probability from state x′ to x (and from x to x′ ).
So if we view the probability as the density of particles in a room, then the Masters Equation says that the
rate of the change of the density is the difference between the in-flow and the out-flow of the particles:
Z h Z h
∂ ′ ′
i
′
i
p(x, t) = W (x, t | x , t)p(x , t) dx − W (x′ , t | x, t)p(x, t) dx′ . (162)
|∂t {z } | {z } | {z }
rate of change in-flow of probability out-flow of probability
Masters Equation is used widely in chemistry, biology, and many disciplines. The notion of in-flow and
out-flow of participles is particularly useful to study the dynamics of a system. Another important aspect
of the Masters Equation is that it relates time ∂t with the state dx′ . This will become prevalent in the
Fokker-Planck equation.
One thing we can complain about the above proof is that although it is rigorous, it lacks the physics
intuition during the proof. In what follows, we present an alternative proof which is more intuitive but less
rigorous. The proof is based on a Lecture Note of Luca Donati [13].
Let define the rate W (x2 |x1 ) and W (x1 |x2 ) such that
Notice here we have implicitly assumed that the transition distribution is Markov so that the current
state only depends on its previous state and not the entire history. Then the above equation can be
written as
p(x1 , t + dt) = p(x1 , t)(1 − W (x2 |x1 )dt) + p(x2 , t)W (x1 |x2 )dt + O(dt2 ).
The high-order term is there to account for multiple jumps during (t, t + dt), e.g., jump from x1 to x2
and then from x2 to x1 within the interval. However, this term will vanish if dt → 0. By rearranging
the terms, we can write
dp(x1 , t)
= −W (x2 |x1 )p(x1 , t) + W (x1 |x2 )p(x2 , t).
dt
We can generalize this result to multiple states to and from x1 . For example,
dp(x1 , t) X
= [−W (xj |x1 )p(x1 , t) + W (x1 |xj )p(xj , t)] .
dt
j̸=1
To make it even more general, we can consider a continuum of xj . By rearranging the terms, we will
obtain
Z h
dp(x, t) i
= W (x|x′ )p(x′ , t) − W (x′ |x)p(x, t) dx′ ,
dt
which is the Masters equation.
Theorem 5.7. Let ξ(t) be a Markov process and let p(x, t) be the probability distribution of ξ(t)
taking a value x at time t. The Kramers-Moyal Expansion states that
∞
∂m
∂ X
p(x, t) = − m D(m) (x, t)p(x, t) ,
∂t m=1
∂x
Proof. Let’s start with the Masters Equation. The Masters Equation states that
"Z
∂ 1
p(x, t | x0 , t0 ) = lim p(x, t + ∆t | x′ , t)p(x′ , t | x0 , t0 )dx′
∂t ∆t→0 ∆t
Z #
′ ′
− p(x , t + ∆t | x, t)p(x, t | x0 , t0 )dx .
∞
(x − x′ )m ∂ m
ZZ X
+ m
φ(x) p(x, t + ∆t | x′ , t)p(x′ , t | x0 , t0 )dx′ dx
m=1
m! ∂x x=x′
ZZ #
− φ(x)p(x′ , t + ∆t | x, t)p(x, t | x0 , t0 )dx′ dx (163)
We notice that the last double integral in the equation above has dummy variables x′ and x. We
can switch the dummy variables, and write
ZZ ZZ
φ(x)p(x′ , t + ∆t | x, t)p(x, t | x0 , t0 )dx′ dx = φ(x′ )p(x, t + ∆t | x′ , t)p(x′ , t | x0 , t0 )dx′ dx
Then, the first and the third double integrals in Eqn (163) can be canceled. This will leave us
Z
∂
φ(x)p(x, t | x0 , t0 )dx
∂t
∞
(x − x′ )m ∂ m
ZZ X
1
= lim m
φ(x) p(x, t + ∆t | x′ , t)p(x′ , t | x0 , t0 )dx′ dx (164)
∆t→0 ∆t m! ∂x x=x′
m=1
where the last step is known as the generalized integration by part which states that for any continuously
differentiable functions f and g,
∂mf ∂mg
Z Z
m
g· dx = (−1) f dx.
∂xm ∂xm
Combining all these, and recognizing that the above result holds for any arbitrary φ(x), it follows
that
∞
∂ X ∂m h i
p(x, t | x0 , t0 ) = (−1)m m D(m) (x, t)p(x, t, x0 , t0 ) . (165)
∂t m=1
∂x
Kramers-Moyal expansion expresses the time-derivative ∂t of the probability distribution of any Markov
process (including the solution of the nonlinear Langevin Equation) through the spatial-derivative ∂x. How-
ever, the expansion has infinitely many terms. An important question now is that are we allowed to truncate
any of these terms? If so, how many terms can be truncated? Pawula Theorem provides an answer to this
question: [30]
Theorem 5.8. Pawula Theorem. The Kramers-Moyal expansion may stop at one of the following
three cases:
• m = 1: The resulting differential equation is known as the Liouville’s Equation which is a deter-
ministic process.
• m = 2: The resulting differential equation is known as the Fokker-Planck Equation.
Denote x′ = ξ(t + ∆t), the subject of interest here is the m-th moment E [(x′ − x)m ].
Note that we cannot apply the above arguments for m = 0, 1, 2 because they will give trivial equalities.
From these two relationship, and suppose that we denote Dm = D(m) (x, t), then the above two
cases can be written as
2
Dm ≤ Dm−1 Dm+1 , m odd and m ≥ 3,
2
Dm ≤ Dm−2 Dm+2 , m even and m ≥ 4.
Our goal now is to show that this recurring relationship will give us Dm = 0 for any m ≥ 3.
Suppose first that D4 = 0. Then D6 ≤ D4 D8 implies that D6 = 0. But if D6 = 0 then D8 ≤ D6 D10
implies that D8 = 0. Repeating the process will give us Dm = 0 for m = 4, 6, 8, 10, . . .. Similarly,
suppose that D6 = 0. Then D4 ≤ D2 D6 implies that D4 = 0. But if D4 = 0, we can go back to the first
case and show that Dm = 0 for m = 4, 6, 8, 10, . . .. In general, all even m ≥ 4 should be zero if any one of
these even m ≥ 4 is zero. For the odd m’s: If D4 = 0, then D3 ≤ D2 D4 implies that D3 = 0. Similarly,
if D6 = 0, we will have D5 = 0. So if D4 = D6 = D8 = . . . = 0, then D3 = D5 = D7 = . . . = 0.
Therefore, if Dm = 0 for any even m such that m ≥ 4, then Dm = 0 all integers m ≥ 3.
The above analysis suggests that if Kramers-Moyal expansion is truncated up to m = 3 so that
D3 ̸= 0 and D4 = D5 = . . . = 0, then D4 = 0 will force D3 = 0. So we will have Dm = 0 for all
m ≥ 3. Similarly, if the Kramers-Moyal expansion is truncated up to m = 4 so that D4 ̸= 0 and
D5 = D6 = . . . = 0, then D6 = 0 will force D4 = 0. So we will have Dm = 0 for all m ≥ 3 again.
By repeating the above argument for other m ≥ 3, we see that it is impossible to have Kramers-
Moyal expansion be truncated for any m ≥ 3. In other words, we can either truncate the expansion for
m = 1, m = 2, or we never truncate it.
Pawula Theorem does not say that the Fokker-Planck Equation (truncating Kramers-Moyal Expansion
up to m = 2) is a good approximation to the underlying Masters Equation. It only says that we can either
exactly approximate the Masters Equation using m = 1 or m = 2, or we cannot approximate at all.
Definition 5.4. The Fokker-Planck Equation is obtained by truncating the Kramers-Moyal expan-
sion to m = 2. That is, for any Markov process ξ(t), the probability distribution p(x, t) of ξ(t) = x at
time t will satisfy the following partial differential equation:
∂ ∂ ∂ 2 (2)
p(x, t) = − D(1) (x, t)p(x, t) + D (x, t)p(x, t). (167)
∂t ∂x ∂x2
we can evaluate the Kramers-Moyal coefficients D(m) (x, t)p(x, t). The following theorem summarizes the
coefficients. We remark that during the proof of this theorem, it will become clear why D(m) (x, t)p(x, t) is
only limited to m = 1 and m = 2.
Theorem 5.9. Fokker-Planck for nonlinear Langevin Equation. Consider the nonlinear Langevin
equation
ξ˙ = h(ξ, t) + g(ξ, t)Γ(t),
for functions h(ξ, t) and g(ξ, t). The Fokker-Planck Equation for this nonlinear Langevin equation will
have Kramers-Moyal coefficients:
1 E [(ξ(t + τ ) − x)m ]
D(m) (x, t) = lim .
m! τ →0 τ ξ(t)=x
where we only write the terms involving h, g and Γ. Terms involving ξ(t′′ ) − x are not dropped.
Take expectation, and noticing that E[Γ(t)] = 0, we can show that only the first two terms and the
last term in Eqn (170) will survive. Thus, we have
Z t+τ Z t+τ Z t′
′ ′
E[ξ(t + τ ) − x] = h(x, t )dt + h′ (x, t′ )h(x, t′′ )dt′′ dt′ + . . .
t t t
Z t+τ Z t′
+ g ′ (x, t′ ) g(x, t′′ )2δ(t′′ − t′ )dt′′ dt′ + . . . . (171)
t
|t {z }
=g(x,t′ )
R t′
where we follow Risken’s definition that t
2δ(t′′ − t′ )dt′′ = 1 [33]. As τ → 0, it follows that the first
and third terms of Eqn (171) are
Z t+τ
lim h(x, t′ )dt′ = h(x, t),
τ →0 t
Z t+τ
lim g ′ (x, t′ )g(x, t′ )dt′ = g ′ (x, t)g(x, t).
τ →0 t
The derivation of D(2) (x, t) follows essentially the same set of arguments. The key to note here is
that when we take the square in E[(ξ(t + τ ) − x)2 ], the integrals in Eqn (171) will give us contributions
proportional to τ 2 . When τ → 0, all these terms will vanish because there is only one 1/τ in the
definition of D(2) (x, t). As a result, the only term that can survive is
Z t+τ Z t+τ
1 1
D(2) (x, t) = lim g(x, t′ )g(x, t′′ )2δ(t′ − t′′ )dt′ dt′′
2 τ →0 τ t t
= g 2 (x, t),
ξ˙ = A(ξ)ξ + σΓ(t).
Then, the probability distribution p(x, t) for the solution ξ(t) will satisfy the following Fokker-Planck
equation
∂ ∂ h i ∂2 h 2 i
p(x, t) = − h(x, t) + g ′ (x, t)g(x, t) p(x, t) + g (x, t)p(x, t)
∂t ∂x ∂x2
2 h
∂ ∂ i
=− [(A(x) + 0 · σ) p(x, t)] + 2
σ 2 p(x, t)
∂x ∂x
2
∂ ∂
=− [A(x)p(x, t)] + σ 2 2 [p(x, t)].
∂x ∂x
Example 5.5. For the special case where A(x) = 0, the Langevin equation is simplified to a Wiener
process:
ξ˙ = σΓ(t).
The corresponding Fokker-Planck equation is
∂ ∂2
p(x, t) = σ 2 2 p(x, t).
∂t ∂x
This equation is known as the heat equation or the diffusion equation. If the initial condition is
that p(x, 0) = δ(x), the solution is (See derivation below.)
1 x2
p(x, t) = √ e− 4σ2 t .
4πσ 2 t
Solution to Heat Equation. The heat equation can be solved using Fourier transforms. For nota-
∂2u
tional simplicity we denote ut = ∂u
∂t and uxx = ∂x2 . Consider a generic heat equation:
with initial condition u(x, 0) = ϕ(x). We can take Fourier transform (to map between x ↔ ω) on both
sides by defining
n o Z ∞
ubt (ω, t) = F ut (x, t) = ut (x, t)ejωx dx
−∞
n o Z ∞
bxx (ω, t) = F uxx (x, t) =
u uxx (x, t)ejωx dx.
−∞
ubt (ω, t) = kb
uxx (ω, t).
Using the differentiation property of Fourier transform, we can write the right hand side of the equation
as
(Remark: Eqn (172) is just a simple differential equation f ′ (t) = af (t) whose solution can be found by
integration.) Therefore, if we take the inverse Fourier transform on u b(ω, t)(with respect to ω ↔ x), we
will have
n o
−kω 2 t
u(x, t) = F −1 {b
u(ω, t)} = F −1 ϕ(ω)e
b
def 2
which is the inverse Fourier transform on the product of ϕ(ω)
b and fb(ω) = e−kω t . Since multiplication
in the Fourier domain is convolution in the spatial domain, it follows that u(x, t) is the convolution of
2 2 2
ϕ(x) and f (x) = F −1 (e−kω t ). But F −1 (e−kω t ) = √2kt
1
e−x /(4kt) . Therefore, we can show that the
solution is
Z ∞
1
u(x, t) = √ ϕ(x − x′ )f (x′ )dx′
2π −∞
Z ∞
1 1 −(x′ )2 /(4kt) ′
=√ ϕ(x − x′ ) √ e dx .
2π −∞ 2kt
Probability Current. The Fokker-Planck Equation has some interesting physics interpretations. Recall
that the Fokker-Planck Equation is
∂ ∂ ∂ 2 (2)
p(x, t) = − D(1) (x, t)p(x, t) + D (x, t)p(x, t). (175)
∂t ∂x ∂x2
Let’s define a quantity
∂ (2)
S(x, t) = D(1) (x, t) − D (x, t) p(x, t). (176)
∂x
Then, the Fokker-Planck Equation can be written as
∂p ∂S
+ = 0. (177)
∂t ∂x
One way to interpret Eqn (177) is that it is the probability current.
Intuitive Derivation of Eqn (177). Conversation of energy tells us that if some increases/decreases
in a spatial region (e.g., particles or charges), the change in the amount should be equal to the change in
its surface. So, if p(x, t) represents a some sort of density, then p(x, t)dx will be the amount of particles
sitting between (x, x + dx) at time t. Here, S(x, t) can be viewed as the current of the particle flowing
per unit time across x. For a time interval (t, t + dt) and a spatial interval (x, x + dx), the change in
amount of particles is
Because of the conversation of energy, for this change to happen, there must be some flow of the current.
The current over the interval is
Example 5.7. Connection with SMLD Let’s try to map our results with Eqn (96) defined in
Definition 3.1. We shall consider the 1D case. Consider the following Langevin equation
∂x ∂
=τ log p(x) + |{z}
σ Γ(t).
∂t ∂x
|{z} | {z } g(ξ,t)
ξ̇ h(ξ,t)
For avoid notational confusions, we let W (x, t) be the probability distribution of the solution x(t) for
this Langevin equation. The Kramers-Moyal coefficients for this Langevin Equation are
∂ def
D(1) (x, t) = h(x, t) + g ′ (x, t)g(x, t) = τ log p(x) = A(x)
∂x
D(2) (x, t) = g(x, t)2 = σ 2 .
∂ ∂ h (1) i ∂ 2 h (2) i
W (x, t) = − D (x, t)W (x, t) + D (x, t)W (x, t)
∂t ∂x ∂x2
2
∂ ∂
=− [A(x)W (x, t)] + σ 2 2 W (x, t).
∂x ∂x
At equilibrium when t → ∞, the probability distribution W (x, t) can be written as W (x). Since
the probability current vanishes, it follows that
∂
A(x)W (x) = σ 2 W (x).
∂x
∂ ∂
τ log p(x) = σ 2 W (x).
∂x ∂x
√
Since we have the freedom to choose σ, we will just make it σ = τ . Then the above equation is
∂ ∂
simplified to ∂x log p(x) = ∂x W (x). Integrating both sides with respect to x will give us
for some constant C. Let U (x) = eW (x) be a probability distribution so that W (x) = log U (x) is the
log-likelihood, REqn (179) will giveR us p(x) = U (x)eC . Since p(x) and U (x) are probability distributions,
we must have p(x)dx = 1 and U (x)dx = 1. Thus, we can show that C = 0.
Therefore, we conclude that if we run the Langevin equation until convergence, the probability
distribution W (x) of the solution is exactly √the ground truth distribution p(x). Moreover, the noise
level σ and the step size τ is related by σ = τ .
Acknowledgement
This work is supported, in part, by the National Science Foundation under the awards 2030570, 2134209,
and 2133032, as well as by SRC JUMP 2.0 Center, and research awards from Samsung Research America.
Since this draft was posted on the internet in March 2024, we received numerous constructive feedback from
readers from all over the world. Thank you all for your input. Thanks also to many of the graduate students
at Purdue who shared good thoughts about the content of the tutorial. We want to give a special thank
to William Chi-Kin Yau who worked tirelessly with us on the section about Langevin and Fokker-Planck
equations.
[3] Kendall Atkinson, Weimin Han, and David Stewart. Numerical solution of ordinary differential equa-
tions. Wiley, 2009. https://homepage.math.uiowa.edu/~atkinson/papers/NAODE_Book.pdf.
[4] Christopher Bishop. Pattern Recognition and Machine Learning. Springer, 2006.
[5] Charles A. Bouman and Gregery T. Buzzard. Generative plug and play: Posterior sampling for inverse
problems. In 2023 59th Annual Allerton Conference on Communication, Control, and Computing
(Allerton), pages 1–7, 2023. https://arxiv.org/abs/2306.07233.
[6] Robert Brown. A brief account of microscopical observations on the particles contained in the pollen
of plants and the general existence of active molecules in organic and inorganic bodies. Edinburgh New
Philosophical Journal, pages 358–371, 1828.
[7] Stanley H. Chan. Introduction to Probability for Data Science. Michigan Publishing, 2021. https:
//probability4datascience.com/.
[8] Stanley H. Chan, Xiran Wang, and Omar Elgendy. Plug-and-Play ADMM for image restoration: Fixed
point convergence and applications. IEEE Trans. Computational Imaging, 3(5):84–98, Mar 2017. https:
//arxiv.org/abs/1605.01710.
[9] Hyungjin Chung and Jong Chul Ye. Score-based diffusion models for accelerated mri. Medical Image
Analysis, 80:102479, 2022. https://arxiv.org/abs/2110.05243.
[10] Peter Dayan, Geoffrey E. Hinton, Radford M. Neal, and Richard S. Zemel. The Helmholtz machine.
Neural Computation, 7(5):889–904, 1995.
[11] Mauricio Delbracio and Peyman Milanfar. Inversion by direct iteration: An alternative to denoising
diffusion for image restoration. Transactions on Machine Learning Research (TMLR), 2023. https:
//openreview.net/forum?id=VmyFF5lL3F.
[12] Carl Doersch. Tutorial on variational autoencoders, 2016. https://arxiv.org/abs/1606.05908.
[13] Luca Donati. From Chapman-Kolmogorov equation to Master equation and Fokker-Planck equation.
https://www.zib.de/userpage/donati/stochastics2023/03/lecture_notes/L03_dCKeq.pdf.
[14] Albert Einstein. On the movement of small particles suspended in stationary liquids required by the
molecular-kinetic theory of heat. Annalen der Physik, pages 549–560, 1905.
[15] Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair,
Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in Neural Information
Processing Systems (NeurIPS), volume 27, 2014. https://arxiv.org/abs/1406.2661.
[16] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In Advances in
Neural Information Processing Systems (NeurIPS), 2020. https://arxiv.org/abs/2006.11239.
[17] Jason Hu, Bowen Song, Xiaojian Xu, Liyue Shen, and Jeffrey A. Fessler. Learning image priors through
patch-based diffusion models for solving inverse problems, 2024. https://arxiv.org/abs/2406.02462.
[18] Aapo Hyvärinen. Estimation of non-normalized statistical models by score matching. Journal
of Machine Learning Research (JMLR), 6(24):695–709, 2005. https://jmlr.org/papers/volume6/
hyvarinen05a/hyvarinen05a.pdf.
[24] Diederik P. Kingma and Max Welling. An introduction to variational autoencoders. Foundations and
Trends in Machine Learning, 12(4):307–392, 2019. https://arxiv.org/abs/1906.02691.
[25] Andrey Kolmogorov. Foundations of the Theory of Probability. Dover, 2018. The orig-
inal version was published in 1933 in German. https://dn790007.ca.archive.org/0/items/
foundationsofthe00kolm/foundationsofthe00kolm.pdf.
[26] Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. DPM-Solver: A fast ODE
solver for diffusion probabilistic model sampling in around 10 steps. In Advances in Neural Information
Processing Systems (NeurIPS), 2022. https://arxiv.org/abs/2206.00927.
[27] Calvin Luo. Understanding diffusion models: A unified perspective, 2022. https://arxiv.org/abs/
2208.11970.
[28] Chenlin Meng, Robin Rombach, Ruiqi Gao, Diederik Kingma, Stefano Ermon, Jonathan Ho, and Tim
Salimans. On distillation of guided diffusion models. In IEEE Conference on Computer Vision and
Pattern Recognition (CVPR), pages 14297–14306, 2023. https://arxiv.org/abs/2210.03142.
[29] Gabriel Nagy. MTH 235 differential equations, 2024. https://users.math.msu.edu/users/gnagy/
teaching/ade.pdf.
[30] R. Pawula. Generalizations and extensions of the Fokker-Planck-Kolmogorov equations. IEEE Trans-
actions on Information Theory, 13(1):33–41, 1967.
[31] L. E. Reichl. A Modern Course in Statistical Physics. John Wiley and Sons, Inc, 2 edition, 1998.
[32] Danilo Rezende and Shakir Mohamed. Variational inference with normalizing flows. In Proceedings of
International Conference on Machine Learning (ICML), pages 1530–1538, 2015. https://arxiv.org/
abs/1505.05770.
[33] Hannes Risken. The Fokker-Planck Equations: Methods of solutions and applications. Springer, 2
edition, 1989.
[34] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-
resolution image synthesis with latent diffusion models. In IEEE Conference on Computer Vision
and Pattern Recognition (CVPR), pages 10684–10695, 2022. https://arxiv.org/abs/2112.10752.
[35] Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar
Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, Jonathan Ho, David J
Fleet, and Mohammad Norouzi. Photorealistic text-to-image diffusion models with deep language un-
derstanding. In Advances in Neural Information Processing Systems (NeurIPS), volume 35, pages
36479–36494, 2022. https://arxiv.org/abs/2205.11487.
[39] Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. In Interna-
tional Conference on Learning Representations (ICLR), 2023. https://openreview.net/forum?id=
St1giarCHLP.
[40] Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data distribution.
In Advances in Neural Information Processing Systems (NeurIPS), 2019. https://arxiv.org/abs/
1907.05600.
[41] Yang Song and Stefano Ermon. Improved techniques for training score-based generative models. In
Advances in Neural Information Processing Systems (NeurIPS), 2020. https://arxiv.org/abs/2006.
09011.
[42] Yang Song, Jascha Sohl-Dickstein, Diederik P. Kingma, Abhishek Kumar, Stefano Ermon, and
Ben Poole. Score-based generative modeling through stochastic differential equations. In Interna-
tional Conference on Learning Representations (ICLR), 2021. https://openreview.net/forum?id=
PxTIG12RRHS.
[43] Pascal Vincent. A connection between score matching and denoising autoencoders. Neural Compu-
tation, 23(7):1661–1674, 2011. https://www.iro.umontreal.ca/~vincentp/Publications/smdae_
techreport.pdf.
[44] M. von Smoluchowski. Zur kinetischen theorie der brownschen molekularbewegung und der suspensio-
nen. Annalen der Physik, pages 756–780, 1906.
[45] Martin J. Wainwright and Michael I. Jordan. Graphical models, exponential families, and variational
inference. Foundations and Trends in Machine Learning, 1(1-2):1–305, 2008. https://cba.mit.edu/
events/03.11.ASE/docs/Wainwright.1.pdf.
[46] Max Welling and Yee Whye Teh. Bayesian learning via stochastic gradient Langevin dynamics. In
Proceedings of International Conference on Machine Learning (ICML), pages 681–686, 6 2011. https:
//www.stats.ox.ac.uk/~teh/research/compstats/WelTeh2011a.pdf.