2582 Elucidating The Design Space o
2582 Elucidating The Design Space o
Generative Models
Abstract
We argue that the theory and practice of diffusion-based generative models are
currently unnecessarily convoluted and seek to remedy the situation by presenting
a design space that clearly separates the concrete design choices. This lets us
identify several changes to both the sampling and training processes, as well as
preconditioning of the score networks. Together, our improvements yield new
state-of-the-art FID of 1.79 for CIFAR-10 in a class-conditional setting and 1.97 in
an unconditional setting, with much faster sampling (35 network evaluations per
image) than prior designs. To further demonstrate their modular nature, we show
that our design changes dramatically improve both the efficiency and quality ob-
tainable with pre-trained score networks from previous work, including improving
the FID of a previously trained ImageNet-64 model from 2.07 to near-SOTA 1.55,
and after re-training with our proposed improvements to a new SOTA of 1.36.
1 Introduction
Diffusion-based generative models [45] have emerged as a powerful new framework for neural image
synthesis, in both unconditional [16, 36, 48] and conditional [17, 35, 36, 38, 39, 41, 42, 48] settings,
even surpassing the quality of GANs [13] in certain situations [9]. They are also rapidly finding use
in other domains such as audio [27, 37] and video [19] generation, image segmentation [4, 54] and
language translation [34]. As such, there is great interest in applying these models and improving
them further in terms of image/distribution quality, training cost, and generation speed.
The literature on these models is dense on theory, and derivations of sampling schedule, training
dynamics, noise level parameterization, etc., tend to be based as directly as possible on theoretical
frameworks, which ensures that the models are on a solid theoretical footing. However, this approach
has a danger of obscuring the available design space — a proposed model may appear as a tightly
coupled package where no individual component can be modified without breaking the entire system.
As our first contribution, we take a look at the theory behind these models from a practical standpoint,
focusing more on the “tangible” objects and algorithms that appear in the training and sampling
phases, and less on the statistical processes from which they might be derived. The goal is to obtain
better insights into how these components are linked together and what degrees of freedom are
available in the design of the overall system. We focus on the broad class of models where a neural
network is used to model the score [22] of a noise level dependent marginal distribution of the training
data corrupted by Gaussian noise. Thus, our work is in the context of denoising score matching [51].
(a) Noisy images drawn from p(x; σ) (b) Ideal denoiser outputs D(x; σ)
Figure 1: Denoising score matching on CIFAR-10. (a) Images from the training set corrupted with
varying levels of additive Gaussian noise. High levels of noise lead to oversaturated colors; we
normalize the images for cleaner visualization. (b) Optimal denoising result from minimizing Eq. 2
analytically (see Appendix B.3). With increasing noise level, the result approaches dataset mean.
Our second set of contributions concerns the sampling processes used to synthesize images using
diffusion models. We identify the best-performing time discretization for sampling, apply a higher-
order Runge–Kutta method for the sampling process, evaluate different sampler schedules, and
analyze the usefulness of stochasticity in the sampling process. The result of these improvements is a
significant drop in the number of sampling steps required during synthesis, and the improved sampler
can be used as a drop-in replacement with several widely used diffusions models [36, 48].
The third set of contributions focuses on the training of the score-modeling neural network. While
we continue to rely on the commonly used network architectures (DDPM [16], NCSN [47]), we
provide the first principled analysis of the preconditioning of the networks’ inputs, outputs, and loss
functions in a diffusion model setting and derive best practices for improving the training dynamics.
We also suggest an improved distribution of noise levels during training, and note that non-leaking
augmentation [25] — typically used with GANs — is beneficial for diffusion models as well.
Taken together, our contributions enable significant improvements in result quality, e.g., leading to
record FIDs of 1.79 for CIFAR-10 [28] and 1.36 for ImageNet [8] in 64×64 resolution. With all key
ingredients of the design space explicitly tabulated, we believe that our approach will allow easier
innovation on the individual components, and thus enable more extensive and targeted exploration of
the design space of diffusion models. Our implementation and pre-trained models are available at
https://github.com/NVlabs/edm
Let us denote the data distribution by pdata (x), with standard deviation σdata , and consider the family
of mollified distributions p(x; σ) obtained by adding i.i.d. Gaussian noise of standard deviation σ to
the data. For σmax ≫ σdata , p(x; σmax ) is practically indistinguishable from pure Gaussian noise. The
2
idea of diffusion models is to randomly sample a noise image x0 ∼ N (0, σmax I), and sequentially
denoise it into images xi with noise levels σ0 = σmax > σ1 > · · · > σN = 0 so that at each noise
level xi ∼ p(xi ; σi ). The endpoint xN of this process is thus distributed according to the data.
Song et al. [48] present a stochastic differential equation (SDE) that maintains the desired distribution
p as sample x evolves over time. This allows the above process to be implemented using a stochastic
solver that both removes and adds noise at each iteration. They also give a corresponding “probability
flow” ordinary differential equation (ODE) where the only source of randomness is the initial noise
image x0 . Contrary to the usual order of treatment, we begin by examining the ODE, as it offers a
fruitful setting for analyzing sampling trajectories and their discretizations. The insights carry over to
stochastic sampling, which we reintroduce as a generalization in Section 4.
ODE formulation. A probability flow ODE [48] continuously increases or reduces noise level of
the image when moving forward or backward in time, respectively. To specify the ODE, we must first √
choose a schedule σ(t) that defines the desired noise level at time t. For example, setting σ(t) ∝ t
is mathematically natural, as it corresponds to constant-speed heat diffusion [12]. However, we will
show in Section 3 that the choice of schedule has major practical implications and should not be
made on the basis of theoretical convenience.
The defining characteristic of the probability flow ODE is that evolving a sample xa ∼ p xa ; σ(ta )
from time ta to tb (either forward or backward in time) yields a sample xb ∼ p xb ; σ(tb ) . Following
previous work [48], this requirement is satisfied (see Appendix B.1 and B.2) by
dx = −σ̇(t) σ(t) ∇x log p x; σ(t) dt, (1)
2
Table 1: Specific design choices employed by different model families. N is the number of ODE
solver iterations that we wish to execute during sampling. The corresponding sequence of time
steps is {t0 , t1 , . . . , tN }, where tN = 0. If the model was originally trained for specific choices
of N and {ti }, the originals are denoted by M and {u j }, respectively. The denoiser is defined as
Dθ (x; σ) = cskip (σ)x + cout (σ)Fθ cin (σ)x; cnoise (σ) ; Fθ represents the raw neural network layers.
VP [48] VE [48] iDDPM [36] + DDIM [46] Ours (“EDM”)
Sampling (Section 3)
ODE solver Euler Euler Euler 2nd order Heun
i 1
i 2 2 2
N −1
Time steps ti<N 1+ N −1 (ϵs − 1) σmax σmin /σmax u⌊j0 + M −1−j0 i+ 1 ⌋ , where σmax ρ +
N −1 2 1 1 ρ
i
N −1 (σmin −σmax )
ρ ρ
uM = r
0
u2j +1
uj−1 = max(ᾱj−1 /ᾱj ,C1 ) −1
p 1 2
√
Schedule σ(t) e 2 βd t +βmin t
−1 t t t
p 1 β t2 +β t
Scaling s(t) 1 e 2 d min 1 1 1
Network and preconditioning (Section 5)
Architecture of Fθ DDPM++ NCSN++ DDPM (any)
2
/ σ 2 + σdata
2
Skip scaling cskip (σ) 1 1 1 σdata
p
Output scaling cout (σ) −σ σ −σ 2 + σ2
σ · σdata / σdata
√ √ p
2
Input scaling cin (σ) 1/ σ 2 + 1 1 1/ σ 2 + 1 1/ σ 2 + σdata
Noise cond. cnoise (σ) (M − 1) σ −1 (σ) ln( 12 σ) M −1−arg minj |uj − σ| 1
4 ln(σ)
Training (Section 5)
Noise distribution σ −1 (σ) ∼ U(ϵt , 1) ln(σ) ∼ U(ln(σmin ), σ = uj , j ∼ U{0, M −1} 2
ln(σ) ∼ N (Pmean , Pstd )
ln(σmax ))
1/σ 2 1/σ 2 1/σ 2 σ +σdata /(σ · σdata )2
2 2
Loss weighting λ(σ) (note: ∗ )
Parameters βd = 19.9, βmin = 0.1 σmin = 0.02 ᾱj = sin2 ( π2 M (Cj2 +1) ) σmin = 0.002, σmax = 80
ϵs = 10−3 , ϵt = 10−5 σmax = 100 C1 = 0.001, C2 = 0.008 σdata = 0.5, ρ = 7
M = 1000 M = 1000, j0 = 8† Pmean = −1.2, Pstd = 1.2
∗ †
iDDPM also employs a second loss term Lvlb In our tests, j0 = 8 yielded better FID than j0 = 0 used by iDDPM
where the dot denotes a time derivative. ∇x log p(x; σ) is the score function [22], a vector field that
points towards higher density of data at a given noise level. Intuitively, an infinitesimal forward step
of this ODE nudges the sample away from the data, at a rate that depends on the change in noise level.
Equivalently, a backward step nudges the sample towards the data distribution.
Denoising score matching. The score function has the remarkable property that it does not depend
on the generally intractable normalization constant of the underlying density function p(x; σ) [22],
and thus can be much easier to evaluate. Specifically, if D(x; σ) is a denoiser function that minimizes
the expected L2 denoising error for samples drawn from pdata separately for every σ, i.e.,
Ey∼pdata En∼N (0,σ2 I) ∥D(y + n; σ) − y∥22 , then ∇x log p(x; σ) = D(x; σ) − x /σ 2 , (2, 3)
where y is a training image and n is noise. In this light, the score function isolates the noise
component from the signal in x, and Eq. 1 amplifies (or diminishes) it over time. Figure 1 illustrates
the behavior of ideal D in practice. The key observation in diffusion models is that D(x; σ) can be
implemented as a neural network Dθ (x; σ) trained according to Eq. 2. Note that Dθ may include
additional pre- and post-processing steps, such as scaling x to an appropriate dynamic range; we will
return to such preconditioning in Section 5.
Time-dependent signal scaling. Some methods (see Appendix C.1) introduce an additional scale
schedule s(t) and consider x = s(t)x̂ to be a scaled version of the original, non-scaled variable x̂.
This changes the time-dependent probability density, and consequently also the ODE solution
trajectories. The resulting ODE is a generalization of Eq. 1:
ṡ(t) x
dx = x − s(t)2 σ̇(t) σ(t) ∇x log p ; σ(t) dt. (4)
s(t) s(t)
Note that we explicitly undo the scaling of x when evaluating the score function to keep the definition
of p(x; σ) independent of s(t).
Solution by discretization. The ODE to be solved is obtained by substituting Eq. 3 into Eq. 4 to
define the point-wise gradient, and the solution can be found by numerical integration, i.e., taking
3
FID FID FID
200 500 20 Original sampler
Our reimplementation
100 200
+ Heun & our {ti }
50 100 10 + Our σ(t) & s(t)
50
20 Black-box RK45
20 5
10
10
5 3
3 5
2 35 3 27 79
2 2
NFE=8 16 32 64 128 256 512 1024 8 32 128 512 2048 8192 8 16 32 64 128 256 512 1024
(a) Uncond. CIFAR-10, VP ODE (b) Uncond. CIFAR-10, VE ODE (c) Class-cond. ImageNet-64, DDIM
Figure 2: Comparison of deterministic sampling methods using three pre-trained models. For each
curve, the dot indicates the lowest NFE whose FID is within 3% of the lowest observed FID.
finite steps over discrete time intervals. This requires choosing both the integration scheme (e.g.,
Euler or a variant of Runge–Kutta), as well as the discrete sampling times {t0 , t1 , . . . , tN }. Many
prior works rely on Euler’s method, but we show in Section 3 that a 2nd order solver offers a better
computational tradeoff. For brevity, we do not provide a separate pseudocode for Euler’s method
applied to our ODE here, but it can be extracted from Algorithm 1 by omitting lines 6–8.
Putting it together. Table 1 presents formulas for reproducing deterministic variants of three
earlier methods in our framework. These methods were chosen because they are widely used and
achieve state-of-the-art performance, but also because they were derived from different theoretical
foundations. Some of our formulas appear quite different from the original papers as indirection
and recursion have been removed; see Appendix C for details. The main purpose of this reframing
is to bring into light all the independent components that often appear tangled together in previous
work. In our framework, there are no implicit dependencies between the components — any choices
(within reason) for the individual formulas will, in principle, lead to a functioning model. In other
words, changing one component does not necessitate changes elsewhere in order to, e.g., maintain the
property that the model converges to the data in the limit. In practice, some choices and combinations
will of course work better than others.
4
Algorithm 1 Deterministic sampling using Heun’s 2nd order method with arbitrary σ(t) and s(t).
1: procedure H EUN S AMPLER(Dθ (x; σ), σ(t), s(t), ti∈{0,...,N } )
sample x0 ∼ N 0, σ 2 (t0 ) s2 (t0 ) I
2: ▷ Generate initial sample at t0
3: for i ∈ {0,. . . , N − 1} do ▷ Solve Eq. 4 over N time steps
σ̇(ti ) ṡ(ti ) σ̇(ti )s(ti ) xi
4: di ← + xi − Dθ ; σ(ti ) ▷ Evaluate dx/dt at ti
σ(ti ) s(ti ) σ(ti ) s(ti )
5: xi+1 ← xi + (ti+1 − ti )di ▷ Take Euler step from ti to ti+1
nd
6: if σ(ti+1 ) ̸= 0 then ▷ Apply 2 order correction unless σ goes to zero
σ̇(ti+1 ) ṡ(ti+1 ) σ̇(ti+1 )s(ti+1 ) xi+1
7: di′ ← + xi+1 − Dθ ; σ(ti+1 ) ▷ Eval. dx/dt at ti+1
σ(ti+1 ) s(ti+1 ) σ(ti+1 ) s(ti+1 )
1
+ 12 di′
8: xi+1 ← xi + (ti+1 − ti ) d
2 i
▷ Explicit trapezoidal rule at ti+1
9: return xN ▷ Return noise-free sample at tN
Trajectory curvature and noise schedule. The shape of the ODE solution trajectories is defined
by functions σ(t) and s(t). The choice of these functions offers a way to reduce the truncation errors
discussed above, as their magnitude can be expected to scale proportional to the curvature of dx/dt.
We argue that the best choice for these functions is σ(t) = t and s(t) = 1, which is also thechoice
made in DDIM [46]. With this choice, the ODE of Eq. 4 simplifies to dx/dt = x − D(x; t) /t and
σ and t become interchangeable.
An immediate consequence is that at any x and t, a single Euler step to t = 0 yields the denoised
image Dθ (x; t). The tangent of the solution trajectory therefore always points towards the denoiser
output. This can be expected to change only slowly with the noise level, which corresponds to largely
linear solution trajectories. The 1D ODE sketch of Figure 3c supports this intuition; the solution
trajectories approach linear at both large and small noise levels, and have substantial curvature in
only a small region in between. The same effect can be seen with real data in Figure 1b, where the
5
x x x
t= t= t=
(a) Variance preserving ODE [48] (b) Variance exploding ODE [48] (c) DDIM [46] / Our ODE
Figure 3: A sketch of ODE curvature in 1D where pdata is two Dirac peaks at x = ±1. Horizontal t
axis is chosen to show σ ∈ [0, 25] in each plot, with insets showing σ ∈ [0, 1] near the data. Example
local gradients are shown with black arrows. (a) Variance preserving ODE of Song et al. [48] has
solution trajectories that flatten out to horizontal lines at large σ. Local gradients start pointing
towards data only at small σ. (b) Variance exploding variant has extreme curvature near data and the
solution trajectories are curved everywhere. (c) With the schedule used by DDIM [46] and us, as
σ increases the solution trajectories approach straight lines that point towards the mean of data. As
σ → 0, the trajectories become linear and point towards the data manifold.
change between different denoiser targets occurs in a relatively narrow σ range. With the advocated
schedule, this corresponds to high ODE curvature being limited to this same range.
The effect of setting σ(t) = t and s(t) = 1 is shown as the red curves in Figure 2. As DDIM already
employs these same choices, the red curve is identical to the green one for ImageNet-64. However,
VP and VE benefit considerably from switching away from their original schedules.
Discussion. The choices that we made in this section to improve deterministic sampling are
summarized in the Sampling part of Table 1. Together, they reduce the NFE needed to reach high-
quality results by a large factor: 7.3× for VP, 300× for VE, and 3.2× for DDIM, corresponding to
the highlighted NFE values in Figure 2. In practice, we can generate 26.3 high-quality CIFAR-10
images per second on a single NVIDIA V100. The consistency of improvements corroborates our
hypothesis that the sampling process is orthogonal to how each model was originally trained. As
further validation, we show results for the adaptive RK45 method [11] using our schedule as the
dashed black curves in Figure 2; the cost of this sophisticated ODE solver outweighs its benefits.
4 Stochastic sampling
Deterministic sampling offers many benefits, e.g., the ability to turn real images into their corre-
sponding latent representations by inverting the ODE. However, it tends to lead to worse output
quality [46, 48] than stochastic sampling that injects fresh noise into the image in each step. Given
that ODEs and SDEs recover the same distributions in theory, what exactly is the role of stochasticity?
Background. The SDEs of Song et al. [48] can be generalized [20, 55] as a sum of the probability
flow ODE of Eq. 1 and a time-varying Langevin diffusion SDE [14] (see Appendix B.5):
p
dx± = −σ̇(t)σ(t)∇x log p x; σ(t) dt ± β(t)σ(t)2 ∇x log p x; σ(t) dt + 2β(t)σ(t) dωt , (6)
| {z } | {z } | {z }
probability flow ODE (Eq. 1) deterministic noise decay noise injection
| {z }
Langevin diffusion SDE
where ωt is the standard Wiener process. dx+ and dx− are now separate SDEs for moving forward
and backward in time, related by the time reversal formula of Anderson [1]. The Langevin term can
further be seen as a combination of a deterministic score-based denoising term and a stochastic noise
injection term, whose net noise level contributions cancel out. As such, β(t) effectively expresses the
relative rate at which existing noise is replaced with new noise. The SDEs of Song et al. [48] are
recovered with the choice β(t) = σ̇(t)/σ(t), whereby the score vanishes from the forward SDE.
This perspective reveals why stochasticity is helpful in practice: The implicit Langevin diffusion
drives the sample towards the desired marginal distribution at a given time, actively correcting for
any errors made in earlier sampling steps. On the other hand, approximating the Langevin term
6
Algorithm 2 Our stochastic sampler with σ(t) = t and s(t) = 1.
1: procedure S TOCHASTIC S AMPLER (Dθ (x; σ), ti∈{0,...,N } , γi∈{0,...,N −1} , Snoise )
sample x0 ∼ N 0, t20 I
2: √
(
min Schurn
N
, 2−1 if ti ∈[Stmin ,Stmax ]
3: for i ∈ {0, . . . , N − 1} do ▷ γi =
4: 2
sample ϵi ∼ N 0, Snoise I
0 otherwise
5: t̂i ← ti + γip ti ▷ Select temporarily increased noise level t̂i
6: x̂i ← xi + t̂2i − t2i ϵi ▷ Add new noise to move from ti to t̂i
7: di ← x̂i − Dθ (x̂i ; t̂i ) /t̂i ▷ Evaluate dx/dt at t̂i
8: xi+1 ← x̂i + (ti+1 − t̂i )di ▷ Take Euler step from t̂i to ti+1
9: if ti+1 ̸= 0 then
di′ ← xi+1 − Dθ (xi+1 ; ti+1 ) /ti+1 ▷ Apply 2nd order correction
10:
11: xi+1 ← x̂i + (ti+1 − t̂i ) 2 di + 21 di′
1
12: return xN
with discrete SDE solver steps introduces error in itself. Previous results [3, 24, 46, 48] suggest that
non-zero β(t) is helpful, but as far as we can tell, the implicit choice for β(t) in Song et al. [48] enjoys
no special properties. Hence, the optimal amount of stochasticity should be determined empirically.
Our stochastic sampler. We propose a stochastic sampler that combines our 2nd order deterministic
ODE integrator with explicit Langevin-like “churn” of adding and removing noise. A pseudocode is
given in Algorithm 2. At each step i, given the sample xi at noise level ti (= σ(ti )), we perform two
sub-steps. First, we add noise to the sample according to a factor γi ≥ 0 to reach a higher noise level
t̂i = ti + γi ti . Second, from the increased-noise sample x̂i , we solve the ODE backward from t̂i to
ti+1 with a single step. This yields a sample xi+1 with noise level ti+1 , and the iteration continues.
We stress that this is not a general-purpose SDE solver, but a sampling procedure tailored for the
specific problem. Its correctness stems from the alternation of two sub-steps that each maintain the
correct distribution (up to truncation error in the ODE step). The predictor-corrector sampler of Song
et al. [48] has a conceptually similar structure to ours.
To analyze the main difference between our method and Euler–Maruyama, we first note a subtle
discrepancy in the latter when discretizing Eq. 6. One can interpret Euler–Maruyama as first adding
noise and then performing an ODE step, not from the intermediate state after noise injection, but
assuming that x and σ remained at the initial state at the beginning of the iteration step. In our
method, the parameters used to evaluate Dθ on line 7 of Algorithm 2 correspond to the state after
noise injection, whereas an Euler–Maruyama -like method would use xi ; ti instead of x̂i ; t̂i . In the
limit of ∆t approaching zero there may be no difference between these choices, but the distinction
appears to become significant when pursuing low NFE with large steps.
Evaluation. Figure 4 shows that our stochastic sampler outperforms previous samplers [24, 36, 48]
by a significant margin, especially at low step counts. Jolicoeur-Martineau et al. [24] use a standard
7
FID FID FID
3.2 4.5 3.0
3.0 2.8
4.0
2.8 2.6
2.6 3.5 2.4
2.4 2.2
2.27
3.0
2.2 2.0
Deterministic Stmin,tmax = [0, ∞]
2.5 1.8
2.0 Stmin,tmax + Snoise = 1 Optimal settings
Snoise = 1 Original sampler 1.6
1.8 Jolicoeur-Martineau et al. [24] 2.23 1.55
2.0 1.4
NFE=16 32 64 128 256 512 1024 2048 16 32 64 128 256 512 1024 2048 16 32 64 128 256 512 1024 2048
higher-order adaptive SDE solver [40] and its performance is a good baseline for such solvers in
general. Our sampler has been tailored to the use case by, e.g., performing noise injection and ODE
step sequentially, and it is not adaptive. It is an open question if adaptive solvers can be a net win
over a well-tuned fixed schedule in sampling diffusion models.
Through sampler improvements alone, we are able to bring the ImageNet-64 model that originally
achieved FID 2.07 [9] to 1.55 that is very close to the state-of-the-art; previously, FID 1.48 has been
reported for cascaded diffusion [17], 1.55 for classifier-free guidance [18], and 1.52 for StyleGAN-
XL [44]. While our results showcase the potential gains achievable through sampler improvements,
they also highlight the main shortcoming of stochasticity: For best results, one must make several
heuristic choices — either implicit or explicit — that depend on the specific model. Indeed, we had
to find the optimal values of {Schurn , Stmin , Stmax , Snoise } on a case-by-case basis using grid search
(Appendix E.2). This raises a general concern that using stochastic sampling as the primary means of
evaluating model improvements may inadvertently end up influencing the design choices related to
model architecture and training.
8
Table 2: Evaluation of our training improvements. The starting point (config A) is VP & VE using
our deterministic sampler. At the end (configs E , F), VP & VE only differ in the architecture of Fθ .
CIFAR-10 [28] at 32×32 FFHQ [26] 64×64 AFHQv2 [7] 64×64
Conditional Unconditional Unconditional Unconditional
Training configuration VP VE VP VE VP VE VP VE
A Baseline [48] (∗ pre-trained) 2.48 3.11 3.01∗ 3.77∗ 3.39 25.95 2.58 18.52
B + Adjust hyperparameters 2.18 2.48 2.51 2.94 3.13 22.53 2.43 23.12
C + Redistribute capacity 2.08 2.52 2.31 2.83 2.78 41.62 2.54 15.04
D + Our preconditioning 2.09 2.64 2.29 3.10 2.94 3.39 2.79 3.81
E + Our loss function 1.88 1.86 2.05 1.99 2.60 2.81 2.29 2.28
F + Non-leaky augmentation 1.79 1.79 1.97 1.98 2.39 2.53 1.96 2.16
NFE 35 35 35 35 79 79 79 79
λ(σ). We can equivalently express this loss with respect to the raw network output Fθ in Eq. 7:
h 2i
Eσ,y,n λ(σ) cout (σ)2 Fθ cin (σ) · (y + n); cnoise (σ) − cout1(σ) y − cskip (σ) · (y + n) 2 . (8)
| {z } | {z } | {z }
effective weight network output effective training target
This form reveals the effective training target of Fθ , allowing us to determine suitable choices for the
preconditioning functions from first principles. As detailed in Appendix B.6, we derive our choices
shown in Table 1 by requiring network inputs and training targets to have unit variance (cin , cout ), and
amplifying errors in Fθ as little as possible (cskip ). The formula for cnoise is chosen empirically.
Table 2 shows FID for a series of training setups, evaluated using our deterministic sampler from
Section 3. We start with the baseline training setup of Song et al. [48], which differs considerably
between the VP and VE cases; we provide separate results for each (config A). To obtain a more
meaningful point of comparison, we re-adjust the basic hyperparameters (config B) and improve the
expressive power of the model (config C) by removing the lowest-resolution layers and doubling the
capacity of the highest-resolution layers instead; see Appendix F.3 for further details. We then replace
the original choices of {cin , cout , cnoise , cskip } with our preconditioning (config D), which keeps the
results largely unchanged — except for VE that improves considerably at 64×64 resolution. Instead
of improving FID per se, the main benefit of our preconditioning is that it makes the training more
robust, enabling us to turn our focus on redesigning the loss function without adverse effects.
Loss weighting and sampling. Eq. 8 shows that training Fθ as preconditioned in Eq. 7 incurs
an effective per-sample loss weight of λ(σ)cout (σ)2 . To balance the effective loss weights, we set
λ(σ) = 1/cout (σ)2 , which also equalizes the initial training loss over the entire σ range as shown in
Figure 5a (green curve). Finally, we need to select ptrain (σ), i.e., how to choose noise levels during
training. Inspecting the per-σ loss after training (blue and orange curves) reveals that a significant
reduction is possible only at intermediate noise levels; at very low levels, it is both difficult and
irrelevant to discern the vanishingly small noise component, whereas at high levels the training targets
are always dissimilar from the correct answer that approaches dataset average. Therefore, we target
the training efforts to the relevant range using a simple log-normal distribution for ptrain (σ) as detailed
in Table 1 and illustrated in Figure 5a (red curve).
Table 2 shows that our proposed ptrain and λ (config E) lead to a dramatic improvement in FID
in all cases when used in conjunction with our preconditioning (config D). In concurrent work,
Choi et al. [6] propose a similar scheme to prioritize noise levels that are most relevant w.r.t. forming
the perceptually recognizable content of the image. However, they only consider the choice of λ in
isolation, which results in a smaller overall improvement.
Augmentation regularization. To prevent potential overfitting that often plagues diffusion models
with smaller datasets, we borrow an augmentation pipeline from the GAN literature [25]. The pipeline
consists of various geometric transformations (see Appendix F.2) that we apply to a training image
prior to adding noise. To prevent the augmentations from leaking to the generated images, we provide
the augmentation parameters as a conditioning input to Fθ ; during inference we set the them to zero
to guarantee that only non-augmented images are generated. Table 2 shows that data augmentation
provides a consistent improvement (config F) that yields new state-of-the-art FIDs of 1.79 and 1.97
for conditional and unconditional CIFAR-10, beating the previous records of 1.85 [44] and 2.10 [50].
9
loss FID FID
Loss after init CIFAR-10 VP, original VP, our model 2.66 Original Our model
1.4 4.0 2.6
Distribution of σ FFHQ-64 VE, original VE, our model
1.2 2.4
1.0 3.5 2.2 2.22
0.8 2.0
3.0
0.6 1.8
0.4 2.5 1.6
1.57
0.2 1.4
2.0 1.36
0.0 1.2
σ=0.005 0.02 0.1 0.5 1 2 5 10 20 50 Schurn =0 10 20 30 40 50 60 70 80 90 100 Schurn =0 10 20 30 40 50 60 70 80 90 100
(a) Loss & noise distribution (b) Stochasticity on CIFAR-10 (c) Stochasticity on ImageNet-64
Figure 5: (a) Observed initial (green) and final loss per noise level, representative of the the 32×32
(blue) and 64×64 (orange) models considered in this paper. The shaded regions represent the standard
deviation over 10k random samples. Our proposed training sample density is shown by the dashed
red curve. (b) Effect of Schurn on unconditional CIFAR-10 with 256 steps (NFE = 511). For the
original training setup of Song et al. [48], stochastic sampling is highly beneficial (blue, green), while
deterministic sampling (Schurn = 0) leads to relatively poor FID. For our training setup, the situation
is reversed (orange, red); stochastic sampling is not only unnecessary but harmful. (c) Effect of Schurn
on class-conditional ImageNet-64 with 256 steps (NFE = 511). In this more challenging scenario,
stochastic sampling turns out to be useful again. Our training setup improves the results for both
deterministic and stochastic sampling.
6 Conclusions
Our approach of putting diffusion models to a common framework exposes a modular design. This
allows a targeted investigation of individual components, potentially helping to better cover the viable
design space. In our tests this let us simply replace the samplers in various earlier models, drastically
improving the results. For example, in ImageNet-64 our sampler turned an average model (FID 2.07)
to a challenger (1.55) for the previous SOTA model (1.48) [17], and with training improvements
achieved SOTA FID of 1.36. We also obtained new state-of-the-art results on CIFAR-10 while using
only 35 model evaluations, deterministic sampling, and a small network. The current high-resolution
diffusion models rely either on separate super-resolution steps [17, 35, 39], subspace projection [23],
very large networks [9, 48], or hybrid approaches [38, 41, 50] — we believe that our contributions are
orthogonal to these extensions. That said, many of our parameter values may need to be re-adjusted
for higher resolution datasets. Furthermore, we feel that the precise interaction between stochastic
sampling and the training objective remains an interesting question for future work.
Societal impact. Our advances in sample quality can potentially amplify negative societal effects
when used in a large-scale system like DALL·E 2, including types of disinformation or emphasizing
sterotypes and harmful biases [33]. The training and sampling of diffusion models needs a lot of
electricity; our project consumed ∼250MWh on an in-house cluster of NVIDIA V100s.
Acknowledgments. We thank Jaakko Lehtinen, Ming-Yu Liu, Tuomas Kynkäänniemi, Axel Sauer,
Arash Vahdat, and Janne Hellsten for discussions and comments, and Tero Kuosmanen, Samuel
Klenberg, and Janne Hellsten for maintaining our compute infrastructure.
10
References
[1] B. D. Anderson. Reverse-time diffusion equation models. Stochastic Processes and their Applications,
12(3):313–326, 1982.
[2] U. M. Ascher and L. R. Petzold. Computer Methods for Ordinary Differential Equations and Differential-
Algebraic Equations. Society for Industrial and Applied Mathematics, 1998.
[3] F. Bao, C. Li, J. Zhu, and B. Zhang. Analytic-DPM: an analytic estimate of the optimal reverse variance in
diffusion probabilistic models. In Proc. ICLR, 2022.
[4] D. Baranchuk, A. Voynov, I. Rubachev, V. Khrulkov, and A. Babenko. Label-efficient semantic segmenta-
tion with diffusion models. In Proc. ICLR, 2022.
[5] C. M. Bishop. Neural networks for pattern recognition. Oxford University Press, USA, 1995.
[6] J. Choi, J. Lee, C. Shin, S. Kim, H. Kim, and S. Yoon. Perception prioritized training of diffusion models.
In Proc. CVPR, 2022.
[7] Y. Choi, Y. Uh, J. Yoo, and J.-W. Ha. StarGAN v2: Diverse image synthesis for multiple domains. In Proc.
CVPR, 2020.
[8] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A large-scale hierarchical image
database. In Proc. CVPR, 2009.
[9] P. Dhariwal and A. Q. Nichol. Diffusion models beat GANs on image synthesis. In Proc. NeurIPS, 2021.
[10] T. Dockhorn, A. Vahdat, and K. Kreis. Score-based generative modeling with critically-damped Langevin
diffusion. In Proc. ICLR, 2022.
[11] J. R. Dormand and P. J. Prince. A family of embedded Runge-Kutta formulae. Journal of computational
and applied mathematics, 6(1):19–26, 1980.
[12] J. B. J. Fourier, G. Darboux, et al. Théorie analytique de la chaleur, volume 504. Didot Paris, 1822.
[13] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio.
Generative adversarial networks. In Proc. NIPS, 2014.
[14] U. Grenander and M. I. Miller. Representations of knowledge in complex systems. Journal of the Royal
Statistical Society: Series B (Methodological), 56(4):549–581, 1994.
[15] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter. GANs trained by a two time-scale
update rule converge to a local Nash equilibrium. In Proc. NIPS, 2017.
[16] J. Ho, A. Jain, and P. Abbeel. Denoising diffusion probabilistic models. In Proc. NeurIPS, 2020.
[17] J. Ho, C. Saharia, W. Chan, D. J. Fleet, M. Norouzi, and T. Salimans. Cascaded diffusion models for high
fidelity image generation. Journal of Machine Learning Research, 23, 2022.
[18] J. Ho and T. Salimans. Classifier-free diffusion guidance. In NeurIPS 2021 Workshop on Deep Generative
Models and Downstream Applications, 2021.
[19] J. Ho, T. Salimans, A. A. Gritsenko, W. Chan, M. Norouzi, and D. J. Fleet. Video diffusion models. In
Proc. ICLR Workshop on Deep Generative Models for Highly Structured Data, 2022.
[20] C.-W. Huang, J. H. Lim, and A. C. Courville. A variational perspective on diffusion-based generative
models and score matching. In Proc. NeurIPS, 2021.
[21] L. Huang, J. Qin, Y. Zhou, F. Zhu, L. Liu, and L. Shao. Normalization techniques in training DNNs:
Methodology, analysis and application. CoRR, abs/2009.12836, 2020.
[22] A. Hyvärinen. Estimation of non-normalized statistical models by score matching. Journal of Machine
Learning Research, 6(24):695–709, 2005.
[23] B. Jing, G. Corso, R. Berlinghieri, and T. Jaakkola. Subspace diffusion generative models. In Proc. ECCV,
2022.
[24] A. Jolicoeur-Martineau, K. Li, R. Piché-Taillefer, T. Kachman, and I. Mitliagkas. Gotta go fast when
generating data with score-based models. CoRR, abs/2105.14080, 2021.
[25] T. Karras, M. Aittala, J. Hellsten, S. Laine, J. Lehtinen, and T. Aila. Training generative adversarial
networks with limited data. In Proc. NeurIPS, 2020.
[26] T. Karras, S. Laine, and T. Aila. A style-based generator architecture for generative adversarial networks.
In Proc. CVPR, 2018.
[27] Z. Kong, W. Ping, J. Huang, K. Zhao, and B. Catanzaro. DiffWave: A versatile diffusion model for audio
synthesis. In Proc. ICLR, 2021.
[28] A. Krizhevsky. Learning multiple layers of features from tiny images. Technical report, University of
Toronto, 2009.
[29] J. Lehtinen, J. Munkberg, J. Hasselgren, S. Laine, T. Karras, M. Aittala, and T. Aila. Noise2Noise:
Learning image restoration without clean data. In Proc. ICML, 2018.
[30] L. Liu, Y. Ren, Z. Lin, and Z. Zhao. Pseudo numerical methods for diffusion models on manifolds. In
Proc. ICLR, 2022.
[31] C. Lu, Y. Zhou, F. Bao, J. Chen, C. Li, and J. Zhu. DPM-Solver: A fast ODE solver for diffusion
probabilistic model sampling in around 10 steps. In Proc. NeurIPS, 2022.
[32] E. Luhman and T. Luhman. Knowledge distillation in iterative generative models for improved sampling
speed. CoRR, abs/2101.02388, 2021.
11
[33] P. Mishkin, L. Ahmad, M. Brundage, G. Krueger, and G. Sastry. DALL·E 2 preview – risks and limitations.
OpenAI, 2022.
[34] E. Nachmani and S. Dovrat. Zero-shot translation using diffusion models. CoRR, abs/2111.01471, 2021.
[35] A. Nichol, P. Dhariwal, A. Ramesh, P. Shyam, P. Mishkin, B. McGrew, I. Sutskever, and M. Chen. GLIDE:
Towards photorealistic image generation and editing with text-guided diffusion models. In Proc. ICML,
2022.
[36] A. Q. Nichol and P. Dhariwal. Improved denoising diffusion probabilistic models. In Proc. ICML, volume
139, pages 8162–8171, 2021.
[37] V. Popov, I. Vovk, V. Gogoryan, T. Sadekova, and M. Kudinov. Grad-TTS: A diffusion probabilistic model
for text-to-speech. In Proc. ICML, volume 139, pages 8599–8608, 2021.
[38] K. Preechakul, N. Chatthee, S. Wizadwongsa, and S. Suwajanakorn. Diffusion autoencoders: Toward a
meaningful and decodable representation. In Proc. CVPR, 2022.
[39] A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, and M. Chen. Hierarchical text-conditional image generation
with CLIP latents. Technical report, OpenAI, 2022.
[40] A. J. Roberts. Modify the improved Euler scheme to integrate stochastic differential equations. CoRR,
abs/1210.0933, 2012.
[41] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer. High-resolution image synthesis with
latent diffusion models. In Proc. CVPR, 2022.
[42] C. Saharia, W. Chan, H. Chang, C. A. Lee, J. Ho, T. Salimans, D. J. Fleet, and M. Norouzi. Palette:
Image-to-image diffusion models. In Proc. SIGGRAPH, 2022.
[43] T. Salimans and J. Ho. Progressive distillation for fast sampling of diffusion models. In Proc. ICLR, 2022.
[44] A. Sauer, K. Schwarz, and A. Geiger. StyleGAN-XL: Scaling StyleGAN to large diverse datasets. In Proc.
SIGGRAPH, 2022.
[45] J. Sohl-Dickstein, E. Weiss, N. Maheswaranathan, and S. Ganguli. Deep unsupervised learning using
nonequilibrium thermodynamics. In Proc. ICML, pages 2256–2265, 2015.
[46] J. Song, C. Meng, and S. Ermon. Denoising diffusion implicit models. In Proc. ICLR, 2021.
[47] Y. Song and S. Ermon. Generative modeling by estimating gradients of the data distribution. In Proc.
NeurIPS, 2019.
[48] Y. Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon, and B. Poole. Score-based generative
modeling through stochastic differential equations. In Proc. ICLR, 2021.
[49] E. Süli and D. F. Mayers. An Introduction to Numerical Analysis. Cambridge University Press, 2003.
[50] A. Vahdat, K. Kreis, and J. Kautz. Score-based generative modeling in latent space. In Proc. NeurIPS,
2021.
[51] P. Vincent. A connection between score matching and denoising autoencoders. Neural Computation,
23(7):1661–1674, 2011.
[52] D. Watson, W. Chan, J. Ho, and M. Norouzi. Learning fast samplers for diffusion models by differentiating
through sample quality. In Proc. ICLR, 2022.
[53] D. Watson, J. Ho, M. Norouzi, and W. Chan. Learning to efficiently sample from diffusion probabilistic
models. CoRR, abs/2106.03802, 2021.
[54] J. Wolleb, R. Sandkühler, F. Bieder, P. Valmaggia, and P. C. Cattin. Diffusion models for implicit image
segmentation ensembles. In Medical Imaging with Deep Learning, 2022.
[55] Q. Zhang and Y. Chen. Diffusion normalizing flow. In Proc. NeurIPS, 2021.
[56] Q. Zhang and Y. Chen. Fast sampling of diffusion models with exponential integrator. CoRR,
abs/2204.13902, 2022.
12
Checklist
1. For all authors...
(a) Do the main claims made in the abstract and introduction accurately reflect the paper’s
contributions and scope? [Yes]
(b) Did you describe the limitations of your work? [Yes] Section 6. The main limitations
of the analysis relate to the set of tested datasets and their limited resolution.
(c) Did you discuss any potential negative societal impacts of your work? [Yes] Section 6.
(d) Have you read the ethics review guidelines and ensured that your paper conforms to
them? [Yes]
2. If you are including theoretical results...
(a) Did you state the full set of assumptions of all theoretical results? [No] We follow
common application-specific assumptions about the probability distributions, functions
and other components, but do not exhaustively specify them, or consider pathological
corner cases.
(b) Did you include complete proofs of all theoretical results? [No] Our equations and
algorithms build on previously known results, and highlight their practical aspects
through mostly readily verifiable algebraic manipulations (Appendix B). We do not
explicitly present all details of the derivations, and assume that the previous results are
sufficiently rigorously proven in the respective literature.
3. If you ran experiments...
(a) Did you include the code, data, and instructions needed to reproduce the main experi-
mental results (either in the supplemental material or as a URL)? [Yes] Our implemen-
tation and pre-trained models are available at https://github.com/NVlabs/edm
(b) Did you specify all the training details (e.g., data splits, hyperparameters, how they
were chosen)? [Yes] Appendix F.
(c) Did you report error bars (e.g., with respect to the random seed after running experi-
ments multiple times)? [Yes] Shaded regions in Figures 4 and 5.
(d) Did you include the total amount of compute and the type of resources used (e.g., type
of GPUs, internal cluster, or cloud provider)? [Yes] Section 6.
4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets...
(a) If your work uses existing assets, did you cite the creators? [Yes]
(b) Did you mention the license of the assets? [Yes] Appendix F.5.
(c) Did you include any new assets either in the supplemental material or as a URL? [No]
(d) Did you discuss whether and how consent was obtained from people whose data you’re
using/curating? [N/A]
(e) Did you discuss whether the data you are using/curating contains personally identifiable
information or offensive content? [N/A]
5. If you used crowdsourcing or conducted research with human subjects...
(a) Did you include the full text of instructions given to participants and screenshots, if
applicable? [N/A]
(b) Did you describe any potential participant risks, with links to Institutional Review
Board (IRB) approvals, if applicable? [N/A]
(c) Did you include the estimated hourly wage paid to participants and the total amount
spent on participant compensation? [N/A]
13