0% found this document useful (0 votes)
30 views

Synthetic Data From Diffusion Models Improves ImageNet Classification

This document discusses using synthetic data from diffusion models to improve image classification. The authors fine-tune a large text-to-image diffusion model to produce class-conditional generative models of ImageNet images. They show these models achieve state-of-the-art results on metrics like FID and Inception Score. Augmenting the ImageNet training set with samples from these models significantly improves classification accuracy over strong ResNet and Vision Transformer baselines.

Uploaded by

Janek Majszewski
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views

Synthetic Data From Diffusion Models Improves ImageNet Classification

This document discusses using synthetic data from diffusion models to improve image classification. The authors fine-tune a large text-to-image diffusion model to produce class-conditional generative models of ImageNet images. They show these models achieve state-of-the-art results on metrics like FID and Inception Score. Augmenting the ImageNet training set with samples from these models significantly improves classification accuracy over strong ResNet and Vision Transformer baselines.

Uploaded by

Janek Majszewski
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

Synthetic Data from Diffusion Models

Improves ImageNet Classification

Shekoofeh Azizi, Simon Kornblith, Chitwan Saharia* , Mohammad Norouzi*, David J. Fleet
Google Research, Brain Team †
arXiv:2304.08466v1 [cs.CV] 17 Apr 2023

Abstract 75 Real Accuracy: 73.59

Classification Accuracy Score


Deep generative models are becoming increasingly pow- 70
Ours-1024x
erful, now generating diverse high fidelity photo-realistic Ours-256x 69.24
65 CDM
samples given text prompts. Have they reached the point 64.96
60 63.02
where models of natural images can be used for gener- VQ-VAE-2
ative data augmentation, helping to improve challenging 55
54.83
discriminative tasks? We show that large-scale text-to- 50
image diffusion models can be fine-tuned to produce class- 45 BigGAN
conditional models with SOTA FID (1.76 at 256×256 reso- 40 42.65
lution) and Inception Score (239 at 256 × 256). The model
also yields a new SOTA in Classification Accuracy Scores 84
Real
(64.96 for 256×256 generative samples, improving to 69.24
82
Real + Generated +1.05 +0.83
for 1024×1024 samples). Augmenting the ImageNet train- DeiT-L
Top-1 Accuracy (%)

DeiT-B
ing set with samples from the resulting models yields sig-
nificant improvements in ImageNet classification accuracy 80 +1.52 +1.56
over strong ResNet and Vision Transformer baselines. DeiT-S +1.59
78 R-152
R-101
+1.78
1. Introduction 76 R-50
Deep generative models are becoming increasingly ma-
22 36 45 64 87 307
ture to the point that they can generate high fidelity photo- Parameters (M)
realistic samples [12, 25, 59]. Most recently, denoising dif-
fusion probabilistic models (DDPMs) [25,59] have emerged Figure 1. Top: Classification Accuracy Scores [45] show that
as a new category of generative techniques that are capable models trained on generated data are approaching those trained on
of generating images comparable to generative adversar- real data. Bottom: Augmenting real training data with generated
ial networks (GANs) in quality while introducing greater images from our ImageNet model boosts classification accuracy
for ResNet and Transformer models.
stability during training [12, 26]. This has been shown
both for class-conditional generative models on classifica-
tion datasets, and for open vocabulary text-to-image gener- related question is, to what extent large-scale text-to-image
ation [40, 44, 47, 50]. models can serve as good representation learners or founda-
It is therefore natural to ask whether current models are tion models for downstream tasks? We explore this issue in
powerful enough to generate natural image data that are ef- the context of generative data augmentation, showing that
fective for challenging discriminative tasks; i.e., generative these models can be fine-tuned to produce state-of-the-art
data augmentation. Specifically, are diffusion models capa- class-conditional generative models on ImageNet.
ble of producing image samples of sufficient quality and di- To this end, we demonstrate three key findings. First,
versity to improve performance on well-studied benchmark we show that an Imagen model fine-tuned on ImageNet
tasks like ImageNet classification? Such tasks set a high training data produces state-of-the-art class-conditional Im-
bar, since existing architectures, augmentation strategies, ageNet models at multiple resolutions, according to their
and training recipes have been heavily tuned. A closely Fréchet Inception Distance (FID) [22] and Inception Score
* Work done at Google Research. (IS) [52]; e.g,, we obtain an FID of 1.76 and IS of 239 on
† {shekazizi, skornblith, davidfleet}@google.com 256 × 256 image samples. These models outperform ex-
bittern bird
harvestman
Leonberger
Schipperke
brussels griffon

Figure 2. Example 1024×1024 images from the fine-tuned Imagen (left) model vs. vanilla Imagen (right). Fine-tuning and careful choice
of guidance weights and other sampling parameters help to improve the alignment of images with class labels and sample diversity. More
samples are provide in the Appendix.

isting state-of-the-art models, with or without the use of datasets [18, 73] and simulation environments with phys-
guidance to improve model sampling. We further establish ically realistic engines [11, 15, 16]. Unlike methods that
that data from such fine-tuned class-conditional models also use model-based rendering, here we focus on the use of
provide new state-of-the-art Classification Accuracy Scores data-driven generative models of natural images, for which
(CAS) [45], computed by training ResNet-50 models on GANs have remained the predominant approach to date
synthetic data and then evaluating them on the real Ima- [6, 17, 36]. Very recent work has also explored the use of
geNet validation set (Fig. 1 - top). Finally, we show that publicly available text-to-image diffusion models to gener-
performance of models trained on generative data further ate synthetic data. We discuss this work further below.
improves by combining synthetic data with real data, with
larger amounts of synthetic data, and with longer training Distillation and Transfer. In our work, we use a diffu-
times. These results hold across a host of convolutional and sion model that has been pretrained on a large multimodal
Transformer-based architectures (Fig. 1 - bottom). dataset and fine-tuned on ImageNet to provide synthetic
data for a classification model. This setup has connec-
2. Related Work tions to previous work that has directly trained classification
models on large-scale datasets and then fine-tuned them on
Synthetic Data. The use of synthetic data has been widely ImageNet [34, 39, 43, 62, 72]. It is also related to knowledge
explored for generating large amounts of labeled data for distillation [7, 23] in that we transfer knowledge from the
vision tasks that require extensive annotation. Examples diffusion model to the classifier, although it differs from the
include tasks like semantic image segmentation [3, 10, 36, traditional distillation setup in that we transfer this knowl-
37, 54, 66], optical flow estimation [14, 32, 63], human mo- edge through generated data rather than labels. Our goal in
tion understanding [19, 29, 38, 67], and other dense predic- this work is to show the viability of this kind of generative
tion tasks [3, 70]. Previous work has explored 3D-rendered knowledge transfer with modern diffusion models.
Diffusion Model Applications. Diffusion models have introducing new noise schedules, guidance mechanisms to
been successfully applied to image generation [25–27], trade-off diversity with image quality, distillation for effi-
speech generation [8, 35], and video generation [24, 58, 68], ciency, and different parameterizations of the denoising ob-
and have found applications in various image processing ar- jective (e.g., [28, 31, 50, 53]).
eas, including image colorization, super-resolution, inpaint- Classification Accuracy Score. It is a standard practice to
ing, and semantic editing [49,51,61,69]. One notable appli- use FID [22] and Inception Score [52] to evaluate the visual
cation of diffusion models is large-scale text-to-image gen- quality of generative models. Due to their relatively low
eration. Several text-to-image models including Stable Dif- computation cost, these metrics are widely used as proxies
fusion [47], DALL-E 2 [44], Imagen [50], eDiff [1], and for generative model training and tuning. However, both
GLIDE [40] have produced evocative high-resolution im- methods tend to penalize non-GAN models harshly, and In-
ages. However, the use of large-scale diffusion models to ception Score produces overly optimistic scores in methods
support downstream tasks is still in its infancy. with sampling modifications [27, 45]. More importantly,
Very recently, large-scale text-to-image models have Ravuri and Vinyals [45] argued that these metrics do not
been used to augment training data. He et al. [21] show show a consistent correlation with metrics that assess per-
that synthetic data generated with GLIDE [40] improves formance on downstream tasks like classification accuracy.
zero-shot and few-shot image classification performance. An alternative way to evaluate the quality of samples
They further demonstrate that a synthetic dataset generated from generative models is to examine the performance of a
by fine-tuning GLIDE on CIFAR-100 images can provide classifier that is trained on generated data and evaluated on
a substantial boost to CIFAR-100 classification accuracy. real data [55, 71]. To this end, Ravuri and Vinyals [45] pro-
Trabucco et al. [65] explore strategies to augment individ- pose classification accuracy score (CAS), which measures
ual images using a pretrained diffusion model, demonstrat- classification performance on the ImageNet validation set
ing improvements in few-shot settings. Most closely related for ResNet-50 models [20] trained on generated data. It is
to our work, two recent papers train ImageNet classifiers on an intriguing proxy, as it requires generative models to pro-
images generated by diffusion models, although they ex- duce high fidelity images across a broad range of categories,
plore only the pretrained Stable Diffusion model and do competing directly with models trained on real data.
not fine-tune it [2, 56]. They find that images generated To date, CAS performance has not been particularly
in this fashion do not improve accuracy on the clean Im- compelling. Models trained exclusively on generated sam-
ageNet validation set. Here, we show that the Imagen text- ples underperform those trained on real data. Moreover,
to-image model can be fine-tuned for class-conditional Im- performance drops when even relatively small amounts of
ageNet, yielding SOTA models. synthetic data are added to real data during training [45].
This performance drop may arise from issues with the qual-
3. Background ity of generated sample, the diversity (e.g., due to mode col-
Diffusion. Diffusion models work by gradually destroy- lapse in GAN models), or both. Cascaded diffusion mod-
ing the data through the successive addition of Gaussian els [26] have recently been shown to outperform BigGAN-
noise, and then learning to recover the data by reversing deep [6] VQ-VAE-2 [46] on CAS (and other metrics). That
this noising process [25, 59]. In broad terms, in a forward said, there remains a sizeable gap in ImageNet test per-
process random noise is slowly added to the data as time formance between models trained on real data and those
t increases from 0 to T . A learned reverse process inverts trained on synthetic data [26]. Here, we explore the use
the forward process, gradually refining a sample of noise of diffusion models in greater depth, with much stronger
into an image. To this end, samples at the current time results, demonstrating the advantage of large-scale models
step, xt−1 are drawn from a learned Gaussian distribution and fine-tuning.
N (xt−1 ; µθ (xt , t), Σθ (xt , t)) where the mean of the distri-
bution µθ (xt , t), is conditioned on the sample at the pre- 4. Generative Model Training and Sampling
vious time step. The variance of the distribution Σθ (xt , t) In what follows we address two main questions: whether
follows a fixed schedule. In conditional diffusion models, large-scale text-to-image models can be fine-tuned as class-
the reverse process is associated with a conditioning signal, conditional ImageNet models, and to what extent such mod-
such as a class label in class-conditional models [26]. els are useful for generative data augmentation. For this pur-
Diffusion models have been the subject of many recent pose, we undertake a series of experiments to construct and
papers including important innovations in architectures and evaluate such models, focused primarily on data sampling
training (e.g., [1, 26, 41, 50]). Important below, [26] pro- for use in training ImageNet classifiers. ImageNet classifi-
pose cascades of diffusion models at increasing image res- cation accuracy is a high bar as a domain for generative data
olutions for high resolution images. Other work has ex- augmentation as the task is widely studied, and existing ar-
plored the importance of the generative sampling process, chitectures and training recipes are very well-honed.
5.0
60
18 logvar coeff = 1.0

Classification Accuracy Score (%)


logvar coeff = 0.2 20
16 logvar coeff = 0.3 50
logvar coeff = 0.4
14 logvar coeff = 0.0 15
40

[email protected]
FID@50K

2.0
12
10 10 1.75 30
8 5
1.5
20
6 1.0 1.25 Top-5 Accuracy (%)
Top-1 Accuracy (%)
10
1.0 1.25 1.5 1.75 2.0 5.0 50 100 150 200 1.0 1.25 1.5 1.75 2.0 5.0
Guidance Weight Inception Score Guidance Weight

Figure 3. Sampling refinement for 64×64 base model. Left: Validation set FID vs. guidance weights for different values of log-variance.
Center: Pareto frontiers for training set FID and IS at different values of the guidance weight. Right: Dependence of CAS on guidance
weight.

The ImageNet ILSVRC 2012 dataset [48] (ImageNet- It includes a pretrained text encoder that maps text to con-
1K) comprises 1.28 million labeled training images and textualized embeddings, and a cascade of conditional diffu-
50K validation images spanning 1000 categories. We adopt sion models that map these embeddings to images of in-
ImageNet-1K as our benchmark to assess the efficacy of the creasing resolution. Imagen uses a frozen T5-XXL en-
generated data, as this is one of the most widely and thor- coder as a semantic text encoder to capture the complex-
oughly studied benchmarks for which there is an extensive ity and compositionality of text inputs. The cascade begins
literature on architectures and training procedures, making with a 2B parameter 64×64 text-to-image base model. Its
it challenging to improve performance. Since the images outputs are then fed to a 600M parameter super-resolution
of ImageNet-1K dataset vary in dimensions and resolution model to upsample from 64×64 to 256×256, followed by
with the average image resolution of 469×387 [48], we ex- a 400M parameter model to upsample from 256×256 to
amine synthetic data generation at different resolutions, in- 1024×1024. The base 64×64 model is conditioned on text
cluding 64×64, 256×256, and 1024×1024. embeddings via a pooled embedding vector added to the dif-
In contrast to previous work that trains diffusion mod- fusion time-step embedding, like previous class-conditional
els directly on ImageNet data (e.g., [12, 26, 28]), here we diffusion models [26]. All three stages of the diffusion cas-
leverage a large-scale text-to-image diffusion model [50] as cade include text cross-attention layers [50].
a foundation, in part to explore the potential benefits of pre- Given the relative paucity of high resolution im-
training on a larger, generic corpus. A key challenge in do- ages in ImageNet, we fine-tune only the 64×64 base
ing so concerns the alignment of the text-to-image model model and 64×64→256×256 super-resolution model on
with ImageNet classes. If, for example, one naively uses the ImageNet-1K train split, keeping the final super-
short text descriptors like those produced for CLIP by [43] resolution module and text-encoder unchanged. The
as text prompts for each ImageNet class, the data generated 64×64 base model is fine-tuned for 210K steps and the
by the Imagen models is easily shown to produce very poor 64×64→256×256 super-resolution model is fine-tuned for
ImageNet classifier. One problem is that a given text label 490K steps, on 256 TPU-v4 chips with a batch size of
may be associated with multiple visual concepts in the wild, 2048. As suggested in the original Imagen training pro-
or visual concepts that differ systematically from ImageNet cess, Adafactor [57] is used to fine-tune the base 64×64
(see Figure 2). This poor performance may also be a con- model because it had a smaller memory footprint compared
sequence of the high guidance weights used by Imagen, ef- to Adam [33]. For the 256×256 super-resolution model,
fectively sacrificing generative diversity for sample quality. we used Adam for better sample quality. Throughout fine-
While there are several ways in which one might re-purpose tuning experiments, we select models based on FID score
a text-to-image model as a class-conditional model, e.g., op- calculated over 10K samples from the default Imagen sam-
timizing prompts in order to minimize the distribution shift, pler and the ImageNet-1K validation set.
here we fix the prompts to be the one or two words class
names from [43], and fine-tune the weights and sampling 4.2. Sampling Parameters
parameters of the diffusion-based generative model. The quality, diversity, and speed of text-conditioned dif-
fusion model sampling are strongly affected by multiple
4.1. Imagen Fine-tuning factors including the number of diffusion steps, noise condi-
We leverage the large-scale Imagen text-to-image model tion augmentation [50], guidance weights for classifier-free
described in detail in [50] as the backbone text-to-image guidance [27, 40], and the log-variance mixing coefficient
generator that we fine-tune using the ImageNet training set. used for prediction (Eq. 15 in [41]), described in further de-
3.0 0.5 module across different noise conditioning augmentation
0.5
2.8 0.4 values using a guidance weight of 1.0. These curves demon-
2.6 0.4 strate the combined impact of the log-variance mixing co-
0.3
2.4 0.3 efficient and condition noise augmentation in achieving an
FID

0.2 optimal balance between FID and CAS.


2.2 0.2
0.1 0.1 Overall, the results suggest that FID and CAS are highly
2.0
logvar coeff = 0.3 correlated, with smaller guidance weights leading to bet-
1.8 logvar coeff = 0.1 0.0 0.0
ter CAS but negatively affecting Inception Score. We ob-
59 60 61 62 63 64 65
CAS: Top-1 Accuracy (%) serve that using noise augmentation of 0 yields the low-
est FID score for all values of guidance weights for super-
Figure 4. Training set FID vs. CAS Pareto curves under varying resolution models. Nevertheless, it is worth noting that
noise conditions when the guidance weight is set to 1.0 for reso- while larger amounts of noise augmentation tend to increase
lution 256 × 256. These curves depict the joint influence of the FID, they also produce more diverse samples, as also ob-
log-variance mixing coefficient [41] and noise conditioning aug- served by [50]. Results of these studies are available in the
mentation [26] on FID and CAS. Appendix.
Based on these sweeps, taking FID and CAS into ac-
tail in Appendix A.1. We conduct a thorough analysis of the count, we selected guidance of 1.25 when sampling from
dependence of FID, IS and classification accuracy scores the base model, and 1.0 for other resolutions. We use
(CAS) in order to select good sampling parameters for the DDPM sampler [25] log-variance mixing coefficients of 0.0
downstream classification task. and 0.1 for 64 × 64 samples and 256 × 256 samples respec-
tively, with 1000 denoising steps. At resolution 1024×1024
The sampling parameters for the 64 × 64 based model
we use a DDIM sampler [60] with 32 steps, as in [50]. We
establish the overall quality and diversity of image sam-
do not use noise conditioning augmentation for sampling.
ples. We first sweep over guidance weight, log-variance,
and number of sampling steps, to identify good hyperpa- 4.3. Generation Protocol
rameters based on FID-50K (vs. the ImageNet validation
We use the fine-tuned Imagen model with the optimized
set). Using the DDPM sampler [25] for the base model, we
sampling hyper-parameters to generate synthetic data re-
sweep over guidance values of [1.0, 1.25, 1.5, 1.75, 2.0, 5.0]
sembling the training split of ImageNet dataset. This means
and log-variance of [0.0, 0.2, 0.3, 0.4, 1.0], and denoise for
that we aim to produce the same quantity of images for
128, 500, or 1000 steps. The results of this sweep, sum-
each class as found in the real ImageNet dataset while keep-
marized in Figure 3, suggest that optimal FID is obtained
ing the same class balance as the original dataset. We then
with a log-variance of 0 and 1000 denoising steps. Given
constructed large-scale training datasets with ranging from
these parameter choices we then complete a more compute
1.2M to 12M images, i.e., between 1× to 10× the size of
intensive sweep, sampling 1.2M images from the fine-tuned
the original ImageNet training set.
base model for different values of the guidance weights. We
measure FID, IS and CAS for these samples on the valida-
tion set in order to select the guidance weight for the model.
5. Results
Figure 3 shows the Pareto frontiers for FID vs. IS across dif- 5.1. Sample Quality: FID and IS
ferent guidance weights, as well as the dependence of CAS Despite the shortcomings described in Sec. 3, FID [22]
on guidance weight, suggesting that optimal FID and CAS and Inception Score [52] remain standard metrics for evalu-
are obtained at a guidance weight of 1.25. ating generative models. Table 1 reports FID and IS for our
Given 64 × 64 samples obtained with the optimal hyper- approach and existing class-conditional and guidance-based
parameters, we then analyze the impact of guidance weight, approaches. Our fine-tuned model outperforms all the ex-
noise augmentation, and log-variance to select sampling pa- isting methods, including state-of-the-art methods that use
rameters for the super-resolution models. The noise aug- U-Nets [26] and larger U-ViT models trained solely on Im-
mentation value specifies the level of noise augmentation ageNet data [28]. This suggests that large-scale pretraining
applied to the input to super-resolution stages in the Imagen followed by fine-tuning on domain-specific target data is an
cascade to regulate sample diversity (and improve robust- effective strategy to achieve better visual quality with dif-
ness during model training). Here, we sweep over guid- fusion models, as measured by FID and IS. Figure 2 shows
ance values of [1.0, 2.0, 5.0, 10.0, 30.0], noise condition- imaage samples from the fine-tuned model (see Appendix
ing augmentation values of [0.0, 0.1, 0.2, 0.3, 0.4], and log- for more). Note that our state-of-the-art FID and IS on Ima-
variance mixing coefficients of [0.1, 0.3], and denoise for geNet are obtained without any design changes, i.e., by sim-
128, 500, or 1000 steps. Figure 4 shows Pareto curves of ply adapting an off-the-shelf, diffusion-based text-to-image
FID vs. CAS for the 64×64 → 256×256 super-resolution model to new data through fine-tuning. This is a promising
Model FID train FID validation IS
64x64 resolution
BigGAN-deep (Dhariwal & Nichol, 2021) [12] 4.06 - -
Improved DDPM (Nichol & Dhariwal, 2021) [41] 2.92 - -
ADM (Dhariwal & Nichol, 2021) [12] 2.07 - -
CDM (Ho et al, 2022) [26] 1.48 2.48 67.95 ± 1.97
RIN (Jabri et al., 2022) [30] 1.23 - 66.5
RIN + noise schedule (Chen, 2023) [9] 2.04 - 55.8
Ours (Fine-tuned Imagen) 1.21 2.51 85.77 ± 0.06
256x256 resolution
BigGAN-deep (Brock et al., 2019) [6] 6.9 - 171.4 ± 2.00
VQ-VAE-2 (Razavi et al., 2019) [46] 31.11 - -
SR3 (Saharia et al., 2021) [49] 11.30 - -
LDM-4 (Rombach et al., 2022) [47] 10.56 - 103.49
DiT-XL/2 (Peebles & Xie, 2022) [42] 9.62 - 121.5
ADM (Dhariwal & Nichol, 2021) [12] 10.94 - 100.98
ADM+upsampling (Dhariwal & Nichol, 2021) [12] 7.49 - 127.49
CDM (Ho et al, 2022) [26] 4.88 3.76 158.71 ± 2.26
RIN (Jabri et al., 2022) [30] 4.51 4.51 161.0
RIN + noise schedule (Chen, 2023) [9] 3.52 - 186.2
Simple Diffusion (U-Net) (Hoogeboom et al., 2023) [28] 3.76 2.88 171.6 ± 3.07
Simple Diffusion (U-ViT L) (Hoogeboom et al., 2023) [28] 2.77 3.23 211.8 ± 2.93
Ours (Fine-tuned Imagen) 1.76 2.81 239.18 ± 1.14

Table 1. Comparison of sample quality of synthetic ImageNet datasets measured by FID and Inception Score (IS) between our fine-tuned
Imagen model and generative models in the literature. We achieve SOTA FID and IS on ImageNet generation among other existing models,
including class-conditional and guidance-based sampling without any design changes.

result indicating that in a resource-limited setting, one can at 256 × 256 resolution by a good margin, for both Top-1
improve the performance of diffusion models by fine-tuning and Top-5 accuracy. Interestingly, results are markedly bet-
model weights and adjusting sampling parameters. ter for 1024 × 1024 samples, even though these samples are
down-sampled to 256×256 during classifier training. As re-
5.2. Classification Accuracy Score ported in Table 2, we achieve the SOTA Top-1 classification
As noted above, classification accuracy score (CAS) [45] accuracy score of 69.24% at resolution 1024 × 1024. This
is a better proxy than FID and IS for performance of down- greatly narrows the gap with the ResNet-50 model trained
stream training on generated data. CAS measures ImageNet on real data.
classification accuracy on the real test data for a model Figure 5 shows the accuracy of models trained on gener-
trained solely on synthetic samples. In keeping with the ative data (red) compared to a model trained on real data
CAS protocol [45], we train a standard ResNet-50 architec- (blue) for each of the 1000 ImageNet classes (cf. Fig. 2
ture on a single crop from each training image. Models are in [45]). From Figure 5 (left) one can see that the ResNet-50
trained for 90 epochs with a batch size of 1024 using SGD trained on CDM samples is weaker than the model trained
with momentum (see Appendix A.4 for details). Regardless on real data, as most red points fall below the blue points.
of the resolution of the generated data, for CAS training For our fine-tuned Imagen models (Figure 5 middle and
and evaluation, we resize images to 256×256 (or, for real right), however, there are more classes for which the models
images, to 256 pixels on the shorter side) and then take a trained on generated data outperform the model trained on
224×224 pixel center crop. real data. This is particularly clear at 1024×1024.
Table 2 reports CAS for samples from our fine-tuned
models at resolutions 256×256 and 1024×1024. CAS for 5.3. Classification Accuracy with Different Models
real data and for other models are taken from [45] and [26]. To further evaluate the discriminative power of the syn-
The results indicate that our fine-tuned class-conditional thetic data, compared to the real ImageNet-1K data, we
models outperform the previous methods in the literature analyze the classification accuracy of models with differ-
1.0 1.0 1.0
0.8 0.8
Classification Accuracy

Classification Accuracy

Classification Accuracy
0.8
0.6 0.6 0.6

0.4 0.4 0.4

0.2 0.2 0.2


Synthetic (CDM) Synthetic (Ours - 256x256) Synthetic (Ours - 1024x1024)
Real Training Data Real Training Data 0.0 Real Training Data
0.0 0.0
0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000
Class ID Class ID Class ID

Figure 5. Class-wise classification accuracy comparison accuracy of models trained on real data (blue) and generated data (red). Left: The
256 × 256 CDM model [26]. Middle and right: Our fine-tuned Imagen model at 256 × 256 and 1024 × 1024.

Model Top-1 Accuracy (%) Top-5 Accuracy(%)


Real 73.09 91.47
BigGAN-deep (Brock et al., 2019) [6] 42.65 65.92
VQ-VAE-2 (Razavi et al, 2019) [46] 54.83 77.59
CDM (Ho et al, 2022) [26] 63.02 84.06
Ours (256×256 resolution) 64.96 85.66
Ours (1024×1024 resolution) 69.24 88.10

Table 2. Classification Accuracy Scores (CAS) for 256 × 256 and 1024 × 1024 generated samples. CAS for real data and other models
are obtained from [45] and [26]. Our results indicate that the fine-tuned generative diffusion model outperforms the previous methods by a
substantial margin.

ent architectures, input resolutions, and model capacities. 73


We consider multiple ResNet-based and Vision Transform- 72
Top-1 Accuracy (%)

ers (ViT)-based [13] classifiers including ResNet-50 [20],


ResNet-RS-50, ResNet-RS-152x2, ResNet-RS-350x2 [4], 71
+3.35 +4.01 +4.35
ViT-S/16 [5], and DeiT-B [64]. The models trained on real, 70
synthetic, and the combination of real and synthetic data
69
are all trained in the same way, consistent with the training
recipes specified by authors of these models on ImageNet- 0 2 4 6 8 10
1K, and our results on real data agree with the published Generated Data Size (M)
results. The Appendix has more details on model training.
Table 3 reports the Top-1 validation accuracy of multi- Figure 6. Improved classification accuracy of ResNet-50 with in-
ple ConvNet and Transformer models when trained with the creasing numbers of synthetic images added to real training data
1.2M real ImageNet training images, with 1.2M generated at resolution 64×64.
images, and when the generative samples are used to aug-
ment the real data. As one might expect, models trained
solely on generated samples perform worse than models results in Table 2. The Appendix provides training details.
trained on real data. Nevertheless, augmenting real data Ravuri and Vinyals [45] (Fig. 5) found that for almost
with synthetic images from the diffusion model yields a all models tested, mixing generated samples with real data
substantial boost in performance across all classifiers tested. degrades Top-5 classifier accuracy. For Big-GAN-deep [6]
with low truncation values (sacrificing diversity for sample
5.4. Merging Real and Synthetic Data at Scale quality), accuracy increases marginally with small amounts
We next consider how performance of a ResNet-50 clas- of generated data, but then drops below models trained
sifier depends on the amount of generated data that is used solely on real data when the amount of generated data ap-
to augment the real data. Here we follow conventional train- proaches the size of the real train set.
ing recipe and train with random crops for 130 epochs, re- Figure 6 shows that, for 64 × 64 images, performance
sulting in a higher ResNet-50 accuracy here than in the CAS continues to improve as the amount of generated data in-
Model Input Size Params (M) Real Only Generated Only Real + Generated Performance ∆
ConvNets
ResNet-50 224×224 36 76.39 69.24 78.17 +1.78
ResNet-101 224×224 45 78.15 71.31 79.74 +1.59
ResNet-152 224×224 64 78.59 72.38 80.15 +1.56
ResNet-RS-50 160×160 36 79.10 70.72 79.97 +0.87
ResNet-RS-101 160×160 64 80.11 72.73 80.89 +0.78
ResNet-RS-101 190×190 64 81.29 73.63 81.80 +0.51
ResNet-RS-152 224×224 87 82.81 74.46 83.10 +0.29
Transformers
ViT-S/16 224×224 22 79.89 71.88 81.00 +1.11
DeiT-S 224×224 22 78.97 72.26 80.49 +1.52
DeiT-B 224×224 87 81.79 74.55 82.84 +1.04
DeiT-B 384×384 87 83.16 75.45 83.75 +0.59
DeiT-L 224×224 307 82.22 74.60 83.05 +0.83

Table 3. Comparison of classifier Top-1 Accuracy (%) performance when 1.2M generated images are used for generative data augmentation.
Models trained solely on generated samples perform worse than models trained on real data. Nevertheless, augmenting the real data with
data generated from the fine-tuned diffusion model provides a substantial boost in performance across many different classifiers.

Train Set (M) 256×256 1024×1024 6. Conclusion


1.2 76.39 ± 0.21 76.39 ± 0.21
This paper asks to what extent generative data augmen-
2.4 77.61 ± 0.08 (+1.22) 78.12 ± 0.05 (+1.73) tation is effective with current diffusion models. We do so
3.6 77.16 ± 0.04 (+0.77) 77.48 ± 0.04 (+1.09) in the context of ImageNet classification, a challenging do-
4.8 76.52 ± 0.04 (+0.13) 76.75 ± 0.07 (+0.36) main as it is extensively explored with highly tuned archi-
6.0 76.09 ± 0.08 (-0.30) 76.34 ± 0.13 (-0.05) tectures and training recipes. Here we show that large-scale
7.2 75.81 ± 0.08 (-0.58) 75.87 ± 0.09 (-0.52) text-to-image diffusion models can be fine-tuned to produce
8.4 75.44 ± 0.06 (-0.95) 75.49 ± 0.07 (-0.90) class-conditional models with SOTA FID (1.76 at 256×256
9.6 75.28 ± 0.10 (-1.11) 74.72 ± 0.20 (-1.67) resolution) and Inception Score (239 at 256 × 256). The
10.8 75.11 ± 0.12 (-1.28) 74.14 ± 0.13 (-2.25) resulting generative model also yields a new SOTA in Clas-
12.0 75.04 ± 0.05 (-1.35) 73.70 ± 0.09 (-2.69) sification Accuracy Scores (64.96 for 256×256 models, im-
proving to 69.24 for 1024×1024 generated samples). And
Table 4. Scaling the training dataset by adding synthetic images, we have shown improvements to ImageNet classification
at resolutions 256 × 256 and 1024 × 1024. The baseline Top-1
accuracy extend to large amounts of generated data, across
accuracy of the classifier trained on real data is 76.39 ± 0.21. The
number in parenthesis shows the change obtained over baseline
a range of ResNet and Transformer-based models.
with the addition of generated data. While these results are encouraging, many questions
remain. One concerns the boost in CAS at resolution
1024×1024, suggesting that the larger images capture more
useful image structure than those at 256×256, even though
creases up to nine times the amount of real data, to a total the 1024×1024 images are downsampled to 256×256 before
dataset size of 12M images. Performance with higher reso- being center-cropped to 224 × 224 for input to ResNet-50.
lution images, however, does not continue to improve with Another concerns the sustained gains in classification accu-
similarly large amounts of generative data augmentation. racy with large amounts of synthetic data at 64×64 (Fig. 6);
Table 4 reports performance as the amount of generated data there is less information at low resolutions for training, and
increased over the same range, up to 9× the amount of real hence a greater opportunity for augmentation with synthetic
data, at resolutions 256×256 and 1024×1024. The perfor- images. At high resolutions (Tab. 4) performance drops for
mance boost remains significant with fine-tuned diffusion synthetic datasets larger than 1M images, which may indi-
models for synthetic data up to a factor of 4 or 5 times the cate bias in the generative model, and the need for more
size of the real ImageNet training set, a significant improve- sophisticated training methods with synthetic data. These
ment over results reported in [45]. issues remain topics of on-going research.
Acknowledgments Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl-
vain Gelly, et al. An image is worth 16x16 words: Trans-
We thank Jason Baldridge and Ting Chen for their valu-
formers for image recognition at scale. arXiv preprint
able feedback. We also extend thanks to William Chan,
arXiv:2010.11929, 2020.
Saurabh Saxena, and Lala Li for helpful discussions, feed-
[14] Alexey Dosovitskiy, Philipp Fischer, Eddy Ilg, Philip
back, and their support with the Imagen code. Hausser, Caner Hazirbas, Vladimir Golkov, Patrick Van
Der Smagt, Daniel Cremers, and Thomas Brox. Flownet:
References Learning optical flow with convolutional networks. In Pro-
[1] Yogesh Balaji, Seungjun Nah, Xun Huang, Arash Vahdat, ceedings of the IEEE international conference on computer
Jiaming Song, Karsten Kreis, Miika Aittala, Timo Aila, vision, pages 2758–2766, 2015.
Samuli Laine, Bryan Catanzaro, Tero Karras, and Ming-Yu [15] Alexey Dosovitskiy, German Ros, Felipe Codevilla, Anto-
Liu. ediff-i: Text-to-image diffusion models with an ensem- nio Lopez, and Vladlen Koltun. Carla: An open urban driv-
ble of expert denoisers. preprint arxiv.2211.01324,, 2022. ing simulator. In Conference on robot learning, pages 1–16.
PMLR, 2017.
[2] Hritik Bansal and Aditya Grover. Leaving reality to imag-
ination: Robust classification via generated datasets. arXiv [16] C Gan, J Schwartz, S Alter, M Schrimpf, J Traer, J De Fre-
preprint arXiv:2302.02503, 2023. itas, J Kubilius, A Bhandwaldar, N Haber, M Sano, et al.
Threedworld: A platform for interactive multi-modal physi-
[3] Dmitry Baranchuk, Ivan Rubachev, Andrey Voynov,
cal simulation. Advances in Neural Information Processing
Valentin Khrulkov, and Artem Babenko. Label-efficient se-
Systems (NeurIPS), 2021.
mantic segmentation with diffusion models. arXiv preprint
arXiv:2112.03126, 2021. [17] Sven Gowal, Sylvestre-Alvise Rebuffi, Olivia Wiles, Florian
Stimberg, Dan Andrei Calian, and Timothy A Mann. Im-
[4] Irwan Bello, William Fedus, Xianzhi Du, Ekin Dogus
proving robustness using generated data. Advances in Neural
Cubuk, Aravind Srinivas, Tsung-Yi Lin, Jonathon Shlens,
Information Processing Systems, 34:4218–4233, 2021.
and Barret Zoph. Revisiting resnets: Improved training and
scaling strategies. Advances in Neural Information Process- [18] Klaus Greff, Francois Belletti, Lucas Beyer, Carl Doersch,
ing Systems, 34:22614–22627, 2021. Yilun Du, Daniel Duckworth, David J Fleet, Dan Gnanapra-
gasam, Florian Golemo, Charles Herrmann, Thomas Kipf,
[5] Lucas Beyer, Xiaohua Zhai, and Alexander Kolesnikov. Bet-
Abhijit Kundu, Dmitry Lagun, Issam Laradji, Hsueh-
ter plain vit baselines for imagenet-1k. arXiv preprint
Ti (Derek) Liu, Henning Meyer, Yishu Miao, Derek
arXiv:2205.01580, 2022.
Nowrouzezahrai, Cengiz Oztireli, Etienne Pot, Noha Rad-
[6] Andrew Brock, Jeff Donahue, and Karen Simonyan. Large wan, Daniel Rebain, Sara Sabour, Mehdi S. M. Sajjadi,
scale GAN training for high fidelity natural image synthe- Matan Sela, Vincent Sitzmann, Austin Stone, Deqing Sun,
sis. International Conference on Learning Representations, Suhani Vora, Ziyu Wang, Tianhao Wu, Kwang Moo Yi,
2019. Fangcheng Zhong, and Andrea Tagliasacchi. Kubric: A scal-
[7] Cristian Buciluǎ, Rich Caruana, and Alexandru Niculescu- able dataset generator. 2022.
Mizil. Model compression. In Proceedings of the 12th ACM [19] Xi Guo, Wei Wu, Dongliang Wang, Jing Su, Haisheng Su,
SIGKDD international conference on Knowledge discovery Weihao Gan, Jian Huang, and Qin Yang. Learning video rep-
and data mining, pages 535–541, 2006. resentations of human motion from synthetic data. In Pro-
[8] Nanxin Chen, Yu Zhang, Heiga Zen, Ron J Weiss, Mo- ceedings of the IEEE/CVF Conference on Computer Vision
hammad Norouzi, and William Chan. Wavegrad: Esti- and Pattern Recognition, pages 20197–20207, 2022.
mating gradients for waveform generation. arXiv preprint [20] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
arXiv:2009.00713, 2020. Deep residual learning for image recognition. In Proceed-
[9] Ting Chen. On the importance of noise scheduling for diffu- ings of the IEEE conference on computer vision and pattern
sion models. arXiv preprint arXiv:2301.10972, 2023. recognition, pages 770–778, 2016.
[10] Yuhua Chen, Wen Li, Xiaoran Chen, and Luc Van Gool. [21] Ruifei He, Shuyang Sun, Xin Yu, Chuhui Xue, Wenqing
Learning semantic segmentation from synthetic data: A geo- Zhang, Philip Torr, Song Bai, and Xiaojuan Qi. Is synthetic
metrically guided input-output adaptation approach. In Pro- data from generative models ready for image recognition?
ceedings of the IEEE/CVF conference on computer vision arXiv preprint arXiv:2210.07574, 2022.
and pattern recognition, pages 1841–1850, 2019. [22] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner,
[11] Celso M de Melo, Antonio Torralba, Leonidas Guibas, James Bernhard Nessler, and Sepp Hochreiter. Gans trained by a
DiCarlo, Rama Chellappa, and Jessica Hodgins. Next- two time-scale update rule converge to a local nash equilib-
generation deep learning based on simulators and synthetic rium. Advances in neural information processing systems,
data. Trends in cognitive sciences, 2021. 30, 2017.
[12] Prafulla Dhariwal and Alexander Nichol. Diffusion models [23] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distill-
beat GANs on image synthesis. Advances in Neural Infor- ing the knowledge in a neural network. arXiv preprint
mation Processing Systems, 34:8780–8794, 2021. arXiv:1503.02531, 2015.
[13] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, [24] Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang,
Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Ruiqi Gao, Alexey Gritsenko, Diederik P Kingma, Ben
Poole, Mohammad Norouzi, David J Fleet, et al. Imagen [39] Dhruv Mahajan, Ross Girshick, Vignesh Ramanathan,
video: High definition video generation with diffusion mod- Kaiming He, Manohar Paluri, Yixuan Li, Ashwin Bharambe,
els. arXiv preprint arXiv:2210.02303, 2022. and Laurens Van Der Maaten. Exploring the limits of weakly
[25] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffu- supervised pretraining. In Proceedings of the European con-
sion probabilistic models. Advances in Neural Information ference on computer vision (ECCV), pages 181–196, 2018.
Processing Systems, 33:6840–6851, 2020. [40] Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav
[26] Jonathan Ho, Chitwan Saharia, William Chan, David J Fleet, Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and
Mohammad Norouzi, and Tim Salimans. Cascaded diffusion Mark Chen. Glide: Towards photorealistic image generation
models for high fidelity image generation. J. Mach. Learn. and editing with text-guided diffusion models. arXiv preprint
Res., 23(47):1–33, 2022. arXiv:2112.10741, 2021.
[27] Jonathan Ho and Tim Salimans. Classifier-free diffusion [41] Alexander Quinn Nichol and Prafulla Dhariwal. Improved
guidance. arXiv preprint arXiv:2207.12598, 2022. denoising diffusion probabilistic models. In International
[28] Emiel Hoogeboom, Jonathan Heek, and Tim Salimans. sim- Conference on Machine Learning, pages 8162–8171. PMLR,
ple diffusion: End-to-end diffusion for high resolution im- 2021.
ages. arXiv preprint arXiv:2301.11093, 2023. [42] William Peebles, Jun-Yan Zhu, Richard Zhang, Antonio Tor-
[29] Shahram Izadi, David Kim, Otmar Hilliges, David ralba, Alexei A Efros, and Eli Shechtman. Gan-supervised
Molyneaux, Richard Newcombe, Pushmeet Kohli, Jamie dense visual alignment. In Proceedings of the IEEE/CVF
Shotton, Steve Hodges, Dustin Freeman, Andrew Davison, Conference on Computer Vision and Pattern Recognition,
et al. Kinectfusion: real-time 3d reconstruction and inter- pages 13470–13481, 2022.
action using a moving depth camera. In Proceedings of the [43] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya
24th annual ACM symposium on User interface software and Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry,
technology, pages 559–568, 2011. Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning
transferable visual models from natural language supervi-
[30] Allan Jabri, David Fleet, and Ting Chen. Scalable adap-
sion. In International conference on machine learning, pages
tive computation for iterative generation. arXiv preprint
8748–8763. PMLR, 2021.
arXiv:2212.11972, 2022.
[44] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu,
[31] Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine.
and Mark Chen. Hierarchical text-conditional image gen-
Elucidating the design space of diffusion-based generative
eration with clip latents. arXiv preprint arXiv:2204.06125,
models. NeurIPS, 2022.
2022.
[32] Yo-whan Kim. How Transferable are Video Representations [45] Suman Ravuri and Oriol Vinyals. Classification accuracy
Based on Synthetic Data? PhD thesis, Massachusetts Insti- score for conditional generative models. Advances in neural
tute of Technology, 2022. information processing systems, 32, 2019.
[33] Diederik P. Kingma and Jimmy Ba. Adam: A method for [46] Ali Razavi, Aaron Van den Oord, and Oriol Vinyals. Gener-
stochastic optimization. arXiv preprint arxiv:1412.6980, ating diverse high-fidelity images with vq-vae-2. Advances
2014. in neural information processing systems, 32, 2019.
[34] Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Joan [47] Robin Rombach, Andreas Blattmann, Dominik Lorenz,
Puigcerver, Jessica Yung, Sylvain Gelly, and Neil Houlsby. Patrick Esser, and Björn Ommer. High-resolution image
Big transfer (bit): General visual representation learning. In synthesis with latent diffusion models. In Proceedings of
Computer Vision–ECCV 2020: 16th European Conference, the IEEE/CVF Conference on Computer Vision and Pattern
Glasgow, UK, August 23–28, 2020, Proceedings, Part V 16, Recognition, pages 10684–10695, 2022.
pages 491–507. Springer, 2020. [48] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, San-
[35] Zhifeng Kong, Wei Ping, Jiaji Huang, Kexin Zhao, and jeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy,
Bryan Catanzaro. Diffwave: A versatile diffusion model for Aditya Khosla, Michael Bernstein, et al. Imagenet large
audio synthesis. arXiv preprint arXiv:2009.09761, 2020. scale visual recognition challenge. International journal of
[36] Daiqing Li, Huan Ling, Seung Wook Kim, Karsten Kreis, computer vision, 115:211–252, 2015.
Sanja Fidler, and Antonio Torralba. Bigdatasetgan: Synthe- [49] Chitwan Saharia, William Chan, Huiwen Chang, Chris Lee,
sizing imagenet with pixel-wise annotations. In Proceedings Jonathan Ho, Tim Salimans, David Fleet, and Mohammad
of the IEEE/CVF Conference on Computer Vision and Pat- Norouzi. Palette: Image-to-image diffusion models. In
tern Recognition, pages 21330–21340, 2022. ACM SIGGRAPH 2022 Conference Proceedings, pages 1–
[37] Daiqing Li, Junlin Yang, Karsten Kreis, Antonio Torralba, 10, 2022.
and Sanja Fidler. Semantic segmentation with generative [50] Chitwan Saharia, William Chan, Saurabh Saxena, Lala
models: Semi-supervised learning and strong out-of-domain Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed
generalization. In Proceedings of the IEEE/CVF Conference Ghasemipour, Burcu Karagol Ayan, S Sara Mahdavi,
on Computer Vision and Pattern Recognition, pages 8300– Rapha Gontijo Lopes, et al. Photorealistic text-to-image
8311, 2021. diffusion models with deep language understanding. arXiv
[38] Jianxin Ma, Shuai Bai, and Chang Zhou. Pretrained diffusion preprint arXiv:2205.11487, 2022.
models for unified human motion synthesis. arXiv preprint [51] Chitwan Saharia, Jonathan Ho, William Chan, Tim Sali-
arXiv:2212.02837, 2022. mans, David J Fleet, and Mohammad Norouzi. Image super-
resolution via iterative refinement. IEEE Transactions on [65] Brandon Trabucco, Kyle Doherty, Max Gurinas, and Ruslan
Pattern Analysis and Machine Intelligence, 2022. Salakhutdinov. Effective data augmentation with diffusion
[52] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki models. arXiv preprint arXiv:2302.07944, 2023.
Cheung, Alec Radford, and Xi Chen. Improved techniques [66] Nontawat Tritrong, Pitchaporn Rewatbowornwong, and Su-
for training gans. Advances in neural information processing pasorn Suwajanakorn. Repurposing gans for one-shot se-
systems, 29, 2016. mantic part segmentation. In Proceedings of the IEEE/CVF
[53] Tim Salimans and Jonathan Ho. Progressive distillation conference on computer vision and pattern recognition,
for fast sampling of diffusion models. arXiv preprint pages 4475–4485, 2021.
arXiv:2202.00512, 2022. [67] Gul Varol, Javier Romero, Xavier Martin, Naureen Mah-
[54] Swami Sankaranarayanan, Yogesh Balaji, Arpit Jain, mood, Michael J Black, Ivan Laptev, and Cordelia Schmid.
Ser Nam Lim, and Rama Chellappa. Learning from synthetic Learning from synthetic humans. In Proceedings of the
data: Addressing domain shift for semantic segmentation. In IEEE conference on computer vision and pattern recogni-
Proceedings of the IEEE conference on computer vision and tion, pages 109–117, 2017.
pattern recognition, pages 3752–3761, 2018. [68] Ruben Villegas, Mohammad Babaeizadeh, Pieter-Jan Kin-
dermans, Hernan Moraldo, Han Zhang, Mohammad Taghi
[55] Shibani Santurkar, Ludwig Schmidt, and Aleksander Madry.
Saffar, Santiago Castro, Julius Kunze, and Dumitru Erhan.
A classification-based study of covariate shift in gan distri-
Phenaki: Variable length video generation from open domain
butions. In International Conference on Machine Learning,
textual description. arXiv preprint arXiv:2210.02399, 2022.
pages 4480–4489. PMLR, 2018.
[69] Su Wang, Chitwan Saharia, Ceslee Montgomery, Jordi Pont-
[56] Mert Bulent Sariyildiz, Karteek Alahari, Diane Larlus,
Tuset, Shai Noy, Stefano Pellegrini, Yasumasa Onoe, Sarah
and Yannis Kalantidis. Fake it till you make it: Learn-
Laszlo, David J Fleet, Radu Soricut, et al. Imagen editor
ing (s) from a synthetic imagenet clone. arXiv preprint
and editbench: Advancing and evaluating text-guided image
arXiv:2212.08420, 2022.
inpainting. arXiv preprint arXiv:2212.06909, 2022.
[57] Noam Shazeer and Mitchell Stern. Adafactor: Adaptive [70] Yinghao Xu, Yujun Shen, Jiapeng Zhu, Ceyuan Yang, and
learning rates with sublinear memory cost. arXiv preprint, Bolei Zhou. Generative hierarchical features from synthe-
arxiv:1804.04235, 2018. sizing images. In Proceedings of the IEEE/CVF Conference
[58] Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, on Computer Vision and Pattern Recognition, pages 4432–
Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual, 4442, 2021.
Oran Gafni, et al. Make-a-video: Text-to-video generation [71] Jianwei Yang, Anitha Kannan, Dhruv Batra, and Devi
without text-video data. arXiv preprint arXiv:2209.14792, Parikh. LR-GAN: Layered recursive generative adversarial
2022. networks for image generation. In International Conference
[59] Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, on Learning Representations, 2017.
and Surya Ganguli. Deep unsupervised learning using [72] Xiaohua Zhai, Alexander Kolesnikov, Neil Houlsby, and Lu-
nonequilibrium thermodynamics. In International Confer- cas Beyer. Scaling vision transformers. In Proceedings of
ence on Machine Learning, pages 2256–2265. PMLR, 2015. the IEEE/CVF Conference on Computer Vision and Pattern
[60] Jiaming Song, Chenlin Meng, and Stefano Ermon. Denois- Recognition, pages 12104–12113, 2022.
ing diffusion implicit models. International Confernece on [73] Jia Zheng, Junfei Zhang, Jing Li, Rui Tang, Shenghua Gao,
Learning Representations, 2021. and Zihan Zhou. Structured3d: A large photo-realistic
[61] Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Ab- dataset for structured 3d modeling. In Computer Vision–
hishek Kumar, Stefano Ermon, and Ben Poole. Score-based ECCV 2020: 16th European Conference, Glasgow, UK, Au-
generative modeling through stochastic differential equa- gust 23–28, 2020, Proceedings, Part IX 16, pages 519–535.
tions. arXiv preprint arXiv:2011.13456, 2020. Springer, 2020.
[62] Chen Sun, Abhinav Shrivastava, Saurabh Singh, and Abhi-
nav Gupta. Revisiting unreasonable effectiveness of data in
deep learning era. In Proceedings of the IEEE international
conference on computer vision, pages 843–852, 2017.
[63] Deqing Sun, Daniel Vlasic, Charles Herrmann, Varun
Jampani, Michael Krainin, Huiwen Chang, Ramin Zabih,
William T Freeman, and Ce Liu. Autoflow: Learning a better
training set for optical flow. In Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition,
pages 10093–10102, 2021.
[64] Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco
Massa, Alexandre Sablayrolles, and Hervé Jégou. Training
data-efficient image transformers & distillation through at-
tention. In International conference on machine learning,
pages 10347–10357. PMLR, 2021.
Appendix

A.1. Hyper-parameters for Imagen fine-tuning and sample generation.


The quality, diversity, and speed of text-conditioned diffusion model sampling are strongly affected by multiple hyper-
parameters. These include the number of diffusion steps, where larger numbers of diffusion steps are often associated with
higher quality images and lower FID. Another hyper-parameter is the amount of noise-conditioning augmentation [50], which
adds Gaussian noise to the output of one stage of the Imagen cascade at training time, prior to it being input to the subsequent
super-resolution stage. We considered noise levels between 0 and 0.5 (with images in the range [0,1]), where adding more
noise during training degrades more fine-scale structure, thereby forcing the subsequent super-resolution stage to be more
robust to variability in the images generated from the previous stage.
During sampling, we use classifier-free guidance [27, 40], but with smaller guidance weights than Imagen, favoring di-
versity over image fidelity to some degree. With smaller guidance weights, one does not require dynamic thresholding [50]
during inference; instead we opt for a static threshold to clip large pixel values at each step of denoising. Ho et al. [27] iden-
tify upper and lower bounds on the predictive variance, Σθ (xt , t), used for sampling at each denoising step. Following [41]
(Eq. 15) we use a linear (convex) combination of the log upper and lower bounds, the mixing parameter for which is referred
to as the logvar parameter. Figures 3 and 4 show the dependence of FID, IS and Classification Accuracy Scores on guidance
weight and logvar mixing coefficient for the base model at resolution 64×64 and the 64 → 256 super-resolution model. These
were used to help choose model hyyper-parameters for large-scale sample generation.
Below are further results relate to hyperparameter selection and its impact on model metrics.

5.0
60 1.25 1.5 1.75
Classification Accuracy Score (%)

1.0 20
50 2.0
15
40
[email protected]

2.0
1.25 1.5 1.75
30 1.0 2.0 5.0 10 1.75
1.5
20 Top-5 Accuracy (%) 5
5.0 1.0 1.25
Top-1 Accuracy (%)
10
50 100 150 200 50 100 150 200
Inception Score Inception Score

Figure A.1. Left: CAS vs IS Pareto curves for train set resolution of 64×64 showing the impact of guidance weights. Right: Train set
FID vs IS Pareto curves for resolution of 64x64 showing the impact of guidance weights.

18 logvar coeff = 1.0 18 logvar coeff = 1.0


logvar coeff = 0.2 logvar coeff = 0.2
16 logvar coeff = 0.3 16 logvar coeff = 0.3
logvar coeff = 0.4 logvar coeff = 0.4
14 logvar coeff = 0.0 14 logvar coeff = 0.0
FID@50K

FID@50K

12 12
10 10
8 8
6 6

1.0 1.25 1.5 1.75 2.0 5.0 60 80 100 120 140 160 180
Guidance Weight Inception Score

Figure A.2. Sampling refinement for 64×64 base model. Left: Validation set FID vs. guidance weights for different values of log-variance.
Right: Validation set FID vs. Inception score (IS) when increasing guidance from 1.0 to 5.0.
DDPM Steps: 128 DDPM Steps: 500 DDPM Steps: 1000
20.0 0.0 0.0 0.0
20.0 20.0
17.5 0.1 0.1 0.1
0.2 17.5 0.2 17.5 0.2
15.0 0.3 0.3 0.3
0.4 15.0 0.4 15.0 0.4
12.5 0.5 0.5 0.5
12.5 12.5
FID

FID

FID
10.0 10.0 10.0
30 30 30
7.5 7.5 7.5
5.0 20 5.0 20 5.0 20
2.5 10 5 2.5 10 5 2.5 10 5
2 1 2 1 2 1
10 20 30 40 50 60 10 20 30 40 50 60 10 20 30 40 50 60
CAS: Top-1 Accuracy (%) CAS: Top-1 Accuracy (%) CAS: Top-1 Accuracy (%)
DDPM Steps: 128 DDPM Steps: 500 DDPM Steps: 1000
20.0 0.0 0.0 0.0
20.0 20.0
17.5 0.1 0.1 0.1
0.2 17.5 0.2 17.5 0.2
15.0 0.3 0.3 0.3
0.4 15.0 0.4 15.0 0.4
12.5 0.5 0.5 0.5
12.5 12.5
FID

FID

FID
10.0 10.0 10.0
30 30 30
7.5 7.5 7.5
5.0 20 5.0 20 5.0 20
2.5 10 5 2.5 10 5 2.5 10 5
2 1 2 1 2 1
20 30 40 50 60 70 80 20 30 40 50 60 70 80 20 30 40 50 60 70 80
CAS: Top-5 Accuracy (%) CAS: Top-5 Accuracy (%) CAS: Top-5 Accuracy (%)

Figure A.3. Top-1 and Top-5 classification accuracy score (CAS) vs train FID Pareto curves (sweeping over guidance weight) showing the
impact of conditioning noise augmentation at 256×256 when sampling with different number of steps. As indicated by number overlaid
on each trend line, guidance weight is decreasing from 30 to 1.

Augmentation Noise: 0.0 Augmentation Noise: 0.1 Augmentation Noise: 0.2


128 128 16 128
9 30 500 12 500 500
30 30
1000 1000 14 1000
8
10
7 12
6 8 10
FID

FID

FID

20
20
5 20 8
6
4
6
4 10
3 10
10
5
4 5 1
2 5 2 1 2 1 2
20 30 40 50 60 20 30 40 50 60 20 30 40 50 60
CAS: Top-1 Accuracy (%) CAS: Top-1 Accuracy (%) CAS: Top-1 Accuracy (%)
Augmentation Noise: 0.3 Augmentation Noise: 0.4 Augmentation Noise: 0.5
18 20
128 128 128
30 500 18 30 500 20 30 500
16 1000 1000 1000
18
14 16
16
14
12
14
FID

FID

FID

20 12 20 20
10 12
10
8 10
10 8 10 10
6 1 8 1
1 6
5 2 5 2 5 2
4 6
10 20 30 40 50 10 20 30 40 50 10 20 30 40 50
CAS: Top-1 Accuracy (%) CAS: Top-1 Accuracy (%) CAS: Top-1 Accuracy (%)

Figure A.4. Top-1 and Top-5 classification accuracy score (CAS) vs train FID Pareto curves (sweeping over guidance weight) showing the
impact of conditioning noise augmentation at 256×256 when sampling with different number of steps at a fixed noise level. As indicated
by number overlaid on each trend line guidance weight is decreasing from 30 to 1. At highest noise level (0.5) lowering number sampling
step and decreasing guidance can lead to a better joint FID and CAS values. At lowest noise level (0.0) this effect is subtle and increasing
sampling steps and lower guidance weight can help to improve CAS.
Vanilla Imagen Vanilla Imagen
8 Finetuned Imagen Finetuned Imagen
8

6 30 30
6

FID

FID
20 20
4 4
10 10
5 5
2 2 1 2 2 1
20 30 40 50 60 40 50 60 70 80
CAS: Top-1 Accuracy (%) CAS: Top-5 Accuracy (%)

Figure A.5. Fine-tuning of SR model helps to jointly improve classification accuracy as well as FID of the vanilla Imagen.

75
70
70
65
CAS: Top-1 Accuracy (%)

CAS: Top-1 Accuracy (%)


65
60 1
60 2
55 0.0 55 0.0
0.1 0.1 5
50 50
0.2 0.2 8
45 0.3 45 0.3 10
0.4 0.4
40 40
2 4 6 8 10 230 235 240 245 250
Guidance Weight Inception Score

Figure A.6. Sampling refinement for 1024×2014 super resolution model. Left: CAS vs. guidance weights under varying noise conditions.
Right: CAS vs. Inception score (IS) when increasing guidance from 1.0 to 5.0 under varying noise conditions.

3.0 0.5 3.0 0.5


0.5 0.5
2.8 0.4 2.8 0.4
2.6 0.4 2.6 0.4
0.3 0.3
2.4 0.3 2.4 0.3
FID

FID

0.2 0.2
2.2 0.2 2.2 0.2
0.1 0.1 0.1 0.1
2.0 2.0
logvar coeff = 0.3 logvar coeff = 0.3
1.8 logvar coeff = 0.1 0.0 0.0 1.8 logvar coeff = 0.1 0.0 0.0
59 60 61 62 63 64 65 81 82 83 84 85
CAS: Top-1 Accuracy (%) CAS: Top-5 Accuracy (%)

Figure A.7. Training set FID vs. classification top-1 and top-5 accuracy Pareto curves under varying noise conditions when the guidance
weight is set to 1.0 for resolution 256×256. These curves depict the joint influence of the log-variance mixing coefficient [41] and noise
conditioning augmentation [26] on FID and CAS.

A.2. Class Alignment of Imagen vs. Fine-Tuned Imagen

What follows are more samples to compare our fine-tuned model vs. the Imagen model are provided in Figure A.8, A.9,
and A.10. In this comparison we sample our fine-tuned model using two strategies. First, we sample using the proposed
vanilla Imagen hyper-parameters which use a guidance weight of 10 for the sampling of the base 64×64 model and subsequent
super-resolution (SR) models are sampled with guidance weights of 20 and 8, respectively. This is called the high guidance
strategy in these figures. Second, we use the proposed sampling hyper-parameters as explained in the paper which includes
sampling the based model with a guidance weight of 1.25 and the subsequent SR models with a guidance weight of 1.0. This
is called the low guidance weight strategy in these figures.
Imagen
(High Guidance)
bittern bird

Fine-tuned
(Low Guidance)
Fine-tuned
Imagen
(High Guidance)
Cardoon

Fine-tuned
(Low Guidance)
Fine-tuned
Imagen
(High Guidance)
Fine-tuned
brussels griffon

(Low Guidance)
Fine-tuned

Figure A.8. Example 1024×1024 images from vanilla Imagen (first row) vs. fine-tuned Imagen sampled with Imagen hyper-parameters
(high guidance, second row) vs. fine-tuned Imagen sampled with our proposed hyper-parameter (low guidance, third row). Fine-tuning
and careful choice of sampling parameters help to improve the alignment of images with class labels, and also improve sample diversity.
Sampling with higher guidance weight can improve photorealism, but lessens diversity.
Imagen
(High Guidance)
Schipperke

Fine-tuned
(Low Guidance)
Fine-tuned
Imagen
(High Guidance)
harvestman

Fine-tuned
(Low Guidance)
Fine-tuned
Imagen
Leonberger

(High Guidance)
Fine-tuned
(Low Guidance)
Fine-tuned

Figure A.9. Example 1024×1024 images from vanilla Imagen (first row) vs. fine-tuned Imagen sampled with Imagen hyper-parameters
(high guidance, second row) vs. fine-tuned Imagen sampled with our proposed hyper-parameter (low guidance, third row). Fine-tuning
and careful choice of sampling parameters help to improve the alignment of images with class labels, and also improve sample diversity.
Sampling with higher guidance weight can improve photorealism, but lessens diversity.
Imagen
(High Guidance)
dowitcher

Fine-tuned
(Low Guidance)
Fine-tuned
Imagen
(High Guidance)
Fine-tuned
kuvasz

(Low Guidance)
Fine-tuned
Figure A.10. Example 1024×1024 images from vanilla Imagen (first row) vs. fine-tuned Imagen sampled with Imagen hyper-parameters
(high guidance, second row) vs. fine-tuned Imagen sampled with our proposed hyper-parameter (low guidance, third row). Fine-tuning
and careful choice of sampling parameters help to improve the alignment of images with class labels, and also improve sample diversity.
Sampling with higher guidance weight can improve photorealism, but lessens diversity.
A.3. High Resolution Random Samples from the ImageNet Model

Figure A.11. Random samples at 1024×1024 resolution generated by our fine-tuned model. The classes are snail (113), panda (388),
orange (950), badger (362), indigo bunting (14), steam locomotive (820), carved pumpkin (607), lion (291), loggerhead sea turtle (33),
golden retriever (207), tree frog (31), clownfish (393), dowitcher (142), lorikeet (90), school bus (779), macaw (88), marmot (336), green
mamba (64).
A.4. Hyper-parameters and model selection for ImageNet classifiers.
This section details all the hyper-parameters used in training our ResNet-based model for CAS calculation, as well as
the other ResNet-based, ResNet-RS-based, and Transformer-based models, used to report classifier accuracy in Table 3.
Table A.1 and Table A.2 summarize the hyper-parameters used to train the ConvNet architectures and vision transformer
architectures, respectively.
For classification accuracy (CAS) calculation, as discussed before we follow the protocol suggested in [45]. Our CAS
ResNet-50 classifier is trained using a single crop. To train the classifier, we employ an SGD momentum optimizer and run
it for 90 epochs. The learning rate is scheduled to linearly increase from 0.0 to 0.4 for the first five epochs and then decrease
by a factor of 10 at epochs 30, 60, and 80. For other ResNet-based classifiers we employ more advanced mechanisms such
as using a cosine schedule instead of step-wise learning rate decay, larger batch size, random augmentation, dropout, and
label smoothing to reach competitive performance [62]. It is also important to emphasize that ResNet-RS achieved higher
performance than ResNet models through a combination of enhanced scaling strategies, improved training methodologies,
and the implementation of techniques like the Squeeze-Excitation module [4]. We follow the training strategy and hyper-
parameter suggested in [4] to train our ResNet-RS-based models.
For vision transformer architectures we mainly follow the recipe provided in [5] to train a competitive ViT-S/16 model
and [64] to train DeiT family models. In all cases we re-implemented and train all of our models from scratch using real only,
real + generated data, and generated only data until convergence.

Model ResNet-50 (CAS) ResNet-50 ResNet-101 ResNet-152 ResNet-RS-50 ResNet-RS-101 ResNet-RS-152


Epochs 90 130 200 200 350 350 350
Batch size 1024 4096 4096 4096 4096 4096 4096
Optimizer Momentum Momentum Momentum Momentum Momentum Momentum Momentum
Learning rate 0.4 1.6 1.6 1.6 1.6 1.6 1.6
Decay method Stepwise Cosine Cosine Cosine Cosine Cosine Cosine
Weight decay 1e-4 1e-4 1e-4 1e-4 4e-5 4e-5 4e-5
Warmup epochs 5 5 5 5 5 5 5
Label smoothing - 0.1 0.1 0.1 0.1 0.1 0.1
Dropout rate - 0.25 0.25 0.25 0.25 0.25 0.25
Rand Augment - 10 15 15 10 15 15

Table A.1. Hyper-parameters used to train ConvNet architectures including ResNet-50 (CAS) [45], ResNet-50, ResNet-101, ResNet-152,
ResNet-RS-50, ResNet-RS-101, and ResNet-RS-152 [4].

Model ViT-S/16 DeiT-S DeiT-B DeiT-L


Epochs 300 300 300 300
Batch size 1024 4096 4096 4096
Optimizer AdamW AdamW AdamW AdamW
Learning rate 0.001 0.004 0.004 0.004
Learning rate decay Cosine Cosine Cosine Cosine
Weight decay 0.0001 - - -
Warmup eepochs 10 5 5 5
Label dmoothing - 0.1 0.1 0.1
Rand Augment 10 9 9 9
Mixup prob. 0.2 0.8 0.8 0.8
Cutmix prob. - 1.0 1.0 1.0

Table A.2. Hyper-parameters used to train the vision transformer architectures, i.e., ViT-S/16 [5], DeiT-S [64], DeiT-B [64], and DeiT-
L [64].

You might also like