Synthetic Data From Diffusion Models Improves ImageNet Classification
Synthetic Data From Diffusion Models Improves ImageNet Classification
Shekoofeh Azizi, Simon Kornblith, Chitwan Saharia* , Mohammad Norouzi*, David J. Fleet
Google Research, Brain Team †
arXiv:2304.08466v1 [cs.CV] 17 Apr 2023
DeiT-B
ing set with samples from the resulting models yields sig-
nificant improvements in ImageNet classification accuracy 80 +1.52 +1.56
over strong ResNet and Vision Transformer baselines. DeiT-S +1.59
78 R-152
R-101
+1.78
1. Introduction 76 R-50
Deep generative models are becoming increasingly ma-
22 36 45 64 87 307
ture to the point that they can generate high fidelity photo- Parameters (M)
realistic samples [12, 25, 59]. Most recently, denoising dif-
fusion probabilistic models (DDPMs) [25,59] have emerged Figure 1. Top: Classification Accuracy Scores [45] show that
as a new category of generative techniques that are capable models trained on generated data are approaching those trained on
of generating images comparable to generative adversar- real data. Bottom: Augmenting real training data with generated
ial networks (GANs) in quality while introducing greater images from our ImageNet model boosts classification accuracy
for ResNet and Transformer models.
stability during training [12, 26]. This has been shown
both for class-conditional generative models on classifica-
tion datasets, and for open vocabulary text-to-image gener- related question is, to what extent large-scale text-to-image
ation [40, 44, 47, 50]. models can serve as good representation learners or founda-
It is therefore natural to ask whether current models are tion models for downstream tasks? We explore this issue in
powerful enough to generate natural image data that are ef- the context of generative data augmentation, showing that
fective for challenging discriminative tasks; i.e., generative these models can be fine-tuned to produce state-of-the-art
data augmentation. Specifically, are diffusion models capa- class-conditional generative models on ImageNet.
ble of producing image samples of sufficient quality and di- To this end, we demonstrate three key findings. First,
versity to improve performance on well-studied benchmark we show that an Imagen model fine-tuned on ImageNet
tasks like ImageNet classification? Such tasks set a high training data produces state-of-the-art class-conditional Im-
bar, since existing architectures, augmentation strategies, ageNet models at multiple resolutions, according to their
and training recipes have been heavily tuned. A closely Fréchet Inception Distance (FID) [22] and Inception Score
* Work done at Google Research. (IS) [52]; e.g,, we obtain an FID of 1.76 and IS of 239 on
† {shekazizi, skornblith, davidfleet}@google.com 256 × 256 image samples. These models outperform ex-
bittern bird
harvestman
Leonberger
Schipperke
brussels griffon
Figure 2. Example 1024×1024 images from the fine-tuned Imagen (left) model vs. vanilla Imagen (right). Fine-tuning and careful choice
of guidance weights and other sampling parameters help to improve the alignment of images with class labels and sample diversity. More
samples are provide in the Appendix.
isting state-of-the-art models, with or without the use of datasets [18, 73] and simulation environments with phys-
guidance to improve model sampling. We further establish ically realistic engines [11, 15, 16]. Unlike methods that
that data from such fine-tuned class-conditional models also use model-based rendering, here we focus on the use of
provide new state-of-the-art Classification Accuracy Scores data-driven generative models of natural images, for which
(CAS) [45], computed by training ResNet-50 models on GANs have remained the predominant approach to date
synthetic data and then evaluating them on the real Ima- [6, 17, 36]. Very recent work has also explored the use of
geNet validation set (Fig. 1 - top). Finally, we show that publicly available text-to-image diffusion models to gener-
performance of models trained on generative data further ate synthetic data. We discuss this work further below.
improves by combining synthetic data with real data, with
larger amounts of synthetic data, and with longer training Distillation and Transfer. In our work, we use a diffu-
times. These results hold across a host of convolutional and sion model that has been pretrained on a large multimodal
Transformer-based architectures (Fig. 1 - bottom). dataset and fine-tuned on ImageNet to provide synthetic
data for a classification model. This setup has connec-
2. Related Work tions to previous work that has directly trained classification
models on large-scale datasets and then fine-tuned them on
Synthetic Data. The use of synthetic data has been widely ImageNet [34, 39, 43, 62, 72]. It is also related to knowledge
explored for generating large amounts of labeled data for distillation [7, 23] in that we transfer knowledge from the
vision tasks that require extensive annotation. Examples diffusion model to the classifier, although it differs from the
include tasks like semantic image segmentation [3, 10, 36, traditional distillation setup in that we transfer this knowl-
37, 54, 66], optical flow estimation [14, 32, 63], human mo- edge through generated data rather than labels. Our goal in
tion understanding [19, 29, 38, 67], and other dense predic- this work is to show the viability of this kind of generative
tion tasks [3, 70]. Previous work has explored 3D-rendered knowledge transfer with modern diffusion models.
Diffusion Model Applications. Diffusion models have introducing new noise schedules, guidance mechanisms to
been successfully applied to image generation [25–27], trade-off diversity with image quality, distillation for effi-
speech generation [8, 35], and video generation [24, 58, 68], ciency, and different parameterizations of the denoising ob-
and have found applications in various image processing ar- jective (e.g., [28, 31, 50, 53]).
eas, including image colorization, super-resolution, inpaint- Classification Accuracy Score. It is a standard practice to
ing, and semantic editing [49,51,61,69]. One notable appli- use FID [22] and Inception Score [52] to evaluate the visual
cation of diffusion models is large-scale text-to-image gen- quality of generative models. Due to their relatively low
eration. Several text-to-image models including Stable Dif- computation cost, these metrics are widely used as proxies
fusion [47], DALL-E 2 [44], Imagen [50], eDiff [1], and for generative model training and tuning. However, both
GLIDE [40] have produced evocative high-resolution im- methods tend to penalize non-GAN models harshly, and In-
ages. However, the use of large-scale diffusion models to ception Score produces overly optimistic scores in methods
support downstream tasks is still in its infancy. with sampling modifications [27, 45]. More importantly,
Very recently, large-scale text-to-image models have Ravuri and Vinyals [45] argued that these metrics do not
been used to augment training data. He et al. [21] show show a consistent correlation with metrics that assess per-
that synthetic data generated with GLIDE [40] improves formance on downstream tasks like classification accuracy.
zero-shot and few-shot image classification performance. An alternative way to evaluate the quality of samples
They further demonstrate that a synthetic dataset generated from generative models is to examine the performance of a
by fine-tuning GLIDE on CIFAR-100 images can provide classifier that is trained on generated data and evaluated on
a substantial boost to CIFAR-100 classification accuracy. real data [55, 71]. To this end, Ravuri and Vinyals [45] pro-
Trabucco et al. [65] explore strategies to augment individ- pose classification accuracy score (CAS), which measures
ual images using a pretrained diffusion model, demonstrat- classification performance on the ImageNet validation set
ing improvements in few-shot settings. Most closely related for ResNet-50 models [20] trained on generated data. It is
to our work, two recent papers train ImageNet classifiers on an intriguing proxy, as it requires generative models to pro-
images generated by diffusion models, although they ex- duce high fidelity images across a broad range of categories,
plore only the pretrained Stable Diffusion model and do competing directly with models trained on real data.
not fine-tune it [2, 56]. They find that images generated To date, CAS performance has not been particularly
in this fashion do not improve accuracy on the clean Im- compelling. Models trained exclusively on generated sam-
ageNet validation set. Here, we show that the Imagen text- ples underperform those trained on real data. Moreover,
to-image model can be fine-tuned for class-conditional Im- performance drops when even relatively small amounts of
ageNet, yielding SOTA models. synthetic data are added to real data during training [45].
This performance drop may arise from issues with the qual-
3. Background ity of generated sample, the diversity (e.g., due to mode col-
Diffusion. Diffusion models work by gradually destroy- lapse in GAN models), or both. Cascaded diffusion mod-
ing the data through the successive addition of Gaussian els [26] have recently been shown to outperform BigGAN-
noise, and then learning to recover the data by reversing deep [6] VQ-VAE-2 [46] on CAS (and other metrics). That
this noising process [25, 59]. In broad terms, in a forward said, there remains a sizeable gap in ImageNet test per-
process random noise is slowly added to the data as time formance between models trained on real data and those
t increases from 0 to T . A learned reverse process inverts trained on synthetic data [26]. Here, we explore the use
the forward process, gradually refining a sample of noise of diffusion models in greater depth, with much stronger
into an image. To this end, samples at the current time results, demonstrating the advantage of large-scale models
step, xt−1 are drawn from a learned Gaussian distribution and fine-tuning.
N (xt−1 ; µθ (xt , t), Σθ (xt , t)) where the mean of the distri-
bution µθ (xt , t), is conditioned on the sample at the pre- 4. Generative Model Training and Sampling
vious time step. The variance of the distribution Σθ (xt , t) In what follows we address two main questions: whether
follows a fixed schedule. In conditional diffusion models, large-scale text-to-image models can be fine-tuned as class-
the reverse process is associated with a conditioning signal, conditional ImageNet models, and to what extent such mod-
such as a class label in class-conditional models [26]. els are useful for generative data augmentation. For this pur-
Diffusion models have been the subject of many recent pose, we undertake a series of experiments to construct and
papers including important innovations in architectures and evaluate such models, focused primarily on data sampling
training (e.g., [1, 26, 41, 50]). Important below, [26] pro- for use in training ImageNet classifiers. ImageNet classifi-
pose cascades of diffusion models at increasing image res- cation accuracy is a high bar as a domain for generative data
olutions for high resolution images. Other work has ex- augmentation as the task is widely studied, and existing ar-
plored the importance of the generative sampling process, chitectures and training recipes are very well-honed.
5.0
60
18 logvar coeff = 1.0
[email protected]
FID@50K
2.0
12
10 10 1.75 30
8 5
1.5
20
6 1.0 1.25 Top-5 Accuracy (%)
Top-1 Accuracy (%)
10
1.0 1.25 1.5 1.75 2.0 5.0 50 100 150 200 1.0 1.25 1.5 1.75 2.0 5.0
Guidance Weight Inception Score Guidance Weight
Figure 3. Sampling refinement for 64×64 base model. Left: Validation set FID vs. guidance weights for different values of log-variance.
Center: Pareto frontiers for training set FID and IS at different values of the guidance weight. Right: Dependence of CAS on guidance
weight.
The ImageNet ILSVRC 2012 dataset [48] (ImageNet- It includes a pretrained text encoder that maps text to con-
1K) comprises 1.28 million labeled training images and textualized embeddings, and a cascade of conditional diffu-
50K validation images spanning 1000 categories. We adopt sion models that map these embeddings to images of in-
ImageNet-1K as our benchmark to assess the efficacy of the creasing resolution. Imagen uses a frozen T5-XXL en-
generated data, as this is one of the most widely and thor- coder as a semantic text encoder to capture the complex-
oughly studied benchmarks for which there is an extensive ity and compositionality of text inputs. The cascade begins
literature on architectures and training procedures, making with a 2B parameter 64×64 text-to-image base model. Its
it challenging to improve performance. Since the images outputs are then fed to a 600M parameter super-resolution
of ImageNet-1K dataset vary in dimensions and resolution model to upsample from 64×64 to 256×256, followed by
with the average image resolution of 469×387 [48], we ex- a 400M parameter model to upsample from 256×256 to
amine synthetic data generation at different resolutions, in- 1024×1024. The base 64×64 model is conditioned on text
cluding 64×64, 256×256, and 1024×1024. embeddings via a pooled embedding vector added to the dif-
In contrast to previous work that trains diffusion mod- fusion time-step embedding, like previous class-conditional
els directly on ImageNet data (e.g., [12, 26, 28]), here we diffusion models [26]. All three stages of the diffusion cas-
leverage a large-scale text-to-image diffusion model [50] as cade include text cross-attention layers [50].
a foundation, in part to explore the potential benefits of pre- Given the relative paucity of high resolution im-
training on a larger, generic corpus. A key challenge in do- ages in ImageNet, we fine-tune only the 64×64 base
ing so concerns the alignment of the text-to-image model model and 64×64→256×256 super-resolution model on
with ImageNet classes. If, for example, one naively uses the ImageNet-1K train split, keeping the final super-
short text descriptors like those produced for CLIP by [43] resolution module and text-encoder unchanged. The
as text prompts for each ImageNet class, the data generated 64×64 base model is fine-tuned for 210K steps and the
by the Imagen models is easily shown to produce very poor 64×64→256×256 super-resolution model is fine-tuned for
ImageNet classifier. One problem is that a given text label 490K steps, on 256 TPU-v4 chips with a batch size of
may be associated with multiple visual concepts in the wild, 2048. As suggested in the original Imagen training pro-
or visual concepts that differ systematically from ImageNet cess, Adafactor [57] is used to fine-tune the base 64×64
(see Figure 2). This poor performance may also be a con- model because it had a smaller memory footprint compared
sequence of the high guidance weights used by Imagen, ef- to Adam [33]. For the 256×256 super-resolution model,
fectively sacrificing generative diversity for sample quality. we used Adam for better sample quality. Throughout fine-
While there are several ways in which one might re-purpose tuning experiments, we select models based on FID score
a text-to-image model as a class-conditional model, e.g., op- calculated over 10K samples from the default Imagen sam-
timizing prompts in order to minimize the distribution shift, pler and the ImageNet-1K validation set.
here we fix the prompts to be the one or two words class
names from [43], and fine-tune the weights and sampling 4.2. Sampling Parameters
parameters of the diffusion-based generative model. The quality, diversity, and speed of text-conditioned dif-
fusion model sampling are strongly affected by multiple
4.1. Imagen Fine-tuning factors including the number of diffusion steps, noise condi-
We leverage the large-scale Imagen text-to-image model tion augmentation [50], guidance weights for classifier-free
described in detail in [50] as the backbone text-to-image guidance [27, 40], and the log-variance mixing coefficient
generator that we fine-tune using the ImageNet training set. used for prediction (Eq. 15 in [41]), described in further de-
3.0 0.5 module across different noise conditioning augmentation
0.5
2.8 0.4 values using a guidance weight of 1.0. These curves demon-
2.6 0.4 strate the combined impact of the log-variance mixing co-
0.3
2.4 0.3 efficient and condition noise augmentation in achieving an
FID
Table 1. Comparison of sample quality of synthetic ImageNet datasets measured by FID and Inception Score (IS) between our fine-tuned
Imagen model and generative models in the literature. We achieve SOTA FID and IS on ImageNet generation among other existing models,
including class-conditional and guidance-based sampling without any design changes.
result indicating that in a resource-limited setting, one can at 256 × 256 resolution by a good margin, for both Top-1
improve the performance of diffusion models by fine-tuning and Top-5 accuracy. Interestingly, results are markedly bet-
model weights and adjusting sampling parameters. ter for 1024 × 1024 samples, even though these samples are
down-sampled to 256×256 during classifier training. As re-
5.2. Classification Accuracy Score ported in Table 2, we achieve the SOTA Top-1 classification
As noted above, classification accuracy score (CAS) [45] accuracy score of 69.24% at resolution 1024 × 1024. This
is a better proxy than FID and IS for performance of down- greatly narrows the gap with the ResNet-50 model trained
stream training on generated data. CAS measures ImageNet on real data.
classification accuracy on the real test data for a model Figure 5 shows the accuracy of models trained on gener-
trained solely on synthetic samples. In keeping with the ative data (red) compared to a model trained on real data
CAS protocol [45], we train a standard ResNet-50 architec- (blue) for each of the 1000 ImageNet classes (cf. Fig. 2
ture on a single crop from each training image. Models are in [45]). From Figure 5 (left) one can see that the ResNet-50
trained for 90 epochs with a batch size of 1024 using SGD trained on CDM samples is weaker than the model trained
with momentum (see Appendix A.4 for details). Regardless on real data, as most red points fall below the blue points.
of the resolution of the generated data, for CAS training For our fine-tuned Imagen models (Figure 5 middle and
and evaluation, we resize images to 256×256 (or, for real right), however, there are more classes for which the models
images, to 256 pixels on the shorter side) and then take a trained on generated data outperform the model trained on
224×224 pixel center crop. real data. This is particularly clear at 1024×1024.
Table 2 reports CAS for samples from our fine-tuned
models at resolutions 256×256 and 1024×1024. CAS for 5.3. Classification Accuracy with Different Models
real data and for other models are taken from [45] and [26]. To further evaluate the discriminative power of the syn-
The results indicate that our fine-tuned class-conditional thetic data, compared to the real ImageNet-1K data, we
models outperform the previous methods in the literature analyze the classification accuracy of models with differ-
1.0 1.0 1.0
0.8 0.8
Classification Accuracy
Classification Accuracy
Classification Accuracy
0.8
0.6 0.6 0.6
Figure 5. Class-wise classification accuracy comparison accuracy of models trained on real data (blue) and generated data (red). Left: The
256 × 256 CDM model [26]. Middle and right: Our fine-tuned Imagen model at 256 × 256 and 1024 × 1024.
Table 2. Classification Accuracy Scores (CAS) for 256 × 256 and 1024 × 1024 generated samples. CAS for real data and other models
are obtained from [45] and [26]. Our results indicate that the fine-tuned generative diffusion model outperforms the previous methods by a
substantial margin.
Table 3. Comparison of classifier Top-1 Accuracy (%) performance when 1.2M generated images are used for generative data augmentation.
Models trained solely on generated samples perform worse than models trained on real data. Nevertheless, augmenting the real data with
data generated from the fine-tuned diffusion model provides a substantial boost in performance across many different classifiers.
5.0
60 1.25 1.5 1.75
Classification Accuracy Score (%)
1.0 20
50 2.0
15
40
[email protected]
2.0
1.25 1.5 1.75
30 1.0 2.0 5.0 10 1.75
1.5
20 Top-5 Accuracy (%) 5
5.0 1.0 1.25
Top-1 Accuracy (%)
10
50 100 150 200 50 100 150 200
Inception Score Inception Score
Figure A.1. Left: CAS vs IS Pareto curves for train set resolution of 64×64 showing the impact of guidance weights. Right: Train set
FID vs IS Pareto curves for resolution of 64x64 showing the impact of guidance weights.
FID@50K
12 12
10 10
8 8
6 6
1.0 1.25 1.5 1.75 2.0 5.0 60 80 100 120 140 160 180
Guidance Weight Inception Score
Figure A.2. Sampling refinement for 64×64 base model. Left: Validation set FID vs. guidance weights for different values of log-variance.
Right: Validation set FID vs. Inception score (IS) when increasing guidance from 1.0 to 5.0.
DDPM Steps: 128 DDPM Steps: 500 DDPM Steps: 1000
20.0 0.0 0.0 0.0
20.0 20.0
17.5 0.1 0.1 0.1
0.2 17.5 0.2 17.5 0.2
15.0 0.3 0.3 0.3
0.4 15.0 0.4 15.0 0.4
12.5 0.5 0.5 0.5
12.5 12.5
FID
FID
FID
10.0 10.0 10.0
30 30 30
7.5 7.5 7.5
5.0 20 5.0 20 5.0 20
2.5 10 5 2.5 10 5 2.5 10 5
2 1 2 1 2 1
10 20 30 40 50 60 10 20 30 40 50 60 10 20 30 40 50 60
CAS: Top-1 Accuracy (%) CAS: Top-1 Accuracy (%) CAS: Top-1 Accuracy (%)
DDPM Steps: 128 DDPM Steps: 500 DDPM Steps: 1000
20.0 0.0 0.0 0.0
20.0 20.0
17.5 0.1 0.1 0.1
0.2 17.5 0.2 17.5 0.2
15.0 0.3 0.3 0.3
0.4 15.0 0.4 15.0 0.4
12.5 0.5 0.5 0.5
12.5 12.5
FID
FID
FID
10.0 10.0 10.0
30 30 30
7.5 7.5 7.5
5.0 20 5.0 20 5.0 20
2.5 10 5 2.5 10 5 2.5 10 5
2 1 2 1 2 1
20 30 40 50 60 70 80 20 30 40 50 60 70 80 20 30 40 50 60 70 80
CAS: Top-5 Accuracy (%) CAS: Top-5 Accuracy (%) CAS: Top-5 Accuracy (%)
Figure A.3. Top-1 and Top-5 classification accuracy score (CAS) vs train FID Pareto curves (sweeping over guidance weight) showing the
impact of conditioning noise augmentation at 256×256 when sampling with different number of steps. As indicated by number overlaid
on each trend line, guidance weight is decreasing from 30 to 1.
FID
FID
20
20
5 20 8
6
4
6
4 10
3 10
10
5
4 5 1
2 5 2 1 2 1 2
20 30 40 50 60 20 30 40 50 60 20 30 40 50 60
CAS: Top-1 Accuracy (%) CAS: Top-1 Accuracy (%) CAS: Top-1 Accuracy (%)
Augmentation Noise: 0.3 Augmentation Noise: 0.4 Augmentation Noise: 0.5
18 20
128 128 128
30 500 18 30 500 20 30 500
16 1000 1000 1000
18
14 16
16
14
12
14
FID
FID
FID
20 12 20 20
10 12
10
8 10
10 8 10 10
6 1 8 1
1 6
5 2 5 2 5 2
4 6
10 20 30 40 50 10 20 30 40 50 10 20 30 40 50
CAS: Top-1 Accuracy (%) CAS: Top-1 Accuracy (%) CAS: Top-1 Accuracy (%)
Figure A.4. Top-1 and Top-5 classification accuracy score (CAS) vs train FID Pareto curves (sweeping over guidance weight) showing the
impact of conditioning noise augmentation at 256×256 when sampling with different number of steps at a fixed noise level. As indicated
by number overlaid on each trend line guidance weight is decreasing from 30 to 1. At highest noise level (0.5) lowering number sampling
step and decreasing guidance can lead to a better joint FID and CAS values. At lowest noise level (0.0) this effect is subtle and increasing
sampling steps and lower guidance weight can help to improve CAS.
Vanilla Imagen Vanilla Imagen
8 Finetuned Imagen Finetuned Imagen
8
6 30 30
6
FID
FID
20 20
4 4
10 10
5 5
2 2 1 2 2 1
20 30 40 50 60 40 50 60 70 80
CAS: Top-1 Accuracy (%) CAS: Top-5 Accuracy (%)
Figure A.5. Fine-tuning of SR model helps to jointly improve classification accuracy as well as FID of the vanilla Imagen.
75
70
70
65
CAS: Top-1 Accuracy (%)
Figure A.6. Sampling refinement for 1024×2014 super resolution model. Left: CAS vs. guidance weights under varying noise conditions.
Right: CAS vs. Inception score (IS) when increasing guidance from 1.0 to 5.0 under varying noise conditions.
FID
0.2 0.2
2.2 0.2 2.2 0.2
0.1 0.1 0.1 0.1
2.0 2.0
logvar coeff = 0.3 logvar coeff = 0.3
1.8 logvar coeff = 0.1 0.0 0.0 1.8 logvar coeff = 0.1 0.0 0.0
59 60 61 62 63 64 65 81 82 83 84 85
CAS: Top-1 Accuracy (%) CAS: Top-5 Accuracy (%)
Figure A.7. Training set FID vs. classification top-1 and top-5 accuracy Pareto curves under varying noise conditions when the guidance
weight is set to 1.0 for resolution 256×256. These curves depict the joint influence of the log-variance mixing coefficient [41] and noise
conditioning augmentation [26] on FID and CAS.
What follows are more samples to compare our fine-tuned model vs. the Imagen model are provided in Figure A.8, A.9,
and A.10. In this comparison we sample our fine-tuned model using two strategies. First, we sample using the proposed
vanilla Imagen hyper-parameters which use a guidance weight of 10 for the sampling of the base 64×64 model and subsequent
super-resolution (SR) models are sampled with guidance weights of 20 and 8, respectively. This is called the high guidance
strategy in these figures. Second, we use the proposed sampling hyper-parameters as explained in the paper which includes
sampling the based model with a guidance weight of 1.25 and the subsequent SR models with a guidance weight of 1.0. This
is called the low guidance weight strategy in these figures.
Imagen
(High Guidance)
bittern bird
Fine-tuned
(Low Guidance)
Fine-tuned
Imagen
(High Guidance)
Cardoon
Fine-tuned
(Low Guidance)
Fine-tuned
Imagen
(High Guidance)
Fine-tuned
brussels griffon
(Low Guidance)
Fine-tuned
Figure A.8. Example 1024×1024 images from vanilla Imagen (first row) vs. fine-tuned Imagen sampled with Imagen hyper-parameters
(high guidance, second row) vs. fine-tuned Imagen sampled with our proposed hyper-parameter (low guidance, third row). Fine-tuning
and careful choice of sampling parameters help to improve the alignment of images with class labels, and also improve sample diversity.
Sampling with higher guidance weight can improve photorealism, but lessens diversity.
Imagen
(High Guidance)
Schipperke
Fine-tuned
(Low Guidance)
Fine-tuned
Imagen
(High Guidance)
harvestman
Fine-tuned
(Low Guidance)
Fine-tuned
Imagen
Leonberger
(High Guidance)
Fine-tuned
(Low Guidance)
Fine-tuned
Figure A.9. Example 1024×1024 images from vanilla Imagen (first row) vs. fine-tuned Imagen sampled with Imagen hyper-parameters
(high guidance, second row) vs. fine-tuned Imagen sampled with our proposed hyper-parameter (low guidance, third row). Fine-tuning
and careful choice of sampling parameters help to improve the alignment of images with class labels, and also improve sample diversity.
Sampling with higher guidance weight can improve photorealism, but lessens diversity.
Imagen
(High Guidance)
dowitcher
Fine-tuned
(Low Guidance)
Fine-tuned
Imagen
(High Guidance)
Fine-tuned
kuvasz
(Low Guidance)
Fine-tuned
Figure A.10. Example 1024×1024 images from vanilla Imagen (first row) vs. fine-tuned Imagen sampled with Imagen hyper-parameters
(high guidance, second row) vs. fine-tuned Imagen sampled with our proposed hyper-parameter (low guidance, third row). Fine-tuning
and careful choice of sampling parameters help to improve the alignment of images with class labels, and also improve sample diversity.
Sampling with higher guidance weight can improve photorealism, but lessens diversity.
A.3. High Resolution Random Samples from the ImageNet Model
Figure A.11. Random samples at 1024×1024 resolution generated by our fine-tuned model. The classes are snail (113), panda (388),
orange (950), badger (362), indigo bunting (14), steam locomotive (820), carved pumpkin (607), lion (291), loggerhead sea turtle (33),
golden retriever (207), tree frog (31), clownfish (393), dowitcher (142), lorikeet (90), school bus (779), macaw (88), marmot (336), green
mamba (64).
A.4. Hyper-parameters and model selection for ImageNet classifiers.
This section details all the hyper-parameters used in training our ResNet-based model for CAS calculation, as well as
the other ResNet-based, ResNet-RS-based, and Transformer-based models, used to report classifier accuracy in Table 3.
Table A.1 and Table A.2 summarize the hyper-parameters used to train the ConvNet architectures and vision transformer
architectures, respectively.
For classification accuracy (CAS) calculation, as discussed before we follow the protocol suggested in [45]. Our CAS
ResNet-50 classifier is trained using a single crop. To train the classifier, we employ an SGD momentum optimizer and run
it for 90 epochs. The learning rate is scheduled to linearly increase from 0.0 to 0.4 for the first five epochs and then decrease
by a factor of 10 at epochs 30, 60, and 80. For other ResNet-based classifiers we employ more advanced mechanisms such
as using a cosine schedule instead of step-wise learning rate decay, larger batch size, random augmentation, dropout, and
label smoothing to reach competitive performance [62]. It is also important to emphasize that ResNet-RS achieved higher
performance than ResNet models through a combination of enhanced scaling strategies, improved training methodologies,
and the implementation of techniques like the Squeeze-Excitation module [4]. We follow the training strategy and hyper-
parameter suggested in [4] to train our ResNet-RS-based models.
For vision transformer architectures we mainly follow the recipe provided in [5] to train a competitive ViT-S/16 model
and [64] to train DeiT family models. In all cases we re-implemented and train all of our models from scratch using real only,
real + generated data, and generated only data until convergence.
Table A.1. Hyper-parameters used to train ConvNet architectures including ResNet-50 (CAS) [45], ResNet-50, ResNet-101, ResNet-152,
ResNet-RS-50, ResNet-RS-101, and ResNet-RS-152 [4].
Table A.2. Hyper-parameters used to train the vision transformer architectures, i.e., ViT-S/16 [5], DeiT-S [64], DeiT-B [64], and DeiT-
L [64].