Tiny Models From Tiny Data Textual and Null-text i
Tiny Models From Tiny Data Textual and Null-text i
Abstract
Few-shot image classification involves classifying images using very few training
examples. Recent vision foundation models show excellent few-shot transfer
abilities, but are large and slow at inference. Using knowledge distillation, the
capabilities of high-performing but slow models can be transferred to tiny, efficient
models. However, common distillation methods require a large set of unlabeled
data, which is not available in the few-shot setting. To overcome this lack of data,
there has been a recent interest in using synthetic data. We expand on this work
by presenting a novel diffusion model inversion technique (TINT) combining the
diversity of textual inversion with the specificity of null-text inversion. Using this
method in a few-shot distillation pipeline leads to state-of-the-art accuracy among
small student models on popular benchmarks, while being significantly faster than
prior work. This allows us to push even tiny models to high accuracy using only a
tiny application-specific dataset, albeit relying on extra data for pre-training.
Popular few-shot benchmarks involve evaluation over a large number of episodes,
which is computationally cumbersome for methods involving synthetic data gener-
ation. Therefore, we also present a theoretical analysis on how the variance of the
accuracy estimator depends on the number of episodes and query examples, and
use these results to lower the computational effort required for method evaluation.
In addition, to further motivate the use of generative models in few-shot distillation,
we demonstrate that our method performs better compared to training on real data
mined from the dataset used to train the diffusion model.
Source code will be made available at https://github.com/pixwse/tiny2.
1 Introduction
In recent years, deep learning has enabled unprecedented performance in most computer vision
tasks. To reach state-of-the-art, increasingly large foundation models [1, 2, 3, 4, 5] are often used,
trained on increasingly huge datasets [6, 7]. In contrast, many computer vision applications have
the opposite requirements. Representative training data can be scarce due to privacy concerns (e.g.
medical data) or other practical issues around data collection, and lightning-fast inference is desired
in both embedded systems and to get good economy in cloud-hosted solutions. However, many
applications have no restrictions on using large pretrained models in the training pipelines, as long as
the final model is small and efficient. Our problem statement can thus be formulated as: How can we
maximize the accuracy of tiny classifiers if only a tiny amount of application-specific labeled data is
available (few-shot setting), but there are no restrictions on using common pre-trained models in the
training pipeline?
Large foundation models often present good few-shot transfer abilities when equipped with a small
classifier head [8, 9, 10]. While large foundation models may be required to handle any few-shot
task well, a specific task (containing just a few classes) is likely solvable using a significantly smaller
Figure 1: Overview of the TINT method. Left: From a set of input examples, we optimize all external
quantities (latents and conditional/unconditional embeddings). Right: A new image is generated by
blending a latent with noise and feeding it through the diffusion model.
Figure 2: Examples of images generated by our method (bottom) from randomly selected support
images (top). Left: Class Scoreboard from miniImageNet. Right: Class Caspian Tern from CUB.
model. Using knowledge distillation [11, 12], a small, efficient classifier could be trained to mimic a
large general few-shot model on a specific task. However, while typical distillation methods require a
large set of representative unlabeled data [11, 13, 14], we only assume access to a few application-
specific examples. This leads us to our overall approach: Adapt a generative model to produce a
large set of images representative of the novel classes. Then use this data to distill knowledge from a
well-performing (but large and slow) few-shot classifier to a small model, specialized on a single task.
We base the generator on diffusion models, due to their impressive image generation capabilities
[15, 16, 17, 18]. The core challenge is how to best specialize such models to the novel few-shot
classes. To this end, we propose a novel combination of textual inversion and null-text inversion that
is illustrated in Figure 1, with some examples of generated training images shown in Figure 2.
Common few-shot benchmarks such as miniImageNet are often strictly interpreted in the sense that
only training data from within the dataset itself may be used. In their recent work [8], Hu et al. argue
that maintaining this strict interpretation may lead to an undesirable divergence between the few-shot
learning and semi-supervised learning [19] communities. We adhere to this view, as our goal is to
present a pipeline that is useful for real-world applications without any restrictions on additional data
sources.
Our contributions can be summarized as:
• A novel diffusion inversion method (TINT) combining the diversity of textual inversion with
the specificity of null-text inversion.
• A demonstration that using a generative model in few-shot distillation outperforms using the
original training data (LAION [6]) of the generative model directly.
2
2 Background
In this section, we briefly review the most important theory and prior work that our pipeline builds
upon. This also serves for introducing our notation.
In few-shot classification [20, 21], the task is to classify a query x(q) into one of N classes identified
by integer labels y ∈ {1, ..., N }, where each class is defined using a small set of labeled support
examples. To simplify the notation, we focus on the case with K support examples per class (N-way,
K-shot), and use x(s)
n,k to denote the k’th support example of class n. All theory transfers naturally to
the case where K is different for each class.
Denoising diffusion probabilistic models [22] model the distribution of a random variable x0 by
transforming it into a tractable distribution (noise) over timesteps t ∈ {0, ..., T }. Our generator is
built upon the popular Stable Diffusion (SD) [15] model due to its public availability and use in prior
work on diffusion inversion [23, 24]. SD defines the diffusion process on a latent variable zt of lower
dimensionality than xt . The model is trained to minimize the loss
L = Ez0 ,t,ϵ ∥ϵ − ϵ(zt ; t, v)∥2 ,
(1)
where the noise estimation ϵ(zt ; t, v) is implemented using a U-net [25], and v is a CLIP [26]
embedding of a text prompt fed as an additional input to this U-net. A VQ-VAE model [27] maps
between x0 and z0 .
Using Denoising Diffusion Implicit Models (DDIM) [28], the forward and backward processes can be
made deterministic. We write this as zt+1 = ft (zt ), zt−1 = gt (zt ). We are mostly interested in the
mapping between z0 and zT , writing it as zT = F (z0 ) with F = fT −1 ◦ ... ◦ f0 and z0 = G(zT )
with G = g1 ◦ ... ◦ gT . The strength of the conditioning input can be controlled by classifier-free
guidance (CFG) [29]. For each timestep t, the noise is predicted using both a text embedding v and
an unconditional embedding u. The noise prediction is then shifted in the direction produced by v
according to
ϵt = ϵ(zt ; t, u) + β ϵ(zt ; t, v) − ϵ(zt ; t, u) . (2)
When using classifier-free guidance, the DDIM inversion no longer reproduces the original input
unless additional techniques are used. Null-text inversion (NTI) [24] is one such technique. In NTI,
latents zT = F (z0 ) are first computed using DDIM. The original objective (Eq. 1) is then minimized
with the unconditional embeddings u1:T as variables (introducing a separate embedding ut for each
timestep t). The output of the inversion is the final latent zT and adjusted embeddings u1:T .
Textual inversion (TI) [23] is a technique for making diffusion models output images that look similar
(but not identical) to a small set of input images. The idea is to adjust the text embeddings of one
or more words in a conditioning text prompt, essentially creating new words representing the input
images. The optimization objective is still the original Eq. 1, but the variables to optimize are now
selected embeddings of individual words (parts of v).
Knowledge distillation [30, 11, 12] refers to training a student network to mimic a teacher network
for the purpose of model compression or performance improvement. We base our pipeline on the
LabelHalluc method [31], where distillation is used in few-shot classification, using base classes as
distillation training data, pseudo-labelled by a teacher sharing the same architecture as the student.
3
a = 0.2 a = 0.4 a = 0.6 a = 0.8
support
examples
Figure 3: Generated images with increasing α (latent space noise), showing how the method gradually
transitions from a more augmentation-like behavior to full synthetic image generation as α increases.
3 Method
We first describe our new diffusion inversion technique (Section 3.1), and then show how to apply it
in a complete few-shot distillation pipeline (Section 3.2).
3.1 TINT
As distillation data, we would like new examples that are similar to the support examples, but still
provide sufficient variation to make the student generalize to unseen query examples. This could be
done by running null-text inversion (NTI) on the support examples, perturbing the latents and feeding
them back through the diffusion model. However, the generated images would depart significantly
from the right class if the level of perturbation is high, unless some guiding conditional input is
applied. Using textual inversion (TI), we can construct such a conditioning input from the support
examples. This leads to a combined method that we call TINT (Textual Inversion + Null-Text).
The method is illustrated in Figure 1 and works as follows: First, all support examples x(s) n,k of
class n are fed jointly to a TI procedure to find a text embedding vn capable of generating similar
examples. We use the generic prompt a photo of an #object, where #object represents the embedding
to optimize. The embedding is initialized from the word object, i.e. we don’t require a known text
description of each support class. We then run NTI on each support example separately, conditioned
by the embeddings obtained from TI. This produces latent vectors zn,k at time t = T and adjusted
unconditional embeddings un,1:T,k . Note that we have now optimized all embeddings present in the
CFG procedure (Eq. 2) - both the conditional and unconditional ones.
Given the optimized embeddings, we want to make a random perturbation of a latent zn,k and feed it
back through the generative model G, using the adjusted un,1:T,k in the CFG. A naive way to do the
random perturbation would be to blend zn,k with noise, i.e. letting
z = (1 − α)zn,k + αn, n ∼ N (0, I), α ∈ [0, 1]. (3)
However, as noted in [32], a simple linear interpolation between two latents will often result in an
output with lower norm than the inputs and lead to an overly bland, low-contrast image. To avoid this
effect, we could simply rescale the latent norm by letting z̃ = z∥z∥−1 , but that would not reproduce
zn,k as α → 0. One option could be to use the norm-guided interpolation suggested by [32], but
we opt for a simpler approach. We compute a desired latent norm by interpolating the norms of the
inputs to the blending operation from Eq. 3 and rescaling z to this norm value, by
z̃ = z∥z∥−1 (1 − α)∥zn,k ∥ + α∥n∥ .
(4)
To generate a new example, simply select a k at random, draw a noise vector n, interpolate and
normalize using Eq. 3 and Eq. 4, and feed z̃ into the diffusion model. Pseudo-code for the entire
procedure is provided in Appendix A.1. As illustrated in Figure 3, when α is increased, our method
gradually transforms from performing deep data augmentation (staying relatively faithful to the
support examples) to complete synthetic data generation. For the few-shot case, we expect that a
large α will work best, in order to provide as much variation as possible, while smaller α may be
useful to in situations where a substantial amount of training data already exists.
4
Support images Training images
(novel classes) (base classes) Teacher Pseudo-labels
TINT Loss
(novel classes) (KL-div)
TINT Teacher Student
Loss Support labels
Support images (CE)
(a) Few-shot specialization (novel classes)
(novel classes) (b) Distillation
Figure 4: Overview of our few-shot transfer pipeline. First, the TINT generator and teacher are
specialized on the novel classes (a), and then a distillation procedure is run to transfer the knowledge
from the teacher to the student (b). The modules adjusted in each step are outlined in red.
Note that there is a significant resolution mismatch between available pre-trained Stable Diffusion
models and popular few-shot classification datasets. We found that a naive upsampling before the
textual inversion made the inversion prone to reproduce sampling artifacts, sometimes prioritizing
this over semantically meaningful details. Using a pre-trained super-resolution (SR) model [33] for
upscaling largely eliminated this problem. We also found it beneficial to apply the TI optimization
target (Eq. 1) only for a restricted range t ∈ [250, 1000], as including smaller t made the textual
inversion more prone to focus on remaining sampling artifacts. See Appendix B.2 for more details.
3.2 Distillation
The distillation pipeline is illustrated in Figure 4. It largely follows prior work [31, 34, 32], but uses a
strong off-the-shelf teacher and our novel image generation method. As noted in [31], images from
the base classes (meta-training split) of a few-shot dataset can be pseudo-labeled by a teacher and
used for distillation. We include this idea, leaving us with 3 types of distillation training data: (1)
training images from base classes, (2) TINT-generated synthetic images of novel classes, and (3)
support images from novel classes. The base and synthetic images are pseudo-labeled using a strong
off-the-shelf teacher. See Appendix B.4 for more details.
An option would be to use direct label supervision for the synthetic data, skipping the teacher. How-
ever, we expect that generating representative images is a harder problem than correctly classifying
given images. For example, if the generator mistakenly produces an image that looks more like
another class, the student is supervised with pseudo-labels encoding what the image actually looks
like according to a good teacher, rather than with labels representing what the generator believed
it was doing. Thus, a mistake by the generator may lead to a sub-optimal distribution over which
the empirical loss is measured, but not to an incorrect learning signal akin to a mislabeled training
example.
Few-shot classifiers are often evaluated over randomly drawn episodes, where each episode contains
a random set of novel classes with random support and query examples. For example, results on
miniImageNet [20] are typically evaluated over 600 episodes, each containing 5 classes with K
(1 or 5) support examples and Q = 15 query examples per class. In our pipeline, specialization
involves generating thousands of synthetic images, and repeating this process 600 times or more
is computationally cumbersome. However, evaluating over a larger query set would use negligible
compute in comparison. Instead of the most common choice of 15 query examples per class, we
could easily use the remaining 595 queries per class for miniImageNet.
A question then arises: can we trade episodes for query examples, i.e. run the evaluation over fewer
episodes but with more query examples per episode, keeping a similar statistical significance in the
result? As a guide for what happens when varying the number of episodes and queries, we present
the following theorem:
Theorem 1 Let P, Q ∈ Z+ be the number of episodes and query examples per episode respectively.
Let ap ∈ [0, 1] be the true (unknown) accuracy of an evaluated method on episode p ∈ {1, ..., P } and
let a = E[ap ] and σa2 = V[ap ] be the true (unknown) mean and variance of ap over i.i.d. episodes p.
5
Furthermore, let ãp be our estimate of ap , computed by measuring the empirical accuracy over Q
i.i.d. query examples from episode p, and let ã be the final empirical accuracy computed as the
average of ãp over P independently drawn episodes.
Then, for any choice of Q and P , we have that
E[ã] = a (5)
1 1 1
V[ã] = a(1 − a) + 1 − σa2 . (6)
P Q Q
Eq. 5 means that the estimated accuracy ã is an unbiased estimate of the true unknown a regardless of
P and Q. Eq. 6 shows how P and Q affect the estimator variance, and can be used e.g. to determine
suitable values for P and Q to reach a certain desired estimator variance. To arrive at Eq. 6, we model
the evaluation over each episode as a set of Bernoulli trials and derive an expression of V[ã] using the
law of total variance. Motivated by this relation, we decided to reduce P to 120 while increasing P
to 595N in our evaluation. The full proof and a numerical example motivating this choice can be
found in Appendix A.3.
4 Related Work
The state-of-the-art performance on miniImageNet [20] has increased steadily over the years [35, 36,
37, 38, 39, 40]. Most recent work rely on other techniques than generative distillation, and often use
relatively large models. Regarding tiny models, the TRIDENT method [41] produced good results
using a model only 4x as large as the Conv4 model (see Section 5.1), but due to their transductive
setting, these results are not directly comparable to our work. Recently, the top performers are models
using extra training data [8, 42].
Knowledge distillation as a means of improving accuracy of a baseline model or for model com-
pression has been widely studied [11, 12, 13, 43]. One notable example is SimCLRv2 [44], where a
self-supervised model is trained on a large set of unlabeled data, specialized on a small labeled set,
and distilled to a potentially smaller model. Like most related methods, the distillation requires a
representative unlabeled dataset.
There is a growing interest in using generated training data for supervised learning [45, 46]. A recent
work [47] showed that generated data improves imagenet classification, but did not address the few-
shot case or transfer to smaller models. Other works use generated data for few-shot classification,
but often in the feature domain and without combining it with distillation [48, 49, 50]. Some prior
works combine generative models with distillation. One work [51] addresses distillation with limited
data using a CVAE and MixUp regularization. However, the authors don’t address the case where the
teacher is also a few-shot model, and don’t present few-shot classification results on miniImageNet.
One work [52] uses GANs [53] to generate distillation data, also without addressing the few-shot
case. Another work [54] concludes that contrastive pre-training is preferable to using GANs. They
also include a distillation step in their pipeline, but assume access to a large unlabeled representative
set, which we don’t have. In DatasetGAN [55] and related works [56], GANs are used to generate
pixelwise labeled distillation training images. In contrast to our work, they require manual annotation
of GAN-generated images (rather than using inversion techniques) and a large set of unlabeled
application-specific images.
The original textual inversion technique [23] is related to a larger body of literature on text-to-image
personalization [57, 58, 59, 60, 61]. Null-text inversion [24] is also related to GAN inversion [62].
Most works in these directions have image generation or editing as the intended application, and
don’t show results for few-shot classification.
Most related to our approach, a recent line of work [63, 32] use diffusion models for few-shot learning
with good results. As for our work, their few-shot pipeline can be derived back to LabelHalluc [31].
However, those methods require a known text description of the novel classes, and require more
computational effort due to running a latent space optimization for every image generation.
6
0.92
PMF, all (a) 0.94 PMF, all (a)
PMF, syn+supp (c)
0.90 PMF, syn+base (b) PMF, syn+base (b)
PMF, syn only (e) 0.92 PMF, syn+supp (c)
accuracy [%] 0.88 LH, all (f) PMF, supp+base (d)
accuracy [%]
LH, supp+base (g) PMF, syn only (e)
0.86 0.90
PMF, supp+base (d)
0.84 0.88 LH, all (f)
LH, supp+base (g)
0.82
0.86
0.80 0 200 400 600 800 1000 0 2k 4k 6k 8k 10k
iterations iterations
Figure 5: Accuracy over iterations for two choices of teachers and data sources. PMF denotes the
P>M>F teacher [8] (DINO + ProtoNet), and LH denotes the LabelHalluc teacher [31] (ResNet12 +
logistic classifier). The left plot is a zoomed-in version of the right plot (note the scale on both axes)
5 Experiments
5.1 Models
As student, we used two of the smallest models that are widely used in the few-shot literature. The
Conv4 model is a tiny 4-layer CNN with only 113k parameters that was used already in the original
ProtoNet paper [64]. The ResNet12 model (8M parameters) is significantly larger, but still one of the
smaller models used frequently in prior art [31, 65, 66, 67, 32]. The main choice of teacher was the
P>M>F method [8] due to its simplicity and good performance. Appendix B contains more details.
To tune selected hyperparameters, perform ablations and selected comparisons to prior work, we first
run experiments with the ResNet12 model over a limited set of of 5 fixed episodes from the validation
split of miniImageNet. The same episodes were used in all experiments, ensuring comparable results
despite the low number of episodes. Initial hyperparameters were based on previous work [31], with
minor adjustment described in Appendix B. For the core TINT method, we found that α = 1 works
best on few-shot classification benchmarks, and used that option for all presented results.
The accuracy over training iterations for a few combinations of data sources and teachers is shown in
Figure 5. While the original LabelHalluc [31] accuracy (plot g) saturated after around 200 iterations,
the accuracy of our full pipeline (plot a) kept improving. We therefore also tried training cycles up to
10k iterations (dropping the learning rate by a factor 0.1 after 5k iterations). Interestingly, including
synthetic data but keeping the original LabelHalluc teacher gave no significant accuracy gain (plot
f ). Conversely, switching to a stronger teacher without adding more representative data (plot d) did
lead to increased accuracy, but with a slow onset. When running for only 300 iterations (as in [31]),
one may mistakenly believe that this combination has no potential, whereas it will in fact start to
outperform LabelHalluc after around 3k iterations. A plausible explanation is that distillation requires
significantly more iterations when transferring knowledge between different architectures. The final
accuracy of our method after the full 10k iterations is shown in the top part of Table 1. Note that in
order to reach the highest accuracy, both a strong teacher and the synthetic data is required.
Table 2 shows ablations where we kept the few-shot pipeline fixed while simplifying the image
generation. Most notably, we run one ablation where the generative model was removed altogether
and replaced with real data mined from the LAION dataset [6] (original SD training data) using CLIP
embeddings of the class names. Interestingly, this setup performed significantly worse than our full
pipeline even though real images and known class names were used. This shows that the generative
model contributes with substantial value to the pipeline. More details are provided in Appendix B.5.
We also compared with data generated by a competing method, NAO+SeedSelect [32]. The authors
provided source code for their core image generation method, but not for the few-shot experiments,
and we were unable to reproduce their results exactly. Instead, we used their image generation method
as a drop-in replacement of TINT in our few-shot pipeline. Their previous results were reported using
7
Table 1: Comparing teacher and synthetic data
options (1k or 4k synthetic examples per class). Table 2: Results using various simplifi-
cations of the synthetic data aspect of
Synthetic Base our method.
Synthetic data teacher teacher Acc Method Acc
TINT (4k) LH LH 87.7 No synthetic data 91.4
TINT (4k) None LH 89.5 Plain TI 92.1
TINT (4k) None PMF 91.9 TI + SR + full t range 93.3
TINT (4k) PMF PMF 94.1 TI + SR + limited t range 93.5
NAO+SS [32] (1k) None PMF 89.3 Full TINT 94.1
NAO+SS [32] (1k) PMF PMF 90.4 LAION mining 92.9
TINT (1k) PMF PMF 93.6
1k generated examples, so we used the same setting for TINT in the comparison. The results (bottom
part of Table 1) show that TINT performs better in this context. Some additional comparisons using
the Conv4 backbone are included in Appendix B.7, due to space constraints. Most notably, using
TINT, the Conv4 backbone reached 81.7%, while replacing TINT with NAO+SeedSelect [32] failed
to produce good results (67.3%). The latter could likely be improved by hyperparameter tuning, but
we did not proceed further in that direction. The take-home message is that it is not straight-forward
to get good results using this competing method for tiny students.
Finally, we also compared the TINT method with competing methods using the FID measure [68],
reaching 2.7 for the full TINT method, compared to 8.0 for plain textual inversion and 13.0 for
NAO+SeedSelect. Due to space constraints, the full results table and experiment details are provided
in Appendix B.8.
The final evaluation was made using the test splits of 3 popular few-shot datasets: miniImageNet [20],
CUB [69] and CIFAR-FS [70]. Our main results for both student models are summarized in Table 3.
The use of additional data is indicated in column Extras, where SD indicates Stable Diffusion [15]
trained on the LAION dataset [6] and DINO indicates the pre-trained DINO model [9]. Note that
DINO was trained on ImageNet, which causes some semantic overlap between pre-training and novel
classes. However, the DINO training is fully unsupervised, and this issue was discussed extensively in
the original P>M>F paper [8]. We also acknowledge that a direct comparison between methods with
different use of extra data is not fair, as the setting without extra data is of course more challenging.
Nevertheless, we chose to include methods that do not rely on extra data in the comparison, to show
how large gains that are possible by using pre-trained models in the pipeline.
Prior top Conv4 results are clustered around 72-74% accuracy, while larger networks routinely reach
better performance. This may lead practitioners to believe that model size is the main bottleneck for
Conv4, and that larger models are required in order to reach higher performance. Our results show
that it is indeed possible to push the performance even for the tiny Conv4 model significantly if extra
data is allowed. For the ResNet12 backbone, there are more representative methods to compare with.
Our method is on par with NAO+SeedSelect [32] on miniImageNet, but slightly worse on CUB and
CIFAR-FS. Note however that their method requires a text description of the classes, which ours does
not. On the other hand, we require an additional pre-trained model (DINO) that they do not. Also
recall using NAO+SeedSelect as a drop-in replacement of TINT in our few-shot pipeline leads to
lower accuracy (see Section 5.2). Since we used their code for the image generation, this discrepancy
is likely explained by some aspect of their training setup outside of the core image generation,
indicating that even better results could be possible by using TINT under similar conditions.
Measured using one NVidia A40 GPU, the total time required for specialization to 5 novel classes
from one miniImageNet episode is 6.41 hours, with 5.59h for generating 20k images (4k per class)
using TINT and 0.82h for distillation. In contrast, NAO+SeedSelect [32] requires 47.5h GPU hours
for generating 5k images (1k per class). This equals 1.0s per generated image for our method
compared to 34.2s for NAO+SeedSelect. See Appendix A.4 for a detailed breakdown.
8
Table 3: Main results on miniImageNet, CUB and CIFAR-FS, 5-way 5-shot, with 95% confidence
intervals. *: Uses textual descriptions of classes and external text-based prior knowledge. **: Results
reported by [70]. †: Requires text descriptions of the novel classes. ‡: Uses a pre-trained CLIP
backbone and was further trained on ImageNet1k [71], Fungi [72], MSCOCO [73] and WikiArt [74].
Method Model Extras miniImageNet CUB CIFAR-FS
ProtoNet [64] Conv4 - 68.2 ± 0.66 - 72.0 ± 0.6 **
R2-D2 [70] Conv4 - 65.4 ± 0.2 - 77.4 ± 0.2
MetaQDA [75] Conv4 - 72.64 ± 0.62 - 77.33 ± 0.73
BL++/IlluAug[76] Conv4 - - 79.74 ± 0.60 -
MELR [77] Conv4 - 72.27 ± 0.35 85.01 ± 0.32 -
FRN+TDM [78] Conv4 - - 88.89 -
KSTNet [67] Conv4 Text* 73.72 ± 0.63 - -
TINT (ours) Conv4 DINO, SD 80.29 ± 1.14 89.56 ± 0.80 85.43 ± 1.03
RENet [79] RN12 - 82.58 ± 0.30 91.11 ± 0.24 86.60 ± 0.32
MELR [77] RN12 - 83.40 ± 0.28 85.01 -
FRN+TDM [78] RN12 - - 92.80 -
LabelHalluc [31] RN12 - 86.54 ± 0.48 - 90.5 ± 0.6
KSTNet [67] RN12 Text* 82.61 ± 0.48 - -
DiffAlign [34] RN12 SD, Text† 88.63 ± 0.3 - 91.96 ± 0.5
NAO+SS [32] RN12 SD, Text† 93.21 ± 0.6 97.92 ± 0.2 94.87 ± 0.4
TINT (ours) RN12 DINO, SD 93.13 ± 0.50 94.57 ± 0.60 93.18 ± 0.86
P>M>F [8] ViT/B DINO 98.4 97.0 92.2
CAML [42] ViT/B DINO, Misc‡ 98.6 ± 0.0 97.1 85.5 ± 0.1
6 Discussion
Summary We proposed a new diffusion inversion method combining the properties of textual and
null-text inversion. We also showed how this method can be used in few-shot transfer pipelines with
accuracy on par with prior state-of-the-art, but with significantly faster training. Furthermore, we
demonstrated that significantly higher accuracy than what has been previously reported is possible
even for the tiny Conv4 model when extra data is allowed in the training pipeline.
Limitations The TINT method is partly constrained by the diffusion model resolution. While a
super-resolution method can be used to handle a significant upscaling, the method can not immediately
be applied to generate images of higher resolution than supported by the diffusion model. The
few-shot pipeline requires that a pre-trained generative model and teacher with a domain-relevant pre-
training is available, which may not be the case in niche applications. Furthermore, the specialization
is relatively compute-heavy. This can be limiting for applications where specialization is made
frequently, and makes full multi-episode evaluation on few-shot benchmarks cumbersome. However,
our method is not worse in this respect than previous methods using diffusion models in few-shot
settings, and we expect that 6-7 hours of specialization time is acceptable in many applications.
Fairness of using extra data Few-shot classification benchmarks have traditionally been used in a
strict setting, without allowing extra data. Nowadays, this strict view if often relaxed, and there are
several published few-shot results relying on extra data [8, 42, 32]. Other benchmarks, e.g. regular
ImageNet classification, have undergone similar transitions, where most top results now rely on extra
training data. As discussed in Section 5.3, we find it reasonable to compare results with/without extra
data as long as the use of extra data is clearly disclosed. We fully acknowledge that the strict setting
is more challenging, but note that many real-world applications have no rule against extra data. We
believe that practitioners will find it useful to see how large accuracy gains that are achievable by
allowing extra data in the training.
Outlook Image-generating diffusion models improve in a rapid pace. In our pipeline, it is straight-
forward to replace the diffusion model with an improved one. As such models become even more
general, we expect more and more applications to be fully solvable using our method, even when
small, highly efficient final models are required and only a tiny set of application-specific data is
available. We hope that this approach will be adapted by practitioners, and that our work will inspire
the use of generative models in a wider range of data-constrained problems.
9
Acknowledgements
This work was partially funded by Pixelwise AB and was supported by the Wallenberg Artificial
Intelligence, Autonomous Systems and Software Program (WASP), funded by the Knut and Alice
Wallenberg Foundation. The experiments were enabled by resources provided by the National
Academic Infrastructure for Supercomputing in Sweden (NAISS), partially funded by the Swedish
Research Council through grant agreement no. 2022-06725.
References
[1] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas
Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth
16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
[2] Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Hervé Jégou.
Training data-efficient image transformers & distillation through attention. In International conference on
machine learning, pages 10347–10357. PMLR, 2021.
[3] Yuxin Fang, Wen Wang, Binhui Xie, Quan Sun, Ledell Wu, Xinggang Wang, Tiejun Huang, Xinlong Wang,
and Yue Cao. Eva: Exploring the limits of masked visual representation learning at scale. In Proceedings
of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19358–19369, 2023.
[4] Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete
Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything. arXiv preprint
arXiv:2304.02643, 2023.
[5] Muhammad Awais, Muzammal Naseer, Salman Khan, Rao Muhammad Anwer, Hisham Cholakkal,
Mubarak Shah, Ming-Hsuan Yang, and Fahad Shahbaz Khan. Foundational models defining a new era in
vision: A survey and outlook. arXiv preprint arXiv:2307.13721, 2023.
[6] Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti,
Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al. Laion-5b: An open large-scale
dataset for training next generation image-text models. Advances in Neural Information Processing
Systems, 35:25278–25294, 2022.
[7] Xiaohua Zhai, Alexander Kolesnikov, Neil Houlsby, and Lucas Beyer. Scaling vision transformers. In
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12104–
12113, 2022.
[8] Shell Xu Hu, Da Li, Jan Stühmer, Minyoung Kim, and Timothy M Hospedales. Pushing the limits of
simple pipelines for few-shot learning: External data and fine-tuning make a difference. In Proceedings of
the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9068–9077, 2022.
[9] Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand
Joulin. Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF
international conference on computer vision, pages 9650–9660, 2021.
[10] Renrui Zhang, Wei Zhang, Rongyao Fang, Peng Gao, Kunchang Li, Jifeng Dai, Yu Qiao, and Hongsheng
Li. Tip-adapter: Training-free adaption of clip for few-shot classification. In European conference on
computer vision, pages 493–510. Springer, 2022.
[11] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv
preprint arXiv:1503.02531, 2015.
[12] Jianping Gou, Baosheng Yu, Stephen J Maybank, and Dacheng Tao. Knowledge distillation: A survey.
International Journal of Computer Vision, 129:1789–1819, 2021.
[13] Yonglong Tian, Dilip Krishnan, and Phillip Isola. Contrastive representation distillation. arXiv preprint
arXiv:1910.10699, 2019.
[14] Sungsoo Ahn, Shell Xu Hu, Andreas Damianou, Neil D Lawrence, and Zhenwen Dai. Variational
information distillation for knowledge transfer. In Proceedings of the IEEE/CVF conference on computer
vision and pattern recognition, pages 9163–9171, 2019.
[15] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution
image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer
vision and pattern recognition, pages 10684–10695, 2022.
[16] Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar
Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-
image diffusion models with deep language understanding. Advances in Neural Information Processing
Systems, 35:36479–36494, 2022.
10
[17] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional
image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
[18] Yogesh Balaji, Seungjun Nah, Xun Huang, Arash Vahdat, Jiaming Song, Karsten Kreis, Miika Aittala,
Timo Aila, Samuli Laine, Bryan Catanzaro, et al. ediffi: Text-to-image diffusion models with an ensemble
of expert denoisers. arXiv preprint arXiv:2211.01324, 2022.
[19] Jesper E Van Engelen and Holger H Hoos. A survey on semi-supervised learning. Machine learning,
109(2):373–440, 2020.
[20] Oriol Vinyals, Charles Blundell, Timothy Lillicrap, Daan Wierstra, et al. Matching networks for one shot
learning. Advances in neural information processing systems, 29, 2016.
[21] Sachin Ravi and Hugo Larochelle. Optimization as a model for few-shot learning. In International
conference on learning representations, 2016.
[22] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in neural
information processing systems, 33:6840–6851, 2020.
[23] Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patashnik, Amit H Bermano, Gal Chechik, and Daniel
Cohen-Or. An image is worth one word: Personalizing text-to-image generation using textual inversion.
arXiv preprint arXiv:2208.01618, 2022.
[24] Ron Mokady, Amir Hertz, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Null-text inversion for editing
real images using guided diffusion models. In Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition, pages 6038–6047, 2023.
[25] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical
image segmentation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015:
18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, pages
234–241. Springer, 2015.
[26] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish
Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from
natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR,
2021.
[27] Ali Razavi, Aaron Van den Oord, and Oriol Vinyals. Generating diverse high-fidelity images with vq-vae-2.
Advances in neural information processing systems, 32, 2019.
[28] Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. arXiv preprint
arXiv:2010.02502, 2020.
[29] Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598,
2022.
[30] Cristian Buciluǎ, Rich Caruana, and Alexandru Niculescu-Mizil. Model compression. In Proceedings of
the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 535–541,
2006.
[31] Yiren Jian and Lorenzo Torresani. Label hallucination for few-shot classification. In Proceedings of the
AAAI Conference on Artificial Intelligence, volume 36, pages 7005–7014, 2022.
[32] Dvir Samuel, Rami Ben-Ari, Nir Darshan, Haggai Maron, and Gal Chechik. Norm-guided latent space
exploration for text-to-image generation. Advances in Neural Information Processing Systems, 36, 2024.
[33] Xiangyu Chen, Xintao Wang, Jiantao Zhou, Yu Qiao, and Chao Dong. Activating more pixels in image
super-resolution transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition, pages 22367–22377, 2023.
[34] Aniket Roy, Anshul Shah, Ketul Shah, Anirban Roy, and Rama Chellappa. Diffalign: Few-shot learning
using diffusion based synthesis and alignment. arXiv preprint arXiv:2212.05404, 2022.
[35] Siyuan Qiao, Chenxi Liu, Wei Shen, and Alan L Yuille. Few-shot image recognition by predicting
parameters from activations. In Proceedings of the IEEE conference on computer vision and pattern
recognition, pages 7229–7238, 2018.
[36] Boris Oreshkin, Pau Rodríguez López, and Alexandre Lacoste. Tadam: Task dependent adaptive metric for
improved few-shot learning. Advances in neural information processing systems, 31, 2018.
[37] Han-Jia Ye, Hexiang Hu, De-Chuan Zhan, and Fei Sha. Few-shot learning via embedding adaptation
with set-to-set functions. In Proceedings of the IEEE/CVF conference on computer vision and pattern
recognition, pages 8808–8817, 2020.
[38] Puneet Mangla, Nupur Kumari, Abhishek Sinha, Mayank Singh, Balaji Krishnamurthy, and Vineeth N
Balasubramanian. Charting the right manifold: Manifold mixup for few-shot learning. In Proceedings of
the IEEE/CVF winter conference on applications of computer vision, pages 2218–2227, 2020.
11
[39] Da Chen, Yuefeng Chen, Yuhong Li, Feng Mao, Yuan He, and Hui Xue. Self-supervised learning for
few-shot image classification. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech
and Signal Processing (ICASSP), pages 1745–1749. IEEE, 2021.
[40] Peyman Bateni, Jarred Barber, Jan-Willem Van de Meent, and Frank Wood. Enhancing few-shot im-
age classification with unlabelled examples. In Proceedings of the IEEE/CVF Winter Conference on
Applications of Computer Vision, pages 2796–2805, 2022.
[41] Anuj Singh and Hadi Jamali-Rad. Transductive decoupled variational inference for few-shot classification.
arXiv preprint arXiv:2208.10559, 2022.
[42] Christopher Fifty, Dennis Duan, Ronald G Junkins, Ehsan Amid, Jure Leskovec, Christopher Ré, and
Sebastian Thrun. Context-aware meta-learning. ICLR 2024-12th International Conference on Learning
Representations, 2024.
[43] Defang Chen, Jian-Ping Mei, Hailin Zhang, Can Wang, Yan Feng, and Chun Chen. Knowledge distillation
with the reused teacher classifier. In Proceedings of the IEEE/CVF conference on computer vision and
pattern recognition, pages 11933–11942, 2022.
[44] Ting Chen, Simon Kornblith, Kevin Swersky, Mohammad Norouzi, and Geoffrey E Hinton. Big self-
supervised models are strong semi-supervised learners. Advances in neural information processing systems,
33:22243–22255, 2020.
[45] Leonhard Hennicke, Christian Medeiros Adriano, Holger Giese, Jan Mathias Koehler, and Lukas Schott.
Mind the gap between synthetic and real: Utilizing transfer learning to probe the boundaries of stable
diffusion generated data. arXiv preprint arXiv:2405.03243, 2024.
[46] Mert Bülent Sarıyıldız, Karteek Alahari, Diane Larlus, and Yannis Kalantidis. Fake it till you make it:
Learning transferable representations from synthetic imagenet clones. In Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition, pages 8011–8021, 2023.
[47] Shekoofeh Azizi, Simon Kornblith, Chitwan Saharia, Mohammad Norouzi, and David J Fleet. Synthetic
data from diffusion models improves imagenet classification. arXiv preprint arXiv:2304.08466, 2023.
[48] Bharath Hariharan and Ross Girshick. Low-shot visual recognition by shrinking and hallucinating features.
In Proceedings of the IEEE international conference on computer vision, pages 3018–3027, 2017.
[49] Yu-Xiong Wang, Ross Girshick, Martial Hebert, and Bharath Hariharan. Low-shot learning from imaginary
data. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7278–7286,
2018.
[50] Jingyi Xu and Hieu Le. Generating representative samples for few-shot classification. In Proceedings of
the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9003–9013, 2022.
[51] Dang Nguyen, Sunil Gupta, Kien Do, and Svetha Venkatesh. Black-box few-shot knowledge distillation.
In European Conference on Computer Vision, pages 196–211. Springer, 2022.
[52] Xin Ding, Yongwei Wang, Zuheng Xu, Z Jane Wang, and William J Welch. Distilling and transferring
knowledge via cgan-generated samples for image classification and regression. Expert Systems with
Applications, 213:119060, 2023.
[53] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron
Courville, and Yoshua Bengio. Generative adversarial nets. Advances in neural information processing
systems, 27, 2014.
[54] Oindrila Saha, Zezhou Cheng, and Subhransu Maji. Ganorcon: Are generative models useful for few-shot
segmentation? In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,
pages 9991–10000, 2022.
[55] Yuxuan Zhang, Huan Ling, Jun Gao, Kangxue Yin, Jean-Francois Lafleche, Adela Barriuso, Antonio
Torralba, and Sanja Fidler. Datasetgan: Efficient labeled data factory with minimal human effort. In
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10145–
10155, 2021.
[56] Nontawat Tritrong, Pitchaporn Rewatbowornwong, and Supasorn Suwajanakorn. Repurposing gans for
one-shot semantic part segmentation. In Proceedings of the IEEE/CVF conference on computer vision and
pattern recognition, pages 4475–4485, 2021.
[57] Rinon Gal, Moab Arar, Yuval Atzmon, Amit H Bermano, Gal Chechik, and Daniel Cohen-Or. Encoder-
based domain tuning for fast personalization of text-to-image models. ACM Transactions on Graphics
(TOG), 42(4):1–13, 2023.
[58] Yoad Tewel, Rinon Gal, Gal Chechik, and Yuval Atzmon. Key-locked rank one editing for text-to-image
personalization. In ACM SIGGRAPH 2023 Conference Proceedings, pages 1–11, 2023.
[59] Niv Cohen, Rinon Gal, Eli A Meirom, Gal Chechik, and Yuval Atzmon. “this is my unicorn, fluffy”:
Personalizing frozen vision-language representations. In European Conference on Computer Vision, pages
558–577. Springer, 2022.
12
[60] Nupur Kumari, Bingliang Zhang, Richard Zhang, Eli Shechtman, and Jun-Yan Zhu. Multi-concept
customization of text-to-image diffusion. In Proceedings of the IEEE/CVF Conference on Computer Vision
and Pattern Recognition, pages 1931–1941, 2023.
[61] Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. Dream-
booth: Fine tuning text-to-image diffusion models for subject-driven generation. In Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22500–22510, 2023.
[62] Weihao Xia, Yulun Zhang, Yujiu Yang, Jing-Hao Xue, Bolei Zhou, and Ming-Hsuan Yang. Gan inversion:
A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(3):3121–3138, 2022.
[63] Dvir Samuel, Rami Ben-Ari, Simon Raviv, Nir Darshan, and Gal Chechik. Generating images of rare con-
cepts using pre-trained diffusion models. In Proceedings of the AAAI Conference on Artificial Intelligence,
volume 38, pages 4695–4703, 2024.
[64] Jake Snell, Kevin Swersky, and Richard Zemel. Prototypical networks for few-shot learning. Advances in
neural information processing systems, 30, 2017.
[65] Jiangtao Xie, Fei Long, Jiaming Lv, Qilong Wang, and Peihua Li. Joint distribution matters: Deep brownian
distance covariance for few-shot classification. In Proceedings of the IEEE/CVF conference on computer
vision and pattern recognition, pages 7972–7981, 2022.
[66] Xu Luo, Longhui Wei, Liangjian Wen, Jinrong Yang, Lingxi Xie, Zenglin Xu, and Qi Tian. Rectifying
the shortcut learning of background for few-shot learning. Advances in Neural Information Processing
Systems, 34:13073–13085, 2021.
[67] Zechao Li, Hao Tang, Zhimao Peng, Guo-Jun Qi, and Jinhui Tang. Knowledge-guided semantic transfer
network for few-shot image recognition. IEEE Transactions on Neural Networks and Learning Systems,
2023.
[68] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans
trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information
processing systems, 30, 2017.
[69] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie. The caltech-ucsd birds-200-2011 dataset.
Technical Report CNS-TR-2011-001, California Institute of Technology, 2011.
[70] Luca Bertinetto, Joao F Henriques, Philip HS Torr, and Andrea Vedaldi. Meta-learning with differentiable
closed-form solvers. arXiv preprint arXiv:1805.08136, 2018.
[71] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical
image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255.
Ieee, 2009.
[72] Yin Cui Brigit Schroeder. Fgvcx fungi classification challenge 2018. https://github.com/visipedia/
fgvcx_fungi_comp, 2018.
[73] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár,
and C Lawrence Zitnick. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014:
13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages
740–755. Springer, 2014.
[74] Babak Saleh and Ahmed Elgammal. Large-scale classification of fine-art paintings: Learning the right
metric on the right feature. arXiv preprint arXiv:1505.00855, 2015.
[75] Xueting Zhang, Debin Meng, Henry Gouk, and Timothy M Hospedales. Shallow bayesian meta learning
for real-world few-shot recognition. In Proceedings of the IEEE/CVF international conference on computer
vision, pages 651–660, 2021.
[76] Haipeng Zhang, Zhong Cao, Ziang Yan, and Changshui Zhang. Sill-net: Feature augmentation with
separated illumination representation. arXiv preprint arXiv:2102.03539, 2021.
[77] Nanyi Fei, Zhiwu Lu, Tao Xiang, and Songfang Huang. Melr: Meta-learning via modeling episode-level
relationships for few-shot learning. In International Conference on Learning Representations, 2020.
[78] SuBeen Lee, WonJun Moon, and Jae-Pil Heo. Task discrepancy maximization for fine-grained few-shot
classification. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition,
pages 5331–5340, 2022.
[79] Dahyun Kang, Heeseung Kwon, Juhong Min, and Minsu Cho. Relational embedding for few-shot
classification. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages
8822–8833, 2021.
[80] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor
Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang,
Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie
Bai, and Soumith Chintala. Pytorch: An imperative style, high-performance deep learning library. In
13
H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in
Neural Information Processing Systems 32, pages 8024–8035. Curran Associates, Inc., 2019.
[81] Hugging Face. Diffusers. https://huggingface.co/docs/diffusers/index, 2014.
[82] Joseph K Blitzstein and Jessica Hwang. Introduction to probability. Crc Press, 2019.
[83] Arman Afrasiyabi, Jean-François Lalonde, and Christian Gagné. Associative alignment for few-shot image
classification. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28,
2020, Proceedings, Part V 16, pages 18–35. Springer, 2020.
[84] Spyros Gidaris, Andrei Bursuc, Nikos Komodakis, Patrick Pérez, and Matthieu Cord. Boosting few-
shot visual learning with self-supervision. In Proceedings of the IEEE/CVF international conference on
computer vision, pages 8059–8068, 2019.
[85] Kwonjoon Lee, Subhransu Maji, Avinash Ravichandran, and Stefano Soatto. Meta-learning with differen-
tiable convex optimization. In Proceedings of the IEEE/CVF conference on computer vision and pattern
recognition, pages 10657–10665, 2019.
[86] Mamshad Nayeem Rizve, Salman Khan, Fahad Shahbaz Khan, and Mubarak Shah. Exploring comple-
mentary strengths of invariant and equivariant representations for few-shot learning. In Proceedings of the
IEEE/CVF conference on computer vision and pattern recognition, pages 10836–10846, 2021.
[87] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009.
14
A Additional method details
A.1 Pseudo-code
This section presents pseudo-code for the TINT method. Algorithm 1 shows how to specialize the
generator given a set of support examples representing novel classes, while Algorithm 2 shows how
to generate an example given a specialized generator.
z0 ← G(z̃ ; vn , un,1:t,k )
return vqvae_decode(z0 )
end procedure
The code required to run all experiments was implemented in PyTorch [80] using the Diffusers
library [81] with Stable Diffusion 1.5. The overall experiment framework was implemented by us
from scratch, but includes code parts from prior work [24, 23, 8, 31, 32]. Full source code will be
made available at https://github.com/pixwse/tiny2.
This section presents more details on the analysis of the accuracy estimation, including a proof of
Theorem 1.
Proof of Theorem 1 Computing the estimated accuracy ãp over an episode p involves classifying Q
independent query examples using a fixed classifier. This can be modeled as performing Q Bernoulli
trials with success rate ap . ãp is then the mean of Q Bernoulli-distributed variables, which follows a
binomial distribution with
E[ãp ] = ap (7)
1
V[ãp ] = ap (1 − ap ). (8)
Q
Considering that episodes are drawn randomly and that each episode may have a different true
accuracy ap , we model each ap as a random variable with E[ap ] = a and V[ap ] = σa2 . Since episodes
are drawn randomly and independently, all ap are i.i.d.
For the expectation, we simply have E[ãp |ap ] = E[ap ] = a. To determine V[ãp ], we need to consider
two sources of variance: due to estimating each ãp from a finite query set (intra-episode variance) and
due to each episode having a different true accuracy (inter-episode variance) based on the intrinsic
1
difficulty of each episode. Starting at the law of total variance (Eve’s law) [82], we can write
V[ãp ] = E V[ãp |ap ] + V E[ãp |ap ] = (9)
1
=E ap (1 − ap ) + V[ap ] = (10)
Q
1
E[ap ] − E[a2p ] + σa2 .
= (11)
Q
From the well-known variance formula V[x] = E[x2 ] − E[x]2 , we get E[a2p ] = V[ap ] + E[ap ]2 =
σa2 + a2 and arrive at
1
V[ãp ] = (a − σa2 − a2 ) + σa2 = (12)
Q
1 1
= a(1 − a) + 1 − σa2 (13)
Q Q
Our estimate ã is computed by averaging ãp over P independently drawn episodes, leading to a final
estimate ã with E[ã] = a and
1 1 1
V[ã] = a(1 − a) + 1 − σa2 (14)
P Q Q
which concludes the proof.
Discussion and usage example Theorem 1 allows us to study the expected impact of varying Q
and P . Note that for moderately large Q, Eq. 14 can be approximated as
1 1
V[ã] = a(1 − a) + σa2 . (15)
P Q
As Q increases, V[ã] approaches P −1 σa2 , representing the case where the accuracy of each episode is
estimated perfectly, and only the inter-episode variance remains.
Eq. 14 (or Eq. 15) can be used for experiment planning, assessing the expected variance of an
evaluation using a given P and Q, or giving guidance around the number of evaluation episodes
required for demonstrating a certain improvement with a desired statistical significance. As a concrete
example, for the LabelHalluc method [31], a ≈ 0.87, σ ≈ 0.05, giving an estimator variance of 7e-6
over 600 episodes using 25 query examples (5 per class). In preliminary experiments, the accuracy of
our method was a = 0.93 with σ = 0.028. Increasing P to include all remaining examples (595N ),
asserting that we want a similar estimator variance as the original LabelHalluc method and solving
for P gives us P = 118. We therefore settled for 120 evaluation episodes as a reasonable trade-off
between computational requirements and confidence. As can be seen in Table 3, our 95% confidence
interval for ResNet12 on miniImageNet is indeed similar to that of the LabelHalluc method (0.50 vs
0.48), showing that the practical results are consistent with the theory.
The compute time required for various parts of the method is listed in Table 4. The total compute
required for one episode is 6.41 hours. While this is cumbersome for multi-episode evaluation, it
is not a problem in practice for applications where specialization is done rarely. Note that the final
models are very fast, due to the use of small student models (not evaluated here, since the student
models themselves are not our contributions). The compute times were evaluated using an NVidia
A40 with 48Gb VRAM, but the method itself requires no more than 16Gb of VRAM. The per-image
numbers for the null-text inversion and textual inversion were computed by dividing the time required
for these operations with the number of images generated.
Students The network used in the original ProtoNet method [64] was dubbed Conv4 in later works
[37, 83, 84]. It is made up of 4 blocks, each consisting of a 3×3 convolution, batchnorm, ReLU
2
Table 4: Compute time required for various parts of the method, evaluated on miniImageNet using
one NVidia A40 GPU with 48 Gb VRAM.
1 class 5 classes
Algorithm step 1 image 4k images 20k images
Textual inversion 0.30s 0.33 h 1.67 h
Null-text inversion 0.05s 0.05 h 0.25 h
Generate images 0.66s 0.73 h 3.67 h
Complete TINT 1.01s 1.12h 5.59h
Distillation 0.82 h
Complete specialization 6.41h
and 2×2 max pooling. The first block increases the channel count to 64, which is kept throughout
the 4 blocks. The ResNet12 is another popular choice [31, 85, 65, 37]. With 8M parameters, it is
significantly larger than Conv4, but still one of the smaller networks used in prior art. It consists
of 4 residual modules, each one consisting of 3 blocks. One block consists of a 3×3 convolution,
bachnorm and ReLU. A skip connection connects the input of the first block to the input of the ReLU
of the last block. The scale is decreased by a 2×2 max pooling at the end of each module. The first
convolution of each module increases the channel count to 64, 128, 256 and 512 channels respectively.
Both our implementation and pre-training using IER [86] directly follows prior work [31].
Teacher The teacher backbone used DINO ViT-B/16 with pre-trained weights. The teacher pretrain-
ing largely followed the original P>M>F paper [8], with some minor variations due to implementation
details. We used the ADAM optimizer with initial learning rate 3e-3, lowered by a cosine schedule
over 100 epochs. Each epoch consisted of 400 episodes of 5 classes with 5 support examples and
15 query examples per class. The optimizer used a momentum term of 0.9 and weight decay 1e-6.
No test-time fine tuning was performed for the teacher, since this is only important when evaluating
cross-domain performance.
Diffusion model For data generation, we used Stable Diffusion 1.5, with the DDIM scheduler
running over 50 time steps. The textual inversion was implemented largely following the original
paper [23], with 5k optimization iterations, using the ADAM optimizer with fixed learning rate 1e-3.
The null-text inversion was run using 10 iterations per timestep, using the ADAM optimizer with a
learning rate varying over timesteps according to the original implementation [24].
Resolution mismatch To handle the resolution mismatch between mini-imagenet and the pre-
trained Stable Diffusion model, the 84×84 images were first upsampled to 96×96, then upscaled
with super resolution using HAT [33] to 384×384, and then upsampled again to 512×512. Bilinear
interpolation was used in the two upsampling steps. Some hand-picked examples of generated images
with and without super-resolution are shown in Figure 6. Note that some images in the top row are
completely degenerate (1 and 5), and that there are traces of a grid-like structure also in the other
images (carpets in image 2-4), showing why the super resolution step is needed.
3
Figure 7: Examples of data mined from LAION (classes goose and rhinoceros beetle from the
mini-imagenet val split). Compare with support examples and generated examples in Figure 2.
B.3 Datasets
For miniImageNet, we used the common train-val-test split from [21]. For CUB, we used the common
100/50/50 split (randomly selected) and with images cropped and downsampled to 84x84 resolution.
For CIFAR-FS, we used the original splits from [70]. The exact episodes used are provided with
the source code. We encourage follow-up work to use the same episodes when comparing with our
work, but also to run a final validation on a new set of random episodes to avoid over-engineering
hyper-parameters for one specific fixed episode set.
B.4 Distillation
Similar to prior work [31], we used an SGD optimizer with momentum 0.9 and weight decay 5e-4.
We adjusted the batch size to 150 to enable constructing a batch from equal parts of 1-3 data sources.
The learning rate was set to 0.02 (adjusted slightly due to the change of batch size). For the longer
runs, it was reduced by a factor 0.1 after 5k iterations. The loss was computed using cross-entropy
for the support examples and KL-divergence for the base and synthetic data sources.
The LAION dataset consists of 5.85 billion image-text pairs. Working with the entire dataset is
unpractical, as it weighs in at 240Tb (and already the metadata requires 800Gb). Furthermore, many
images in LAION are of rather poor quality. In Stable Diffusion, the final iterations of the training is
performed a subsets of LAION with extra high aesthetic quality, assessed by an automated method.
As a practical compromise, we use the subset improved aesthetics 6+, consisting of 12M images.
To select images representative of our support classes, we compared the CLIP embeddings of LAION
images and of text descriptions of the mini-imagenet validation split classes. We selected the best
1k (at most) matches, while requiring a cosine similarity larger than T = 0.45. Some examples of
images mined this way are shown in Figure 7. The main issue with training using mined LAION data
is that the mined images are often not representative of the actual class (as for the Beetle images in
Figure 7). Diffusion inversion provides a more consistent way of obtaining images that are visually
related to the support examples.
Theorem 1 was used to guide experiment planning, arriving at 120 episodes but evaluating using all
remaining query examples per episode as a reasonable trade-off between significance and computation
time (see Section A.3). The final 95% confidence intervals reported in Table 3 are computed from
the individual episode accuracies using the Students-t distribution as usual, directly following prior
work [31].
In addition to the experiments shown in Section 5.2, we run a few selected ablations/comparisons
using the Conv4 backbone on the same 5 fixed episodes of the validation split of miniImageNet. The
results are shown in in Table 5. Note that no hyperparameter tuning was applied here, so the results
4
Table 5: Results for the Conv4 backbone on the 5
Table 6: FID measure (64 features) for different
fixed episodes, miniImageNet validation split.
synthetic data generation choices.
Method Syn teacher Acc
Method FID ↓
Plain TI (4k) PMF 70.4
NAO+SeedSelect [32] 13.0
TINT (4k) PMF 81.7
Plain TI 8.0
NAO+SS [32] (1k) None 67.3
TI + superres + limited t range 4.0
NAO+SS [32] (1k) PMF 64.0
Full TINT 2.7
TINT (1k) PMF 80.9
can likely be improved. However, the results show that it is at least not straight-forward to reach good
performance using the competing NAO+SeedSelect method [32]. The results also show that TINT
significantly beats plain textual inversion also for this tiny student model.
One way of evaluating generative models is using the FID measure [68]. The original measure used
the 2048-dimensional feature vector from an Inception V3 network. In our case, we want to compare
generated images with real images drawn from miniImageNet, where there are only 600 examples per
class. Estimating a 2048 × 2048 covariance matrix from only a few hundred images is numerically
problematic. Instead, following prior work [32], we use a feature space of only 64 dimensions, using
pooled features from an earlier layer of the Inception V3 model. We generated 300 random images
per class, compared to 300 randomly drawn real images of the same class, repeated the process for 5
episodes (with 5 classes per episode) and computed the average FID score. The results are shown in
Table 6. Eventhough this is a rather coarse measure, it shows that the TINT in some respect generates
images that follow the novel class image distribution better than competing methods.
C External assets
This section gives an overview of external assets, download links and licenses. All assets that have no
explicit license were made publicly available together with wording encouraging research use.
• miniImageNet
Dataset introduced by [20], derivative of ImageNet (https://image-net.org/) that
was released for non-commercial and educational use. Full terms of use are at https:
//www.image-net.org/download.php. Preprocessed by and downloaded from https:
//github.com/hushell/pmf_cvpr22.
• CUB
Dataset introduced by [69], available at https://www.vision.caltech.edu/
datasets/cub_200_2011/. No explicit license provided.
• CIFAR-FS
Dataset introduced by [70], derivative of CIFAR-100 [87], available at https://www.cs.
toronto.edu/~kriz/cifar.html. No explicit license provided.
• Null-text inversion
Our code for null-text inversion [24] was adapted from https://github.com/google/
prompt-to-prompt/. Apache 2.0 license.
• Textual inversion
Our code for textual inversion [24] was adapted from https://github.com/google/
prompt-to-prompt/. MIT license.
• P>M>F
Our code for P>M>F [8] was adapted from https://github.com/hushell/pmf_
cvpr22. No explicit license provided.
• LabelHalluc
Our code for LabelHalluc [31] was adapted from https://github.com/yiren-jian/
LabelHalluc and linked code for IER pretraining [86] at https://github.com/
yiren-jian/embedding-learning-FSL. No explicit license provided.
5
Figure 8: More examples of generated images, classes crate, guitar, roundworm, king crab, hourglass,
dalmation, bowl and school bus. For each class, 5 random support examples are shown on top,
followed by 10 randomly selected generated examples.
6
D Generated image examples
Figure 8 shows more examples of images generated by TINT. The classes were selected manually,
but the support examples and generated images were randomly selected without cherry picking. Note
how for example the generated guitar images capture the setting of the support examples (focus
on electric guitars / stage), rather than producing more generic guitar images. Also note how the
generated king crab images cover both whole crabs and prepared as food, as well as covering both
the red and brown color scales of the support examples. Sometimes food-like images are depicted
in a brown hue, which is understandable based on the few support examples. The generator also
sometimes extrapolates a bit too far in the food direction (lower-right image). The roundworm class
is the one where the method struggles the most. Here, generated images are frequently reminiscent
of atmospherical phenomena or pyrotechnics. This is understandable based on the two microscopy
images present in the support set. Recall however that the negative effects of non-desired variation in
the generated images are mitigated by the use of a teacher.