Dataset Distillation
Dataset Distillation
Abstract—Recent success of deep learning can be largely attributed to the huge amount of data used for training deep neural
networks. However, the sheer amount of data significantly increase the burden on storage and transmission. It would also consume
considerable time and computational resources to train models on such large datasets. Moreover, directly publishing raw data
inevitably raise concerns on privacy and copyright. Focusing on these inconveniences, dataset distillation (DD), also known as dataset
condensation (DC), has become a popular research topic in recent years. Given an original large dataset, DD aims at a much smaller
dataset containing several synthetic samples, such that models trained on the synthetic dataset can have comparable performance
with those trained on the original real one. This paper presents a comprehensive review and summary for recent advances in DD and
its application. We first introduce the task in formal and propose an overall algorithmic framework followed by all existing DD methods.
Then, we provide a systematic taxonomy of current methodologies in this area. Their theoretical relationship will also be discussed. We
arXiv:2301.07014v1 [cs.LG] 17 Jan 2023
also point out current challenges in DD through extensive experiments and envision possible directions for future works.
1 I NTRODUCTION
This seminal work has attracted many follow-up studies on four aspects: representations of knowledge, teacher-
in recent years. On the one hand, considerable progress student architectures, distillation algorithms, and distilla-
has been made in improving the performance of DD by a tion schemes. First, knowledge can be represented by model
series of methods [19], [20], [21], [22], [23], [24] such that response/output [37], [41], features [38], [42], [43], and re-
the real-world performance of models trained on synthetic lation [44], [45], [46]. Second, teacher-student architectures
datasets can match those trained on original ones as much refer to the network architectures of teacher and student
as possible. On the other hand, a variety of works have models, which determines the quality of knowledge acquisi-
extended the application of DD into a variety of research tion and distillation from teacher to student [40]. Third, dis-
fields, such as continual learning [25], [26], [27], [28], [29] tillation algorithms determine the ways of knowledge trans-
and federated learning [30], [31], [32], [33], [34], [35], [36]. fer. A simple and typical way is to match the knowledge
This paper aims to provide an overview of current captured by the teacher and student models directly [37],
research in dataset distillation. Our contributions are as [38]. Beyond that, many different algorithms are proposed
follows: to handle more complex settings, such as adversarial dis-
tillation [47], attention-based distillation [39], and data-free
• We explore and summarize existing works in dataset
distillation [48], [49]. Finally, distillation schemes control
distillation and its application comprehensively.
training configurations of teacher and student, and there are
• We provide a systematic classification of current
offline- [37], [38], online- [50], and self-distillation [51]. As
methodologies in DD. Categorized by the optimiza-
for application, KD is widely used in ensemble learning [52]
tion objective, there are three mainstream solutions:
and model compression [38], [53], [54].
performance matching, parameter matching, and
The concept of DD is inspired by KD [18]. Specifically,
distribution matching. Their relationship will also be
DD aims at a lightweight dataset, while KD aims at a
discussed.
lightweight model. In this view, DD and KD are only
• By abstracting all the key components in DD, we
conceptually related but technically orthogonal. It is worth
construct an overall algorithmic framework followed
noting that, similar to DD, recent data-free KD methods [48],
by all existing DD methods.
[49], [55] are also concerned with the generation of synthetic
• We point out current challenges in DD and envision
training samples since original training datasets are unavail-
possible future directions for improvements.
able. Their differences are two-fold. On the one hand, data-
The rest of the article is organized as follows. We will free KD takes a teacher model as input, while DD takes an
begin with discussions of closely-related works for DD original dataset as input. On the other hand, data-free KD
in Section 2, including core-set methods, hyper-parameter aims at a student model with performance similar to the
optimization, and generative models. Then in Section 3, we teacher. In order for a satisfactory student, it generally relies
first define the dataset distillation problem formally, along on a large number of synthetic samples. For DD, it aims at
with a general framework of the dataset distillation method. a smaller dataset, whose size is pre-defined and typically
Based on this, various methods and their relationship will be small due to limited storage budgets.
systematically introduced. Section 4 will introduce applica-
tions of DD in various research fields. Section 5 will evaluate 2.2 Coreset or Instance Selection
the performance of representation DD methods. Section 6 Coreset or instance selection is a classic selection-based
will discuss the current challenges in the field and propose method to reduce the size of the training set. It only pre-
some possible directions for future works. Furthermore, the serves a subset of the original training dataset containing
last Section 7 will conclude the paper. only valuable or representative samples, such that the mod-
els trained on it can achieve similar performance to those
2 R ELATED W ORKS trained on the original. Since simply searching for the sub-
set with the best performance is NP-Hard, current coreset
Dataset distillation (DD) aims distilling knowledge of the
methods select samples mainly based on some heuristic
dataset into some synthetic data while preserving the per-
strategies. Some early methods generally expect a consis-
formance of the models trained on it. Its motivations, ob-
tent data distribution between coresets and the original
jectives, and solutions are closely related to some existing
ones [56], [57], [58], [59]. For example, Herding [56], [57],
fields, like knowledge distillation, coreset selection, gen-
one of the classic coreset methods, aims to minimize the
erative modeling, and hyperparameter optimization. Their
distance in feature space between centers of the selected
relationship with dataset distillation will be discussed in this
subset and the original dataset by incrementally and greed-
section.
ily adding one sample each time. In recent years, more
strategies beyond matching data distribution have been pro-
2.1 Knowledge Distillation posed. For instance, bi-level optimization-based methods
Knowledge distillation (KD) [37], [38], [39], [40] aims to are proposed and applied in fields of continual learning [60],
transfer knowledge from a large teacher network to a semi-supervised learning [61], and active learning [62]. For
smaller student network, such that the student network another example, Craig et al. [63] propose to find the coreset
can preserve the performance of the teacher with reduced with the minimal difference between gradients produced by
computational overhead. The seminal work by Hinton et the coreset and the original coreset with respective to the
al. [37] leads the student to mimic the outputs of the teacher, same neural network.
which can represent knowledge acquired by the teacher Aiming to synthesize condensed samples, many objec-
network. Afterward, improvements of KD have focused tives in DD shares similar spirits as coreset techniques. As
3
a matter of fact, on the functional level, any valid objective of optimizing high-dimensional hyperparameters. Introduc-
in coreset selection is also applicable in DD, such as those ing gradients into hyperparameter optimization problems
matching distribution, bi-level optimization, and matching opens up a new way of optimizing high-dimensional hy-
gradient-based methods. However, coreset and DD have perparameters [10].
distinct differences. On the one hand, coreset is based on DD can also be converted into a hyperparameter opti-
selected samples from original datasets, while DD is based mization problem if each sample in synthetic datasets is
on synthesizing new data. In other words, the former con- regarded as a high-dimensional hyperparameter. In this
tains raw samples in real datasets, while the latter does not spirit, the seminal algorithm proposed by Wang et al. [18]
require synthetic samples to look real, potentially providing optimizes the loss on real data for models trained by syn-
better privacy protection. On the other hand, due to the thetic datasets with gradient decent, shares essentially the
NP-hard nature, coreset methods typically rely on heuristics same idea with gradient-based hyperparameter optimiza-
criteria or greedy strategies to achieve a trade-off between tion. However, the focus and final goals of DD and hyper-
efficiency and performance. Thus, they are more prone to parameter optimization are different. The former focus more
sub-optimal results compared with DD, where all samples on getting synthetic datasets to improve the efficiency of the
in condensed datasets are learnable. training process, while the latter try tuning hyperparameter
to optimize model performance. Given such orthogonal
objectives, other streams of DD methods, like parameter
2.3 Generative Models
matching and distribution matching, can hardly be related
Generative models are capable of generating new data to hyperparameter optimization.
by learning the latent distribution of a given dataset and
sampling from it. The primary target is to generate realistic
data, like images, text, and video, which can fool human 3 DATASET D ISTILLATION M ETHODS
beings. DD is related to this field in two aspects. For one
thing, one classic optimization objective in DD is to match Dataset distillation (DD), also known as dataset condensa-
the distribution of synthetic datasets and original ones, tion (DC), is first formally proposed by Wang et al. [18]. The
which is conceptually similar to generative modeling, e.g., target is to extract knowledge from a large-scale dataset and
generative adversarial network (GAN) [64], [65] and vari- build a much smaller synthetic dataset, such that models
ational autoencoders (VAE) [66]. Nevertheless, generative trained on it have comparable performance to those trained
models try to synthesize realistic images, while DD cares on the original dataset. This section first provides a compre-
about generating informative images that can boost training hensive definition of dataset distillation and summarizes a
efficiency and save storage costs. general workflow for DD. Then, we categorize current DD
For another, generative models can also be effective in methods by different optimization objectives: performance
DD. In DD, synthetic samples are not learned directly in matching, parameter matching, and distribution matching.
some works. Instead, they are produced by some genera- Algorithms associated with these objectives will be illus-
tive models. For example, IT-GAN [67] adopts conditional trated in detail, and their potential relationships will also be
GANs [65] and GANs inversion [68], [69] for synthetic data discussed. Some works also focus on synthetic data param-
generation so that it only needs to update latent codes rather eterization and label distillation, which will be introduced
than directly optimizing image pixels; KFS [70] uses latent in the following parts.
codes together with several decoders to output synthetic
data. This usage of generative models is known as a kind
of synthetic dataset parameterization, whose motivation is 3.1 Definition of Dataset Distillation
that latent codes provide a more compact representation
The canonical dataset distillation problem involves learning
of knowledge than raw samples and thus improve storage
a small set of synthetic data from an original large-scale
efficiency.
dataset so that models trained on the synthetic dataset can
perform comparably to those trained on the original. Given
2.4 Hyperparameter Optimization a real dataset consisting of |T | pairs of training images and
corresponding labels, denoted as T = (Xt , Yt ), where Xt ∈
The performance of machine learning models is typically
RN ×d , N is the number of real samples, d is the number
sensitive to the choice of hyperparameters [9], [10]. Re-
of features, Yt ∈ RN ×C , and C is the number of output
searches in hyperparameter optimization aim to automat-
entries, the synthetic dataset is denoted as S = (Xs , Ys ),
ically tune hyperparameters to yield optimal performance
where Xs ∈ RM ×D , M is the number of synthetic samples,
for machine learning models, eliminating the tedious work
Ys ∈ RM ×C , M N , and D is the number of features
of manual adjustment and thus improving efficiency and
for each sample. For the typical image classification tasks,
performance. Hyperparameter optimization has a rich his-
D is height × width × channels, and y is a one-hot vector
tory. Early works, like [71] and [72], are gradient-free model-
whose dimension C is the number of classes. Formally, we
based methods, choosing hyperparameters to optimize the
formulate the problem as the following:
validation loss after fully training the model. However, the
efficiency of these methods is limited when dealing with S = arg min L(S, T ), (1)
tens of hyperparameters [9]. Thus, gradient-based optimiza- S
tion methods [9], [10], [73], [74] are proposed and tremen- where L is some objective for dataset distillation, which will
dously succeed in scaling up, which open up a new way be elaborated in following contents.
4
to compute the gradient of validation loss w.r.t synthetic 3.3.3 Parameter Matching
datasets by backpropagating through the entire training
graph. In this way, outer optimization steps are computa-
tionally expensive, and the GPU memory required is pro- The approach of matching parameters of neural networks
portional to the number of inner loops. Thus, the number in DD is first proposed by Zhao et al. [22]; and is extended
of inner loops is limited, which results in insufficient inner by a series of following works [79], [80], [81], [85]. Unlike
optimization and bottlenecks the performance [22]. It is also performance matching, which optimizes the performance
inconvenient for this routine to be scaled up to large models. of networks trained on synthetic datasets, the key idea of
There is a class of methods tackling this problem [19], [20], parameter matching is to train the same network using
[21], [85] based on kernel ridge regression (KRR), which synthetic datasets and original datasets for some steps,
performs convex optimization and results in a closed-form respectively, and encourage the consistency of their trained
solution for the linear model which avoids extensive inner- neural parameters. According to the number of training
loop training. Projecting samples into a high-dimensional steps using S and T , parameter matching methods can
feature space with a non-linear neural network f parame- be further divided into two streams: single-step parameter
terized by θ, where θ is sampled from some distribution Θ, matching and multi-step parameter matching.
the performance matching metric can be represented as: Single-Step Parameter Matching. In single-step param-
∗ eter matching, as shown in Fig. 3(a), a network is updated
L(S, T ) = Eθ∼Θ [||Yt − fθ (Xt )WS,θ ||2 ],
using S and T for only 1 step, respectively, and their
∗
WS,θ = arg min{||Ys − fθ (Xs )WS,θ ||2 + λ||WS,θ ||2 } resultant gradients with respective to θ are encouraged to
WS,θ
be consistent, which is also known as gradient matching.
= fθ (Xs )T (fθ (Xs )fθ (Xs )T + λI)−1 Ys . It is first proposed by Zhao et al. [22] and is extended
(4) by a series of following works [79], [80], [82], [93]. After
Here λ is a small number for regularization. Rewriting Eq. 4 each step of updating synthetic data, the network used for
in kernel version, we have: computing gradients is trained on S for T steps. In this case,
the objective function can be formalized as following:
θ
L(S, T ) = Eθ∼Θ [||Yt − KX t Xs
θ
(KX s Xs
+ λI)−1 Ys ||2 ], (5) T
X
where θ
KX
T
= fθ (X1 )fθ (X2 ) . Nguyen et al. [20], [21] L(S, T ) = Eθ(0) ∼Θ [ D(S, T ; θ(t) )],
1 X2
t=0
(7)
propose to perform KRR with the neural tangent kernel
(NTK) instead of training neural networks for multiple steps θ(t) = θ(t−1) − η∇l(S; θ(t−1) ),
to learn synthetic datasets, based on the property that KRR where metric D measures the distance between gradients
for NTK approximates the training of the corresponding ∇l(S; θ(t) ) and ∇l(T ; θ(t) ). Since only a single-step gradi-
wide neural network [86], [87], [88], [89]. ent is necessary, and updates of synthetic data and net-
works are decomposed, this approach is memory-efficient
To alleviate the high complexity of computing NTK, compared with meta-learning-based performance matching.
Loo et al. [85] propose RFAD, based on Empirical Neural Particularly, in image classification tasks, when updating the
Network Gaussian Process (NNGP) kernel [90], [91] alter- synthetic dataset S , Zhao et al. [22] sample each synthetic
natively. They also adopt platt scaling [92] by applying and real batch pair Sc and Tc from S and T respectively
cross-entropy loss to labels of real data instead of mean containing samples from the c-th class, and each class of
square error, which further improves the performance and synthetic data is updated separately in each iteration:
is demonstrated to be more suitable for classification tasks. C−1
X
D(S, T ; θ) = d(∇l(Sc ; θ), ∇l(Tc ; θ)),
Concurrently, FRePo [19] decomposes a neural network c=0
into a feature extractor and a linear classifier, considers θ L XJi (i) (i) (8)
X Aj · Bj
and WS,θ in Eq. 4 as parameters of the feature extractor d(A, B) = (1 − (i) (i)
),
and the linear classifier respectively. Instead of pursuing a i=1 j=1 kAj kkBj k
fully-optimized network in an entire inner training loop, it where C is the total number of classes, L is the number
only obtains optimal parameters for the linear classifier via of layers in neural networks, i is the layer index, Ji is
Eq. 4, and the feature extractor is trained on the current the number of output channels for the i-th layer, and j is
synthetic dataset, which results in a decomposed two-phase the channel index. As shown in Eq. 8, the original idea
algorithm. Formally, its optimization objective can be writ- proposed by Zhao et al. [22] adopts the negative cosine
ten as: similarity to evaluate the distance between two gradients.
T
X
θ (t) θ (t)
It also preserves the layer-wise and channel-wise structure
L(S, T ) = Eθ(0) ∼Θ [ ||Yt − KX t Xs
(KX s Xs
+ λI)−1 Ys ||2 ], of the network to get an effective distance measurement.
t=0
(t) (t−1) (t−1) Nevertheless, this method has some limitations, e.g., the
θS = θS − η∇l(S; θS ).
distance metric between two gradients considers each class
(6)
independently and ignores relationships underlain for dif-
For FRePo, updates of synthetic data and networks are de- ferent classes. Thus, class-discriminative features are largely
composed, unlike the meta-learning-based method in Eq. 2, neglected. To remedy this issue, Lee et al. [79] propose a new
which requires backpropagation through multiple updates. distance metric for gradient matching considering class-
6
𝒮
Train
())
𝜃(
(#)
𝜃𝒮
(%)
𝜃𝒮 … (%! )
𝜃𝒮
Consistency
𝒟!"#$ 𝜃 (")
𝜃 (&) Consistency 𝒟'()(*
𝒯
(% )
𝜃𝒯 "
𝒯 Train ())
𝜃* 𝜃𝒯
(#) (%)
𝜃𝒯 𝜃𝒯
(')
…
(a) (b)
Fig. 3. (a) Single-step parameter matching. (b) Multi-step parameter matching. They optimize the consistency of trained model parameters using
real data and synthetic data. For single-step parameter matching, it is equivalent to matching gradients, while for multi-step parameter matching, it
is also known as matching training trajectories.
...
Instead of matching training effects for synthetic datasets
Consistency
𝒟!"# to real datasets, e.g., the performance of models trained
on S , distribution matching directly optimizes the distance
between the two distributions using some metrics, e.g.,
𝒯 Maximum Mean Discrepancy (MMD) leveraged in Zhao
...
Real et al. [24]. The intuition is shown in Fig. 4. Since directly
Real Statistics
Embedding
estimating the real data distribution can be expensive and
inaccurate as images are high-dimensional data, distribution
Fig. 4. Distribution matching. It matches statistics of features in some matching adopts a set of embedding functions, i.e., neural
networks for synthetic and real datasets. networks, each providing a partial interpretation of the
input and their combination providing a comprehensive
and layers. It is demonstrated that this multi-step parameter interpretation, to approximate MMD. Here, we denote the
matching strategy yields better performance than the single- parametric function as fθ , and distribution matching is
step counterpart. defined as:
As a following work along this routine, Li et al. [94] L(S, T ) = Eθ∈Θ [D(S, T ; θ)], (14)
find that a few parameters are difficult to match in dataset
distillation for multi-step parameter matching, which nega- where Θ is a specific distribution for random neural net-
tively affects the condensation performance. To remedy this work initialization, Xs,c and Xt,c denote samples from the
problem and generate a more robust synthetic dataset, they c-th class in synthetic and real datasets respectively, and D
adopt parameter pruning, removing the difficult-to-match is some metric measuring the distance between two distri-
parameters when the similarity of the model parameters butions. DM by Zhao et al. [24] adopts classifier networks
(T )
trained on synthetic datasets θS s and on real datasets without the last linear layer as embedding functions. The
(T ) center, i.e., mean vector, of the output embeddings for each
θT t is less than a threshold after updating networks in
class of synthetic and real datasets are encouraged to be
each dataset distillation iteration. The synthetic dataset is
close. Formally, D is defined as:
updated by optimizing the objective function L calculated
with pruned parameters. C−1
X
Optimizing the multi-step parameter matching objective D(S, T ; θ) = kµθ,s,c − µθ,t,c k2 ,
c=0
defined in Eq. 12 involves backpropagation through the un- M Nc
rolled computational graph for Ts network updates, which 1 Xc (i) (j) 1 X (i) (j)
µθ,s,c = f (Xs,c ), µθ,t,c = f (Xt,c ),
triggers the problem of memory efficiency, similar to meta- Mc j=1 θ Nc j=1 θ
learning-based methods. To solve this problem in MTT, Cui (15)
et al. [95] propose TESLA with constant memory complexity
w.r.t. update steps. The key step in TESLA lies in that, in where Mc and Nc are the number of samples for the c-th
the i-th step, 0 ≤ i < Ts , it calculates the gradient for class in synthetic and real datasets respectively, and j is the
(i)
the loss function l(θS ; S (i) ) against both current network sample index.
(i)
parameters θS and current synthetic samples S (i) . The Instead of matching features before the last linear layer,
gradient for each synthetic sample is accumulated. Thus, Wang et al. [97] propose CAFE, which forces statistics of
during backpropagation, there is no need to record the features for synthetic and real samples extracted by each
computational graph for Ts training steps. Also, it is worth network layer except the final one to be consistent. The
noting that adopting soft labels for synthetic data is crucial distance function is as follows:
when condensing datasets with many classes. C−1
X L−1 (i) (i)
X
D(S, T ; θ) = kµθ,s,c − µθ,t,c k2 , (16)
Moreover, Du et al. [96] find that MTT results in ac-
c=0 i=1
cumulated trajectory error for downstream training and
propose FTD. To be specific, in the objective of MTT shown where L is the number of layers in a neural network, and
in Eq. 12, θ(0) may be sampled from some training check- i is the layer index. Also, to explicitly learn discriminative
points of latter epochs on the original dataset, and the synthetic images, CAFE also adopts the discrimination loss
trajectory matching error on these checkpoints is minimized. Dr. Here, the center of synthetic samples for each class is
However, it is hard to guarantee that desired parameters used as a classifier to classify real samples. The discrimina-
seen in dataset distillation optimization can be faithfully tive loss for the c-th class tries to maximize the probability
achieved when trained using the synthetic dataset, which of this class for the classifier:
causes accumulated error, especially for latter epochs. To C−1 Nc
(j)
XX
alleviate this problem, FTD adds flat regularization when Dr(S, T ; θ) = − log p(c|Xt,c , S, θ),
training with original datasets, which results in flat train- c=0 j=1
ing trajectories and makes targeting networks more robust (L−1) T (L−1) (j)
(17)
(j) exp{µθ,s,c fθ (Xt,c )}
to weight perturbation. Thus, it yields lower accumulated p(c|Xt,c , S, θ) =P .
C−1 (L−1) T (L−1) (j)
error compared with the baseline. c0 =0 exp{µθ,s,c0 fθ (Xt,c )}
8
Meanwhile, networks are updated alternately with synthetic tween single-step parameter matching, i.e., gradient match-
data by a dedicated schedule. Overall, the objective function ing, and distribution matching based on the first moment,
can be written as: e.g., [24]. Some insights are adapted from Zhao et al. [24]. We
T
X have the following proposition:
L(S, T ) = Eθ(0) ∈Θ [ {D(S, T ; θ(t) ) + λDr(S, T ; θ(t) )}],
Proposition 2. First-order distribution matching objective is
t=0
(t) (t−1) (t−1) approximately equal to gradient matching objective of each class
θ =θ − η∇l(S; θ ), for kernel ridge regression models following a random feature
(18) extractor.
where λ is a hyperparameter for balance.
Proof. Denote the parameter of the final linear layer as W =
3.3.5 Connections between Objectives in DD [w0 , w1 , . . . , wC−1 ], where wc ∈ RF is the weight vector
In this part, we show connections between the above three connected to the c-th class. And for each training sample
optimization objectives in DD and give proof that they (xj ) with class label c, we get the loss function as follows:
are essentially related. For simplicity, we assume that only
the last linear layer of neural networks is considered for l = kyi − fθ (xj )W k2
updating synthetic data, while previous network layers, i.e., C−1
(22)
((1c0 =c − fθ (xj )wc0 )2 ,
X
the feature extractor fθ , are fixed in this step. The parameter =
of the linear layer is W ∈ RF ×C , where F is the dimension c0 =0
for of feature from fθ , and C is the total number of classes. where yi is a one-hot vector with the c-th entry 1 and
Least square function is adopted for model optimization, as others 0, and 1 is the indicator function. Then, the partial
the case in FRePo [19]. derivative of the classification loss on j -th sample w.r.t the
Performance Matching v.s. Optimal Parameter Match- c0 -th neuron is:
ing. For performance matching, the goal is that optimize
gc0 ,j = (pc0 ,j − 1c0 =c ) · fθ (xj )T , (23)
the performance on T for models trained on S , as shown in
Eq. 2. Here, for simplicity, we ignore the regularization term where pc0 ,j = fθ (xj )wc0 . For seminal works on gradient
and assume λ = 0, given that λ is generally a small constant matching [22] and distribution matching [24], objective func-
for numerical stability. Then, the performance matching tions are calculated for each class separately. For the c-th
objective in Eq. 2 for some given θ can be written as: class, the gradient of the c0 -th weight vector wc0 is:
Lperf M = kYt − fθ (Xt )fθ (Xs )T (fθ (Xs )fθ (Xs )T )−1 Ys k2 . 1 X
Nc
(19) gc0 = gc0 ,j
Nc j=1
Under this circumstance, we have the following proposition:
Nc
1 X
Proposition 1. Performance matching objective for kernel ridge = (pc0 ,j − 1c0 =c ) · fθ (xj )T (24)
regression models is equivalent to optimal parameter matching Nc j=1
objective, or infinity-step parameter matching. Nc
1 X
Proof. Given that KRR is a convex optimization problem ≈ (q − 1c0 =c ) fθ (xj )T ,
Nc j=1
and the optimal analytical solution can be achieved by a
sufficient number of training steps, the optimal parameter where the approximation is valid since the network param-
matching, in this case, is essentially infinity-step parameter eterized by θ is randomly initialized, and the prediction
matching, which can be written as: for Peach class is near uniform, denoted as q . The term
1 Nc T
∗ j=1 fθ (xj ) is actually the first moment of the output
LparaM = kWS,θ − WT∗ ,θ k2 , Nc
distribution for samples of the c-th class, i.e., µθ,s,c for S
∗
WS,θ = arg min{kYs − fθ (Xs )WS,θ k2 } and µθ,t,c for T . Thus, first-moment distribution matching
WS,θ
is approximately equivalent to gradient matching for each
= fθ (Xs )T (fθ (Xs )fθ (Xs )T )−1 Ys , (20) class in this case.
WT∗ ,θ = arg min{kYt − fθ (Xt )WT ,θ k2 }
WT ,θ
Single-Step Parameter Matching v.s. Second-Order
= (fθ (Xt )T fθ (Xt ))−1 fθ (Xt )T Yt . Distribution Matching. The relationship between gradient
Here, we denote (fθ (Xt )T fθ (Xt ))−1 fθ (Xt )T as M. Then matching and first-order distribution matching discussed
we get: above relies on an important assumption that the outputs
kMk2 Lperf M = LparaM . (21) of networks are uniformly distributed, which requires fully
random neural networks without any learning bias. Here
As M is a constant matrix and not related to the optimiza- we show in the following proposition that without such a
tion problem w.r.t S , concluded from Eq. 21, performance strong assumption, we can still bridge gradient matching
matching is equivalent to optimal parameter matching un- and distribution matching.
der this circumstance.
Proposition 3. Second-order distribution matching objective
Single-Step Parameter Matching v.s. First-Moment Dis- optimizes an upper bound of gradient matching objective for kernel
tribution Matching. In this part, we reveal connections be- ridge regression models.
9
TABLE 1
Taxonomy of existing dataset distillation methods. ’v’ denotes some variant of an optimization objective here. Please refer to Fig. 5 for a more
intuitive explanation.
The gradient matching objective in this case satisfies: due to the limited storage, information carried by a small
1 1 number of images is limited. Moreover, with exactly the
L =k ∇l(fθ (Xs ), W ) − ∇l(fθ (Xt ), W )k2 same format as real data, it is unclear whether synthetic data
|S| |T |
contain useless or redundant information. Focusing on these
1
=k fθ (Xs )T (fθ (Xs )W − Ys ) concerns and orthogonal to optimization objectives in DD,
|S| a series of works propose different ways of synthetic data
1 parameterization. In a general form, for a synthetic dataset,
− fθ (Xt )T (fθ (Xt )W − Yt )k2 0 |Z|
|T | some codes z ∈ Z ⊂ RD , Z = {(zj , yj )}|j=1 in a format
1 1 other than the raw shape are used for storage. And there
=k( fθ (Xs )T fθ (Xs ) − fθ (Xt )T fθ (Xt ))W 0
|S| |T | is some function gφ : RD → RD parameterized by φ that
1 1 maps a code with d0 dimensions to the format of raw images
− ( fθ (Xs )T Ys − fθ (Xt )T Yt )k2
|S| |T | for downstream training. In this case, the synthetic dataset
1 1 is denoted as:
≤k fθ (Xs )T fθ (Xs ) − fθ (Xt )T fθ (Xt )k2 kW k2 |S|
|S| |T | S = (Xs , Ys ), Xs = {gφ (zj )}j=1 . (26)
1 1
+ k fθ (Xs )T Ys − fθ (Xt )T Yt k2 . As shown in Fig. 6, when training, φ and Z can be updated
|S| |T |
(25) in an end-to-end fashion, by further backpropagating the
gradient for S to them, since the process for data generation
The r.h.s of the inequality of Eq. 25 measures the difference is differentiable.
between first and second-order statistics of synthetic and
real data, i.e., mean, and correlation. In this way, distribution 3.4.1 Differentiable Siamese Augmentation
matching essentially optimizes an upper bound of gradient
Differentiable siamese augmentation (DSA) is a set of aug-
matching.
mentation policies designed to improve the data efficiency,
including crop [1], cutout [99], flip, scale, rotate, and color
3.4 Synthetic Data Parameterization jitters [1] operations. It is first applied to the dataset distil-
One of the essential goals of dataset distillation is to syn- lation task by Zhao et al. [23] to increase the data efficiency
thesize informative datasets to improve training efficiency, and thus generalizability. Here, gφ (·) is a family of image
given a limited storage budget. In other words, as for the transformations parameterized with φ ∼ Φ, where φ is
same limited storage, more information on the original the parameter for data augmentation, and a code z still
dataset is expected to be preserved so that the model trained holds the same format as a raw image. It is differentiable
on condensed datasets can achieve comparable and satisfac- so that the synthetic data can be optimized via gradient
tory performance. In the image classification task, the typical descent. Moreover, the parameter for data augmentation
method of dataset distillation is to distill the information is the same for synthetic and real samples within each
of the dataset into a few synthetic images with the same iteration to avoid the averaging effect. This technique has
resolution and number of channels as real images. However, been applied in many following DD methods, e.g., DM [24],
10
Dataset Distillation
Minor Classes Meta Learning KRR Single Layer Multi Layer Single-Step Multi-Step
DD AddMem KIP RFAD FRePo DM IT-GAN KFS CAFE DC DSA DCC EGM IDC EDD MTT HaBa DDPP FTD TESLA
Methods
Use 𝒮 (Unroll)
Fig. 5. Taxonomy of existing DD methods. A DD method can be analyzed from 4 aspects: optimization objective, fashion of updating networks,
synthetic data parameterization, and fashion of learning labels.
accessed through learnable addressing matrices to construct This section will introduce the applications of dataset distil-
the synthetic dataset. Thus, the size of the synthetic dataset lation.
does not necessarily correlate with the number of classes
linearly, and the high compression rate achieves better data 4.1 Continual Learning
efficiency. In this case, the shared common memory, or basis,
Continual learning [102] aims to remedy the catastrophic
serves as the parameter φ for the mapping function g , and
forgetting problem caused by the inability of neural net-
the address, or coefficient, serves as the latent code z for
works to learn a stream of tasks in sequence. A commonly
each sample. Specifically, assume that there are C classes,
used strategy is rehearsal, maintaining the representative
and we want to retrieve R samples for each class. Then we
samples that can be re-used to retain knowledge about past
have R addressing matrices Aj ∈ RC×K for 1 ≤ j ≤ R. K
tasks [15], [103], [104], [105]. In this case, how to preserve as
is the number of bases vectors, and the memory φ ∈ RK×D ,
much the knowledge of previous tasks in the buffer space,
where D is the dimension for each sample. The j -th syn-
especially for limited memory, becomes the essential point.
thetic sample for the c-th class xc,j is constructed by the
Due to its highly condensed nature, data condensation is
following linear transformation:
able to capture the essence of the whole dataset, making it
zc,j = yc Aj , xc,j = zc,j φ, (29) an inspiring application in continual learning. Several works
[19], [22], [24], [25], [26] have directedly applied dataset
where yc is the corresponding one-hot vector. For optimiza- distillation techniques to the continual learning scenario. In-
tion, the method proposes an improved meta learning based stead of selecting the representative samples of the historical
performance matching framework with momentum used data, train the synthetic dataset of the historical data to keep
for inner loops. the knowledge of the past task.
Decoders. Beyond the linear transformation-based Besides, some synthetic dataset parameterization meth-
method by Deng et al. [84], HaBa [98] and KFS [70] adopt ods are also successfully applied to improve the memory
non-linear transformation for the mapping function g . HaBa efficiency. Kim et al. [82] use data partition and upsampling,
uses the multi-step parameter matching method MTT [81] increasing the number of training samples under the same
for optimization while KFS is based on distribution match- storage budget. Deng et al. [84] propose using addressable
ing [24]. Nevertheless, they use similar techniques for syn- memories and addressing matrices to construct synthetic
thetic data parameterization. Specifically, there are |Z| latent data of historical tasks. The budget of memory cost will not
codes (bases) and |Φ| neural decoders (hallucinators). Latent increase linearly and positively correlate with the number
codes may include distinct knowledge for different samples, of past tasks. The space to keep the knowledge of one
while decoders contain shared knowledge across all sam- specific past task can be M T , where M is the buffer size, and
ples. A synthetic sample can be generated by sending the T is the number of tasks. Sangermano et al. [27] propose
j -th latent code zj to the k -th decoder parameterized by φk : obtaining the synthetic dataset by linear weighted combin-
xj,k = gφk (zj ), (30) ing the historical images, and the optimization target is the
coefficients matrix rather than the pixels of synthetic data.
where zj ∈ Z , 1 ≤ j ≤ |Z| and φk ∈ Φ, 1 ≤ k ≤ |Φ|. Wiewel et al. [28] propose a method that learns a weighted
In other words, different latent codes and decoders are combination of shared features, which only needed to be
combined interchangeably and arbitrarily, such that the size stored once, between samples of one particular class rather
of the synthetic dataset can reach up to |Z| × |Φ|, which than directly learning the synthetic dataset. Masarczyk et
increases the data efficiency exponentially. al. [29] use the generative model to create the synthetic
dataset and form a sequence of tasks to train the model,
3.5 Label Distillation Methods. and the parameters of the generative model are fine-tuned
by the evaluation loss of the learner on real data.
In dataset distillation for image classification tasks, most
methods fix the class label Ys as a one-hot format and
only learn synthetic images Xs . Some works [19], [20], 4.2 Federated Learning
[84], [100] find that making labels learnable can improve Federated learning (FL) [106], [107], [108] develops a
performance. Bohdal et al. [101] reveal that even only learn- privacy-preserving distributed model training schema such
ing labels without learning images can achieve satisfactory that multiple clients collaboratively learn a model without
performance. Moreover, Cui et al. [95] takes soft labels pre- sharing their private data. A standard way of federated
dicted by a teacher model trained on real datasets, which is learning to keep data private is to transmit model updates
demonstrated to be significantly effective when condensing instead of private user data. However, this may cause an
datasets with a large number of classes. Generally, as plug- increased communication cost, for the size of model updates
and-play schemes, how to deal with labels in DD is orthog- may be very large and make it burdensome for clients when
onal to optimization objectives and is compatible with all uploading the updates to the server.
existing ones. To remedy this problem and improve communication
efficiency, some research [30], [31], [32], [33], [34], [35] on
federated learning transmits the locally generated synthetic
4 A PPLICATIONS dataset instead of model updates from clients to the server.
Benefiting from the highly condensed nature, the research The motivation is quite straightforward, such that the model
of dataset distillation has led to many successful innovative parameters are usually much larger than a small number
applications in various fields, such as continual learning. of data points which can reduce the costs significantly in
12
upload transmission from clients back to the server, and the data alone, protect the privacy and improve model
the synthetic dataset can preserve the essence of the whole robustness.
original dataset, assuring the global model can have the full There are some straightforward applications of dataset
view of the knowledge from all clients. More specifically, distillation to protect the privacy of the dataset, for synthetic
for the local update process, the clients adopt synthetic data may look unreal to recognize the actual label, and it
techniques to generate the synthetic dataset of its local is hard to reproduce the result with the synthetic dataset
private data to approximate the model gradient update in without knowing the architecture and initialization of the
standard federated learning. It is worth noting that label dis- target model. For example, remote training [114] transmits
tillation [101] is usually considered to prevent synthetic data the synthetic dataset with distilled labels instead of the
from representing an actual label for privacy issues [30]. To original dataset to protect data privacy.
further strengthen privacy protection, Zhou et al. [31] com- More for data privacy protection, Dong et al. [16] points
bine dataset distillation and distributed one-shot learning, out that dataset distillation can offer privacy protection to
such that for every local update step, each synthetic data prevent unintentional data leakage. They emerge dataset
successively updates the network for one gradient descent distillation techniques into the privacy community and give
step. Thus synthetic data is closely bonded with one specific the theoretical analysis of the connection between dataset
network weight, and the eavesdropper cannot reproduce distillation and differential privacy. Also, they validate em-
the result with only leaked synthetic data. Xiong et al. [32] pirically that the synthetic dataset is irreversible to original
adopt distribution matching [24] to generate the synthetic data in terms of similarity metrics of L2 and LPIPS [115].
data and update synthetic data with the Gaussian mecha- Chen et al. [116] apply dataset distillation to generate high-
nism to protect the privacy of local data. Song et al. [33] dimensional data with differential privacy guarantees for
apply dataset distillation in one-shot federated learning and private data sharing with lower memory and computation
propose a novel evaluation metric γ -accuracy gain to tune consumption costs.
the importance of accuracy and analyze communication As for robustness of model improvement, Tsilivis et
efficiency. Liu et al. [34] develop two mechanisms for local al. [117] proposes that dataset distillation can provide a
updates: dynamic weight assignment, when training the new perspective for solving robust optimization. They com-
synthetic dataset, assigning dynamic weights to each sample bine the adversarial training and KIP [20], [21] to optimize
based on its training loss; meta-knowledge sharing, sharing the training data instead of model parameters with high
local synthetic dataset among clients to mitigate heteroge- efficiency. It is beneficial that the optimized data can be
neous data distributions among clients. deployed with other models and give favorable transfer
In addition to reducing the communication cost of each properties. Also, it enjoys high robustness against PGD at-
round, dataset distillation can also be applied to reduce the tacks [118]. Huang et al. [119] study the robust dataset learn-
communication epochs to reduce the total communication ing problem such that the network trained on the dataset is
consumption. Pi et al. [36] propose to use fewer rounds of adversarially robust. They formulate robust dataset learning
standard federated learning to generate the synthetic dataset as a min-max, tri-level optimization problem where the
on the server by applying the multi-parameter matching on robust error of adversarial data on the robust data parame-
the global model trajectory and then using the generated terized model is minimized.
synthetic dataset to complete subsequent training. Furthermore, traditional backdoor attacks, which inject
triggers into the original data and use the malicious data
to train the model, cannot work on the distilled synthetic
4.3 Neural Architecture Search dataset as it is too small for injection, and inspection can
Neural architecture search [13], [109] aims to discover the quickly mitigate such attacks. However, Liu et al. [120]
optimal structure of a neural network from search space sat- propose a new backdoor attack method, DOORPING, which
isfying a specific requirement. However, it typically requires attacks during the dataset distillation process rather than
expensive training of numerous candidate neural network afterward model training. Specifically, DOORPING contin-
architectures on complete datasets. uously optimizes the trigger in every epoch before updating
In this case, dataset distillation, as its highly condensed the synthetic dataset to ensure the trigger is preserved in
nature keeps the essence of the whole dataset in a small set the synthetic dataset. As the experiment results show that
of data, can be served as the proxy set to accelerate model nine defense mechanisms for the backdoor attacks at three
evaluation in neural architecture search, which is proved levels, i.e., model-level [121], [122], [123], input-level [124],
feasible in some works [22], [23], [24]. Besides, Such et al. [11] [125], [126] and dataset-level [127], [128], [129], are unable
propose a method named generative teaching networks to mitigate the attacks effectively.
combining a generative model and a learner to create the
synthetic dataset, which is learner agnostic, i.e., generalizes 4.5 Graph Neural Network
to different latent learner architectures and initializations. Graph neural networks (GNNs) [130], [131], [132], [133] are
developed to analyze graph-structured data, representing
many real-world data such as social networks [134] and
4.4 Privacy, Security and Robustness chemical molecules [135]. Despite their effectiveness, GNNs
Machine learning suffers from a wide range of privacy suffer from data-hungry like traditional deep neural net-
attacks [110], e.g., model inversion attack [111], member- works, such that large-scale datasets are required for them
ship inference attack [112], property inference attack [113]. to learn powerful representations. Motivated by dataset dis-
Dataset distillation provides a perspective to start from tillation, graph-structured data can also be condensed into
13
a synthetic, simplified one to improve the GNNs training 4.8 Knowledge Distillation
efficiency while preserving the performance. Most of the existing KD studies [37], [38], [39], [147], [148]
Jin et al. [136] apply gradient matching [22] in graph transfer the knowledge hints of the whole sample space
condensation, condensing both graph structure and node during the training process, which neglects the changing
attributes. It is worth noting that the graph structure and capacity of the student model along with training pro-
node features may have connections, such that node fea- ceeds. Such redundant knowledge causes two issues: 1.
tures can represent the graph structure matrix [137]. In this more resource costs, e.g., memory storage and GPU time;
case, the optimization target is only node features, which 2. poor performance, distracting the attention of the stu-
avoids the quadratic increase of computation complexity dent model on the proper knowledge and weakening the
with the number of synthetic graph nodes. Based on that, learning efficiency. In this case, Li et al. [149] propose a
Jin et al. [138] propose a more efficient graph condensation novel KD paradigm of knowledge condensation performing
method via one-step gradient matching without training knowledge condensation and model distillation alternately.
the network weights. To handle the discrete data, they The knowledge point is condensed according to its value
formulate the graph structure as a probabilistic model that determined by the feedback from the student model itera-
can be learned in a differentiable manner. Liu et al. [139] tively.
adopt distribution matching [24] in graph condensation,
synthesizing a small graph that shares a similar distribution
of receptive fields with the original graph. 4.9 Medical
Medical datasets sharing is crucial to establish the cross-
hospital flow of medical information and improve the
4.6 Recommender System quality of medical services [150], e.g., constructing high-
Recommender systems [140], [141] aim to give users per- accuracy computer-aided diagnosis systems [151]. However,
sonalized recommendations, content, or services through a privacy protection [152], transmission, and storage costs
large volume of dynamically generated information, such are unneglectable issues behind medical datasets sharing.
as item features, user past behaviors, and similar decisions To solve this problem, Li et al. [153], [154], [155] leverage
made by other users. The recommender system researches the dataset distillation to extract the essence of the dataset
also face a series of excessive resource consumption chal- and construct a much smaller anonymous synthetic dataset
lenges caused by training models on massive datasets, with different distributions for data sharing. They handle
which can involve billions of user-item interaction logs. At large-size and high-resolution medical datasets by dividing
the same time, the security of user data privacy should them into several patches and labeling them into different
also be considered [142]. In this case, Sachdeva et al. [143] categories manually. Also, they successfully apply perfor-
develop a data condensation framework to condense the mance matching [18], and multi-step parameter matching
collaborative filtering data into small, high-fidelity data [81], along with the label distillation [101] to generate the
summaries. They take inspiration from KIP [20], [21] to synthetic medical dataset.
build an efficient framework. To deal with the discrete
nature of the recommendation problem, instead of directly 4.10 Fashion, Art and Design
optimizing the interaction matrix, this method learns a The images produced by dataset distillation have a certain
continuous prior for each user-item pair, and the interaction degree of visual appreciation. It retains the characteristics
matrix is sampled from this learnable distribution. of the category of objects and produces artistic fragments.
Cazenavette et al. [156] propose a method that generates
tileable distilled texture, which is aesthetically pleasing to
4.7 Text Classification
be applied to practical tasks, e.g., clothing. They update the
Text classification [144] is one of the classical problems in canvas by applying random crops on a padded distilled
natural language processing, which aims to assign labels or canvas and performing dataset distillation on those crops.
tags to textual units. In order to solve much more complex For fashion compatibility learning, Chen et al. [157] propose
and challenging problems, the latest developed language using designer-generated data to guide outfit compatibil-
models [145], which require massive datasets, are burden- ity modeling. They extract the disentangled features from
some to train, fine-tune and use. Motivated by dataset the set of fashion outfits, generate the fashion graph, and
distillation, it is possible to generate a much smaller dataset leverage the designer-generated data through a dataset dis-
to cover all the knowledge of the original dataset and tillation scheme, which benefits the fashion compatibility
provide the practical dataset for models to train. However, prediction.
the discrete nature of text data makes dataset distillation a
challenging problem. In this case, Li et al. [146] propose to
generate words embeddings instead of actual text words to 5 E XPERIMENTS
form the synthetic dataset and adopt performance match- In this section, we conduct two sets of quantitative ex-
ing [18] as the data optimization target. Moreover, label periments, performance and training cost evaluation, on
distillation can also be applied to the text classification some representative dataset distillation methods that cover
task. Sucholutsky et al. [100] propose to embed text into a three classes of primary condensation metrics. The collec-
continuous space, then train the embeded text data by soft- tion of methods includes Dataset Distillation (DD) [18],
label image distillation method. Dataset Condensation with Gradient Matching (DD) [22],
14
TABLE 2
Comparison for different dataset distillation methods on the same architecture with training. † adopt momentum [84]. For performance of all the
existing DD methods, see Appendix A.
Method
Dataset Img/Cls
DD† [18], [84] DC [22] DSA [23] DM [24] MTT [81] FRePo [19]
1 95.2 ± 0.3 91.7 ± 0.5 88.7 ± 0.6 89.9 ± 0.8 91.4 ± 0.9 93.8 ± 0.6
MNIST 10 98.0 ± 0.1 97.4 ± 0.2 97.9 ± 0.1 97.6 ± 0.1 97.3 ± 0.1 98.4 ± 0.1
50 98.8 ± 0.1 98.8 ± 0.2 99.2 ± 0.1 98.6 ± 0.1 98.5 ± 0.1 99.2 ± 0.1
1 83.4 ± 0.3 70.5 ± 0.6 70.6 ± 0.6 71.5 ± 0.5 75.1 ± 0.9 75.6 ± 0.5
Fashion-MNIST 10 87.6 ± 0.4 82.3 ± 0.4 84.8 ± 0.3 83.6 ± 0.2 87.2 ± 0.3 86.2 ± 0.3
50 87.7 ± 0.3 83.6 ± 0.4 88.8 ± 0.2 88.2 ± 0.1 88.3 ± 0.1 89.6 ± 0.1
1 46.6 ± 0.6 28.3 ± 0.5 28.8 ± 0.7 26.5 ± 0.4 46.3 ± 0.8 46.8 ± 0.7
CIFAR-10 10 60.2 ± 0.4 44.9 ± 0.5 53.2 ± 0.8 48.9 ± 0.6 65.3 ± 0.7 65.5 ± 0.6
50 65.3 ± 0.4 53.9 ± 0.5 60.6 ± 0.5 63.0 ± 0.4 71.6 ± 0.2 71.7 ± 0.2
1 19.6 ± 0.4 12.6 ± 0.4 13.9 ± 0.3 11.4 ± 0.3 24.3 ± 0.3 27.2 ± 0.4
CIFAR-100 10 32.7 ± 0.4 25.4 ± 0.3 32.3 ± 0.3 29.7 ± 0.3 40.1 ± 0.4 41.3 ± 0.2
50 35.0 ± 0.3 29.7 ± 0.3 42.8 ± 0.4 43.6 ± 0.4 47.7 ± 0.2 44.3 ± 0.2
1 - 5.3 ± 0.2 6.6 ± 0.2 3.9 ± 0.2 8.8 ± 0.3 15.4 ± 0.3
Tiny-ImageNet 10 - 11.1 ± 0.3 16.3 ± 0.2 13.5 ± 0.3 23.2 ± 0.2 24.9 ± 0.2
50 - 11.2 ± 0.3 25.3 ± 0.2 24.1 ± 0.3 28.2 ± 0.5 -
TABLE 3
Cross-architecture transfer performance on CIFAR-10 with 10 Img/Cls. ConvNet is the default evalutation model used for each method. NN, IN and
BN stand for no normalization, Instance Normalization and Batch Normalization respectively. † adopt momentum [84].
Evaluation Model
Train Arch
Conv Conv-NN AlexNet-NN AlexNet-IN ResNet18-IN ResNet18-BN VGG11-IN VGG11-BN
†
DD [18], [84] Conv-IN 60.2 ± 0.4 17.8 ± 2.7 11.4 ± 0.1 38.5 ± 1.3 33.9 ± 1.1 16.0 ± 1.2 40.0 ± 1.7 21.2 ± 2.4
DC [22] Conv-IN 44.9 ± 0.5 31.9 ± 0.6 27.3 ± 1.6 45.5 ± 0.3 43.3 ± 0.6 32.7 ± 0.9 43.7 ± 0.6 37.3 ± 1.1
DSA [23] Conv-IN 53.2 ± 0.8 36.4 ± 1.5 34.0 ± 2.3 45.5 ± 0.6 42.3 ± 0.9 34.9 ± 0.5 43.1 ± 0.9 40.6 ± 0.5
DM [24] Conv-IN 49.2 ± 0.8 35.2 ± 0.5 34.9 ± 1.1 44.2 ± 0.9 40.2 ± 0.8 40.1 ± 0.8 41.7 ± 0.7 43.9 ± 0.4
MTT [81] Conv-IN 64.4 ± 0.9 41.6 ± 1.3 34.2 ± 2.6 51.9 ± 1.3 45.8 ± 1.2 42.9 ± 1.5 48.5 ± 0.8 45.4 ± 0.9
FRePo [19] Conv-BN 65.6 ± 0.6 65.6 ± 0.6 58.2 ± 0.5 39.0 ± 0.3 47.4 ± 0.7 53.0 ± 1.0 35.0 ± 0.7 56.8 ± 0.6
Differentiable Siamese Augmentation (DSA) [23], Distri- the batch normalization. Also, to have a deeper understand-
bution Matching (DM) [24], Matching Training Trajecto- ing of the transferability of dataset distillation methods, we
ries (MTT) [81], and Neural Feature Regression with Pool- evaluate the performance of synthetic datasets generated by
ing (FRePo) [19]. different methods on various heterogeneous network archi-
tectures. The network architectures of evaluation models are
5.1 Experimental Setup as follows:
The performance of dataset distillation methods is mainly • ConvNet [1] with no normalization layer,
evaluated on the classification task. The followings are the • AlexNet [1] with no normalization layer and instance
details of our experiment settings. normalization layer,
Datasets. Here we adopt five datasets, including • ResNet [4] with instance normalization layer and
• 28 × 28 MNIST [158], Fashion-MNIST [159], batch normalization layer,
• 32 × 32 CIFAR-10 [160], CIFAR-100 [160], and • VGG [162] with instance normalization layer and
• 64 × 64 Tiny-ImageNet [161] batch normalization layer.
as the evaluation datasets. These datasets are widely used Evaluation Protocol. We evaluate the quality of synthetic
as benchmark in many works on dataset distillation. datasets generated by selected methods under the default
Networks. We use the default ConvNet [1] architecture evaluation strategies provided by the authors. More specif-
provided by the authors, especially for the normalization ically, for performance evaluation, we first generate the
layer. As shown in Table 3, DD, DD, DSA, DM, and MTT synthetic dataset through selected methods. Then, we train
adopt the instance normalization layer, and FRePo adopts the target network using the generated synthetic dataset.
15
For last, the corresponding test set of the original dataset the case of MNIST 1 IPC, Fashion-MNIST 1 IPC, and 10 IPC.
evaluates the model trained on the synthetic dataset. As for The performance of DM is often not good as other methods.
training cost evaluation, all methods are evaluated under Cross Architecture Generalization. The transferability
full batch training for fair comparisons. of different condensation methods is evaluated on a wide
For data augmentation strategies performed on the syn- range of network architectures with different normalization
thetic dataset to train the homogeneous architecture eval- layers, unseen during training in the case of CIFAR-10 10
uation models, DD and DC do not adopt any data aug- IPC. The experimental setup for the evaluation is consistent
mentations, DSA, DM, MTT, and FRePo use DSA aug- with the official way provided by the authors. Here, DC
mentations [23] as the default settings provided by the uses DC augmentation (including crop, scale, rotate, and
author. While for the performance evaluation on hetero- noise operation, which is added for the network architec-
geneous architectures models, all methods adopt the DSA tures with batch normalization layer) as the official code
data augmentation strategy for fair comparisons. DSA is a implies. The testing results are shown in Table 3. From
set of augmentation policies designed to improve the data the result, the instance normalization layer seems to be the
efficiency of GAN, including crop, cutout, flip, scale, rotate, vital ingredient in several methods (DD, DC, DSA, DM,
and color jitters operations. DC augmentation is the subset and MTT), which may be harmful to the transferability.
of DSA, including crop, scale, rotate, and noise operations. The performance degrades significantly for most methods
For compression ratios, we measure the performance of except FRePo when no normalization is adopted (Conv-
condensed datasets for homogeneous networks with 1, 10, NN, AlexNet-NN). Also, most methods except DM are sus-
and 50 images per class (IPC), respectively. We conduct ceptible to the normalization layers adopted by the evalua-
the performance evaluation on heterogeneous networks on tion models. If the normalization layer is inconsistent with
CIFAR-10 with 10 IPC. Efficiency evaluation experiments the training model, the test results will drop significantly.
are conducted on the CIFAR-10 dataset but with a wide It suggests that the synthetic dataset generated by those
range of IPC. methods encodes the inductive bias of a particular training
Implementation Details. Here, we implement DC, DSA, architecture. However, as DM does not need to update the
DM, MTT, and FRePo by the official code provided by the training model while generating the condensed data, the
authors. For DD, we modify the inner loop optimization by influences of the changing architectures are relatively low.
adding the momentum term. As shown in [84], it is observed As for DD, the condensed data highly rely on the training
that adding momentum terms can lead to a substantial model, such that the performance degrades significantly
performance boost no matter the number of inner loops. on heterogeneous networks and is even unable to train
FRePo is implemented in JAX and PyTorch, respectively, the model with different normalization layers, e.g., batch
and the rest methods are implemented in PyTorch. For normalization. Comparing the transferability of DC and
fair comparisons, all efficiency evaluation experiments are DSA, it can be found that DSA data augmentation can help
conducted on the single A100-SXM4-40GB. train models with different architectures, especially those
with batch normalization layers.
5.2 Performance Evaluation
The distillation performance of selected dataset distillation 5.3 Training Cost Evaluation
methods is evaluated on both homogeneous and heteroge- We mainly focus on run-time and peak GPU memory for
neous networks. We learn 1, 10, and 50 IPC to evaluate per- training cost evaluation. We evaluate the time of selected
formance on the homogeneous networks for five benchmark dataset distillation methods to perform one gradient step
datasets and learn 10 images per class of CIFAR-10 to mea- on the synthetic dataset and the time of the whole epoch for
sure the cross architecture generalization of chosen methods synthetic dataset updating, which may include the training
on four network architectures with different normalization network updating for some methods. Also, we compare the
layers. peak GPU memory usage for selected methods during the
Distillation Performance. We evaluate the distillation dataset distillation process. For fair comparisons, in this
performance of the selected dataset distillation methods on section, the number of inner loops, if there have any, is
the models with the same architecture as the default training entirely based on the experiment settings provided by the
models. The experiment settings are default as the authors author. Because to achieve the performance shown in Table
provided. The testing results are shown in Table 2. For per- 2, the default experiment settings are followed. The training
formance of all the existing DD methods, see Appendix A. cost is compared in this case. Otherwise, it will not be very
Due to the limitation of GPU memory, some experimental sensible. Moreover, for fair comparisons, the evaluation is
results for the dataset Tiny-ImageNet are missing. As shown under full batch training.
in Table 2, in most cases, FRePo achieves the state of the Run-Time Evaluation. We evaluate the required time for
art performance, especially for more complex datasets, e.g., one entire epoch of the outer loop and the time to perform
CIFAR-10, CIFAR-100, and Tiny-ImageNet. At the same one gradient step on the synthetic dataset, respectively. The
time, MTT often achieves second-best performance. Com- main difference is whether including the run time of up-
paring the performance results of DC and DSA, although dating the training network. The run time of calculating the
dataset augmentation cannot guarantee to benefit the dis- gradient on synthetic datasets differs due to different dataset
tillation performance, it influences the results significantly. distillation metrics and various model training strategies.
Thanks to the momentum term, the performance of DD is As for DD, the synthetic dataset is generated based on the
greatly improved, and it obtains the SOTA performance in training model updates. Thus, the required time for DD to
16
102 102 40 DD
DC
DSA
35 DM
MTT
101 101
25
100
100
20
1
1 10 15
10 DD DD
DC DC
DSA DSA 10
DM 2 DM
MTT 10 MTT
10 2
FRePo-JAX FRePo-JAX
5
FRePo-PyTorch FRePo-PyTorch
0
0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000
Image per Class Image per Class Image per Class
(a) (b) (c)
Fig. 7. (a) Run-time per loop (including time required for updating synthetic data and networks in one loop). (b) Run-time per step (including only
time required for updating synthetic data). (c) Peak GPU Memory Usage (during the whole process). All evaluated under full batch training for
CIFAR-10.
process an entire outer loop and perform one gradient step DC and DSA, indicating that DSA augmentation is space-
on the synthetic dataset is the same. For MTT, the model friendly.
training process is unrolled. Thus the two kinds of run-
time are the same. Also, as DM does not update the model, 5.4 Empirical Studies
the required time for these two processes is the same. The
evaluation results are shown in Fig. 7. The following conclusions can be drawn from the above
The results show that DD requires a significantly longer experimental results:
run time to update the synthetic dataset for a single epoch. • The performance of DD has significantly improved
The reason is that the gradient computation for the perfor- momentum item. However, from the performance
mance matching loss over the synthetic dataset involves bi- evaluation results of heterogeneous networks, DD
level optimization, and PyTorch provides no utilizable accel- has relatively poor generalizability.In addition, the
erate mechanism for the gradient computation in this case. requirement of time and GPU memory by DD during
DC, DSA, and FRePo implemented by PyTorch have similar training is considerable.
required run times. Comparing DSA and DC, we can see • Compared with DC and DSA, from the performance
that the use of DSA augmentation makes the running time evaluation results, adopting the DSA data augmen-
slightly longer, but it can significantly improve the effect. It tation strategy can significantly improve the perfor-
can be seen that DSA is a good strategy for improving the mance of the synthetic dataset, whether evaluated on
algorithm effect. When the IPC is small, it can be seen that homogeneous or heterogeneous networks. Moreover,
FRePo (both the JAX and PyTorch version) is significantly adopting DSA does not significantly increase the
faster than other methods, but as the IPC increases, the running time and space.
PyTorch version is similar to the second-efficient echelon • DM does not perform as well as other methods from
methods, and the JAX version is close to DM. MTT runs the the homogenous network evaluation performance
second slowest, but there is no more data to analyze due to results. However, since the training process of DM
being out of memory. is network-independent, it has good evaluation per-
Peak GPU Memory Usage. We also evaluate the peak formance on heterogeneous networks. Further, DM
GPU memory usage during the dataset distillation process. has significant advantages regarding run time and
Here, we record the peak GPU memory in actual operation. GPU memory requirements.
At the same time, since JAX allocates 90% of GPU memory • MTT algorithm basically has the second-best perfor-
by default, to facilitate testing, we mute the default allo- mance but requires much running time and space
cation mechanism and change it to on-demand allocation. because of unrolled gradient computation through
Evaluation results are shown in Fig. 7. backpropagation.
The results show that DM requires the lowest GPU • FRePo achieves SOTA performance when IPC are
memory, while MTT requires the most. When IPC are only small, regardless of performance or training cost.
50, the operation of MTT encounters the problem of being However, as IPC increase, FRePo does not have
out of memory. In addition, with the gradual growth of comparable performance.
IPC, JAX version FRePo reveals the advantages of memory
efficiency. The DD and PyTorch versions of FRePo require
relatively large GPU memory. The former needs to go 6 C HALLENGES AND P OSSIBLE I MPROVEMENTS
through the entire training graph during backpropagation, The research on dataset distillation has promising prospects,
while the latter adopts a wide network. In addition, there and many algorithms have been applied in various fields.
is no significant difference in the GPU memory required by Although the existing methods have achieved competitive
17
TABLE 4
80 The performance results of different dataset distillation methods on
ImageNet-1K. Results are derived from [95] and our experiments. †
Testing Accuracy (%) 70 adopt momentum [84].
60 Whole
Random IPC
K-Center
50 DD 1 2 10 50
DC
40 DSA DD† [18], [84] - - - -
DM DC [22] - - - -
ImageNet-1K
MTT
30 FRePo-JAX DSA [23] - - - -
DM [24] 1.5 ± 0.1 1.7 ± 0.1 - -
0 200 400 600 800 1000 MTT [81] - - - -
Image per Class FRePo [19] 7.5 ± 0.3 9.7 ± 0.2 - -
TESLA [95] 7.7 ± 0.2 10.5 ± 0.3 17.8 ± 1.3 27.9 ± 1.2
Fig. 8. Performance comparison with different compression ratios on
CIFAR10 using Random selection, K-Center, DD, DC, DSA, DM, MTT
and FRePo-JAX. Results are derived from [75] and our experiments. TABLE 5
The evaluation performance results on CIFAR-10 with 10 Img/Cls for
more extensive networks. † adopt momentum [84].
performance, there are still some challenges and issues. In
this section, we summarise the key challenges and discuss Evaluation Model
possible directions for improvements. ResNet18 ResNet34 ResNet50 ResNet101 ResNet152
DD† [18], [84] 33.9 ± 1.1 32.2 ± 2.7 19.0 ± 2.0 18.2 ± 1.4 12.3± 0.7
DC [22] 43.3 ± 0.6 35.4 ± 1.3 23.7 ± 0.4 17.6 ± 1.0 16.3 ± 1.0
6.1 Scaling Up
DSA [23] 42.3 ± 0.9 33.2 ± 1.0 22.9 ± 0.7 18.7 ± 1.3 15.7 ± 1.0
The existing dataset distillation methods mainly evaluate DM [24] 39.5 ± 0.6 31.2 ± 1.1 23.9 ± 0.6 17.7 ± 1.3 16.8 ± 1.4
the performance of the synthetic dataset up to about 50 IPC. MTT [81] 45.8 ± 1.2 34.6 ± 3.3 22.5 ± 0.4 18.1 ± 2.0 18.0 ± 1.1
At the same time, the original datasets are focused on some FRePo [19] 47.4 ± 0.7 41.3 ± 2.0 41.1 ± 1.8 28.7 ± 1.2 22.7 ± 1.0
synthetic dataset. Among them, the choice of normalization 6.4 Design for Other Tasks and Applications
layer significantly impacts transferability. The existing DD methods mainly focus on the classifica-
To remedy this problem, one intuitive idea is to unbind tion task. Here, we expect future works to apply DD and
the network and the synthetic dataset, such as the practice of conduct sophisticated designing in more tasks in computer
DM [24] and IDC [82]. DM treats the network as the feature vision, e.g., semantic segmentation [164], [165] and objective
extractor without updating, and IDC updates the training detection [166], natural language processing, e.g., machine
network on the real dataset instead of the synthetic dataset. translation [167], and multi-modal scenarios [6].
From the results, these two kinds of strategy all have effects.
However, DM has poor performance, and IDC needs extra
run time to train the networks on a larger dataset (especially 6.5 Security and Privacy
when condensing large-scale datasets, e.g., ImageNet). As Existing works focus on improving the methods to gener-
the example given by [93] shows, generating 500 synthetic ate more informative synthetic datasets while neglecting
images for CIFAR-10 by IDC [82] method requires approxi- the potential security and privacy issues of DD. Recent
mately 30 hours, while in the equivalent time, 60 ConvNet-3 work [120] provides a new backdoor attack method, DOOR-
models can be trained on the original dataset. Another way PING, which attacks during the dataset distillation process
is to increase the variety of the candidate training model rather than afterward model training as introduced in sec-
pool, e.g., MTT [81] and FRePo [19]. However, this kind of tion 4.4, and nine defense mechanisms for the backdoor
strategy will cause extra GPU time to prepare the model attacks at three levels, i.e., model-level [121], [122], [123],
pool and memory to store these models. Recently, Zhang input-level [124], [125], [126] and dataset-level [127], [128],
et al. [93] adopts model augmentation strategies to increase [129], cannot effectively mitigate the attacks. Possible de-
the diversity of the model pool while using less memory fense mechanisms and more security and privacy issues are
and computation cost, which may be a good direction for left for future work.
this problem.
7 C ONCLUSION
6.3 Computational Cost This paper aims at a comprehensive review of dataset distil-
lation (DD), a popular research topic recently, to synthesize
The motivation of DD is to train a small informative dataset a small dataset given an original large one for the sake
so that the model training on it can achieve the same perfor- of similar performance. We present a systematic taxonomy
mance as training on the original data set, thereby reducing of existing DD methods and categorize them into three
GPU time. However, the problem is that generating the syn- major streams by the optimization objective: performance
thetic dataset is typically expensive, and the required time matching, parameter matching, and distribution matching.
will rapidly increase when condense large-scale datasets. Theoretical analysis reveals their underlying connections.
This high computational costs mainly caused by the bi-level For general understanding, we abstract a common algorith-
optimization process, as computing of matching loss w.r.t mic framework for all current approaches. Application of
the synthetic datasets requires backpropagating through the DD on various topics, including continual learning, neural
entire training graph. architecture search, privacy, etc is also covered. Different
To tackle this problem, many works have been pro- approaches are experimentally compared in terms of accu-
posed for different DD frameworks to reduce computational racy, time efficiency, and scalability, indicating some critical
costs. One intuitive way is to avoid the inner-loop network challenges in this area left for future research.
training. As for performance matching, KRR based meth-
ods [19], [20], [21], [93] are proposed to improve compu-
R EFERENCES
tation efficiency of DD [18], which performs convex opti-
mization and results in a closed-form solution for linear [1] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classi-
fication with deep convolutional neural networks,” Communica-
model, which avoids extensive inner-loop training. As for tions of the ACM, vol. 60, no. 6, pp. 84–90, 2017. 1, 9, 14
distribution matching, DM [24] improves efficiency using [2] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-
the networks as the feature extractors without updating, training of deep bidirectional transformers for language under-
which avoids expensive bi-level optimization. However, standing,” arXiv preprint arXiv:1810.04805, 2018. 1
[3] D. Amodei, S. Ananthanarayanan, R. Anubhai, J. Bai, E. Batten-
the high-efficiency sacrificed the its performance. Except berg, C. Case, J. Casper, B. Catanzaro, Q. Cheng, G. Chen et al.,
for the above improvements to avoid the inner loop of “Deep speech 2: End-to-end speech recognition in english and
updating the networks, there are also some directions from mandarin,” in International conference on machine learning. PMLR,
2016, pp. 173–182. 1
other aspects. As existing gradient matching methods are
[4] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning
computationally expensive to gain synthetic datasets with for image recognition,” in Proceedings of the IEEE conference on
satisfactory generalizability, Zhang et al. [93] propose model computer vision and pattern recognition, 2016, pp. 770–778. 1, 14
augmentation methods to reduce the computational costs [5] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai,
T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly
for forming a diverse candidate model pool. Multi-step et al., “An image is worth 16x16 words: Transformers for image
parameter matching MTT [81] has good performance but recognition at scale,” arXiv preprint arXiv:2010.11929, 2020. 1
high memory costs due to the unrolled computational graph [6] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agar-
for network updates. To solve this problem in MTT, Cui et wal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning
transferable visual models from natural language supervision,”
al. [95] re-organize the computation flow for the gradients, in International Conference on Machine Learning. PMLR, 2021, pp.
which reduces the memory costs in constant complexity. 8748–8763. 1, 18
19
[7] A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, and M. Chen, “Hi- [28] F. Wiewel and B. Yang, “Condensed composite memory continual
erarchical text-conditional image generation with clip latents,” learning,” in 2021 International Joint Conference on Neural Networks
arXiv preprint arXiv:2204.06125, 2022. 1 (IJCNN). IEEE, 2021, pp. 1–8. 2, 11
[8] C. Chen, Y. Zhang, J. Fu, X. Liu, and M. Coates, “Bidirectional [29] W. Masarczyk and I. Tautkute, “Reducing catastrophic forgetting
learning for offline infinite-width model-based optimization,” with learning on synthetic data,” in Proceedings of the IEEE/CVF
in Thirty-Sixth Conference on Neural Information Processing Systems, Conference on Computer Vision and Pattern Recognition Workshops,
2022. [Online]. Available: https://openreview.net/forum?id= 2020, pp. 252–253. 2, 11
j8yVIyp27Q 1 [30] J. Goetz and A. Tewari, “Federated learning via synthetic data,”
[9] D. Maclaurin, D. Duvenaud, and R. Adams, “Gradient-based arXiv preprint arXiv:2008.04489, 2020. 2, 11, 12
hyperparameter optimization through reversible learning,” in [31] Y. Zhou, G. Pu, X. Ma, X. Li, and D. Wu, “Distilled one-shot
International conference on machine learning. PMLR, 2015, pp. federated learning,” arXiv preprint arXiv:2009.07999, 2020. 2, 11,
2113–2122. 1, 3 12
[10] J. Lorraine, P. Vicol, and D. Duvenaud, “Optimizing millions [32] Y. Xiong, R. Wang, M. Cheng, F. Yu, and C.-J. Hsieh, “Feddm:
of hyperparameters by implicit differentiation,” in International Iterative distribution matching for communication-efficient fed-
Conference on Artificial Intelligence and Statistics. PMLR, 2020, pp. erated learning,” arXiv preprint arXiv:2207.09653, 2022. 2, 11, 12
1540–1552. 1, 3, 4 [33] R. Song, D. Liu, D. Z. Chen, A. Festag, C. Trinitis, M. Schulz, and
[11] F. P. Such, A. Rawal, J. Lehman, K. Stanley, and J. Clune, “Gener- A. Knoll, “Federated learning via decentralized dataset distilla-
ative teaching networks: Accelerating neural architecture search tion in resource-constrained edge environments,” arXiv preprint
by learning to generate synthetic training data,” in International arXiv:2208.11311, 2022. 2, 11, 12
Conference on Machine Learning. PMLR, 2020, pp. 9206–9216. 1, [34] P. Liu, X. Yu, and J. T. Zhou, “Meta knowledge condensation for
12 federated learning,” arXiv preprint arXiv:2209.14851, 2022. 2, 11,
[12] L. Li and A. Talwalkar, “Random search and reproducibility for 12
neural architecture search,” in Uncertainty in artificial intelligence. [35] S. Hu, J. Goetz, K. Malik, H. Zhan, Z. Liu, and Y. Liu, “Fedsynth:
PMLR, 2020, pp. 367–377. 1 Gradient compression via synthetic data in federated learning,”
[13] T. Elsken, J. H. Metzen, and F. Hutter, “Neural architecture search: arXiv preprint arXiv:2204.01273, 2022. 2, 11
A survey,” The Journal of Machine Learning Research, vol. 20, no. 1, [36] R. Pi, W. Zhang, Y. Xie, J. Gao, X. Wang, S. Kim, and Q. Chen,
pp. 1997–2017, 2019. 1, 12 “Dynafed: Tackling client data heterogeneity with global dynam-
[14] I. J. Goodfellow, M. Mirza, D. Xiao, A. Courville, and Y. Ben- ics,” arXiv preprint arXiv:2211.10878, 2022. 2, 12
gio, “An empirical investigation of catastrophic forgetting in [37] G. Hinton, O. Vinyals, J. Dean et al., “Distilling the knowledge in
gradient-based neural networks,” arXiv preprint arXiv:1312.6211, a neural network,” arXiv preprint arXiv:1503.02531, vol. 2, no. 7,
2013. 1 2015. 2, 13
[15] S.-A. Rebuffi, A. Kolesnikov, G. Sperl, and C. H. Lampert, “icarl: [38] A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, and
Incremental classifier and representation learning,” in Proceedings Y. Bengio, “Fitnets: Hints for thin deep nets,” arXiv preprint
of the IEEE conference on Computer Vision and Pattern Recognition, arXiv:1412.6550, 2014. 2, 13
2017, pp. 2001–2010. 1, 11 [39] S. Zagoruyko and N. Komodakis, “Paying more attention to
[16] T. Dong, B. Zhao, and L. Lyu, “Privacy for free: How does dataset attention: Improving the performance of convolutional neural
condensation help privacy?” arXiv preprint arXiv:2206.00240, networks via attention transfer,” arXiv preprint arXiv:1612.03928,
2022. 1, 12 2016. 2, 13
[17] R. Shokri and V. Shmatikov, “Privacy-preserving deep learning,” [40] J. Gou, B. Yu, S. J. Maybank, and D. Tao, “Knowledge distillation:
in Proceedings of the 22nd ACM SIGSAC conference on computer and A survey,” International Journal of Computer Vision, vol. 129, no. 6,
communications security, 2015, pp. 1310–1321. 1 pp. 1789–1819, 2021. 2
[18] T. Wang, J.-Y. Zhu, A. Torralba, and A. A. Efros, “Dataset distilla- [41] S. I. Mirzadeh, M. Farajtabar, A. Li, N. Levine, A. Matsukawa,
tion,” arXiv preprint arXiv:1811.10959, 2018. 1, 2, 3, 4, 9, 13, 14, 17, and H. Ghasemzadeh, “Improved knowledge distillation via
18, 23, 24 teacher assistant,” in Proceedings of the AAAI conference on artificial
[19] Y. Zhou, E. Nezhadarya, and J. Ba, “Dataset distillation using intelligence, vol. 34, no. 04, 2020, pp. 5191–5198. 2
neural feature regression,” arXiv preprint arXiv:2206.00719, 2022. [42] K. Xu, L. Rui, Y. Li, and L. Gu, “Feature normalized knowledge
2, 4, 5, 8, 9, 11, 14, 17, 18, 23, 24 distillation for image classification,” in European Conference on
[20] T. Nguyen, Z. Chen, and J. Lee, “Dataset meta-learning from Computer Vision. Springer, 2020, pp. 664–680. 2
kernel ridge-regression,” arXiv preprint arXiv:2011.00050, 2020. 2, [43] X. Wang, T. Fu, S. Liao, S. Wang, Z. Lei, and T. Mei, “Exclusivity-
5, 9, 11, 12, 13, 18, 23, 24 consistency regularized knowledge distillation for face recogni-
[21] T. Nguyen, R. Novak, L. Xiao, and J. Lee, “Dataset distillation tion,” in European Conference on Computer Vision. Springer, 2020,
with infinitely wide convolutional networks,” Advances in Neural pp. 325–342. 2
Information Processing Systems, vol. 34, pp. 5186–5198, 2021. 2, 5, [44] S. You, C. Xu, C. Xu, and D. Tao, “Learning from multiple teacher
9, 12, 13, 18, 23, 24 networks,” in Proceedings of the 23rd ACM SIGKDD International
[22] B. Zhao, K. R. Mopuri, and H. Bilen, “Dataset condensation with Conference on Knowledge Discovery and Data Mining, 2017, pp.
gradient matching.” ICLR, vol. 1, no. 2, p. 3, 2021. 2, 4, 5, 6, 8, 9, 1285–1294. 2
11, 12, 13, 14, 17, 23, 24 [45] W. Park, D. Kim, Y. Lu, and M. Cho, “Relational knowledge dis-
[23] B. Zhao and H. Bilen, “Dataset condensation with differentiable tillation,” in Proceedings of the IEEE/CVF Conference on Computer
siamese augmentation,” in International Conference on Machine Vision and Pattern Recognition, 2019, pp. 3967–3976. 2
Learning. PMLR, 2021, pp. 12 674–12 685. 2, 4, 6, 9, 12, 14, 15, 17, [46] J. Liu, D. Wen, H. Gao, W. Tao, T.-W. Chen, K. Osa, and M. Kato,
23, 24 “Knowledge representing: efficient, sparse representation of prior
[24] Bo Zhao and Hakan Bilen, “Dataset condensation with distribu- knowledge for knowledge distillation,” in Proceedings of the
tion matching,” CoRR, vol. abs/2110.04181, 2021. 2, 7, 8, 9, 11, 12, IEEE/CVF Conference on Computer Vision and Pattern Recognition
13, 14, 17, 18, 23, 24 Workshops, 2019, pp. 0–0. 2
[25] Y. Liu, Y. Su, A.-A. Liu, B. Schiele, and Q. Sun, “Mnemonics [47] X. Wang, R. Zhang, Y. Sun, and J. Qi, “Kdgan: Knowledge
training: Multi-class incremental learning without forgetting,” distillation with generative adversarial networks,” Advances in
in Proceedings of the IEEE/CVF conference on Computer Vision and neural information processing systems, vol. 31, 2018. 2
Pattern Recognition, 2020, pp. 12 245–12 254. 2, 11 [48] R. G. Lopes, S. Fenu, and T. Starner, “Data-free knowledge distil-
[26] A. Rosasco, A. Carta, A. Cossu, V. Lomonaco, and D. Bac- lation for deep neural networks,” arXiv preprint arXiv:1710.07535,
ciu, “Distilled replay: Overcoming forgetting through synthetic 2017. 2
samples,” in International Workshop on Continual Semi-Supervised [49] G. Fang, K. Mo, X. Wang, J. Song, S. Bei, H. Zhang, and M. Song,
Learning. Springer, 2022, pp. 104–117. 2, 11 “Up to 100x faster data-free knowledge distillation,” in Proceed-
[27] M. Sangermano, A. Carta, A. Cossu, and D. Bacciu, “Sample ings of the AAAI Conference on Artificial Intelligence, vol. 36, no. 6,
condensation in online continual learning,” in 2022 International 2022, pp. 6597–6604. 2
Joint Conference on Neural Networks (IJCNN). IEEE, 2022, pp. 01– [50] X. Zhang, X. Zhou, M. Lin, and J. Sun, “Shufflenet: An extremely
08. 2, 11 efficient convolutional neural network for mobile devices,” in
20
Proceedings of the IEEE conference on computer vision and pattern International Conference on Machine Learning. PMLR, 2017, pp.
recognition, 2018, pp. 6848–6856. 2 1165–1173. 3
[51] L. Zhang, J. Song, A. Gao, J. Chen, C. Bao, and K. Ma, “Be your [74] L. Franceschi, P. Frasconi, S. Salzo, R. Grazzi, and M. Pon-
own teacher: Improve the performance of convolutional neural til, “Bilevel programming for hyperparameter optimization and
networks via self distillation,” in Proceedings of the IEEE/CVF meta-learning,” in International Conference on Machine Learning.
International Conference on Computer Vision, 2019, pp. 3713–3722. 2 PMLR, 2018, pp. 1568–1577. 3
[52] I. Radosavovic, P. Dollár, R. Girshick, G. Gkioxari, and K. He, [75] J. Cui, R. Wang, S. Si, and C.-J. Hsieh, “DC-BENCH: Dataset
“Data distillation: Towards omni-supervised learning,” in Pro- condensation benchmark,” in Proceedings of the Advances in Neural
ceedings of the IEEE conference on computer vision and pattern Information Processing Systems (NeurIPS), 2022. 4, 17, 23
recognition, 2018, pp. 4119–4128. 2 [76] O. Sener and S. Savarese, “Active learning for convolu-
[53] J. Ba and R. Caruana, “Do deep nets really need to be deep?” tional neural networks: A core-set approach,” arXiv preprint
Advances in neural information processing systems, vol. 27, 2014. 2 arXiv:1708.00489, 2017. 4
[54] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, [77] K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rec-
T. Weyand, M. Andreetto, and H. Adam, “Mobilenets: Efficient tifiers: Surpassing human-level performance on imagenet clas-
convolutional neural networks for mobile vision applications,” sification,” in Proceedings of the IEEE international conference on
arXiv preprint arXiv:1704.04861, 2017. 2 computer vision, 2015, pp. 1026–1034. 4
[55] G. K. Nayak, K. R. Mopuri, V. Shaj, V. B. Radhakrishnan, and [78] X. Glorot and Y. Bengio, “Understanding the difficulty of training
A. Chakraborty, “Zero-shot knowledge distillation in deep net- deep feedforward neural networks,” in Proceedings of the thir-
works,” in International Conference on Machine Learning. PMLR, teenth international conference on artificial intelligence and statistics.
2019, pp. 4743–4751. 2 JMLR Workshop and Conference Proceedings, 2010, pp. 249–256.
[56] M. Welling, “Herding dynamical weights to learn,” in Proceedings 4
of the 26th Annual International Conference on Machine Learning, [79] S. Lee, S. Chun, S. Jung, S. Yun, and S. Yoon, “Dataset conden-
2009, pp. 1121–1128. 2 sation with contrastive signals,” in Proceedings of the International
[57] Y. Chen, M. Welling, and A. Smola, “Super-samples from kernel Conference on Machine Learning (ICML), 2022, pp. 12 352–12 364. 4,
herding,” arXiv preprint arXiv:1203.3472, 2012. 2 5, 6, 9, 23, 24
[58] D. Feldman, M. Faulkner, and A. Krause, “Scalable training [80] Z. Jiang, J. Gu, M. Liu, and D. Z. Pan, “Delving into effec-
of mixture models via coresets,” Advances in neural information tive gradient matching for dataset condensation,” arXiv preprint
processing systems, vol. 24, 2011. 2 arXiv:2208.00311, 2022. 4, 5, 6, 9, 23, 24
[59] O. Bachem, M. Lucic, and A. Krause, “Coresets for nonparametric [81] G. Cazenavette, T. Wang, A. Torralba, A. A. Efros, and J.-Y.
estimation-the case of dp-means,” in International Conference on Zhu, “Dataset distillation by matching training trajectories,” in
Machine Learning. PMLR, 2015, pp. 209–217. 2 Proceedings of the IEEE/CVF Conference on Computer Vision and
[60] Z. Borsos, M. Mutny, and A. Krause, “Coresets via bilevel op- Pattern Recognition, 2022, pp. 4750–4759. 4, 5, 6, 9, 10, 11, 13,
timization for continual learning and streaming,” Advances in 14, 17, 18, 23, 24
Neural Information Processing Systems, vol. 33, pp. 14 879–14 890, [82] J.-H. Kim, J. Kim, S. J. Oh, S. Yun, H. Song, J. Jeong, J.-W. Ha,
2020. 2 and H. O. Song, “Dataset condensation via efficient synthetic-
[61] K. Killamsetty, X. Zhao, F. Chen, and R. Iyer, “Retrieve: Coreset data parameterization,” arXiv preprint arXiv:2205.14959, 2022. 4,
selection for efficient and robust semi-supervised learning,” Ad- 5, 6, 9, 10, 11, 18, 23, 24
vances in Neural Information Processing Systems, vol. 34, pp. 14 488– [83] C. Finn, P. Abbeel, and S. Levine, “Model-agnostic meta-
14 501, 2021. 2 learning for fast adaptation of deep networks,” in Proceedings of
[62] K. Killamsetty, D. Sivasubramanian, G. Ramakrishnan, and the 34th International Conference on Machine Learning, ICML 2017,
R. Iyer, “Glister: Generalization based data subset selection for Sydney, NSW, Australia, 6-11 August 2017, ser. Proceedings of
efficient and robust learning,” in Proceedings of the AAAI Confer- Machine Learning Research, D. Precup and Y. W. Teh, Eds.,
ence on Artificial Intelligence, vol. 35, no. 9, 2021, pp. 8110–8118. vol. 70. PMLR, 2017, pp. 1126–1135. [Online]. Available:
2 http://proceedings.mlr.press/v70/finn17a.html 4
[63] B. Mirzasoleiman, J. Bilmes, and J. Leskovec, “Coresets for data- [84] Z. Deng and O. Russakovsky, “Remember the past: Distilling
efficient training of machine learning models,” in International datasets into addressable memories for neural networks,” arXiv
Conference on Machine Learning. PMLR, 2020, pp. 6950–6960. 2 preprint arXiv:2206.02916, 2022. 4, 9, 10, 11, 14, 15, 17, 23, 24
[64] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde- [85] N. Loo, R. Hasani, A. Amini, and D. Rus, “Efficient dataset
Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adver- distillation using random feature approximation,” in Proceedings
sarial networks,” Communications of the ACM, vol. 63, no. 11, pp. of the Advances in Neural Information Processing Systems (NeurIPS),
139–144, 2020. 3 2022. 5, 9, 23, 24
[65] M. Mirza and S. Osindero, “Conditional generative adversarial [86] A. Jacot, F. Gabriel, and C. Hongler, “Neural tangent kernel:
nets,” arXiv preprint arXiv:1411.1784, 2014. 3 Convergence and generalization in neural networks,” Advances
[66] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” in neural information processing systems, vol. 31, 2018. 5
arXiv preprint arXiv:1312.6114, 2013. 3 [87] J. Lee, L. Xiao, S. Schoenholz, Y. Bahri, R. Novak, J. Sohl-
[67] B. Zhao and H. Bilen, “Synthesizing informative training samples Dickstein, and J. Pennington, “Wide neural networks of any
with gan,” arXiv preprint arXiv:2204.07513, 2022. 3, 9, 10, 23, 24 depth evolve as linear models under gradient descent,” Advances
[68] R. Abdal, Y. Qin, and P. Wonka, “Image2stylegan: How to embed in neural information processing systems, vol. 32, 2019. 5
images into the stylegan latent space?” in Proceedings of the [88] S. Arora, S. S. Du, W. Hu, Z. Li, R. R. Salakhutdinov, and
IEEE/CVF International Conference on Computer Vision, 2019, pp. R. Wang, “On exact computation with an infinitely wide neural
4432–4441. 3, 10 net,” Advances in Neural Information Processing Systems, vol. 32,
[69] J. Zhu, Y. Shen, D. Zhao, and B. Zhou, “In-domain gan inversion 2019. 5
for real image editing,” in European conference on computer vision. [89] J. Lee, S. Schoenholz, J. Pennington, B. Adlam, L. Xiao, R. Novak,
Springer, 2020, pp. 592–608. 3 and J. Sohl-Dickstein, “Finite versus infinite neural networks:
[70] H. B. Lee, D. B. Lee, and S. J. Hwang, “Dataset condensation with an empirical study,” Advances in Neural Information Processing
latent space knowledge factorization and sharing,” arXiv preprint Systems, vol. 33, pp. 15 156–15 172, 2020. 5
arXiv:2208.10494, 2022. 3, 9, 11, 23, 24 [90] R. M. Neal, Bayesian learning for neural networks. Springer Science
[71] J. Snoek, H. Larochelle, and R. P. Adams, “Practical bayesian & Business Media, 2012, vol. 118. 5
optimization of machine learning algorithms,” Advances in neural [91] J. Lee, Y. Bahri, R. Novak, S. S. Schoenholz, J. Pennington, and
information processing systems, vol. 25, 2012. 3 J. Sohl-Dickstein, “Deep neural networks as gaussian processes,”
[72] J. Bergstra, D. Yamins, and D. Cox, “Making a science of model arXiv preprint arXiv:1711.00165, 2017. 5
search: Hyperparameter optimization in hundreds of dimensions [92] J. Platt et al., “Probabilistic outputs for support vector machines
for vision architectures,” in International conference on machine and comparisons to regularized likelihood methods,” Advances in
learning. PMLR, 2013, pp. 115–123. 3 large margin classifiers, vol. 10, no. 3, pp. 61–74, 1999. 5
[73] L. Franceschi, M. Donini, P. Frasconi, and M. Pontil, “Forward [93] L. Zhang, J. Zhang, B. Lei, S. Mukherjee, X. Pan, B. Zhao, C. Ding,
and reverse gradient-based hyperparameter optimization,” in Y. Li, and X. Dongkuan, “Accelerating dataset distillation via
21
model augmentation,” arXiv preprint arXiv:2212.06152, 2022. 5, [116] D. Chen, R. Kerkouche, and M. Fritz, “Private set generation
6, 9, 18, 23, 24 with discriminative information,” arXiv preprint arXiv:2211.04446,
[94] Li, Guang and Togo, Ren and Ogawa, Takahiro and Haseyama, 2022. 12
Miki, “Dataset distillation using parameter pruning,” arXiv [117] N. Tsilivis, J. Su, and J. Kempe, “Can we achieve robustness from
preprint arXiv:2209.14609, 2022. 7, 9, 23, 24 data alone?” arXiv preprint arXiv:2207.11727, 2022. 12
[95] J. Cui, R. Wang, S. Si, and C.-J. Hsieh, “Scaling up dataset [118] A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu,
distillation to imagenet-1k with constant memory,” arXiv preprint “Towards deep learning models resistant to adversarial attacks,”
arXiv:2211.10586, 2022. 7, 9, 10, 11, 17, 18, 23, 24 arXiv preprint arXiv:1706.06083, 2017. 12
[96] J. Du, Y. Jiang, V. T. F. Tan, J. T. Zhou, and H. Li, “Minimizing [119] Y. Wu, X. Li, F. Kerschbaum, H. Huang, and H. Zhang,
the accumulated trajectory error to improve dataset distillation,” “Towards robust dataset learning,” 2022. [Online]. Available:
arXiv preprint arXiv:2211.11004, 2022. 7, 9, 23, 24 https://arxiv.org/abs/2211.10752 12
[97] K. Wang, B. Zhao, X. Peng, Z. Zhu, S. Yang, S. Wang, G. Huang, [120] Y. Liu, Z. Li, M. Backes, Y. Shen, and Y. Zhang, “Backdoor attacks
H. Bilen, X. Wang, and Y. You, “Cafe: Learning to condense against dataset distillation,” arXiv preprint arXiv:2301.01197, 2023.
dataset by aligning features,” in Proceedings of the IEEE/CVF 12, 18
Conference on Computer Vision and Pattern Recognition, 2022, pp. [121] Y. Li, X. Lyu, N. Koren, L. Lyu, B. Li, and X. Ma, “Neural
12 196–12 205. 7, 9, 23, 24 attention distillation: Erasing backdoor triggers from deep neural
[98] S. Liu, K. Wang, X. Yang, J. Ye, and X. Wang, “Dataset distillation networks,” arXiv preprint arXiv:2101.05930, 2021. 12, 18
via factorization,” in Proceedings of the Advances in Neural Informa- [122] Y. Liu, W.-C. Lee, G. Tao, S. Ma, Y. Aafer, and X. Zhang, “Abs:
tion Processing Systems (NeurIPS), 2022. 9, 11, 23, 24 Scanning neural networks for back-doors by artificial brain stim-
[99] T. DeVries and G. W. Taylor, “Improved regularization of ulation,” in Proceedings of the 2019 ACM SIGSAC Conference on
convolutional neural networks with cutout,” arXiv preprint Computer and Communications Security, 2019, pp. 1265–1282. 12,
arXiv:1708.04552, 2017. 9 18
[100] I. Sucholutsky and M. Schonlau, “Soft-label dataset distillation [123] B. Wang, Y. Yao, S. Shan, H. Li, B. Viswanath, H. Zheng, and
and text dataset distillation,” in 2021 International Joint Conference B. Y. Zhao, “Neural cleanse: Identifying and mitigating backdoor
on Neural Networks (IJCNN). IEEE, 2021, pp. 1–8. 11, 13 attacks in neural networks,” in 2019 IEEE Symposium on Security
[101] O. Bohdal, Y. Yang, and T. Hospedales, “Flexible dataset and Privacy (SP). IEEE, 2019, pp. 707–723. 12, 18
distillation: Learn labels instead of images,” arXiv preprint [124] S. Cho, T. J. Jun, B. Oh, and D. Kim, “Dapas: Denoising autoen-
arXiv:2006.08572, 2020. 11, 12, 13 coder to prevent adversarial attack in semantic segmentation,”
in 2020 International Joint Conference on Neural Networks (IJCNN).
[102] J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Des-
IEEE, 2020, pp. 1–8. 12, 18
jardins, A. A. Rusu, K. Milan, J. Quan, T. Ramalho, A. Grabska-
Barwinska et al., “Overcoming catastrophic forgetting in neural [125] Y. Gao, C. Xu, D. Wang, S. Chen, D. C. Ranasinghe, and
networks,” Proceedings of the national academy of sciences, vol. 114, S. Nepal, “Strip: A defence against trojan attacks on deep neural
no. 13, pp. 3521–3526, 2017. 11 networks,” in Proceedings of the 35th Annual Computer Security
Applications Conference, 2019, pp. 113–125. 12, 18
[103] P. Buzzega, M. Boschini, A. Porrello, and S. Calderara, “Rethink-
[126] H. Kwon, “Defending deep neural networks against backdoor
ing experience replay: a bag of tricks for continual learning,” in
attack by using de-trigger autoencoder,” IEEE Access, 2021. 12, 18
2020 25th International Conference on Pattern Recognition (ICPR).
IEEE, 2021, pp. 2180–2187. 11 [127] J. Hayase, W. Kong, R. Somani, and S. Oh, “Spectre: defending
against backdoor attacks using robust statistics,” arXiv preprint
[104] Y. Liu, B. Schiele, and Q. Sun, “Adaptive aggregation networks
arXiv:2104.11315, 2021. 12, 18
for class-incremental learning,” in Proceedings of the IEEE/CVF
[128] D. Tang, X. Wang, H. Tang, and K. Zhang, “Demon in the variant:
Conference on Computer Vision and Pattern Recognition, 2021, pp.
Statistical analysis of dnns for robust backdoor contamination
2544–2553. 11
detection,” arXiv preprint arXiv:1908.00686, 2019. 12, 18
[105] A. Prabhu, P. H. Torr, and P. K. Dokania, “Gdumb: A simple
[129] B. Tran, J. Li, and A. Madry, “Spectral signatures in backdoor
approach that questions our progress in continual learning,” in
attacks,” Advances in neural information processing systems, vol. 31,
European conference on computer vision. Springer, 2020, pp. 524–
2018. 12, 18
540. 11
[130] P. W. Battaglia, J. B. Hamrick, V. Bapst, A. Sanchez-Gonzalez,
[106] J. Konečnỳ, H. B. McMahan, D. Ramage, and P. Richtárik, “Fed-
V. Zambaldi, M. Malinowski, A. Tacchetti, D. Raposo, A. Santoro,
erated optimization: Distributed machine learning for on-device
R. Faulkner et al., “Relational inductive biases, deep learning, and
intelligence,” arXiv preprint arXiv:1610.02527, 2016. 11
graph networks,” arXiv preprint arXiv:1806.01261, 2018. 12
[107] B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. [131] T. N. Kipf and M. Welling, “Semi-supervised classification with
y Arcas, “Communication-efficient learning of deep networks graph convolutional networks,” arXiv preprint arXiv:1609.02907,
from decentralized data,” in Artificial intelligence and statistics. 2016. 12
PMLR, 2017, pp. 1273–1282. 11
[132] P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Lio,
[108] Q. Yang, Y. Liu, T. Chen, and Y. Tong, “Federated machine learn- and Y. Bengio, “Graph attention networks,” arXiv preprint
ing: Concept and applications,” ACM Transactions on Intelligent arXiv:1710.10903, 2017. 12
Systems and Technology (TIST), vol. 10, no. 2, pp. 1–19, 2019. 11 [133] Z. Wu, S. Pan, F. Chen, G. Long, C. Zhang, and S. Y. Philip,
[109] M. Wistuba, A. Rawat, and T. Pedapati, “A survey on neural “A comprehensive survey on graph neural networks,” IEEE
architecture search,” arXiv preprint arXiv:1905.01392, 2019. 12 transactions on neural networks and learning systems, vol. 32, no. 1,
[110] L. Lyu, H. Yu, and Q. Yang, “Threats to federated learning: A pp. 4–24, 2020. 12
survey,” arXiv preprint arXiv:2003.02133, 2020. 12 [134] W. Fan, Y. Ma, Q. Li, Y. He, E. Zhao, J. Tang, and D. Yin, “Graph
[111] M. Fredrikson, S. Jha, and T. Ristenpart, “Model inversion attacks neural networks for social recommendation,” in The world wide
that exploit confidence information and basic countermeasures,” web conference, 2019, pp. 417–426. 12
in Proceedings of the 22nd ACM SIGSAC conference on computer and [135] Z. Ying, J. You, C. Morris, X. Ren, W. Hamilton, and J. Leskovec,
communications security, 2015, pp. 1322–1333. 12 “Hierarchical graph representation learning with differentiable
[112] R. Shokri, M. Stronati, C. Song, and V. Shmatikov, “Membership pooling,” Advances in neural information processing systems, vol. 31,
inference attacks against machine learning models,” in 2017 IEEE 2018. 12
symposium on security and privacy (SP). IEEE, 2017, pp. 3–18. 12 [136] W. Jin, L. Zhao, S. Zhang, Y. Liu, J. Tang, and N. Shah,
[113] L. Melis, C. Song, E. De Cristofaro, and V. Shmatikov, “Exploiting “Graph condensation for graph neural networks,” arXiv preprint
unintended feature leakage in collaborative learning,” in 2019 arXiv:2110.07580, 2021. 13
IEEE symposium on security and privacy (SP). IEEE, 2019, pp. [137] J. J. Pfeiffer III, S. Moreno, T. La Fond, J. Neville, and B. Gal-
691–706. 12 lagher, “Attributed graph models: Modeling network structure
[114] I. Sucholutsky and M. Schonlau, “Secdd: Efficient and secure with correlated attributes,” in Proceedings of the 23rd international
method for remotely training neural networks,” arXiv preprint conference on World wide web, 2014, pp. 831–842. 13
arXiv:2009.09155, 2020. 12 [138] W. Jin, X. Tang, H. Jiang, Z. Li, D. Zhang, J. Tang, and B. Yin,
[115] M. Kettunen, E. Härkönen, and J. Lehtinen, “E-lpips: robust per- “Condensing graphs via one-step gradient matching,” in Proceed-
ceptual image similarity via random transformation ensembles,” ings of the 28th ACM SIGKDD Conference on Knowledge Discovery
arXiv preprint arXiv:1906.03973, 2019. 12 and Data Mining, 2022, pp. 720–730. 13
22
[139] M. Liu, S. Li, X. Chen, and L. Song, “Graph condensa- [163] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma,
tion via receptive field distribution matching,” arXiv preprint Z. Huang, A. Karpathy, A. Khosla, M. Bernstein et al., “Imagenet
arXiv:2206.13697, 2022. 13 large scale visual recognition challenge,” International journal of
[140] J. A. Konstan and J. Riedl, “Recommender systems: from al- computer vision, vol. 115, no. 3, pp. 211–252, 2015. 17
gorithms to user experience,” User modeling and user-adapted [164] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional net-
interaction, vol. 22, no. 1, pp. 101–123, 2012. 13 works for semantic segmentation,” in Proceedings of the IEEE
[141] F. O. Isinkaye, Y. O. Folajimi, and B. A. Ojokoh, “Recommen- conference on computer vision and pattern recognition, 2015, pp.
dation systems: Principles, methods and evaluation,” Egyptian 3431–3440. 18
informatics journal, vol. 16, no. 3, pp. 261–273, 2015. 13 [165] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L.
[142] A. Narayanan and V. Shmatikov, “Robust de-anonymization of Yuille, “Deeplab: Semantic image segmentation with deep convo-
large sparse datasets,” in 2008 IEEE Symposium on Security and lutional nets, atrous convolution, and fully connected crfs,” IEEE
Privacy (sp 2008). IEEE, 2008, pp. 111–125. 13 transactions on pattern analysis and machine intelligence, vol. 40,
[143] N. Sachdeva, M. P. Dhaliwal, C.-J. Wu, and J. McAuley, “Infi- no. 4, pp. 834–848, 2017. 18
nite recommendation networks: A data-centric approach,” arXiv [166] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature
preprint arXiv:2206.02626, 2022. 13 hierarchies for accurate object detection and semantic segmenta-
[144] S. Minaee, N. Kalchbrenner, E. Cambria, N. Nikzad, tion,” in Proceedings of the IEEE conference on computer vision and
M. Chenaghlu, and J. Gao, “Deep learning–based text classifica- pattern recognition, 2014, pp. 580–587. 18
tion: a comprehensive review,” ACM Computing Surveys (CSUR), [167] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine trans-
vol. 54, no. 3, pp. 1–40, 2021. 13 lation by jointly learning to align and translate,” arXiv preprint
arXiv:1409.0473, 2014. 18
[145] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhari-
wal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell et al.,
“Language models are few-shot learners,” Advances in neural
information processing systems, vol. 33, pp. 1877–1901, 2020. 13
[146] Y. Li and W. Li, “Data distillation for text classification,” arXiv
preprint arXiv:2104.08448, 2021. 13
[147] Y. Tian, D. Krishnan, and P. Isola, “Contrastive representation
distillation,” arXiv preprint arXiv:1910.10699, 2019. 13
[148] P. Chen, S. Liu, H. Zhao, and J. Jia, “Distilling knowledge via
knowledge review,” in Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition, 2021, pp. 5008–5017. 13
[149] C. Li, M. Lin, Z. Ding, N. Lin, Y. Zhuang, Y. Huang, X. Ding,
and L. Cao, “Knowledge condensation distillation,” in European
Conference on Computer Vision. Springer, 2022, pp. 19–35. 13
[150] R. Kumar, W. Wang, J. Kumar, T. Yang, A. Khan, W. Ali, and I. Ali,
“An integration of blockchain and ai for secure data sharing and
detection of ct images for the hospitals,” Computerized Medical
Imaging and Graphics, vol. 87, p. 101812, 2021. 13
[151] E. R. Weitzman, L. Kaci, and K. D. Mandl, “Sharing medical data
for health research: the early personal health record experience,”
Journal of medical Internet research, vol. 12, no. 2, p. e1356, 2010. 13
[152] G. A. Kaissis, M. R. Makowski, D. Rückert, and R. F. Braren,
“Secure, privacy-preserving and federated machine learning in
medical imaging,” Nature Machine Intelligence, vol. 2, no. 6, pp.
305–311, 2020. 13
[153] G. Li, R. Togo, T. Ogawa, and M. Haseyama, “Soft-label anony-
mous gastric x-ray image distillation,” in 2020 IEEE International
Conference on Image Processing (ICIP). IEEE, 2020, pp. 305–309.
13
[154] Li, Guang and Togo, Ren and Ogawa, Takahiro and Haseyama,
Miki, “Dataset distillation for medical dataset sharing,” arXiv
preprint arXiv:2209.14603, 2022. 13
[155] Li, Guang and Togo, Ren and Ogawa, Takahiro and Haseyama,
Miki, “Compressed gastric image generation based on soft-label
dataset distillation for medical data sharing,” Computer Methods
and Programs in Biomedicine, vol. 227, p. 107189, 2022. 13
[156] G. Cazenavette, T. Wang, A. Torralba, A. A. Efros, and J.-Y. Zhu,
“Wearable imagenet: Synthesizing tileable textures via dataset
distillation,” in Proceedings of the IEEE/CVF Conference on Com-
puter Vision and Pattern Recognition, 2022, pp. 2278–2282. 13
[157] Y. Chen, Z. Wu, Z. Shen, and J. Jia, “Learning from designers:
Fashion compatibility analysis via dataset distillation,” in 2022
IEEE International Conference on Image Processing (ICIP). IEEE,
2022, pp. 856–860. 13
[158] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based
learning applied to document recognition,” Proceedings of the
IEEE, vol. 86, no. 11, pp. 2278–2324, 1998. 14
[159] H. Xiao, K. Rasul, and R. Vollgraf, “Fashion-mnist: a novel image
dataset for benchmarking machine learning algorithms,” arXiv
preprint arXiv:1708.07747, 2017. 14
[160] A. Krizhevsky, G. Hinton et al., “Learning multiple layers of
features from tiny images,” 2009. 14
[161] Y. Le and X. Yang, “Tiny imagenet visual recognition challenge,”
CS 231N, vol. 7, no. 7, p. 3, 2015. 14
[162] K. Simonyan and A. Zisserman, “Very deep convolutional
networks for large-scale image recognition,” arXiv preprint
arXiv:1409.1556, 2014. 14
23
A PPENDIX A TABLE 8
The performance results of all existing methods on SVHN. † adopt
P ERFORMANCE OF E XISTING DD M ETHODS momentum [84].
Here, we list the evaluation performance of all existing DD
methods on networks with the same architecture as the IPC
training network. Results are derived from official papers, 1 10 50
DC-Benchmark [75] and our experiments.
DD [18] - - -
DD† [18], [84] - - -
TABLE 6 AddMem [84] 87.3 ± 0.1 89.1 ± 0.2 89.5 ± 0.2
The performance results of all existing methods on MNIST. † adopt KIP (ConvNet) [21] 57.3 ± 0.1 75.0 ± 0.1 80.5 ± 0.1
momentum [84]. KIP (∞-FC) [20] - - -
KIP (∞-Conv) [21] 64.3 ± 0.4 81.1 ± 0.5 84.3 ± 0.1
FRePo (ConvNet) [19] - - -
IPC
FRePo (∞-Conv) [19] - - -
1 10 50 RFAD (ConvNet) [85] 52.2 ± 2.2 74.9 ± 0.4 80.9 ± 0.3
DD [18] - 79.5 ± 8.1 - RFAD (∞-Conv) [85] 57.4 ± 0.8 78.2 ± 0.5 82.4 ± 0.1
DD† [18], [84] 95.2 ± 0.3 98.0 ± 0.1 98.8 ± 0.1
AddMem [84] 98.7 ± 0.7 99.3 ± 0.5 99.4 ± 0.4 DC [22] 31.2 ± 1.4 76.1 ± 0.6 82.3 ± 0.3
SVHN
KIP (ConvNet) [21] 90.1 ± 0.1 97.5 ± 0.0 98.3 ± 0.1 DSA [23] 27.5 ± 1.4 79.2 ± 0.5 84.4 ± 0.4
KIP (∞-FC) [20] 85.5 ± 0.1 97.2 ± 0.2 98.4 ± 0.1 IDC [82] - - -
KIP (∞-Conv) [21] 97.3 ± 0.1 99.1 ± 0.1 99.5 ± 0.1 EDD [93] - - -
FRePo (ConvNet) [19] 93.8 ± 0.6 98.4 ± 0.1 99.2 ± 0.1 DCC [79] 47.5 ± 2.6 80.5 ± 0.6 87.2 ± 0.3
FRePo (∞-Conv) [19] 93.6 ± 0.5 98.5 ± 0.1 99.1 ± 0.1
RFAD (ConvNet) [85] 94.4 ± 1.5 98.5 ± 0.1 98.8 ± 0.1 EGM [80] 34.5 ± 1.9 76.2 ± 0.7 83.8 ± 0.3
RFAD (∞-Conv) [85] 97.2 ± 0.2 99.1 ± 0.0 99.1 ± 0.0 MTT [81] - - -
DDPP [94] - - -
MNIST
TABLE 7 IPC
The performance results of all existing methods on Fashion-MNIST. †
adopt momentum [84]. 1 10 50
DD [18] - 36.8 ± 1.2 -
IPC DD† [18], [84] 46.6 ± 0.6 60.2 ± 0.4 65.3 ± 0.4
AddMem [84] 66.4 ± 0.4 71.2 ± 0.4 73.6 ± 0.5
1 10 50
KIP (ConvNet) [21] 49.9 ± 0.2 62.7 ± 0.3 68.6 ± 0.2
DD [18] - - - KIP (∞-FC) [20] 40.5 ± 0.4 53.1 ± 0.5 58.6 ± 0.4
DD† [18], [84] 83.4 ± 0.3 87.6 ± 0.4 87.7 ± 0.3 KIP (∞-Conv) [21] 64.3 ± 0.4 81.1 ± 0.5 84.3 ± 0.1
AddMem [84] 88.5 ± 0.1 90.0 ± 0.7 91.2 ± 0.3 FRePo (ConvNet) [19] 46.8 ± 0.7 65.5 ± 0.6 71.7 ± 0.2
KIP (ConvNet) [21] 70.6 ± 0.6 84.6 ± 0.3 88.7 ± 0.2
KIP (∞-FC) [20] - - - FRePo (∞-Conv) [19] 47.9 ± 0.6 68.0 ± 0.2 74.4 ± 0.1
KIP (∞-Conv) [21] 82.9 ± 0.2 91.0 ± 0.1 92.4 ± 0.1 RFAD (ConvNet) [85] 53.6 ± 1.2 66.3 ± 0.5 71.1 ± 0.4
FRePo (ConvNet) [19] 75.6 ± 0.5 86.2 ± 0.3 89.6 ± 0.1 RFAD (∞-Conv) [85] 61.4 ± 0.8 73.7 ± 0.2 76.6 ± 0.3
FRePo (∞-Conv) [19] 76.3 ± 0.4 86.7 ± 0.3 90.0±0.0
DC [22] 28.3 ± 0.5 44.9 ± 0.5 53.9 ± 0.5
CIFAR-10
RFAD (∞-Conv) [85] 84.6 ± 0.2 90.3 ± 0.2 91.4 ± 0.1 DSA [23] 28.8 ± 0.7 53.2 ± 0.8 60.6 ± 0.5
DC [22] 70.5 ± 0.6 82.3 ± 0.4 83.6 ± 0.4 IDC [82] 50.0 ± 0.4 67.5 ± 0.5 74.5 ± 0.1
DSA [23] 70.6 ± 0.6 84.8 ± 0.3 88.8 ± 0.2 EDD [93] 49.2 ± 0.4 67.1 ± 0.2 73.8 ± 0.1
IDC [82] - - - DCC [79] 34.0 ± 0.7 54.5 ± 0.5 64.2 ± 0.4
EDD [93] - - - EGM [80] 30.0 ± 0.6 50.2 ± 0.6 60.0 ± 0.4
DCC [79] - - - MTT [81] 46.3 ± 0.8 65.3 ± 0.7 71.6 ± 0.2
EGM [80] 71.4 ± 0.4 85.4 ± 0.3 87.9 ± 0.2
MTT [81] 75.1 ± 0.9 87.2 ± 0.3 88.3 ± 0.1 DDPP [94] 46.4 ± 0.6 65.5 ± 0.3 71.9 ± 0.2
DDPP [94] - - - HaBa [98] 48.3 ± 0.8 69.9 ± 0.4 74.0 ± 0.2
HaBa [98] - - - FTD [96] 46.8 ± 0.3 66.6 ± 0.3 73.8 ± 0.2
FTD [96] - - - TESLA [95] 48.5 ± 0.8 66.4 ± 0.8 72.6 ± 0.7
TESLA [95] - - -
DM [24] 26.5 ± 0.4 48.9 ± 0.6 63.0 ± 0.4
DM [24] 71.5 ± 0.5 83.6 ± 0.2 88.2 ± 0.1
IT-GAN [67] - - - IT-GAN [67] - - -
KFS [70] - - - KFS [70] 59.8 ± 0.5 72.0 ± 0.3 75.0 ± 0.2
CAFE [97] 73.7 ± 0.7 83.0 ± 0.4 88.2 ± 0.3 CAFE [97] 31.6 ± 0.8 50.9 ± 0.5 62.3 ± 0.4
24
TABLE 10 TABLE 12
The performance results of all existing methods on CIFAR-100. † adopt The performance results of all existing methods on ImageNet-1K. †
momentum [84]. adopt momentum [84].
IPC IPC
1 10 50 1 2 10 50
DD [18] - - - -
DD [18] - - -
DD† [18], [84] 19.6 ± 0.4 32.7 ± 0.4 35.0 ± 0.3 DD† [18], [84] - - - -
AddMem [84] 34.0 ± 0.4 42.9 ± 0.7 - AddMem [84] - - - -
KIP (ConvNet) [21] 15.7 ± 0.2 28.3 ± 0.1 - KIP (ConvNet) [21] - - - -
KIP (∞-FC) [20] - - - KIP (∞-FC) [20] - - - -
KIP (∞-Conv) [21] 34.9 ± 0.1 49.5 ± 0.3 - KIP (∞-Conv) [21] - - - -
FRePo (ConvNet) [19] 27.2 ± 0.4 41.3 ± 0.2 44.3 ± 0.2 FRePo (ConvNet) [19] 7.5 ± 0.3 9.7 ± 0.2 - -
FRePo (∞-Conv) [19] 30.4 ± 0.3 42.1 ± 0.2 43.0 ± 0.3 FRePo (∞-Conv) [19] - - - -
RFAD (ConvNet) [85] 26.3 ± 1.1 33.0 ± 0.3 - RFAD (ConvNet) [85] - - - -
RFAD (∞-Conv) [85] 44.1 ± 0.1 46.8 ± 0.2 - RFAD (∞-Conv) [85] - - - -
ImageNet-1K
CIFAR-100
TABLE 11
The performance results of all existing methods on Tiny-ImageNet. †
adopt momentum [84].
IPC
1 10 50
DD [18] - - -
DD† [18], [84] - - -
AddMem [84] - - -
KIP (ConvNet) [21] - - -
KIP (∞-FC) [20] - - -
KIP (∞-Conv) [21] - - -
FRePo (ConvNet) [19] 15.4 ± 0.3 24.9 ± 0.2 -
FRePo (∞-Conv) [19] 17.6 ± 0.2 25.3 ± 0.2 -
RFAD (ConvNet) [85] - - -
RFAD (∞-Conv) [85] - - -
Tiny-ImageNet