A Comprehensive Survey of Continual Learning Theory Method and Application
A Comprehensive Survey of Continual Learning Theory Method and Application
8, AUGUST 2024
(Survey Paper)
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY ROORKEE. Downloaded on December 19,2024 at 09:29:09 UTC from IEEE Xplore. Restrictions apply.
WANG et al.: COMPREHENSIVE SURVEY OF CONTINUAL LEARNING: THEORY, METHOD AND APPLICATION 5363
and vice versa. Beyond simply balancing the “proportions” of up-to-date and comprehensive, which is the primary strength of
these two aspects, a desirable solution for continual learning this work.
should obtain strong generalizability to accommodate distribu- Compared to previous surveys, our improvements lie in the
tion differences within and between tasks (Fig. 1(b)). As a naive following aspects, including (1) Setup: collect and formulate
baseline, reusing all old training samples (if allowed) makes more typical scenarios that have emerged in recent years; (2)
it easy to address the above challenges, but creates huge com- Theory: summarize theoretical efforts on continual learning in
putational and storage overheads, as well as potential privacy terms of stability, plasticity and generalizability; (3) Method: add
issues. In fact, continual learning is primarily intended to ensure optimization-based and representation-based approaches on the
the resource efficiency of model updates, preferably close to top of regularization-based, replay-based and architecture-based
learning only new training samples. approaches, with an extensive analysis of their sub-directions;
A number of continual learning methods have been pro- (4) Application: summarize practical applications of continual
posed in recent years for various aspects of machine learning learning and their particular challenges in terms of scenario
(Fig. 1(c))., which can be conceptually grouped into five ma- complexity and task specificity; (5) Linkage: discuss underlying
jor categories: adding regularization terms with reference to connections between theory, method and application, as well as
the old model (regularization-based approach); approximating promising crossovers with other related fields.
and recovering the old data distributions (replay-based ap- The organization of this paper is described in Fig. 1. We intro-
proach); explicitly manipulating the optimization programs duce basic setups of continual learning in Section II, summarize
(optimization-based approach); learning robust and well- theoretical efforts for its general objectives in Section III, present
generalized representations (representation-based approach); a state-of-the-art and elaborated taxonomy of representative
and constructing task-specific parameters with a properly- methods in Section IV, describe how these methods are adapted
designed architecture (architecture-based approach). This to practical challenges in Section V, and discuss promising
taxonomy extends the commonly-used ones with current ad- directions for subsequent work in Section VI.
vances, and provides refined sub-directions for each category.
We summarize how these methods achieve the objective of II. SETUP
continual learning, with an extensive analysis of their theoretical In this section, we first present a basic formulation of continual
foundations and specific implementations. In particular, these learning. Then we introduce typical scenarios and evaluation
methods are closely connected, e.g., regularization and replay metrics.
ultimately act to rectify the gradient directions in optimization,
and highly synergistic, e.g., the efficacy of replay can be facili- A. Basic Formulation
tated by distilling knowledge from the old model. Continual learning is characterized as learning from dynamic
Realistic applications present particular challenges for con- data distributions. In practice, training samples of different
tinual learning, categorized into scenario complexity and task distributions arrive in sequence. A continual learning model
specificity (Fig. 1(d)). As for the former, for example, the task parameterized by θ needs to learn corresponding task(s) with
identity is probably missing in training and testing, and the no or limited access to old training samples and perform well on
training samples might be introduced in tiny batches or even one their test sets. Formally, an incoming batch of training samples
pass. Due to the expense and scarcity of data labeling, continual belonging to a task t can be represented as Dt,b = {Xt,b , Yt,b },
learning needs to be effective for few-shot, semi-supervised and where Xt,b is the input data, Yt,b is the data label, t ∈ T =
even unsupervised scenarios. As for the latter, although current {1, . . . , k} is the task identity and b ∈ Bt is the batch index
advances mainly focus on visual classification, other represen- (T and Bt denote their space, respectively). Here we define
tative visual domains such as object detection and semantic seg- a “task” by its training samples Dt following the distribution
mentation, as well as other related fields such as conditional gen- Dt := p(Xt , Yt ) (Dt denotes the entire training set by omitting
eration, reinforcement learning (RL) and natural language pro- the batch index, likewise for Xt and Yt ), and assume that there is
cessing (NLP), are receiving increasing attention with their own no difference in distribution between training and testing. Under
characteristics. We summarize these particular challenges and realistic constraints, the data label Yt and the task identity t
analyze how continual learning methods are adapted to them. might not be always available. In continual learning, the training
Considering the significant growth of interest in continual samples of each task can arrive incrementally in batches (i.e.,
learning, we believe that such an up-to-date and comprehensive {{Dt,b }b∈Bt }t∈T ) or simultaneously (i.e., {Dt }t∈T ).
survey can provide a holistic perspective for subsequent work.
Although there are some early surveys of continual learning B. Typical Scenario
with relatively broad coverage [2], [5], [8], important advances According to the division of incremental batches and the
in recent years have not been incorporated. In contrast, the latest availability of task identities, we detail the typical scenarios as
surveys typically collate only a local aspect of continual learning, follows (see Table I for a formal comparison):
with respect to its biological underpinnings [1], [3], [9], [10], r Instance-Incremental Learning (IIL): All training samples
specialized settings for visual classification [11], [12], [13], [14], belong to the same task and arrive in batches.
and specific extensions for NLP [15], [16] or RL [17]. We include r Domain-Incremental Learning (DIL): Tasks have the same
a detailed comparison of previous surveys in Appendix Table 1, data label space but different input distributions. Task
available online. These surveys are typically hard to be both identities are not required.
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY ROORKEE. Downloaded on December 19,2024 at 09:29:09 UTC from IEEE Xplore. Restrictions apply.
5364 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 46, NO. 8, AUGUST 2024
TABLE I
A FORMAL COMPARISON OF TYPICAL CONTINUAL LEARNING SCENARIOS
r Task-Incremental Learning (TIL): Tasks have disjoint data corresponding to the use of multi-head evaluation (e.g., TIL)
label spaces. Task identities are provided in both training or single-head evaluation (e.g., CIL) [44]. The two metrics
and testing. at the kth task are then defined as AAk = k1 kj=1 ak,j and
r Class-Incremental Learning (CIL): Tasks have disjoint
AIAk = k1 ki=1 AAi , where AA represents the overall per-
data label spaces. Task identities are only provided in formance at the current moment and AIA further reflects the
training. historical performance.
r Task-Free Continual Learning (TFCL): Tasks have disjoint
Memory stability can be evaluated by forgetting measure
data label spaces. Task identities are not provided in either (FM) [44] and backward transfer (BWT) [45]. As for the former,
training or testing. the forgetting of a task is calculated by the difference between
r Online Continual Learning (OCL): Tasks have disjoint
its maximum performance obtained in the past and its cur-
data label spaces. Training samples for different tasks arrive rent performance: fj,k = maxi∈{1,...,k−1} (ai,j − ak,j ), ∀j < k.
as a one-pass data stream. FM at the kth task is the average forgetting of all old tasks:
r Continual Pre-training (CPT): Pre-training data arrives in 1
k−1
FMk = k−1 j=1 fj,k . As for the latter, BWT evaluates the
sequence. The goal is to improve knowledge transfer to average influence of learning the kth task on all old tasks:
downstream tasks. 1
k−1
BWTk = k−1 j=1 (ak,j − aj,j ), where the forgetting is gen-
If not specified, each task is often assumed to have a sufficient erally reflected as a negative BWT.
number of labeled training samples, i.e., Supervised Continual Learning plasticity can be evaluated by intransience measure
Learning. According to the provided Xt and Yt in each Dt , (IM) [44] and forward transfer (FWT) [45]. IM is defined as the
continual learning is further extended to zero-shot [23], few- inability of a model to learn new tasks, which is calculated by the
shot [24], semi-supervised [25], open-world [26] and un-/self- difference of a task between its joint training performance and
supervised [27], [28] scenarios. Besides, other realistic consider- continual learning performance: IMk = a∗k − ak,k , where a∗k is
ations have been incorporated, such as multiple labels [29], noisy the classification accuracy of a randomly-initialized reference
labels [30], blurred boundary [31], [32], hierarchical granular- model jointly trained with ∪kj=1 Dj for the kth task. In compari-
ity [33] and sub-populations [34], mixture of task similarity [35], son, FWT evaluates the average influence of all old tasks on the
long-tailed distribution [36], novel class discovery [37], [38], 1
k
current kth task: FWTk = k−1 j=2 (aj,j − ãj ), where ãj is
multi-modality [39], etc. Some recent work has focused on the classification accuracy of a randomly-initialized reference
combinatorial scenarios [18], [40], [41], [42], [43] in order to model trained with Dj for the jth task. Note that, ak,j can be
simulate real-world complexity. adapted to other forms depending on the task type, and the above
evaluation metrics should be adjusted accordingly.
C. Evaluation Metric
In general, the performance of continual learning can be evalu-
III. THEORY
ated from three aspects: overall performance of the tasks learned
so far, memory stability of old tasks, and learning plasticity of In this section, we summarize the theoretical efforts on con-
new tasks. Here we summarize the common evaluation metrics, tinual learning, with respect to both stability-plasticity trade-off
using classification as an example. and generalizability analysis, and relate them to the motivations
Overall performance is typically evaluated by average ac- of various continual learning methods.
curacy (AA) [44], [45] and average incremental accuracy
(AIA) [46], [47]. Let ak,j ∈ [0, 1] denote the classification
A. Stability-Plasticity Trade-Off
accuracy evaluated on the test set of the jth task after incre-
mental learning of the kth task (j ≤ k). The output space to With the basic formulation in Section II-A, let’s con-
compute ak,j consists of the classes in either Yj or ∪ki=1 Yi , sider a general setup for continual learning, where a neural
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY ROORKEE. Downloaded on December 19,2024 at 09:29:09 UTC from IEEE Xplore. Restrictions apply.
WANG et al.: COMPREHENSIVE SURVEY OF CONTINUAL LEARNING: THEORY, METHOD AND APPLICATION 5365
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY ROORKEE. Downloaded on December 19,2024 at 09:29:09 UTC from IEEE Xplore. Restrictions apply.
5366 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 46, NO. 8, AUGUST 2024
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY ROORKEE. Downloaded on December 19,2024 at 09:29:09 UTC from IEEE Xplore. Restrictions apply.
WANG et al.: COMPREHENSIVE SURVEY OF CONTINUAL LEARNING: THEORY, METHOD AND APPLICATION 5367
non-convex, there are many local optimal solutions that per- self-supervised learning [28], [91], [92], [93]. This motivates
form similarly on training sets but have significantly different the representation-based approach in Section IV-D.
generalizability on test sets [69], [81]. For continual learning, a There are many other factors important for continual learning
desirable solution requires not only intra-task generalizability performance. As shown in (13), the upper bound of performance
from training sets to test sets, but also inter-task generaliz- degradation also depends on the difference of the empirical opti-
ability to accommodate incremental changes of their distri- mal solution θt∗ = arg minθ t (θ; Dt ) for each task, i.e., the dis-
butions. Here we provide a conceptual demonstration with a crepancy of task distribution (see Fig. 2(d)), which is further val-
task-specific loss t (θ; Dt ) and its empirical optimal solution idated by a theoretical analysis of the forgetting-generalization
θt∗ = arg minθ t (θ; Dt ). When task i needs to accommodate trade-off [94] and the PAC-Bayes bound of generalization er-
another task j, the maximum sacrifice of its performance can rors [69], [95]. Therefore, how to exploit task similarity is
be estimated by performing a second-order Taylor expansion of directly related to the performance of continual learning. The
i (θ; Di ) around θi∗ generalization error for each task can improve with synergistic
tasks but deteriorate with competing tasks [68], where learning
i (θj∗ ; Di ) ≈ i (θi∗ ; Di ) + (θj∗ − θi∗ ) ∇θ i (θ; Di )θ=θ∗ all tasks equally in a shared solution tends to compromise each
i
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY ROORKEE. Downloaded on December 19,2024 at 09:29:09 UTC from IEEE Xplore. Restrictions apply.
5368 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 46, NO. 8, AUGUST 2024
Fig. 3. State-of-the-art and elaborated taxonomy of representative continual learning methods. We have summarized five main categories (blue blocks), each of
which is further divided into several sub-directions (red blocks).
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY ROORKEE. Downloaded on December 19,2024 at 09:29:09 UTC from IEEE Xplore. Restrictions apply.
WANG et al.: COMPREHENSIVE SURVEY OF CONTINUAL LEARNING: THEORY, METHOD AND APPLICATION 5369
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY ROORKEE. Downloaded on December 19,2024 at 09:29:09 UTC from IEEE Xplore. Restrictions apply.
5370 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 46, NO. 8, AUGUST 2024
parameter gradients [21], performance of corresponding tasks with additional generated data. To alleviate dramatic represen-
with cardinality constraints [132], mini-batch gradient similarity tation shifts, PODNet [47] employs a spatial distillation loss
and cross-batch gradient diversity [133], ability of optimizing to preserve representations throughout the model. Co2L [92]
latent decision boundaries [134], diversity of robustness against introduces a self-supervised distillation loss to obtain robust
perturbations [32], similarity to the gradients of old training representations against catastrophic forgetting. GeoDL [150]
samples with respect to the current parameters [135], etc. performs KD along a path that connects the low-dimensional
To improve storage efficiency, AQM [136] performs online projections of the old and new feature spaces for a smooth
continual compression based on a VQ-VAE framework [137] transition between them. ELI [151] learns an energy manifold
and saves compressed data for replay. MRDC [138] formu- with the old and new models to realign the representation shifts
lates experience replay with data compression as determinantal for optimizing incremental tasks. To adequately exploit the
point processes (DPPs), and derives a computationally efficient past information, DDE [152] distills colliding effects from the
way for online determination of the optimal compression rate. features of the new training samples, while CSC [153] addition-
RM [32] adopts conventional and label mixing-based strategies ally leverages the structure of the old feature space. To further
of data augmentation to enhance the diversity of old training enhance learning plasticity, D+R [154] performs KD from an
samples. RAR [139] synthesizes adversarial samples near the additional model dedicated to each new task. FOSTER [155]
forgetting boundary and performs MixUp for data augmentation. expands new modules to fit the residuals of the old model
The auxiliary information with low storage cost, such as class and then distills them into a single model. Besides, weight
statistics [79], [140] and attention maps [141], [142], can be regularization can also be combined with experience replay for
further incorporated to maintain old knowledge. Besides, the old better performance and generality [44], [52].
training samples can be continually adjusted to accommodate in- It is worth noting that the merits and limitations of experience
cremental changes, e.g., making them more representative [143] replay remain largely open. In addition to the intuitive benefits
or challenging [144] for separation. of staying in the low-loss region of the old tasks [156], theoret-
As for exploitation, experience replay requires an adequate ical analysis has demonstrated its contribution to resolving the
use of the memory buffer to recover the past information. There NP-hard problem of optimal continual learning [62]. However, it
are many different implementations, closely related to other risks overfitting to only a few old training samples retained in the
continual learning strategies. First, the effect of old training memory buffer, which potentially affects generalizability [156].
samples in optimization can be constrained to avoid catas- To alleviate overfitting, LiDER [157] takes inspirations from
trophic forgetting and facilitate knowledge transfer. For exam- adversarial robustness and enforces the Lipschitz continuity of
ple, GEM [45] constructs individual constraints based on the old the model to its inputs. MOCA [158] enlarges the variation
training samples for each task to ensure non-increase in their of representations to prevent the old ones from shrinking in
losses. A-GEM [63] replaces the individual constraints with a their space. Interestingly, several simple baselines of experience
global loss of all tasks to improve training efficiency. LOGD [64] replay can achieve considerable performance. DER [31] stores
decomposes the gradient of each task into task-sharing and old training samples together with their logits, and perform logit-
task-specific components to leverage inter-task information. To matching throughout the optimization trajectory. GDumb [159]
achieve a good trade-off in interference-transfer [94], [131], greedily collects incoming training samples in a memory buffer
MER [131] employs meta-learning for gradient alignment in and then uses them to train a model from scratch for testing.
experience replay. BCL [94] explicitly pursues a saddle point of These efforts can serve as evaluation criteria for subsequent
the cost of old and new training samples. To selectively utilize the exploration.
memory buffer, MIR [145] prioritizes the old training samples The second is generative replay or pseudo-rehearsal, which
that subject to more forgetting, while HAL [146] uses them as generally requires training an additional generative model to re-
“anchors” and stabilizes their predictions. play generated data. This is closely related to continual learning
On the other hand, experience replay can be naturally com- of generative models themselves, as they also require incremen-
bined with knowledge distillation (KD), which additionally in- tal updates. DGR [160] provides an initial framework in which
corporates the past information from the old model. iCaRL [122] learning each generation task is accompanied with replaying
and EEIL [123] are two early works that perform KD on both generated data sampled from the old generative model, so as
old and new training samples. Some subsequent improvements to inherit the previously-learned knowledge. MeRGAN [125]
focus on different issues in experience replay. To mitigate data further enforces consistency of the generated data sampled with
imbalance of the limited old training samples, LUCIR [46] the same random noise between the old and new generative
encourages similar feature orientation of the old and new mod- models, similar to the role of function regularization. Besides,
els, while performing cosine normalization of the last layer other continual learning strategies can be incorporated into
and mining hard negatives of the current task. BiC [147] and generative replay. Weight regularization [25], [55], [161], [162]
WA [148] attribute this issue to the bias of the last fully- and experience replay [25], [163] have been shown to be effec-
connected layer, and resolve it by either learning a bias correction tive in mitigating catastrophic forgetting of generative models.
layer with a balanced validation set [147], or normalizing the DGMa/DGMw [164] and a follow-up work [162] adopt binary
output weights [148]. SS-IL [149] adopts separated softmax in masks to allocate task-specific parameters for overcoming inter-
the last layer and task-wise KD to mitigate the bias. DRI [129] task interference, and an extendable network to ensure scalabil-
trains a generative model to supplement the old training samples ity. If pre-training is available, it can provide a relatively stable
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY ROORKEE. Downloaded on December 19,2024 at 09:29:09 UTC from IEEE Xplore. Restrictions apply.
WANG et al.: COMPREHENSIVE SURVEY OF CONTINUAL LEARNING: THEORY, METHOD AND APPLICATION 5371
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY ROORKEE. Downloaded on December 19,2024 at 09:29:09 UTC from IEEE Xplore. Restrictions apply.
5372 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 46, NO. 8, AUGUST 2024
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY ROORKEE. Downloaded on December 19,2024 at 09:29:09 UTC from IEEE Xplore. Restrictions apply.
WANG et al.: COMPREHENSIVE SURVEY OF CONTINUAL LEARNING: THEORY, METHOD AND APPLICATION 5373
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY ROORKEE. Downloaded on December 19,2024 at 09:29:09 UTC from IEEE Xplore. Restrictions apply.
5374 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 46, NO. 8, AUGUST 2024
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY ROORKEE. Downloaded on December 19,2024 at 09:29:09 UTC from IEEE Xplore. Restrictions apply.
WANG et al.: COMPREHENSIVE SURVEY OF CONTINUAL LEARNING: THEORY, METHOD AND APPLICATION 5375
by inverting a frozen copy of the old classification model [247], scenario assumes that there is an external unlabeled dataset to
[248], which usually further incorporate knowledge distillation facilitate supervised continual learning, e.g., by knowledge dis-
to compensate the lost information in model inversion. Other tillation [124] and data augmentation [268]. The third scenario
methods exploit the class-wise statistics of feature representa- is to learn representations from incremental unlabeled data [28],
tions to obtain a balanced classifier, such as by imposing the [215], [216], which becomes an increasingly important topic for
representations to be transferable and invariant [173], [176], or updating pre-trained knowledge in foundation models.
compensating explicitly the representation shifts [174], [249].
C. Generic Learning Paradigm
B. Scarcity of Labeled Data
Potential challenges of the learning paradigm can be sum-
Most of the current continual learning settings assume that marized in a broad concept called General Continual Learning
incremental tasks have sufficiently large amounts of labeled (GCL) [12], [20], [31], where the model observes incremen-
data, which is often expensive and difficult to obtain in practical tal data in an online fashion without explicit task boundaries.
applications. For this reason, there is a growing body of work Correspondingly, GCL consists of two interconnected settings:
focusing on the scarcity of labeled data in continual learning. Task-Free Continual Learning (TFCL) [20], where the task
A representative scenario is called Few-Shot CIL (FSCIL) [24], identities are not accessible in either training or testing; and
where the model first learns some base classes for initialization Online Continual Learning (OCL) [21], where the training
with a large number of training samples, and then learns a samples are observed in an one-pass data stream. Since TFCL
sequence of novel classes with only a few training samples. The usually accesses only a small batch of training samples at a time
extremely limited training samples exacerbate the overfitting of for gradual changes in task distributions, while OCL usually
previously-learned representations to subsequent tasks, which requires only the data label rather than the task identity for each
can be alleviated by recent work such as preserving the topol- training sample, many methods for TFCL and OCL are compat-
ogy of representations [24], constructing an exemplar relation ible, summarized with their applicable scenarios in Appendix
graph for knowledge distillation [250], selectively updating only Table 3, available online.
the unimportant parameters [251] or stabilizing the important Some of them attempt to learn specialized parameters in a
parameters [252], updating parameters within the flat region of growing architecture. CN-DPM [80] adopts Dirichlet process
loss landscape [84], meta-learning of a good initialization [253], mixture models to construct a growing number of neural experts,
as well as generative replay [254]. while a concurrent work [269] derives such mixture models from
There are many other efforts keeping the initialized backbone a probabilistic meta-learner. VariGrow [270] employs an energy-
fixed in subsequent continual learning, so as to decouple the based novelty score to decide whether to extend a new expert
learning of representation and classifier. Following this idea, or update an old one. ODDL [271] estimates the discrepancy
representative strategies can be separated into two aspects. The between the current memory buffer and the previously-learned
first is to obtain compatible and extensible representations from knowledge as an expansion signal. InstAParam [272] selects and
massive base classes, such as by enforcing the representations consolidates appropriate network paths for individual training
compatible with simulated incremental tasks [255], reserving the samples.
feature space with virtual prototypes for future classes [256], us- On the other hand, many efforts are built on experience replay,
ing angular penalty loss with data augmentation [257], providing focusing on construction, management and exploitation of a
extra constraints from margin-based representations [258], etc. memory buffer. Since training samples of the same distribution
The second is to obtain an adaptive classifier from a sequence arrive in small batches, the information of task boundaries
of novel classes, such as by evolving the classifier weights is less effective, and reservoir sampling usually serves as an
with a graph attention network [259], performing hyperdimen- effective baseline strategy for sample selection. More advanced
sional computing [260], sampling stochastic classifiers [261], strategies prioritize the replay of those training samples that
etc. Besides, auxiliary information such as semantic word vec- are informative [273], diversified in parameter gradients [21],
tors [262], [263] and sketch exemplars [264] can be incorporated balanced in class labels [159], [274], and beneficial to latent
to enrich the limited training samples. decision boundaries [134]. Meanwhile, the memory buffer can
In addition to a few labeled data, there is usually a large be dynamically managed, such as by removing less important
amount of unlabeled data available and collected over time. training samples [42], editing the old training samples to be more
The first practical setting is called Semi-Supervised Continual likely forgotten [144], [275], and retrieving the old training sam-
Learning (SSCL) [25], which considers incremental data as ples that are susceptible to interference [145], [276]. To better
partially labeled. As an initial attempt, ORDisCo [25] learns a exploit the memory buffer, representative strategies include cali-
semi-supervised classification model together with a conditional brating features with task-specific parameters [277], performing
GANs for generative replay, and regularizes discriminator con- knowledge distillation [31], [278], improving representations
sistency to mitigate catastrophic forgetting. Subsequent work with contrastive learning [276], [279], using asymmetric cross-
includes training an adversarial autoencoder to reconstruct im- entropy [280] or constrained gradient directions [45], [63] of
ages [265], imposing predictive consistency among augmented the old and new training samples, repeated rehearsal with data
and interpolated data [266], and leveraging the nearest-neighbor augmentation [281], properly adjusting the learning rate [42],
classifier to distill class-instance relationships [267]. The second etc.
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY ROORKEE. Downloaded on December 19,2024 at 09:29:09 UTC from IEEE Xplore. Restrictions apply.
5376 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 46, NO. 8, AUGUST 2024
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY ROORKEE. Downloaded on December 19,2024 at 09:29:09 UTC from IEEE Xplore. Restrictions apply.
WANG et al.: COMPREHENSIVE SURVEY OF CONTINUAL LEARNING: THEORY, METHOD AND APPLICATION 5377
repetitive training to capture a distribution, especially for some provides a new target for continual learning, and its outstanding
hard examples [3]. Recent work has shown that the one-pass performance in conditional generation can also facilitate the
performance of visual classification can be largely improved by efficacy of generative replay.
experience replay of hard examples [297] or orthogonal gradient Foundation Model, such as GPT [316] and CLIP [218],
projection [298]. Similarly, resolving within-task catastrophic demonstrates impressive performance in a variety of down-
forgetting can facilitate reinforcement learning [299], [300] and stream tasks with the use of large-scale pre-training. The pre-
stabilize the training of GANs [301], [302]. training data is usually huge in volume and collected incre-
Meanwhile, continual learning is relevant to two important mentally, creating urgent demands for efficient updates. On
directions of privacy protection. The first is Federated Learn- the other hand, an increasing scale of pre-training would fa-
ing [303], where the server and clients are not allowed to cilitate knowledge transfer and mitigate catastrophic forgetting
communicate with data. A typical scenario is that the server for downstream continual learning. Transformer-Based Archi-
aggregates the locally trained parameters from multiple clients tecture [317] has proven effective for both language and vision
into a single model and then sends it back. As the incremental domains, and become the dominant choice for state-of-the-art
data collected by clients is dynamic and variable, federated foundation models. This requires specialized designs to over-
learning needs to overcome catastrophic forgetting and facilitate come catastrophic forgetting while providing new insights for
knowledge transfer across clients, which can be achieved by maintaining task specificity in continual learning. Parameter-
continual learning strategies such as model decomposition [304] efficient fine-tuning techniques originally developed in the field
and knowledge distillation [305], [306]. The second is Machine of NLP are being widely adapted to continual learning.
Unlearning, which aims to forget the influence of specific train- Embodied AI [318] aims to enable AI systems to learn from
ing samples when their access is lost, without affecting other interactions with the physical environment, rather than static
knowledge. Many efforts in this direction are closely related to datasets collected primarily from the Internet. The study of
continual learning, such as learning separate models with subsets general continual learning helps the embodied agents to learn
of training samples [307], utilizing historical parameters and from an egocentric perception similar to humans, and provides
gradients [308], removing old knowledge from parameters with a unique opportunity for researchers to pursue the essence of
Fisher information matrix [309], adding adaptive parameters on lifelong learning by observing the same person in a long time
a pre-trained model [310], etc. On the other hand, continual span. Advances in Neuroscience provide important inspirations
learning may suffer from data leakage and privacy invasion as for the development of continual learning, as biological learning
retaining all old knowledge. Mnemonic Code [311] embeds a is naturally on a continual basis [1], [3]. The underlying mecha-
class-specific code when learning each class, enabling to selec- nisms include multiple levels from synaptic plasticity to regional
tively forget them through discarding the corresponding codes. collaboration, detailed in Appendix D, available online. With
LIRF [312] designs a distillation framework to remove specific a deeper understanding of the biological brain, more “natural
old knowledge and store it in a pruned lightweight network for algorithms” can be explored to facilitate continual learning of
selective recovery. AI systems.
As a strategy for adapting to variable inputs, continual learn-
ing can assist a robust model to eliminate or resist external VII. CONCLUSION
disturbances [30], [313], [314]. In fact, robustness and continual
In this work, we present an up-to-date and comprehensive
learning are intrinsically linked, as they correspond to general-
survey of continual learning, bridging the latest advances in
izability in the spatial and temporal dimensions, respectively.
theory, method and application. We summarize both general
Many ideas for improving robustness to adversarial examples
objectives and particular challenges in this field, with an ex-
have been used to improve continual learning, such as flat min-
tensive analysis of how representative strategies address them.
ima [31], [81], model ensemble [69], Lipschitz continuity [157]
Encouragingly, we observe a growing and widespread interest in
and adversarial training [158]. Subsequent work could further
continual learning from the broad AI community, bringing novel
interconnect excellent ideas from both fields, e.g., designing par-
understandings, diversified applications and cross-directional
ticular algorithms to actively “forget” [52] external disturbances.
opportunities. Based on such a holistic perspective, we expect
the development of continual learning to eventually empower
C. Cross-Directional Prospect AI systems with human-like adaptability, responding flexibly
to real-world dynamics and evolving themselves in a lifelong
Continual learning demonstrates vigorous vitality, as most manner.
of the state-of-the-art AI models require flexible and efficient
updates, and their advances have contributed to the develop- REFERENCES
ment of continual learning. Here, we discuss some attractive
intersections of continual learning with other topics of the broad [1] D. Kudithipudi et al., “Biological underpinnings for lifelong learning
machines,” Nat. Mach. Intell., vol. 4, pp. 196–210, 2022.
AI community: [2] G. I. Parisi et al., “Continual lifelong learning with neural networks: A
Diffusion Model [315] is a rising state-of-the-art generative review,” Neural Netw., vol. 113, pp. 54–71, 2019.
model, which constructs a Markov chain of discrete steps to pro- [3] R. Hadsell et al., “Embracing change: Continual learning in deep neural
networks,” Trends Cogn. Sci., vol. 24, pp. 1028–1040, 2020.
gressively add random noise for the input and learns to gradually [4] G. M. Van de Ven and A. S. Tolias, “Three scenarios for continual
remove the noise to restore the original data distribution. This learning,” 2019, arXiv:1904.07734.
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY ROORKEE. Downloaded on December 19,2024 at 09:29:09 UTC from IEEE Xplore. Restrictions apply.
5378 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 46, NO. 8, AUGUST 2024
[5] Z. Chen and B. Liu, Lifelong Machine Learning. San Rafael, CA, USA: [32] J. Bang, H. Kim, Y. Yoo, J. -W. Ha, and J. Choi, “Rainbow memory: Con-
Morgan & Claypool, 2018. tinual learning with a memory of diverse samples,” in Proc. IEEE/CVF
[6] M. McCloskey and N. J. Cohen, “Catastrophic interference in con- Conf. Comput. Vis. Pattern Recognit., 2021, pp. 8214–8223.
nectionist networks: The sequential learning problem,” Psychol. Learn. [33] M. Abdelsalam, M. Faramarzi, S. Sodhani, and S. Chandar, “IIRC:
Motivation, vol. 24, pp. 109–165, 1989. Incremental implicitly-refined classification,” in Proc. IEEE/CVF Conf.
[7] J. L. McClelland, B. L. McNaughton, and R. C. O’Reilly, “Why there Comput. Vis. Pattern Recognit., 2021, pp. 11033–11042.
are complementary learning systems in the hippocampus and neocortex: [34] M. Liang et al., “Balancing between forgetting and acquisition in incre-
Insights from the successes and failures of connectionist models of mental subpopulation learning,” in Proc. Eur. Conf. Comput. Vis., 2022,
learning and memory,” Psychol. Rev., vol. 102, pp. 419–457, 1995. pp. 364–380.
[8] M. Mundt et al., “A wholistic view of continual learning with deep neural [35] Z. Ke, B. Liu, and X. Huang, “Continual learning of a mixed sequence
networks: Forgotten lessons and the bridge to active and open world of similar and dissimilar tasks,” in Proc. Int. Conf. Neural Inf. Process.
learning,” Neural Netw., vol. 160, pp. 306–336, 2023. Syst., 2020, Art. no. 1553.
[9] T. L. Hayes et al., “Replay in deep learning: Current approaches and [36] X. Liu et al., “Long-tailed class incremental learning,” in Proc. Eur. Conf.
missing biological elements,” Neural Comput., vol. 33, pp. 2908–2950, Comput. Vis., 2022, pp. 495–512.
2021. [37] S. Roy et al., “Class-incremental novel class discovery,” in Proc. Eur.
[10] P. Jedlicka et al., “Contributions by metaplasticity to solving the catas- Conf. Comput. Vis., 2022, pp. 317–333.
trophic forgetting problem,” Trends Neurosci., vol. 45, pp. 656–666, [38] K. Joseph et al., “Novel class discovery without forgetting,” in Proc. Eur.
2022. Conf. Comput. Vis., 2022, pp. 570–586.
[11] M. Masana, B. Twardowski, and J. Van de Weijer, “On class orderings [39] T. Srinivasan et al., “CLiMB: A continual learning benchmark for vision-
for incremental learning,” 2020, arXiv:2007.02145. and-language tasks,” in Proc. Int. Conf. Neural Inf. Process. Syst., 2022,
[12] M. De Lange et al., “A continual learning survey: Defying forgetting in pp. 29440–29453.
classification tasks,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 44, [40] F. Mi, L. Kong, T. Lin, K. Yu, and B. Faltings, “Generalized class
no. 7, pp. 3366–3385, Jul. 2022. incremental learning,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern
[13] H. Qu et al., “Recent advances of continual learning in computer vision: Recognit. Workshops, 2020, pp. 970–974.
An overview,” 2021, arXiv:2109.11369. [41] J. Xie, S. Yan, and X. He, “General incremental learning with domain-
[14] Z. Mai et al., “Online continual learning in image classification: An aware categorical representations,” in Proc. IEEE/CVF Conf. Comput.
empirical survey,” Neurocomputing, vol. 469, pp. 28–51, 2022. Vis. Pattern Recognit., 2022, pp. 14331–14340.
[15] M. Biesialska, K. Biesialska, and M. R. Costa-juss à, “Continual lifelong [42] H. Koh et al., “Online continual learning on class incremental blurry
learning in natural language processing: A survey,” in Proc. Int. Conf. task configuration with anytime inference,” in Proc. Int. Conf. Learn.
Comput. Linguistics, 2020, pp. 6523–6541. Representations, 2021.
[16] Z. Ke and B. Liu, “Continual learning of natural language processing [43] M. Caccia et al., “Online fast adaptation and knowledge accumulation
tasks: A survey,” 2022, arXiv:2211.12701. (OSAKA): A new approach to continual learning,” in Proc. Int. Conf.
[17] K. Khetarpal et al., “Towards continual reinforcement learning: A review Neural Inf. Process. Syst., 2020, Art. no. 1387.
and perspectives,” J. Artif. Intell. Res., vol. 75, pp. 1401–1476, 2022. [44] A. Chaudhry et al., “Riemannian walk for incremental learning: Under-
[18] V. Lomonaco and D. Maltoni, “Core50: A new dataset and benchmark standing forgetting and intransigence,” in Proc. Eur. Conf. Comput. Vis.,
for continuous object recognition,” in Proc. Conf. Robot Learn., 2017, 2018, pp. 556–572.
pp. 17–26. [45] D. Lopez-Paz and M. Ranzato, “Gradient episodic memory for continual
[19] Y.-C. Hsu et al., “Re-evaluating continual learning scenarios: A catego- learning,” in Proc. Int. Conf. Neural Inf. Process. Syst., 2017, pp. 6470–
rization and case for strong baselines,” 2018, arXiv:1810.12488. 6479.
[20] R. Aljundi, K. Kelchtermans, and T. Tuytelaars, “Task-free continual [46] S. Hou, X. Pan, C. C. Loy, Z. Wang, and D. Lin, “Learning a unified clas-
learning,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2019, sifier incrementally via rebalancing,” in Proc. IEEE/CVF Conf. Comput.
pp. 11246–11255. Vis. Pattern Recognit., 2019, pp. 831–839.
[21] R. Aljundi et al., “Gradient based sample selection for online con- [47] A. Douillard et al., “PODNet: Pooled outputs distillation for small-tasks
tinual learning,” in Proc. Int. Conf. Neural Inf. Process. Syst., 2019, incremental learning,” in Proc. Eur. Conf. Comput. Vis., 2020, pp. 86–102.
Art. no. 1058. [48] T. Hastie et al., The Elements of Statistical Learning: Data Mining,
[22] Y. Sun et al., “ERNIE 2.0: A continual pre-training framework for lan- Inference, and Prediction. Berlin, Germany: Springer, 2009.
guage understanding,” in Proc. AAAI Conf. Artif. Intell., 2020, pp. 8968– [49] J. Kirkpatrick et al., “Overcoming catastrophic forgetting in neural net-
8975. works,” Proc. Nat. Acad. Sci. USA, vol. 114, pp. 3521–3526, 2017.
[23] P. Singh, P. Mazumder, P. Rai, and V. P. Namboodiri, “Rectification-based [50] H. Ritter, A. Botev, and D. Barber, “Online structured laplace approxima-
knowledge retention for continual learning,” in Proc. IEEE/CVF Conf. tions for overcoming catastrophic forgetting,” in Proc. Int. Conf. Neural
Comput. Vis. Pattern Recognit., 2021, pp. 15277–15286. Inf. Process. Syst., 2018, pp. 3742–3752.
[24] X. Tao, X. Hong, X. Chang, S. Dong, X. Wei, and Y. Gong, “Few-shot [51] T.-C. Kao et al., “Natural continual learning: Success is a journey, not
class-incremental learning,” in Proc. IEEE/CVF Conf. Comput. Vis. Pat- (just) a destination,” in Proc. Int. Conf. Neural Inf. Process. Syst., 2021,
tern Recognit., 2020, pp. 12180–12189. pp. 28067–28079.
[25] L. Wang, K. Yang, C. Li, L. Hong, Z. Li, and J. Zhu, “ORDisCo: Effective [52] L. Wang et al., “AFEC: Active forgetting of negative transfer in continual
and efficient usage of incremental unlabeled data for semi-supervised learning,” in Proc. Int. Conf. Neural Inf. Process. Syst., 2021, pp. 22379–
continual learning,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern 22391.
Recognit., 2021, pp. 5379–5388. [53] F. Huszár, “On quadratic penalties in elastic weight consolidation,”
[26] K. Joseph, S. Khan, F. S. Khan, and V. N. Balasubramanian, “Towards 2017, arXiv:1712.03847.
open world object detection,” in Proc. IEEE/CVF Conf. Comput. Vis. [54] J. Martens and R. Grosse, “Optimizing neural networks with kronecker-
Pattern Recognit., 2021, pp. 5826–5836. factored approximate curvature,” in Proc. Int. Conf. Mach. Learn., 2015,
[27] D. Rao et al., “Continual unsupervised representation learning,” in Proc. pp. 2408–2417.
Int. Conf. Neural Inf. Process. Syst., 2019, Art. no. 687. [55] C. V. Nguyen et al., “Variational continual learning,” in Proc. Int. Conf.
[28] D. Hu et al., “How well does self-supervised pre-training perform with Learn. Representations, 2018.
streaming data?,” in Proc. Int. Conf. Learn. Representations, 2021. [56] T. Adel, H. Zhao, and R. E. Turner, “Continual learning with adap-
[29] C. D. Kim, J. Jeong, and G. Kim, “Imbalanced continual learning with tive weights (CLAW),” in Proc. Int. Conf. Learn. Representations,
partitioning reservoir sampling,” in Proc. Eur. Conf. Comput. Vis., 2020, 2019.
pp. 411–428. [57] R. Kurle et al., “Continual learning with Bayesian neural networks for
[30] C. D. Kim, J. Jeong, S. Moon, and G. Kim, “Continual learning on non-stationary data,” in Proc. Int. Conf. Learn. Representations, 2019.
noisy data streams via self-purified replay,” in Proc. IEEE/CVF Int. Conf. [58] N. Loo, S. Swaroop, and R. E. Turner, “Generalized variational continual
Comput. Vis., 2021, pp. 517–527. learning,” in Proc. Int. Conf. Learn. Representations, 2020.
[31] P. Buzzega et al., “Dark experience for general continual learning: A [59] S. Kapoor, T. Karaletsos, and T. D. Bui, “Variational auto-regressive
strong, simple baseline,” in Proc. Int. Conf. Neural Inf. Process. Syst., Gaussian processes for continual learning,” in Proc. Int. Conf. Mach.
2020, Art. no. 1335. Learn., 2021, pp. 5290–5300.
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY ROORKEE. Downloaded on December 19,2024 at 09:29:09 UTC from IEEE Xplore. Restrictions apply.
WANG et al.: COMPREHENSIVE SURVEY OF CONTINUAL LEARNING: THEORY, METHOD AND APPLICATION 5379
[60] T. G. Rudner et al., “Continual learning via sequential function-space [91] Q. Pham, C. Liu, and S. Hoi, “DualNet: Continual learning, fast and slow,”
variational inference,” in Proc. Int. Conf. Mach. Learn., 2022, pp. 18871– in Proc. Int. Conf. Neural Inf. Process. Syst., 2021, pp. 16131–16144.
18887. [92] H. Cha, J. Lee, and J. Shin, “Co2L: Contrastive continual learning,” in
[61] H. Tseran et al., “Natural variational continual learning,” in Proc. Int. Proc. IEEE/CVF Int. Conf. Comput. Vis., 2021, pp. 9496–9505.
Conf. Neural Inf. Process. Syst. Workshop, 2018. [93] D. Madaan et al., “Representational continuity for unsupervised continual
[62] J. Knoblauch, H. Husain, and T. Diethe, “Optimal continual learning has learning,” in Proc. Int. Conf. Learn. Representations, 2021.
perfect memory and is NP-HARD,” in Proc. Int. Conf. Mach. Learn., [94] K. Ramakrishnan, R. Panda, Q. Fan, J. Henning, A. Oliva, and R. Feris,
2020, Art. no. 494. “Relationship matters: Relation guided knowledge transfer for incremen-
[63] A. Chaudhry et al., “Efficient lifelong learning with A-GEM,” in Proc. tal learning of object detectors,” in Proc. IEEE/CVF Conf. Comput. Vis.
Int. Conf. Learn. Representations, 2018. Pattern Recognit. Workshops, 2020, pp. 1009–1018.
[64] S. Tang, D. Chen, J. Zhu, S. Yu, and W. Ouyang, “Layerwise optimization [95] A. Pentina and C. Lampert, “A PAC-Bayesian bound for lifelong learn-
by gradient decomposition for continual learning,” in Proc. IEEE/CVF ing,” in Proc. Int. Conf. Mach. Learn., 2014, pp. 991–999.
Conf. Comput. Vis. Pattern Recognit., 2021, pp. 9629–9638. [96] M. A. Bennani, T. Doan, and M. Sugiyama, “Generalisation guarantees
[65] G. Zeng et al., “Continual learning of context-dependent processing in for continual learning with orthogonal gradient descent,” in Proc. Int.
neural networks,” Nat. Mach. Intell., vol. 1, pp. 364–372, 2019. Conf. Mach. Learn. Workshops, 2020.
[66] S. Wang, X. Li, J. Sun, and Z. Xu, “Training networks in null space [97] T. Doan et al., “A theoretical analysis of catastrophic forgetting through
of feature covariance for continual learning,” in Proc. IEEE/CVF Conf. the NTK overlap matrix,” in Proc. Int. Conf. Artif. Intell. Statist., 2021,
Comput. Vis. Pattern Recognit., 2021, pp. 184–193. pp. 1072–1080.
[67] M. Farajtabar et al., “Orthogonal gradient descent for continual learning,” [98] R. Karakida and S. Akaho, “Learning curves for continual learning in
in Proc. Int. Conf. Artif. Intell. Statist., 2020, pp. 3762–3773. neural networks: Self-knowledge transfer and forgetting,” in Proc. Int.
[68] R. Ramesh and P. Chaudhari, “Model zoo: A growing brain that learns Conf. Learn. Representations, 2022.
continually,” in Proc. Int. Conf. Learn. Representations, 2021. [99] S. Lee, S. Goldt, and A. Saxe, “Continual learning in the teacher-student
[69] L. Wang et al., “CoSCL: Cooperation of small continual learners is setup: Impact of task similarity,” in Proc. Int. Conf. Mach. Learn., 2021,
stronger than a big one,” in Proc. Eur. Conf. Comput. Vis., 2022, pp. 254– pp. 6109–6119.
271. [100] M. Wołczyk et al., “Continual learning with guarantees via weight interval
[70] A. A. Rusu et al., “Progressive neural networks,” constraints,” in Proc. Int. Conf. Mach. Learn., 2022, pp. 23897–23911.
2016, arXiv:1606.04671. [101] T. Doan et al., “Efficient continual learning ensembles in neural network
[71] J. Serra et al., “Overcoming catastrophic forgetting with hard attention subspaces,” 2022, arXiv:2202.09826.
to the task,” in Proc. Int. Conf. Mach. Learn., 2018, pp. 4548–4557. [102] L. Peng, P. Giampouras, and R. Vidal, “The ideal continual learner:
[72] C. Fernando et al., “PathNet: Evolution channels gradient descent in super An agent that never forgets,” in Proc. Int. Conf. Mach. Learn., 2023,
neural networks,” 2017, arXiv:1701.08734. pp. 27585–27610.
[73] S. Ebrahimi et al., “Uncertainty-guided continual learning with Bayesian [103] L. Wang et al., “Incorporating neuro-inspired adaptability for continual
neural networks,” in Proc. Int. Conf. Learn. Representations, 2019. learning in artificial intelligence,” Nat. Mach. Intell., vol. 5, pp. 1356–
[74] C. Henning et al., “Posterior meta-replay for continual learning,” in Proc. 1368, 2023.
Int. Conf. Neural Inf. Process. Syst., 2021, pp. 14135–14149. [104] J. Schwarz et al., “Progress & compress: A scalable framework for con-
[75] G. Kim et al., “A theoretical study on solving continual learning,” in tinual learning,” in Proc. Int. Conf. Mach. Learn., 2018, pp. 4528–4537.
Proc. Int. Conf. Neural Inf. Process. Syst., 2022, pp. 5065–5079. [105] F. Zenke, B. Poole, and S. Ganguli, “Continual learning through synaptic
[76] F. D’Angelo and C. Henning, “Uncertainty-based out-of-distribution de- intelligence,” in Proc. Int. Conf. Mach. Learn., 2017, pp. 3987–3995.
tection requires suitable function space priors,” 2021, arXiv:2110.06020. [106] R. Aljundi et al., “Memory aware synapses: Learning what (not) to
[77] G. Kim et al., “Learnability and algorithm for continual learning,” in forget,” in Proc. Eur. Conf. Comput. Vis., 2018, pp. 139–154.
Proc. Int. Conf. Mach. Learn., 2023, Art. no. 694. [107] F. Benzing, “Unifying importance based regularisation methods for con-
[78] K. J. Joseph and V. N. Balasubramanian, “Meta-consolidation for con- tinual learning,” in Proc. Int. Conf. Artif. Intell. Statist., 2022, pp. 2372–
tinual learning,” in Proc. Int. Conf. Neural Inf. Process. Syst., 2020, 2396.
Art. no. 1205. [108] X. Liu, M. Masana, L. Herranz, J. Van de Weijer, A. M. López, and A.
[79] Z. Gong et al., “Continual pre-training of language models for math D. Bagdanov, “Rotate your networks: Better weight consolidation and
problem understanding with syntax-aware memory network,” in Proc. less catastrophic forgetting,” in Proc. Int. Conf. Pattern Recognit., 2018,
60th Annu. Meeting Assoc. Comput. Linguistics, 2022, pp. 5923–5933. pp. 2262–2268.
[80] S. Lee et al., “A neural dirichlet process mixture model for task-free [109] J. Lee, H. G. Hong, D. Joo, and J. Kim, “Continual learning with extended
continual learning,” in Proc. Int. Conf. Learn. Representations, 2019. kronecker-factored approximate curvature,” in Proc. IEEE/CVF Conf.
[81] S. I. Mirzadeh et al., “Understanding the role of training regimes in Comput. Vis. Pattern Recognit., 2020, pp. 8998–9007.
continual learning,” in Proc. Int. Conf. Neural Inf. Process. Syst., 2020, [110] D. Park, S. Hong, B. Han, and K. M. Lee, “Continual learning by
Art. no. 613. asymmetric loss approximation with single-side overestimation,” in Proc.
[82] S. Hochreiter and J. Schmidhuber, “Flat minima,” Neural Comput., vol. 9, IEEE/CVF Int. Conf. Comput. Vis., 2019, pp. 3334–3343.
pp. 1–42, 1997. [111] S.-W. Lee et al., “Overcoming catastrophic forgetting by incremental
[83] N. S. Keskar et al., “On large-batch training for deep learning: Generaliza- moment matching,” in Proc. Int. Conf. Neural Inf. Process. Syst., 2017,
tion gap and sharp minima,” in Proc. Int. Conf. Learn. Representations, pp. 4652–4662.
2017. [112] J. Lee et al., “Residual continual learning,” in Proc. AAAI Conf. Artif.
[84] G. Shi et al., “Overcoming catastrophic forgetting in incremental few-shot Intell., 2020, pp. 4553–4560.
learning by finding flat minima,” in Proc. Int. Conf. Neural Inf. Process. [113] G. Lin, H. Chu, and H. Lai, “Towards better plasticity-stability trade-off
Syst., 2021, pp. 6747–6761. in incremental learning: A simple linear connector,” in Proc. IEEE/CVF
[85] D. Deng et al., “Flattening sharpness for dynamic gradient projection Conf. Comput. Vis. Pattern Recognit., 2022, pp. 89–98.
memory benefits continual learning,” in Proc. Int. Conf. Neural Inf. [114] I. Paik et al., “Overcoming catastrophic forgetting by neuron-
Process. Syst., 2021, pp. 18710–18721. level plasticity control,” in Proc. AAAI Conf. Artif. Intell., 2020,
[86] S. I. Mirzadeh et al., “Linear mode connectivity in multitask and continual pp. 5339–5346.
learning,” in Proc. Int. Conf. Learn. Representations, 2020. [115] H. Ahn et al., “Uncertainty-based continual learning with adaptive regu-
[87] S. V. Mehta et al., “An empirical investigation of the role of pre-training larization,” in Proc. Int. Conf. Neural Inf. Process. Syst., 2019, pp. 4394–
in lifelong learning,” 2021, arXiv:2112.09153. 4404.
[88] S. Cha et al., “CPR: Classifier-projection regularization for continual [116] S. Jung et al., “Continual learning with node-importance based adaptive
learning,” in Proc. Int. Conf. Learn. Representations, 2020. group sparse regularization,” in Proc. Int. Conf. Neural Inf. Process. Syst.,
[89] V. V. Ramasesh, A. Lewkowycz, and E. Dyer, “Effect of scale on 2020, pp. 3647–3658.
catastrophic forgetting in neural networks,” in Proc. Int. Conf. Learn. [117] J. Gou et al., “Knowledge distillation: A survey,” Int. J. Comput. Vis.,
Representations, 2021. vol. 129, pp. 1789–1819, 2021.
[90] S. I. Mirzadeh et al., “Wide neural networks forget less catastrophically,” [118] Z. Li and D. Hoiem, “Learning without forgetting,” IEEE Trans. Pattern
in Proc. Int. Conf. Mach. Learn., 2022, pp. 15699–15717. Anal. Mach. Intell., vol. 40, no. 12, pp. 2935–2947, Dec. 2018.
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY ROORKEE. Downloaded on December 19,2024 at 09:29:09 UTC from IEEE Xplore. Restrictions apply.
5380 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 46, NO. 8, AUGUST 2024
[119] P. Dhar, R. V. Singh, K. -C. Peng, Z. Wu, and R. Chellappa, “Learning [148] B. Zhao, X. Xiao, G. Gan, B. Zhang, and S.-T. Xia, “Maintaining discrim-
without memorizing,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern ination and fairness in class incremental learning,” in Proc. IEEE/CVF
Recognit., 2019, pp. 5133–5141. Conf. Comput. Vis. Pattern Recognit., 2020, pp. 13205–13214.
[120] A. Iscen et al., “Memory-efficient incremental learning through feature [149] H. Ahn, J. Kwak, S. Lim, H. Bang, H. Kim, and T. Moon, “SS-IL:
adaptation,” in Proc. Eur. Conf. Comput. Vis., 2020, pp. 699–715. Separated softmax for incremental learning,” in Proc. IEEE/CVF Int.
[121] A. Rannen, R. Aljundi, M. B. Blaschko, and T. Tuytelaars, “Encoder Conf. Comput. Vis., 2021, pp. 824–833.
based lifelong learning,” in Proc. IEEE/CVF Int. Conf. Comput. Vis., [150] C. Simon, P. Koniusz, and M. Harandi, “On learning the geodesic path
2017, pp. 1329–1337. for incremental learning,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern
[122] S.-A. Rebuffi, A. Kolesnikov, G. Sperl, and C. H. Lampert, “iCaRL: Recognit., 2021, pp. 1591–1600.
Incremental classifier and representation learning,” in Proc. IEEE/CVF [151] K. Joseph, S. Khan, F. S. Khan, R. M. Anwer, and V. N. Balasubra-
Conf. Comput. Vis. Pattern Recognit., 2017, pp. 5533–5542. manian, “Energy-based latent aligner for incremental learning,” in Proc.
[123] F. M. Castro et al., “End-to-end incremental learning,” in Proc. Eur. Conf. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2022, pp. 7442–7451.
Comput. Vis., 2018, pp. 233–248. [152] X. Hu, K. Tang, C. Miao, X. -S. Hua, and H. Zhang, “Distilling causal
[124] K. Lee, K. Lee, J. Shin, and H. Lee, “Overcoming catastrophic forgetting effect of data in class-incremental learning,” in Proc. IEEE/CVF Conf.
with unlabeled data in the wild,” in Proc. IEEE/CVF Int. Conf. Comput. Comput. Vis. Pattern Recognit., 2021, pp. 3956–3965.
Vis., 2019, pp. 312–321. [153] A. Ashok, K. Joseph, and V. N. Balasubramanian, “Class-incremental
[125] C. Wu et al., “Memory replay GANs: Learning to generate new categories learning with cross-space clustering and controlled transfer,” in Proc.
without forgetting,” in Proc. Int. Conf. Neural Inf. Process. Syst., 2018, Eur. Conf. Comput. Vis., 2022, pp. 105–122.
pp. 5966–5976. [154] S. Hou et al., “Lifelong learning via progressive distillation and retro-
[126] M. Zhai, L. Chen, F. Tung, J. He, M. Nawhal, and G. Mori, “Lifelong spection,” in Proc. Eur. Conf. Comput. Vis., 2018, pp. 437–452.
GAN: Continual learning for conditional image generation,” in Proc. [155] F.-Y. Wang et al., “Foster: Feature boosting and compression for
IEEE/CVF Int. Conf. Comput. Vis., 2019, pp. 2759–2768. class-incremental learning,” in Proc. Eur. Conf. Comput. Vis., 2022,
[127] M. K. Titsias et al., “Functional regularisation for continual learning with pp. 398–414.
Gaussian processes,” in Proc. Int. Conf. Learn. Representations, 2019. [156] E. Verwimp, M. De Lange, and T. Tuytelaars, “Rehearsal revealed: The
[128] P. Pan et al., “Continual deep learning by functional regularisation of limits and merits of revisiting samples in continual learning,” in Proc.
memorable past,” in Proc. Int. Conf. Neural Inf. Process. Syst., 2020, IEEE/CVF Int. Conf. Comput. Vis., 2021, pp. 9365–9374.
pp. 4453–4464. [157] L. Bonicelli et al., “On the effectiveness of lipschitz-driven rehearsal in
[129] Z. Wang et al., “Continual learning through retrieval and imagination,” continual learning,” in Proc. Int. Conf. Neural Inf. Process. Syst., 2022,
in Proc. AAAI Conf. Artif. Intell., 2022, pp. 8594–8602. pp. 31886–31901.
[130] A. Chaudhry et al., “On tiny episodic memories in continual learning,” [158] L. Yu et al., “Continual learning by modeling intra-class variation,” Trans.
2019, arXiv:1902.10486. Mach. Learn. Res., vol. 2023, 2023.
[131] M. Riemer et al., “Learning to learn without forgetting by maximizing [159] A. Prabhu, P. H. Torr, and P. K. Dokania, “GDumb: A simple approach
transfer and minimizing interference,” in Proc. Int. Conf. Learn. Repre- that questions our progress in continual learning,” in Proc. Eur. Conf.
sentations, 2018. Comput. Vis., 2020, pp. 524–540.
[132] Z. Borsos, M. Mutny, and A. Krause, “Coresets via bilevel optimization [160] H. Shin et al., “Continual learning with deep generative replay,” in Proc.
for continual learning and streaming,” in Proc. Int. Conf. Neural Inf. Int. Conf. Neural Inf. Process. Syst., 2017, pp. 2990–2999.
Process. Syst., 2020, pp. 14879–14890. [161] A. Seff et al., “Continual learning in generative adversarial nets,”
[133] J. Yoon et al., “Online coreset selection for rehearsal-based continual 2017, arXiv:1705.08395.
learning,” in Proc. Int. Conf. Learn. Representations, 2021. [162] L. Wang, B. Lei, Q. Li, H. Su, J. Zhu, and Y. Zhong, “Triple-memory
[134] D. Shim et al., “Online class-incremental continual learning with adver- networks: A brain-inspired method for continual learning,” IEEE Trans.
sarial shapley value,” in Proc. AAAI Conf. Artif. Intell., 2021, pp. 9630– Neural Netw. Learn. Syst., vol. 33, no. 5, pp. 1925–1934, May 2022.
9638. [163] C. He et al., “Exemplar-supported generative reproduction for class
[135] R. Tiwari, K. Killamsetty, R. Iyer, and P. Shenoy, “GCR: Gradient coreset incremental learning,” in Proc. Brit. Mach. Vis. Conf., 2018.
based replay buffer selection for continual learning,” in Proc. IEEE/CVF [164] O. Ostapenko, M. Puscas, T. Klein, P. Jähnichen, and M. Nabi, “Learning
Conf. Comput. Vis. Pattern Recognit., 2022, pp. 99–108. to remember: A synaptic plasticity driven framework for continual learn-
[136] L. Caccia et al., “Online learned continual compression with adaptive ing,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2019,
quantization modules,” in Proc. Int. Conf. Mach. Learn., 2020, pp. 1240– pp. 11313–11321.
1250. [165] R. Kemker and C. Kanan, “FearNet: Brain-inspired model for incremental
[137] A. Van Den Oord et al., “Neural discrete representation learning,” in Proc. learning,” in Proc. Int. Conf. Learn. Representations, 2018.
Int. Conf. Neural Inf. Process. Syst., 2017, pp. 6306–6315. [166] Y. Xiang, Y. Fu, P. Ji, and H. Huang, “Incremental learning using
[138] L. Wang et al., “Memory replay with data compression for continual conditional adversarial networks,” in Proc. IEEE/CVF Int. Conf. Comput.
learning,” in Proc. Int. Conf. Learn. Representations, 2021. Vis., 2019, pp. 6618–6627.
[139] L. Kumari et al., “Retrospective adversarial replay for continual learning,” [167] Y. Cong et al., “GAN memory with no forgetting,” in Proc. Int. Conf.
in Proc. Int. Conf. Neural Inf. Process. Syst., 2022, pp. 28530–28544. Neural Inf. Process. Syst., 2020, pp. 16481–16494.
[140] E. Belouadah and A. Popescu, “IL2M: Class incremental learning with [168] A. Ayub and A. Wagner, “EEC: Learning to encode and regenerate images
dual memory,” in Proc. IEEE/CVF Int. Conf. Comput. Vis., 2019, pp. 583– for continual learning,” in Proc. Int. Conf. Learn. Representations, 2020.
592. [169] M. Riemer et al., “Scalable recollections for continual lifelong learning,”
[141] S. Ebrahimi et al., “Remembering for the right reasons: Explanations re- in Proc. AAAI Conf. Artif. Intell., 2019, pp. 1352–1359.
duce catastrophic forgetting,” in Proc. Int. Conf. Learn. Representations, [170] F. Ye and A. G. Bors, “Learning latent representations across multiple
2020. data domains using lifelong VAEGAN,” in Proc. Eur. Conf. Comput. Vis.,
[142] G. Saha, I. Garg, and K. Roy, “Gradient projection memory for continual 2020, pp. 777–795.
learning,” in Proc. Int. Conf. Learn. Representations, 2020. [171] G. M. van de Ven, H. T. Siegelmann, and A. S. Tolias, “Brain-inspired
[143] Y. Liu, Y. Su, A. -A. Liu, B. Schiele, and Q. Sun, “Mnemonics training: replay for continual learning with artificial neural networks,” Nat. Com-
Multi-class incremental learning without forgetting,” in Proc. IEEE/CVF mun., vol. 11, 2020, Art. no. 4069.
Conf. Comput. Vis. Pattern Recognit., 2020, pp. 12242–12251. [172] X. Liu et al., “Generative feature replay for class-incremental learning,”
[144] X. Jin et al., “Gradient-based editing of memory examples for online in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. Workshops,
task-free continual learning,” in Proc. Int. Conf. Neural Inf. Process. 2020, pp. 915–924.
Syst., 2021, pp. 29193–29205. [173] K. Zhu, W. Zhai, Y. Cao, J. Luo, and Z.-J. Zha, “Self-sustaining represen-
[145] R. Aljundi et al., “Online continual learning with maximal interfered tation expansion for non-exemplar class-incremental learning,” in Proc.
retrieval,” in Proc. Int. Conf. Neural Inf. Process. Syst., 2019, pp. 11849– IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2022, pp. 9286–9295.
11860. [174] M. Toldo and M. Ozay, “Bring evanescent representations to life in
[146] A. Chaudhry et al., “Using hindsight to anchor past knowledge in con- lifelong class incremental learning,” in Proc. IEEE/CVF Conf. Comput.
tinual learning,” in Proc. AAAI Conf. Artif. Intell., 2021, pp. 6993–7001. Vis. Pattern Recognit., 2022, pp. 16711–16720.
[147] Y. Wu et al., “Large scale incremental learning,” in Proc. IEEE/CVF Conf. [175] T. L. Hayes et al., “Remind your neural network to prevent catastrophic
Comput. Vis. Pattern Recognit., 2019, pp. 374–382. forgetting,” in Proc. Eur. Conf. Comput. Vis., 2020, pp. 466–483.
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY ROORKEE. Downloaded on December 19,2024 at 09:29:09 UTC from IEEE Xplore. Restrictions apply.
WANG et al.: COMPREHENSIVE SURVEY OF CONTINUAL LEARNING: THEORY, METHOD AND APPLICATION 5381
[176] G. Petit, A. Popescu, H. Schindler, D. Picard, and B. Delezoide, “FeTrIL: [203] B. Ermis et al., “Memory efficient continual learning with transformers,”
Feature translation for exemplar-free class-incremental learning,” in in Proc. Int. Conf. Neural Inf. Process. Syst., 2022, pp. 10629–10642.
Proc. IEEE/CVF Winter Conf. Appl. Comput. Vis., 2023, pp. 3900–3909. [204] N. Houlsby et al., “Parameter-efficient transfer learning for NLP,” in
[177] A. Chaudhry et al., “Continual learning in low-rank orthogonal sub- Proc. Int. Conf. Mach. Learn., 2019, pp. 2790–2799.
spaces,” in Proc. Int. Conf. Neural Inf. Process. Syst., 2020, pp. 9900– [205] Z. Wang et al., “Learning to prompt for continual learning,” in Proc.
9911. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2022, pp. 139–149.
[178] S. Lin et al., “TRGP: Trust region gradient projection for continual [206] J. S. Smith et al., “CODA-prompt: Continual decomposed attention-based
learning,” in Proc. Int. Conf. Learn. Representations, 2021. prompting for rehearsal-free continual learning,” in Proc. IEEE/CVF
[179] Y. Kong et al., “Balancing stability and plasticity through advanced null Conf. Comput. Vis. Pattern Recognit., 2023, pp. 11909–11919.
space in continual learning,” in Proc. Eur. Conf. Comput. Vis., 2022, [207] Z. Wang et al., “DualPrompt: Complementary prompting for rehearsal-
pp. 219–236. free continual learning,” in Proc. Eur. Conf. Comput. Vis., 2022, pp. 631–
[180] H. Liu and H. Liu, “Continual learning with recursive gradient optimiza- 648.
tion,” in Proc. Int. Conf. Learn. Representations, 2021. [208] Y. Wang, Z. Huang, and X. Hong, “S-prompts learning with pre-trained
[181] K. Javed and M. White, “Meta-learning representations for continual transformers: An occam’s razor for domain incremental learning,” in
learning,” in Proc. Int. Conf. Neural Inf. Process. Syst., 2019, pp. 1818– Proc. Int. Conf. Neural Inf. Process. Syst., 2022, pp. 5682–5695.
1828. [209] A. Razdaibiedina et al., “Progressive prompts: Continual learning for
[182] S. Beaulieu et al., “Learning to continually learn,” in Proc. Eur. Conf. language models,” in Proc. Int. Conf. Learn. Representations, 2023.
Artif. Intell., 2020, pp. 992–1001. [210] P. Janson et al., “A simple baseline that questions the use of pretrained-
[183] E. Lee, C.-H. Huang, and C.-Y. Lee, “Few-shot and continual learning models in continual learning,” in Proc. Int. Conf. Neural Inf. Process.
with attentive independent mechanisms,” in Proc. IEEE/CVF Int. Conf. Syst. Workshop, 2017.
Comput. Vis., 2021, pp. 9435–9444. [211] F. Pelosin, “Simpler is better: Off-the-shelf continual learning through
[184] J. Rajasegaran, S. Khan, M. Hayat, F. S. Khan, and M. Shah, “iTAML: An pretrained backbones,” 2022, arXiv:2205.01586.
incremental task-agnostic meta-learning approach,” in Proc. IEEE/CVF [212] A. Panos et al., “First session adaptation: A strong replay-free baseline
Conf. Comput. Vis. Pattern Recognit., 2020, pp. 13585–13594. for class-incremental learning,” 2023, arXiv:2303.13199.
[185] G. Gupta, K. Yadav, and L. Paull, “Look-ahead meta learning for con- [213] P. Foret et al., “Sharpness-aware minimization for efficiently improving
tinual learning,” in Proc. Int. Conf. Neural Inf. Process. Syst., 2020, generalization,” in Proc. Int. Conf. Learn. Representations, 2020.
pp. 11588–11598. [214] G. Zhang et al., “SLCA: Slow learner with classifier alignment for
[186] J. Hurtado, A. Raymond, and A. Soto, “Optimizing reusable knowledge continual learning on a pre-trained model,” 2023, arXiv:2303.05118.
for continual learning via metalearning,” in Proc. Int. Conf. Neural Inf. [215] V. Marsocci and S. Scardapane, “Continual barlow twins: Continual self-
Process. Syst., 2021, pp. 14150–14162. supervised learning for remote sensing semantic segmentation,” IEEE J.
[187] R. Wang et al., “Anti-retroactive interference for lifelong learning,” in Sel. Topics Appl. Earth Observ. Remote Sens., vol. 16, pp. 5049–5060,
Proc. Eur. Conf. Comput. Vis., 2022, pp. 163–178. 2023.
[188] J. Gallardo, T. L. Hayes, and C. Kanan, “Self-supervised training en- [216] A. Cossu et al., “Continual pre-training mitigates forgetting in language
hances online continual learning,” in Proc. Brit. Mach. Vis. Conf., and vision,” 2022, arXiv:2205.09357.
2021. [217] S. Yan et al., “Generative negative text replay for continual vision-
[189] T.-Y. Wu et al., “Class-incremental learning with strong pre-trained language pretraining,” in Proc. Eur. Conf. Comput. Vis., 2022, pp. 22–38.
models,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2022, [218] A. Radford et al., “Learning transferable visual models from natu-
pp. 9591–9600. ral language supervision,” in Proc. Int. Conf. Mach. Learn., 2021,
[190] M. Davari, N. Asadi, S. Mudur, R. Aljundi, and E. Belilovsky, “Probing pp. 8748–8763.
representation forgetting in supervised and unsupervised continual learn- [219] Q. Liu et al., “Incremental meta-learning via indirect discriminant align-
ing,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2022, ment,” 2020, arXiv:2002.04162.
pp. 16691–16700. [220] Z. Wang et al., “Meta-learning with less forgetting on large-scale non-
[191] Y. Shi et al., “Mimicking the oracle: An initial phase decorrelation ap- stationary task distributions,” in Proc. Eur. Conf. Comput. Vis., 2022,
proach for class incremental learning,” in Proc. IEEE/CVF Conf. Comput. pp. 221–238.
Vis. Pattern Recognit., 2022, pp. 16701–16710. [221] A. Mallya, D. Davis, and S. Lazebnik, “Piggyback: Adapting a single
[192] B. Neyshabur, H. Sedghi, and C. Zhang, “What is being transferred in network to multiple tasks by learning to mask weights,” in Proc. Eur.
transfer learning?,” in Proc. Int. Conf. Neural Inf. Process. Syst., 2020, Conf. Comput. Vis., 2018, pp. 67–82.
pp. 512–523. [222] H. Kang et al., “Forget-free continual learning with winning subnet-
[193] Y. Hao et al., “Visualizing and understanding the effectiveness of BERT,” works,” in Proc. Int. Conf. Mach. Learn., 2022, pp. 10734–10750.
in Proc. Conf. Empir. Methods Natural Lang. Process. Int. Joint Conf. [223] H. Jin and E. Kim, “Helpful or harmful: Inter-task association in continual
Natural Lang. Process., 2019, pp. 4141–4150. learning,” in Proc. Eur. Conf. Comput. Vis., 2022, pp. 519–535.
[194] S. Purushwalkam, P. Morgado, and A. Gupta, “The challenges of contin- [224] A. Mallya and S. Lazebnik, “PackNet: Adding multiple tasks to a single
uous self-supervised learning,” in Proc. Eur. Conf. Comput. Vis., 2022, network by iterative pruning,” in Proc. IEEE/CVF Conf. Comput. Vis.
pp. 702–721. Pattern Recognit., 2018, pp. 7765–7773.
[195] E. Fini, V. G. T. Da Costa, X. Alameda-Pineda, E. Ricci, K. Alahari, [225] S. Golkar, M. Kagan, and K. Cho, “Continual learning via neural prun-
and J. Mairal, “Self-supervised models are continual learners,” in Proc. ing,” 2019, arXiv:1903.04476.
IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2022, pp. 9611–9620. [226] M. B. Gurbuz and C. Dovrolis, “NISPA: Neuro-inspired stability-
[196] N. Vödisch et al., “Continual SLAM: Beyond lifelong simultaneous plasticity adaptation for continual learning in sparse networks,” in Proc.
localization and mapping through continual learning,” in Proc. Int. Symp. Int. Conf. Mach. Learn., 2022, pp. 8157–8174.
Robot. Res., 2022, pp. 19–35. [227] J. Yoon et al., “Lifelong learning with dynamically expandable networks,”
[197] O. Ostapenko et al., “Foundational models for continual learning: An in Proc. Int. Conf. Learn. Representations, 2018.
empirical study of latent replay,” in Proc. Conf. Lifelong Learn. Agents, [228] C.-Y. Hung et al., “Compacting, picking and growing for unforgetting
2022, pp. 534–547. continual learning,” in Proc. Int. Conf. Neural Inf. Process. Syst., 2019,
[198] J. Zhang et al., “Side-tuning: A baseline for network adaptation via addi- pp. 13647–13657.
tive side networks,” in Proc. Eur. Conf. Comput. Vis., 2020, pp. 698–714. [229] J. Xu and Z. Zhu, “Reinforced continual learning,” in Proc. Int. Conf.
[199] H. Shon et al., “DLCFT: Deep linear continual fine-tuning for general Neural Inf. Process. Syst., 2018, pp. 907–916.
incremental learning,” in Proc. Eur. Conf. Comput. Vis., 2022, pp. 513– [230] Q. Qin et al., “BNS: Building network structures dynamically for con-
529. tinual learning,” in Proc. Int. Conf. Neural Inf. Process. Syst., 2021,
[200] M. Boschini et al., “Transfer without forgetting,” in Proc. Eur. Conf. pp. 20608–20620.
Comput. Vis., 2022, pp. 692–709. [231] X. Li et al., “Learn to grow: A continual structure learning framework
[201] E. Perez et al., “FiLM: Visual reasoning with a general conditioning for overcoming catastrophic forgetting,” in Proc. Int. Conf. Mach. Learn.,
layer,” in Proc. AAAI Conf. Artif. Intell., 2018, pp. 3942–3951. 2019, pp. 3925–3934.
[202] M. Zhao, Y. Cong, and L. Carin, “On leveraging pretrained GANs for [232] A. Kumar, S. Chatterjee, and P. Rai, “Bayesian structural adapta-
generation with limited data,” in Proc. Int. Conf. Mach. Learn., 2020, tion for continual learning,” in Proc. Int. Conf. Mach. Learn., 2021,
pp. 11340–11351. pp. 5850–5860.
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY ROORKEE. Downloaded on December 19,2024 at 09:29:09 UTC from IEEE Xplore. Restrictions apply.
5382 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 46, NO. 8, AUGUST 2024
[233] S. Ebrahimi et al., “Adversarial continual learning,” in Proc. Eur. Conf. [260] M. Hersche, G. Karunaratne, G. Cherubini, L. Benini, A. Sebas-
Comput. Vis., 2020, pp. 386–402. tian, and A. Rahimi, “Constrained few-shot class-incremental learn-
[234] Z. Wu, C. Baek, C. You, and Y. Ma, “Incremental learning via rate ing,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2022,
reduction,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., pp. 9047–9057.
2021, pp. 1125–1133. [261] J. Kalla and S. Biswas, “S3C: Self-supervised stochastic classifiers for
[235] A. Douillard, A. Ramé, G. Couairon, and M. Cord, “DyTox: Trans- few-shot class-incremental learning,” in Proc. Eur. Conf. Comput. Vis.,
formers for continual learning with dynamic token expansion,” in Proc. 2022, pp. 432–448.
IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2022, pp. 9275–9285. [262] A. Cheraghian, S. Rahman, P. Fang, S. K. Roy, L. Petersson, and M.
[236] P. Singh et al., “Calibrating CNNs for lifelong learning,” in Proc. Int. Harandi, “Semantic-aware knowledge distillation for few-shot class-
Conf. Neural Inf. Process. Syst., 2020, Art. no. 1307. incremental learning,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern
[237] D. Abati, J. Tomczak, T. Blankevoort, S. Calderara, R. Cucchiara, and B. Recognit., 2021, pp. 2534–2543.
E. Bejnordi, “Conditional channel gated networks for task-aware contin- [263] A. F. Akyürek et al., “Subspace regularizers for few-shot class incremen-
ual learning,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., tal learning,” in Proc. Int. Conf. Learn. Representations, 2021.
2020, pp. 3930–3939. [264] A. K. Bhunia et al., “Doodle it yourself: Class incremental learning by
[238] J. Yoon et al., “Scalable and order-robust continual learning with additive drawing a few sketches,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern
parameter decomposition,” in Proc. Int. Conf. Learn. Representations, Recognit., 2022, pp. 2283–2292.
2019. [265] A. Lechat, S. Herbin, and F. Jurie, “Semi-supervised class incremental
[239] M. Kanakis et al., “Reparameterizing convolutions for incremental multi- learning,” in Proc. Int. Conf. Pattern Recognit., 2021, pp. 10383–10389.
task learning without task interference,” in Proc. Eur. Conf. Comput. Vis., [266] M. Boschini et al., “Continual semi-supervised learning through con-
2020, pp. 689–707. trastive interpolation consistency,” Pattern Recognit. Lett., vol. 162,
[240] Z. Miao et al., “Continual learning with filter atom swapping,” in Proc. pp. 9–14, 2022.
Int. Conf. Learn. Representations, 2021. [267] Z. Kang, E. Fini, M. Nabi, E. Ricci, and K. Alahari, “A soft nearest-
[241] N. Mehta et al., “Continual learning using a Bayesian nonparametric neighbor framework for continual semi-supervised learning,” in Proc.
dictionary of weight factors,” in Proc. Int. Conf. Artif. Intell. Statist., IEEE/CVF Int. Conf. Comput. Vis., 2023, pp. 11834–11843.
2021, pp. 100–108. [268] Y.-M. Tang, Y.-X. Peng, and W.-S. Zheng, “Learning to imagine: Di-
[242] R. Hyder et al., “Incremental task learning with incremental rank up- versify memory for incremental learning using unlabeled data,” in Proc.
dates,” in Proc. Eur. Conf. Comput. Vis., 2022, pp. 566–582. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2022, pp. 9539–9548.
[243] R. Aljundi, P. Chakravarty, and T. Tuytelaars, “Expert gate: Lifelong [269] G. Jerfel et al., “Reconciling meta-learning and continual learning with
learning with a network of experts,” in Proc. IEEE/CVF Conf. Comput. online mixtures of tasks,” in Proc. Int. Conf. Neural Inf. Process. Syst.,
Vis. Pattern Recognit., 2017, pp. 7120–7129. 2019, pp. 9119–9130.
[244] J. Rajasegaran et al., “Random path selection for incremental learning,” [270] R. Ardywibowo et al., “VariGrow: Variational architecture growing for
in Proc. Int. Conf. Neural Inf. Process. Syst., 2019, pp. 12669–12679. task-agnostic continual learning based on Bayesian novelty,” in Proc. Int.
[245] T. Veniat, L. Denoyer, and M. Ranzato, “Efficient continual learning Conf. Mach. Learn., 2022, pp. 865–877.
with modular networks and task-driven priors,” in Proc. Int. Conf. Learn. [271] F. Ye and A. G. Bors, “Task-free continual learning via online discrepancy
Representations, 2020. distance learning,” in Proc. Int. Conf. Neural Inf. Process. Syst., 2022,
[246] O. Ostapenko et al., “Continual learning via local module composition,” pp. 23675–23688.
in Proc. Int. Conf. Neural Inf. Process. Syst., 2021, pp. 30298–30312. [272] H.-J. Chen et al., “Mitigating forgetting in online continual learning via
[247] H. Yin et al., “Dreaming to distill: Data-free knowledge transfer via instance-aware parameterization,” in Proc. Int. Conf. Neural Inf. Process.
deepinversion,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Syst., 2020, pp. 17466–17477.
2020, pp. 8712–8721. [273] S. Sun et al., “Information-theoretic online memory selection for contin-
[248] J. Smith, Y. -C. Hsu, J. Balloch, Y. Shen, H. Jin, and Z. Kira, “Always ual learning,” in Proc. Int. Conf. Learn. Representations, 2021.
be dreaming: A new approach for data-free class-incremental learning,” [274] M. De Lange and T. Tuytelaars, “Continual prototype evolution: Learning
in Proc. IEEE/CVF Int. Conf. Comput. Vis., 2021, pp. 9354–9364. online from non-stationary data streams,” in Proc. IEEE/CVF Int. Conf.
[249] L. Yu et al., “Semantic drift compensation for class-incremental learning,” Comput. Vis., 2021, pp. 8230–8239.
in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2020, pp. 6980– [275] Z. Wang et al., “Improving task-free continual learning by distributionally
6989. robust memory evolution,” in Proc. Int. Conf. Mach. Learn., 2022,
[250] S. Dong et al., “Few-shot class-incremental learning via relation knowl- pp. 22985–22998.
edge distillation,” in Proc. AAAI Conf. Artif. Intell., 2021, pp. 1255–1263. [276] Y. Gu, X. Yang, K. Wei, and C. Deng, “Not just selection, but exploration:
[251] P. Mazumder, P. Singh, and P. Rai, “Few-shot lifelong learning,” in Proc. Online class-incremental continual learning via dual view consistency,”
AAAI Conf. Artif. Intell., 2021, pp. 2337–2345. in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2022, pp. 7432–
[252] A. Kukleva, H. Kuehne, and B. Schiele, “Generalized and incremental 7441.
few-shot learning by explicit learning and calibration without forgetting,” [277] Q. Pham et al., “Contextual transformation networks for online continual
in Proc. IEEE/CVF Int. Conf. Comput. Vis., 2021, pp. 9000–9009. learning,” in Proc. Int. Conf. Learn. Representations, 2020.
[253] Z. Chi, L. Gu, H. Liu, Y. Wang, Y. Yu, and J. Tang, “MetaFSCIL: [278] J. He, R. Mao, Z. Shao, and F. Zhu, “Incremental learning in online
A meta-learning approach for few-shot class incremental learning,” in scenario,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit.,
Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2022, pp. 14146– 2020, pp. 13923–13932.
14155. [279] Z. Mai, R. Li, H. Kim, and S. Sanner, “Supervised contrastive replay:
[254] H. Liu et al., “Few-shot class-incremental learning via entropy- Revisiting the nearest class mean classifier in online class-incremental
regularized data-free replay,” in Proc. Eur. Conf. Comput. Vis., 2022, continual learning,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern
pp. 146–162. Recognit., 2021, pp. 3584–3594.
[255] K. Zhu, Y. Cao, W. Zhai, J. Cheng, and Z.-J. Zha, “Self-promoted [280] L. Caccia et al., “New insights on reducing abrupt representation change
prototype refinement for few-shot class-incremental learning,” in Proc. in online continual learning,” in Proc. Int. Conf. Learn. Representations,
IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2021, pp. 6797–6806. 2021.
[256] D.-W. Zhou, F. -Y. Wang, H.-J. Ye, L. Ma, S. Pu, and D.- [281] Y. Zhang et al., “A simple but strong baseline for online continual
C. Zhan, “Forward compatible few-shot class-incremental learning,” learning: Repeated augmented rehearsal,” in Proc. Int. Conf. Neural Inf.
in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2022, Process. Syst., 2022, pp. 14771–14783.
pp. 9036–9046. [282] K. Shmelkov, C. Schmid, and K. Alahari, “Incremental learning of object
[257] C. Peng et al., “Few-shot class-incremental learning from an open-set detectors without catastrophic forgetting,” in Proc. IEEE/CVF Int. Conf.
perspective,” in Proc. Eur. Conf. Comput. Vis., 2022, pp. 382–397. Comput. Vis., 2017, pp. 3420–3429.
[258] Y. Zou et al., “Margin-based few-shot class-incremental learning with [283] C. Peng, K. Zhao, and B. C. Lovell, “Faster ILOD: Incremental learning
class-level overfitting mitigation,” in Proc. Int. Conf. Neural Inf. Process. for object detectors based on faster RCNN,” Pattern Recognit. Lett.,
Syst., 2022, pp. 27267–27279. vol. 140, pp. 109–115, 2020.
[259] C. Zhang, N. Song, G. Lin, Y. Zheng, P. Pan, and Y. Xu, “Few-shot incre- [284] N. Dong et al., “Bridging non co-occurrence with unlabeled in-the-wild
mental learning with continually evolved classifiers,” in Proc. IEEE/CVF data for incremental object detection,” in Proc. Int. Conf. Neural Inf.
Conf. Comput. Vis. Pattern Recognit., 2021, pp. 12450–12459. Process. Syst., 2021, pp. 30492–30503.
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY ROORKEE. Downloaded on December 19,2024 at 09:29:09 UTC from IEEE Xplore. Restrictions apply.
WANG et al.: COMPREHENSIVE SURVEY OF CONTINUAL LEARNING: THEORY, METHOD AND APPLICATION 5383
[285] T. Feng, M. Wang, and H. Yuan, “Overcoming catastrophic forgetting [311] T. Shibata et al., “Learning with selective forgetting,” in Proc. Int. Conf.
in incremental object detection via elastic response distillation,” in Proc. Artif. Intell. Statist., 2021, pp. 989–996.
IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2022, pp. 9417–9426. [312] J. Ye et al., “Learning with recoverable forgetting,” in Proc. Eur. Conf.
[286] J. Zhang et al., “Class-incremental learning via deep model consol- Comput. Vis., 2022, pp. 87–103.
idation,” in Proc. IEEE/CVF Winter Conf. Appl. Comput. Vis., 2020, [313] M. Zhou et al., “Image de-raining via continual learning,” in Proc.
pp. 1120–1129. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2021, pp. 4905–4914.
[287] K. Joseph, J. Rajasegaran, S. Khan, F. S. Khan, and V. N. Balasubra- [314] M. Rostami, L. Spinoulas, M. Hussein, J. Mathai, and W. Abd-Almageed,
manian, “Incremental object detection via meta-learning,” IEEE Trans. “Detection and continual learning of novel face presentation attacks,” in
Pattern Anal. Mach. Intell., vol. 44, no. 12, pp. 9209–9216, Dec. 2022. Proc. IEEE/CVF Int. Conf. Comput. Vis., 2021, pp. 14831–14840.
[288] F. Cermelli, M. Mancini, S. Rota Bulò, E. Ricci, and B. Caputo, “Mod- [315] Y. Song et al., “Score-based generative modeling through stochastic
eling the background for incremental learning in semantic segmenta- differential equations,” in Proc. Int. Conf. Learn. Representations, 2020.
tion,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2020, [316] T. Brown et al., “Language models are few-shot learners,” in Proc. Int.
pp. 9230–9239. Conf. Neural Inf. Process. Syst., 2020, pp. 1877–1901.
[289] Y. Oh, D. Baek, and B. Ham, “ALIFE: Adaptive logit regularizer and [317] A. Vaswani et al., “Attention is all you need,” in Proc. Int. Conf. Neural
feature replay for incremental semantic segmentation,” in Proc. Int. Conf. Inf. Process. Syst., 2017, pp. 5998–6008.
Neural Inf. Process. Syst., 2022, pp. 14516–14528. [318] J. Duan, S. Yu, H. L. Tan, H. Zhu, and C. Tan, “A survey of embodied AI:
[290] C.-B. Zhang, J. -W. Xiao, X. Liu, Y. -C. Chen, and M.-M. Cheng, From simulators to research tasks,” IEEE Trans. Emerg. Topics Comput.
“Representation compensation networks for continual semantic segmen- Intell., vol. 6, no. 2, pp. 230–244, Apr. 2022.
tation,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2022,
pp. 7043–7054.
[291] U. Michieli and P. Zanuttigh, “Continual semantic segmentation via Liyuan Wang received the BS and PhD degrees
repulsion-attraction of sparse and disentangled latent representations,” in from Tsinghua University. He is currently a postdoc
Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2021, pp. 1114– with Tsinghua University, working with Prof. Jun
1124. Zhu with the Department of Computer Science and
[292] A. Douillard, Y. Chen, A. Dapogny, and M. Cord, “PLOP: Learning with- Technology. His research interests include continual
out forgetting for continual semantic segmentation,” in Proc. IEEE/CVF learning, incremental learning, lifelong learning and
Conf. Comput. Vis. Pattern Recognit., 2021, pp. 4039–4049. brain-inspired AI. His work in continual learning has
[293] A. Maracani, U. Michieli, M. Toldo, and P. Zanuttigh, “RECALL: been published in major conferences and journals in
Replay-based continual learning in semantic segmentation,” in Proc. related fields, such as Nature Machine Intelligence,
IEEE/CVF Int. Conf. Comput. Vis., 2021, pp. 7006–7015. NeurIPS, ICLR, CVPR, ICCV, ECCV, etc.
[294] S. Cha et al., “SSUL: Semantic segmentation with unknown label for
exemplar-based class-incremental learning,” in Proc. Int. Conf. Neural
Inf. Process. Syst., 2021, pp. 10919–10930.
[295] L. Yu, X. Liu, and J. Van de Weijer, “Self-training for class-incremental Xingxing Zhang received the BE and PhD degrees
semantic segmentation,” IEEE Trans. Neural Netw. Learn. Syst., vol. 34, from the Institute of Information Science, Beijing
no. 11, pp. 9116–9127, Nov. 2023. Jiaotong University, in 2015 and 2020, respectively.
[296] F. Cermelli, D. Fontanel, A. Tavera, M. Ciccone, and B. Caputo, “Incre- She was also a visiting student with the Department
mental learning in semantic segmentation from image labels,” in Proc. of Computer Science, University of Rochester, from
IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2022, pp. 4361–4371. 2018 to 2019. She was a postdoc with the Department
[297] H. Hu et al., “One pass ImageNet,” in Proc. Int. Conf. Neural Inf. Process. of Computer Science and Technology, Tsinghua Uni-
Syst. Workshop, 2021. versity, from 2020 to 2022. Her research interests in-
[298] Y. Min, K. Ahn, and N. Azizan, “One-pass learning via bridging orthog- clude continual learning and zero/few-shot learning.
onal gradient descent and recursive least-squares,” in Proc. IEEE Conf. She has received the excellent PhD thesis award from
Decis. Control, 2022, pp. 4720–4725. the Chinese Institute of Electronics, in 2020.
[299] C. Kaplanis, M. Shanahan, and C. Clopath, “Continual reinforcement
learning with complex synapses,” in Proc. Int. Conf. Mach. Learn., 2018,
pp. 2497–2506.
[300] C. Kaplanis, M. Shanahan, and C. Clopath, “Policy consolidation for Hang Su (Member, IEEE) is an associated profes-
continual reinforcement learning,” in Proc. Int. Conf. Mach. Learn., 2019, sor with the Department of Computer Science and
pp. 3242–3251. Technology, Tsinghua University. His research in-
[301] H. Thanh-Tung and T. Tran, “Catastrophic forgetting and mode collapse terests lie in the adversarial machine learning and
in GANs,” in Proc. Int. Joint Conf. Neural Netw., 2020, pp. 1–10. robust computer vision, based on which he has pub-
[302] K. S. Lee, N.-T. Tran, and N.-M. Cheung, “InfoMax-GAN: Improved ad- lished more than 50 papers including CVPR, ECCV,
versarial image generation via information maximization and contrastive IEEE Transactions on Medical Imaging, etc. He has
learning,” in Proc. IEEE/CVF Winter Conf. Appl. Comput. Vis., 2021, served as area chair in NeurIPS and the workshop
pp. 3941–3951. co-chair in AAAI22. He received “Young Investigator
[303] B. McMahan et al., “Communication-efficient learning of deep networks Award” from MICCAI2012, the “Best Paper Award”
from decentralized data,” in Proc. Int. Conf. Artif. Intell. Statist., 2017, in AVSS2012, and “Platinum Best Paper Award”
pp. 1273–1282. in ICME2018.
[304] J. Yoon et al., “Federated continual learning with weighted inter-client
transfer,” in Proc. Int. Conf. Mach. Learn., 2021, pp. 12073–12086.
[305] J. Dong et al., “Federated class-incremental learning,” in Proc. IEEE/CVF
Conf. Comput. Vis. Pattern Recognit., 2022, pp. 10154–10163. Jun Zhu (Fellow, IEEE) received the BS and PhD
[306] Y. Ma et al., “Continual federated learning based on knowledge distilla- degrees from the Department of Computer Science
tion,” in Proc. Int. Conf. Artif. Intell. Statist., 2022, pp. 2182–2188. and Technology, Tsinghua University, where he is
[307] L. Bourtoule et al., “Machine unlearning,” in Proc. Symp. Secur. Privacy, currently a Bosch AI professor. He was a postdoc-
2021, pp. 141–159. toral fellow and adjunct faculty with the Machine
[308] Y. Wu, E. Dobriban, and S. Davidson, “DeltaGrad: Rapid retraining Learning Department, Carnegie Mellon University.
of machine learning models,” in Proc. Int. Conf. Mach. Learn., 2020, His research interest is primarily on developing ma-
pp. 10355–10366. chine learning methods to understand scientific and
[309] A. Golatkar, A. Achille, and S. Soatto, “Eternal sunshine of the spotless engineering data arising from various fields. He regu-
net: Selective forgetting in deep networks,” in Proc. IEEE/CVF Conf. larly serves as senior area chairs and area chairs with
Comput. Vis. Pattern Recognit., 2020, pp. 9301–9309. prestigious conferences, including ICML, NeurIPS,
[310] A. Golatkar, A. Achille, A. Ravichandran, M. Polito, and S. Soatto, ICLR, IJCAI and AAAI. He was selected as “AI’s 10 to Watch” by IEEE
“Mixed-privacy forgetting in deep networks,” in Proc. IEEE/CVF Conf. Intelligent Systems. He is a fellow of AAAI, and an associate editor-in-chief
Comput. Vis. Pattern Recognit., 2021, pp. 792–801. of IEEE Transactions on Pattern Analysis and Machine Intelligence.
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY ROORKEE. Downloaded on December 19,2024 at 09:29:09 UTC from IEEE Xplore. Restrictions apply.