0% found this document useful (0 votes)
54 views

A Comprehensive Survey of Continual Learning Theory Method and Application

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
54 views

A Comprehensive Survey of Continual Learning Theory Method and Application

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

5362 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 46, NO.

8, AUGUST 2024

A Comprehensive Survey of Continual Learning:


Theory, Method and Application
Liyuan Wang , Xingxing Zhang , Hang Su , Member, IEEE, and Jun Zhu , Fellow, IEEE

(Survey Paper)

Abstract—To cope with real-world dynamics, an intelligent sys-


tem needs to incrementally acquire, update, accumulate, and ex-
ploit knowledge throughout its lifetime. This ability, known as
continual learning, provides a foundation for AI systems to develop
themselves adaptively. In a general sense, continual learning is
explicitly limited by catastrophic forgetting, where learning a new
task usually results in a dramatic performance drop of the old
tasks. Beyond this, increasingly numerous advances have emerged
in recent years that largely extend the understanding and appli-
cation of continual learning. The growing and widespread interest
in this direction demonstrates its realistic significance as well as
complexity. In this work, we present a comprehensive survey of
continual learning, seeking to bridge the basic settings, theoretical
foundations, representative methods, and practical applications.
Based on existing theoretical and empirical results, we summarize
the general objectives of continual learning as ensuring a proper
stability-plasticity trade-off and an adequate intra/inter-task gen-
eralizability in the context of resource efficiency. Then we provide Fig. 1. Conceptual framework of continual learning. (a), Continual learn-
a state-of-the-art and elaborated taxonomy, extensively analyzing ing requires adapting to incremental tasks with dynamic data distributions
how representative strategies address continual learning, and how (Section II). (b), A desirable solution should ensure an appropriate trade-off
they are adapted to particular challenges in various applications. between stability (red arrow) and plasticity (green arrow), as well as an ade-
Through an in-depth discussion of promising directions, we believe quate generalizability to intra-task (blue arrow) and inter-task (orange arrow)
that such a holistic perspective can greatly facilitate subsequent distribution differences (Section III). (c), To achieve the objective of continual
exploration in this field and beyond. learning, representative methods have targeted various aspects of machine
learning (Section IV). (d), Continual learning is adapted to practical applications
Index Terms—Continual learning, incremental learning, lifelong to address particular challenges such as scenario complexity and task specificity
learning, catastrophic forgetting. (Section V).

changes, evolution has empowered human and other organ-


I. INTRODUCTION isms with strong adaptability to continually acquire, update,
EARNING is the basis for intelligent systems to accom- accumulate and exploit knowledge [1], [2], [3]. Naturally, we
L modate dynamic environments. In response to external expect artificial intelligence (AI) systems to adapt in a similar
way. This motivates the study of continual learning, where a
typical setting is to learn a sequence of contents one by one
Manuscript received 13 February 2023; revised 13 September 2023; accepted and behave as if they were observed simultaneously (Fig. 1(a)).
5 February 2024. Date of publication 26 February 2024; date of current version
2 July 2024. This work was supported in part by the NSFC Projects under Grant Such contents could be new skills, new examples of old skills,
62350080, Grant 62076147, Grant 62106123, and Grant U19A2081, in part by different environments, different contexts, etc., with particular
BNRist under Grant BNR2022RC01006, in part by Tsinghua Institute for Guo realistic challenges incorporated [2], [4]. As the contents are
Qiang, Tsinghua-OPPO Joint Research Center for Future Terminal Technology,
and in part by the High Performance Computing Center, Tsinghua University. provided incrementally over a lifetime, continual learning is also
The work of Liyuan Wang was also supported in part by the Postdoctoral referred to as incremental learning or lifelong learning in much
Fellowship Program of CPSF under Grant GZB20230350 and in part by the of the literature, without a strict distinction [1], [5].
Shuimu Tsinghua Scholar. The work of Jun Zhu was also supported in part
by the New Cornerstone Science Foundation through the XPLORER PRIZE. Unlike conventional machine learning models built on the
Recommended for acceptance by M. Kloft. (Corresponding author: Jun Zhu.) premise of capturing a static data distribution, continual learning
The authors are with the Department of Computer Science & Technology, is characterized by learning from dynamic data distributions. A
Institute for AI, BNRist Center, THBI Lab, Tsinghua-Bosch Joint Center for ML,
Tsinghua University, Beijing 100190, China (e-mail: [email protected]; major challenge is known as catastrophic forgetting [6], [7],
[email protected]; [email protected]; dcszj@tsinghua. where adaptation to a new distribution generally results in a
edu.cn). largely reduced ability to capture the old ones. This dilemma is
This article has supplementary downloadable material available at
https://doi.org/10.1109/TPAMI.2024.3367329, provided by the authors. a facet of the trade-off between learning plasticity and memory
Digital Object Identifier 10.1109/TPAMI.2024.3367329 stability: an excess of the former interferes with the latter,
0162-8828 © 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY ROORKEE. Downloaded on December 19,2024 at 09:29:09 UTC from IEEE Xplore. Restrictions apply.
WANG et al.: COMPREHENSIVE SURVEY OF CONTINUAL LEARNING: THEORY, METHOD AND APPLICATION 5363

and vice versa. Beyond simply balancing the “proportions” of up-to-date and comprehensive, which is the primary strength of
these two aspects, a desirable solution for continual learning this work.
should obtain strong generalizability to accommodate distribu- Compared to previous surveys, our improvements lie in the
tion differences within and between tasks (Fig. 1(b)). As a naive following aspects, including (1) Setup: collect and formulate
baseline, reusing all old training samples (if allowed) makes more typical scenarios that have emerged in recent years; (2)
it easy to address the above challenges, but creates huge com- Theory: summarize theoretical efforts on continual learning in
putational and storage overheads, as well as potential privacy terms of stability, plasticity and generalizability; (3) Method: add
issues. In fact, continual learning is primarily intended to ensure optimization-based and representation-based approaches on the
the resource efficiency of model updates, preferably close to top of regularization-based, replay-based and architecture-based
learning only new training samples. approaches, with an extensive analysis of their sub-directions;
A number of continual learning methods have been pro- (4) Application: summarize practical applications of continual
posed in recent years for various aspects of machine learning learning and their particular challenges in terms of scenario
(Fig. 1(c))., which can be conceptually grouped into five ma- complexity and task specificity; (5) Linkage: discuss underlying
jor categories: adding regularization terms with reference to connections between theory, method and application, as well as
the old model (regularization-based approach); approximating promising crossovers with other related fields.
and recovering the old data distributions (replay-based ap- The organization of this paper is described in Fig. 1. We intro-
proach); explicitly manipulating the optimization programs duce basic setups of continual learning in Section II, summarize
(optimization-based approach); learning robust and well- theoretical efforts for its general objectives in Section III, present
generalized representations (representation-based approach); a state-of-the-art and elaborated taxonomy of representative
and constructing task-specific parameters with a properly- methods in Section IV, describe how these methods are adapted
designed architecture (architecture-based approach). This to practical challenges in Section V, and discuss promising
taxonomy extends the commonly-used ones with current ad- directions for subsequent work in Section VI.
vances, and provides refined sub-directions for each category.
We summarize how these methods achieve the objective of II. SETUP
continual learning, with an extensive analysis of their theoretical In this section, we first present a basic formulation of continual
foundations and specific implementations. In particular, these learning. Then we introduce typical scenarios and evaluation
methods are closely connected, e.g., regularization and replay metrics.
ultimately act to rectify the gradient directions in optimization,
and highly synergistic, e.g., the efficacy of replay can be facili- A. Basic Formulation
tated by distilling knowledge from the old model. Continual learning is characterized as learning from dynamic
Realistic applications present particular challenges for con- data distributions. In practice, training samples of different
tinual learning, categorized into scenario complexity and task distributions arrive in sequence. A continual learning model
specificity (Fig. 1(d)). As for the former, for example, the task parameterized by θ needs to learn corresponding task(s) with
identity is probably missing in training and testing, and the no or limited access to old training samples and perform well on
training samples might be introduced in tiny batches or even one their test sets. Formally, an incoming batch of training samples
pass. Due to the expense and scarcity of data labeling, continual belonging to a task t can be represented as Dt,b = {Xt,b , Yt,b },
learning needs to be effective for few-shot, semi-supervised and where Xt,b is the input data, Yt,b is the data label, t ∈ T =
even unsupervised scenarios. As for the latter, although current {1, . . . , k} is the task identity and b ∈ Bt is the batch index
advances mainly focus on visual classification, other represen- (T and Bt denote their space, respectively). Here we define
tative visual domains such as object detection and semantic seg- a “task” by its training samples Dt following the distribution
mentation, as well as other related fields such as conditional gen- Dt := p(Xt , Yt ) (Dt denotes the entire training set by omitting
eration, reinforcement learning (RL) and natural language pro- the batch index, likewise for Xt and Yt ), and assume that there is
cessing (NLP), are receiving increasing attention with their own no difference in distribution between training and testing. Under
characteristics. We summarize these particular challenges and realistic constraints, the data label Yt and the task identity t
analyze how continual learning methods are adapted to them. might not be always available. In continual learning, the training
Considering the significant growth of interest in continual samples of each task can arrive incrementally in batches (i.e.,
learning, we believe that such an up-to-date and comprehensive {{Dt,b }b∈Bt }t∈T ) or simultaneously (i.e., {Dt }t∈T ).
survey can provide a holistic perspective for subsequent work.
Although there are some early surveys of continual learning B. Typical Scenario
with relatively broad coverage [2], [5], [8], important advances According to the division of incremental batches and the
in recent years have not been incorporated. In contrast, the latest availability of task identities, we detail the typical scenarios as
surveys typically collate only a local aspect of continual learning, follows (see Table I for a formal comparison):
with respect to its biological underpinnings [1], [3], [9], [10], r Instance-Incremental Learning (IIL): All training samples
specialized settings for visual classification [11], [12], [13], [14], belong to the same task and arrive in batches.
and specific extensions for NLP [15], [16] or RL [17]. We include r Domain-Incremental Learning (DIL): Tasks have the same
a detailed comparison of previous surveys in Appendix Table 1, data label space but different input distributions. Task
available online. These surveys are typically hard to be both identities are not required.

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY ROORKEE. Downloaded on December 19,2024 at 09:29:09 UTC from IEEE Xplore. Restrictions apply.
5364 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 46, NO. 8, AUGUST 2024

TABLE I
A FORMAL COMPARISON OF TYPICAL CONTINUAL LEARNING SCENARIOS

r Task-Incremental Learning (TIL): Tasks have disjoint data corresponding to the use of multi-head evaluation (e.g., TIL)
label spaces. Task identities are provided in both training or single-head evaluation (e.g., CIL) [44]. The  two metrics
and testing. at the kth task are then defined as AAk = k1 kj=1 ak,j and
r Class-Incremental Learning (CIL): Tasks have disjoint 
AIAk = k1 ki=1 AAi , where AA represents the overall per-
data label spaces. Task identities are only provided in formance at the current moment and AIA further reflects the
training. historical performance.
r Task-Free Continual Learning (TFCL): Tasks have disjoint
Memory stability can be evaluated by forgetting measure
data label spaces. Task identities are not provided in either (FM) [44] and backward transfer (BWT) [45]. As for the former,
training or testing. the forgetting of a task is calculated by the difference between
r Online Continual Learning (OCL): Tasks have disjoint
its maximum performance obtained in the past and its cur-
data label spaces. Training samples for different tasks arrive rent performance: fj,k = maxi∈{1,...,k−1} (ai,j − ak,j ), ∀j < k.
as a one-pass data stream. FM at the kth task is the average forgetting of all old tasks:
r Continual Pre-training (CPT): Pre-training data arrives in 1
k−1
FMk = k−1 j=1 fj,k . As for the latter, BWT evaluates the
sequence. The goal is to improve knowledge transfer to average influence of learning the kth task on all old tasks:
downstream tasks. 1
k−1
BWTk = k−1 j=1 (ak,j − aj,j ), where the forgetting is gen-
If not specified, each task is often assumed to have a sufficient erally reflected as a negative BWT.
number of labeled training samples, i.e., Supervised Continual Learning plasticity can be evaluated by intransience measure
Learning. According to the provided Xt and Yt in each Dt , (IM) [44] and forward transfer (FWT) [45]. IM is defined as the
continual learning is further extended to zero-shot [23], few- inability of a model to learn new tasks, which is calculated by the
shot [24], semi-supervised [25], open-world [26] and un-/self- difference of a task between its joint training performance and
supervised [27], [28] scenarios. Besides, other realistic consider- continual learning performance: IMk = a∗k − ak,k , where a∗k is
ations have been incorporated, such as multiple labels [29], noisy the classification accuracy of a randomly-initialized reference
labels [30], blurred boundary [31], [32], hierarchical granular- model jointly trained with ∪kj=1 Dj for the kth task. In compari-
ity [33] and sub-populations [34], mixture of task similarity [35], son, FWT evaluates the average influence of all old tasks on the
long-tailed distribution [36], novel class discovery [37], [38], 1
k
current kth task: FWTk = k−1 j=2 (aj,j − ãj ), where ãj is
multi-modality [39], etc. Some recent work has focused on the classification accuracy of a randomly-initialized reference
combinatorial scenarios [18], [40], [41], [42], [43] in order to model trained with Dj for the jth task. Note that, ak,j can be
simulate real-world complexity. adapted to other forms depending on the task type, and the above
evaluation metrics should be adjusted accordingly.
C. Evaluation Metric
In general, the performance of continual learning can be evalu-
III. THEORY
ated from three aspects: overall performance of the tasks learned
so far, memory stability of old tasks, and learning plasticity of In this section, we summarize the theoretical efforts on con-
new tasks. Here we summarize the common evaluation metrics, tinual learning, with respect to both stability-plasticity trade-off
using classification as an example. and generalizability analysis, and relate them to the motivations
Overall performance is typically evaluated by average ac- of various continual learning methods.
curacy (AA) [44], [45] and average incremental accuracy
(AIA) [46], [47]. Let ak,j ∈ [0, 1] denote the classification
A. Stability-Plasticity Trade-Off
accuracy evaluated on the test set of the jth task after incre-
mental learning of the kth task (j ≤ k). The output space to With the basic formulation in Section II-A, let’s con-
compute ak,j consists of the classes in either Yj or ∪ki=1 Yi , sider a general setup for continual learning, where a neural

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY ROORKEE. Downloaded on December 19,2024 at 09:29:09 UTC from IEEE Xplore. Restrictions apply.
WANG et al.: COMPREHENSIVE SURVEY OF CONTINUAL LEARNING: THEORY, METHOD AND APPLICATION 5365

also limited by a huge additional resource overhead, as well as


their own catastrophic forgetting and expressiveness.
An alternative choice is to propagate the old data distributions
in updating parameters through formulating continual learning
in a Bayesian framework. Based on a prior p(θ) of the network
parameters, the posterior after observing the kth task can be
computed with Bayes’ rule
k

p(θ|D1:k ) ∝ p(θ) p(Dt |θ) ∝ p(θ|D1:k−1 )p(Dk |θ), (1)
t=1

where the posterior p(θ|D1:k−1 ) of the (k-1)th task becomes


the prior of the kth task, and thus enables the new posterior
p(θ|D1:k ) to be computed with only the current training set
Dk . However, as the posterior is generally intractable (except
very special cases), a common option is to approximate it with
qk−1 (θ) ≈ p(θ|D1:k−1 ), likewise for qk (θ) ≈ p(θ|D1:k ). In the
following, we will introduce two widely-used approximation
strategies:
The first is online Laplace approximation, which approxi-
mates p(θ|D1:k−1 ) as a multivariate Gaussian with local gradient
information [49], [50], [51], [52]. Specifically, we can parame-
terize qk−1 (θ) with φk−1 and construct an approximate Gaussian
Fig. 2. Analysis of critical factors for continual learning. (a), (b), Continual posterior qk−1 (θ) := q(θ; φk−1 ) = N (θ; μk−1 , Λ−1k−1 ) through
learning requires a proper balance between learning plasticity and memory
stability, where excess of either can affect the overall performance. (c), (d), performing a second-order Taylor expansion around the mode
When the converged loss landscape is flatter and the observed data distributions μk−1 ∈ R|θ| of p(θ|D1:k−1 ), where Λk−1 denotes the precision
are more similar, a properly balanced solution can better generalize to the task matrix and φk−1 = {μk−1 , Λk−1 }, likewise for q(θ; φk ), μk and
sequence. (e), The structure of parameter space determines the complexity and
possibility of finding a desirable solution (adapted from [62]). The yellow area Λk . According to (1), the posterior mode for learning the current
indicates the feasible parameter space shared by individual tasks, which tends kth task can be computed as
to be narrow and irregular as more incremental tasks are introduced.
μk = arg maxθ log p(θ|D1:k )
1
network with parameters θ ∈ R|θ| needs to learn k incremen- ≈ arg maxθ log p(Dk |θ) − (θ − μk−1 ) Λk−1 (θ − μk−1 ),
2
tal tasks. The training set and test set of each task are as- (2)
sumed to follow the same distribution Dt , t = 1, . . ., k, where
the training set Dt = {Xt , Yt } = {(xt,n , yt,n )}N t
n=1 includes
which is updated recursively from μk−1 and Λk−1 . Meanwhile,
Nt data-label pairs.  The objective is to learn a probabilistic Λk is updated recursively from Λk−1
model p(D1:k |θ) = kt=1 p(Dt |θ) that can perform well on 
all tasks denoted as D1:k := {D1 , . . ., Dk }. The task-related Λk = −∇2θ log p(θ|D1:k )θ=μk

performance for  discriminative models can be expressed as ≈ −∇2θ log p(Dk |θ)θ=μk + Λk−1 , (3)
log p(Dt |θ) = N t
n=1 log p |x
θ t,n t,n ). The central challenge
(y
of continual learning generally arises from the sequential nature where the first term on the right side is the Hessian of the negative
of learning: when learning the kth task from Dk , the old training log likelihood of Dk at μk , denoted as H(Dk , μk ). In practice,
sets {D1 , . . ., Dk−1 } are inaccessible. Therefore, it is critical H(Dk , μk ) is often computationally inefficient due to the great
but extremely challenging to capture the distributions of both dimensionality of R|θ| , and there is no guarantee that the approx-
old and new tasks in a balanced manner, i.e., ensuring a proper imated Λk is positive semi-definite for the Gaussian assumption.
stability-plasticity trade-off, where excessive learning plasticity To overcome these issues, the Hessian can be approximated by
or memory stability can largely compromise each other (Fig. 2(a) the Fisher information matrix (FIM)
and (b)). 
A straightforward idea is to approximate and recover the old Fk = E[∇θ log p(Dk |θ)∇θ log p(Dk |θ) ]θ=μk ≈ H(Dk , μk ).
data distributions by storing a few old training samples or train- (4)
ing a generative model, known as the replay-based approach For ease of computation, the FIM can be further simplified with
(Section IV-B). According to the learning theory for supervised a diagonal approximation [49], [53] or a Kronecker-factored
learning [48], the performance of an old task is improved with approximation [50], [54]. Then, (2) is implemented by saving
replaying more old training samples that approximate its dis- a frozen copy of the old model μk−1 to regularize parameter
tribution, but resulting in potential privacy issues and a linear changes, known as the regularization-based approach (Sec-
increase in resource overhead. The use of generative models is tion IV-A). Here, we use EWC [49] as an example and present

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY ROORKEE. Downloaded on December 19,2024 at 09:29:09 UTC from IEEE Xplore. Restrictions apply.
5366 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 46, NO. 8, AUGUST 2024

its loss function following update rule for a learning rate λ:


λ
LEWC (θ) = k (θ) + (θ − μk−1 ) F̂1:k−1 (θ − μk−1 ), (5) θ ← θ + λ[Λ−1
k−1 ∇θ k (θ) − (θ − μk−1 )], (9)
2
where the first term encourages the parameter changes predom-
k−1 k denotes the task-specific loss, the FIM F̂1:k−1 =
where
t=1 diag(Ft ) with a diagonal approximation diag(·) of each
inantly in directions that do not interfere with the old tasks via
Ft , and λ is a hyperparameter to control the strength. a preconditioner Λ−1k−1 , while the second term enforces θ to stay
The second is online variational inference (VI) [55], [56], close to the old task solution μk−1 .
[57], [58], [59], [60]. A representative way for online VI is to Of note, the above analyses are mainly based on finding a
minimize the following KL-divergence over a family Q that shared solution for all tasks, which is subject to severe inter-task
satisfies p(θ|D1:k ) ∈ Q at the current kth task: interference [52], [68], [69]. In contrast, incremental tasks can
also be learned in a (partially) separated way, which is the dom-
1
qk (θ) = arg minq∈Q KL(q(θ) qk−1 (θ)p(Dk |θ)), (6) inant idea of the architecture-based approach (Section IV-E).
Zk
This can be formulated as constructing a continual learning
where Zk is the normalizing constant of qk−1 (θ)p(Dk |θ). In model with parameters θ = ∪kt=1 θ(t) , where θ(t) = {e(t) , ψ},
practice, the above minimization can be achieved by using an ad- e(t) is the task-specific parameters, and ψ is the task-sharing
ditional Monte Carlo approximation, with specifying qk (θ) := parameters. The task-sharing parameters ψ are omitted in some
q(θ; φk ) = N (θ; μk , Λ−1
k ) as a multivariate Gaussian. Here we cases, where the task-specific parameters e(i) and e(j) (i < j)
use VCL [55] as an example, which minimizes the following may overlap to enable parameter reuse and knowledge transfer.
objective (i.e., maximize its negative): The overlapping part e(i) ∩ e(j) is usually frozen when learning
the jth task to avoid catastrophic forgetting [70], [71]. Then,
LVCL (qk (θ)) = Eqk (θ) (k (θ)) + KL(qk (θ) qk−1 (θ)), (7)
each task can be performed as p(Dt |θ(t) ) instead of p(Dt |θ) if
where the KL-divergence can be computed in a closed-form and given the task identity IDt , in which the conflicts between tasks
serves as an implicit regularization term. In particular, although can be explicitly controlled or even completely avoided
the loss functions of (5) and (7) take similar forms, the former k

is a local approximation at a set of deterministic parameters θ, p(Dt |θ) = p(Dt |IDt = i, θ)p(IDt = i|θ)
while the latter is computed by sampling from the variational i=1
distribution qk (θ). This is attributed to the fundamental differ-
ence between the two approximation strategies [55], [61]. = p(Dt |IDt = t, θ)p(IDt = t|Dt , θ)
In essence, the constraint on continual learning for either = p(Dt |θ(t) )p(IDt = t|Dt , θ). (10)
replay or regularization is ultimately reflected in gradient di-
rections. As a result, some recent work directly manipulates However, there are two major challenges. The first is the scal-
the gradient-based optimization process, categorized as the ability of model size due to the progressive allocation of θ(t) ,
optimization-based approach (Section IV-C). Specifically, when which depends on the sparsity of e(t) , reusability of e(i) ∩ e(j)
a few old training samples Mt for task t are maintained in a (i < j), and transferability of ψ. The second is the accuracy of
memory buffer, gradient directions of the new training samples task-identity prediction, denoted as p(IDt = t|Dt , θ). Except for
are encouraged to stay close to that of the Mt [45], [63], [64]. the TIL setting that always provides the task identity IDt [70],
This is formulated as ∇θ Lk (θ; Dk ), ∇θ Lk (θ; Mt ) ≥ 0 for [71], [72], [73], other scenarios generally require the model
t ∈ {1, . . ., k − 1}, so as to essentially enforce non-increase in to determine which θ(t) to use based on the input data, as
the loss of old tasks, i.e., Lk (θ; Mt ) ≤ Lk (θk−1 ; Mt ), where shown in (10). This is closely related to the out-of-distribution
θk−1 is the network parameters at the end of learning the (k-1)th (OOD) detection, where the predictive uncertainty should be
task. low for in-distribution data and high for OOD data [74], [75],
Alternatively, gradient projection can also be performed with- [76], [77]. More importantly, since the function of task-identity
out storing old training samples [51], [65], [66], [67]. Here we prediction as (11) needs to be continually updated, it also suffers
take NCL [51] as an example, which can manipulate gradient from catastrophic forgetting. To address this issue, the ith task’s
directions with μk−1 and Λk−1 in online Laplace approxima- distribution p(Dt |i, θ) could be recovered by replay [74], [78],
tion. As shown in (8), NCL performs continual learning by [79], [80]
minimizing the task-specific loss k (θ) within a region of ra-
p(IDt = i|Dt , θ) ∝ p(Dt |i, θ)p(i), (11)
 r centered around θ with a distance metric d(θ, θ + δ) =
dius
δ Λk−1 δ/2 that takes into account the curvature of the prior where the marginal task distribution p(i) ∝ Ni in general.
via its precision matrix Λk−1
B. Generalizability Analysis
δ ∗ = arg min k (θ + δ) ≈ arg min k (θ) + ∇θ k (θ) δ,
δ δ
 Current theoretical efforts for continual learning have pri-
s.t., d(θ, θ + δ) = δ Λk−1 δ/2 ≤ r. (8) marily been performed on training sets of incremental tasks,
assuming that their test sets follow similar distributions and
The solution to such an optimization problem in (8) is given the candidate solutions have similar generalizability. However,
by δ ∗ ∝ Λ−1
k−1 ∇θ k (θ) − (θ − μk−1 ), which gives rise to the since the objective for learning multiple tasks is typically highly

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY ROORKEE. Downloaded on December 19,2024 at 09:29:09 UTC from IEEE Xplore. Restrictions apply.
WANG et al.: COMPREHENSIVE SURVEY OF CONTINUAL LEARNING: THEORY, METHOD AND APPLICATION 5367

non-convex, there are many local optimal solutions that per- self-supervised learning [28], [91], [92], [93]. This motivates
form similarly on training sets but have significantly different the representation-based approach in Section IV-D.
generalizability on test sets [69], [81]. For continual learning, a There are many other factors important for continual learning
desirable solution requires not only intra-task generalizability performance. As shown in (13), the upper bound of performance
from training sets to test sets, but also inter-task generaliz- degradation also depends on the difference of the empirical opti-
ability to accommodate incremental changes of their distri- mal solution θt∗ = arg minθ t (θ; Dt ) for each task, i.e., the dis-
butions. Here we provide a conceptual demonstration with a crepancy of task distribution (see Fig. 2(d)), which is further val-
task-specific loss t (θ; Dt ) and its empirical optimal solution idated by a theoretical analysis of the forgetting-generalization
θt∗ = arg minθ t (θ; Dt ). When task i needs to accommodate trade-off [94] and the PAC-Bayes bound of generalization er-
another task j, the maximum sacrifice of its performance can rors [69], [95]. Therefore, how to exploit task similarity is
be estimated by performing a second-order Taylor expansion of directly related to the performance of continual learning. The
i (θ; Di ) around θi∗ generalization error for each task can improve with synergistic
 tasks but deteriorate with competing tasks [68], where learning
i (θj∗ ; Di ) ≈ i (θi∗ ; Di ) + (θj∗ − θi∗ ) ∇θ i (θ; Di )θ=θ∗ all tasks equally in a shared solution tends to compromise each
i

 task in performance [68], [69]. On the other hand, when model


1
+ (θj∗ − θi∗ ) ∇2θ i (θ; Di )θ=θ∗ (θj∗ − θi∗ ) parameters are not shared by all tasks (e.g., using a multi-head
2 i
output layer), the impact of task similarity on continual learning
1 
≈ i (θi∗ ; Di ) + Δθ ∇2θ i (θ; Di )θ=θ∗ Δθ, (12) will be complex. Some theoretical studies with the neural tangent
2 i kernel (NTK) [96], [97], [98], [99] suggest that an increase in
 task similarity may lead to more forgetting. Since the output
where Δθ := θj∗ − θi∗ and ∇θ i (θ; Di )θ=θ∗ ≈ 0. Then, the per-
i heads are independent for individual tasks, it becomes much
formance degradation of task i is upper-bounded by more difficult to distinguish between two similar solutions [98],
1 max [99]. Specifically, under the NTK regime from the tth task up
i (θj∗ ; Di ) − i (θi∗ ; Di ) ≤ λ Δθ 2 , (13) until the kth task, the forgetting of old tasks is bounded by
2 i
where λmax is the maximum eigenvalue of the Hessian matrix 2
i p(Dk |θk∗ ) − p(Dk |θt∗ )
∇2θ i (θ; Di )θ=θ∗ . Note that the order of task i and j can be F
i
k

arbitrary, that is, (13) demonstrates both forward and back- 2 2
2 2
ward effects. Therefore, the robustness of an empirical optimal ≤ σt,|rep|+1 Θt→S(i,|rep|) Θi→S(i,|rep|) RESi 2 .
2 2
i=t+1
solution θi∗ to parameter changes is closely related to λmax i ,
which has been a common metric to describe the flatness of (14)
loss landscape [81], [82], [83].
Intuitively, convergence to a local minima with a flatter loss Θt→k is a diagonal matrix where each diagonal element
landscape will be less sensitive to modest parameter changes cos(θt,k )r is the cosine of the rth principal angle between the
and thus benefit both old and new tasks (see Fig. 2(c)). To tth and kth tasks in the feature space. σt,· is the ·th singular
find such a flat minima, there are three widely-used strate- value of the tth task. RESi is the rotated residuals that remain
gies in continual learning. The first is derived from its defi- to be learned, and S(i, ·) represents the residuals subspace of
nition, i.e., the flatness metric. Specifically, the minimization order · until the ith task. |rep| is the sample number of replay
of t (θ; Dt ) can be replaced by a robust task-specific loss data. The complex impact of task similarity suggests the impor-
bt (θ; Dt ) := max δ ≤b t (θ + δ; Dt ), and thus the obtained so- tance of model architectures for coordinating task-sharing and
lution guarantees low error not only at a specific point but also task-specific components.
in its neighborhood with a “radius” of b. However, due to the Moreover, the complexity of finding a desirable solution for
great dimensionality of θ, the calculation of bt (θ; Dt ) cannot continual learning is determined to a large extent by the structure
cover all possible δ but only a few directions [84], similar of parameter space. Learning all incremental tasks with a shared
to the complexity issue of computing the Hessian matrix in solution is equivalent to learning each new task in a constrained
(12). The alternatives include using an approximation of the parameter space that prevents performance degradation of all old
Hessian [81], [85] or calculating δ only along the trajectory tasks. Such a classical continual learning problem has proven
of forward and backward parameter changes [86], [87]. The to be NP-hard in general [62], because the feasible parameter
second is to operate the loss landscape by constructing an space tends to be narrow and irregular as more tasks are intro-
ensemble model under the restriction of mode connectivity, duced, thus difficult to identify (see Fig. 2(e)). This challenging
i.e., integrating multiple minima in parameter or function space issue can be mitigated by replaying representative old training
along the low-error path, as connecting them ensures flatness samples [62], restricting the feasible parameter space to a hyper-
on that path [69], [86], [88]. These two strategies are closely rectangle [100], or alternating the model architecture of using a
related to the optimization-based approach. The third comes single parameter space (e.g., using multiple continual learning
down to obtaining well-distributed representations, which tend models) [68], [69], [101]. To harmonize the important factors
to be more robust to distribution differences in function space, in continual learning, recent work presents a similar form of
such as by using large-scale pre-training [87], [89], [90] and generalized bounds for learning and forgetting. For example,

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY ROORKEE. Downloaded on December 19,2024 at 09:29:09 UTC from IEEE Xplore. Restrictions apply.
5368 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 46, NO. 8, AUGUST 2024

Fig. 3. State-of-the-art and elaborated taxonomy of representative continual learning methods. We have summarized five main categories (blue blocks), each of
which is further divided into several sub-directions (red blocks).

with probability at least 1 − δ for any δ ∈ (0, 1), an ideal con-


tinual learner [102] under the assumption that all tasks share a
global minimizer with uniform convergence (i.e., λmax i = λ for
∀t = 1, . . . , k in (13)) has the generalization bound
c∗t ≤ EDt ∼Dt t (θ; Dt ) ≤ c∗t + ζ(Nt , δ), ∀t = 1, . . . , k, (15)
where c∗t = t (θt∗ ; Dt ) is the minimum loss of the tth task, and θ
is a global solution of the continually learned
√ 1 : k tasks by
λB
t |θ| log(N ) log(|θ|k/δ)
empirical risk minimization. ζ = O( √ ),
2 Nt
and θ 2 ≤ B. Considering that the shared parameter space
for many different tasks might be an empty set (Fig. 2(e)), i.e., Fig. 4. Regularization-based approach. This direction is characterized by
∪kt=1 θt = ∅, the generalization bounds are further refined by adding explicit regularization terms to mimic the parameters (weight regular-
assuming K parameter spaces (K ≥ 1 in general) to capture all ization) or behaviors (function regularization) of the old model.
tasks [69], [103]. For generalization errors of new and old tasks
t−1

EDt ∼Dt t (θ; Dt ) ≤ c∗t + R bi learning system (see Fig. 1(c)). These categories have their
i=1 own theoretical foundations, as analyzed in Section III, while
t−1
 t−1
 intrinsically connected in objectives. Each category is further
+ Div(Di , Dt ) + ζ Ni , K/δ , divided into several sub-directions depending on their specific
i=1 i=1 implementations.
t−1
 t−1
 t−1

EDi ∼Di i (θ; Di ) ≤ c∗i + R(bt ) + Div(Dt , Di ) A. Regularization-Based Approach
i=1 i=1 i=1
This direction is characterized by adding explicit regular-
+ ζ(Nt , K/δ), (16) ization terms to balance the old and new tasks, which usually
requires storing a frozen copy of the old model for reference (see
where R(·) and Div are the functions of loss flatness and task
Fig. 4). Depending on the target of regularization, such methods
discrepancy, respectively. The definitions of δ, θ and c∗t are the
can be divided into two groups.
same as (15). Therefore, a desirable solution for continual learn-
The first is weight regularization, which selectively regular-
ing should provide an appropriate stability-plasticity trade-off
izes the variation of network parameters. A typical implementa-
and an adequate intra/inter-task generalizability, motivating a
tion is to add a quadratic penalty in loss function that penalizes
variety of representative methods.
the variation of each parameter depending on its “importance” to
perform the old tasks (see (5)). The importance can be calculated
IV. METHOD by the Fisher information matrix (FIM), such as EWC [49] and
In this section, we present an elaborated taxonomy of rep- some more advanced versions [50], [104]. Numerous efforts
resentative continual learning methods (see Fig. 3), analyzing have been devoted to designing a better importance measure-
extensively their main motivations, typical implementations and ment. SI [105] online approximates the importance of each
empirical properties. This taxonomy consists of five broad cate- parameter by its contribution to the total loss variation and its
gories that correspond to the major components of a machine update length over the entire training trajectory. MAS [106]

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY ROORKEE. Downloaded on December 19,2024 at 09:29:09 UTC from IEEE Xplore. Restrictions apply.
WANG et al.: COMPREHENSIVE SURVEY OF CONTINUAL LEARNING: THEORY, METHOD AND APPLICATION 5369

accumulates importance measures based on the sensitivity of


predictive results to parameter changes, which is both online and
unsupervised. RWalk [44] combines the regularization terms of
SI [105] and EWC [49] to integrate their advantages. Interest-
ingly, these importance measurements have been shown to be
all tantamount to an approximation of the FIM [107], although
stemming from different motivations.
There are also some works focusing on refining the imple-
mentation of the quadratic penalty. Since the diagonal approx-
imation of the FIM in EWC [49] might lose information about
the old tasks, R-EWC [108] performs a factorized rotation of
the parameter space to diagonalize the FIM. XK-FAC [109] Fig. 5. Replay-based approach. This direction is characterized by approxi-
mating and recovering the old data distributions. Typical sub-directions include
further considers the inter-example relations in approximating experience replay, which saves a few old training samples in a memory buffer;
the FIM to better accommodate batch normalization. Observ- generative replay, which trains a generative model to provide generated samples;
ing the asymmetric effect of parameter changes on old tasks, and feature replay, which recovers the distribution of old features through saving
prototypes, saving statistical information or training a generative model.
ALASSO [110] designs an asymmetric quadratic penalty with
one of its sides overestimated.
Compared to learning each task within the constraints of
the old model, which typically exacerbates the intransience, output head of the old tasks to compute the distillation loss.
an expansion-renormalization process of obtaining separately LwM [119] exploits the attention maps of new training samples
the new task solution and renormalizing it with the old model for KD. EBLL [121] learns task-specific autoencoders and pre-
can provide a better stability-plasticity trade-off. IMM [111] is vents changes in feature reconstruction. GD [124] further distills
an early attempt that incrementally matches the moment of the knowledge on the large stream of unlabeled data available in
posterior distributions for old and new tasks, i.e., a weighted the wild. When old training samples are faithfully recovered,
average of their solutions. ResCL [112] extends this idea with the potential of function regularization can be largely released.
a learnable combination coefficient. P&C [104] learns each Thus, function regularization often collaborates with replaying
task individually with an additional network, and then distills it a few old training samples, discussed latter in Section IV-B.
back to the old model with a generalized version of EWC [49]. Meanwhile, sequential Bayesian inference over function space
AFEC [52] introduces a forgetting rate to eliminate the potential can be seen as a form of function regularization, which generally
negative transfer from the original posterior p(θ|D1:k−1 ) in requires storing some old training samples (called “coreset” in
(1), and derives quadratic terms to penalize differences of the literature), such as FRCL [127], FROMP [128] and S-FSVI [60].
network parameters θ with both the old and new task solutions. For conditional generation, the generated data of previously-
To reliably average the old and new task solutions, a linear learned conditions and their output values are regularized to
connector [113] is constructed by constraining them on a linear be consistent between the teacher and student models, such as
low-error path. MeRGANs [125], DRI [129] and LifelongGAN [126].
Other forms of regularization that target the network itself
also belong to this sub-direction. As discussed before, online
variational inference of the posterior distribution can serve as B. Replay-Based Approach
an implicit regularization of parameter changes [55], [56], [58], We group the methods for approximating and recovering old
[59]. Instead of consolidating parameters, NPC [114] estimates data distributions into this direction (see Fig. 5), which can
the importance of each neuron and selectively reduces its learn- be further divided into three sub-directions depending on the
ing rate. UCL [115] and AGS-CL [116] freeze the parameters content of replay, each with its own challenges.
connecting the important neurons, equivalent to a hard version The first is experience replay, which typically stores a few
of weight regularization. The second is function regularization, old training samples within a small memory buffer. Due to
which targets the intermediate or final output of the prediction the extremely limited storage space, the key challenges consist
function. This strategy typically uses the previously-learned of how to construct and how to exploit the memory buffer.
model as the teacher and the currently-trained model as the As for construction, the stored training samples should be
student, while implementing knowledge distillation (KD) [117] carefully selected, compressed, augmented, and updated, in
to mitigate catastrophic forgetting. Ideally, the target of KD order to recover adaptively the past information. Earlier work
should be all old training samples, which are unavailable in adopts fixed principles for sample selection. For example, Reser-
continual learning. The alternatives can be new training sam- voir Sampling [130], [131] randomly stores a fixed amount
ples [118], [119], [120], [121], a small fraction of old training of training samples from each input batch. Ring Buffer [45]
samples [46], [47], [122], [123], external unlabeled data [124], further ensures an equal number of old training samples per
generated data [125], [126], etc., suffering from different degrees class. Mean-of-Feature [122] selects an equal number of old
of distribution shift. training samples that are closest to the feature mean of each
As a pioneer work, LwF [118] and LwF.MC [122] learn class. More advanced strategies are typically gradient-based or
new training samples while using their predictions from the optimizable, by maximizing such as the sample diversity of

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY ROORKEE. Downloaded on December 19,2024 at 09:29:09 UTC from IEEE Xplore. Restrictions apply.
5370 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 46, NO. 8, AUGUST 2024

parameter gradients [21], performance of corresponding tasks with additional generated data. To alleviate dramatic represen-
with cardinality constraints [132], mini-batch gradient similarity tation shifts, PODNet [47] employs a spatial distillation loss
and cross-batch gradient diversity [133], ability of optimizing to preserve representations throughout the model. Co2L [92]
latent decision boundaries [134], diversity of robustness against introduces a self-supervised distillation loss to obtain robust
perturbations [32], similarity to the gradients of old training representations against catastrophic forgetting. GeoDL [150]
samples with respect to the current parameters [135], etc. performs KD along a path that connects the low-dimensional
To improve storage efficiency, AQM [136] performs online projections of the old and new feature spaces for a smooth
continual compression based on a VQ-VAE framework [137] transition between them. ELI [151] learns an energy manifold
and saves compressed data for replay. MRDC [138] formu- with the old and new models to realign the representation shifts
lates experience replay with data compression as determinantal for optimizing incremental tasks. To adequately exploit the
point processes (DPPs), and derives a computationally efficient past information, DDE [152] distills colliding effects from the
way for online determination of the optimal compression rate. features of the new training samples, while CSC [153] addition-
RM [32] adopts conventional and label mixing-based strategies ally leverages the structure of the old feature space. To further
of data augmentation to enhance the diversity of old training enhance learning plasticity, D+R [154] performs KD from an
samples. RAR [139] synthesizes adversarial samples near the additional model dedicated to each new task. FOSTER [155]
forgetting boundary and performs MixUp for data augmentation. expands new modules to fit the residuals of the old model
The auxiliary information with low storage cost, such as class and then distills them into a single model. Besides, weight
statistics [79], [140] and attention maps [141], [142], can be regularization can also be combined with experience replay for
further incorporated to maintain old knowledge. Besides, the old better performance and generality [44], [52].
training samples can be continually adjusted to accommodate in- It is worth noting that the merits and limitations of experience
cremental changes, e.g., making them more representative [143] replay remain largely open. In addition to the intuitive benefits
or challenging [144] for separation. of staying in the low-loss region of the old tasks [156], theoret-
As for exploitation, experience replay requires an adequate ical analysis has demonstrated its contribution to resolving the
use of the memory buffer to recover the past information. There NP-hard problem of optimal continual learning [62]. However, it
are many different implementations, closely related to other risks overfitting to only a few old training samples retained in the
continual learning strategies. First, the effect of old training memory buffer, which potentially affects generalizability [156].
samples in optimization can be constrained to avoid catas- To alleviate overfitting, LiDER [157] takes inspirations from
trophic forgetting and facilitate knowledge transfer. For exam- adversarial robustness and enforces the Lipschitz continuity of
ple, GEM [45] constructs individual constraints based on the old the model to its inputs. MOCA [158] enlarges the variation
training samples for each task to ensure non-increase in their of representations to prevent the old ones from shrinking in
losses. A-GEM [63] replaces the individual constraints with a their space. Interestingly, several simple baselines of experience
global loss of all tasks to improve training efficiency. LOGD [64] replay can achieve considerable performance. DER [31] stores
decomposes the gradient of each task into task-sharing and old training samples together with their logits, and perform logit-
task-specific components to leverage inter-task information. To matching throughout the optimization trajectory. GDumb [159]
achieve a good trade-off in interference-transfer [94], [131], greedily collects incoming training samples in a memory buffer
MER [131] employs meta-learning for gradient alignment in and then uses them to train a model from scratch for testing.
experience replay. BCL [94] explicitly pursues a saddle point of These efforts can serve as evaluation criteria for subsequent
the cost of old and new training samples. To selectively utilize the exploration.
memory buffer, MIR [145] prioritizes the old training samples The second is generative replay or pseudo-rehearsal, which
that subject to more forgetting, while HAL [146] uses them as generally requires training an additional generative model to re-
“anchors” and stabilizes their predictions. play generated data. This is closely related to continual learning
On the other hand, experience replay can be naturally com- of generative models themselves, as they also require incremen-
bined with knowledge distillation (KD), which additionally in- tal updates. DGR [160] provides an initial framework in which
corporates the past information from the old model. iCaRL [122] learning each generation task is accompanied with replaying
and EEIL [123] are two early works that perform KD on both generated data sampled from the old generative model, so as
old and new training samples. Some subsequent improvements to inherit the previously-learned knowledge. MeRGAN [125]
focus on different issues in experience replay. To mitigate data further enforces consistency of the generated data sampled with
imbalance of the limited old training samples, LUCIR [46] the same random noise between the old and new generative
encourages similar feature orientation of the old and new mod- models, similar to the role of function regularization. Besides,
els, while performing cosine normalization of the last layer other continual learning strategies can be incorporated into
and mining hard negatives of the current task. BiC [147] and generative replay. Weight regularization [25], [55], [161], [162]
WA [148] attribute this issue to the bias of the last fully- and experience replay [25], [163] have been shown to be effec-
connected layer, and resolve it by either learning a bias correction tive in mitigating catastrophic forgetting of generative models.
layer with a balanced validation set [147], or normalizing the DGMa/DGMw [164] and a follow-up work [162] adopt binary
output weights [148]. SS-IL [149] adopts separated softmax in masks to allocate task-specific parameters for overcoming inter-
the last layer and task-wise KD to mitigate the bias. DRI [129] task interference, and an extendable network to ensure scalabil-
trains a generative model to supplement the old training samples ity. If pre-training is available, it can provide a relatively stable

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY ROORKEE. Downloaded on December 19,2024 at 09:29:09 UTC from IEEE Xplore. Restrictions apply.
WANG et al.: COMPREHENSIVE SURVEY OF CONTINUAL LEARNING: THEORY, METHOD AND APPLICATION 5371

and strong reference model for continual learning. For example,


FearNet [165] and ILCAN [166] additionally preserves statisti-
cal information of the old features acquired from a pre-trained
feature extractor, while GAN-Memory [167] continually adjusts
a pre-trained generative model with task-specific parameters.
The generative models for pseudo-rehearsal can be of var-
ious types, such as generative adversarial networks (GANs)
and (variational) autoencoder (VAE). A majority of state-of-
the-art approaches have focused on GANs to enjoy its advan-
Fig. 6. Optimization-based approach. This direction is characterized by ex-
tages in fine-grained generation, but usually suffer from label plicit design and manipulation of the optimization programs, such as gradient
inconsistency in continual learning [164], [168]. In contrast, projection with reference to the gradient space or input space of the old tasks
autoencoder-based strategies, such as FearNet [165], SRM [169] (adapted from [67]), meta-learning of sequentially arrived tasks within the inner
loop, and exploitation of mode connectivity and flat minima in loss landscape
and EEC [168], can explicitly control the labels of the gen- (adapted from [86], [113]). θA∗ , θ ∗ and θ ∗
B A,B are desirable solutions for task
erated data, although relatively blurred. L-VAEGAN [170] in- A, task B and both of them, respectively.
stead employs a hybrid model for both high-quality generation
and accurate inference. However, since continual learning of
generative models is extremely difficult and requires signifi-
cant resource overhead, generative replay is typically limited and MER [131] constrain parameter updates to align with the
to relatively simple datasets [162], [171]. An alternative is to direction of experience replay, corresponding to preserving the
convert the target of generative replay from data level to feature previous input space and gradient space through a few old
level, which can largely reduce the complexity of conditional training samples. In contrast to saving old training samples,
generation and more adequately exploit semantic information. OWM [65] modifies parameter updates to the orthogonal di-
For example, GFR [172] trains a conditional GANs to replay rection of the previous input space. OGD [67] preserves the old
generated features after the feature extractor. BI-R [171] incor- gradient directions and rectifies the current gradient directions
porates context-modulated feedback connections in a standard orthogonal to them. Orthog-Subspace [177] performs contin-
VAE to replay internal representations. ual learning with orthogonal low-rank vector subspaces and
In fact, maintaining feature-level rather than data-level dis- keeps the gradients of different tasks orthogonal to each other.
tributions enjoys significant benefits in efficiency and privacy. GPM [142] maintains the gradient subspace important to the
We categorize this sub-direction as feature replay. However, a old tasks (i.e., the bases of core gradient space) for orthogonal
central challenge is the representation shift caused by sequen- projection in updating parameters. FS-DGPM [85] dynamically
tially updating the feature extractor, which reflects the feature- releases unimportant bases of GPM [142] to improve learning
level catastrophic forgetting. To address this issue, GFR [172], plasticity and encourages the convergence to a flat loss land-
FA [120] and DSR [173] perform feature distillation between scape. TRGP [178] defines the “trust region” through the norm
the old and new models. IL2M [140] and SNCL [79] recover of gradient projection onto the subspace of previous inputs, so
statistics of feature representations (e.g., mean and covariance) as to selectively reuse the frozen weights of old tasks. Adam-
on the basis of experience replay. RER [174] explicitly estimates NSCL [66] instead projects candidate parameter updates into
the representation shift to update the preserved old features. the current null space approximated by the uncentered feature
REMIND [175] fixes the early layers of the feature extractor covariance of the old tasks, while AdNS [179] considers the
and reconstruct the intermediate representations to update the shared part of the previous and the current null spaces. NCL [51]
latter layers. FeTrIL [176] employs a fixed feature extractor unifies Bayesian weight regularization and gradient projection,
learned from the initial task and replays generated features encouraging parameter updates in the null space of the old tasks
afterwards. For continual learning from scratch, the required while converging to a maximum of the Bayesian approximation
changes in representation are often dramatic, while stabilizing posterior. Under the upper bound of the quadratic penalty in
the feature extractor may interfere with accommodating new Bayesian weight regularization, RGO [180] modifies gradient
representations. In contrast, the use of strong pre-training can directions with a recursive optimization procedure to obtain the
provide robust representations that are generalizable to down- optimal solution. Therefore, as regularization and replay are
stream tasks and remain relatively stable in continual learning, ultimately achieved by rectifying the current gradient, gradient
alleviating the central challenge of feature replay. projection corresponds to a similar modification but explicitly
in parameter updates.
Another attractive idea is meta-learning or learning-to-learn
C. Optimization-Based Approach for continual learning, which attempts to obtain a data-driven
Continual learning can be achieved not only by adding ad- inductive bias for various scenarios, rather than designing it
ditional terms to the loss function (e.g., regularization and manually [3]. OML [181] provides a meta-training strategy that
replay), but also by explicitly designing and manipulating the performs online updates on the sequentially arrived inputs and
optimization programs (see Fig. 6). minimizes their interference, which can naturally obtain sparse
A typical idea is to perform gradient projection. Some replay- representations suitable for continual learning. ANML [182]
based approaches such as GEM [45], A-GEM [63], LOGD [64] extends this idea by meta-learning of a context-dependent gating

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY ROORKEE. Downloaded on December 19,2024 at 09:29:09 UTC from IEEE Xplore. Restrictions apply.
5372 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 46, NO. 8, AUGUST 2024

function to selectively activate neurons with respect to incremen-


tal tasks. AIM [183] learns a mixture of experts to make predic-
tions with the representations of OML [181] or ANML [182],
further sparsifying the representations at the architectural level.
Meanwhile, meta-learning can work with experience replay to
better utilize both the old and new training samples. For example,
MER [131] aligns their gradient directions, while iTAML [184]
applies a meta-updating rule to keep them in balance with each
other. With the help of experience replay, La-MAML [185]
optimizes the OML [181] objective in an online fashion with
an adaptively modulated learning rate. OSAKA [43] proposes
a hybrid objective of knowledge accumulation and fast adap- Fig. 7. Representation-based approach. This direction is characterized by
tation, which can be resolved by obtaining a good initializa- creating and leveraging the strengths of representations for continual learning,
such as by using self-supervised learning and pre-training.
tion with meta-training and then incorporating knowledge of
incremental tasks into the initialization. Meta-learning can also
be used to optimize specialized architectures. MERLIN [78]
consolidates a meta-distribution of model parameters given the in continual learning. Note that these two strategies are closely
representations of each task, which allows to sample task- related, since the pre-training data is usually of a huge amount
specific models and ensemble them for inference. Similarly, and without explicit labels, while the performance of SSL itself is
PR [74] adopts a Bayesian strategy to learn task-specific pos- mainly evaluated by fine-tuning on (a sequence of) downstream
teriors with a shared meta-model. MARK [186] maintains tasks. Below, we will discuss representative sub-directions.
a set of shared weights that are incrementally updated with The first is to implement self-supervised learning (basically
meta-learning and selectively masked to solve specific tasks. with contrastive loss) for continual learning. Observing that self-
ARI [187] combines adversarial attacks with experience replay supervised representations are more robust to catastrophic for-
to obtain task-specific models, which are then fused together getting, LUMP [93] acquires further improvements by interpo-
through meta-training. lating between instances of the old and new tasks. MinRed [194]
Besides, some other works refine the optimization process further promotes the diversity of experience replay by de-
from a loss landscape perspective. For example, rather than ded- correlating the stored old training samples. CaSSLe [195] con-
icating an algorithm, Stable-SGD [81] enables SGD to find a flat verts the self-supervised loss to a distillation strategy by map-
local minima by adapting the factors in training regime, such as ping the current state of a representation to its previous state.
dropout, learning rate decay and batch size. MC-SGD [86] em- Co2L [92] adopts a supervised contrastive loss to learn each
pirically demonstrates that the local minima obtained by multi- task and a self-supervised loss to distill knowledge between the
task learning and continual learning can be connected by a linear old and new models. DualNet [91] trains a fast learner with
path of low error, and applies experience replay to find a solution supervised loss and a slow learner with self-supervised loss, with
along it. Linear Connector [113] adopts Adam-NSCL [66] and the latter helping the former to acquire generalizable represen-
feature distillation to obtain respective solutions of the old and tations. CL-SLAM [196] proposes a dual-network architecture,
new tasks connected by a linear path of low error, followed optimized by self-supervised learning for plasticity and stability,
by linear averaging. Further, un-/self-supervised learning (than respectively.
traditional supervised learning) [28], [93], [188] and large-scale The second is to use pre-training for downstream continual
pre-training (than random initialization) [87], [89], [189], [190] learning. Several empirical studies suggest that pre-training
have been shown to suffer from less catastrophic forgetting. brings not only strong knowledge transfer but also robustness to
Empirically, both can be attributed to obtaining a more robust catastrophic forgetting to downstream continual learning [87],
representation [28], [89], [93], [191], and converging to a wider [89], [188], [197]. In particular, the benefits for downstream
loss basin [28], [87], [93], [192], [193], suggesting a potential continual learning tend to be more apparent when pre-training
link among the sensitivity of representations, parameters and with larger data size [89], [197], larger model size [89] and
task-specific errors. Many efforts seek to leverage these advan- contrastive loss [188], [190]. However, a critical challenge is
tages in continual learning, as we discuss next. that the pre-trained knowledge needs to be adaptively leveraged
for the current task while maintaining generalizability to future
tasks. There are various strategies for this problem, depending
D. Representation-Based Approach
on whether the pre-trained representations are fixed or not.
We group the methods that create and exploit the strengths As for adapting a fixed backbone, Side-Tuning [198] and
of representations for continual learning into this category (see DLCFT [199] train a lightweight network in parallel with the
Fig. 7). In addition to an earlier work on obtaining sparse backbone and fuse their outputs linearly. TwF [200] also trains
representations through meta-training [181], recent works have a sibling network, but distills knowledge from the backbone
attempted to incorporate the advantages of self-supervised learn- in a layer-wise manner. GAN-Memory [167] takes advantage
ing (SSL) [91], [93], [188] and large-scale pre-training [87], of FiLM [201] and AdaFM [202] to learn task-specific param-
[189], [191] to improve the representations in initialization and eters for each layer of a pre-trained generative model, while

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY ROORKEE. Downloaded on December 19,2024 at 09:29:09 UTC from IEEE Xplore. Restrictions apply.
WANG et al.: COMPREHENSIVE SURVEY OF CONTINUAL LEARNING: THEORY, METHOD AND APPLICATION 5373

ADA [203] employs Adapters [204] with knowledge distilla-


tion to adjust a pre-trained transformer. Recent prompt-based
approaches instruct the representations of a pre-trained trans-
former with a few prompt parameters. Such methods typically
involve construction of task-adaptive prompts and inference of
appropriate prompts for testing, by exploring prompt architec-
tures to accommodate task-sharing and task-specific knowledge.
Representative strategies include selecting the most relevant
prompts from a prompt pool (L2P [205]), optimizing a weighted
summation of the prompt pool with attention factors (CODA-
Prompt [206]), using explicitly task-sharing and task-specific
prompts (DualPrompt [207]) or only task-specific prompts (S-
Prompts [208]), progressive expansion of task-specific prompts
(Progressive Prompts [209]), etc. Besides, by saving prototypes,
appending a nearest class mean (NCM) classifier to the backbone
has proved to be a strong baseline [210], [211], which can be
further enhanced by transfer learning techniques such as the
FiLM adapter [212]. As for optimizing an updatable backbone,
F2M [84] searches for flat local minima in the pre-training Fig. 8. Architecture-based approach. This direction is characterized by con-
stage, and then learns incremental tasks within the flat region. structing task-specific parameters with a properly-designed architecture, such as
assigning dedicated parameters to each task (parameter allocation), constructing
CwD [191] regularizes the initial-phase representations to be task-adaptive sub-modules or sub-networks (modular network), and decompos-
uniformly scattered, which can empirically mimic the represen- ing the model into task-sharing and task-specific components (model decom-
tations of joint training. SAM [87], [213] encourages finding a position). Here we exhibit two types of model decomposition, corresponding to
parameters (low-rank factorization) and representations (masks of intermediate
wide basin in downstream continual learning by optimizing the features).
flatness metric. SLCA [214] observes that slowly fine-tuning the
backbone of a pre-trained transformer can achieve outstanding
performance in continual learning, and further preserves proto-
type statistics to rectify the output layer. isolation and dynamic architecture, depending on whether the
The third is continual pre-training (CPT) or continual meta- network architecture is fixed or not. Here, we instead focus on
training. As the huge amount of data required for pre-training the way of implementing task-specific parameters, extending the
is typically collected in an incremental manner, performing up- above concepts to parameter allocation, model decomposition
stream continual learning to improve downstream performance and modular network (Fig. 8).
is particularly important. For example, a recent work [215] Parameter allocation features an isolated parameter subspace
combines Barlow Twins and EWC to learn representations from dedicated to each task throughout the network, where the archi-
incremental unlabeled data. An empirical study finds that self- tecture can be fixed or dynamic in size. Within a fixed network
supervised pre-training is more effective than supervised pro- architecture, Piggyback [221], HAT [71], WSN [222] and H2
tocols for continual learning of vision-language models [216], [223] explicitly optimize a binary mask to select dedicated
consistent with the results for only visual tasks [28]. Since neurons or parameters for each task, with the masked regions of
texts are generally more efficient than images, IncCLIP [217] the old tasks (almost) frozen to prevent catastrophic forgetting.
replays generated hard negative texts conditioned on images PackNet [224], UCL [115], CLNP [225] and AGS-CL [116]
and performs multi-modal knowledge distillation for updating explicitly identify the important neurons or parameters for the
CLIP [218]. Meanwhile, continual meta-training needs to ad- current task and then release the unimportant parts to the follow-
dress a similar issue that the base knowledge is incrementally ing tasks, which can be achieved by iterative pruning [224], ac-
enriched and adapted. IDA [219] imposes discriminants of the tivation value [116], [225], [226], uncertainty estimation [115],
old and new models to be aligned relative to the old centers, and etc. Since the network capacity is limited, “free” parameters tend
otherwise leaves the embedding free to accommodate new tasks. to saturate as more incremental tasks are introduced. Therefore,
ORDER [220] employs unlabeled OOD data with experience these methods typically require sparsity constraints on parameter
replay and feature replay to cope with highly dynamic task usage and selective reuse of the frozen old parameters, which
distributions. might affect the learning of each task. To alleviate this dilemma,
the network architecture can be dynamically expanded if its
capacity is not sufficient to learn a new task well [164], [227],
E. Architecture-Based Approach [228]. The dynamic architecture can be explicitly optimized to
The above strategies basically focus on learning all incre- improve parameter efficiency and knowledge transfer, such as by
mental tasks with a shared set of parameters, which is a ma- reinforcement learning [229], [230], architecture search [230],
jor cause of inter-task interference. In contrast, constructing [231], variational Bayes [232], etc. As the network expansion
task-specific parameters can explicitly resolve this problem. should be much slower than the task increase to ensure scalabil-
Previous work generally separates this direction into parameter ity, constraints on sparsity and reusability remain important.

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY ROORKEE. Downloaded on December 19,2024 at 09:29:09 UTC from IEEE Xplore. Restrictions apply.
5374 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 46, NO. 8, AUGUST 2024

Model decomposition explicitly separates a model into task-


sharing and task-specific components, where the task-specific
components are typically expandable. For a regular network,
the task-specific components could be parallel branches [233],
[234], adaptive layers [58], [235], masks of intermediate fea-
tures [186], [236], [237]. Note that the feature masks for model
decomposition do not operate in parameter space and are not
binary for each task, distinguished from the binary masks for
parameter allocation. Besides, the network parameters them-
selves can be decomposed into task-sharing and task-specific
elements, such as by additive decomposition [238], singular Fig. 9. Representative strategies for class-incremental learning. Catastrophic
value decomposition [239], filter atom decomposition [240] and forgetting can be mitigated with respect to data space (experience replay), feature
low-rank factorization [241], [242]. As the number of task- space (knowledge distillation) and label space (knowledge distillation). This
figure is adapted from [152].
specific components usually grows linearly with incremental
tasks, their resource efficiency determines the scalability of this
sub-direction.
refers to the challenges of continual learning scenarios for
Modular network leverages parallel sub-networks or sub-
each task, such as task-agnostic inference, scarcity of labeled
modules to learn incremental tasks in a differentiated manner,
data and generic learning paradigm. We analyze them in Sec-
without pre-defined task-sharing or task-specific components.
tions V-A, V-B and V-C, respectively, using visual classification
As an early work, Progressive Networks [70] introduces an
as a typical example. The latter refers to the challenges of
identical sub-network for each task and allows knowledge trans-
specific task types. We discuss more complex vision tasks, such
fer from other sub-networks via adaptor connections. Expert
as object detection in Section V-D and semantic segmentation
Gate [243] employs a mixture of experts to learn incremen-
in Section V-E, both of which are affected by the co-occurrence
tal tasks, expanding one expert as each task is introduced.
of old and new classes in incremental data. The descriptions and
PathNet [72] and RPSNet [244] pre-allocate multiple parallel
task-specific challenges for other domains are left in Appendix
networks to construct a few candidate paths, i.e., layer-wise
A– C, available online, including conditional generation, rein-
compositions of network modules, and select the best path for
forcement learning, and natural language processing.
each task. MNTDP [245] and LMC [246] seek to explicitly find
the optimal layout from previous sub-modules and potentially
new sub-modules. Similar to parameter allocation, these efforts A. Task-Agnostic Inference
are intentional to construct task-specific models, while the com- Continual learning usually has Task-Incremental Learning
bination of sub-networks or sub-modules allows explicit reuse (TIL) as a basic setup, i.e., task identities are provided in both
of knowledge. In addition, the sub-networks can be encouraged training and testing. In contrast, task-agnostic inference that
to learn incremental tasks in parallel. Model Zoo [68] expands avoids the use of task identities for testing tends to be more
a sub-network to learn each new task with experience replay natural but more challenging in practical applications, which is
of the old tasks, and ensembles all sub-networks for predic- known as Class-Incremental Learning (CIL) for classification
tion. CoSCL [69] and CAF [103] ensembles multiple continual tasks. For example, let’s consider two binary classification tasks:
learning models and modulates the predictive similarity between (1) “zebra” and “elephant”; and (2) “hare” and “robin”. After
them, proving to be effective in resolving the discrepancy of task learning these two tasks, TIL needs to know which task it
distribution and improving the flatness of loss landscape. is and then classify the two classes accordingly, while CIL
In contrast to other directions, most architecture-based ap- directly classifies the four classes at the same time. Therefore,
proaches amount to de-correlating incremental tasks in network CIL has received great attention and become almost the most
parameters, which can almost avoid catastrophic forgetting but representative scenario for continual learning.
affect scalability and inter-task generalizability. In particular, The CIL problem can be disentangled into within-task predic-
task identities are often required to determine which set of tion and task-identity prediction [75], [77], where the latter is
parameters to use. To overcome this limitation, task identities a particular challenge of task-agnostic inference. To cope with
can be inferred from the predictive uncertainty of task-specific CIL, the behavior of the previous model is imposed onto the
models [74], [75], [243]. The function of task-identity prediction current model, in terms of data, feature, and label spaces (Fig. 9).
can also be learned from incremental tasks, using other continual As replay of the old training samples can impose an end-to-end
learning strategies to mitigate catastrophic forgetting [78], [223], effect [152], many state-of-the-art methods are built on the
[233], [241]. framework of experience replay and then incorporate knowledge
distillation, as discussed in Section IV-B. In Appendix Table 2,
available online, we summarize these CIL methods based on
V. APPLICATION their major focuses. To avoid extra resource overhead and poten-
The real-world complexity presents a variety of particu- tial privacy issues of retaining old training samples, many efforts
lar challenges for continual learning, categorized into sce- attempt to perform CIL without experience replay, i.e., Data-
nario complexity and task specificity (Fig. 1(d)). The former Free CIL. An intriguing idea is to replay synthetic data produced

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY ROORKEE. Downloaded on December 19,2024 at 09:29:09 UTC from IEEE Xplore. Restrictions apply.
WANG et al.: COMPREHENSIVE SURVEY OF CONTINUAL LEARNING: THEORY, METHOD AND APPLICATION 5375

by inverting a frozen copy of the old classification model [247], scenario assumes that there is an external unlabeled dataset to
[248], which usually further incorporate knowledge distillation facilitate supervised continual learning, e.g., by knowledge dis-
to compensate the lost information in model inversion. Other tillation [124] and data augmentation [268]. The third scenario
methods exploit the class-wise statistics of feature representa- is to learn representations from incremental unlabeled data [28],
tions to obtain a balanced classifier, such as by imposing the [215], [216], which becomes an increasingly important topic for
representations to be transferable and invariant [173], [176], or updating pre-trained knowledge in foundation models.
compensating explicitly the representation shifts [174], [249].
C. Generic Learning Paradigm
B. Scarcity of Labeled Data
Potential challenges of the learning paradigm can be sum-
Most of the current continual learning settings assume that marized in a broad concept called General Continual Learning
incremental tasks have sufficiently large amounts of labeled (GCL) [12], [20], [31], where the model observes incremen-
data, which is often expensive and difficult to obtain in practical tal data in an online fashion without explicit task boundaries.
applications. For this reason, there is a growing body of work Correspondingly, GCL consists of two interconnected settings:
focusing on the scarcity of labeled data in continual learning. Task-Free Continual Learning (TFCL) [20], where the task
A representative scenario is called Few-Shot CIL (FSCIL) [24], identities are not accessible in either training or testing; and
where the model first learns some base classes for initialization Online Continual Learning (OCL) [21], where the training
with a large number of training samples, and then learns a samples are observed in an one-pass data stream. Since TFCL
sequence of novel classes with only a few training samples. The usually accesses only a small batch of training samples at a time
extremely limited training samples exacerbate the overfitting of for gradual changes in task distributions, while OCL usually
previously-learned representations to subsequent tasks, which requires only the data label rather than the task identity for each
can be alleviated by recent work such as preserving the topol- training sample, many methods for TFCL and OCL are compat-
ogy of representations [24], constructing an exemplar relation ible, summarized with their applicable scenarios in Appendix
graph for knowledge distillation [250], selectively updating only Table 3, available online.
the unimportant parameters [251] or stabilizing the important Some of them attempt to learn specialized parameters in a
parameters [252], updating parameters within the flat region of growing architecture. CN-DPM [80] adopts Dirichlet process
loss landscape [84], meta-learning of a good initialization [253], mixture models to construct a growing number of neural experts,
as well as generative replay [254]. while a concurrent work [269] derives such mixture models from
There are many other efforts keeping the initialized backbone a probabilistic meta-learner. VariGrow [270] employs an energy-
fixed in subsequent continual learning, so as to decouple the based novelty score to decide whether to extend a new expert
learning of representation and classifier. Following this idea, or update an old one. ODDL [271] estimates the discrepancy
representative strategies can be separated into two aspects. The between the current memory buffer and the previously-learned
first is to obtain compatible and extensible representations from knowledge as an expansion signal. InstAParam [272] selects and
massive base classes, such as by enforcing the representations consolidates appropriate network paths for individual training
compatible with simulated incremental tasks [255], reserving the samples.
feature space with virtual prototypes for future classes [256], us- On the other hand, many efforts are built on experience replay,
ing angular penalty loss with data augmentation [257], providing focusing on construction, management and exploitation of a
extra constraints from margin-based representations [258], etc. memory buffer. Since training samples of the same distribution
The second is to obtain an adaptive classifier from a sequence arrive in small batches, the information of task boundaries
of novel classes, such as by evolving the classifier weights is less effective, and reservoir sampling usually serves as an
with a graph attention network [259], performing hyperdimen- effective baseline strategy for sample selection. More advanced
sional computing [260], sampling stochastic classifiers [261], strategies prioritize the replay of those training samples that
etc. Besides, auxiliary information such as semantic word vec- are informative [273], diversified in parameter gradients [21],
tors [262], [263] and sketch exemplars [264] can be incorporated balanced in class labels [159], [274], and beneficial to latent
to enrich the limited training samples. decision boundaries [134]. Meanwhile, the memory buffer can
In addition to a few labeled data, there is usually a large be dynamically managed, such as by removing less important
amount of unlabeled data available and collected over time. training samples [42], editing the old training samples to be more
The first practical setting is called Semi-Supervised Continual likely forgotten [144], [275], and retrieving the old training sam-
Learning (SSCL) [25], which considers incremental data as ples that are susceptible to interference [145], [276]. To better
partially labeled. As an initial attempt, ORDisCo [25] learns a exploit the memory buffer, representative strategies include cali-
semi-supervised classification model together with a conditional brating features with task-specific parameters [277], performing
GANs for generative replay, and regularizes discriminator con- knowledge distillation [31], [278], improving representations
sistency to mitigate catastrophic forgetting. Subsequent work with contrastive learning [276], [279], using asymmetric cross-
includes training an adversarial autoencoder to reconstruct im- entropy [280] or constrained gradient directions [45], [63] of
ages [265], imposing predictive consistency among augmented the old and new training samples, repeated rehearsal with data
and interpolated data [266], and leveraging the nearest-neighbor augmentation [281], properly adjusting the learning rate [42],
classifier to distill class-instance relationships [267]. The second etc.

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY ROORKEE. Downloaded on December 19,2024 at 09:29:09 UTC from IEEE Xplore. Restrictions apply.
5376 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 46, NO. 8, AUGUST 2024

D. Object Detection defining unknown classes within the background to benefit


Incremental Object Detection (IOD) is a typical extension learning plasticity [294].
of continual learning for object detection, where the training
samples annotated with different classes are introduced in se- VI. DISCUSSION
quence, and the model needs to correctly locate and identify In this section, we present an in-depth discussion of promising
the objects belonging to the previously-learned classes. Unlike directions for continual learning, including the current trends,
visual classification with only one object appearing in each essential considerations beyond task performance, and cross-
training sample, object detection usually has multiple objects directional prospects.
belonging to the old and new classes appearing together. Such
co-occurrence poses an additional challenge for IOD, where the
A. Observation of Current Trend
old classes are marked as the background when learning new
classes, thus exacerbating catastrophic forgetting. On the other As continual learning is directly affected by catastrophic
hand, this makes knowledge distillation a naturally powerful forgetting, previous efforts seek to address this critical problem
strategy for IOD, since the old class objects can be obtained from by promoting memory stability of the old knowledge. However,
new training samples to constrain the differences in responses recent work has increasingly focused on facilitating learning
between the old and new models. As an early work, ILOD [282] plasticity and inter-task generalizability. This essentially ad-
distills the responses for old classes to prevent catastrophic vances the understanding of continual learning: a desirable
forgetting on Fast R-CNN. The idea of knowledge distillation solution requires a proper balance between the old and new tasks,
is then introduced to other detection frameworks [283], [284], with adequate generalizability to accommodate their distribution
[285]. Some approaches exploit the unlabeled in-the-wild data differences.
to distill the old and new models into a shared model, in order To promote learning plasticity on the basis of memory stabil-
to bridge potential non co-occurrence [284] and achieve a better ity, emergent strategies include renormalization of old and new
stability-plasticity trade-off [286]. To further improve learning task solutions [52], [104], [113], [154], [196], balanced exploita-
plasticity, IOD-ML [287] adopts meta-learning to reshape pa- tion of old and new training samples [46], [129], [147], [148],
rameter gradients into a balanced direction between the old and [149], [152], space reservation for subsequent tasks [115], [116],
new classes. [256], etc. On the other hand, solution generalizability could
be explicitly improved by optimizing the flatness metric [81],
[84], [85], [86], [87], constructing an ensemble model at either
E. Semantic Segmentation spatial scale [69], [88] or temporal scale [31], and obtaining well-
distributed representations [28], [87], [89], [91], [92], [93]. In
Continual Semantic Segmentation (CSS) aims at pixel-wise particular, since self-supervised and pre-trained representations
prediction of classes and learning new classes in addition to are naturally more robust to catastrophic forgetting [28], [87],
the old ones. Similar to IOD, the old and new classes can [89], [93], creating, improving and exploiting such representa-
appear together with annotations of only the latter, leading to tional advantages has become a promising direction.
the old classes being treated as the background (known as the We also observe that the applications of continual learning are
background shift) and thus exacerbates catastrophic forgetting. becoming more diverse and widespread. In addition to various
A common strategy is to distill knowledge adaptively from the scenarios of visual classification, current extensions of continual
old model, which can faithfully distinguish unannotated old learning have covered many other visual domains, as well as
classes from the background. For example, MiB [288] calibrates other related fields such as RL and NLP. We have only introduced
regular cross-entropy (CE) and knowledge distillation (KD) some representative applications, with other more specialized
losses of the background pixels with predictions from the old and cross-cutting extensions to be explored. Notably, existing
model. ALIFE [289] further improves the calibrated CE and work on applications has focused on providing basic bench-
KD with logit regularization, and fine-tunes the classifier with marks and baseline approaches. Future work could develop
feature replay. RCIL [290] reparameterizes the network into more specialized approaches to obtain stronger performance,
two parallel branches, where the old branch is frozen for KD or evaluate the generality of current approaches in different
between intermediate layers. SDR [291] introduces contrastive applications.
learning into distillation of latent representations, where pixels
of the same class are clustered and pixels of different classes
are separated. PLOP [292], RECALL [293], SSUL [294], Self- B. Beyond Task Performance
Training [295] and WILSON [296] explicitly apply the old Continual learning can benefit many considerations beyond
model to generate pseudo-labels of the old classes. Auxiliary task performance, such as efficiency, privacy and robustness. A
data resources such as a web crawler [293], a pre-trained gen- major purpose of continual learning is to avoid retraining all old
erative model [293], unlabeled data [295], and old training training samples and thus improve resource efficiency of model
samples [294] have been exploited to facilitate KD and prevent updates, which is not only applicable to learning multiple incre-
catastrophic forgetting. Besides, saliency maps are commonly mental tasks, but also important for regular single-task training.
used to locate unannotated objects in CSS, in response to weak Due to the nature of gradient-based optimization, a network
supervision of only image-level annotations [296], as well as tends to “forget” the observed training samples and thus requires

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY ROORKEE. Downloaded on December 19,2024 at 09:29:09 UTC from IEEE Xplore. Restrictions apply.
WANG et al.: COMPREHENSIVE SURVEY OF CONTINUAL LEARNING: THEORY, METHOD AND APPLICATION 5377

repetitive training to capture a distribution, especially for some provides a new target for continual learning, and its outstanding
hard examples [3]. Recent work has shown that the one-pass performance in conditional generation can also facilitate the
performance of visual classification can be largely improved by efficacy of generative replay.
experience replay of hard examples [297] or orthogonal gradient Foundation Model, such as GPT [316] and CLIP [218],
projection [298]. Similarly, resolving within-task catastrophic demonstrates impressive performance in a variety of down-
forgetting can facilitate reinforcement learning [299], [300] and stream tasks with the use of large-scale pre-training. The pre-
stabilize the training of GANs [301], [302]. training data is usually huge in volume and collected incre-
Meanwhile, continual learning is relevant to two important mentally, creating urgent demands for efficient updates. On
directions of privacy protection. The first is Federated Learn- the other hand, an increasing scale of pre-training would fa-
ing [303], where the server and clients are not allowed to cilitate knowledge transfer and mitigate catastrophic forgetting
communicate with data. A typical scenario is that the server for downstream continual learning. Transformer-Based Archi-
aggregates the locally trained parameters from multiple clients tecture [317] has proven effective for both language and vision
into a single model and then sends it back. As the incremental domains, and become the dominant choice for state-of-the-art
data collected by clients is dynamic and variable, federated foundation models. This requires specialized designs to over-
learning needs to overcome catastrophic forgetting and facilitate come catastrophic forgetting while providing new insights for
knowledge transfer across clients, which can be achieved by maintaining task specificity in continual learning. Parameter-
continual learning strategies such as model decomposition [304] efficient fine-tuning techniques originally developed in the field
and knowledge distillation [305], [306]. The second is Machine of NLP are being widely adapted to continual learning.
Unlearning, which aims to forget the influence of specific train- Embodied AI [318] aims to enable AI systems to learn from
ing samples when their access is lost, without affecting other interactions with the physical environment, rather than static
knowledge. Many efforts in this direction are closely related to datasets collected primarily from the Internet. The study of
continual learning, such as learning separate models with subsets general continual learning helps the embodied agents to learn
of training samples [307], utilizing historical parameters and from an egocentric perception similar to humans, and provides
gradients [308], removing old knowledge from parameters with a unique opportunity for researchers to pursue the essence of
Fisher information matrix [309], adding adaptive parameters on lifelong learning by observing the same person in a long time
a pre-trained model [310], etc. On the other hand, continual span. Advances in Neuroscience provide important inspirations
learning may suffer from data leakage and privacy invasion as for the development of continual learning, as biological learning
retaining all old knowledge. Mnemonic Code [311] embeds a is naturally on a continual basis [1], [3]. The underlying mecha-
class-specific code when learning each class, enabling to selec- nisms include multiple levels from synaptic plasticity to regional
tively forget them through discarding the corresponding codes. collaboration, detailed in Appendix D, available online. With
LIRF [312] designs a distillation framework to remove specific a deeper understanding of the biological brain, more “natural
old knowledge and store it in a pruned lightweight network for algorithms” can be explored to facilitate continual learning of
selective recovery. AI systems.
As a strategy for adapting to variable inputs, continual learn-
ing can assist a robust model to eliminate or resist external VII. CONCLUSION
disturbances [30], [313], [314]. In fact, robustness and continual
In this work, we present an up-to-date and comprehensive
learning are intrinsically linked, as they correspond to general-
survey of continual learning, bridging the latest advances in
izability in the spatial and temporal dimensions, respectively.
theory, method and application. We summarize both general
Many ideas for improving robustness to adversarial examples
objectives and particular challenges in this field, with an ex-
have been used to improve continual learning, such as flat min-
tensive analysis of how representative strategies address them.
ima [31], [81], model ensemble [69], Lipschitz continuity [157]
Encouragingly, we observe a growing and widespread interest in
and adversarial training [158]. Subsequent work could further
continual learning from the broad AI community, bringing novel
interconnect excellent ideas from both fields, e.g., designing par-
understandings, diversified applications and cross-directional
ticular algorithms to actively “forget” [52] external disturbances.
opportunities. Based on such a holistic perspective, we expect
the development of continual learning to eventually empower
C. Cross-Directional Prospect AI systems with human-like adaptability, responding flexibly
to real-world dynamics and evolving themselves in a lifelong
Continual learning demonstrates vigorous vitality, as most manner.
of the state-of-the-art AI models require flexible and efficient
updates, and their advances have contributed to the develop- REFERENCES
ment of continual learning. Here, we discuss some attractive
intersections of continual learning with other topics of the broad [1] D. Kudithipudi et al., “Biological underpinnings for lifelong learning
machines,” Nat. Mach. Intell., vol. 4, pp. 196–210, 2022.
AI community: [2] G. I. Parisi et al., “Continual lifelong learning with neural networks: A
Diffusion Model [315] is a rising state-of-the-art generative review,” Neural Netw., vol. 113, pp. 54–71, 2019.
model, which constructs a Markov chain of discrete steps to pro- [3] R. Hadsell et al., “Embracing change: Continual learning in deep neural
networks,” Trends Cogn. Sci., vol. 24, pp. 1028–1040, 2020.
gressively add random noise for the input and learns to gradually [4] G. M. Van de Ven and A. S. Tolias, “Three scenarios for continual
remove the noise to restore the original data distribution. This learning,” 2019, arXiv:1904.07734.

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY ROORKEE. Downloaded on December 19,2024 at 09:29:09 UTC from IEEE Xplore. Restrictions apply.
5378 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 46, NO. 8, AUGUST 2024

[5] Z. Chen and B. Liu, Lifelong Machine Learning. San Rafael, CA, USA: [32] J. Bang, H. Kim, Y. Yoo, J. -W. Ha, and J. Choi, “Rainbow memory: Con-
Morgan & Claypool, 2018. tinual learning with a memory of diverse samples,” in Proc. IEEE/CVF
[6] M. McCloskey and N. J. Cohen, “Catastrophic interference in con- Conf. Comput. Vis. Pattern Recognit., 2021, pp. 8214–8223.
nectionist networks: The sequential learning problem,” Psychol. Learn. [33] M. Abdelsalam, M. Faramarzi, S. Sodhani, and S. Chandar, “IIRC:
Motivation, vol. 24, pp. 109–165, 1989. Incremental implicitly-refined classification,” in Proc. IEEE/CVF Conf.
[7] J. L. McClelland, B. L. McNaughton, and R. C. O’Reilly, “Why there Comput. Vis. Pattern Recognit., 2021, pp. 11033–11042.
are complementary learning systems in the hippocampus and neocortex: [34] M. Liang et al., “Balancing between forgetting and acquisition in incre-
Insights from the successes and failures of connectionist models of mental subpopulation learning,” in Proc. Eur. Conf. Comput. Vis., 2022,
learning and memory,” Psychol. Rev., vol. 102, pp. 419–457, 1995. pp. 364–380.
[8] M. Mundt et al., “A wholistic view of continual learning with deep neural [35] Z. Ke, B. Liu, and X. Huang, “Continual learning of a mixed sequence
networks: Forgotten lessons and the bridge to active and open world of similar and dissimilar tasks,” in Proc. Int. Conf. Neural Inf. Process.
learning,” Neural Netw., vol. 160, pp. 306–336, 2023. Syst., 2020, Art. no. 1553.
[9] T. L. Hayes et al., “Replay in deep learning: Current approaches and [36] X. Liu et al., “Long-tailed class incremental learning,” in Proc. Eur. Conf.
missing biological elements,” Neural Comput., vol. 33, pp. 2908–2950, Comput. Vis., 2022, pp. 495–512.
2021. [37] S. Roy et al., “Class-incremental novel class discovery,” in Proc. Eur.
[10] P. Jedlicka et al., “Contributions by metaplasticity to solving the catas- Conf. Comput. Vis., 2022, pp. 317–333.
trophic forgetting problem,” Trends Neurosci., vol. 45, pp. 656–666, [38] K. Joseph et al., “Novel class discovery without forgetting,” in Proc. Eur.
2022. Conf. Comput. Vis., 2022, pp. 570–586.
[11] M. Masana, B. Twardowski, and J. Van de Weijer, “On class orderings [39] T. Srinivasan et al., “CLiMB: A continual learning benchmark for vision-
for incremental learning,” 2020, arXiv:2007.02145. and-language tasks,” in Proc. Int. Conf. Neural Inf. Process. Syst., 2022,
[12] M. De Lange et al., “A continual learning survey: Defying forgetting in pp. 29440–29453.
classification tasks,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 44, [40] F. Mi, L. Kong, T. Lin, K. Yu, and B. Faltings, “Generalized class
no. 7, pp. 3366–3385, Jul. 2022. incremental learning,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern
[13] H. Qu et al., “Recent advances of continual learning in computer vision: Recognit. Workshops, 2020, pp. 970–974.
An overview,” 2021, arXiv:2109.11369. [41] J. Xie, S. Yan, and X. He, “General incremental learning with domain-
[14] Z. Mai et al., “Online continual learning in image classification: An aware categorical representations,” in Proc. IEEE/CVF Conf. Comput.
empirical survey,” Neurocomputing, vol. 469, pp. 28–51, 2022. Vis. Pattern Recognit., 2022, pp. 14331–14340.
[15] M. Biesialska, K. Biesialska, and M. R. Costa-juss à, “Continual lifelong [42] H. Koh et al., “Online continual learning on class incremental blurry
learning in natural language processing: A survey,” in Proc. Int. Conf. task configuration with anytime inference,” in Proc. Int. Conf. Learn.
Comput. Linguistics, 2020, pp. 6523–6541. Representations, 2021.
[16] Z. Ke and B. Liu, “Continual learning of natural language processing [43] M. Caccia et al., “Online fast adaptation and knowledge accumulation
tasks: A survey,” 2022, arXiv:2211.12701. (OSAKA): A new approach to continual learning,” in Proc. Int. Conf.
[17] K. Khetarpal et al., “Towards continual reinforcement learning: A review Neural Inf. Process. Syst., 2020, Art. no. 1387.
and perspectives,” J. Artif. Intell. Res., vol. 75, pp. 1401–1476, 2022. [44] A. Chaudhry et al., “Riemannian walk for incremental learning: Under-
[18] V. Lomonaco and D. Maltoni, “Core50: A new dataset and benchmark standing forgetting and intransigence,” in Proc. Eur. Conf. Comput. Vis.,
for continuous object recognition,” in Proc. Conf. Robot Learn., 2017, 2018, pp. 556–572.
pp. 17–26. [45] D. Lopez-Paz and M. Ranzato, “Gradient episodic memory for continual
[19] Y.-C. Hsu et al., “Re-evaluating continual learning scenarios: A catego- learning,” in Proc. Int. Conf. Neural Inf. Process. Syst., 2017, pp. 6470–
rization and case for strong baselines,” 2018, arXiv:1810.12488. 6479.
[20] R. Aljundi, K. Kelchtermans, and T. Tuytelaars, “Task-free continual [46] S. Hou, X. Pan, C. C. Loy, Z. Wang, and D. Lin, “Learning a unified clas-
learning,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2019, sifier incrementally via rebalancing,” in Proc. IEEE/CVF Conf. Comput.
pp. 11246–11255. Vis. Pattern Recognit., 2019, pp. 831–839.
[21] R. Aljundi et al., “Gradient based sample selection for online con- [47] A. Douillard et al., “PODNet: Pooled outputs distillation for small-tasks
tinual learning,” in Proc. Int. Conf. Neural Inf. Process. Syst., 2019, incremental learning,” in Proc. Eur. Conf. Comput. Vis., 2020, pp. 86–102.
Art. no. 1058. [48] T. Hastie et al., The Elements of Statistical Learning: Data Mining,
[22] Y. Sun et al., “ERNIE 2.0: A continual pre-training framework for lan- Inference, and Prediction. Berlin, Germany: Springer, 2009.
guage understanding,” in Proc. AAAI Conf. Artif. Intell., 2020, pp. 8968– [49] J. Kirkpatrick et al., “Overcoming catastrophic forgetting in neural net-
8975. works,” Proc. Nat. Acad. Sci. USA, vol. 114, pp. 3521–3526, 2017.
[23] P. Singh, P. Mazumder, P. Rai, and V. P. Namboodiri, “Rectification-based [50] H. Ritter, A. Botev, and D. Barber, “Online structured laplace approxima-
knowledge retention for continual learning,” in Proc. IEEE/CVF Conf. tions for overcoming catastrophic forgetting,” in Proc. Int. Conf. Neural
Comput. Vis. Pattern Recognit., 2021, pp. 15277–15286. Inf. Process. Syst., 2018, pp. 3742–3752.
[24] X. Tao, X. Hong, X. Chang, S. Dong, X. Wei, and Y. Gong, “Few-shot [51] T.-C. Kao et al., “Natural continual learning: Success is a journey, not
class-incremental learning,” in Proc. IEEE/CVF Conf. Comput. Vis. Pat- (just) a destination,” in Proc. Int. Conf. Neural Inf. Process. Syst., 2021,
tern Recognit., 2020, pp. 12180–12189. pp. 28067–28079.
[25] L. Wang, K. Yang, C. Li, L. Hong, Z. Li, and J. Zhu, “ORDisCo: Effective [52] L. Wang et al., “AFEC: Active forgetting of negative transfer in continual
and efficient usage of incremental unlabeled data for semi-supervised learning,” in Proc. Int. Conf. Neural Inf. Process. Syst., 2021, pp. 22379–
continual learning,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern 22391.
Recognit., 2021, pp. 5379–5388. [53] F. Huszár, “On quadratic penalties in elastic weight consolidation,”
[26] K. Joseph, S. Khan, F. S. Khan, and V. N. Balasubramanian, “Towards 2017, arXiv:1712.03847.
open world object detection,” in Proc. IEEE/CVF Conf. Comput. Vis. [54] J. Martens and R. Grosse, “Optimizing neural networks with kronecker-
Pattern Recognit., 2021, pp. 5826–5836. factored approximate curvature,” in Proc. Int. Conf. Mach. Learn., 2015,
[27] D. Rao et al., “Continual unsupervised representation learning,” in Proc. pp. 2408–2417.
Int. Conf. Neural Inf. Process. Syst., 2019, Art. no. 687. [55] C. V. Nguyen et al., “Variational continual learning,” in Proc. Int. Conf.
[28] D. Hu et al., “How well does self-supervised pre-training perform with Learn. Representations, 2018.
streaming data?,” in Proc. Int. Conf. Learn. Representations, 2021. [56] T. Adel, H. Zhao, and R. E. Turner, “Continual learning with adap-
[29] C. D. Kim, J. Jeong, and G. Kim, “Imbalanced continual learning with tive weights (CLAW),” in Proc. Int. Conf. Learn. Representations,
partitioning reservoir sampling,” in Proc. Eur. Conf. Comput. Vis., 2020, 2019.
pp. 411–428. [57] R. Kurle et al., “Continual learning with Bayesian neural networks for
[30] C. D. Kim, J. Jeong, S. Moon, and G. Kim, “Continual learning on non-stationary data,” in Proc. Int. Conf. Learn. Representations, 2019.
noisy data streams via self-purified replay,” in Proc. IEEE/CVF Int. Conf. [58] N. Loo, S. Swaroop, and R. E. Turner, “Generalized variational continual
Comput. Vis., 2021, pp. 517–527. learning,” in Proc. Int. Conf. Learn. Representations, 2020.
[31] P. Buzzega et al., “Dark experience for general continual learning: A [59] S. Kapoor, T. Karaletsos, and T. D. Bui, “Variational auto-regressive
strong, simple baseline,” in Proc. Int. Conf. Neural Inf. Process. Syst., Gaussian processes for continual learning,” in Proc. Int. Conf. Mach.
2020, Art. no. 1335. Learn., 2021, pp. 5290–5300.

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY ROORKEE. Downloaded on December 19,2024 at 09:29:09 UTC from IEEE Xplore. Restrictions apply.
WANG et al.: COMPREHENSIVE SURVEY OF CONTINUAL LEARNING: THEORY, METHOD AND APPLICATION 5379

[60] T. G. Rudner et al., “Continual learning via sequential function-space [91] Q. Pham, C. Liu, and S. Hoi, “DualNet: Continual learning, fast and slow,”
variational inference,” in Proc. Int. Conf. Mach. Learn., 2022, pp. 18871– in Proc. Int. Conf. Neural Inf. Process. Syst., 2021, pp. 16131–16144.
18887. [92] H. Cha, J. Lee, and J. Shin, “Co2L: Contrastive continual learning,” in
[61] H. Tseran et al., “Natural variational continual learning,” in Proc. Int. Proc. IEEE/CVF Int. Conf. Comput. Vis., 2021, pp. 9496–9505.
Conf. Neural Inf. Process. Syst. Workshop, 2018. [93] D. Madaan et al., “Representational continuity for unsupervised continual
[62] J. Knoblauch, H. Husain, and T. Diethe, “Optimal continual learning has learning,” in Proc. Int. Conf. Learn. Representations, 2021.
perfect memory and is NP-HARD,” in Proc. Int. Conf. Mach. Learn., [94] K. Ramakrishnan, R. Panda, Q. Fan, J. Henning, A. Oliva, and R. Feris,
2020, Art. no. 494. “Relationship matters: Relation guided knowledge transfer for incremen-
[63] A. Chaudhry et al., “Efficient lifelong learning with A-GEM,” in Proc. tal learning of object detectors,” in Proc. IEEE/CVF Conf. Comput. Vis.
Int. Conf. Learn. Representations, 2018. Pattern Recognit. Workshops, 2020, pp. 1009–1018.
[64] S. Tang, D. Chen, J. Zhu, S. Yu, and W. Ouyang, “Layerwise optimization [95] A. Pentina and C. Lampert, “A PAC-Bayesian bound for lifelong learn-
by gradient decomposition for continual learning,” in Proc. IEEE/CVF ing,” in Proc. Int. Conf. Mach. Learn., 2014, pp. 991–999.
Conf. Comput. Vis. Pattern Recognit., 2021, pp. 9629–9638. [96] M. A. Bennani, T. Doan, and M. Sugiyama, “Generalisation guarantees
[65] G. Zeng et al., “Continual learning of context-dependent processing in for continual learning with orthogonal gradient descent,” in Proc. Int.
neural networks,” Nat. Mach. Intell., vol. 1, pp. 364–372, 2019. Conf. Mach. Learn. Workshops, 2020.
[66] S. Wang, X. Li, J. Sun, and Z. Xu, “Training networks in null space [97] T. Doan et al., “A theoretical analysis of catastrophic forgetting through
of feature covariance for continual learning,” in Proc. IEEE/CVF Conf. the NTK overlap matrix,” in Proc. Int. Conf. Artif. Intell. Statist., 2021,
Comput. Vis. Pattern Recognit., 2021, pp. 184–193. pp. 1072–1080.
[67] M. Farajtabar et al., “Orthogonal gradient descent for continual learning,” [98] R. Karakida and S. Akaho, “Learning curves for continual learning in
in Proc. Int. Conf. Artif. Intell. Statist., 2020, pp. 3762–3773. neural networks: Self-knowledge transfer and forgetting,” in Proc. Int.
[68] R. Ramesh and P. Chaudhari, “Model zoo: A growing brain that learns Conf. Learn. Representations, 2022.
continually,” in Proc. Int. Conf. Learn. Representations, 2021. [99] S. Lee, S. Goldt, and A. Saxe, “Continual learning in the teacher-student
[69] L. Wang et al., “CoSCL: Cooperation of small continual learners is setup: Impact of task similarity,” in Proc. Int. Conf. Mach. Learn., 2021,
stronger than a big one,” in Proc. Eur. Conf. Comput. Vis., 2022, pp. 254– pp. 6109–6119.
271. [100] M. Wołczyk et al., “Continual learning with guarantees via weight interval
[70] A. A. Rusu et al., “Progressive neural networks,” constraints,” in Proc. Int. Conf. Mach. Learn., 2022, pp. 23897–23911.
2016, arXiv:1606.04671. [101] T. Doan et al., “Efficient continual learning ensembles in neural network
[71] J. Serra et al., “Overcoming catastrophic forgetting with hard attention subspaces,” 2022, arXiv:2202.09826.
to the task,” in Proc. Int. Conf. Mach. Learn., 2018, pp. 4548–4557. [102] L. Peng, P. Giampouras, and R. Vidal, “The ideal continual learner:
[72] C. Fernando et al., “PathNet: Evolution channels gradient descent in super An agent that never forgets,” in Proc. Int. Conf. Mach. Learn., 2023,
neural networks,” 2017, arXiv:1701.08734. pp. 27585–27610.
[73] S. Ebrahimi et al., “Uncertainty-guided continual learning with Bayesian [103] L. Wang et al., “Incorporating neuro-inspired adaptability for continual
neural networks,” in Proc. Int. Conf. Learn. Representations, 2019. learning in artificial intelligence,” Nat. Mach. Intell., vol. 5, pp. 1356–
[74] C. Henning et al., “Posterior meta-replay for continual learning,” in Proc. 1368, 2023.
Int. Conf. Neural Inf. Process. Syst., 2021, pp. 14135–14149. [104] J. Schwarz et al., “Progress & compress: A scalable framework for con-
[75] G. Kim et al., “A theoretical study on solving continual learning,” in tinual learning,” in Proc. Int. Conf. Mach. Learn., 2018, pp. 4528–4537.
Proc. Int. Conf. Neural Inf. Process. Syst., 2022, pp. 5065–5079. [105] F. Zenke, B. Poole, and S. Ganguli, “Continual learning through synaptic
[76] F. D’Angelo and C. Henning, “Uncertainty-based out-of-distribution de- intelligence,” in Proc. Int. Conf. Mach. Learn., 2017, pp. 3987–3995.
tection requires suitable function space priors,” 2021, arXiv:2110.06020. [106] R. Aljundi et al., “Memory aware synapses: Learning what (not) to
[77] G. Kim et al., “Learnability and algorithm for continual learning,” in forget,” in Proc. Eur. Conf. Comput. Vis., 2018, pp. 139–154.
Proc. Int. Conf. Mach. Learn., 2023, Art. no. 694. [107] F. Benzing, “Unifying importance based regularisation methods for con-
[78] K. J. Joseph and V. N. Balasubramanian, “Meta-consolidation for con- tinual learning,” in Proc. Int. Conf. Artif. Intell. Statist., 2022, pp. 2372–
tinual learning,” in Proc. Int. Conf. Neural Inf. Process. Syst., 2020, 2396.
Art. no. 1205. [108] X. Liu, M. Masana, L. Herranz, J. Van de Weijer, A. M. López, and A.
[79] Z. Gong et al., “Continual pre-training of language models for math D. Bagdanov, “Rotate your networks: Better weight consolidation and
problem understanding with syntax-aware memory network,” in Proc. less catastrophic forgetting,” in Proc. Int. Conf. Pattern Recognit., 2018,
60th Annu. Meeting Assoc. Comput. Linguistics, 2022, pp. 5923–5933. pp. 2262–2268.
[80] S. Lee et al., “A neural dirichlet process mixture model for task-free [109] J. Lee, H. G. Hong, D. Joo, and J. Kim, “Continual learning with extended
continual learning,” in Proc. Int. Conf. Learn. Representations, 2019. kronecker-factored approximate curvature,” in Proc. IEEE/CVF Conf.
[81] S. I. Mirzadeh et al., “Understanding the role of training regimes in Comput. Vis. Pattern Recognit., 2020, pp. 8998–9007.
continual learning,” in Proc. Int. Conf. Neural Inf. Process. Syst., 2020, [110] D. Park, S. Hong, B. Han, and K. M. Lee, “Continual learning by
Art. no. 613. asymmetric loss approximation with single-side overestimation,” in Proc.
[82] S. Hochreiter and J. Schmidhuber, “Flat minima,” Neural Comput., vol. 9, IEEE/CVF Int. Conf. Comput. Vis., 2019, pp. 3334–3343.
pp. 1–42, 1997. [111] S.-W. Lee et al., “Overcoming catastrophic forgetting by incremental
[83] N. S. Keskar et al., “On large-batch training for deep learning: Generaliza- moment matching,” in Proc. Int. Conf. Neural Inf. Process. Syst., 2017,
tion gap and sharp minima,” in Proc. Int. Conf. Learn. Representations, pp. 4652–4662.
2017. [112] J. Lee et al., “Residual continual learning,” in Proc. AAAI Conf. Artif.
[84] G. Shi et al., “Overcoming catastrophic forgetting in incremental few-shot Intell., 2020, pp. 4553–4560.
learning by finding flat minima,” in Proc. Int. Conf. Neural Inf. Process. [113] G. Lin, H. Chu, and H. Lai, “Towards better plasticity-stability trade-off
Syst., 2021, pp. 6747–6761. in incremental learning: A simple linear connector,” in Proc. IEEE/CVF
[85] D. Deng et al., “Flattening sharpness for dynamic gradient projection Conf. Comput. Vis. Pattern Recognit., 2022, pp. 89–98.
memory benefits continual learning,” in Proc. Int. Conf. Neural Inf. [114] I. Paik et al., “Overcoming catastrophic forgetting by neuron-
Process. Syst., 2021, pp. 18710–18721. level plasticity control,” in Proc. AAAI Conf. Artif. Intell., 2020,
[86] S. I. Mirzadeh et al., “Linear mode connectivity in multitask and continual pp. 5339–5346.
learning,” in Proc. Int. Conf. Learn. Representations, 2020. [115] H. Ahn et al., “Uncertainty-based continual learning with adaptive regu-
[87] S. V. Mehta et al., “An empirical investigation of the role of pre-training larization,” in Proc. Int. Conf. Neural Inf. Process. Syst., 2019, pp. 4394–
in lifelong learning,” 2021, arXiv:2112.09153. 4404.
[88] S. Cha et al., “CPR: Classifier-projection regularization for continual [116] S. Jung et al., “Continual learning with node-importance based adaptive
learning,” in Proc. Int. Conf. Learn. Representations, 2020. group sparse regularization,” in Proc. Int. Conf. Neural Inf. Process. Syst.,
[89] V. V. Ramasesh, A. Lewkowycz, and E. Dyer, “Effect of scale on 2020, pp. 3647–3658.
catastrophic forgetting in neural networks,” in Proc. Int. Conf. Learn. [117] J. Gou et al., “Knowledge distillation: A survey,” Int. J. Comput. Vis.,
Representations, 2021. vol. 129, pp. 1789–1819, 2021.
[90] S. I. Mirzadeh et al., “Wide neural networks forget less catastrophically,” [118] Z. Li and D. Hoiem, “Learning without forgetting,” IEEE Trans. Pattern
in Proc. Int. Conf. Mach. Learn., 2022, pp. 15699–15717. Anal. Mach. Intell., vol. 40, no. 12, pp. 2935–2947, Dec. 2018.

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY ROORKEE. Downloaded on December 19,2024 at 09:29:09 UTC from IEEE Xplore. Restrictions apply.
5380 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 46, NO. 8, AUGUST 2024

[119] P. Dhar, R. V. Singh, K. -C. Peng, Z. Wu, and R. Chellappa, “Learning [148] B. Zhao, X. Xiao, G. Gan, B. Zhang, and S.-T. Xia, “Maintaining discrim-
without memorizing,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern ination and fairness in class incremental learning,” in Proc. IEEE/CVF
Recognit., 2019, pp. 5133–5141. Conf. Comput. Vis. Pattern Recognit., 2020, pp. 13205–13214.
[120] A. Iscen et al., “Memory-efficient incremental learning through feature [149] H. Ahn, J. Kwak, S. Lim, H. Bang, H. Kim, and T. Moon, “SS-IL:
adaptation,” in Proc. Eur. Conf. Comput. Vis., 2020, pp. 699–715. Separated softmax for incremental learning,” in Proc. IEEE/CVF Int.
[121] A. Rannen, R. Aljundi, M. B. Blaschko, and T. Tuytelaars, “Encoder Conf. Comput. Vis., 2021, pp. 824–833.
based lifelong learning,” in Proc. IEEE/CVF Int. Conf. Comput. Vis., [150] C. Simon, P. Koniusz, and M. Harandi, “On learning the geodesic path
2017, pp. 1329–1337. for incremental learning,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern
[122] S.-A. Rebuffi, A. Kolesnikov, G. Sperl, and C. H. Lampert, “iCaRL: Recognit., 2021, pp. 1591–1600.
Incremental classifier and representation learning,” in Proc. IEEE/CVF [151] K. Joseph, S. Khan, F. S. Khan, R. M. Anwer, and V. N. Balasubra-
Conf. Comput. Vis. Pattern Recognit., 2017, pp. 5533–5542. manian, “Energy-based latent aligner for incremental learning,” in Proc.
[123] F. M. Castro et al., “End-to-end incremental learning,” in Proc. Eur. Conf. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2022, pp. 7442–7451.
Comput. Vis., 2018, pp. 233–248. [152] X. Hu, K. Tang, C. Miao, X. -S. Hua, and H. Zhang, “Distilling causal
[124] K. Lee, K. Lee, J. Shin, and H. Lee, “Overcoming catastrophic forgetting effect of data in class-incremental learning,” in Proc. IEEE/CVF Conf.
with unlabeled data in the wild,” in Proc. IEEE/CVF Int. Conf. Comput. Comput. Vis. Pattern Recognit., 2021, pp. 3956–3965.
Vis., 2019, pp. 312–321. [153] A. Ashok, K. Joseph, and V. N. Balasubramanian, “Class-incremental
[125] C. Wu et al., “Memory replay GANs: Learning to generate new categories learning with cross-space clustering and controlled transfer,” in Proc.
without forgetting,” in Proc. Int. Conf. Neural Inf. Process. Syst., 2018, Eur. Conf. Comput. Vis., 2022, pp. 105–122.
pp. 5966–5976. [154] S. Hou et al., “Lifelong learning via progressive distillation and retro-
[126] M. Zhai, L. Chen, F. Tung, J. He, M. Nawhal, and G. Mori, “Lifelong spection,” in Proc. Eur. Conf. Comput. Vis., 2018, pp. 437–452.
GAN: Continual learning for conditional image generation,” in Proc. [155] F.-Y. Wang et al., “Foster: Feature boosting and compression for
IEEE/CVF Int. Conf. Comput. Vis., 2019, pp. 2759–2768. class-incremental learning,” in Proc. Eur. Conf. Comput. Vis., 2022,
[127] M. K. Titsias et al., “Functional regularisation for continual learning with pp. 398–414.
Gaussian processes,” in Proc. Int. Conf. Learn. Representations, 2019. [156] E. Verwimp, M. De Lange, and T. Tuytelaars, “Rehearsal revealed: The
[128] P. Pan et al., “Continual deep learning by functional regularisation of limits and merits of revisiting samples in continual learning,” in Proc.
memorable past,” in Proc. Int. Conf. Neural Inf. Process. Syst., 2020, IEEE/CVF Int. Conf. Comput. Vis., 2021, pp. 9365–9374.
pp. 4453–4464. [157] L. Bonicelli et al., “On the effectiveness of lipschitz-driven rehearsal in
[129] Z. Wang et al., “Continual learning through retrieval and imagination,” continual learning,” in Proc. Int. Conf. Neural Inf. Process. Syst., 2022,
in Proc. AAAI Conf. Artif. Intell., 2022, pp. 8594–8602. pp. 31886–31901.
[130] A. Chaudhry et al., “On tiny episodic memories in continual learning,” [158] L. Yu et al., “Continual learning by modeling intra-class variation,” Trans.
2019, arXiv:1902.10486. Mach. Learn. Res., vol. 2023, 2023.
[131] M. Riemer et al., “Learning to learn without forgetting by maximizing [159] A. Prabhu, P. H. Torr, and P. K. Dokania, “GDumb: A simple approach
transfer and minimizing interference,” in Proc. Int. Conf. Learn. Repre- that questions our progress in continual learning,” in Proc. Eur. Conf.
sentations, 2018. Comput. Vis., 2020, pp. 524–540.
[132] Z. Borsos, M. Mutny, and A. Krause, “Coresets via bilevel optimization [160] H. Shin et al., “Continual learning with deep generative replay,” in Proc.
for continual learning and streaming,” in Proc. Int. Conf. Neural Inf. Int. Conf. Neural Inf. Process. Syst., 2017, pp. 2990–2999.
Process. Syst., 2020, pp. 14879–14890. [161] A. Seff et al., “Continual learning in generative adversarial nets,”
[133] J. Yoon et al., “Online coreset selection for rehearsal-based continual 2017, arXiv:1705.08395.
learning,” in Proc. Int. Conf. Learn. Representations, 2021. [162] L. Wang, B. Lei, Q. Li, H. Su, J. Zhu, and Y. Zhong, “Triple-memory
[134] D. Shim et al., “Online class-incremental continual learning with adver- networks: A brain-inspired method for continual learning,” IEEE Trans.
sarial shapley value,” in Proc. AAAI Conf. Artif. Intell., 2021, pp. 9630– Neural Netw. Learn. Syst., vol. 33, no. 5, pp. 1925–1934, May 2022.
9638. [163] C. He et al., “Exemplar-supported generative reproduction for class
[135] R. Tiwari, K. Killamsetty, R. Iyer, and P. Shenoy, “GCR: Gradient coreset incremental learning,” in Proc. Brit. Mach. Vis. Conf., 2018.
based replay buffer selection for continual learning,” in Proc. IEEE/CVF [164] O. Ostapenko, M. Puscas, T. Klein, P. Jähnichen, and M. Nabi, “Learning
Conf. Comput. Vis. Pattern Recognit., 2022, pp. 99–108. to remember: A synaptic plasticity driven framework for continual learn-
[136] L. Caccia et al., “Online learned continual compression with adaptive ing,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2019,
quantization modules,” in Proc. Int. Conf. Mach. Learn., 2020, pp. 1240– pp. 11313–11321.
1250. [165] R. Kemker and C. Kanan, “FearNet: Brain-inspired model for incremental
[137] A. Van Den Oord et al., “Neural discrete representation learning,” in Proc. learning,” in Proc. Int. Conf. Learn. Representations, 2018.
Int. Conf. Neural Inf. Process. Syst., 2017, pp. 6306–6315. [166] Y. Xiang, Y. Fu, P. Ji, and H. Huang, “Incremental learning using
[138] L. Wang et al., “Memory replay with data compression for continual conditional adversarial networks,” in Proc. IEEE/CVF Int. Conf. Comput.
learning,” in Proc. Int. Conf. Learn. Representations, 2021. Vis., 2019, pp. 6618–6627.
[139] L. Kumari et al., “Retrospective adversarial replay for continual learning,” [167] Y. Cong et al., “GAN memory with no forgetting,” in Proc. Int. Conf.
in Proc. Int. Conf. Neural Inf. Process. Syst., 2022, pp. 28530–28544. Neural Inf. Process. Syst., 2020, pp. 16481–16494.
[140] E. Belouadah and A. Popescu, “IL2M: Class incremental learning with [168] A. Ayub and A. Wagner, “EEC: Learning to encode and regenerate images
dual memory,” in Proc. IEEE/CVF Int. Conf. Comput. Vis., 2019, pp. 583– for continual learning,” in Proc. Int. Conf. Learn. Representations, 2020.
592. [169] M. Riemer et al., “Scalable recollections for continual lifelong learning,”
[141] S. Ebrahimi et al., “Remembering for the right reasons: Explanations re- in Proc. AAAI Conf. Artif. Intell., 2019, pp. 1352–1359.
duce catastrophic forgetting,” in Proc. Int. Conf. Learn. Representations, [170] F. Ye and A. G. Bors, “Learning latent representations across multiple
2020. data domains using lifelong VAEGAN,” in Proc. Eur. Conf. Comput. Vis.,
[142] G. Saha, I. Garg, and K. Roy, “Gradient projection memory for continual 2020, pp. 777–795.
learning,” in Proc. Int. Conf. Learn. Representations, 2020. [171] G. M. van de Ven, H. T. Siegelmann, and A. S. Tolias, “Brain-inspired
[143] Y. Liu, Y. Su, A. -A. Liu, B. Schiele, and Q. Sun, “Mnemonics training: replay for continual learning with artificial neural networks,” Nat. Com-
Multi-class incremental learning without forgetting,” in Proc. IEEE/CVF mun., vol. 11, 2020, Art. no. 4069.
Conf. Comput. Vis. Pattern Recognit., 2020, pp. 12242–12251. [172] X. Liu et al., “Generative feature replay for class-incremental learning,”
[144] X. Jin et al., “Gradient-based editing of memory examples for online in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. Workshops,
task-free continual learning,” in Proc. Int. Conf. Neural Inf. Process. 2020, pp. 915–924.
Syst., 2021, pp. 29193–29205. [173] K. Zhu, W. Zhai, Y. Cao, J. Luo, and Z.-J. Zha, “Self-sustaining represen-
[145] R. Aljundi et al., “Online continual learning with maximal interfered tation expansion for non-exemplar class-incremental learning,” in Proc.
retrieval,” in Proc. Int. Conf. Neural Inf. Process. Syst., 2019, pp. 11849– IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2022, pp. 9286–9295.
11860. [174] M. Toldo and M. Ozay, “Bring evanescent representations to life in
[146] A. Chaudhry et al., “Using hindsight to anchor past knowledge in con- lifelong class incremental learning,” in Proc. IEEE/CVF Conf. Comput.
tinual learning,” in Proc. AAAI Conf. Artif. Intell., 2021, pp. 6993–7001. Vis. Pattern Recognit., 2022, pp. 16711–16720.
[147] Y. Wu et al., “Large scale incremental learning,” in Proc. IEEE/CVF Conf. [175] T. L. Hayes et al., “Remind your neural network to prevent catastrophic
Comput. Vis. Pattern Recognit., 2019, pp. 374–382. forgetting,” in Proc. Eur. Conf. Comput. Vis., 2020, pp. 466–483.

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY ROORKEE. Downloaded on December 19,2024 at 09:29:09 UTC from IEEE Xplore. Restrictions apply.
WANG et al.: COMPREHENSIVE SURVEY OF CONTINUAL LEARNING: THEORY, METHOD AND APPLICATION 5381

[176] G. Petit, A. Popescu, H. Schindler, D. Picard, and B. Delezoide, “FeTrIL: [203] B. Ermis et al., “Memory efficient continual learning with transformers,”
Feature translation for exemplar-free class-incremental learning,” in in Proc. Int. Conf. Neural Inf. Process. Syst., 2022, pp. 10629–10642.
Proc. IEEE/CVF Winter Conf. Appl. Comput. Vis., 2023, pp. 3900–3909. [204] N. Houlsby et al., “Parameter-efficient transfer learning for NLP,” in
[177] A. Chaudhry et al., “Continual learning in low-rank orthogonal sub- Proc. Int. Conf. Mach. Learn., 2019, pp. 2790–2799.
spaces,” in Proc. Int. Conf. Neural Inf. Process. Syst., 2020, pp. 9900– [205] Z. Wang et al., “Learning to prompt for continual learning,” in Proc.
9911. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2022, pp. 139–149.
[178] S. Lin et al., “TRGP: Trust region gradient projection for continual [206] J. S. Smith et al., “CODA-prompt: Continual decomposed attention-based
learning,” in Proc. Int. Conf. Learn. Representations, 2021. prompting for rehearsal-free continual learning,” in Proc. IEEE/CVF
[179] Y. Kong et al., “Balancing stability and plasticity through advanced null Conf. Comput. Vis. Pattern Recognit., 2023, pp. 11909–11919.
space in continual learning,” in Proc. Eur. Conf. Comput. Vis., 2022, [207] Z. Wang et al., “DualPrompt: Complementary prompting for rehearsal-
pp. 219–236. free continual learning,” in Proc. Eur. Conf. Comput. Vis., 2022, pp. 631–
[180] H. Liu and H. Liu, “Continual learning with recursive gradient optimiza- 648.
tion,” in Proc. Int. Conf. Learn. Representations, 2021. [208] Y. Wang, Z. Huang, and X. Hong, “S-prompts learning with pre-trained
[181] K. Javed and M. White, “Meta-learning representations for continual transformers: An occam’s razor for domain incremental learning,” in
learning,” in Proc. Int. Conf. Neural Inf. Process. Syst., 2019, pp. 1818– Proc. Int. Conf. Neural Inf. Process. Syst., 2022, pp. 5682–5695.
1828. [209] A. Razdaibiedina et al., “Progressive prompts: Continual learning for
[182] S. Beaulieu et al., “Learning to continually learn,” in Proc. Eur. Conf. language models,” in Proc. Int. Conf. Learn. Representations, 2023.
Artif. Intell., 2020, pp. 992–1001. [210] P. Janson et al., “A simple baseline that questions the use of pretrained-
[183] E. Lee, C.-H. Huang, and C.-Y. Lee, “Few-shot and continual learning models in continual learning,” in Proc. Int. Conf. Neural Inf. Process.
with attentive independent mechanisms,” in Proc. IEEE/CVF Int. Conf. Syst. Workshop, 2017.
Comput. Vis., 2021, pp. 9435–9444. [211] F. Pelosin, “Simpler is better: Off-the-shelf continual learning through
[184] J. Rajasegaran, S. Khan, M. Hayat, F. S. Khan, and M. Shah, “iTAML: An pretrained backbones,” 2022, arXiv:2205.01586.
incremental task-agnostic meta-learning approach,” in Proc. IEEE/CVF [212] A. Panos et al., “First session adaptation: A strong replay-free baseline
Conf. Comput. Vis. Pattern Recognit., 2020, pp. 13585–13594. for class-incremental learning,” 2023, arXiv:2303.13199.
[185] G. Gupta, K. Yadav, and L. Paull, “Look-ahead meta learning for con- [213] P. Foret et al., “Sharpness-aware minimization for efficiently improving
tinual learning,” in Proc. Int. Conf. Neural Inf. Process. Syst., 2020, generalization,” in Proc. Int. Conf. Learn. Representations, 2020.
pp. 11588–11598. [214] G. Zhang et al., “SLCA: Slow learner with classifier alignment for
[186] J. Hurtado, A. Raymond, and A. Soto, “Optimizing reusable knowledge continual learning on a pre-trained model,” 2023, arXiv:2303.05118.
for continual learning via metalearning,” in Proc. Int. Conf. Neural Inf. [215] V. Marsocci and S. Scardapane, “Continual barlow twins: Continual self-
Process. Syst., 2021, pp. 14150–14162. supervised learning for remote sensing semantic segmentation,” IEEE J.
[187] R. Wang et al., “Anti-retroactive interference for lifelong learning,” in Sel. Topics Appl. Earth Observ. Remote Sens., vol. 16, pp. 5049–5060,
Proc. Eur. Conf. Comput. Vis., 2022, pp. 163–178. 2023.
[188] J. Gallardo, T. L. Hayes, and C. Kanan, “Self-supervised training en- [216] A. Cossu et al., “Continual pre-training mitigates forgetting in language
hances online continual learning,” in Proc. Brit. Mach. Vis. Conf., and vision,” 2022, arXiv:2205.09357.
2021. [217] S. Yan et al., “Generative negative text replay for continual vision-
[189] T.-Y. Wu et al., “Class-incremental learning with strong pre-trained language pretraining,” in Proc. Eur. Conf. Comput. Vis., 2022, pp. 22–38.
models,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2022, [218] A. Radford et al., “Learning transferable visual models from natu-
pp. 9591–9600. ral language supervision,” in Proc. Int. Conf. Mach. Learn., 2021,
[190] M. Davari, N. Asadi, S. Mudur, R. Aljundi, and E. Belilovsky, “Probing pp. 8748–8763.
representation forgetting in supervised and unsupervised continual learn- [219] Q. Liu et al., “Incremental meta-learning via indirect discriminant align-
ing,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2022, ment,” 2020, arXiv:2002.04162.
pp. 16691–16700. [220] Z. Wang et al., “Meta-learning with less forgetting on large-scale non-
[191] Y. Shi et al., “Mimicking the oracle: An initial phase decorrelation ap- stationary task distributions,” in Proc. Eur. Conf. Comput. Vis., 2022,
proach for class incremental learning,” in Proc. IEEE/CVF Conf. Comput. pp. 221–238.
Vis. Pattern Recognit., 2022, pp. 16701–16710. [221] A. Mallya, D. Davis, and S. Lazebnik, “Piggyback: Adapting a single
[192] B. Neyshabur, H. Sedghi, and C. Zhang, “What is being transferred in network to multiple tasks by learning to mask weights,” in Proc. Eur.
transfer learning?,” in Proc. Int. Conf. Neural Inf. Process. Syst., 2020, Conf. Comput. Vis., 2018, pp. 67–82.
pp. 512–523. [222] H. Kang et al., “Forget-free continual learning with winning subnet-
[193] Y. Hao et al., “Visualizing and understanding the effectiveness of BERT,” works,” in Proc. Int. Conf. Mach. Learn., 2022, pp. 10734–10750.
in Proc. Conf. Empir. Methods Natural Lang. Process. Int. Joint Conf. [223] H. Jin and E. Kim, “Helpful or harmful: Inter-task association in continual
Natural Lang. Process., 2019, pp. 4141–4150. learning,” in Proc. Eur. Conf. Comput. Vis., 2022, pp. 519–535.
[194] S. Purushwalkam, P. Morgado, and A. Gupta, “The challenges of contin- [224] A. Mallya and S. Lazebnik, “PackNet: Adding multiple tasks to a single
uous self-supervised learning,” in Proc. Eur. Conf. Comput. Vis., 2022, network by iterative pruning,” in Proc. IEEE/CVF Conf. Comput. Vis.
pp. 702–721. Pattern Recognit., 2018, pp. 7765–7773.
[195] E. Fini, V. G. T. Da Costa, X. Alameda-Pineda, E. Ricci, K. Alahari, [225] S. Golkar, M. Kagan, and K. Cho, “Continual learning via neural prun-
and J. Mairal, “Self-supervised models are continual learners,” in Proc. ing,” 2019, arXiv:1903.04476.
IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2022, pp. 9611–9620. [226] M. B. Gurbuz and C. Dovrolis, “NISPA: Neuro-inspired stability-
[196] N. Vödisch et al., “Continual SLAM: Beyond lifelong simultaneous plasticity adaptation for continual learning in sparse networks,” in Proc.
localization and mapping through continual learning,” in Proc. Int. Symp. Int. Conf. Mach. Learn., 2022, pp. 8157–8174.
Robot. Res., 2022, pp. 19–35. [227] J. Yoon et al., “Lifelong learning with dynamically expandable networks,”
[197] O. Ostapenko et al., “Foundational models for continual learning: An in Proc. Int. Conf. Learn. Representations, 2018.
empirical study of latent replay,” in Proc. Conf. Lifelong Learn. Agents, [228] C.-Y. Hung et al., “Compacting, picking and growing for unforgetting
2022, pp. 534–547. continual learning,” in Proc. Int. Conf. Neural Inf. Process. Syst., 2019,
[198] J. Zhang et al., “Side-tuning: A baseline for network adaptation via addi- pp. 13647–13657.
tive side networks,” in Proc. Eur. Conf. Comput. Vis., 2020, pp. 698–714. [229] J. Xu and Z. Zhu, “Reinforced continual learning,” in Proc. Int. Conf.
[199] H. Shon et al., “DLCFT: Deep linear continual fine-tuning for general Neural Inf. Process. Syst., 2018, pp. 907–916.
incremental learning,” in Proc. Eur. Conf. Comput. Vis., 2022, pp. 513– [230] Q. Qin et al., “BNS: Building network structures dynamically for con-
529. tinual learning,” in Proc. Int. Conf. Neural Inf. Process. Syst., 2021,
[200] M. Boschini et al., “Transfer without forgetting,” in Proc. Eur. Conf. pp. 20608–20620.
Comput. Vis., 2022, pp. 692–709. [231] X. Li et al., “Learn to grow: A continual structure learning framework
[201] E. Perez et al., “FiLM: Visual reasoning with a general conditioning for overcoming catastrophic forgetting,” in Proc. Int. Conf. Mach. Learn.,
layer,” in Proc. AAAI Conf. Artif. Intell., 2018, pp. 3942–3951. 2019, pp. 3925–3934.
[202] M. Zhao, Y. Cong, and L. Carin, “On leveraging pretrained GANs for [232] A. Kumar, S. Chatterjee, and P. Rai, “Bayesian structural adapta-
generation with limited data,” in Proc. Int. Conf. Mach. Learn., 2020, tion for continual learning,” in Proc. Int. Conf. Mach. Learn., 2021,
pp. 11340–11351. pp. 5850–5860.

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY ROORKEE. Downloaded on December 19,2024 at 09:29:09 UTC from IEEE Xplore. Restrictions apply.
5382 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 46, NO. 8, AUGUST 2024

[233] S. Ebrahimi et al., “Adversarial continual learning,” in Proc. Eur. Conf. [260] M. Hersche, G. Karunaratne, G. Cherubini, L. Benini, A. Sebas-
Comput. Vis., 2020, pp. 386–402. tian, and A. Rahimi, “Constrained few-shot class-incremental learn-
[234] Z. Wu, C. Baek, C. You, and Y. Ma, “Incremental learning via rate ing,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2022,
reduction,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., pp. 9047–9057.
2021, pp. 1125–1133. [261] J. Kalla and S. Biswas, “S3C: Self-supervised stochastic classifiers for
[235] A. Douillard, A. Ramé, G. Couairon, and M. Cord, “DyTox: Trans- few-shot class-incremental learning,” in Proc. Eur. Conf. Comput. Vis.,
formers for continual learning with dynamic token expansion,” in Proc. 2022, pp. 432–448.
IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2022, pp. 9275–9285. [262] A. Cheraghian, S. Rahman, P. Fang, S. K. Roy, L. Petersson, and M.
[236] P. Singh et al., “Calibrating CNNs for lifelong learning,” in Proc. Int. Harandi, “Semantic-aware knowledge distillation for few-shot class-
Conf. Neural Inf. Process. Syst., 2020, Art. no. 1307. incremental learning,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern
[237] D. Abati, J. Tomczak, T. Blankevoort, S. Calderara, R. Cucchiara, and B. Recognit., 2021, pp. 2534–2543.
E. Bejnordi, “Conditional channel gated networks for task-aware contin- [263] A. F. Akyürek et al., “Subspace regularizers for few-shot class incremen-
ual learning,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., tal learning,” in Proc. Int. Conf. Learn. Representations, 2021.
2020, pp. 3930–3939. [264] A. K. Bhunia et al., “Doodle it yourself: Class incremental learning by
[238] J. Yoon et al., “Scalable and order-robust continual learning with additive drawing a few sketches,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern
parameter decomposition,” in Proc. Int. Conf. Learn. Representations, Recognit., 2022, pp. 2283–2292.
2019. [265] A. Lechat, S. Herbin, and F. Jurie, “Semi-supervised class incremental
[239] M. Kanakis et al., “Reparameterizing convolutions for incremental multi- learning,” in Proc. Int. Conf. Pattern Recognit., 2021, pp. 10383–10389.
task learning without task interference,” in Proc. Eur. Conf. Comput. Vis., [266] M. Boschini et al., “Continual semi-supervised learning through con-
2020, pp. 689–707. trastive interpolation consistency,” Pattern Recognit. Lett., vol. 162,
[240] Z. Miao et al., “Continual learning with filter atom swapping,” in Proc. pp. 9–14, 2022.
Int. Conf. Learn. Representations, 2021. [267] Z. Kang, E. Fini, M. Nabi, E. Ricci, and K. Alahari, “A soft nearest-
[241] N. Mehta et al., “Continual learning using a Bayesian nonparametric neighbor framework for continual semi-supervised learning,” in Proc.
dictionary of weight factors,” in Proc. Int. Conf. Artif. Intell. Statist., IEEE/CVF Int. Conf. Comput. Vis., 2023, pp. 11834–11843.
2021, pp. 100–108. [268] Y.-M. Tang, Y.-X. Peng, and W.-S. Zheng, “Learning to imagine: Di-
[242] R. Hyder et al., “Incremental task learning with incremental rank up- versify memory for incremental learning using unlabeled data,” in Proc.
dates,” in Proc. Eur. Conf. Comput. Vis., 2022, pp. 566–582. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2022, pp. 9539–9548.
[243] R. Aljundi, P. Chakravarty, and T. Tuytelaars, “Expert gate: Lifelong [269] G. Jerfel et al., “Reconciling meta-learning and continual learning with
learning with a network of experts,” in Proc. IEEE/CVF Conf. Comput. online mixtures of tasks,” in Proc. Int. Conf. Neural Inf. Process. Syst.,
Vis. Pattern Recognit., 2017, pp. 7120–7129. 2019, pp. 9119–9130.
[244] J. Rajasegaran et al., “Random path selection for incremental learning,” [270] R. Ardywibowo et al., “VariGrow: Variational architecture growing for
in Proc. Int. Conf. Neural Inf. Process. Syst., 2019, pp. 12669–12679. task-agnostic continual learning based on Bayesian novelty,” in Proc. Int.
[245] T. Veniat, L. Denoyer, and M. Ranzato, “Efficient continual learning Conf. Mach. Learn., 2022, pp. 865–877.
with modular networks and task-driven priors,” in Proc. Int. Conf. Learn. [271] F. Ye and A. G. Bors, “Task-free continual learning via online discrepancy
Representations, 2020. distance learning,” in Proc. Int. Conf. Neural Inf. Process. Syst., 2022,
[246] O. Ostapenko et al., “Continual learning via local module composition,” pp. 23675–23688.
in Proc. Int. Conf. Neural Inf. Process. Syst., 2021, pp. 30298–30312. [272] H.-J. Chen et al., “Mitigating forgetting in online continual learning via
[247] H. Yin et al., “Dreaming to distill: Data-free knowledge transfer via instance-aware parameterization,” in Proc. Int. Conf. Neural Inf. Process.
deepinversion,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Syst., 2020, pp. 17466–17477.
2020, pp. 8712–8721. [273] S. Sun et al., “Information-theoretic online memory selection for contin-
[248] J. Smith, Y. -C. Hsu, J. Balloch, Y. Shen, H. Jin, and Z. Kira, “Always ual learning,” in Proc. Int. Conf. Learn. Representations, 2021.
be dreaming: A new approach for data-free class-incremental learning,” [274] M. De Lange and T. Tuytelaars, “Continual prototype evolution: Learning
in Proc. IEEE/CVF Int. Conf. Comput. Vis., 2021, pp. 9354–9364. online from non-stationary data streams,” in Proc. IEEE/CVF Int. Conf.
[249] L. Yu et al., “Semantic drift compensation for class-incremental learning,” Comput. Vis., 2021, pp. 8230–8239.
in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2020, pp. 6980– [275] Z. Wang et al., “Improving task-free continual learning by distributionally
6989. robust memory evolution,” in Proc. Int. Conf. Mach. Learn., 2022,
[250] S. Dong et al., “Few-shot class-incremental learning via relation knowl- pp. 22985–22998.
edge distillation,” in Proc. AAAI Conf. Artif. Intell., 2021, pp. 1255–1263. [276] Y. Gu, X. Yang, K. Wei, and C. Deng, “Not just selection, but exploration:
[251] P. Mazumder, P. Singh, and P. Rai, “Few-shot lifelong learning,” in Proc. Online class-incremental continual learning via dual view consistency,”
AAAI Conf. Artif. Intell., 2021, pp. 2337–2345. in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2022, pp. 7432–
[252] A. Kukleva, H. Kuehne, and B. Schiele, “Generalized and incremental 7441.
few-shot learning by explicit learning and calibration without forgetting,” [277] Q. Pham et al., “Contextual transformation networks for online continual
in Proc. IEEE/CVF Int. Conf. Comput. Vis., 2021, pp. 9000–9009. learning,” in Proc. Int. Conf. Learn. Representations, 2020.
[253] Z. Chi, L. Gu, H. Liu, Y. Wang, Y. Yu, and J. Tang, “MetaFSCIL: [278] J. He, R. Mao, Z. Shao, and F. Zhu, “Incremental learning in online
A meta-learning approach for few-shot class incremental learning,” in scenario,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit.,
Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2022, pp. 14146– 2020, pp. 13923–13932.
14155. [279] Z. Mai, R. Li, H. Kim, and S. Sanner, “Supervised contrastive replay:
[254] H. Liu et al., “Few-shot class-incremental learning via entropy- Revisiting the nearest class mean classifier in online class-incremental
regularized data-free replay,” in Proc. Eur. Conf. Comput. Vis., 2022, continual learning,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern
pp. 146–162. Recognit., 2021, pp. 3584–3594.
[255] K. Zhu, Y. Cao, W. Zhai, J. Cheng, and Z.-J. Zha, “Self-promoted [280] L. Caccia et al., “New insights on reducing abrupt representation change
prototype refinement for few-shot class-incremental learning,” in Proc. in online continual learning,” in Proc. Int. Conf. Learn. Representations,
IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2021, pp. 6797–6806. 2021.
[256] D.-W. Zhou, F. -Y. Wang, H.-J. Ye, L. Ma, S. Pu, and D.- [281] Y. Zhang et al., “A simple but strong baseline for online continual
C. Zhan, “Forward compatible few-shot class-incremental learning,” learning: Repeated augmented rehearsal,” in Proc. Int. Conf. Neural Inf.
in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2022, Process. Syst., 2022, pp. 14771–14783.
pp. 9036–9046. [282] K. Shmelkov, C. Schmid, and K. Alahari, “Incremental learning of object
[257] C. Peng et al., “Few-shot class-incremental learning from an open-set detectors without catastrophic forgetting,” in Proc. IEEE/CVF Int. Conf.
perspective,” in Proc. Eur. Conf. Comput. Vis., 2022, pp. 382–397. Comput. Vis., 2017, pp. 3420–3429.
[258] Y. Zou et al., “Margin-based few-shot class-incremental learning with [283] C. Peng, K. Zhao, and B. C. Lovell, “Faster ILOD: Incremental learning
class-level overfitting mitigation,” in Proc. Int. Conf. Neural Inf. Process. for object detectors based on faster RCNN,” Pattern Recognit. Lett.,
Syst., 2022, pp. 27267–27279. vol. 140, pp. 109–115, 2020.
[259] C. Zhang, N. Song, G. Lin, Y. Zheng, P. Pan, and Y. Xu, “Few-shot incre- [284] N. Dong et al., “Bridging non co-occurrence with unlabeled in-the-wild
mental learning with continually evolved classifiers,” in Proc. IEEE/CVF data for incremental object detection,” in Proc. Int. Conf. Neural Inf.
Conf. Comput. Vis. Pattern Recognit., 2021, pp. 12450–12459. Process. Syst., 2021, pp. 30492–30503.

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY ROORKEE. Downloaded on December 19,2024 at 09:29:09 UTC from IEEE Xplore. Restrictions apply.
WANG et al.: COMPREHENSIVE SURVEY OF CONTINUAL LEARNING: THEORY, METHOD AND APPLICATION 5383

[285] T. Feng, M. Wang, and H. Yuan, “Overcoming catastrophic forgetting [311] T. Shibata et al., “Learning with selective forgetting,” in Proc. Int. Conf.
in incremental object detection via elastic response distillation,” in Proc. Artif. Intell. Statist., 2021, pp. 989–996.
IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2022, pp. 9417–9426. [312] J. Ye et al., “Learning with recoverable forgetting,” in Proc. Eur. Conf.
[286] J. Zhang et al., “Class-incremental learning via deep model consol- Comput. Vis., 2022, pp. 87–103.
idation,” in Proc. IEEE/CVF Winter Conf. Appl. Comput. Vis., 2020, [313] M. Zhou et al., “Image de-raining via continual learning,” in Proc.
pp. 1120–1129. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2021, pp. 4905–4914.
[287] K. Joseph, J. Rajasegaran, S. Khan, F. S. Khan, and V. N. Balasubra- [314] M. Rostami, L. Spinoulas, M. Hussein, J. Mathai, and W. Abd-Almageed,
manian, “Incremental object detection via meta-learning,” IEEE Trans. “Detection and continual learning of novel face presentation attacks,” in
Pattern Anal. Mach. Intell., vol. 44, no. 12, pp. 9209–9216, Dec. 2022. Proc. IEEE/CVF Int. Conf. Comput. Vis., 2021, pp. 14831–14840.
[288] F. Cermelli, M. Mancini, S. Rota Bulò, E. Ricci, and B. Caputo, “Mod- [315] Y. Song et al., “Score-based generative modeling through stochastic
eling the background for incremental learning in semantic segmenta- differential equations,” in Proc. Int. Conf. Learn. Representations, 2020.
tion,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2020, [316] T. Brown et al., “Language models are few-shot learners,” in Proc. Int.
pp. 9230–9239. Conf. Neural Inf. Process. Syst., 2020, pp. 1877–1901.
[289] Y. Oh, D. Baek, and B. Ham, “ALIFE: Adaptive logit regularizer and [317] A. Vaswani et al., “Attention is all you need,” in Proc. Int. Conf. Neural
feature replay for incremental semantic segmentation,” in Proc. Int. Conf. Inf. Process. Syst., 2017, pp. 5998–6008.
Neural Inf. Process. Syst., 2022, pp. 14516–14528. [318] J. Duan, S. Yu, H. L. Tan, H. Zhu, and C. Tan, “A survey of embodied AI:
[290] C.-B. Zhang, J. -W. Xiao, X. Liu, Y. -C. Chen, and M.-M. Cheng, From simulators to research tasks,” IEEE Trans. Emerg. Topics Comput.
“Representation compensation networks for continual semantic segmen- Intell., vol. 6, no. 2, pp. 230–244, Apr. 2022.
tation,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2022,
pp. 7043–7054.
[291] U. Michieli and P. Zanuttigh, “Continual semantic segmentation via Liyuan Wang received the BS and PhD degrees
repulsion-attraction of sparse and disentangled latent representations,” in from Tsinghua University. He is currently a postdoc
Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2021, pp. 1114– with Tsinghua University, working with Prof. Jun
1124. Zhu with the Department of Computer Science and
[292] A. Douillard, Y. Chen, A. Dapogny, and M. Cord, “PLOP: Learning with- Technology. His research interests include continual
out forgetting for continual semantic segmentation,” in Proc. IEEE/CVF learning, incremental learning, lifelong learning and
Conf. Comput. Vis. Pattern Recognit., 2021, pp. 4039–4049. brain-inspired AI. His work in continual learning has
[293] A. Maracani, U. Michieli, M. Toldo, and P. Zanuttigh, “RECALL: been published in major conferences and journals in
Replay-based continual learning in semantic segmentation,” in Proc. related fields, such as Nature Machine Intelligence,
IEEE/CVF Int. Conf. Comput. Vis., 2021, pp. 7006–7015. NeurIPS, ICLR, CVPR, ICCV, ECCV, etc.
[294] S. Cha et al., “SSUL: Semantic segmentation with unknown label for
exemplar-based class-incremental learning,” in Proc. Int. Conf. Neural
Inf. Process. Syst., 2021, pp. 10919–10930.
[295] L. Yu, X. Liu, and J. Van de Weijer, “Self-training for class-incremental Xingxing Zhang received the BE and PhD degrees
semantic segmentation,” IEEE Trans. Neural Netw. Learn. Syst., vol. 34, from the Institute of Information Science, Beijing
no. 11, pp. 9116–9127, Nov. 2023. Jiaotong University, in 2015 and 2020, respectively.
[296] F. Cermelli, D. Fontanel, A. Tavera, M. Ciccone, and B. Caputo, “Incre- She was also a visiting student with the Department
mental learning in semantic segmentation from image labels,” in Proc. of Computer Science, University of Rochester, from
IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2022, pp. 4361–4371. 2018 to 2019. She was a postdoc with the Department
[297] H. Hu et al., “One pass ImageNet,” in Proc. Int. Conf. Neural Inf. Process. of Computer Science and Technology, Tsinghua Uni-
Syst. Workshop, 2021. versity, from 2020 to 2022. Her research interests in-
[298] Y. Min, K. Ahn, and N. Azizan, “One-pass learning via bridging orthog- clude continual learning and zero/few-shot learning.
onal gradient descent and recursive least-squares,” in Proc. IEEE Conf. She has received the excellent PhD thesis award from
Decis. Control, 2022, pp. 4720–4725. the Chinese Institute of Electronics, in 2020.
[299] C. Kaplanis, M. Shanahan, and C. Clopath, “Continual reinforcement
learning with complex synapses,” in Proc. Int. Conf. Mach. Learn., 2018,
pp. 2497–2506.
[300] C. Kaplanis, M. Shanahan, and C. Clopath, “Policy consolidation for Hang Su (Member, IEEE) is an associated profes-
continual reinforcement learning,” in Proc. Int. Conf. Mach. Learn., 2019, sor with the Department of Computer Science and
pp. 3242–3251. Technology, Tsinghua University. His research in-
[301] H. Thanh-Tung and T. Tran, “Catastrophic forgetting and mode collapse terests lie in the adversarial machine learning and
in GANs,” in Proc. Int. Joint Conf. Neural Netw., 2020, pp. 1–10. robust computer vision, based on which he has pub-
[302] K. S. Lee, N.-T. Tran, and N.-M. Cheung, “InfoMax-GAN: Improved ad- lished more than 50 papers including CVPR, ECCV,
versarial image generation via information maximization and contrastive IEEE Transactions on Medical Imaging, etc. He has
learning,” in Proc. IEEE/CVF Winter Conf. Appl. Comput. Vis., 2021, served as area chair in NeurIPS and the workshop
pp. 3941–3951. co-chair in AAAI22. He received “Young Investigator
[303] B. McMahan et al., “Communication-efficient learning of deep networks Award” from MICCAI2012, the “Best Paper Award”
from decentralized data,” in Proc. Int. Conf. Artif. Intell. Statist., 2017, in AVSS2012, and “Platinum Best Paper Award”
pp. 1273–1282. in ICME2018.
[304] J. Yoon et al., “Federated continual learning with weighted inter-client
transfer,” in Proc. Int. Conf. Mach. Learn., 2021, pp. 12073–12086.
[305] J. Dong et al., “Federated class-incremental learning,” in Proc. IEEE/CVF
Conf. Comput. Vis. Pattern Recognit., 2022, pp. 10154–10163. Jun Zhu (Fellow, IEEE) received the BS and PhD
[306] Y. Ma et al., “Continual federated learning based on knowledge distilla- degrees from the Department of Computer Science
tion,” in Proc. Int. Conf. Artif. Intell. Statist., 2022, pp. 2182–2188. and Technology, Tsinghua University, where he is
[307] L. Bourtoule et al., “Machine unlearning,” in Proc. Symp. Secur. Privacy, currently a Bosch AI professor. He was a postdoc-
2021, pp. 141–159. toral fellow and adjunct faculty with the Machine
[308] Y. Wu, E. Dobriban, and S. Davidson, “DeltaGrad: Rapid retraining Learning Department, Carnegie Mellon University.
of machine learning models,” in Proc. Int. Conf. Mach. Learn., 2020, His research interest is primarily on developing ma-
pp. 10355–10366. chine learning methods to understand scientific and
[309] A. Golatkar, A. Achille, and S. Soatto, “Eternal sunshine of the spotless engineering data arising from various fields. He regu-
net: Selective forgetting in deep networks,” in Proc. IEEE/CVF Conf. larly serves as senior area chairs and area chairs with
Comput. Vis. Pattern Recognit., 2020, pp. 9301–9309. prestigious conferences, including ICML, NeurIPS,
[310] A. Golatkar, A. Achille, A. Ravichandran, M. Polito, and S. Soatto, ICLR, IJCAI and AAAI. He was selected as “AI’s 10 to Watch” by IEEE
“Mixed-privacy forgetting in deep networks,” in Proc. IEEE/CVF Conf. Intelligent Systems. He is a fellow of AAAI, and an associate editor-in-chief
Comput. Vis. Pattern Recognit., 2021, pp. 792–801. of IEEE Transactions on Pattern Analysis and Machine Intelligence.

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY ROORKEE. Downloaded on December 19,2024 at 09:29:09 UTC from IEEE Xplore. Restrictions apply.

You might also like