0% found this document useful (0 votes)
10 views17 pages

Advances and Challenges in Meta-Learning A Technical Review

This technical review discusses advancements and challenges in meta-learning, emphasizing its significance in scenarios with limited data availability. It explores state-of-the-art meta-learning methods, their relationship with multi-task and transfer learning, and highlights advanced topics such as unsupervised meta-learning and continual learning. The article aims to provide a comprehensive understanding of meta-learning's potential impact on machine learning applications and outlines open problems for future research.

Uploaded by

nguyentai325
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views17 pages

Advances and Challenges in Meta-Learning A Technical Review

This technical review discusses advancements and challenges in meta-learning, emphasizing its significance in scenarios with limited data availability. It explores state-of-the-art meta-learning methods, their relationship with multi-task and transfer learning, and highlights advanced topics such as unsupervised meta-learning and continual learning. The article aims to provide a comprehensive understanding of meta-learning's potential impact on machine learning applications and outlines open problems for future research.

Uploaded by

nguyentai325
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 46, NO.

7, JULY 2024 4763

Advances and Challenges in Meta-Learning: A


Technical Review
Anna Vettoruzzo , Mohamed-Rafik Bouguelia , Joaquin Vanschoren ,
Thorsteinn Rögnvaldsson , Senior Member, IEEE, and KC Santosh , Senior Member, IEEE

Abstract—Meta-learning empowers learning systems with the where data is scarce or costly to obtain. Most existing approaches
ability to acquire knowledge from multiple tasks, enabling faster rely on either supervised learning of a representation tailored to
adaptation and generalization to new tasks. This review provides a single task, or unsupervised learning of a representation that
a comprehensive technical overview of meta-learning, emphasiz-
ing its importance in real-world applications where data may be captures general features that may not be well-suited to new
scarce or expensive to obtain. The article covers the state-of-the-art tasks. Furthermore, learning from scratch for each task is often
meta-learning approaches and explores the relationship between not feasible, especially in domains such as medicine, robotics,
meta-learning and multi-task learning, transfer learning, domain and rare language translation where data availability is limited.
adaptation and generalization, self-supervised learning, personal- To overcome these challenges, meta-learning has emerged as
ized federated learning, and continual learning. By highlighting the
synergies between these topics and the field of meta-learning, the a promising approach. Meta-learning enables models to quickly
article demonstrates how advancements in one area can benefit adapt to new tasks, even with few examples, and general-
the field as a whole, while avoiding unnecessary duplication of ize across them. While meta-learning shares similarities with
efforts. Additionally, the article delves into advanced meta-learning transfer learning and multitask learning, it goes beyond these
topics such as learning from complex multi-modal task distribu- approaches by enabling a learning system to learn how to learn.
tions, unsupervised meta-learning, learning to efficiently adapt to
data distribution shifts, and continual meta-learning. Lastly, the This capability is particularly valuable in settings where data is
article highlights open problems and challenges for future research scarce, costly to obtain, or where the environment is constantly
in the field. By synthesizing the latest research developments, this changing. While humans can rapidly acquire new skills by lever-
article provides a thorough understanding of meta-learning and aging prior experience and are therefore considered generalists,
its potential impact on various machine learning applications. We most deep learning models are still specialists and are limited
believe that this technical overview will contribute to the advance-
ment of meta-learning and its practical implications in addressing to performing well on specific tasks. Meta-learning bridges this
real-world problems. gap by enabling models to efficiently adapt to new tasks.
Index Terms—Deep neural networks, few-shot learning, meta-
learning, representation learning, transfer learning.
B. Contribution
This review article primarily discusses the use of meta-
I. INTRODUCTION learning techniques in deep neural networks to learn reusable
A. Context and motivation representations, with an emphasis on few-shot learning; it does
not cover topics such as AutoML and Neural Architecture
EEP representation learning has revolutionized the field
D of machine learning by enabling models to learn effective
features from data. However, it often requires large amounts of
Search [1], which are out of scope. Similarly, even though meta-
learning is often applied in the context of reinforcement learn-
ing [2], [3], it falls outside the scope of this article. Distinct from
data for solving a specific task, making it impractical in scenarios
existing surveys on meta-learning, such as [4], [5], [6], [7], [8],
this review article highlights several key differentiating factors:
Manuscript received 10 July 2023; revised 10 November 2023; accepted 21 r Inclusion of advanced meta-learning topics: In addition
January 2024. Date of publication 24 January 2024; date of current version 5
June 2024. This work was supported by the “Knowledge Foundation” (KK- to covering fundamental aspects of meta-learning, this
stiftelsen). Recommended for acceptance by M. Cho. (Corresponding author: review article delves into advanced topics such as learning
Anna Vettoruzzo.) from multimodal task distributions, meta-learning without
Anna Vettoruzzo, Mohamed-Rafik Bouguelia, and Thorsteinn Rögnvaldsson
are with the Center for Applied Intelligent Systems Research (CAISR), Halm- explicit task information, learning without data sharing
stad University, 301 18 Halmstad, Sweden (e-mail: [email protected]; among clients, adapting to distribution shifts, and contin-
[email protected]; [email protected]). ual learning from a stream of tasks. By including these
Joaquin Vanschoren is with the Automated Machine Learning Group, Eind-
hoven University of Technology, 5612 AZ Eindhoven, The Netherlands (e-mail: advanced topics, our article provides a comprehensive
[email protected]). understanding of the current state-of-the-art and highlights
KC Santosh is with the Applied AI Research Lab, Department of Computer the challenges and opportunities in these areas.
Science, University of South Dakota, Vermillion, SD 57069 USA (e-mail: r Detailed exploration of relationship with other topics:
[email protected]).
Digital Object Identifier 10.1109/TPAMI.2024.3357847 We not only examine meta-learning techniques but also

© 2024 The Authors. This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see
https://creativecommons.org/licenses/by/4.0/
4764 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 46, NO. 7, JULY 2024

establish clear connections between meta-learning and re- II. BASIC NOTATIONS AND DEFINITIONS
lated areas, including transfer learning, multitask learning,
In this section, we introduce some simple notations which will
self-supervised learning, personalized federated learning, be used throughout the article and provide a formal definition of
and continual learning. This exploration of the relation-
the term “task” within the scope of this article.
ships and synergies between meta-learning and these im-
We use θ (and sometimes also φ) to represent the set of
portant topics provides valuable insights into how meta- parameters (weights) of a deep neural network model. D =
learning can be efficiently integrated into broader machine
{(xj , yj )}nj=1 denotes a dataset, where inputs xj are sampled
learning frameworks.
r Clear and concise exposition: Recognizing the complexity from the distribution p(x) and outputs yj are sampled from
p(y|x). The function L(., .) denotes a loss function, for example,
of meta-learning, this review article provides a clear and
L(θ, D) represents the loss achieved by the model’s parameters
concise explanation of the concepts, techniques and appli- θ on the dataset D. The symbol T refers to a task, which is
cations of meta-learning. It is written with the intention primarily defined by the data-generating distributions p(x) and
of being accessible to a wide range of readers, including
p(y|x) that define the problem.
both researchers and practitioners. Through intuitive ex- In a standard supervised learning scenario, the objective is
planations, illustrative examples, and references to seminal to optimize the parameters θ by minimizing the loss L(θ, D),
works, we facilitate readers’ understanding of the founda-
where the dataset D is derived from a single task T , and the
tion of meta-learning and its practical implications. loss function L depends on that task. Formally, in this setting,
r Consolidation of key information: As a fast-growing field,
a task Ti is a triplet Ti  {pi (x), pi (y|x), Li } that includes
meta-learning has information scattered across various
task-specific data-generating distributions pi (x) and pi (y|x), as
sources. This review article consolidates the most impor- well as a task-specific loss function Li . The goal is to learn a
tant and relevant information about meta-learning, present-
model that performs well on data sampled from task Ti . In a
ing a comprehensive overview in a single resource. By
more challenging setting, we consider learning from multiple
synthesizing the latest research developments, this survey tasks {Ti }Ti=1 , which involves (a dataset of) multiple datasets
becomes an indispensable guide to researchers and practi-
{Di }Ti=1 . In this scenario, a set of training tasks is used to learn
tioners seeking a thorough understanding of meta-learning
a model that performs well on test tasks. Depending on the
and its potential impact on various machine learning specific setting, a test task can either be sampled from the training
applications.
tasks or completely new, never encountered during the training
By highlighting these contributions, this article complements
phase.
existing surveys and offers unique insights into the current state In general, tasks can differ in various ways depending on the
and future directions of meta-learning.
application. For example, in image recognition, different tasks
can involve recognizing handwritten digits or alphabets from
C. Organization different languages [2], [10], while in natural language process-
ing, tasks can include sentiment analysis [11], [12], machine
In this article, we provide the foundations of modern deep translation [13], and chatbot response generation [14], [15],
learning methods for learning across tasks. To do so, we first [16]. Tasks in robotics can involve training robots to achieve
define the key concepts and introduce relevant notations used different goals [17], while in automated feedback generation,
throughout the article in Section II. Then, we cover the basics tasks can include providing feedback to students on different
of multitask learning and transfer learning and their relation exams [18]. It is worth noting that tasks can share structures,
to meta-learning in Section III. In Section IV, we present an even if they appear unrelated. For example, the laws of physics
overview of the current state of meta-learning methods and underlying real data, the language rules underlying text data,
provide a unified view that allows us to categorize them into and the intentions of people all share common structures that
three types: black-box meta-learning methods, optimization- enable models to transfer knowledge across seemingly unrelated
based meta-learning methods, and meta-learning methods that tasks.
are based on distance metric learning [9]. In Section V, we delve
into advanced meta-learning topics, explaining the relationship
between meta-learning and other important machine learning
III. FROM MULTITASK AND TRANSFER TO META-LEARNING
topics, and addressing issues such as learning from multimodal
task distributions, performing meta-learning without provided Meta-learning, multitask learning, and transfer learning en-
tasks, learning without sharing data across clients, learning compass different approaches aimed at learning across multiple
to adapt to distribution shifts, and continual learning from a tasks. Multitask learning aims to improve performance on a
stream of tasks. Finally, the article explores the application of set of tasks by learning them simultaneously. Transfer learning
meta-learning to real-world problems and provides an overview fine-tunes a pre-trained model on a new task with limited data.
of the landscape of promising frontiers and yet-to-be-conquered In contrast, meta-learning acquires useful knowledge from past
challenges that lie ahead. Section VI focuses on these challenges, tasks and leverages it to learn new tasks more efficiently. In this
shedding light on the most pressing questions and future research section, we transition from discussing “multitask learning” and
opportunities. “transfer learning” to introducing the topic of “meta-learning”.
VETTORUZZO et al.: ADVANCES AND CHALLENGES IN META-LEARNING: A TECHNICAL REVIEW 4765

determined by the hyperparameter λ. In the case of L2 regular-


ization, the objective function is given by:
T
 T
 
min wi Li ({θsh , θi }, Di ) + λ θi − θi .
θ sh ,θ 1 ,...,θ T
i=1 i =1

However, soft parameter sharing can be more memory-intensive


as separate sets of parameters are stored for each task, and it
requires additional design decisions and hyperparameters.
Another approach to sharing parameters is to condition a
single model on a task descriptor zi that contains task-specific
information used to modulate the network’s computation. The
task descriptor zi can be a simple one-hot encoding of the task
index or a more complex task specification, such as language
description or user attributes. When a task descriptor is provided,
it is used to modulate the weights of the shared network with
respect to the task at hand. Through this modulation mechanism,
the significance of the shared features is determined based on
the particular task, enabling the learning of both shared and task-
specific features in a flexible manner. Such an approach grants
fine-grained control over the adjustment of the network’s repre-
sentation, tailoring it to each individual task. Various methods
for conditioning the model on the task descriptor are described
in [25]. More complex methods are also provided in [26], [27],
[28].
Fig. 1. Multitask learning vs transfer learning vs meta-learning. Choosing the appropriate approach for parameter sharing,
determining the level of the network architecture at which to
share parameters, and deciding on the degree of parameter
A. Multitask Learning Problem sharing across tasks are all design decisions that depend on the
problem at hand. Currently, these decisions rely on intuition and
As illustrated in Fig. 1(a), multitask learning (MTL) trains a knowledge of the problem, making them more of an art than a
model to perform multiple related tasks simultaneously, lever- science, similar to the process of tuning neural network architec-
aging shared structure across tasks, and improving performance tures. Moreover, multitask learning presents several challenges,
compared to learning each task individually. In this setting, there such as determining which tasks are complementary, particularly
is no distinction between training and test tasks, and we refer to in scenarios with a large number of tasks, as in [29]. Interested
them as {Ti }Ti=1 . readers can find a more comprehensive discussion of multitask
One common approach in MTL is hard parameter sharing, learning in [30], [31].
where the model parameters θ are split into shared θsh and In summary, multitask learning aims to learn a set of T
task-specific θi parameters. These parameters are learned si- tasks {Ti }Ti=1 at once. Even though the model can generalize
multaneously through an objective function that takes the form: to new data from these T tasks, it might not be able to handle a
T
 completely new task that it has not been trained on. This is where
min wi Li ({θsh , θi }, Di ), transfer learning and meta-learning become more relevant.
θ sh ,θ 1 ,...,θ T
i=1
B. Transfer Learning Via Fine-Tuning
where wi can weight tasks differently. This approach is often
implemented using a multi-headed neural network architecture, Transfer learning is a valuable technique that allows a model
where a shared encoder (parameterized by θsh ) is responsible for to leverage representations learned from one or more source
feature extraction. This shared encoder subsequently branches tasks to solve a target task. As illustrated in Fig. 1(b), the main
out into task-specific decoding heads (parameterized by θi ) goal is to use the knowledge learned from the source task(s) Ta
dedicated to individual tasks Ti [19], [20], [21]. to improve the performance of the model on a new task, usually
Soft parameter sharing is another approach in MTL that en- referred to as the target task Tb , especially when the target task
courages parameter similarity across task-specific models using dataset Db is limited. In practice, the source task data Da is often
regularization penalties [22], [23], [24]. In this approach, each inaccessible, either because it is too expensive to obtain or too
task typically has its own model with its own set of parameters θi , large to store.
while the shared parameters set θsh can be empty. The objective One common approach for transfer learning is fine-tuning,
function is similar to that of hard parameter sharing, but with which involves starting with a model that has been pre-trained
an additional regularization term that controls the strength of on the source task dataset Da . The parameters of the pre-trained
parameter sharing across tasks. The strength of regularization is model, denoted as θ, are then fine-tuned on the training data
4766 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 46, NO. 7, JULY 2024

Db from the target task Tb using gradient descent or any other During the meta-training phase, prior knowledge enabling
optimizer for several optimization steps. An example of the efficient learning of new tasks is extracted from a set of training
fine-tuning process for one gradient descent step is expressed tasks {Ti }Ti=1 . This is achieved by using a meta-dataset con-
as follows: sisting of multiple datasets {Di }Ti=1 , each corresponding to a
different training task. At meta-test time, a small training dataset
φ ← θ − α∇θ L(θ, Db ), Dnew is observed from a completely new task Tnew and used in
where φ denotes the parameters fine-tuned for task Tb , and α is conjunction with the prior knowledge to infer the most likely
the learning rate. posterior parameters. As in transfer learning, accessing prior
Models with pre-trained parameters θ are often available tasks at meta-test time is impractical. Although the datasets
online, including models pre-trained on large datasets such as {Di }i come from different data distributions (since they come
ImageNet for image classification [32] and language models like from different tasks {Ti }i ), it is assumed that the tasks them-
BERT [33], PaLM [34], LLaMA [35], and GPT-4 [36], trained selves (both for training and testing) are drawn i.i.d. from an
on large text corpora. Models pre-trained on other large and underlying task distribution p(T ), implying some similarities in
diverse datasets or using unsupervised learning techniques, as the task structure. This assumption ensures the effectiveness of
discussed in Section V-C, can also be used as a starting point for meta-learning frameworks even when faced with limited labeled
fine-tuning. data. Moreover, the more tasks that are available for meta-
However, as discussed in [37], it is crucial to avoid destroying training, the better the model can learn to adapt to new tasks,
initialized features when fine-tuning. Some design choices, such just as having more data improves performance in traditional
as using a smaller learning rate for earlier layers, freezing earlier machine learning.
layers and gradually unfreezing, or re-initializing the last layer, In the next section, we provide a more formal definition of
can help to prevent this issue. Recent studies such as [38] show meta-learning and various approaches to it.
that fine-tuning the first or middle layers can sometimes work
better than fine-tuning the last layers, while others recommend IV. META-LEARNING METHODS
a two-step process of training the last layer first and then fine- To gain a unified understanding of the meta-learning problem,
tuning the entire network [37]. More advanced approaches, such we can draw an analogy to the standard supervised learning
as STILTs [39], propose an intermediate step of further training setting. In the latter, the goal is to learn a set of parameters φ
the model on a labeled task with abundant data to mitigate the for a base model hφ (e.g., a neural network parametrized by φ),
potential degradation of pre-trained features. which maps input data x ∈ X to the corresponding output y ∈ Y
In [40], it was demonstrated that transfer learning via fine- as follows:
tuning may not always be effective, particularly when the target
hφ : X → Y
task dataset is very small or very different from the source tasks. (1)
x → y = hφ (x).
To investigate this, the authors fine-tuned a pre-trained universal
language model on specific text corpora corresponding to new To accomplish this, a typically large training dataset D =
tasks using varying numbers of training examples. Their results {(xj , yj )}nj=1 specific to a particular task T is used to learn
showed that starting with a pre-trained model outperformed φ.
training from scratch on the new task. However, when the In the meta-learning setting, the objective is to learn prior
size of the new task dataset was very small, fine-tuning on knowledge, which consists of a set of meta-parameters θ, for a
such a limited number of examples led to poor generalization procedure Fθ (Ditr , xts ). This procedure uses θ to efficiently learn
performance. To address this issue, meta-learning can be used from (or adapt to) a small training dataset Ditr = {(xk , yk )}K k=1
to learn a model that can effectively adapt to new tasks with from a task Ti , and then make accurate predictions on unlabeled
limited data by leveraging prior knowledge from other tasks. In test data xts from the same task Ti . As we will see in the following
fact, meta-learning is particularly useful for learning new tasks sections, Fθ is typically composed of two functions: (1) a meta-
from very few examples, and we will discuss it in more detail in learner fθ (.) that produces task-specific parameters φi ∈ Φ from
the remainder of this article. Ditr ∈ X K , and (2) a base model hφi (.) that predicts outputs
corresponding to the data in xts :
C. Meta-Learning Problem
fθ : X K → Φ hφi : X → Y
Meta-learning (or learning to learn) is a field that aims to Ditr → φi = fθ (Ditr ), x → y = hφi (x).
surpass the limitations of traditional transfer learning by adopt- (2)
ing a more sophisticated approach that explicitly optimizes for Note that the process of obtaining task-specific parameters φi =
transferability. As discussed in Section III-B, traditional transfer fθ (Ditr ) is often referred to as “adaptation” in the literature,
learning involves pre-training a model on source tasks and as it adapts to the task Ti using a small amount of data while
fine-tuning it for a new task. In contrast, meta-learning trains leveraging the prior knowledge summarized in θ. The objective
a network to efficiently learn or adapt to new tasks with only of meta-training is to learn the set of meta-parameters θ. This is
a few examples. Fig. 1(c) illustrates this approach, where at accomplished by using a meta-dataset {Di }Ti=1 , which consists
meta-training time we learn to learn tasks, and at meta-test time of a dataset of datasets, where each dataset Di = {(xj , yj )}nj=1
we learn a new task efficiently. is specific to a task Ti .
VETTORUZZO et al.: ADVANCES AND CHALLENGES IN META-LEARNING: A TECHNICAL REVIEW 4767

φi . In this case, φi consists of {zi , θh }, where θh denotes the


trainable parameters of the network hφi . The base network hφi
is modulated with task descriptors by using various techniques
for conditioning on task descriptors discussed in Section III-A.
Several black-box meta-learning methods adopt different neu-
ral network architectures to represent fθ . For instance, methods
described in [41], use LSTMs or architectures with augmented
memory capacities, such as Neural Turing Machines, while
others, like Meta Networks [43], employ external memory
mechanisms. SNAIL [42] defines meta-learner architectures that
leverage temporal convolutions to aggregate information from
Fig. 2. Black-box meta-learning. past experience and attention mechanisms to pinpoint specific
pieces of information. Alternatively, some methods, such as
the one proposed in [44], use a feedforward plus averaging
Algorithm 1: Black-Box Meta-Learning. strategy. This latter feeds each data-point in Ditr = {(xj , yj )}K
j=1
1: Randomly initialize θ through a neural network to produce a representation rj for each
2: while not done do data-point, and then averages
3: Sample a task Ti ∼ p(T ) (or a mini-batch of tasks) 1
Kthese representations to create a
task representation zi = K j=1 rj . This strategy may be more
4: Sample disjoint datasets Ditr , Dits from Ti effective than using a recurrent model such as LSTM, as it does
5: Compute φi ← fθ (Ditr ) not rely on the assumption of temporal relationships between
6: Update θ using ∇θ L(φi , Dits ) data-points in Ditr .
7: end while Recent research efforts [45], [46], [47], [48] have explored
8: return θ the connection between in-context learning and black-box
meta-learning methods. This connection reveals that in-context
learning can be viewed as a special instance of the broader
The unified view of meta-learning presented here is bene- meta-learning paradigm [46]. In particular, in-context learning
ficial because it simplifies the meta-learning problem by re- involves training models to perform well in new tasks with
ducing it to the design and optimization of Fθ . Moreover, it minimal examples, achieved by conditioning their response on
facilitates the categorization of the various meta-learning ap- context. Kirsch et al. [47] demonstrate that general-purpose
proaches into three categories: black-box meta-learning meth- in-context learning algorithms can be trained from scratch us-
ods, optimization-based meta-learning methods, and distance ing black-box models with minimal inductive bias (such as
metric-based meta-learning methods (as discussed in [9]). An transformers [49]), highlighting the adaptability and potential of
overview of these categories is provided in the subsequent black-box meta-learning methods in these specialized contexts.
sections. Black-box meta-learning methods are expressive, versatile,
and easy to combine with various learning problems, including
A. Black-Box Meta-Learning Methods classification, regression, and reinforcement learning. However,
Black-box meta-learning methods, also known as model- they require complex architectures for the meta-learner fθ , mak-
based meta-learning [7], [9], represent fθ as a black-box neural ing them computationally demanding and data-inefficient. As an
network that takes the entire training dataset, Ditr , and predicts alternative, one can represent φi = fθ (Ditr ) as an optimization
task-specific-parameters, φi . These parameters are then used to procedure instead of a neural network. The next section explores
parameterize the base network, hφi , and make predictions for methods that utilize this approach.
test data-points, y ts = hφi (xts ). The architecture of this approach
is shown in Fig. 2. The meta-parameters, θ, are optimized
as shown in (3), and a general algorithm for these kinds of B. Optimization-Based Meta-Learning Methods
black-box methods is outlined in Algorithm 1.
 Optimization-based meta-learning offers an alternative to the
min L(fθ (Ditr ), Dits ). (3) black-box approach, where the meta-learner fθ is an optimiza-
θ    tion procedure like gradient descent, rather than a black-box
Ti
φi
neural network. The goal of optimization-based meta-learning is
However, this approach faces a major challenge: outputting to acquire a set of meta-parameters θ that are easy to learn via gra-
all the parameters φi of the base network hφi is not scalable dient descent and to fine-tune on new tasks. Most optimization-
and is impractical for large-scale models. To overcome this based techniques do so by defining meta-learning as a bi-level
issue, black-box meta-learning methods, such as MANN [41] optimization problem. At the inner level, fθ produces task-
and SNAIL [42], only output sufficient statistics instead of the specific parameters φi using Ditr , while at the outer level, the
complete set of parameters of the base network. These methods initial set of meta-parameters θ is updated by optimizing the
allow fθ to output a low-dimensional vector zi that encodes performance of hφi on the test set of the same task. This is
contextual task information, rather than a full set of parameters shown in Fig. 3 and in Algorithm 2 in case fθ is a gradient-based
4768 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 46, NO. 7, JULY 2024

Fig. 3. Optimization-based meta-learning with gradient-based optimization.

optimization. The meta-parameters θ can represent inner opti-


mizers [50], [51], [52], [53], neural network architectures [54],
[55], other network hyperparameters [56], or the initialization
of the base model h(.) [2], [57]. The latter approach is similar to
transfer learning via fine-tuning (cf. Section III-B), but instead
of using a pre-trained θ that may not be transferable to new tasks,
we learn θ to explicitly optimize for transferability.
Model-Agnostic Meta-Learning (MAML) [2] is one of the
earliest and most popular optimization-based meta-learning
methods. The main idea behind MAML is to learn a set of initial
neural network’s parameters θ that can easily be fine-tuned for
any task using gradient descent with only a few steps. During the
meta-training phase, MAML minimizes the objective defined as
follows:
Fig. 4. Visual representation of the computation graph of MAML.

min L(θ − α∇θ L(θ, Ditr ), Dits ). (4)
θ   
Ti
φi Algorithm 2: Optimization-Based Meta-Learning With
Gradient-Based Optimization.
Note that in (4), the task-specific parameters φi are obtained
through a single gradient descent step from θ, although in 1: Randomly initialize θ
practice, a few more gradient steps are usually used for better 2: while not done do
performance. 3: Sample a task Ti ∼ p(T ) (or a mini-batch of tasks)
As a result, MAML produces a model initialization θ that 4: Sample disjoint datasets Ditr , Dits from Ti
can be quickly adapted to new tasks with a small number of 5: Optimize φi ← θ − α∇θ L(θ, Ditr )
training examples. Algorithm 2 can be viewed as a simplified 6: Update θ using ∇θ L(φi , Dits )
illustration of MAML, where θ represents the parameters of 7: end while
a neural network. This is similar to Algorithm 1 but with φi 8: return θ
obtained through optimization.
During meta-test time, a small dataset Dnewtr
is observed from
a new task Tnew ∼ p(T ). The goal is to use the prior knowledge In [58], the authors investigated the effectiveness of
encoded in θ to train a model that generalizes well to new, unseen optimization-based meta-learning in generalizing to similar but
examples from this task. To achieve this, θ is fine-tuned with extrapolated tasks that are outside the original task distribu-
a few adaptation steps using ∇θ L(θ, Dnew tr
), resulting in task- tion p(T ). The study found that, as task variability increases,
specific parameters φ. These parameters are then used to make black-box meta-learning methods such as SNAIL [42] and
accurate predictions on previously unseen input data from Tnew . MetaNet [43] acquire less generalizable learning strategies than
MAML can be thought of as a computation graph (as shown gradient-based meta-learning approaches like MAML.
in Fig. 4) with an embedded gradient operator. Interestingly, the However, despite its success, MAML faces some challenges
components of this graph can be interchanged or replaced with that have motivated the development of other optimization-
components from the black-box approach. For instance, [50] based meta-learning methods. One of these challenges is the
also learned an initialization θ, but adapted θ differently by instability of MAML’s bi-level optimization. Fortunately, there
using a learned network fw (θ, Ditr , ∇θ L) instead of the gradient are enhancements that can significantly improve optimization
∇θ L(θ, Ditr ): process. For instance, Meta-SGD [59] and AlphaMAML [60]
learn a vector of learning rates α automatically, rather than using
φi ← θ − αfw (θ, Ditr , ∇θ L) a manually set scalar value α. Other methods like DEML [61],
VETTORUZZO et al.: ADVANCES AND CHALLENGES IN META-LEARNING: A TECHNICAL REVIEW 4769

ANIL [62] and BOIL [63] suggest optimizing only a subset


of the parameters during adaptation. Additionally, MAML++
[64] proposes various modifications to stabilize the optimization
process and further improve the generalization performance.
Moreover, Bias-transformation [58] and CAVIA [65] introduce
context variables for increased expressive power, while [66] en-
forces a well-conditioned parameter space based on the concepts
of the condition number [67].
Another significant challenge in MAML is the computation-
ally expensive process of backpropagating through multiple
gradient adaptation steps. To overcome this challenge, first-order
alternatives to MAML such as FOMAML and Reptile have
been introduced [68]. For example, Reptile aims to find an
Fig. 5. Meta-learning via distance metric learning using matching net-
initialization θ that is close to each task’s optimal parameters. work [75].
Another approach is to optimize only the parameters of the
last layer. For instance, [69] and [70] perform a closed-form or
convex optimization on top of meta-learned features. Another Algorithm 3: Meta-Learning via Metric Learning (Match-
solution is iMAML [71], which computes the full meta-gradient ing Networks).
without differentiating through the optimization path, using the 1: Randomly initialize θ
implicit function theorem. 2: while not done do
3: Sample a task Ti ∼ p(T ) (or a mini-batch of tasks)
C. Meta-Learning Via Distance Metric Learning 4: datasets Ditr , Dits from Ti
Sample disjoint 
5: Compute ŷ = (xk ,yk )∈Dtr fθ (xts , xk )yk
ts
In the context of low data regimes, such as in few-shot i
6: Update θ using ∇θ L(ŷ ts , y ts )
learning, simple non-parametric methods such as Nearest
7: end while
Neighbors [72] can be effective. However, black-box and
8: return θ
optimization-based meta-learning approaches discussed so far in
Sections IV-A and IV-B have focused on using parametric base
models, such as neural networks. In this section we discuss meta-
learning approaches that employ a non-parametric learning pro-
cedure. The key concept is to use parametric meta-learners
to produce effective non-parametric learners, thus eliminating
the need for second-order optimization, as required by several
methods discussed in Section IV-B.
Suppose we are given a small training dataset Ditr that presents
a 1-shot-N -way classification problem, i.e., N classes with only
one labeled data-point per class, along with a test data-point
Fig. 6. Prototypical networks.
xts . To classify xts , a Nearest Neighbor learner compares it
with each training data-point in Ditr . However, determining an
effective space and distance metric for this comparison can
be challenging. For example, using the L2 distance in pixel Matching Networks. It is similar to Algorithms 1 and 2, except
space for image data may not yield satisfactory results [73]. that the base model is non-parametric, so there is no φi (see lines
To overcome this, a distance metric can be derived by learning 5 and 6).
how to compare instances using meta-training data. To learn an However, Matching Networks are specifically designed for
appropriate distance metric for comparing instances, a Siamese 1-shot classification and cannot be directly applied to K-shot
network [74] can be trained to solve a binary classification prob- classification problems (where there are K labeled samples per
lem that predicts whether two images belong to the same class. class). To address this issue, other methods, such as Prototypical
During meta-test time, each image in Ditr is compared with the Networks [76], have been proposed. Prototypical Networks ag-
test image xts to determine whether they belong to the same class gregate class information to create a prototypical embedding,
or not. However, there is a nuance due to the mismatch between as illustrated in Fig. 6. In Prototypical Networks, line 5 of
the binary classification problem during meta-training and the Algorithm 3 is replaced with:
N -way classification problem during meta-testing. Matching
Networks, introduced in [75], address this by learning an em- exp (−fθ (x) − cl )
pθ (y = l|x) =  ,
l exp (−fθ (x) − cl )
bedding space with a network fθ and using Nearest Neighbors
in the learned space, as shown in Fig. 5. The network is trained
end-to-end to ensure that meta-training is consistent with meta- where cl is the mean embedding of all the samples in the l-th
1
testing. Algorithm 3 outlines the meta-training process used by class, i.e., cl = K (x,y)∈Dtr 1(y = l)fθ (x).
i
4770 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 46, NO. 7, JULY 2024

TABLE I r Generalizability to diverse/variable tasks: Explores the


SUMMARY COMPARISON OF META-LEARNING APPROACHES
approach’s ability to acquire generalizable learning strate-
gies, extending to similar but extrapolated tasks beyond the
original task distribution.
Although black-box, optimization-based, and distance
metric-based meta-learning approaches differ, they are not mu-
tually exclusive and can be combined in various ways. For
instance, in [80], gradient descent is applied while condition-
ing on the data, allowing the model to modulate the feature
representations and capture inter-class dependencies. In [81],
LEO (Latent Embedding Optimization) combines optimization-
While methods such as Siamese networks, Matching Net- based meta-learning with a latent embedding produced by the
works, and Prototypical Networks can perform few-shot classifi- RelationNet embedding proposed in [77]. The parameters of the
cation by embedding data and applying Nearest Neighbors [74], model are first conditioned on the input data and then further
[75], [76], they may not be sufficient to capture complex re- adapted through gradient descent. In [82], the strength of both
lationships between data-points. To address this, alternative MAML and Prototypical Networks are combined to form a
approaches have been proposed. RelationNet [77] introduces hybrid approach called Proto-MAML. This approach exploits
a non-linear relation module that can reason about complex the flexible adaptation of MAML, while initializing the last layer
relationships between embeddings. Garcia et al. [78] propose with ProtoNet to provide a simple inductive bias that is effective
to use graph neural networks to perform message passing on for very-few-shot learning. Similarly, [83] proposes a model
embeddings, allowing for the capture of more complex depen- where the meta-learner operates using an optimization-based
dencies. Finally, Allen et al. [79] extend Prototypical Networks meta-model, while the base learner exploits a metric-based
to learn an infinite mixture of prototypes, which improves the approach (either Matching Network or Prototypical Network).
model’s ability to represent the data distribution. The distance metrics used by the base learner can better adapt
to different tasks thanks to the weight prediction from the
meta-learner.
D. Comparison and Hybrid Approaches In summary, researchers have explored combining black-box,
Meta-learning approaches exhibit distinct strengths and trade- optimization-based, and distance metric-based meta-learning
offs. Black-box, optimization-based, and distance metric-based approaches to take advantage of their individual strengths. These
meta-learning approaches define Fθ (Ditr , xts ) differently, and the combined approaches aim to improve performance, adaptability,
choice of method depends on specific use-case requirements. and generalization in few-shot learning tasks by integrating
To assist readers, we provide a unified comparison in Table I, different methodologies.
summarizing the pros and cons of each approach based on the
following criteria: V. ADVANCED META-LEARNING TOPICS
r Parametric base model: Indicates whether the approach is
The field of meta-learning has seen rapid development in
parametric (yes) or non-parametric (no), highlighting the
recent years, with numerous methods proposed for learning
model’s flexibility.
r Expressive power: Evaluates the ability of Fθ to represent to learn from a few examples. In this section, we delve
into advanced topics in meta-learning that extend the meta-
a wide range of learning procedures.
r Consistency: Reflects the extent to which the learned learn- learning paradigm to more complex scenarios. We explore
meta-learning from multi-modal task distributions, the chal-
ing procedure improves with additional data.
r Versatility: Compatibility with different scenarios and the lenge of out-of-distribution tasks, and unsupervised meta-
learning. Additionally, we examine the relationship between
ease of integration with different learning problems, such
meta-learning and personalized federated learning, domain
as classification, regression, and reinforcement learning.
r Simple optimization: Considers the complexity of opti- adaptation/generalization, as well as the intersection between
meta-learning and continual learning. By delving into these
mization, including optimization challenges arising from
advanced topics, we can gain a deeper understanding of the
complex models or the necessity for sophisticated second-
potential of meta-learning and its applications in more complex
order optimization techniques.
r Data efficiency (in terms of training tasks): The method’s real-world scenarios.
efficiency concerning the number of training tasks required
for effective meta-learning. A. Meta-Learning From Multimodal Task Distributions
r Positive inductive bias at meta-learning onset: Indicates Meta-learning methods have traditionally focused on opti-
the presence of an inherent bias favoring certain learning mizing performance within a unimodal task distribution p(T ),
strategies, influencing initial meta-learning performance. assuming that all tasks are closely related and share similarities
r Resource efficiency: Assesses computational and memory within a single application domain. However, in real-world
requirements, providing insights into the practical feasibil- scenarios, tasks are often diverse and sampled from a more
ity of the approach. complex task distribution with multiple unknown modes. The
VETTORUZZO et al.: ADVANCES AND CHALLENGES IN META-LEARNING: A TECHNICAL REVIEW 4771

performance of most meta-learning approaches tends to deteri- B. Meta-Learning & Personalized Federated Learning
orate as the dissimilarity among tasks increases [84], [85], [86],
Federated learning (FL) is a distributed learning paradigm
[87], indicating that a globally shared set of meta-parameters θ
where multiple clients collaborate to train a shared model while
may not adequately capture the heterogeneity among tasks and
preserving data privacy by keeping their data locally stored.
enable fast adaptation.
FedAvg [106] is a pioneering method that combines local
To address this challenge, MMAML [88] builds upon the
stochastic gradient descent on each client with model averaging
standard MAML approach by estimating the mode of tasks
on a central server. This approach performs well when local
sampled from a multimodal task distribution p(T ) and adjusting
data across clients is independent and identically distributed
the initial model parameters accordingly. Another approach
(IID). However, in scenarios with heterogeneous (non-IID) data
proposed in [89] involves learning a meta-regularization condi-
distributions, regularization techniques [107], [108], [109] have
tioned on additional task-specific information. However, obtain- been proposed to improve local learning.
ing such additional task information may not always be feasible. Personalized federated learning (PFL) is an alternative ap-
Alternatively, some methods propose learning multiple model
proach that aims to develop customized models for individual
initializations θ1 , θ2 , . . . , θM and selecting the most suitable clients while leveraging the collaborative nature of FL. Popular
one for each task, leveraging clustering techniques applied in PFL methods include L2GD [110], which combines local and
either the task-space or parameter-space [90], [91], [92], [93], or
global models, as well as multi-task learning methods like
relying on the output of an additional network, as in MUSE [94]. pFedMe [111], Ditto [112], and FedPAC [113]. Clustered or
CAVIA [65] partitions the initial model parameters into shared group-based FL approaches [114], [115], [116], [117] learn
parameters across all tasks and task-specific context parameters,
multiple group-based global models. In contrast, meta-learning-
while LGM-Net [95] directly generates classifier weights based based methods interpret PFL as a meta-learning algorithm,
on an encoded task representation.
where personalization to a client aligns with adaptation to
A series of related works (but outside of the meta-learning
a task [118]. Notably, various combinations of MAML-type
field) aim to build a “universal representation” that encompasses methods with FL architectures have been explored in [118],
a robust set of features capable of achieving strong perfor-
[119], [120] to find an initial shared point that performs well after
mance across multiple datasets (or modes) [44], [96], [97], [98],
personalization to each client’s local dataset. Additionally, the
[99], [99], [100]. This representation is subsequently adapted to authors of [121] proposed ARUBA, a meta-learning algorithm
individual tasks in various ways. However, these approaches
inspired by online convex optimization, which enhances the
are currently limited to classification problems and do not
performance of FedAvg.
leverage meta-learning techniques to efficiently adapt to new To summarize, there is a growing focus on addressing FL
tasks.
challenges in non-IID data settings. The integration of meta-
A more recent line of research focuses on cross-domain learning has shown promising outcomes, leading to enhanced
meta-learning, where knowledge needs to be transferred from personalization and performance in PFL methods.
tasks sampled from a potentially multimodal distribution p(T )
to target tasks sampled from a different distribution. One notable
study, BOIL [63], reveals that the success of meta-learning C. Unsupervised Meta-Learning With Tasks Construction
methods, such as MAML, can be attributed to large changes in In meta-training, constructing tasks typically relies on labeled
the representation during task learning. The authors emphasize data. However, real-world scenarios often involve mostly, or
the importance of updating only the body (feature extractor) only, unlabeled data, requiring techniques that leverage unla-
of the model and freezing the head (classifier) during the adap- beled data to learn feature representations that can transfer to
tation phase for effective cross-domain adaptation. Building on downstream tasks with limited labeled data. One alternative to
this insight, DAML [101] introduces tasks from both seen and address this is through “self-supervised learning” (also known as
pseudo-unseen domains during meta-training to obtain domain- “unsupervised pre-training”) [122], [123], [124]. This involves
agnostic initial parameters capable of adapting to novel classes training a model on a large unlabeled dataset, as depicted in
in unseen domains. In [102], the authors propose a transfer- Fig. 7, to capture informative features. Contrastive learn-
able meta-learning algorithm with a meta task adaptation to ing [122], [125] is commonly used in this context, aiming to
minimize the domain divergence and thus facilitate knowledge learn features by bringing similar examples closer together while
transfer across domains. To further improve the transferability of pushing differing examples apart. The learned features can then
cross-domain knowledge, [103] and [104] propose to incorpo- be fine-tuned on a target task Tnew with limited labeled data
rate semi-supervised techniques into the meta-learning frame- Dnew
tr
, leading to improved performance compared to training
work. Specifically, [103] combines the representation power of from scratch. Another promising alternative is “unsupervised
large pre-trained language models (e.g., BERT [33]) with the meta-learning,” which aims to automatically construct diverse
generalization capability of prototypical networks enhanced by and structured training tasks from unlabeled data. These tasks
SMLMT [105] to achieve effective generalization and adapta- can then be used with any meta-learning algorithm, such as
tion to tasks from new domains. In contrast, [104] promotes the MAML [2] and ProtoNet [76]. In this section, we explore meth-
idea of task-level self-supervision by leveraging multiple views ods for meta-training without predefined tasks and investigate
or augmentations of tasks. strategies for automatically constructing tasks for meta-learning.
4772 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 46, NO. 7, JULY 2024

the authors in [140] integrate contrastive learning in a two-


stage training paradigm consisting of sequential pre-training
and meta-training stages. Another work [141] interprets a meta-
learning problem as a set-level problem and maximizes the
agreement between augmented sets using SimCLR [142]. Fi-
nally, PsCo [142] builds upon MoCo [123] by progressively
improving pseudo-labeling and constructing diverse tasks in an
online manner. These findings indicate the potential for lever-
aging existing advances in meta-learning to improve contrastive
learning (and vice-versa).
To meta-learn with unlabeled text data, some methods use lan-
guage modeling, as shown in [45] for GPT-3. Here, the support
Fig. 7. Unsupervised pre-training. set Ditr consists of a sequence of characters, and the query set Dits
consists of the subsequent sequence of characters. However, this
approach may not be suitable for text classification tasks, such as
sentiment analysis or identifying political bias. In [105], an al-
The method proposed in [126] constructs tasks based on unsu-
ternative approach (SMLMT) for self-supervised meta-learning
pervised representation learning methods such as BiGAN [127],
for few-shot natural language classification tasks is proposed.
[128] or DeepCluster [129] and clusters the data in the embed-
SMLMT involves masking out words and classifying the masked
ding space to assign pseudo-labels and construct tasks. Other
word to construct tasks. The process involves: 1) sampling a
methods such as UMTRA [130] and LASIUM [131] generate
subset of N unique words and assigning each word a unique ID
synthetic samples using image augmentations or pre-trained
as its class label, 2) sampling K + Q sentences that contain each
generative networks. In particular, the authors in [130] construct
of the N words and masking out the corresponding word in each
a task Ti for a 1-shot N -way classification problem by creating
sentence, and 3) constructing the support set Ditr and the query
a support set Ditr and a query set Dits as follows:
r Randomly sample N images and assign labels 1, . . . , N , set Dits using the masked sentences and their corresponding word
IDs. SMLMT (for unsupervised meta-learning) is compared to
storing them in Ditr .
r Augment1 each image in Ditr , and store the resulting (aug-
BERT [33], a method that uses standard self-supervised learning
and fine-tuning. SMLMT outperforms BERT on some tasks
mented) images in Di . ts
and achieves at least equal performance on others. Further-
Such augmentations can be based on domain knowledge or
more, Hybrid-SMLMT (semi-supervised meta-learning, which
learned augmentation strategies like those proposed in [132].
involves meta-learning on constructed tasks and supervised
In principle, task construction techniques can be applied be-
tasks), is compared to MT-BERT [143] (multi-task learning
yond image-based augmentation. For instance, temporal aspects
on supervised tasks) and LEOPARD [144] (an optimization-
can be leveraged by incorporating time-contrastive learning on
based meta-learner that uses only supervised tasks). The results
videos, as demonstrated in [133]. Another approach is offered
show that Hybrid-SMLMT significantly outperforms these other
by Viewmaker Networks [134], which learn augmentations that
methods.
yield favorable outcomes not only for images but also for speech
and sensor data. Contrary to these works focusing on generating
pseudo tasks, Meta-GMVAE [135] and Meta-SVEBM [136] D. Meta-Learning & Domain adaptation/generalization
address the problem by using variational autoencoders [137] Domain shift is a fundamental challenge, where the distri-
and energy-based models [138], respectively. However, these bution of the input data changes between the training and test
methods are limited to the pseudo-labeling strategies used to domains. To address this problem, there is a growing interest
create tasks, they rely on the quality of generated samples and in utilizing meta-learning techniques for more effective domain
they cannot scale to large-scale datasets. adaptation and domain generalization. These approaches aim to
To overcome this limitation, recent approaches have inves- enable models to quickly adapt to new domains with limited
tigated the possibility of using self-supervised learning tech- data or to train robust models that achieve better generalization
niques to improve unsupervised meta-learning methods. In par- on domains they have not been explicitly trained on.
ticular, in [139], the relationship between contrastive learning Effective domain adaptation via meta-learning: Domain
and meta-learning is explored, demonstrating that established adaptation is a form of transductive transfer learning that lever-
meta-learning methods can achieve comparable performance to ages source domain(s) pS (x, y) to achieve high performance on
contrastive learning methods, and that representations transfer test data from a target domain pT (x, y). It assumes pS (y|x) =
similarly well to downstream tasks. Inspired by these findings, pT (y|x) but pS (x) = pT (x), treating domains as a particular
kind of tasks, with a task Ti  {pi (x), pi (y|x), Li } and a domain
di  {pi (x), p(y|x), L}. For example, healthcare data from dif-
1 Various augmentation techniques, like flipping, cropping, or reflecting an
ferent hospitals with varying imaging techniques or patient
image, typically preserve its label. Likewise, nearby image patches or adjacent demographics can correspond to different domains. Domain
video frames share similar characteristics and are therefore assigned the same
label. adaptation is most commonly achieved via feature alignment
VETTORUZZO et al.: ADVANCES AND CHALLENGES IN META-LEARNING: A TECHNICAL REVIEW 4773

Fig. 9. Continual learning.

Fig. 8. Domain generalization problem. and a meta-learning-based cross-domain validation component


to further enhance the robustness of the classifier.

as in [145], [146] or via translation between domains using Cy- E. Meta-Learning & Continual Learning
cleGAN [147] as in [148], [149], [150]. Other approaches focus This section explores the application of meta-learning to
on aligning the feature distribution of multiple source domains continual learning, where learners continually accumulate ex-
with the target domain [151] or they address the multi-target perience over time to more rapidly acquire new knowledge or
domain adaptation scenario [152], [153], [154] with models skills. Continual learning scenarios can be divided into task-
capable of adapting to multiple target domains. However, these incremental learning, domain-incremental learning, and class-
methods face limitations when dealing with insufficient labeled incremental learning, depending on whether task identity is
data in the source domain or when quick adaptation to new target provided at test time or must be inferred by the algorithm [170].
domains is required. Additionally, they assume the input-output In this section, we focus on approaches that specifically address
relationship (i.e., p(y|x)) to be the same across domains. To task/class-incremental learning.
solve these problems, some methods [153], [155], [156], [157] Traditionally, meta-learning has primarily focused on sce-
combine meta-learning with domain adaptation. In particular, narios where a batch of training tasks is available. However,
ARM [155] leverages contextual information extracted from real-world situations often involve tasks presented sequentially,
batches of unlabeled data to learn a model capable of adapting allowing for progressive leveraging of past experience. This is
to distribution shifts. illustrated in Fig. 9, and examples include tasks that progres-
Effective domain generalization via meta-learning: Domain sively increase in difficulty or build upon previous knowledge,
generalization enables models to perform well on new and or robots learning diverse skills in changing environments.
unseen domains without requiring access to their data, as il- Standard online learning involves observing tasks in a se-
lustrated in Fig. 8. This is particularly useful in scenarios quential manner, without any task-specific adaptation or use
where access to data is restricted due to real-time deployment of past experience to accelerate adaptation. To tackle this is-
requirements or privacy policies. For instance, an object de- sue, researchers have proposed various approaches, includ-
tection model for self-driving cars trained on three types of ing memory-based methods [171], [172], [173], regularization-
roads may need to be deployed to a new road without any based methods [174], [175], [176] and dynamic architectural
data from that domain. In contrast to domain adaptation, which methods [177], [178], [179]. However, each of these methods
requires access to (unlabeled) data from a specific target domain has its own limitations, such as scalability, memory inefficiency,
during training to specialize the model, domain generalization time complexity, or the need for task-specific parameters. Meta-
belongs to the inductive setting. Most domain generalization learning has emerged as a promising approach for addressing
methods aim to train neural networks to learn domain-invariant continual learning. In [180], the authors introduced ANML, a
representations that are consistent across domains. For instance, framework that meta-learns an activation-gating function that
domain adversarial training [158] trains the network to make enables context-dependent selective activation within a deep
predictions based on features that cannot be distinguished be- neural network. This selective activation allows the model to
tween domains. Another approach is to directly align the rep- focus on relevant knowledge and avoid catastrophic forgetting.
resentations between domains using similarity metrics, such as Other approaches such as MER [181], OML [182], and LA-
in [159]. Data augmentation techniques are also used to enhance MAML [183] use gradient-based meta-learning algorithms to
the diversity of the training data and improve generalization optimize various objectives such as gradient alignment, inner
across domains [160], [161], [162]. Another way to improve representations, or task-specific learning rates and learn update
generalization to various domains is to use meta-learning and rules that avoid negative transfer. These algorithms enable faster
applying the episodic training paradigm typical of MAML [58], learning over time and enhanced proficiency in each new task.
as in [163], [164], [165], [166], [167], [168], [169]. For instance,
MLDG [166] optimizes a model by simulating the train-test
domain shift during the meta-training phase. MetaReg [167] VI. OPEN CHALLENGES & OPPORTUNITIES
proposes to meta-learn a regularization function that improves Meta-learning has been a promising area of research that has
domain generalization. DADG [169] contains a discriminative shown impressive results in various machine learning domains.
adversarial learning component to learn a set of general features However, there are still open challenges that need to be addressed
4774 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 46, NO. 7, JULY 2024

in order to further advance the field. In this section, we discuss field. Several efforts have been made towards creating use-
some of these challenges and categorize them into three groups. ful benchmark datasets, including Meta-Dataset [82], Meta-
Addressing these challenges can lead to significant advances in Album Dataset [192], NEVIS’22 [193], Meta-World Bench-
meta-learning, which could potentially lead to more generaliz- mark [194], Visual Task Adaptation Benchmark [195], Taskon-
able and robust machine learning models. omy Dataset [196], VALUE Benchmark [197], and BIG
Bench [198]. However, further work is needed to ensure that the
datasets are comprehensive and representative of the diversity
A. Addressing Fundamental Problem Assumptions of real-world problems that meta-learning aims to address.
The first category of challenges pertains to the fundamental Some ways with which existing benchmarks can be improved
assumptions made in meta-learning problems. to better reflect real-world problems and challenges in meta-
One such challenge is related to generalization to out-of- learning are: 1) to increase the diversity and complexity of tasks
distribution tasks and long-tailed task distributions. Indeed, that are included; 2) to consider more realistic task distributions
adaptation becomes difficult when the few-shot tasks observed at that can change over time; and 3) to include real-world data that is
meta-test time are from a different task distribution than the ones representative of the challenges faced in real-world applications
seen during meta-training. While there have been some attempts of meta-learning. For example, including medical data, financial
to address this challenge, such as in [102], [184], it still remains data, time-series data, or other challenging types of data (besides
unclear how to address it. Ideas from the domain generalization images and text) can help improve the realism and relevance of
and robustness literature could provide some hints and poten- benchmarks.
tially be combined with meta-learning to tackle these long-tailed Furthermore, developing benchmarks that reflect these more
task distributions and out-of-distribution tasks. For example, realistic scenarios can help improve the generalization and ro-
possible directions are to define subtle regularization techniques bustness of algorithms. This ensures that algorithms are tested on
to prevent the meta-parameters from being very specific to the a range of scenarios and that they are robust and generalizable
distribution of the training tasks, or use subtle task augmentation across a wide range of tasks. Better benchmarks are essential
techniques to generate synthetic tasks that cover a wider range for progress in machine learning and AI, as they challenge
of task variations. current algorithms to find common structures, reflect real-world
Another challenge in this category involves dealing with the problems, and have a significant impact in the real world.
multimodality of data. While the focus has been on meta-training
over tasks from a single modality, the reality is that we may have
multiple modalities of data to work with. Human beings have C. Improving Core Algorithms
the advantage of being able to draw upon multiple modalities, The last category of challenges in meta-learning is centered
such as visual imagery, tactile feedback, language, and social around improving the core algorithms.
cues, to create a rich repository of knowledge and make more A major obstacle is the large-scale bi-level optimization
informed decisions. For instance, we often use language cues to problem encountered in popular meta-learning methods such
aid our visual decision-making processes. Rather than develop- as MAML. The computational and memory costs of such ap-
ing a prior that only works for a single modality, exploring the proaches can be significant, and there is a need to make them
concept of learning priors across multiple modalities of data is more practical, particularly for very large-scale problems, like
a fascinating area to pursue. Different modalities have different learning effective optimizers [199].
dimensionalities or units, but they can provide complementary In addition, a deeper theoretical understanding of various
forms of information. While some initial works in this direction meta-learning methods and their performance is critical to driv-
have been reported, including [185], [186], [187], there is still ing progress and pushing the boundaries of the field. Such
a long way to go in terms of capturing all of this rich prior insights can inform and inspire further advancements in the field
information when learning new tasks. and lead to more effective and efficient algorithms. To achieve
these goals, several fundamental questions can be explored,
including:
B. Providing Benchmarks and Real-World Problems 1) Can we develop theoretical guarantees on the sample com-
The second category of challenges is related to provid- plexity and generalization performance of meta-learning
ing/improving benchmarks to better reflect real-world problems algorithms? Understanding these aspects can help us de-
and challenges. sign more efficient and effective meta-learning algorithms
Meta-learning has shown promise in a diverse set of ap- that require less data or less tasks. While recent investi-
plications, including few-shot land cover classification [188], gations [200], [201], [202] have made notable strides in
few-shot dermatological disease diagnosis [184], automati- this domain, they represent just the initial steps toward
cally providing feedback on student code [18], one-shot im- a more extensive theoretical comprehension. Further re-
itation learning [189], drug discovery [190], motion predic- search is imperative to completely harness the potential of
tion [191], and language generation [16], to mention but meta-learning.
a few. However, the lack of benchmark datasets that accu- 2) Can we gain a better understanding of the optimization
rately reflect real-world problems with appropriate levels of landscape of meta-learning algorithms? For instance, can
difficulty and ease of use is a significant challenge for the we identify the properties of the objective function that
VETTORUZZO et al.: ADVANCES AND CHALLENGES IN META-LEARNING: A TECHNICAL REVIEW 4775

make it easier or harder to optimize? Can we design op- [5] T. Hospedales, A. Antoniou, P. Micaelli, and A. Storkey, “Meta-learning
timization algorithms that are better suited to the bi-level in neural networks: A survey,” IEEE Trans. Pattern Anal. Mach. Intell.,
vol. 44, no. 9, pp. 5149–5169, Sep. 2022.
optimization problem inherent in various meta-learning [6] R. Vilalta and Y. Drissi, “A perspective view and survey of meta-
approaches? learning,” Artif. Intell. Rev., vol. 18, pp. 77–95, 2002.
3) Can we design meta-learning algorithms that can better [7] M. Huisman, J. N. Van Rijn, and A. Plaat, “A survey of deep meta-
learning,” Artif. Intell. Rev., vol. 54, no. 6, pp. 4483–4541, 2021.
incorporate task-specific or domain-specific expert knowl- [8] L. Zou, “Chapter 1 - meta-learning basics and background,” in Meta-
edge, in a principled way, to learn more effective meta- Learning, L. Zou, Ed. Cambridge, MA, USA: Academic Press, 2023,
parameters? pp. 1–22.
[9] O. Vinyals, “Talk: Model vs optimization meta learning,” in Proc.
Addressing such questions could enhance the design and Int. conf. Neural Inf. Process. Syst., 2017. [Online]. Available: https:
performance of meta-learning algorithms, and help us tackle //evolution.ml/pdf/vinyals.pdf
increasingly complex and challenging learning problems. [10] J. Snell, K. Swersky, and R. Zemel, “Prototypical networks for few-shot
learning,” in Proc. Adv. Neural Inf. Process. Syst., 2017, pp. 4080–4090.
[11] R. Geng, B. Li, Y. Li, X. Zhu, P. Jian, and J. Sun, “Induction networks
for few-shot text classification,” in Proc. Conf. Empirical Methods Nat-
VII. CONCLUSION ural Lang. Process. 9th Int. Joint Conf. Natural Lang. Process., 2019,
pp. 3904–3913.
In conclusion, the field of artificial intelligence (AI) has [12] B. Liang et al., “Few-shot aspect category sentiment analysis via meta-
witnessed significant advancements in developing specialized learning,” ACM Trans. Inf. Syst., vol. 41, no. 1, pp. 1–31, 2023.
systems for specific tasks. However, the pursuit of generality and [13] J. Gu, Y. Wang, Y. Chen, K. Cho, and V. O. Li, “Meta-learning for low-
resource neural machine translation,” in Proc. Conf. Empirical Methods
adaptability in AI across multiple tasks remains a fundamental Natural Lang. Process., 2020, pp. 3622–3631.
challenge. [14] A. Madotto, Z. Lin, C.-S. Wu, and P. Fung, “Personalizing dialogue
Meta-learning emerges as a promising research area that seeks agents via meta-learning,” in Proc. 57th Annu. Meeting Assoc. Comput.
Linguistics, 2019, pp. 5454–5459.
to bridge this gap by enabling algorithms to learn how to learn. [15] K. Qian and Z. Yu, “Domain adaptive dialog generation via meta learn-
Meta-learning algorithms offer the ability to learn from limited ing,” in Proc. 57th Annu. Meeting Assoc. Comput. Linguistics, 2019,
data, transfer knowledge across tasks and domains, and rapidly pp. 2639–2649.
[16] F. Mi, M. Huang, J. Zhang, and B. Faltings, “Meta-learning
adapt to new environments. This review article has explored vari- for low-resource natural language generation in task-oriented dia-
ous meta-learning approaches that have demonstrated promising logue systems,” in Proc. 28th Int. Joint Conf. Artif. Intell., 2019,
results in applications with scarce data. Nonetheless, numerous pp. 3151–3157.
[17] C. Finn, T. Yu, T. Zhang, P. Abbeel, and S. Levine, “One-shot visual
challenges and unanswered questions persist, calling for further imitation learning via meta-learning,” in Proc. Conf. Robot Learn., 2017,
investigation. pp. 357–368.
A key area of focus lies in unifying various fields such as [18] M. Wu, N. Goodman, C. Piech, and C. Finn, “ProtoTransformer:
A meta-learning approach to providing student feedback,” 2021,
meta-learning, self-supervised learning, domain generalization, arXiv:2107.14035.
and continual learning. Integrating and collaborating across [19] O. Sener and V. Koltun, “Multi-task learning as multi-objective optimiza-
these domains can generate synergistic advancements and foster tion,” in Proc. Adv. Neural Inf. Process. Syst., 2018.
[20] Z. Chen, V. Badrinarayanan, C.-Y. Lee, and A. Rabinovich, “GradNorm:
a more comprehensive approach to developing AI systems. By Gradient normalization for adaptive loss balancing in deep multitask
leveraging insights and techniques from these different areas, we networks,” in Proc. Int. Conf. Mach. Learn., 2018, pp. 794–803.
can construct more versatile and adaptive algorithms capable of [21] A. Kendall, Y. Gal, and R. Cipolla, “Multi-task learning using uncertainty
to weigh losses for scene geometry and semantics,” in Proc. IEEE Conf.
learning from multiple tasks, generalizing across domains, and Comput. Vis. Pattern Recognit., 2018, pp. 7482–7491.
continuously accumulating knowledge. [22] I. Misra, A. Shrivastava, A. Gupta, and M. Hebert, “Cross-stitch net-
This review article serves as a starting point for encouraging works for multi-task learning,” in Proc. IEEE Conf. Comput. Vis. Pattern
Recognit., 2016, pp. 3994–4003.
research in this direction. By examining the current state of [23] S. Ruder, J. Bingel, I. Augenstein, and A. Søgaard, “Latent multi-
meta-learning and illuminating the challenges and opportunities, task architecture learning,” in Proc. AAAI Conf. Artif. Intell., 2019,
we aim to inspire researchers to explore interdisciplinary con- pp. 4822–4829.
[24] Y. Gao, J. Ma, M. Zhao, W. Liu, and A. L. Yuille, “NDDR-CNN:
nections and contribute to the progress of meta-learning while Layerwise feature fusing in multi-task CNNs by neural discriminative
integrating it with other AI research fields. Through collective dimensionality reduction,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern
efforts and collaboration, we can surmount existing challenges Recognit., 2019, pp. 3205–3214.
[25] V. Dumoulin et al., “Feature-wise transformations,” Distill, 2018. [On-
and unlock the full potential of meta-learning to address a broad line]. Available: https://distill.pub/2018/feature-wise-transformations
spectrum of complex problems. [26] S. Liu, E. Johns, and A. J. Davison, “End-to-end multi-task learning
with attention,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit.,
2019, pp. 1871–1880.
REFERENCES [27] M. Long, Z. Cao, J. Wang, and P. S. Yu, “Learning multiple tasks with
multilinear relationship networks,” in Proc. Adv. Neural Inf. Process.
[1] S. K. Karmaker, M. M. Hassan, M. J. Smith, L. Xu, C. Zhai, and K. Veera- Syst., 2017, pp. 1593–1602.
machaneni, “AutoML to date and beyond: Challenges and opportunities,” [28] A. Jaegle et al., “Perceiver IO: A general architecture for structured inputs
ACM Comput. Surv., vol. 54, no. 8, pp. 1–36, 2021. & outputs,” in Proc. Int. Conf. Learn. Representations, 2022, pp. 1–16.
[2] C. Finn, P. Abbeel, and S. Levine, “Model-agnostic meta-learning for [29] C. Fifty, E. Amid, Z. Zhao, T. Yu, R. Anil, and C. Finn, “Efficiently
fast adaptation of deep networks,” in Proc. Int. Conf. Mach. Learn, 2017, identifying task groupings for multi-task learning,” in Proc. Adv. Neural
pp. 1126–1135. Inf. Process. Syst., 2021, pp. 27 503–27 516.
[3] J. Beck et al., “A survey of meta-reinforcement learning,” [30] Y. Zhang and Q. Yang, “A survey on multi-task learning,” IEEE Trans.
2023, arXiv:2301.08028. Knowl. Data Eng., vol. 34, no. 12, pp. 5586–5609, Dec. 2022.
[4] J. Vanschoren, Meta-Learning: A Survey. Berlin, Germany: Springer, [31] M. Crawshaw, “Multi-task learning with deep neural networks: A survey,”
2019, pp. 35–61. 2020, arXiv:2009.09796.
4776 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 46, NO. 7, JULY 2024

[32] M. Huh, P. Agrawal, and A. A. Efros, “What makes imagenet good for [61] F. Zhou, B. Wu, and Z. Li, “Deep meta-learning: Learning to learn in the
transfer learning?,” 2016, arXiv:1608.08614. concept space,” 2018, arXiv:1802.03596.
[33] J. D.M.-W. C. Kenton and L. K. Toutanova, “BERT: Pre-training of deep [62] A. Raghu, M. Raghu, S. Bengio, and O. Vinyals, “Rapid learning or
bidirectional transformers for language understanding,” in Proc. Conf. feature reuse? Towards understanding the effectiveness of MAML,” in
North Amer. Chapter Assoc. Comput. Linguistics - Hum. Lang. Technol., Proc. Int. Conf. Learn. Representations, 2023, pp. 1–12.
2019, pp. 4171–4186. [63] J. Oh, H. Yoo, C. Kim, and S.-Y. Yun, “BOIL: Towards representation
[34] A. Chowdhery et al., “PaLM: Scaling language modeling with pathways,” change for few-shot learning,” in Proc. Int. Conf. Learn. Representations,
2022, arXiv:2204.02311. 2021, pp. 1–12.
[35] H. Touvron et al., “LLaMA: Open and efficient foundation language [64] A. Antoniou, H. Edwards, and A. Storkey, “How to train your MAML,”
models,” 2023, arXiv:2302.13971. in Proc. Int. Conf. Learn. Representations, 2018, pp. 1–10.
[36] OpenAI, “GPT-4 technical report,” 2023, arXiv:2303.08774. [65] L. Zintgraf, K. Shiarli, V. Kurin, K. Hofmann, and S. Whiteson, “Fast
[37] A. Kumar, A. Raghunathan, R. M. Jones, T. Ma, and P. Liang, context adaptation via meta-learning,” in Proc. Int. Conf. Mach. Learn.,
“Fine-tuning can distort pretrained features and underperform out-of- 2019, pp. 7693–7702.
distribution,” in Proc. Int. Conf. Learn. Representations, 2022, pp. 1–15. [66] M. Hiller, M. Harandi, and T. Drummond, “On enforcing better condi-
[38] Y. Lee et al., “Surgical fine-tuning improves adaptation to distribution tioned meta-learning for rapid few-shot adaptation,” in Proc. Adv. Neural
shifts,” in Proc. Workshop Distrib. Shifts: Connecting Methods Appl., Inf. Process. Syst., 2022, pp. 4059–4071.
2023, pp. 1–14. [67] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. Cambridge,
[39] J. Phang, T. Févry, and S. R. Bowman, “Sentence encoders on MA, USA: MIT Press, 2016.
stilts: Supplementary training on intermediate labeled-data tasks,” [68] A. Nichol, J. Achiam, and J. Schulman, “On first-order meta-learning
2018, arXiv:1811.01088. algorithms,” 2018, arXiv:1803.02999.
[40] J. Howard and S. Ruder, “Universal language model fine-tuning for text [69] L. Bertinetto, J. F. Henriques, P. H. Torr, and A. Vedaldi, “Meta-learning
classification,” in Proc. 56th Annu. Meeting Assoc. Comput. Linguistics, with differentiable closed-form solvers,” in Proc. Int. Conf. Learn. Rep-
2018, pp. 328–339. resentations, 2019, pp. 1–13.
[41] A. Santoro, S. Bartunov, M. Botvinick, D. Wierstra, and T. Lillicrap, [70] K. Lee, S. Maji, A. Ravichandran, and S. Soatto, “Meta-learning with
“Meta-learning with memory-augmented neural networks,” in Proc. Int. differentiable convex optimization,” in Proc. IEEE/CVF Conf. Comput.
Conf. Mach. Learn., 2016, pp. 1842–1850. Vis. Pattern Recognit., 2019, pp. 10657–10665.
[42] N. Mishra, M. Rohaninejad, X. Chen, and P. Abbeel, “A simple neural [71] A. Rajeswaran, C. Finn, S. M. Kakade, and S. Levine, “Meta-learning
attentive meta-learner,” in Proc. Int. Conf. Learn. Representations, 2018, with implicit gradients,” in Proc. Adv. Neural Inf. Process. Syst., 2019.
pp. 1–12. [72] T. Cover and P. Hart, “Nearest neighbor pattern classification,” IEEE
[43] T. Munkhdalai and H. Yu, “Meta networks,” in Proc. Int. Conf. Mach. Trans. Inf. Theory, vol. 13, no. 1, pp. 21–27, Jan. 1967.
Learn., 2017, pp. 2554–2563. [73] R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang, “The
[44] M. Garnelo et al., “Conditional neural processes,” in Proc. Int. Conf. unreasonable effectiveness of deep features as a perceptual metric,” in
Mach. Learn., 2018, pp. 1704–1713. Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018, pp. 586–595.
[45] T. Brown et al., “Language models are few-shot learners,” in Proc. Adv. [74] G. Koch et al., “Siamese neural networks for one-shot image recognition,”
Neural Inf. Process. Syst., 2020, pp. 1877–1901. in Proc. ICML Deep Learn. Workshop, 2015, pp. 1–8.
[46] S. Garg, D. Tsipras, P. S. Liang, and G. Valiant, ”What can transformers [75] O. Vinyals et al., “Matching networks for one shot learning,” in Proc.
learn in-context? A case study of simple function classes,” in Proc. Adv. Adv. Neural Inf. Process. Syst., 2016, pp. 3637–3645.
Neural Inf. Process. Syst., 2022, pp. 30583–30598. [76] S. Laenen and L. Bertinetto, “On episodes, prototypical networks,
[47] L. Kirsch, J. Harrison, J. Sohl-Dickstein, and L. Metz, “General-purpose and few-shot learning,” in Proc. Adv. Neural Inf. Process. Syst.,
in-context learning by meta-learning transformers,” in Proc. Workshop pp. 24581–24592, 2021.
Distrib. Shifts: Connecting Methods Appl., 2022, pp. 1–14. [77] F. Sung, Y. Yang, L. Zhang, T. Xiang, P. H. Torr, and T. M. Hospedales,
[48] E. Akyürek, D. Schuurmans, J. Andreas, T. Ma, and D. Zhou, “Learning to compare: Relation network for few-shot learning,” in Proc.
“What learning algorithm is in-context learning? Investigations with IEEE Conf. Comput. Vis. Pattern Recognit., 2018, pp. 1199–1208.
linear models,” in proc. 11th Int. Conf. Learn. Representations, [78] V. Garcia and J. Bruna, “Few-shot learning with graph neural networks,”
2022, pp. 1–12. in Proc. Int. Conf. Learn. Representations, 2018, pp. 1–12.
[49] A. Vaswani et al., “Attention is all you need,” in Proc. Adv. Neural Inf. [79] K. Allen, E. Shelhamer, H. Shin, and J. Tenenbaum, “Infinite mixture
Process. Syst., 2017, pp. 6000–6010. prototypes for few-shot learning,” in Proc. Int. Conf. Mach. Learn., 2019,
[50] S. Ravi and H. Larochelle, “Optimization as a model for few-shot pp. 232–241.
learning,” in Proc. Int. Conf. Learn. Representations, 2017, pp. 1–11. [80] X. Jiang, M. Havaei, F. Varno, G. Chartrand, N. Chapados, and S. Matwin,
[51] M. Andrychowicz et al., “Learning to learn by gradient descent “Learning to learn with conditional class dependencies,” in Proc. Int.
by gradient descent,” in Proc. Adv. Neural Inf. Process. Syst., Conf. Learn. Representations, 2019, pp. 1–11.
2016, pp. 3988–3996. [81] A. A. Rusu et al., “Meta-learning with latent embedding optimization,”
[52] K. Li and J. Malik, “Learning to optimize,” in Proc. Int. Conf. Learn. in Proc. Int. Conf. Learn. Representations, 2019.
Representations, 2017. [82] E. Triantafillou et al., “Meta-dataset: A dataset of datasets for learning
[53] O. Wichrowska et al., “Learned optimizers that scale and generalize,” in to learn from few examples,” in Proc. Int. Conf. Learn. Representations,
Proc. Int. Conf. Mach. Learn., 2017, pp. 3751–3760. 2019.
[54] A. Shaw, W. Wei, W. Liu, L. Song, and B. Dai, “Meta architecture search,” [83] D. Wang, Y. Cheng, M. Yu, X. Guo, and T. Zhang, “A hybrid approach
in Proc. Adv. Neural Inf. Process. Syst., 2019, pp. 11227–11237. with optimization-based and metric-based meta-learner for few-shot
[55] D. Lian et al., “Towards fast adaptation of neural architectures with meta learning,” Neurocomputing, vol. 349, pp. 202–211, 2019.
learning,” in Proc. Int. Conf. Learn. Representations, 2019, pp. 1–13. [84] Y. Tian, Y. Wang, D. Krishnan, J. B. Tenenbaum, and P. Isola, “Re-
[56] L. Franceschi, P. Frasconi, S. Salzo, R. Grazzi, and M. Pontil, “Bilevel thinking few-shot image classification: A good embedding is all you
programming for hyperparameter optimization and meta-learning,” in need?,” in Proc. 16th Eur. Conf. Comput. Vis., Glasgow, U.K., 2020,
Proc. Int. Conf. Mach. Learn., 2018, pp. 1568–1577. pp. 266–282.
[57] T. Kim, J. Yoon, O. Dia, S. Kim, Y. Bengio, and S. Ahn, “Bayesian [85] Y. Chen, Z. Liu, H. Xu, T. Darrell, and X. Wang, ”Meta-baseline: Ex-
model-agnostic meta-learning,” in Proc. Adv. Neural Inf. Process. Syst., ploring simple meta-learning for few-shot learning,” in Proc. IEEE/CVF
2018, pp. 1–11. Int. Conf. Comput. Vis., 2021, pp. 9062–9071.
[58] C. Finn and S. Levine, “Meta-learning and universality: Deep represen- [86] Y. Guo et al., “A broader study of cross-domain few-shot learning,” in
tations and gradient descent can approximate any learning algorithm,” in Proc. 16th Eur. Conf. Comput. Vis., Glasgow, U.K., 2020, pp. 124–141.
Proc. Int. Conf. Learn. Representations, 2018. [87] W.-Y. Chen, Y.-C. Liu, Z. Kira, Y.-C. F. Wang, and J.-B. Huang, “A
[59] Z. Li, F. Zhou, F. Chen, and H. Li, “Meta-SGd: Learning to learn quickly closer look at few-shot classification,” in Proc. Int. Conf. Learn. Repre-
for few-shot learning,” 2017, arXiv:1707.09835. sentations, 2019, pp. 1–11.
[60] H. S. Behl, A. G. Baydin, and P. H. Torr, “Alpha MAML: Adaptive [88] R. Vuorio, S.-H. Sun, H. Hu, and J. J. Lim, “Multimodal model-agnostic
model-agnostic meta-learning,” in Proc. 6th ICML Workshop Automated meta-learning via task-aware modulation,” in Proc. Adv. Neural Inf.
Mach. Learn., 2019, pp. 1–10. Process. Syst., 2019, pp. 1–12.
VETTORUZZO et al.: ADVANCES AND CHALLENGES IN META-LEARNING: A TECHNICAL REVIEW 4777

[89] G. Denevi, M. Pontil, and C. Ciliberto, “The advantage of conditional [114] A. Ghosh, J. Chung, D. Yin, and K. Ramchandran, “An efficient frame-
meta-learning for biased regularization and fine tuning,” in Proc. Adv. work for clustered federated learning,” in Proc. Adv. Neural Inf. Process.
Neural Inf. Process. Syst., 2020, pp. 964–974. Syst., 2020, pp. 19586–19597.
[90] H. Yao, Y. Wei, J. Huang, and Z. Li, “Hierarchically structured meta- [115] M. Duan et al., “Flexible clustered federated learning for client-level data
learning,” in Proc. Int. Conf. Mach. Learn., 2019, pp. 7045–7054. distribution shift,” IEEE Trans. Parallel Distrib. Syst., vol. 33, no. 11,
[91] W. Jiang, J. Kwok, and Y. Zhang, “Subspace learning for effective meta- pp. 2661–2674, Nov. 2022.
learning,” in Proc. Int. Conf. Mach. Learn., 2022, pp. 10177–10194. [116] F. Sattler, K.-R. Müller, and W. Samek, “Clustered federated learn-
[92] G. Jerfel, E. Grant, T. Griffiths, and K. A. Heller, “Reconciling meta- ing: Model-agnostic distributed multitask optimization under privacy
learning and continual learning with online mixtures of tasks,” Adv. constraints,” IEEE Trans. Neural Netw. Learn. Syst., vol. 32, no. 8,
Neural Inf. Process. Syst., 2019, pp. 9122–9133. pp. 3710–3722, Aug. 2021.
[93] P. Zhou, Y. Zou, X.-T. Yuan, J. Feng, C. Xiong, and S. Hoi, [117] L. Yang, J. Huang, W. Lin, and J. Cao, “Personalized federated learning
“Task similarity aware meta learning: Theory-inspired improvement on non-IID data via group-based meta-learning,” ACM Trans. Knowl.
on MAML,” in Proc. 37th Conf. Uncertainty Artif. Intell., 2021, Discov. Data, vol. 17, no. 4, pp. 1–20, 2023.
pp. 23–33. [118] Y. Jiang, J. Konečn, K. Rush, and S. Kannan, “Improving fed-
[94] A. Vettoruzzo, M.-R. Bouguelia, and T. Rögnvaldsson, “Meta- erated learning personalization via model agnostic meta learning,”
learning from multimodal task distributions using multiple sets of 2019, arXiv:1909.12488.
meta-parameters,” in Proc. Int. Joint Conf. Neural Netw., 2023, [119] A. Fallah, A. Mokhtari, and A. Ozdaglar, “Personalized federated learn-
pp. 1–8. ing with theoretical guarantees: A model-agnostic meta-learning ap-
[95] H. Li, W. Dong, X. Mei, C. Ma, F. Huang, and B.-G. Hu, “LGM-Net: proach,” in Proc. Adv. Neural Inf. Process. Syst., 2020, pp. 3557–3568.
Learning to generate matching networks for few-shot learning,” in Proc. [120] F. Chen, M. Luo, Z. Dong, Z. Li, and X. He, “Federated
Int. Conf. Mach. Learn., 2019, pp. 3825–3834. meta-learning with fast convergence and efficient communication,”
[96] L. Liu, W. L. Hamilton, G. Long, J. Jiang, and H. Larochelle, “A universal 2018, arXiv:1802.07876.
representation transformer layer for few-shot image classification,” in [121] M. Khodak, M.-F. F. Balcan, and A. S. Talwalkar, “Adaptive gradient-
Proc. Int. Conf. Learn. Representations, 2020, pp. 1–11. based meta-learning methods,” in Proc. Adv. Neural Inf. Process. Syst.,
[97] W.-H. Li, X. Liu, and H. Bilen, “Universal representation learning from 2019, pp. 5917–5928.
multiple domains for few-shot classification,” in Proc. IEEE/CVF Int. [122] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A simple framework
Conf. Comput. Vis., 2021, pp. 9526–9535. for contrastive learning of visual representations,” in Proc. Int. Conf.
[98] J. Requeima, J. Gordon, J. Bronskill, S. Nowozin, and R. E. Turner, “Fast Mach. Learn., 2020, pp. 1597–1607.
and flexible multi-task classification using conditional neural adaptive [123] K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick, “Momentum contrast for
processes,” Adv. Neural Inf. Process. Syst., 2019, pp. 7959–7970. unsupervised visual representation learning,” in Proc. IEEE/CVF Conf.
[99] N. Dvornik, C. Schmid, and J. Mairal, “Selecting relevant features from Comput. Vis. Pattern Recognit., 2020, pp. 9729–9738.
a multi-domain representation for few-shot classification,” in Proc. 16th [124] J.-B. Grill et al., “Bootstrap your own latent-a new approach to self-
Eur. Conf. Comput. Vis., Glasgow, U.K., 2020, pp. 769–786. supervised learning,” in Proc. Adv. Neural Inf. Process. Syst., 2020,
[100] E. Triantafillou, H. Larochelle, R. Zemel, and V. Dumoulin, “Learning pp. 21271–21284.
a universal template for few-shot dataset generalization,” in Proc. Int. [125] A. V. D. Oord, Y. Li, and O. Vinyals, “Representation learning with
Conf. Mach. Learn., 2021, pp. 10424–10433. contrastive predictive coding,” 2018, arXiv:1807.03748.
[101] W.-Y. Lee, J.-Y. Wang, and Y.-C. F. Wang, “Domain-agnostic meta- [126] K. Hsu, S. Levine, and C. Finn, “Unsupervised learning via meta-
learning for cross-domain few-shot classification,” in Proc. IEEE Int. learning,” in Proc. Int. Conf. Learn. Representations, 2019, pp. 1–14.
Conf. Acoust. Speech Signal Process., 2022, pp. 1715–1719. [127] J. Donahue, P. Krähenbühl, and T. Darrell, “Adversarial feature learning,”
[102] B. Kang and J. Feng, “Transferable meta learning across domains,” in in Proc. Int. Conf. Learn. Representations, 2017, pp. 1–12.
Proc. Conf. Uncertainty Artif. Intell., 2018, pp. 177–187. [128] J. Donahue and K. Simonyan, “Large scale adversarial representation
[103] Y. Li and J. Zhang, “Semi-supervised meta-learning for cross-domain learning,” in Proc. Adv. Neural Inf. Process. Syst., 2019, pp. 10542–
few-shot intent classification,” in Proc. 1st Workshop Meta Learn. Its 10552.
Appl. Natural Lang. Process., 2021, pp. 67–75. [129] M. Caron, P. Bojanowski, A. Joulin, and M. Douze, “Deep clustering for
[104] W. Yuan, Z. Zhang, C. Wang, H. Song, Y. Xie, and L. Ma, “Task-level self- unsupervised learning of visual features,” in Proc. Eur. Conf. Comput.
supervision for cross-domain few-shot learning,” in Proc. AAAI Conf. Vis., 2018, pp. 132–149.
Artif. Intell., 2022, pp. 3215–3223. [130] S. Khodadadeh, L. Boloni, and M. Shah, “Unsupervised meta-learning
[105] T. Bansal, R. Jha, T. Munkhdalai, and A. McCallum, “Self- for few-shot image classification,” in Proc. Adv. Neural Inf. Processi.
supervised meta-learning for few-shot natural language classification Syst., 2019, pp. 10132–10142.
tasks,” in Proc. Conf. Empirical Methods Natural Lang. Process., [131] S. Khodadadeh, S. Zehtabian, S. Vahidian, W. Wang, B. Lin, and L.
2020, pp. 522–534. Bölöni, “Unsupervised meta-learning through latent-space interpolation
[106] B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Ar- in generative models,” in Proc. Int. Conf. Learn. Representations, 2021.
cas, “Communication-efficient learning of deep networks from de- [132] E. D. Cubuk, B. Zoph, D. Mane, V. Vasudevan, and Q. V. Le, “Autoaug-
centralized data,” in Proc. Int. Conf. Artif. Intell. Statist., 2017, ment: Learning augmentation strategies from data,” in Proc. IEEE/CVF
pp. 1273–1282. Conf. Comput. Vis. Pattern Recognit., 2019, pp. 113–123.
[107] T. Li, A. K. Sahu, M. Zaheer, M. Sanjabi, A. Talwalkar, and V. Smith, [133] S. Nair, A. Rajeswaran, V. Kumar, C. Finn, and A. Gupta, “R3M: A
“Federated optimization in heterogeneous networks,” in Proc. Mach. universal visual representation for robot manipulation,” in Proc. Conf.
Learn. Syst., vol. 2, pp. 429–450, 2020. Robot Learn., 2023, pp. 892–909.
[108] Q. Li, B. He, and D. Song, “Model-contrastive federated learn- [134] A. Tamkin, M. Wu, and N. Goodman, “Viewmaker networks: Learning
ing,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2021, views for unsupervised representation learning,” in Proc. Int. Conf. Learn.
pp. 10713–10722. Representations, 2020.
[109] S. P. Karimireddy, S. Kale, M. Mohri, S. Reddi, S. Stich, and A. T. Suresh, [135] D. B. Lee, D. Min, S. Lee, and S. J. Hwang, “Meta-GMVAE: Mixture
“Scaffold: Stochastic controlled averaging for federated learning,” in of Gaussian VAE for unsupervised meta-learning,” in Proc. Int. Conf.
Proc. Int. Conf. Mach. Learn., 2020, pp. 5132–5143. Learn. Representations, 2021, pp. 1–13.
[110] F. Hanzely and P. Richtárik, “Federated learning of a mixture of global [136] D. Kong, B. Pang, and Y. N. Wu, “Unsupervised meta-learning via
and local models,” 2020, arXiv:2002.05516. latent space energy-based model of symbol vector coupling,” in Proc. 5th
[111] C. T. Dinh, N. Tran, and J. Nguyen, “Personalized federated learn- Workshop Meta- Learn. Conf. Neural Inf. Process. Syst., 2021, pp. 1–9.
ing with Moreau envelopes,” Adv. Neural Inf. Process. Syst., 2020, [137] D. P. Kingma and M. Welling, “Auto-encoding variational Bayes,” in
pp. 21394–21405. Proc. Int. Conf. Learn. Representations, 2014, pp. 1–9.
[112] T. Li, S. Hu, A. Beirami, and V. Smith, “Ditto: Fair and robust federated [138] Y. W. Teh, M. Welling, S. Osindero, and G. E. Hinton, “Energy-based
learning through personalization,” in Proc. Int. Conf. Mach. Learn., 2021, models for sparse overcomplete representations,” J. Mach. Learn. Res.,
pp. 6357–6368. vol. 4, no. Dec, pp. 1235–1260, 2003.
[113] J. Xu, X. Tong, and H. Shao-Lun, “Personalized federated learning with [139] R. Ni, M. Shu, H. Souri, M. Goldblum, and T. Goldstein, “The close
feature alignment and classifier collaboration,” in Proc. Int. Conf. Learn. relationship between contrastive learning and meta-learning,” in Proc.
Representations, 2023, pp. 1–14. Int. Conf. Learn. Representations, 2021, pp. 1–12.
4778 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 46, NO. 7, JULY 2024

[140] Z. Yang, J. Wang, and Y. Zhu, “Few-shot classification with con- [165] Y. Li, Y. Yang, W. Zhou, and T. Hospedales, “Feature-critic networks for
trastive learning,” in Proc. 17th Eur. Conf. Comput. Vis., 2022, heterogeneous domain generalization,” in Proc. Int. Conf. Mach. Learn.,
pp. 293–309. 2019, pp. 3915–3924.
[141] D. B. Lee et al., “Self-supervised set representation learning for unsuper- [166] D. Li, Y. Yang, Y.-Z. Song, and T. Hospedales, “Learning to generalize:
vised meta-learning,” in Proc. Int. Conf. Learn. Representations, 2023, Meta-learning for domain generalization,” in Proc. AAAI Conf. Artif.
pp. 1–13. Intell., 2018, pp. 3490–3497.
[142] H. Jang, H. Lee, and J. Shin, “Unsupervised meta-learning via few-shot [167] Y. Balaji, S. Sankaranarayanan, and R. Chellappa, “MetaReg: Towards
pseudo-supervised contrastive learning,” in Proc. Int. Conf. Learn. Rep- domain generalization using meta-regularization,” in Proc. Adv. Neural
resentations, 2023, pp. 1–13. Inf. Process. Syst., 2018, pp. 1006–1016.
[143] C. Wu, F. Wu, and Y. Huang, “One teacher is enough? Pre-trained [168] Y. Shu, Z. Cao, C. Wang, J. Wang, and M. Long, “Open domain gen-
language model distillation from multiple teachers,” inProc. Findings eralization with domain-augmented meta-learning,” in Proc. IEEE/CVF
Annu. Meeting Assoc. Comput. Linguistics, 2021, pp. 4408–4413. Conf. Comput. Vis. Pattern Recognit., 2021, pp. 9624–9633.
[144] T. Bansal, R. Jha, and A. McCallum, “Learning to few-shot learn across [169] K. Chen, D. Zhuang, and J. M. Chang, “Discriminative adversarial do-
diverse natural language classification tasks,” in Proc. Int. Conf. Comput. main generalization with meta-learning based cross-domain validation,”
Linguistics, 2020, pp. 5108–5123. Neurocomputing, vol. 467, pp. 418–426, 2022.
[145] E. Tzeng, J. Hoffman, N. Zhang, K. Saenko, and T. Darrell, “Deep domain [170] G. M. van de Ven, T. Tuytelaars, and A. S. Tolias, “Three types of
confusion: Maximizing for domain invariance,” 2014, arXiv:1412.3474. incremental learning,” Nature Mach. Intell., vol. 4, no. 12, pp. 1185–1197,
[146] Y. Ganin et al., “Domain-adversarial training of neural networks,” J. 2022.
Mach. Learn. Res., vol. 17, no. 1, pp. 2096–2030, 2016. [171] D. Lopez-Paz and M. Ranzato, “Gradient episodic memory for continual
[147] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-image learning,” in Proc. Adv. Neural Inf. Process. Syst., 2017, pp. 6470–6479.
translation using cycle-consistent adversarial networks,” in Proc. IEEE [172] A. Chaudhry, M. Ranzato, M. Rohrbach, and M. Elhoseiny, “Efficient
Int. Conf. Comput. Vis., 2017, pp. 2223–2232. lifelong learning with A-GEM,” in Proc. Int. Conf. Learn. Representa-
[148] K. Rao, C. Harris, A. Irpan, S. Levine, J. Ibarz, and M. Khansari, tions, 2019, pp. 1–12.
“RL-CycleGAN: Reinforcement learning aware simulation-to-real,” [173] S.-A. Rebuffi, A. Kolesnikov, G. Sperl, and C. H. Lampert, “iCaRL:
in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2020, Incremental classifier and representation learning,” in Proc. IEEE Conf.
pp. 11 157–11 166. Comput. Vis. Pattern Recognit., 2017, pp. 2001–2010.
[149] L. Smith, N. Dhawan, M. Zhang, P. Abbeel, and S. Levine, “Avid: [174] R. Aljundi, M. Rohrbach, and T. Tuytelaars, “Selfless sequential learn-
Learning multi-stage tasks via pixel-level translation of human videos,” ing,” in Proc. Int. Conf. Learn. Representations, 2019, pp. 1–12.
2019, arXiv:1912.04443. [175] J. Kirkpatrick et al., “Overcoming catastrophic forgetting in neural net-
[150] J. Hoffman et al., “Cycada: Cycle-consistent adversarial domain adapta- works,” Proc. Nat. Acad. Sci., vol. 114, no. 13, pp. 3521–3526, 2017.
tion,” in Proc. Int. Conf. Mach. Learn., 2018, pp. 1989–1998. [176] J. Serra, D. Suris, M. Miron, and A. Karatzoglou, “Overcoming catas-
[151] H. Zhao, S. Zhang, G. Wu, J. M. Moura, J. P. Costeira, and G. J. Gordon, trophic forgetting with hard attention to the task,” in Proc. Int. Conf.
“Adversarial multiple source domain adaptation,” in Proc. Adv. Neural Mach. Learn., 2018, pp. 4548–4557.
Inf. Process. Syst., 2018, pp. 8568–8579. [177] X. Li, Y. Zhou, T. Wu, R. Socher, and C. Xiong, “Learn to grow:
[152] L. T. Nguyen-Meidine, A. Belal, M. Kiran, J. Dolz, L.-A. Blais-Morin, A continual structure learning framework for overcoming catastrophic
and E. Granger, “Unsupervised multi-target domain adaptation through forgetting,” in Proc. Int. Conf. Mach. Learn., 2019, pp. 3925–3934.
knowledge distillation,” in Proc. IEEE/CVF Winter Conf. Appl. Comput. [178] Q. Pham, C. Liu, D. Sahoo, and H. Steven, “Contextual transformation
Vis., 2021, pp. 1339–1347. networks for online continual learning,” in Proc. Int. Conf. Learn. Rep-
[153] Z. Chen, J. Zhuang, X. Liang, and L. Lin, “Blending-target domain resentations, 2021, pp. 1–13.
adaptation by adversarial meta-adaptation networks,” in Proc. IEEE/CVF [179] A. A. Rusu et al., “Progressive neural networks,” 2016,
Conf. Comput. Vis. Pattern Recognit., 2019, pp. 2248–2257. arXiv:1606.04671.
[154] B. Gholami, P. Sahu, O. Rudovic, K. Bousmalis, and V. Pavlovic, [180] S. Beaulieu et al., “Learning to continually learn,” in Proc. 24th Eur.
“Unsupervised multi-target domain adaptation: An information theo- Conf. Artif. Intell., 2020, pp. 992–1001.
retic approach,” IEEE Trans. Image Process., vol. 29, pp. 3993–4002, [181] M. Riemer et al., “Learning to learn without forgetting by maximizing
2020. transfer and minimizing interference,” in Proc. Int. Conf. Learn. Repre-
[155] M. Zhang, H. Marklund, N. Dhawan, A. Gupta, S. Levine, and C. Finn, sentations, 2019, pp. 1–14.
“Adaptive risk minimization: Learning to adapt to domain shift,” in Proc. [182] K. Javed and M. White, “Meta-learning representations for continual
Adv. Neural Inf. Process. Syst., 2021, pp. 23 664–23 678. learning,” in Proc. Adv. Neural Inf. Process. Sys., 2019, pp. 1820–1830.
[156] W. Yang, C. Yang, S. Huang, L. Wang, and M. Yang, “Few-shot unsu- [183] G. Gupta, K. Yadav, and L. Paull, “Look-ahead meta learning for con-
pervised domain adaptation via meta learning,” in Proc. IEEE Int. Conf. tinual learning,” in Proc. Adv. Neural Inf. Process. Syst., pp. 11 588–11
Multimedia Expo, 2022, pp. 1–6. 598, 2020.
[157] Y. Feng et al., “Similarity-based meta-learning network with adversarial [184] V. Prabhu, A. Kannan, M. Ravuri, M. Chaplain, D. Sontag, and X.
domain adaptation for cross-domain fault identification,” Knowl.-Based Amatriain, “Few-shot learning for dermatological disease diagnosis,” in
Syst., vol. 217, 2021, Art. no. 106829. Proc. Mach. Learn. Healthcare Conf., 2019, pp. 532–552.
[158] A. Sicilia, X. Zhao, and S. J. Hwang, “Domain adversarial neural net- [185] P. P. Liang, P. Wu, L. Ziyin, L.-P. Morency, and R. Salakhutdinov,
works for domain generalization: When it works and how to improve,” “Cross-modal generalization: Learning in low resource modalities via
Mach. Learn., vol. 112, pp. 2685–2721, 2023. meta-alignment,” in Proc. 29th ACM Int. Conf. Multimedia, 2021,
[159] B. Sun and K. Saenko, “Deep coral: Correlation alignment for deep pp. 2680–2689.
domain adaptation,” in Proc. Eur. Conf. Comput. Vis., Amsterdam, The [186] J.-B. Alayrac et al., “Flamingo: A visual language model for few-
Netherlands, 2016, pp. 443–450. shot learning,” in Proc. Adv. Neural Inf. Process. Syst., 2022,
[160] H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez-Paz, “mixup: Beyond pp. 23 716–23 736.
empirical risk minimization,” in Proc. Int. Conf. Learn. Representations, [187] S. Reed et al., “A generalist agent,” Trans. Mach. Learn. Res., 2022,
2018, pp. 1–13. pp. 1–42.
[161] V. Verma et al., “Manifold mixup: Better representations by in- [188] M. Rußwurm, S. Wang, M. Korner, and D. Lobell, “Meta-learning for
terpolating hidden states,” in Proc. Int. Conf. Mach. Learn., 2019, few-shot land cover classification,” in Proc. IIII/CVF Conf. Comput. Vis.
pp. 6438–6447. Pattern Recognit. Workshops, 2020, pp. 200–201.
[162] H. Yao et al., “Improving out-of-distribution robustness via se- [189] T. Yu et al., “One-shot imitation from observing humans via domain-
lective augmentation,” in Proc. Int. Conf. Mach. Learn., 2022, adaptive meta-learning,” Robotics: Science and Systems XIV, 2018,
pp. 25 407–25 437. pp. 1–10.
[163] Q. Dou, D. Coelho de Castro, K. Kamnitsas, and B. Glocker, “Domain [190] C. Q. Nguyen, C. Kreatsoulas, and K. M. Branson, “Meta-learning GNN
generalization via model-agnostic learning of semantic features,” in Proc. initializations for low-resource molecular property prediction,” in Proc.
Adv. Neural Inf. Process. Syst., 2019, pp. 6450–6461. 4th Lifelong Mach. Learn. Workshop, 2020, pp. 1–6.
[164] D. Li, J. Zhang, Y. Yang, C. Liu, Y.-Z. Song, and T. M. Hospedales, [191] L.-Y. Gui, Y.-X. Wang, D. Ramanan, and J. M. Moura, “Few-shot human
“Episodic training for domain generalization,” in Proc. IEEE/CVF Int. motion prediction via meta-learning,” in Proc. Eur. Conf. Comput. Vis.,
Conf. Comput. Vis., 2019, pp. 1446–1455. 2018, pp. 432–450.
VETTORUZZO et al.: ADVANCES AND CHALLENGES IN META-LEARNING: A TECHNICAL REVIEW 4779

[192] I. Ullah et al., “Meta-album: Multi-domain meta-dataset for few-shot Joaquin Vanschoren is an associate professor
image classification,” in Proc. Adv. Neural Inf. Process. Syst., 2022, of machine learning with the Eindhoven Uni-
pp. 3232–3247. versity of Technology (TU/e). His research fo-
[193] J. Bornschein et al., “Nevis’22: A stream of 100 tasks sampled from 30 cuses on understanding and automating machine
years of computer vision research,” 2022, arXiv:2211.11747. learning, meta-learning, and continual learning. He
[194] T. Yu et al., “Meta-world: A benchmark and evaluation for multi-task and founded and leads OpenML.org, an open sci-
meta reinforcement learning,” Conf. Robot Learn., 2020, pp. 1094–1100. ence platform for machine learning research used
[195] X. Zhai et al., “A large-scale study of representation learning with the all over the world. He obtained several demon-
visual task adaptation benchmark,” 2019, arXiv:1910.04867. stration and application awards, the Dutch Data
[196] A. R. Zamir, A. Sax, W. Shen, L. J. Guibas, J. Malik, and S. Savarese, Prize, and has been invited speaker at ECDA,
“Taskonomy: Disentangling task transfer learning,” in Proc. IEEE Conf. StatComp, AutoML@ICML, CiML@NIPS, Repro-
Comput. Vis. Pattern Recognit., 2018, pp. 3712–3722. ducibility@ICML, DEEM@SIGMOD and many other conferences. He also
[197] L. Li et al., “Value: A multi-task benchmark for video-and-language co-organized machine learning conferences (e.g. ECMLPKDD 2013, LION
understanding evaluation,” in Proc. 35th Conf. Neural Inf. Process. Syst. 2016) and many workshops, including the AutoML workshop series at ICML.
Datasets Benchmarks Track, 2021, pp. 1–13.
[198] A. Srivastava et al., “Beyond the imitation game: Quantifying and extrap-
olating the capabilities of language models,” Trans. Mach. Learn. Res.,
2023, pp. 1–42.
[199] L. Metz, N. Maheswaranathan, C. D. Freeman, B. Poole, and J. Sohl-
Dickstein, “Tasks, stability, architecture, and compute: Training more Thorsteinn Rögnvaldsson (Senior Member, IEEE)
effective learned optimizers, and using them to train themselves,” received the PhD degree in theoretical physics from
2020, arXiv:2009.11243. Lund University, 1994. He is a professor of computer
[200] J. Chen, X.-M. Wu, Y. Li, Q. Li, L.-M. Zhan, and F.-L. Chung, “A closer science with Halmstad University, Sweden. From
look at the training strategy for modern meta-learning,” in Proc. Adv. 2012, he started and directed the Center for Ap-
Neural Inf. Process. Syst., 2020, pp. 396–406. plied Intelligent Systems Research (CAISR), Halm-
[201] J. Lucas, M. Ren, I. R. K. KAMENI, T. Pitassi, and R. Zemel, “Theoretical stad University. He did his postdoc with the Oregon
bounds on estimation error for meta-learning,” in Proc. Int. Conf. Learn. Graduate Institute. His research interests include au-
Representations, 2020, pp. 1–12. tonomous knowledge creation, machine learning, and
[202] J. Guan and Z. Lu, “Fast-rate pac-Bayesian generalization bounds for self-organization.
meta-learning,” in Proc. Int. Conf. Mach. Learn., 2022, pp. 7930–7948.

KC Santosh (Senior Member, IEEE) received the


PhD degree in computer science - AI from INRIA
Anna Vettoruzzo received the MSc degree in ICT Nancy Grand East Research Centre, France. He is
for Internet and Multimedia from the University of a highly accomplished AI expert - is the chair with
Padova, in 2021, with focus on Machine Learning the Department of Computer, University of South
for Healthcare. She is currently working toward the Dakota (USD). Immediately after his postdoc with
PhD degree with the Center for Applied and Intelli- the LORIA reseach centre, University of Lorraine, he
gence Systems Research (CAISR), Halmstad Univer- joined the National Institutes of Health as a research
sity (Sweden). Her research interests include the areas fellow. With funding of more than $1.3 million, in-
of meta-learning, few-shot learning, and continuous cluding a $1 million grant from DEPSCOR (2023)
learning. for AI/ML capacity building at USD, he has authored
ten books and published more than 240 peer-reviewed research articles. He is
also an editor of multiple prestigious journals, such as IEEE Transactions on AI,
International Journal of Machine Learning & Cybernetics, and International
Journal of Pattern Recognition & AI. As founder of USD’s AI programs, he
significantly boosted graduate program enrollment by over 2,000% in three
years, establishing inter-disciplinary AI/Data Science programs with various
departments.
Mohamed-Rafik Bouguelia received the MSc de-
gree in computer science from USTHB University,
Algeria, and the PhD degree in computer science with
a focus on Machine Learning from the University of
Lorraine, France. He is an associate professor and
docent in machine learning with Halmstad University,
Sweden. He is also the program manager for the Ap-
plied AI program. Previously, he conducted research
with the University of Lorraine and the INRIA re-
search center in France. His current research interests
include interactive machine learning, representation
learning with deep neural networks, transfer learning, and meta-learning.

You might also like