0% found this document useful (0 votes)
86 views

Embodied AI-paper - 2

The document proposes embodied AI (E-AI) as the next step in artificial general intelligence. It discusses the evolution of embodiment across fields and introduces a theoretical framework for E-AI based on cognitive architectures emphasizing perception, action, memory, and learning for embodied agents. Despite progress, challenges remain in developing novel learning theory and advanced hardware for E-AI.

Uploaded by

upspeed5050
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
86 views

Embodied AI-paper - 2

The document proposes embodied AI (E-AI) as the next step in artificial general intelligence. It discusses the evolution of embodiment across fields and introduces a theoretical framework for E-AI based on cognitive architectures emphasizing perception, action, memory, and learning for embodied agents. Despite progress, challenges remain in developing novel learning theory and advanced hardware for E-AI.

Uploaded by

upspeed5050
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

A call for embodied AI

Giuseppe Paolo 1 Jonas Gonzalez-Billandon 2 Balázs Kégl 1

Abstract 1980s. The ambitious goal that has propelled AI research


We propose Embodied AI (E-AI) as the next fun- forward from the beginning was to create intelligence that
damental step in the pursuit of Artificial General either parallels or exceeds human abilities. This quest for
arXiv:2402.03824v1 [cs.AI] 6 Feb 2024

Intelligence (AGI), juxtaposing it against current superhuman intelligence, commonly termed Artificial Gen-
AI advancements, particularly Large Language eral Intelligence (AGI), has been seen differently by experts
Models (LLMs). We traverse the evolution of the across different disciplines, yet it broadly refers to the abil-
embodiment concept across diverse fields (phi- ity of a system to understand, learn, and apply knowledge
losophy, psychology, neuroscience, and robotics) in a wide array of tasks and contexts, mirroring the cogni-
to highlight how E-AI distinguishes itself from tive flexibility of humans and animals.
the classical paradigm of static learning. By The remarkable progress in AI over the past decade can
broadening the scope of E-AI, we introduce a the- largely be attributed to three pivotal developments: i) ad-
oretical framework based on cognitive architec- vancements in deep learning algorithms, ii) the advent of
tures, emphasizing perception, action, memory, powerful new hardware, and iii) the availability of exten-
and learning as essential components of an em- sive datasets for training. A prime illustration of this
bodied agent. This framework is aligned with advancement is the creation of Large Language Models
Friston’s active inference principle, offering a (LLMs) like OpenAI’s GPT-4 (Achiam et al., 2023) and
comprehensive approach to E-AI development. Google’s Gemini (Team et al., 2023). The surprising abil-
Despite the progress made in the field of AI, sub- ities of these LLMs have sparked discussions within the
stantial challenges, such as the formulation of a AI community, with some pondering whether these mod-
novel AI learning theory and the innovation of els have already achieved nascent forms of AGI. Founda-
advanced hardware, persist. Our discussion lays tion models (large networks with billions of parameters
down a foundational guideline for future E-AI trained on massive datasets) have found success in var-
research. Highlighting the importance of creat- ied fields, ranging from predicting 3D protein structures
ing E-AI agents capable of seamless communica- (Cramer, 2021) and robotic control (Brohan et al., 2023),
tion, collaboration, and coexistence with humans to generating images and audio (Ramesh et al., 2022; Rad-
and other intelligent entities within real-world en- ford et al., 2022). This breadth of achievement supports the
vironments, we aim to steer the AI community hypothesis that continued scaling and refinement of foun-
towards addressing the multifaceted challenges dation models could be a viable path toward realizing AGI.
and seizing the opportunities that lie ahead in the
quest for AGI. In our paper, we argue that despite the significant advances
made by current AI technologies, they represent only the
initial steps towards truly intelligent agents. Despite their
1. Introduction impressive capabilities, these large networks are static and
Over recent years, the field of artificial intelligence (AI) unable to evolve with time and experience. They leverage
has experienced a significant surge, leading to substantial large datasets and cutting-edge hardware for scaling, but
breakthroughs in areas ranging from computer vision (CV) they lack the ability to properly care about the truth (Ver-
and natural language processing (NLP) to neuroscience. vaeke & Coyne, 2024), which in turn makes it impossible
This journey through AI’s development has been marked to dynamically adjust their knowledge and actively search
by a series of significant triumphs interspersed with set- for valuable new information. The two primary manifesta-
backs, including the well-documented AI winter of the mid- tions of this fundamental shortfall are i) the difficulty in ef-
fectively aligning LLMs (Ouyang et al., 2022), and ii) their
1
Noah’s Ark Lab, Huawei Technologies France, Paris, France propensity to generate plausible but inaccurate information,
2
London Research Center, London, UK. Correspondence to: a phenomenon known as confabulation (Huang et al., 2023;
Giuseppe Paolo <[email protected]>. Wei et al., 2023). Current strategies to mitigate these issues,
such as post-processing, fine-tuning, prompt engineering,

1
A call for embodied AI

and incorporating human feedback, are undeniably valu- 2. What is embodied AI


able. However, we argue that these methods address only E-AI is s sub-field of AI, focusing on agents that inter-
the superficial aspects of the problem and fall short in deal- act with their physical environment, emphasizing senso-
ing with the core issue at play: the inherent lack of a deeper, rimotor coupling and situated intelligence. As opposed
grounded sense of care in LLMs. to mere passive observing, E-AI agents act on their en-
Pursuing the development of AGI, we draw upon the in- vironment and learn from the reaction. E-AI is deeply
sights of Vervaeke & Coyne (2024) to advocate for design- rooted in embodied cognition (Shapiro, 2011; McNearney,
ing AI agents that are bound to, observe, interact with, and 2011), a perspective in philosophy and cognitive science
learn from the real world (including humans) in a contin- that posits a profound coupling between the mind and the
uous and dynamic manner. These Embodied AI (E-AI) body. This idea, challenging Cartesian dualism — the his-
agents ought to prioritize their continued existence and torically dominant view that distinctly separates the mind
our bindings to them, thereby learning the value of truth. from the body (Descartes, 2012) — emerged in the early
They should also be capable of adapting to environmental 20th century. Pioneers like Lakoff & Johnson (1979; 1999)
changes and evolving without human intervention. have significantly contributed to this paradigm by propos-
ing that reason is not based on abstract laws but is grounded
While Large Language Models play a significant role in in bodily experiences. Embodied cognition forms a critical
the development of AI systems, they fall short of captur- part of the 4E cognitive science framework (Varela et al.,
ing the essence of what constitutes an intelligent agent. 1991; Clark, 1997; Clark & Chalmers, 1998), encompass-
Notably, intelligent beings, whether humans or animals, ing embodied, enactive, embedded, and extended aspects
are characterized by three fundamental components: the of cognition. Within E-AI, the focus is predominantly on
mind, perception, and action capabilities (Kirchhoff et al., implementing the ‘embodied’ and ‘enactive’ aspects, while
2018). LLMs, or more broadly, foundation models, may the ‘embedded’ and ‘extended’ components are more perti-
be likened to an aspect of the mind’s reasoning function nent to situating AI in a social context and as an augmenta-
(Xi et al., 2023). Yet, the perceptive and action-oriented tion of human (individual or collective) cognition.
dimensions of intelligence, along with the pivotal ability to
dynamically revise beliefs and knowledge based on experi- In AI, initial explorations into embodiment emerged in the
ences, remain unaddressed. Autoregressive LLMs are not 1980s, driven by a growing recognition of the inherent lim-
designed to understand the causal relationships between itations in disembodied agents. These limitations were pri-
events, but rather to identify proximate context and cor- marily attributed to the absence of rich, high-bandwidth
relations within sequences (Bariah & Debbah, 2023). In interactions with the environment (Pfeifer & Iida, 2004;
contrast, a fully embodied agent should have the ability Pfeifer & Bongard, 2006). An early advocate for this
to grasp the causality underlying events and actions within paradigm shift was Brooks (1991), who built walking
its environment, be it digital or physical. By comprehend- robots simulating insect-like locomotion. Simultaneously,
ing these causal relationships, such an agent can make in- the field of computer vision was undergoing its own trans-
formed decisions that consider both the anticipated out- formation. Researchers and practitioners were increasingly
comes and the reasons behind those outcomes. We pro- focusing on enabling agents to interact with their surround-
pose that current LLM-based foundation models could lay ings. This emphasis on interaction led to a concentration on
the groundwork for designing these agents, but are just one the perceptual elements of embodiment, particularly from
component of a truly embodied agent. This approach is a first-person point of view (POV) (Shapiro, 2021). This
akin to how neonates come into the world equipped with in- approach aligns with the concept of visual exploration and
herent priors to successfully adapt to the world (Reynolds navigation (Ramakrishnan et al., 2021), where an agent ac-
& Roth, 2018). quires information about a 3D environment through move-
ment and sensory perception, thereby continuously refin-
In the next section of this paper, we will define the concept ing its model of the environmental (Anderson et al., 2018;
of embodiment and what we mean by E-AI in Sec. 2, an- Chen et al., 2019). Such exploration techniques empower
alyzing the literature and various scientific and philosoph- an agent to discover objects and understand their perma-
ical currents. In Sec. 3, we discuss why we believe this nence. As a result of these developments, many contem-
is a necessary step towards Artificial General Intelligence porary benchmarks in E-AI have emerged predominantly
(AGI). In Sec. 4 we analyze the main components of a truly from the domains of vision and robotics (Duan et al., 2022),
embodied agent, and we discuss the major challenges to reflecting the integral role these disciplines have played in
achieving this ambitious goal in Sec. 5. Our motivation advancing the field.
behind the need to develop E-AI and its fundamental role
in our path towards AGI is proposed throughout. Finally, That said, the broader definition of E-AI does not require
Sec. 6 provides a short recap of our proposition. vision. Sensorimotor coupling may be implemented us-
ing any physical sense (Pfeifer & Bongard, 2006). In the

2
A call for embodied AI

living world, many organisms survive and thrive without present, LLMs largely resemble static Internet AI (I-AI)
vision, using, for example, chemical or electric sensing (Duan et al., 2022), differing significantly from the dy-
(Bargmann, 2006). Levin (2022)’s Technlogical Approach namic, interactive nature characteristic of E-AI.
to Mind Everywhere (TAME) framework further explores
It is intriguing that despite the growing concerns about the
this idea, suggesting that cognition emerges from the collec-
risks and alignment challenges of LLMs highlighted in re-
tive intelligence of cell groups, they themselves deeply em-
cent research (Bender et al., 2021), SMAIs have attracted
bodied within their environment (the body they comprise).
comparatively less scrutiny (Huszár et al., 2022; Ribeiro
This framework challenges traditional Cartesian dualism,
et al., 2020). This is noteworthy considering SMAIs have
embedding cognition within the physical and biological
been around for a longer time and their influence on soci-
makeup of an organism. In the TAME perspective, cogni-
ety is both wider and more profound. We propose that their
tion is not just an attribute of higher-order organisms; it ex-
widespread acceptance and their more integrated, less intru-
tends throughout the ontological hierarchy of living beings,
sive presence in our lives are due to their closer alignment
from individual cells, through tissues and organs, to com-
with the principles of embodiment, in contrast to LLMs.
plex organisms. Each agent demonstrates cognitive capabil-
ities that are inherently connected to its physical structure What do we mean by SMAIs being closer to embodiment?
and the environmental interactions at its proper level. This Firstly, SMAIs are driven by clear objectives: to captivate
broadened view of cognition and embodiment goes beyond our attention and maximize our engagement with their re-
the conventional focus on vision in robotics and computer spective platforms (Bozdag, 2013; Bodó, 2021). These
vision. It posits that any entity capable of perceiving, in- goals are fundamentally linked to the business models of
teracting with, and learning from its environment, thereby these platforms, which revolve around advertising. The
adapting to it and influencing it, qualifies as embodied. A specifics of these “engagement” objectives are typically
technological instantiation of this concept is an intelligent proprietary, forming the core of the competitive advan-
router in a telecommunication network. This device ‘lives’ tage of these platforms. Although these goals are initially
in a realm dominated by electromagnetic sensing. It con- human-designed and not intrinsically generated (Coving-
tinuously learns from and adapts to the network traffic, ef- ton et al., 2016), they are subject to evolutionary pressure
fectively mapping and managing the flow of information. and adaptation, and thus they are tied to the existence and
This example underscores the potential of applying the prin- the survival of the SMAI. Secondly, SMAIs learn almost
ciples of E-AI beyond the traditional domains, embracing entirely from the data they collect by interacting with us.
a more inclusive and diverse understanding of intelligence This leads to a high level of individuation (adapting to our
and embodiment. individual preferences (Nguyen et al., 2014)), and notions
of exploration (offering us content not so much to satisfy
This broadening of the notion of E-AI raises the question:
us but for the sake of learning what we like). This cre-
how close current commercial AI tools are to embodiment?
ates a user experience that, when well-executed, resembles
Here we examine two such tools: Large Language Mod-
interaction with a considerate friend, who wants our best,
els (Brown et al., 2020; Devlin et al., 2019) and Social Me-
who connects us to things we like, and who wants to un-
dia Content AI Recommendation Systems (SMAI) (Bakshy
derstand us better. The flip side, however, is the potential
et al., 2015; Covington et al., 2016; Eirinaki et al., 2018).
for these systems to morph into mechanisms that perpetu-
LLMs operate within a linguistic-symbolic domain, rep- ate addictive behaviors or harmful content (Schüll, 2012).
resenting textual information and generating new text by Nevertheless, since SMAIs connect and adapt to us in a
completing prompts. Their foundational training is essen- more intuitive and deeper manner than LLMs, we often feel
tially static, relying on datasets meticulously compiled and a greater sense of control over our interactions with these
curated by teams of AI engineers. Their goal is supervised: systems (by, for example, consciously not clicking on con-
to generate likely tokens following a context. Their sec- tent that we know we do not want to see in the long run).
ondary training (fine-tuning) may involve both interactions This control, albeit limited, is reminiscent of persuasion
with their symbolic environment (human users) and goals more than mechanical manipulation, aligning with how we
to reach (satisfy their human users), but these interactions interact with other sentient beings rather than machines.
are presents some limits due to both technical (e.g., catas- This type of relationship with AI systems is a fundamen-
trophic forgetting (Kirkpatrick et al., 2017; Parisi et al., tal aspect of Levin (2022)’s TAME proposal . Our stance
2019)) and business (e.g., managing individuated LLMs on E-AI suggests that, while systems akin to SMAIs pose
(Strubell et al., 2019; Kaplan et al., 2020)) reasons. Look- greater risks due to their seamless integration into our so-
ing ahead, we anticipate advancements that might address cial fabric, they also present more natural opportunities for
these limitations, potentially leading to the emergence of alignment with our values. This alignment process is pro-
“personal assistant” LLMs. These would represent a form cedural, perspectival, and evolutionary in nature (Vervaeke
of embodied agents within a symbolic realm. However, at et al., 2012; Vervaeke & Coyne, 2024), contrasting with the

3
A call for embodied AI

primarily propositional approaches being applied to LLMs learning is the next milestone towards AGI (Fei et al., 2022;
(Shen et al., 2023). Parcalabescu et al., 2021). In I-AI, multimodal data needs
to be collected and connected painstakingly. In contrast,
We posit that the potential for more effective and naturally
E-AI agents, when equipped with multimodal sensors, will
aligned AI systems is, alone, a compelling reason to priori-
inherently collect and correlate multi-modal data by mere
tizing E-AI in the broader AI research agenda.
co-occurrence. For instance, robots will see (CV), commu-
In the forthcoming section, we further explore the pivotal nicate (NLP), reason (general intelligence), navigate and
role that well-executed implementations of E-AI could play interact with their environment (planning and RL), all si-
in the quest for AGI. multaneously (Shenavarmasouleh et al., 2022). Intelligent
routers will observe requests and traffic (sensing), commu-
3. Why embodiement? nicate with other routers, human engineers, absorb news
In the previous section, we examined how contemporary about their surroundings (NLP), reason (general intelli-
theories of embodiment, particularly the TAME framework gence), and control the traffic (control and RL). Despite
(Levin, 2022), challenge the long-standing Cartesian dual- the impressive progress in these domains, much of it has re-
ism which posits a distinct separation between mind and lied on the external collection and curation of vast datasets
body (Descartes, 2012). This philosophical stance has sig- for algorithmic training. This approach ha significant draw-
nificantly influenced the development of current generative backs: i) the collection and preparation of data demands
AI models, such as LLMs, which primarily rely on static substantial investments; ii) this data can contain biases that
data and lack interaction with the physical or even the sym- are hard to detect and rectify (Li & Deng, 2020; Balayn
bolic world. It is a prevalent belief that simply scaling up et al., 2021; Verma et al., 2021). The issue of biases is
such models, in terms of data volume and computational particularly pertinent in discussions on AI alignment (Shen
power, could lead to AGI. We contest this view. We pro- et al., 2023; Ji et al., 2023). Efforts to align AI through
pose that true understanding, not only propositional truth rule-based and procedural methods (such as RLHF (Lam-
but also the value of propositions that guide us how to bert et al., 2022)) often struggle, producing systems that
act, is achievable only through E-AI agents that live in the feel mechanistic and “dumb”, rather than an agent which
world and learn of it by interacting with it. seamlessly acts according to values compatible with our
society.
The significance of embodiment in cognitive development
was demonstrated by Held & Hein (1963)’s carousel ex- An embodied agent, designed to interact with and learn
periment with kittens. In this study, one kitten could ac- from its environment, fundamentally changes the tradi-
tively interact with and control a carousel, while the other tional approach to data collection and curation in AI de-
could only observe it passively. Despite both kittens re- velopment. By being inherently integrated with its physi-
ceiving identical visual input, the one engaged in active in- cal and social contexts, such an agent bypasses the labor-
teraction exhibited normal visual development, unlike its intensive processes previously required. This shift not only
passively observing counterpart. This seminal experiment simplifies the challenge of aligning AI with human values
underscores the vital role of embodied interaction in shap- but also enhances the agent’s learning efficiency by utiliz-
ing cognitive abilities (Shenavarmasouleh et al., 2022). It ing the unique features of its environment. As a result,
also reinforces the observation that all known forms of in- the focus in AI development transitions from data to sim-
telligence, including human intelligence, are inherently em- ulators. These simulators serve a dual purpose: they are
bodied (Smith & Gasser, 2005), suggesting that embod- both training grounds for E-AI and platforms for testing
iment serves as a solid foundation for cognitive learning and refining concepts and algorithms (Duan et al., 2022).
and development. Current AI learns in a very different way Moreover, the process of aligning these agents with human
from humans. We humans learn by seeing, moving, inter- values becomes more intuitive as it involves defining goals
acting with the world and speaking with others. We also reflective of those values. This approach does not claim to
learn by collecting sequential experiences, not by passive fully resolve the alignment challenge, as E-AI systems will
observation of shuffled and randomized, even if carefully still necessitate oversight and guidelines to avert unwanted
selected, data (Smith & Gasser, 2005; Westho et al., 2020). behaviors. However, the alignment process becomes inher-
We advocate for an approach where insights from cognitive ently more natural. Adjusting and defining goals is a more
science and developmental psychology inform the design straightforward task than the extensive editing and curat-
of AI systems. Such systems should be designed to learn ing of data. This methodology draws upon our inherent,
through active interaction with their surroundings, mirror- non-propositional understanding and instincts about align-
ing the embodied learning processes fundamental to human ing embodied intelligences—whether it is guiding our own
cognition. actions, nurturing children, or training pets.

Even advocates of static learning concede that multimodal Another important characteristic of E-AI, stemming from

4
A call for embodied AI

the coupling between the agent and its environment, is the with mechanisms for intrinsic motivation (Oudeyer & Ka-
agent’s capacity for ongoing evolution and adaptation. This plan, 2007; Pathak et al., 2017), which incentivize agents to
adaptability is vital for any agent destined to navigate a explore and acquire new knowledge to reduce uncertainty.
world in perpetual change. It underscores the importance
However, what propels an intelligent agent to act, espe-
of continual learning: the process of assimilating new ex-
cially beyond mere survival instincts, continues to be a mat-
periences while retaining previously acquired knowledge
ter of debate. We argue that exploring and developing em-
(Wang et al., 2023a).
bodied agents will illuminate this question. Thus, E-AI not
Moreover, Ishiguro & Kawakatsu (2004) have shown, both only shows potential for significant breakthroughs toward
through theory and practical application in robotics, that a achieving AGI, but also has deep implications for our un-
close and effective integration of control mechanisms with derstanding of cognition in general.
body dynamics significantly enhances energy efficiency.
Coupled systems lead also to the emergence of intriguing 4. Theoretical framework
behaviors that can be hard to explicitly program or learn In previous sections, we have underscored the pivotal role
from disembodied datasets (Rosas et al., 2020), an observa- of E-AI in advancing toward AGI. Shifting focus, we now
tion aligning with the principles of the TAME framework. delve into the essential components that, we believe, will
Embodiment is also a prerequisite for learning about affor- comprise E-AIs. We draw heavily on the concept of cog-
dances (Gibson, 1979). Learning, or more precisely re- nitive architectures designed by cognitive scientists aiming
alizing affordances, according to Vervaeke et al. (2012)’s to model the human mind (Thagard, 2012). Despite the
perspectival learning, is a fundamental capacity of AGI, as promise these architectures hold for enhancing modern ma-
affordances are what “fill our world with meaning” (Roli chine learning methods, progress on this has been notably
et al., 2022), and are thus necessary for agents that give limited (Kotseruba & Tsotsos, 2020). The slow advance-
meaning to their own world. Affordances emerge from the ment is largely due to cognitive architectures being the do-
dynamic interplay between an agent’s perception, objec- main of neuroscientists and cognitive scientists, with only
tives, abilities, and the characteristics of objects and con- a select few within the machine learning community ex-
texts within the environment; for example, a chair affords ploring their potential for AGI. We advocate for a syner-
us to sit, a glass to drink and a hand to grasp and pick gistic strategy that marries cognitive architectures with ma-
up objects. Roli et al. (2022) argue that the capacity to chine learning within the E-AI paradigm, proposing it as
comprehend, utilize, and be influenced by environmental a viable path toward AGI. The emergence of agent-based
affordances distinguishes biological intelligence from cur- LLMs, such as AutoGPT (FIRAT & Kuleli, 2023), which
rent artificial systems. Besides affordances, E-AI is also in- pioneers the generation of autonomous agents, and Pan-
dispensable for investigating emergent phenomena such as guAgent (Christianos et al., 2023), an agent-focused lan-
qualia (Locke, 1847; Korth, 2022), consciousness (Solms, guage model, indicate the potential of this approach.
2019), as well as creativity, empathy (Perez, 2023), and eth- We identify four essential components of an E-AI system:
ical understanding (Lake et al., 2017; Russell, 2021). (i) perception: the ability of the agent to sense its environ-
Finally, there is the important question of why an intel- ment; (ii) action: the ability to interact with and change
ligent agent would do anything in the first place (Pfeifer its environment; (iii) memory: the capacity to retain past
& Iida, 2004). What drives it to engage and acquire new experiences; and (iv) learning: integrating experiences to
knowledge without external prompts? Within well-framed form new knowledge and abilities. These components are
small worlds, such as a chess game, an agent’s purpose is notably aligned with the active inference framework of Fris-
straightforward: deciding the next move. However, when ton (2010). In this framework, the agent models its world
navigating large, open worlds, the motivations guiding an through a probabilistic generative model that infers the
agent’s decisions grow increasingly ambiguous. The con- causes of its sensory observations (perception). This model
cepts of active inference and the free energy principle (Fris- is hierarchical, forecasting future states in a top-down man-
ton, 2010; Friston et al., 2023) provide a compelling frame- ner and reconciling these predictions with bottom-up sen-
work for understanding the behaviors of intelligent agents. sory data, with discrepancies or errors being escalated up-
This principle posits that minimizing surprise and uncer- wards only when they cannot be reconciled at the initial
tainty is the core objective of the agents. They achieve this level. The agent acts to minimize the divergence between
through the use of internal models to forecast outcomes, its anticipations and reality, thus moving towards states of
continually updating these models with sensory input, and reduced uncertainty (action). Concurrently, it collects and
proactively modifying their surroundings to better match stores new information about its environment (memory)
their expectations. This concept resonates within the AI and refines its internal model to minimize predictive errors
community, particularly in the design of agents equipped (learning). In the sections that follow, we will describe in
detail these four components and how they comprise the

5
A call for embodied AI

E-AI agent. 4.3. Memory


Embodied agents learn from their experience, which are
4.1. Perception
stored in memory. Memory encompasses various dimen-
At the heart of an embodied agent lies the ability to per- sions, including its duration (short-term or long-term) and
ceive the world in which it exists. Perception is a process its nature (procedural, declarative, semantic, and episodic).
by which raw sensory data is transformed into a structured Importantly, memory is not necessarily represented as ex-
internal representation, enabling the agent to engage in cog- plicit propositional knowledge; it can be implicitly encoded
nitive tasks. The range of inputs that inform perception is into the weights of a neural network (NN). To navigate cog-
vast, encompassing familiar human senses such as vision, nitive tasks, agents require diverse types of memory sys-
hearing, smell, touch, and taste. It extends to any form tems, each playing a distinct role. Working and short-term
of stimuli an agent might encounter, be it force sensors in memory offer temporary storage to support the agent’s im-
robotics or signal strength indicators in wireless technol- mediate objectives. Long-term and episodic memories pro-
ogy. The challenge with sensory data is that it is often vide a reservoir for information over longer time. Episodic
not immediately actionable. It typically undergoes a pro- memory captures and stores unique, perspectival experi-
cess of transformation, a task where recent advances in ma- ences, ready to be accessed when familiar scenarios un-
chine learning can prove invaluable. The field has seen the fold. Long-term memory, conversely, is the repository
development of sophisticated methods for learning feature for broader propositional knowledge. LLMs, for example,
and embedding spaces, facilitating the conversion of raw implement long-term memory using Retrieval-Augmented
data into meaningful information (Golinko & Zhu, 2019; Generation (RAG) (Gao et al., 2024), a technique that re-
Sivaraman et al., 2022). A particularly effective strategy duces hallucinations using an external database. This tech-
has been self-supervised learning to learn such representa- nique showcases how sophisticated machine learning meth-
tions. Although much of the research has concentrated on ods can be synergized with cognitive architectures.
single modalities, such as vision (Oquab et al., 2023), the
principles underlying these techniques are universally ap- 4.4. Learning
plicable across different sensory inputs (Orhan et al., 2022; A defining trait of intelligent agents is their ability to learn.
Lee et al., 2019). Yet, how to learn, especially in a continuous and dynamic
way, remains a subject of ongoing research and debate
4.2. Action
(Wang et al., 2023a; Yifan et al., 2023). While recent
Embodied agents navigate the world by taking actions and strides in AI have largely been powered by training on
observing the outcomes. Acting can be broken down into static datasets, the concept of continual learning, essen-
two steps: (i) choosing what action to undertake next, like tial for adapting over time, faces challenges. These chal-
deciding to relocate to a specific spot, and (ii) determining lenges stem primarily from the inherent limitations of deep
how to execute this action, such as plotting the course to NNs, such as catastrophic forgetting (Kemker et al., 2018),
that location. Actions can further be categorized into re- and the complexities associated with learning from non-
active and goal-directed types. Reactive actions, akin to stationary data that result from an agent’s interaction with
human reflexes, occur almost instantaneously in response its environment (Fahrbach et al., 2023). The embodiment
to stimuli and play a crucial role in an agent’s immediate hypothesis suggests that true intelligence is born from such
self-preservation by maintaining stability. Goal-directed ac- interactions (Smith & Gasser, 2005), underscoring the need
tions, on the other hand, involve strategic planning and are for dynamic learning methodologies. In this context, simu-
motivated by high-level objectives. Reactive actions are lators emerge as a vital tool, offering a shift away from the
important for self-preservation, with model-free reinforce- static learning typical of traditional AI. Instead, they enable
ment learning methods playing an important role for devel- agents to evolve through ongoing, interactive experiences
oping reactive control policies in tasks like robot walking within simulated environments (Duan et al., 2022).
(Rudin et al., 2022). On the other hand, for an agent to
achieve more complex, high-level objectives, planning is 5. Challenges
indispensable, even if efficient planning remains an open
E-AIs agents will adopt an egocentric perspective, experi-
area of research (Lin et al., 2022; Shi et al., 2022). Cen-
encing their environment from a first-person viewpoint, in
tral to the concept of planning is the presence of a “world
contrast to the allocentric perspective prevalent in current
model” within the agent, which it can use to predict the con-
AI systems. This shift is not only essential for meaning-
sequences of its own actions. Model-based RL has made
ful interaction with the world but also offers an advantage
significant strides in developing algorithms that learn these
by allowing the agents to focus on modeling their immedi-
world models and use them for planning (Silver et al., 2016;
ate surroundings rather than the entirety of the world. On
Kégl et al., 2021; Paolo et al., 2022).
the other hand, E-AIs introduces several challenges, includ-

6
A call for embodied AI

ing extending current learning theories, managing noise in Reinforcement Learning (Sutton & Barto, 2018) and re-
perception and action effectively and safely, and ensuring lated paradigms (Bayesian optimization (Mockus, 1989)
meaningful communication with humans that adheres to or contextual bandits (Langford & Zhang, 2008)) offer a
ethical standards. The remainder of this section will cover closer fit for embodied AI, when the prediction is not the
these challenges, exploring potential pathways and solu- end-product, rather part of a predictive pipeline that also
tions. includes data collection. RL affords the data scientist to
design a higher-level objective, letting the algorithm opti-
5.1. New learning theory mize both the predictor and the data it is trained on. Here,
The principles of E-AI challenge us to reevaluate tradi- the mismatch between theory and practice is different from
tional learning theories (Devroye et al., 1996; Vapnik, supervised learning. The analysis in RL or bandit theory
1998), bridging a gap between supervised and reinforce- often focuses on the convergence of the agent to a theo-
ment learning. Supervised learning, while foundational retical optimum, given a fixed but often unknown environ-
in AI, assumes that the data is drawn from an unknown ment. RL theory usually does not offer tools to analyze
but fixed distribution, collected independently of the learn- the data collected during the learning process, especially
ing process. This theory gives rise to the classical no- when the collection is semi-automatic (includes a human
tions of generalization, over- and underfitting, bias and curator in the loop). RL agents, in practice, usually do not
variance, and asymptotic or finite-sample statistical consis- converge even in a stationary environment, they rather indi-
tency. This framing is obviously highly useful: even those viduate, making, for example, quite perversely, the random
who are not explicitly doing theory use it transparently as seed part of the algorithm (Henderson et al., 2018). This
their lingua technica and cognitive scaffolding when work- is even more pronounced in non-stationary environments
ing with algorithms and analyzing results. where the agent’s actions alter the environment; a situation
When embodied agents interact dynamically with their en- which AGI will definitely find itself (da Silva et al., 2006;
vironment, data collection becomes part of the data science Zhou et al., 2024).
pipeline (Pfeifer & Iida, 2004; Thrun et al., 2005). Clas- A new learning theory for embodied AI must transcend
sical supervised learning theory is insufficient to analyze these limitations. It should account for the dynamic, in-
these cases and to guide algorithm building. Extensions, teractive nature of data in E-AI, where the agent’s actions
like transfer learning (Pan & Yang, 2010), multitask learn- continuously reshape its learning environment. This theory
ing (Caruana, 1997), distribution shift (Quiñonero-Candela should not just aim for optimal performance in a fixed set-
et al., 2009), domain adaptation (Csurka, 2017) or out-of- ting but should embrace a spectrum of behaviors suitable
distribution generalization, have been proposed to patch ba- for evolving environments. Moreover, it should provide di-
sic supervised learning theory, but most of these cling to the agnostics to assess the quality and relevance of data gener-
original framing, pretending that the data is coming from ated through these interactions.
outside the learning process, encapsulating the value (busi-
ness or otherwise) of the predictive pipeline. Practically, 5.2. Noise and uncertainty
this is obviously not the case: the data on which we learn E-AI agents are tasked with navigating the real world, rife
a predictor is often collected by the data scientist, responsi- with noise and uncertainty. These elements can drastically
ble for the quality of the pipeline (O’Neil & Schutt, 2013; affect both the agent’s perception of its surroundings and
Provost & Fawcett, 2013). Furthermore, most of the de- the quality of its decision-making. For example, elevated
bates around responsible AI turn around the data, not the noise levels may distort the agent’s interpretation of envi-
learning algorithm (O’Neil, 2016; Selbst et al., 2019). Col- ronmental cues, leading to suboptimal decisions. This chal-
lecting, selecting, and curating data is obviously part of the lenge is accentuated in an egocentric perspective, where
pipeline. The text we use to train LLMs is created by its agents frequently encounter continuous streams of fluctuat-
writers, rather than drawn from a distribution. In some ing and imprecise data. Sources of noise include the natural
cases, when collection and model-retraining are automated, imprecision of sensors and actuators, which might lack ac-
the situation may be even worse. For example, in click- curacy due to manufacturing inconsistencies, degradation
through-rate prediction (Bottou et al., 2013; Perlich et al., over time, or external disturbances. Additionally, quanti-
2014) or recommendation systems (Deldjoo et al., 2020), zation error, a byproduct of converting analog signals into
the deployed predictor affects the data for the next round of digital form (Widrow & Kollár, 2008), can further compro-
training, generating an often adversarial feedback. A simi- mise data integrity.
lar phenomenon is happening in the LLM world: as these
AIs become the go-to tools for creative and business writ- As these agents learn and adapt to their environment, they
ing, the data collected for the next round of training will, must also grapple with uncertainty. This uncertainty can
in large part, be coming from the previous generation of obscure the agent’s understanding of its environment, in-
LLMs. fluencing its performance. This dilemma is especially

7
A call for embodied AI

prevalent in RL scenarios dealing with partial observabil- nicate with us (Amirova et al., 2021; Bonarini, 2020). Yet,
ity, where decisions must be made with incomplete infor- the challenge of ensuring proper and ethical communica-
mation, leading to uncertainty in predicting the outcomes tion with AI systems persists. The effectiveness of LLMs,
of its actions (Dulac-Arnold et al., 2021; Hess et al., 2023; for instance, hinges significantly on their training and how
Pattanaik et al., 2017). Therefore, managing noise and un- well they are aligned with human intentions and values
certainty effectively is paramount for the progress of E-AI. (Wang et al., 2023b). Integrating human oversight directly
into the AI development process and establishing compre-
5.3. Simulators hensive guidelines and protocols for AI communication are
As we pivot towards E-AI, simulators will assume a fun- among the proposed strategies to address these challenges,
damental role as a key driver of progress, similar to the aiming to make AI interactions more meaningful and ethi-
role data sets play in the training of traditional I-AI mod- cally sound.
els. These simulators offer a controlled, replicable envi-
ronment where AI systems can be rigorously trained and 5.5. Hardware limitations
tested. This setup allows for learning and adapting to di- A significant challenge to the broad-scale development and
verse scenarios prior to deployment, ensuring both safety integration of E-AI lies in the hardware requirements of
and cost-efficiency. A notable advantage of simulators, and these AI systems. Presently, AI technologies largely de-
requirements, is their speed and ease of parallelization, sig- pend on GPU clusters, which are, while powerful, not ide-
nificantly accelerating training time, making it more feasi- ally suited for embodied agents due to their high cost, en-
ble to train sophisticated AI models on multiple scenarios ergy consumption, and extensive heat output. Additionally,
simultaneously. the physical bulk and heft of GPUs pose logistical chal-
Many advanced simulators have been introduced recently, lenges for mobile agents or those operating within spatial
yet they often demand significant computational resources limitations. Addressing these constraints necessitates the
and are predominantly geared towards robotics applica- innovation of new, energy-efficient hardware solutions that
tions (Li et al., 2021; Gan et al., 2020; Yan et al., 2018; can be embedded within the agents. Promising develop-
Puig et al., 2018; Gao et al., 2019). For these simulators to ments are on the horizon, with Google’s Tensor Processing
truly serve the needs of E-AI, they must expand their scope Unit (TPU) (Norrie et al., 2021; Cass, 2019) and Huawei’s
to a broader spectrum of environments. A major challenge Ascend chip (Liao et al., 2021) leading the charge. These
in the use of simulators is bridging the “reality gap” (Bous- advancements, coupled with the potential of neuromorphic
malis & Levine, 2017): the difference between simulated computing and the strategic synergy of hardware-software
conditions and the agent’s eventual real-world or virtual de- co-design, signal a new era of hardware capability. More-
ployment context (Ligot & Birattari, 2020). This gap can over, the development of energy and data-efficient algo-
lead to a situation where models that excel in simulations rithms is critical. Such breakthroughs in hardware and algo-
fail in actual application, undermining the effectiveness of rithm efficiency will have a direct and profound effect on
the training process. Despite numerous strategies being put an AI’s ability to understand, decide, and interact within
forward to mitigate the reality gap (Salvato et al., 2021; its environment, enabling E-AI agents to operate more au-
Daza et al., 2023; Daoudi et al., 2023; Koos et al., 2012; tonomously and effectively in a diverse array of settings.
Tobin et al., 2017), it remains an unresolved issue in the
field, challenging the applicability of simulated training en-
6. Conclusion
vironments. In this paper, we have articulated the critical role Embodied
AI plays on the path toward achieving AGI, setting it apart
5.4. Interaction with humans from prevailing AI methodologies, notably LLMs. By in-
A key ambition of E-AI is to seamlessly interact with and tegrating insights from a spectrum of research fields, we
learn from humans, enhancing AI’s ability to offer person- underscored how E-AI’s development benefits from exist-
alized and impactful solutions. By improving these inter- ing knowledge, with LLMs enhancing the potential for in-
actions, E-AI will also diminish fear and mistrust towards tuitive interactions between humans and emerging AI en-
AI technologies, leading to broader acceptance and inte- tities. We introduced a comprehensive theoretical frame-
gration. In this endeavor, LLMs stand out as particularly work for the development of E-AI, grounded in the princi-
beneficial, with their ability to comprehend and produce ples of cognitive science, highlighting perception, action,
human-like text, facilitating communication in natural lan- memory, and learning, situating E-AI within the context
guage and making engagements with AI more natural and of Friston’s active inference framework, thereby offering a
accessible. The domain of Human-Robot Interaction (HRI) wide-ranging theoretical backdrop for our discussion. De-
offers valuable lessons for enhancing AI-human communi- spite the outlook, the journey ahead is fraught with chal-
cation, as researchers in this domain have dedicated efforts lenges, not least the formulation of a novel learning theory
to explore innovative methods for robots to better commu- tailored for AI and the creation of sophisticated hardware

8
A call for embodied AI

solutions. This paper aims to serve as a roadmap for ongo- tors to ecology. Nature, 444(7117):295–301, 2006. URL
ing and future research into E-AI, proposing directions that https://www.nature.com/articles/nature05402.
could lead to significant advancements in the field.
Bariah, L. and Debbah, M. Ai embodiment through 6g:
Shaping the future of agi. 2023.
Impact Statement
Bender, E. M., Gebru, T., McMillan-Major, A., and
While the development of Embodied AI introduces com-
plexities and challenges, particularly in hardware require- Shmitchell, S. On the dangers of stochastic parrots:
Can language models be too big? In Proceedings of
ments, ethical considerations, and safety protocols, the po-
tential benefits significantly outweigh these drawbacks. E- the 2021 ACM Conference on Fairness, Accountability,
AI stands to evolve our interaction with technology, imbu- and Transparency, pp. 610–623. ACM, 2021. URL
https://dl.acm.org/doi/10.1145/3442188.3445922.
ing AI with a deeper understanding of and engagement with
both the physical world and human society. This not only Bodó, B. Selling news to audiences–a qualitative inquiry
paves the way for more natural and effective human-AI in- into the emerging logics of algorithmic news personaliza-
teractions but also enhances AI’s adaptability and applica- tion in european quality news media. In Algorithms, Au-
tion across a broad spectrum of fields. tomation, and News, pp. 75–96. Routledge, 2021. URL
https://www.tandfonline.com/doi/full/10.1080/21670
References
Bonarini, A. Communication in human-robot interac-
Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., tion. Current Robotics Reports, 1:279–285, 2020. URL
Aleman, F. L., Almeida, D., Altenschmidt, J., Altman, S., https://link.springer.com/article/10.1007/s43154-0
Anadkat, S., et al. Gpt-4 technical report. arXiv preprint
arXiv:2303.08774, 2023. Bottou, L., Peters, J., Quiñonero-Candela, J., Charles,
D. X., Chickering, M., Portugaly, E., Ray, D., Simard,
Amirova, A., Rakhymbayeva, N., Yadollahi, E., P., and Snelson, E. Counterfactual reasoning and
Sandygulova, A., and Johal, W. 10 years of learning systems: The example of computational
human-nao interaction research: A scoping review. advertising. In Journal of Machine Learning Re-
Frontiers in Robotics and AI, 8:744526, 2021. URL search, volume 14, pp. 3207–3260, 2013. URL
https://www.frontiersin.org/articles/10.3389/frobt.2021.744526/full.
https://jmlr.org/papers/v14/bottou13a.html.

Anderson, P., Wu, Q., Teney, D., Bruce, J., John- Bousmalis, K. and Levine, S. Closing the
son, M., Sünderhauf, N., Reid, I., Gould, S., and simulation-to-reality gap for deep robotic learn-
Van Den Hengel, A. Vision-and-language navi- ing. Google Research Blog, 1, 2017. URL
gation: Interpreting visually-grounded navigation https://blog.research.google/2017/10/closing-simul
instructions in real environments. In Proceedings
of the IEEE conference on computer vision and Bozdag, E. Bias in algorithmic filtering
pattern recognition, pp. 3674–3683, 2018. URL and personalization. Ethics and informa-
https://ieeexplore.ieee.org/document/8578485. tion technology, 15:209–227, 2013. URL
https://link.springer.com/article/10.1007/s10676-0
Bakshy, E., Messing, S., and Adamic, L. A. Ex-
posure to ideologically diverse news and opin- Brohan, A., Brown, N., Carbajal, J., Chebotar, Y., Chen,
ion on facebook. In Proceedings of the Na- X., Choromanski, K., Ding, T., Driess, D., Dubey, A.,
tional Academy of Sciences, volume 112, pp. Finn, C., Florence, P., Fu, C., Arenas, M. G., Gopalakr-
5791–5796. National Acad Sciences, 2015. URL ishnan, K., Han, K., Hausman, K., Herzog, A., Hsu, J.,
https://www.science.org/doi/10.1126/science.aaa1160. Ichter, B., Irpan, A., Joshi, N., Julian, R., Kalashnikov,
D., Kuang, Y., Leal, I., Lee, L., Lee, T.-W. E., Levine,
Balayn, A., Lofi, C., and Houben, G.-J. Managing S., Lu, Y., Michalewski, H., Mordatch, I., Pertsch, K.,
bias and unfairness in data for decision support: Rao, K., Reymann, K., Ryoo, M., Salazar, G., San-
a survey of machine learning and data engineer- keti, P., Sermanet, P., Singh, J., Singh, A., Soricut, R.,
ing approaches to identify and mitigate bias and Tran, H., Vanhoucke, V., Vuong, Q., Wahid, A., Welker,
unfairness within data management and analytics sys- S., Wohlhart, P., Wu, J., Xia, F., Xiao, T., Xu, P., Xu,
tems. The VLDB Journal, 30(5):739–768, 2021. URL S., Yu, T., and Zitkovich, B. Rt-2: Vision-language-
action
https://link.springer.com/article/10.1007/s00778-021-00671-8. models transfer web knowledge to robotic con-
trol. In arXiv preprint arXiv:2307.15818, 2023. URL
Bargmann, C. I. Comparative chemosensation from recep- https://arxiv.org/abs/2307.15818.

9
A call for embodied AI

Brooks, R. A. Intelligence without representation. of the 23rd International Conference on Machine


Artificial intelligence, 47(1-3):139–159, 1991. URL Learning, ICML ’06, pp. 217–224, 2006. URL
https://www.sciencedirect.com/science/article/abs/pii/000437029190053M.
https://dl.acm.org/doi/10.1145/1143844.1143872.
Brown, T., Mann, B., Ryder, N., Subbiah, M., Ka- Daoudi, P., Prieur, C., Robu, B., Barlier, M., and
plan, J. D., Dhariwal, P., Neelakantan, A., Shyam, Santos, L. D. A trust region approach for few-
P., Sastry, G., Askell, A., et al. Language models shot sim-to-real reinforcement learning, 2023. URL
are few-shot learners. Advances in neural informa- https://arxiv.org/abs/2312.15474.
tion processing systems, 33:1877–1901, 2020. URL
https://arxiv.org/abs/2005.14165. Daza, I. G., Izquierdo, R., Martı́nez, L. M., Ben-
derius, O., and Llorca, D. F. Sim-to-real trans-
Caruana, R. Multitask learning. Ma- fer and reality gap modeling in model predic-
chine Learning, 28(1):41–75, 1997. URL tive control for autonomous driving. Applied
https://link.springer.com/article/10.1023/A:1007379606734. Intelligence, 53(10):12719–12735, 2023. URL
https://link.springer.com/article/10.1007/s10489-0
Cass, S. Taking ai to the edge: Google’s
tpu now comes in a maker-friendly package. Deldjoo, Y., Di Noia, T., and Merra, F. A. Adversarial
IEEE Spectrum, 56(5):16–17, 2019. URL machine learning in recommender systems (aml-recsys).
https://ieeexplore.ieee.org/document/8701189. In Proceedings of the 13th International Conference on
Web Search and Data Mining, pp. 869–872, 2020. URL
Chen, T., Gupta, S., and Gupta, A. Learn- https://dl.acm.org/doi/abs/10.1145/3336191.3371877
ing exploration policies for navigation. arXiv
preprint arXiv:1903.01959, 2019. URL Descartes, R. Discourse on method.
https://arxiv.org/abs/1903.01959. Hackett Publishing, 2012. URL
https://hackettpublishing.com/discourse-on-method.
Christianos, F., Papoudakis, G., Zimmer, M., Coste, T., Wu,
Z., Chen, J., Khandelwal, K., Doran, J., Feng, X., Liu, J., Devlin, J., Chang, M.-W., Lee, K., and Toutanova,
et al. Pangu-agent: A fine-tunable generalist agent with K. Bert: Pre-training of deep bidirectional trans-
structured reasoning. arXiv preprint arXiv:2312.14878, formers for language understanding, 2019. URL
2023. URL https://arxiv.org/2312.14878. https://arxiv.org/abs/1810.04805.

Clark, A. Being There: Putting Brain, Body, and Devroye, L., Györfi, L., and Lugosi, G. A Probabilistic
World Together Again. MIT Press, 1997. URL Theory of Pattern Recognition. Springer, 1996. URL
https://link.springer.com/book/10.1007/978-1-4612
https://mitpress.mit.edu/9780262531566/being-there/.

Clark, A. and Chalmers, D. The extended Duan, J., Yu, S., Tan, H. L., Zhu, H., and Tan, C. A
mind. Analysis, 58(1):7–19, 1998. URL survey of embodied ai: From simulators to research
tasks. IEEE Transactions on Emerging Topics in Com-
https://era.ed.ac.uk/bitstream/handle/1842/1312/TheExtendedMind.pdf?sequence=1&isAllowed=y.
putational Intelligence, 6(2):230–244, 2022. URL
Covington, P., Adams, J., and Sargin, E. Deep neu- https://arxiv.org/abs/2103.04918.
ral networks for youtube recommendations. In
Proceedings of the 10th ACM Conference on Rec- Dulac-Arnold, G., Levine, N., Mankowitz, D. J.,
ommender Systems, pp. 191–198, 2016. URL Li, J., Paduraru, C., Gowal, S., and Hester, T.
https://dl.acm.org/doi/10.1145/2959100.2959190. Challenges of real-world reinforcement learn-
ing: definitions, benchmarks and analysis. Ma-
Cramer, P. Alphafold2 and the future of struc- chine Learning, 110(9):2419–2468, 2021. URL
tural biology. Nature structural & molec- https://link.springer.com/article/10.1007/s10994-0
ular biology, 28(9):704–705, 2021. URL
https://www.nature.com/articles/s41594-021-00650-1. Eirinaki, M., Gao, J., Varlamis, I., and Tserpes, K. Rec-
ommender systems for large-scale social networks:
Csurka, G. Domain adaptation for visual ap- A review of challenges and solutions. Future Gen-
plications: A comprehensive survey. arXiv eration Computer Systems, 78:413–418, 2018. URL
preprint arXiv:1702.05374, 2017. URL https://www.sciencedirect.com/science/article/pii/
https://link.springer.com/chapter/10.1007/978-3-319-58347-1_1.
Fahrbach, M., Javanmard, A., Mirrokni, V., and Worah,
da Silva, B. C., Basso, E. W., Bazzan, A. L. C., and P. Learning rate schedules in the presence of distribu-
Engel, P. M. Dealing with non-stationary envi- tion shift. arXiv preprint arXiv:2303.15634, 2023. URL
ronments using context detection. In Proceedings https://arxiv.org/abs/2303.15634.

10
A call for embodied AI

Fei, N., Lu, Z., Gao, Y., Yang, G., Huo, Y., Wen, J., Lu, Henderson, P., Islam, R., Bachman, P., Pineau, J., Pre-
H., Song, R., Gao, X., Xiang, T., et al. Towards arti- cup, D., and Meger, D. Deep reinforcement learn-
ficial general intelligence via a multimodal foundation ing that matters. In Proceedings of the AAAI Confer-
model. Nature Communications, 13(1):3094, 2022. URL ence on Artificial Intelligence, volume 32, 2018. URL
https://arxiv.org/abs/2110.14378. https://arxiv.org/abs/1709.06560.

FIRAT, M. and Kuleli, S. What if gpt4 became autonomous: Hess, F., Monfared, Z., Brenner, M., and Durste-
The auto-gpt project and use cases. Journal of Emerg- witz, D. Generalized Teacher Forcing for Learn-
ing Computer Technologies, 3(1):1–6, 2023. URL ing Chaotic Dynamics, October 2023. URL
https://github.com/Significant-Gravitas/AutoGPT. http://arxiv.org/abs/2306.04406.
arXiv:2306.04406 [nlin].
Friston, K. The free energy principle: A uni-
fied brain theory? Nature Reviews Neu- Huang, L., Yu, W., Ma, W., Zhong, W., Feng, Z., Wang, H.,
roscience, 11(2):127–138, 2010. URL Chen, Q., Peng, W., Feng, X., Qin, B., and Liu, T. A
https://www.nature.com/articles/nrn2787. survey on hallucination in large language models: Prin-
ciples, taxonomy, challenges, and open questions, 2023.
Friston, K., Da Costa, L., Sajid, N., Heins, C., URL https://arxiv.org/abs/2311.05232.
Ueltzhöffer, K., Pavliotis, G. A., and Parr, T. The
free energy principle made simpler but not too sim- Huszár, F., Ktena, S. I., O’Brien, C., Belli, L., Schlaik-
ple. Physics Reports, 1024:1–29, 2023. URL jer, A., and Hardt, M. Algorithmic amplification
https://arxiv.org/abs/2201.06387. of politics on twitter. Proceedings of the Na-
tional Academy of Sciences, 119(1):e2025334119,
Gan, C., Schwartz, J., Alter, S., Mrowca, D., Schrimpf, 2022. doi: 10.1073/pnas.2025334119. URL
M., Traer, J., De Freitas, J., Kubilius, J., Bhand- https://www.pnas.org/doi/abs/10.1073/pnas.20253341
waldar, A., Haber, N., et al. Threedworld: A
Ishiguro, A. and Kawakatsu, T. How should control
platform for interactive multi-modal physical simula-
and body systems be coupled? a robotic case study.
tion. arXiv preprint arXiv:2007.04954, 2020. URL
In Embodied Artificial Intelligence: International
https://arxiv.org/abs/2007.04954.
Seminar, Dagstuhl Castle, Germany, July 7-11, 2003.
Revised Papers, pp. 107–118. Springer, 2004. URL
Gao, X., Gong, R., Shu, T., Xie, X., Wang,
https://link.springer.com/chapter/10.1007/978-3-5
S., and Zhu, S.-C. Vrkitchen: an interac-
tive 3d virtual environment for task-oriented learn- Ji, J., Qiu, T., Chen, B., Zhang, B., Lou, H.,
ing. arXiv preprint arXiv:1903.05757, 2019. URL Wang, K., Duan, Y., He, Z., Zhou, J., Zhang,
https://arxiv.org/abs/1903.05757. Z., et al. Ai alignment: A comprehensive sur-
vey. arXiv preprint arXiv:2310.19852, 2023. URL
Gao, Y., Xiong, Y., Gao, X., Jia, K., Pan, J., https://arxiv.org/abs/2310.19852.
Bi, Y., Dai, Y., Sun, J., Guo, Q., Wang, M.,
and Wang, H. Retrieval-augmented generation for Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B.,
large language models: A survey, 2024. URL Chess, B., Child, R., Gray, S., Radford, A., Wu, J.,
https://arxiv.org/abs/2312.10997. and Amodei, D. Scaling laws for neural language mod-
els. arXiv preprint arXiv:2001.08361, 2020. URL
Gibson, J. J. The Ecological Approach to Visual Perception. https://arxiv.org/abs/2001.08361.
Houghton Mifflin, 1979.
Kégl, B., Hurtado, G., and Thomas, A. Model-
Golinko, E. and Zhu, X. Generalized feature embedding based micro-data reinforcement learning: what are
for supervised, unsupervised, and online learning tasks. the crucial model properties and which model to
Information Systems Frontiers, 21:125–142, 2019. URL choose? arXiv preprint arXiv:2107.11587, 2021. URL
https://link.springer.com/article/10.1007/s10796-018-9850-y.
https://arxiv.org/2107.11587.

Held, R. and Hein, A. Movement-produced stim- Kemker, R., McClure, M., Abitino, A., Hayes, T., and
ulation in the development of visually guided Kanan, C. Measuring catastrophic forgetting in neu-
behavior. Journal of comparative and phys- ral networks. In Proceedings of the AAAI confer-
iological psychology, 56(5):872, 1963. URL ence on artificial intelligence, volume 32, 2018. URL
https://psycnet.apa.org/record/1964-03855-001. https://arxiv.org/abs/1708.02072.

11
A call for embodied AI

Kirchhoff, M., Parr, T., Palacios, E., Friston, K., and Kiver- Lee, M. A., Zhu, Y., Srinivasan, K., Shah, P., Savarese,
stein, J. The Markov blankets of life: autonomy, active S., Fei-Fei, L., Garg, A., and Bohg, J. Making
inference and the free energy principle. Journal of The sense of vision and touch: Self-supervised learning of
Royal Society interface, 15(138):20170792, 2018. URL multimodal representations for contact-rich tasks. In
https://royalsocietypublishing.org/doi/10.1098/rsif.2017.0792. 2019 International Conference on Robotics and Au-
tomation (ICRA), pp. 8943–8950. IEEE, 2019. URL
Kirkpatrick, J., Pascanu, R., Rabinowitz, N., Ve- https://arxiv.org/1810.10191v2.
ness, J., Desjardins, G., Rusu, A. A., Milan,
K., Quan, J., Ramalho, T., Grabska-Barwinska, Levin, M. Technological approach to mind every-
A., et al. Overcoming catastrophic forgetting in where: an experimentally-grounded framework for
neural networks. Proceedings of the National understanding diverse bodies and minds. Fron-
Academy of Sciences, 114(13):3521–3526, 2017. URL tiers in systems neuroscience, 16:768201, 2022. URL
https://arxiv.org/abs/1612.00796. https://www.frontiersin.org/articles/10.3389/fnsys

Koos, S., Mouret, J.-B., and Doncieux, S. The trans- Li, C., Xia, F., Martı́n-Martı́n, R., Lingelbach, M., Sri-
ferability approach: Crossing the reality gap in vastava, S., Shen, B., Vainio, K., Gokmen, C., Dha-
evolutionary robotics. IEEE Transactions on Evo- ran, G., Jain, T., et al. igibson 2.0: Object-centric
lutionary Computation, 17(1):122–145, 2012. URL simulation for robot learning of everyday household
https://ieeexplore.ieee.org/document/6151107. tasks. arXiv preprint arXiv:2108.03272, 2021. URL
https://arxiv.org/abs/2108.03272.
Korth, M. The purpose of qualia: What if hu-
man thinking is not (only) information process- Li, S. and Deng, W. A deeper look at facial ex-
ing? arXiv preprint arXiv:2212.00800, 2022. URL pression dataset bias. IEEE Transactions on Af-
https://arxiv.org/abs/2212.00800. fective Computing, 13(2):881–893, 2020. URL
https://arxiv.org/1904.11150.
Kotseruba, I. and Tsotsos, J. K. 40 Years of Cognitive Archi-
Liao, H., Tu, J., Xia, J., Liu, H., Zhou, X., Yuan, H.,
tectures: Core Cognitive Abilities and Practical Appli-
and Hu, Y. Ascend: a scalable and unified architec-
cations, volume 53. Springer Netherlands, 2020. ISBN
ture for ubiquitous deep neural network computing:
9550141039. doi: 10.1007/s10462-018-9646-y. URL
Industry track paper. In 2021 IEEE International
https://doi.org/10.1007/s10462-018-9646-y.
Symposium on High-Performance Computer Ar-
Lake, B. M., Ullman, T. D., Tenenbaum, J. B., chitecture (HPCA), pp. 789–801. IEEE, 2021. URL
and Gershman, S. J. Building machines https://ieeexplore.ieee.org/document/9407221.
that learn and think like people. Behavioral
Ligot, A. and Birattari, M. Simulation-only ex-
and brain sciences, 40:e253, 2017. URL
periments to mimic the effects of the reality
https://pubmed.ncbi.nlm.nih.gov/27881212/.
gap in the automatic design of robot swarms.
Swarm Intelligence, 14(1):1–24, 2020. URL
Lakoff, G. and Johnson, M. Metaphors we live
https://link.springer.com/article/10.1007/s11721-0
by. University of Chicago press, 1979. URL
https://press.uchicago.edu/ucp/books/book/chicago/M/bo3637992.html.
Lin, H., Sun, Y., Zhang, J., and Yu, Y. Model-based re-
inforcement learning with multi-step plan value estima-
Lakoff, G. and Johnson, M. L. Philosophy tion. arXiv preprint arXiv:2209.05530, 2022. URL
in the flesh : the embodied mind and its https://arxiv.org/abs/2209.05530.
challenge to western thought. 1999. URL
https://api.semanticscholar.org/CorpusID:16103621. Locke, J. An essay concerning human under-
standing. Kay & Troutman, 1847. URL
Lambert, N., Castricato, L., von Werra, L., and Havrilla, https://www.gutenberg.org/files/10615/10615-h/1061
A. Illustrating reinforcement learning from human
feedback (rlhf). Hugging Face Blog, 2022. URL McNearney, S. A Brief Guide to Embodied Cognition:
https://huggingface.co/blog/rlhf. Why You Are Not Your Brain, 2011.

Langford, J. and Zhang, T. The epoch-greedy algorithm for Mockus, J. Bayesian Approach to Global
contextual multi-armed bandits. In Advances in Neural Optimization: Theory and Applications.
Information Processing Systems, volume 20, 2008. URL Kluwer Academic Publishers, 1989. URL
https://proceedings.neurips.cc/paper/2007/file/4b04a686b0ad13dce35fa99fa4161c65-Paper.pdf.
https://link.springer.com/book/10.1007/978-94-009

12
A call for embodied AI

Nguyen, T. T., Hui, P.-M., Harper, F. M., Terveen, Parcalabescu, L., Trost, N., and Frank, A. What is multi-
L., and Konstan, J. A. Exploring the filter bubble: modality? arXiv preprint arXiv:2103.06304, 2021. URL
the effect of using recommender systems on con- https://arxiv.org/abs/2103.06304.
tent diversity. In Proceedings of the 23rd international
conference on World wide web, pp. 677–686, 2014. URL Parisi, G. I., Kemker, R., Part, J. L., Kanan, C., and
https://dl.acm.org/doi/10.1145/2566486.2568012. Wermter, S. Continual lifelong learning with neural net-
works: A review. Neural Networks, 113:54–71, 2019.
Norrie, T., Patil, N., Yoon, D. H., Kurian, G., Li, S.,
Laudon, J., Young, C., Jouppi, N., and Patterson, Pathak, D., Agrawal, P., Efros, A. A., and Darrell,
D. The design process for google’s training chips: T. Curiosity-driven exploration by self-supervised
Tpuv2 and tpuv3. IEEE Micro, 41(2):56–63, 2021. URL prediction. In International conference on ma-
https://ieeexplore.ieee.org/document/9351692. chine learning, pp. 2778–2787. PMLR, 2017. URL
https://arxiv.org/1705.05363.
O’Neil, C. Weapons of Math Destruction: How Big Data
Increases Inequality and Threatens Democracy. Crown Pattanaik, A., Tang, Z., Liu, S., Bommannan, G.,
Publishing Group, 2016. and Chowdhary, G. Robust Deep Reinforcement
Learning with Adversarial Attacks, December 2017.
O’Neil, C. and Schutt, R. Doing Data Science. O’Reilly URL http://arxiv.org/abs/1712.03632.
Media, Inc., 2013. arXiv:1712.03632 [cs].

Oquab, M., Darcet, T., Moutakanni, T., Vo, H. V., Perez, C. Artificial Empathy - A Roadmap for Human
Szafraniec, M., Khalidov, V., Fernandez, P., Haziza, D., Aligned Artificial General Intelligence. Publisher, 2023.
Massa, F., El-Nouby, A., Howes, R., Huang, P.-Y., Xu,
H., Sharma, V., Li, S.-W., Galuba, W., Rabbat, M., As- Perlich, C., Dalessandro, B., Raeder, T., Stitelman, O., and
sran, M., Ballas, N., Synnaeve, G., Misra, I., Jegou, H., Provost, F. Machine learning for targeted display adver-
Mairal, J., Labatut, P., Joulin, A., and Bojanowski, P. Di- tising: transfer learning in action. In Machine learning,
nov2: Learning robust visual features without supervi- volume 95, pp. 103–127. Springer, 2014.
sion, 2023. Pfeifer, R. and Bongard, J. How the Body Shapes the Way
Orhan, P., Boubenec, Y., and King, J.-R. Don’t stop We Think: A New View of Intelligence. MIT Press, 2006.
the training: continuously-updating self-supervised al- Pfeifer, R. and Iida, F. Embodied artificial intel-
gorithms best account for auditory responses in the cor- ligence: Trends and challenges. Lecture notes
tex. arXiv preprint arXiv:2202.07290, 2022. URL in computer science, pp. 1–26, 2004. URL
https://arxiv.org/abs/2202.07290. https://link.springer.com/chapter/10.1007/978-3-5
Oudeyer, P.-Y. and Kaplan, F. What is intrinsic mo- Provost, F. and Fawcett, T. Data Science for Business:
tivation? a typology of computational approaches. What You Need to Know about Data Mining and Data-
Frontiers in neurorobotics, 1:6, 2007. URL Analytic Thinking. O’Reilly Media, Inc., 2013.
https://www.frontiersin.org/articles/10.3389/neuro.12.006.2007.
Puig, X., Ra, K., Boben, M., Li, J., Wang, T., Fi-
Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, dler, S., and Torralba, A. Virtualhome: Simulating
C. L., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., household activities via programs. In Proceed-
Ray, A., Schulman, J., Hilton, J., Kelton, F., Miller, ings of the IEEE Conference on Computer Vision
L., Simens, M., Askell, A., Welinder, P., Christiano, P., and Pattern Recognition, pp. 8494–8502, 2018. URL
Leike, J., and Lowe, R. Training language models to https://openaccess.thecvf.com/content_cvpr_2018/ht
follow instructions with human feedback, 2022. URL
https://arxiv.org/abs/2203.02155. Quiñonero-Candela, J., Sugiyama, M., Schwaighofer, A.,
and Lawrence, N. D. (eds.). Dataset Shift in Machine
Pan, S. J. and Yang, Q. A survey on transfer learning. Learning. The MIT Press, 2009.
IEEE Transactions on knowledge and data engineering,
22(10):1345–1359, 2010. Radford, A., Kim, J. W., Xu, T., Brockman, G.,
McLeavey, C., and Sutskever, I. Robust speech recog-
Paolo, G., Gonzalez-Billandon, J., Thomas, A., and nition via large-scale weak supervision, 2022. URL
Kégl, B. Guided safe shooting: model based https://arxiv.org/abs/2212.04356.
reinforcement learning with safety constraints.
arXiv preprint arXiv:2206.09743, 2022. URL Ramakrishnan, S. K., Jayaraman, D., and Grau-
https://arxiv.org/2206.09743. man, K. An exploration of embodied visual

13
A call for embodied AI

exploration. International Journal of Com- Shapiro, L. G. Computer vision: the last fifty years. Uni-
puter Vision, 129:1616–1649, 2021. URL versity of Washington, Last access, 7(05), 2021.
https://arxiv.org/abs/2001.02192.
Shen, T., Jin, R., Huang, Y., Liu, C., Dong,
Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., W., Guo, Z., Wu, X., Liu, Y., and Xiong, D.
and Chen, M. Hierarchical text-conditional im- Large language model alignment: A survey.
age generation with clip latents, 2022. URL arXiv preprint arXiv:2309.15025, 2023. URL
https://arxiv.org/abs/2204.06125. https://arxiv.org/abs/2309.15025.
Reynolds, G. D. and Roth, K. C. The devel- Shenavarmasouleh, F., Mohammadi, F. G., Amini, M. H.,
opment of attentional biases for faces in in- and Reza Arabnia, H. Embodied ai-driven operation
fancy: A developmental systems perspective. of smart cities: A concise review. Cyberphysical
Frontiers in psychology, 9:222, 2018. URL Smart Cities Infrastructures: Optimal Operation and
https://www.frontiersin.org/articles/10.3389/fpsyg.2018.00222.
Intelligent Decision Making, pp. 29–45, 2022. URL
https://arxiv.org/abs/2108.09823.
Ribeiro, M. H., Ottoni, R., West, R., Almeida, V. A.,
and Meira Jr, W. Auditing radicalization pathways on Shi, L. X., Lim, J. J., and Lee, Y. Skill-
youtube. In Proceedings of the 2020 Conference on Fair- based model-based reinforcement learning.
ness, Accountability, and Transparency, pp. 131–141, arXiv preprint arXiv:2207.07560, 2022. URL
2020. https://arxiv.org/abs/2207.07560.
Roli, A., Jaeger, J., and Kauffman, S. A. How or- Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L.,
ganisms come to know the world: fundamental Van Den Driessche, G., Schrittwieser, J., Antonoglou, I.,
limits on artificial general intelligence. Frontiers Panneershelvam, V., Lanctot, M., et al. Mastering the
in Ecology and Evolution, 9:1035, 2022. URL game of go with deep neural networks and tree search.
https://www.frontiersin.org/articles/10.3389/fevo.2021.806283.
nature, 529(7587):484–489, 2016.
Rosas, F. E., Mediano, P. A., Jensen, H. J., Seth, A. K.,
Sivaraman, V., Wu, Y., and Perer, A. Emblaze:
Barrett, A. B., Carhart-Harris, R. L., and Bor, D. Recon-
Illuminating machine learning representations
ciling emergences: An information-theoretic approach
through interactive comparison of embedding
to identify causal emergence in multivariate data. PLoS
spaces. In 27th International Conference on Intel-
computational biology, 16(12):e1008289, 2020. URL
ligent User Interfaces, pp. 418–432, 2022. URL
https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1008289.
https://arxiv.org/abs/2202.02641.
Rudin, N., Hoeller, D., Reist, P., and Hutter, M.
Smith, L. and Gasser, M. The development
Learning to walk in minutes using massively paral-
of embodied cognition: Six lessons from ba-
lel deep reinforcement learning. In Conference on
bies. Artificial life, 11(1-2):13–29, 2005. URL
Robot Learning, pp. 91–100. PMLR, 2022. URL
https://pubmed.ncbi.nlm.nih.gov/15811218/.
https://arxiv.org/abs/2109.11978.

Russell, S. Human-compatible artificial intelligence. Solms, M. The hard problem of consciousness and the free
Human-like machine intelligence, pp. 3–23, 2021. URL energy principle. Frontiers in Psychology, 2019.
http://aima.cs.berkeley.edu/˜russell/papers/mi19book-hcai.pdf.
Strubell, E., Ganesh, A., and McCallum, A. Energy and
Salvato, E., Fenu, G., Medvet, E., and Pellegrino, F. A. policy considerations for deep learning in NLP. In Pro-
Crossing the reality gap: A survey on sim-to-real trans- ceedings of the 57th Annual Meeting of the Association
ferability of robot controllers in reinforcement learning. for Computational Linguistics, pp. 3645–3650, 2019.
IEEE Access, 9:153171–153187, 2021.
Sutton, R. S. and Barto, A. G. Reinforcement Learning: An
Schüll, N. D. Addiction by Design: Machine Gambling in Introduction. MIT press, 2018.
Las Vegas. Princeton University Press, 2012.
Team, G., Anil, R., Borgeaud, S., Wu, Y., Alayrac, J.-B.,
Selbst, A. D., Boyd, D., Friedler, S. A., Venkatasubrama- Yu, J., Soricut, R., Schalkwyk, J., Dai, A. M., Hauth, A.,
nian, S., and Vertesi, J. Fairness and abstraction in so- et al. Gemini: a family of highly capable multimodal
ciotechnical systems. ACM Conference on Fairness, Ac- models. arXiv preprint arXiv:2312.11805, 2023.
countability, and Transparency (FAT*), pp. 59–68, 2019.
Thagard, P. Cognitive architectures. The Cambridge hand-
Shapiro, L. Embodied Cognition. Routledge, 2011. book of cognitive science, 3:50–70, 2012.

14
A call for embodied AI

Thrun, S., Burgard, W., and Fox, D. Probabilistic Robotics. Fan, X., Wang, X., Xiong, L., Zhou, Y., Wang, W.,
MIT Press, 2005. Jiang, C., Zou, Y., Liu, X., Yin, Z., Dou, S., Weng,
R., Cheng, W., Zhang, Q., Qin, W., Zheng, Y., Qiu, X.,
Tobin, J., Fong, R., Ray, A., Schneider, J., Zaremba, W., Huang, X., and Gui, T. The rise and potential of large
and Abbeel, P. Domain randomization for transferring language model based agents: A survey, 2023. URL
deep neural networks from simulation to the real world. https://arxiv.org/abs/2309.07864.
In 2017 IEEE/RSJ international conference on intelli-
gent robots and systems (IROS), pp. 23–30. IEEE, 2017. Yan, C., Misra, D., Bennnett, A., Walsman, A., Bisk, Y.,
and Artzi, Y. Chalet: Cornell house agent learning envi-
Vapnik, V. Statistical Learning Theory. Wiley, 1998. ronment. arXiv preprint arXiv:1801.07357, 2018. URL
https://arxiv.org/abs/1801.07357.
Varela, F. J., Thompson, E., and Rosch, E. The Embodied
Mind: Cognitive Science and Human Experience. MIT Yifan, C., Yulu, C., Yadan, Z., and Wenbo, L.
Press, 1991. Continual learning in an easy-to-hard manner.
Applied Intelligence, pp. 1–21, 2023. URL
Verma, S., Ernst, M., and Just, R. Remov-
https://link.springer.com/article/10.1007/s10489-0
ing biased data to improve fairness and accuracy.
arXiv preprint arXiv:2102.03054, 2021. URL Zhou, Q., Chen, S., Wang, Y., Xu, H., Du, W., Zhang, H.,
https://arxiv.org/2102.03054. Du, Y., Tenenbaum, J. B., and Gan, C. HAZARD chal-
lenge: Embodied decision making in dynamically chang-
Vervaeke, J. and Coyne, S. Mentoring the Ma-
ing environments, 2024.
chines. Hackett Publishing, 2024. URL
https://www.mentoringthemachines.com.

Vervaeke, J., Lillicrap, T. P., and Richards, B. A. Relevance


realization and the emerging framework in cognitive sci-
ence. Journal of Logic and Computation, 22(1):79–99,
2012.

Wang, L., Zhang, X., Su, H., and Zhu, J. A comprehensive


survey of continual learning: Theory, method and appli-
cation. arXiv preprint arXiv:2302.00487, 2023a. URL
https://arxiv.org/abs/2302.00487.

Wang, Y., Zhong, W., Li, L., Mi, F., Zeng, X., Huang, W.,
Shang, L., Jiang, X., and Liu, Q. Aligning large lan-
guage models with human: A survey. arXiv preprint
arXiv:2307.12966, 2023b.

Wei, J., Wang, X., Schuurmans, D., Bosma, M.,


Ichter, B., Xia, F., Chi, E., Le, Q., and Zhou,
D. Chain-of-thought prompting elicits reason-
ing in large language models, 2023. URL
https://arxiv.org/abs/2201.11903.

Westho, B., Koele, I. J., and van de Groep, I. H. Social


learning and the brain: How do we learn from and about
other people? Everything You and Your Teachers Need
to Know About the Learning Brain, pp. 42, 2020. URL
https://kids.frontiersin.org/articles/10.3389/frym.2020.00095.

Widrow, B. and Kollár, I. Quantization noise: round-


off error in digital computation, signal processing, con-
trol, and communications. Cambridge University Press,
2008.

Xi, Z., Chen, W., Guo, X., He, W., Ding, Y., Hong, B.,
Zhang, M., Wang, J., Jin, S., Zhou, E., Zheng, R.,

15

You might also like