0% found this document useful (0 votes)

7 views

Imitation Learning

Uploaded by

naruto10102019

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views

Imitation Learning

Uploaded by

naruto10102019

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 188

Foundations and Trends® in Robotics

Vol. 7, No. 1-2 (2018) 1–179

An Algorithmic Perspective on
Imitation Learning

Takayuki Osa Joni Pajarinen

University of Tokyo Technische Universität Darmstadt
[email protected] [email protected]
Gerhard Neumann J. Andrew Bagnell
University of Lincoln Carnegie Mellon University
[email protected] [email protected]
Pieter Abbeel
University of California, Berkeley
[email protected]
Jan Peters
Technische Universität Darmstadt
[email protected]
Contents

1 Introduction 3
1.1 Key successes in Imitation Learning . . . . . . . . . . . 4
1.2 Imitation Learning from the Point of View of Robotics . 5
1.3 Diﬀerences between Imitation Learning and Supervised
Learning . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.4 Insights for Machine Learning and Robotics Research . 11
1.5 Statistical Machine Learning Background . . . . . . . . 12
1.5.1 Notation and Mathematical Formalization . . . . 12
1.5.2 Markov Property . . . . . . . . . . . . . . . . . . 14
1.5.3 Markov Decision Process . . . . . . . . . . . . . 14
1.5.4 Entropy . . . . . . . . . . . . . . . . . . . . . . . 14
1.5.5 Kullback-Leibler (KL) Divergence . . . . . . . . 15
1.5.6 Information and Moment Projections . . . . . . . 15
1.5.7 The Maximum Entropy Principle . . . . . . . . . 16
1.5.8 Background: Reinforcement Learning . . . . . . . 17
1.6 Formulation of the Imitation Learning Problem . . . . . 18

2 Design of Imitation Learning Algorithms 20

2.1 Design Choices for Imitation Learning Algorithms . . . 20
2.2 Behavioral Cloning and Inverse Reinforcement Learning 24

ii
iii

2.3 Model-Free and Model-Based Imitation Learning Meth-

ods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.4 Observability . . . . . . . . . . . . . . . . . . . . . . . . 28
2.4.1 Trajectories in Fully Observable Settings . . . . . 29
2.4.2 Trajectories in Partially Observable Settings . . . 29
2.4.3 Diﬀerences in observability between the expert
and the learner . . . . . . . . . . . . . . . . . . . 30
2.5 Policy Representation in Imitation Learning . . . . . . . 31
2.5.1 Levels of Policy Abstraction . . . . . . . . . . . . 31
2.5.2 Hierarchical vs Monolithic Policies . . . . . . . . 33
2.5.3 Feedback vs Open-Loop/Feedback-Free Policies . 34
2.5.4 Stationarity and Stochasticity of Policies . . . . . 35
2.5.4.1 Stationary vs. Non-Stationary Policies . 36
2.5.4.2 Deterministic Policy . . . . . . . . . . . 36
2.5.4.3 Stochastic Policy . . . . . . . . . . . . . 37
2.6 Behavior Descriptors . . . . . . . . . . . . . . . . . . . . 38
2.6.1 State-action Distribution . . . . . . . . . . . . . 38
2.6.2 Trajectory Feature Expectation . . . . . . . . . . 38
2.6.3 Trajectory Feature Distribution . . . . . . . . . . 39
2.7 Information Theoretic Understanding of Feature Matching 39
2.7.1 Information Theoretic Understanding of Imita-
tion Learning Algorithms for Trajectory Learn-
ing . . . . . . . . . . . . . . . . . . . . . . . . . 40
2.7.2 Information Theoretic Understanding of Imita-
tion Learning Algorithms in Action-State Space . 44

3 Behavioral Cloning 46
3.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . 46
3.2 Design Choices for Behavioral Cloning . . . . . . . . . . 48
3.2.1 Choice of Surrogate Loss Functions for Behav-
ioral Cloning . . . . . . . . . . . . . . . . . . . . 49
3.2.1.1 Quadratic Loss Function . . . . . . . . 49
3.2.1.2 ℓ1 -Loss Function . . . . . . . . . . . . 50
3.2.1.3 Log Loss Function . . . . . . . . . . . . 51
3.2.1.4 Hinge Loss Function . . . . . . . . . . . 52
3.2.1.5 Kullback-Leibler Divergence . . . . . . 52
iv

3.2.2 Choice of Regression Methods for Behavioral

Cloning . . . . . . . . . . . . . . . . . . . . . . . 53
3.3 Model-Free and Model-Based Behavioral Cloning Methods 53
3.4 Model-Free Behavioral Cloning Methods in Action-State
space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.4.1 Model-Free Behavioral Cloning as Supervised
Learning . . . . . . . . . . . . . . . . . . . . . . . 55
3.4.2 Imitation as Supervised Learning with Neural
Networks . . . . . . . . . . . . . . . . . . . . . . 56
3.4.2.1 Recent Successes of Imitation Leaning
with Neural Networks . . . . . . . . . . 56
3.4.2.2 Learning with Recurrent Neural Networks 58
3.4.3 Teacher-Student Interaction during Behavioral
Cloning . . . . . . . . . . . . . . . . . . . . . . . 59
3.4.3.1 Reduction of Structured Prediction to
Iterative Learning of Simple Classiﬁcation 60
3.4.3.2 Conﬁdence-Based Approach . . . . . . 61
3.4.3.3 Data Aggregation Approach: DAGGER 62
3.5 Model-Free Behavioral Cloning for Learning Trajectories 65
3.5.1 Trajectory Representation . . . . . . . . . . . . . 65
3.5.1.1 Keyframe/Via-Point Based Approaches 66
3.5.1.2 Representation with Hidden Markov
Models . . . . . . . . . . . . . . . . . . 67
3.5.1.3 Dynamic Movement Primitives . . . . . 69
3.5.1.4 Probabilistic Movement Primitives . . . 73
3.5.1.5 Trajectory Representation with Time-
Invariant Dynamical Systems . . . . . . 75
3.5.2 Comparison of Trajectory Representations . . . . 77
3.5.3 Generalization of Demonstrated Trajectories . . 79
3.5.4 Information Theoretic Understanding of Model-
Free BC . . . . . . . . . . . . . . . . . . . . . . . 83
3.5.5 Time Alignment of Multiple Demonstrations . . 84
3.5.6 Learning Coupled Movements . . . . . . . . . . . 86
3.5.6.1 Learning Coupled Movements with DMPs 87
v

3.5.6.2 Learning Coupled Movements with

Gaussian Conditioning . . . . . . . . . 87
3.5.6.3 Learning Coupled Movements with
Time-Invariant Dynamical Systems . . 89
3.5.7 Incremental Trajectory Learning . . . . . . . . . 90
3.5.8 Combining Multiple Expert Policies . . . . . . . 93
3.6 Model-Free Behavioral Cloning for Task-Level Planning 94
3.6.1 Segmentation and Clustering for Task-Level
Planning . . . . . . . . . . . . . . . . . . . . . . 94
3.6.2 Learning a Sequence of Primitive Motions . . . . 95
3.7 Model-Based Behavioral Cloning Methods . . . . . . . . 101
3.7.1 Model-Based Behavioral Cloning Methods with
Forward Dynamics Models . . . . . . . . . . . . 101
3.7.1.1 Imitation with a Gaussian Mixture
Forward Model . . . . . . . . . . . . . . 103
3.7.1.2 Imitation with a Gaussian Process
Forward Model . . . . . . . . . . . . . . 104
3.7.2 Imitation Learning through Iterative Learning
Control . . . . . . . . . . . . . . . . . . . . . . . 107
3.7.3 Information Theoretic Understandings of Model-
Based Behavioral Cloning Methods . . . . . . . . 108
3.8 Robot Applications with Model-Free BC Methods . . . 109
3.8.1 Learning to Hit a Ball with DMP . . . . . . . . . 109
3.8.2 Learning Hand-Over Tasks with ProMPs . . . . 110
3.8.3 Learning to Tie a Knot by Modeling the Trajec-
tory
Distribution with Gaussian Processes . . . . . . 112
3.9 Robot Applications with Model-Based BC Methods . . 113
3.9.1 Learning Acrobatic Helicopter Flights . . . . . . 113
3.9.2 Learning to Hit a Ball with an Underactuated
Robot . . . . . . . . . . . . . . . . . . . . . . . . 114
3.9.3 Learning to Control with DAGGER . . . . . . . . 115

4 Inverse Reinforcement Learning 117

4.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . 118
4.2 Model-Based and Model-Free IRL Methods . . . . . . . 120
vi

4.3 Design Choices for Inverse Reinforcement Learning

Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
4.4 Model-Based Inverse Reinforcement Learning Methods . 122
4.4.1 Feature Expectation Matching . . . . . . . . . . 123
4.4.2 Maximum Margin Planning . . . . . . . . . . . . 124
4.4.3 Inverse Reinforcement Learning Based on the
Maximum Entropy Principle . . . . . . . . . . . 126
4.4.3.1 Maximum Entropy Inverse
Reinforcement Learning . . . . . . . . . 126
4.4.3.2 Maximum Causal Entropy Inverse
Reinforcement Learning . . . . . . . . . 129
4.4.3.3 IRL from Failed Demonstrations . . . . 130
4.4.3.4 Connection of Maximum Entropy
Methods to Economics . . . . . . . . . 131
4.4.4 Miscellaneous Important Model-Based IRL
Methods . . . . . . . . . . . . . . . . . . . . . . . 132
4.4.4.1 Linearly-Solvable MDPs . . . . . . . . . 132
4.4.4.2 IRL Methods Based on a Bayesian
Framework . . . . . . . . . . . . . . . . 133
4.4.5 Learning Nonlinear Reward Functions . . . . . . 134
4.4.5.1 Boosting Methods . . . . . . . . . . . . 134
4.4.5.2 Deep Network Methods . . . . . . . . . 135
4.4.5.3 Gaussian Process IRL . . . . . . . . . . 135
4.4.6 Guided Cost Learning . . . . . . . . . . . . . . . 136
4.5 Model-Free Inverse Reinforcement Learning Methods . . 138
4.5.1 Relative Entropy Inverse Reinforcement Learning 138
4.5.2 Generative Adversarial Imitation Learning . . . . 140
4.6 Interpretation of IRL with the Maximum Entropy Prin-
ciple . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
4.7 Inverse Reinforcement Learning under Partial Observ-
ability . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
4.7.1 IRL from Partially Observable Demonstrations . 144
4.7.2 IRL with Incomplete Expert Observations . . . . 145
4.7.3 Active Inverse Reinforcement Learning as a
POMDP . . . . . . . . . . . . . . . . . . . . . . . 146
vii

4.7.4 Cooperative Inverse Reinforcement Learning . . 146

4.8 Robot Applications with IRL Methods . . . . . . . . . . 147
4.8.1 Learning to Drive a Car in a Simulator . . . . . 147
4.8.2 Learning Path Planning with MMP . . . . . . . 148
4.8.3 Learning Motion Planning with Deep Guided-
Cost Learning . . . . . . . . . . . . . . . . . . . . 150
4.8.4 Learning a Ball-in-a-Cup task with Relative En-
tropy Inverse Reinforcement Learning . . . . . . 151

5 Challenges in Imitation Learning for Robotics 152

5.1 Behavioral Cloning vs Inverse Reinforcement Learning . 152
5.2 Open Questions in Imitation Learning . . . . . . . . . . 154
5.2.1 Problems Related to Demonstrated Data . . . . 154
5.2.2 Open Questions Related to Design Choices . . . 155
5.2.3 Problems Related to Algorithms . . . . . . . . . 157
5.2.4 Performance Evaluation . . . . . . . . . . . . . . 159

Acknowledgements 161

References 162
Abstract

As robots and other intelligent agents move from simple environments

and problems to more complex, unstructured settings, manually pro-
gramming their behavior has become increasingly challenging and ex-
pensive. Often, it is easier for a teacher to demonstrate a desired be-
havior rather than attempt to manually engineer it. This process of
learning from demonstrations, and the study of algorithms to do so, is
called imitation learning. This work provides an introduction to imi-
tation learning. It covers the underlying assumptions, approaches, and
how they relate; the rich set of algorithms developed to tackle the prob-
lem; and advice on effective tools and implementation.
We intend this paper to serve two audiences. First, we want to famil-
iarize machine learning experts with the challenges of imitation learn-
ing, particularly those arising in robotics, and the interesting theoreti-
cal and practical distinctions between it and more familiar frameworks
like statistical supervised learning theory and reinforcement learning.
Second, we want to give roboticists and experts in applied artificial in-
telligence a broader appreciation for the frameworks and tools available
for imitation learning.
We organize our work by dividing imitation learning into directly
replicating desired behavior (sometimes called behavioral cloning [Bain
and Sammut, 1996]) and learning the hidden objectives of the desired
behavior from demonstrations (called inverse optimal control [Kalman,
1964] or inverse reinforcement learning [Russell, 1998]). In addition to
method analysis, we discuss the design decisions a practitioner must
make when selecting an imitation learning approach. Moreover, appli-
cation examples—such as robots that play table tennis [Kober and
Peters, 2009] and programs that play the game of Go [Silver et al.,
2016]— illustrate the properties and motivations behind different forms
of imitation learning. We conclude by presenting a set of open questions
and point towards possible future research directions.

T. Osa, J. Pajarinen, G. Neumann, J. A. Bagnell, P. Abbeel and J. Peters. An

Algorithmic Perspective on
Imitation Learning. Foundations and Trends® in Robotics, vol. 7, no. 1-2,
pp. 1–179, 2018.
2

DOI: 10.1561/2300000053.
1
Introduction

Programming autonomous behavior in machines and robots tradition-

ally requires a specific set of skills and knowledge. However, human
experts know how to demonstrate the desired task even if they do not
know how to program the necessary behavior in a machine or robot.
The purpose of imitation learning is to efficiently learn a desired be-
havior by imitating an expert’s behavior. The application of imitation
learning is not limited to physical systems. It can be a powerful tool
to design autonomous behavior in systems such as web sites, computer
games, and mobile applications. Any system that requires autonomous
behavior similar to human experts can benefit from imitation learning.
However, imitation learning may be essential for robotics. It is now
considered to be a key technology for applications such as manufac-
turing, elder care, and the service industry. These robots will be ex-
pected to work closely with humans in a dramatic shift from prior
uses of robots. Powerful robotic manipulators are dangerous and have
therefore been used mainly in constrained, predefined industrial appli-
cations; employees must undergo special training before working with
them. This is changing due to recent advances in robotics from com-
pute to the use of light, compliant, and safe robotic manipulators. They

3
4 Introduction

are ideal for applications where robots work alongside people, such as
collaborating with human operators and reducing the physical work-
load of care givers. These applications require efficient, intuitive ways
to teach robots the motions they need to perform from domain experts
who may not possess special skills or knowledge about robotics.
In recent years, imitation learning has been investigated as a way to
efficiently and intuitively program autonomous behavior[Schaal, 1999,
Argall et al., 2009, Billard et al., 2008, Billard and Grollman, 2013,
Bagnell, 2015, Billard et al., 2016]. In imitation learning, a human
demonstrates how to perform a task. A robotic system learns a pol-
icy to execute the given task by imitating the demonstrated motions.
Numerous imitation learning methods have been developed and imita-
tion learning has become a gigantic field of research. As a consequence,
capturing the entire field of imitation learning is not a trivial task.
The purpose of this survey is to provide a structural understanding
of existing imitation learning methods and its relationship with other
fields from supervised learning to control theory. We will describe what
has been developed in the field of imitation learning and what should
be developed in the future.

1.1 Key successes in Imitation Learning

One of the earliest and most well-known imitation learning success sto-
ries was the autonomous driving project Autonomous Land Vehicle In
a Neural Network (ALVINN) at Carnegie Mellon University [Pomer-
leau, 1988]. In ALVINN, a neural network learned how to map input
images to discrete actions in order to drive a vehicle. ALVINN’s neu-
ral network had one hidden layer with ﬁve units. Its input layer had
30 by 32 units; its output layer had 30 units. Although the structure
of this network was simple compared to modern neural networks with
millions of parameters, the system succeeded at driving autonomously
across the North American continent.
The Kendama robot developed by Miyamoto et al. [1996] is an-
other successful application of imitation learning. In the early days
of imitation learning, roboticists were mainly interested in teaching
1.2. Imitation Learning from the Point of View of Robotics 5

higher-level tasks from human demonstrations, such as “pick,” “move,”

and “place” Kang and Ikeuchi [1993], Kuniyoshi et al. [1994]. In those
settings, lower-level tasks were often considered to be simple, point-to-
point motions. In the late 1990s, this focus shifted from task-level plan-
ning to trajectory-level planning. The term “learning from demonstra-
tion” has become very popular since its use by S. Schaal and G. Atke-
son [Schaal, 1997, Atkeson and Schaal, 1997]. Since then, learning robot
motions has been a key domain of imitation learning.
Recently, learning from human demonstrations has benefited from
developments in deep neural networks. Recurrent neural networks such
as long short-term memory (LSTM) networks Hochreiter and Schmid-
huber [1997] have played a significant role in demonstrating how
to succeed in many previously difficult sequential tasks by learning
from demonstrated data. This includes tasks for generating handwrit-
ing Chung et al. [2015], natural language Wen et al. [2015], or image
captions Karpathy and Fei-Fei [2015]. Furthermore, AlphaGo, the al-
gorithm which was able to beat a human Go master and which we
discuss in more detail in §3.4.2, initializes a deep neural network pol-
icy from human demonstrations Silver et al. [2016]. Often these recent
approaches require a large amount of data. In §3, we will discuss how
to learn from data to reproduce observed behavior in specific problem
settings.

1.2 Imitation Learning from the Point of View of

Robotics

Imitation learning is a class of methods that reproduces desired be-

havior based on expert demonstrations. In many cases, the experts are
human operators and the learners are robotic systems, Thus, imitation
learning is a technique that enables skills to be transferred from hu-
mans to robotic systems. To perform imitation learning, we need to
develop a system that records demonstrations by experts and learns a
policy to reproduce the demonstrated behavior from the recorded data.
For this purpose, we need to answer the following questions.
6 Introduction

General Aspects:
1. Why and when should imitation learning be used? This
question clariﬁes the motivation for using imitation learning and
what we should do with it.

2. Who should demonstrate? In many cases, the experts are hu-

man operators. Many imitation learning methods implicitly as-
sume that demonstrations are provided by a single expert. When
multiple experts are available, we need to decide which one should
be imitated or how we can incorporate demonstrations from mul-
tiple experts.

3. How should we record data of the expert demonstra-

tions? There are multiple ways of recording the behavior of
experts. For example, motion capture systems and teleoperated
robotic systems record data from expert behavior. This choice is
closely related to the embodiment problem between experts and
learners, which will be discussed in §3.7.1.

4. What should we imitate? The recorded data often includes

redundant information about expert behavior. In such cases, fea-
tures appropriate for the desired behavior should be selected.
Meanwhile, the recorded data also includes unnecessary motions,
which should not be imitated. The data must be segmented to
extract the motions to be imitated.
Algorithmic Aspects:

5. How should we represent the policy? Expert behavior can

be represented using methods such as symbolic representation,
trajectory-based representation, and state-action space represen-
tation. The choice depends largely on the design of the entire
system.

6. How should we learn the policy? Many algorithms for learn-

ing the policy have been developed over the past several decades.
The choice of the algorithm for learning the policy is closely re-
lated to the choice of policy representation.
1.2. Imitation Learning from the Point of View of Robotics 7

With regard to the ﬁrst four questions, several survey papers on

imitation learning [Argall et al., 2009, Billard et al., 2008, Billard and
Grollman, 2013, Billard et al., 2016], provide a taxonomy of imitation
learning from the perspective of robotics. Argall et al. [2009] indicate
that it is essential to design an imitation learning system by considering
the correspondence between the expert and the learner, data acquisi-
tion methods, and limitations of the demonstration dataset. Billard
et al. [2008, 2016] provide an overview of imitation learning methods
and highlight techniques for trajectory learning. However, none of the
previous review articles focused on the design decisions needed to de-
velop new imitation learning algorithms to enable answering questions
five and six related to the algorithmic aspects discussed above. In ad-
dition, these articles did not discuss the algorithmic details of exist-
ing methods because the enormous amount of prior work on imitation
learning makes it challenging to cover the entire range of previous stud-
ies.
In this survey, we provide an overview of existing methods from
the algorithmic point of view, which will be useful for both readers
beginning the practice of imitation learning and readers who want to
achieve a deeper understanding of the theoretical aspects of imitation
learning. We discuss the design choices which one should consider in or-
der to develop novel imitation learning algorithms. Although our survey
cannot be exhaustive, we discuss the algorithmic details of existing al-
gorithms as much as possible, which will be useful to readers who want
to implement imitation learning techniques. Additionally, we develop
an information theoretic understanding of existing methods, which will
help readers to understand how existing methods relate to each other
and figure out how they could be extended.
Let us illustrate how different design choices of imitation learn-
ing algorithms can be made in different applications. Figure 1.1 shows
three applications of imitation learning: 1) an RC helicopter, 2) robotic
surgery, and 3) quadruped robot locomotion. In these applications, de-
sign of the policies for motion planning and control vary. Abbeel et al.
[2010] demonstrates acrobatic RC helicopter flight by learning from tra-
jectories demonstrated by a human expert. In this system, the desired
8 Introduction

Control inputs Observation

Left-right tilt Accelerometers
Forward-backward tilt Gyro sensors
Vertical rotational rate Magnetometers
Roter collective pitch GPS
…

Vision system

…
https://commons.wikimedia.org/w/index.php?curid=11467562

Demonstration by experts

(a) Learning of acrobatic RC helicopter maneuvers [Abbeel et al., 2010]. The tra-
jectories for acrobatic flights are learned from a human expert’s demonstrations.
To control the system with highly nonlinear dynamics, iterative learning control
was used.

Control inputs Observation

Position of the Position of the

master manipulator slave manipulator

Demonstration by experts

(b) Learning with a teleoperated system [Osa et al., 2014] where a posi-
tion/velocity controller is available. To generalize the trajectory to different situ-
ations, a mapping from task situations to trajectories is learned from demonstra-
tions under various situations.

Control inputs Observation

Analog joystick
Terrain features
value
Foot step locations

Demonstration by experts

(c) Learning quadruped robot locomotion [Zucker et al., 2011]. The footstep plan-
ning was addressed as an optimization of the reward/cost function, which was re-
covered from the expert demonstrations. Learning the reward/cost function allows
the footstep planning strategy to be generalized to different terrains.

Figure 1.1: Observations y and control inputs u for imitation learning in (a)
helicopter flight, (b) surgery, and (c) locomotion. Motion planning is formulated in
different ways in these examples.
1.2. Imitation Learning from the Point of View of Robotics 9

trajectories of acrobatic ﬂights were learned from demonstrations with a

supervised learning method. Osa et al. [2017b] also learned trajectories
for autonomous knot tying from demonstrations by a human expert. To
generalize a trajectory, Osa et al. [2017b] learned a direct mapping from
task situations (contexts) to trajectories using demonstrations recorded
under various situations. Contrary to [Abbeel et al., 2010, Osa et al.,
2017b], Zucker et al. [2011] formulated footstep planning for quadruped
robot locomotion as an optimization of the reward/cost function. The
reward/cost function was recovered from demonstrations. In [Zucker
et al., 2011], learning the reward/cost function as a function of terrain
features enables the footstep planning strategy to be generalized to dif-
ferent terrains. Learning such reward/cost functions for manipulation
tasks like as knot-tying [Osa et al., 2017b] is not trivial, since complex
manipulation tasks often require nonlinear reward/cost functions.
Methods for learning policies also differ between applications. The
observation and control inputs of the RC helicopter system are much
noisier than those of the other two systems, and its dynamics are highly
nonlinear [Abbeel et al., 2010]. Therefore, it is essential to estimate the
true state using various sensory information and learn an adaptive con-
troller through iterations of trials to achieve acrobatic RC helicopter
flight. On the other hand, we can assume that the system state is
precisely known and a position/velocity controller is available in the
case of the tele-operation system in [Osa et al., 2014], which simplifies
imitation learning significantly. In [Osa et al., 2014], the conditional
trajectory distribution given a context can be learned with a simple re-
gression method, and the planned trajectory can be executed by a stan-
dard velocity controller. In locomotion planning for a quadruped robot
in [Zucker et al., 2011], estimating the reward/cost function requires
an iterative learning process with virtual simulation of the learned pol-
icy. As one can see from these examples, learning methods can be very
different between applications.
To apply imitation learning, it is essential to identify the structure
of the system, formulate a given problem, and design an algorithm to
solve the problem efficiently. In this survey, we focus on the algorithmic
aspects of imitation and discuss necessary design choices, exploring
10 Introduction

various solutions proposed by previous studies.

In the rest of this chapter, we introduce several concepts in machine
learning that are essential to understand imitation learning algorithms.
We discuss the design choices of imitation learning algorithms in Chap-
ter 2. We describe the details of behavioral cloning methods and inverse
reinforcement learning methods in Chapters 3 and 4, respectively. To
conclude, we list open questions of imitation learning in Chapter 5.

1.3 Key Differences between Imitation Learning and

Supervised Learning

The imitation learning problem has special properties that distinguish

it from the better known supervised learning setting [Shalev-Shwartz
and Ben-David, 2014] : 1) the solution may have important structural
properties including constraints (for example, robot joint limits), dy-
namic smoothness and stability, or leading to a coherent, multi-step
plan [Bagnell, 2015]; 2) the interaction between the learner’s decisions
and its own input distribution (an on-policy versus off-policy distinc-
tion) , and 3) the increased necessity of minimizing the typically high
cost of gathering examples.
As we learn a policy π from a dataset D, imitation learning is
closely related to supervised learning, and is particularly related to
the field of structured prediction [Daumé III et al., 2009, Ratliff et al.,
2006a, Taskar, 2005] , where the task is to learn a mapping from in-
puts x to a complex, structured output y (plans, parse trees, com-
plex motions). Reductions of structured prediction to sequential deci-
sion [Daumé III et al., 2009], and reductions of imitation learning to
structured prediction [Ratliff et al., 2006b] show the close connection,
and cross-fertilization between these research areas has been important
for both. In practice, distinctions arise because of the structural prop-
erties of policies we attempt to imitate, and the difficulty of "resetting"
state and restarting predictions is too costly or even infeasible in most
imitation learning settings because a physical system is often involved.
In addition, it is often the case that the embodiments of the expert
and the learner are different. For example, when transferring human
skills to a humanoid robot, the motion captured from a human expert
1.4. Insights for Machine Learning and Robotics Research 11

may be infeasible for the humanoid. In such a case, the demonstrated

motion needs to be adapted to be feasible for the humanoid. This kind
of adaptation is less common in the standard supervised learning.
In machine learning, the prediction problem where the source do-
main distribution and the target domain distribution are different is of-
ten referred to as “covariate shift” or “domain adaptation” [Sugiyama,
2015]. In imitation learning, the source domain corresponds to expert
demonstrations and the target domain to learner reproductions. In im-
itation learning, the demonstration dataset does not cover all possible
situations since collecting expert demonstrations to cover all situations
is usually too expensive and time-consuming. As a result, the learner
often encounters states which were not encountered by the expert dur-
ing demonstrations, which means that the target domain distribution is
different from the source distribution. Therefore, covariate shift or do-
main adaptation is closely related to imitation learning [Bagnell, 2015].
Imitation learning is also closely related to reinforcement learn-
ing (RL), which tries to obtain a policy that maximizes an expected
reward [Sutton and Barto, 1998] signal. In RL, we employ a reward
function that encourages a desired behavior. However, in imitation
learning we often assume optimal (or at least “good”) expert demon-
strations which are not available in basic reinforcement learning, and
which provide prior knowledge that allows for dramatically more effi-
cient methods. Recent work by Sun et al. [2017] demonstrates a po-
tentially exponential decrease in sample complexity in learning a task
by imitation rather than by trial-and-error reinforcement learning, and
empirical results have long shown such benefits [Silver et al., 2016,
Kober and Peters, 2009, Abbeel et al., 2010]. Moreover, in the imi-
tation learning setting, as we detail below, we may or may not have
access to a true reward function.

1.4 Insights for Machine Learning and Robotics Re-

As imitation learning oﬀers intuitive ways to program robotic motions

by demonstrating the desired motion, imitation learning attracted in-
terests from robotic researchers. The robotics community has devel-
12 Introduction

oped many imitation learning methods for motion planning and robot
control. When planning a trajectory for a robotic system, it is often
necessary to make sure that a planned trajectory satisfies some con-
straints such as smooth convergence to a new goal state. For this rea-
son, robotics researchers have developed “custom” trajectory represen-
tations that explicitly satisfy constraints necessary for robotic appli-
cations. Machine learning techniques are often used as a part of such
frameworks. However, robotics researchers need to be aware that rich
set of algorithms have been developed by the machine learning com-
munity and some of new algorithms might eliminate the need for cus-
tomizing policy or trajectory representation.
For machine learning researchers, imitation learning offers interest-
ing practical and theoretical problems, which differ from standard su-
pervised and reinforcement learning settings. Although imitation learn-
ing is closely related to structured prediction, it is often challenging to
apply existing machine learning methods to imitation learning, espe-
cially robotic applications. In imitation learning, collecting demonstra-
tions and performing rollouts are often expensive and time-consuming.
Therefore, it is necessary to consider how to minimize these costs and
perform learning efficiently. In addition, embodiments and observabil-
ity of the learner and the expert are different in many applications. In
such cases, the demonstrated motion needs to be adapted based on the
learner’s embodiment and observability. These difficulties in imitation
learning present new challenges to machine learning researchers.

1.5 Statistical Machine Learning Background

To understand imitation learning algorithms, familiarity with several

concepts in statistical machine learning is essential. In this section, we
brieﬂy introduce the notation we use and these concepts.

1.5.1 Notation and Mathematical Formalization

Before introducing important concepts in machine learning, we intro-

duce the notation in this article. Table 1.1 summarizes our notation.
Throughout this survey, we use the bold style for vector values, and the
1.5. Statistical Machine Learning Background 13

non-bold style for scalar values. Demonstrations by an expert are often

given as a set of trajectories. In this case, the dataset of demonstra-
tions is given by D = {τ 0 , . . . , τ m }. We use the lower script to denote
the time index; xt represents the state of the system at time step t.
We review many methods that manipulate probability distributions in
various ways. To make equations concise, the probability distribution
induced by the experts’ policy is denoted by q, and the distribution
induced by the learner’s policy is denoted by p. For example, p(τ )
represents the probability distribution over trajectories induced by the
learner’s policy. The term “action” is mainly used in machine learning
community, and “control input” is mainly used in robotic community
and control theory community. Since imitation learning methods have
been developed in all of these communities, we use the word “action”

Table 1.1: Table of Notation. We use a notation common in the control literature
for states and controls.

x system state
s context
φ feature vector
u control input/action
τ trajectory
π policy
D dataset of demonstrations
q probability distribution induced by an expert’s policy
p probability distribution induced by a learner’s policy
t time
T ﬁnite horizon
N number of demonstrations
superscript representing an expert
E
e.g. π E denotes an expert’s policy
superscript representing a learner
L
e.g. π L denotes a learner’s policy
superscript representing a demonstration by an expert
demo
e.g. τ demo denotes a trajectory demonstrated by an expert
14 Introduction

and “control input” interchangeably. We use the term “context” to refer

to the condition relevant to the task. The context s can be the initial
state of the system x0 or the state of relevant objects. For instance, the
position of the ball can be part of the context in a hitting-a-ball task.
We use T to denote the ﬁnite horizon of the trajectory. Therefore, the
total number of the time steps of a single trajectory is T + 1 in our
notation.

1.5.2 Markov Property

A sequence of states x0 , ..., xt is a Markov chain if at any time t, the

future states xt+1 , xt+2 , ... depend on the history x0 , ..., xt only through
the present state xt [Serfozo, 2009]. In other words, the next state xt+1
only depends on the current state xt in a Markov chain. This property
is called the Markov property.

1.5.3 Markov Decision Process

A Markov decision process (MDP) is a process that satisﬁes the Markov

property. If the state and action spaces are finite, then it is called a finite
Markov decision process (finite MDP) [Sutton and Barto, 1998]. An
MDP is defined as a tuple (X , U, P, γ, D, R). X is a finite set of states;
U is a set of control inputs; P is a set of state transitions probabilities;
γ ∈ [1, 0) is a discount factor; D is the initial-state distribution from
which the initial state x0 is drawn; and R : X Ô→ R is the reward
function.

1.5.4 Entropy

Given the random variable x and its probability distribution p(x), the
entropy
Ú
H (p) = − p(x) ln p(x)dx (1.1)

is deﬁned as the amount of information conveyed by transmitting

x [Bishop, 2006]. Note that the entropy H(x) is a convex function.
1.5. Statistical Machine Learning Background 15

1.5.5 Kullback-Leibler (KL) Divergence

In the field of information geometry, the KL divergence is used to quan-
tify a difference between two probability distributions[Kullback and
Leibler, 1951], i.e.,
p(x)
Ú
DKL (p(x)||q(x)) = p(x) ln dx. (1.2)
q(x)
Since the KL divergence identifies a difference between two probability
distributions, it is useful for cases in which stochastic policies are go-
ing to be learned, or stochastic trajectories result from a deterministic
policy. Please note that the KL divergence is not symmetric, therefore
DKL (p||q) Ó= DKL (q||p). The KL divergence can be obtained as a Breg-
man divergence derived from the negative entropy [Amari, 2016] and
is widely used as a measure in multiple imitation learning approaches.

1.5.6 Information and Moment Projections

One common approach to learning a policy from a dataset is to consider
“projecting” that dataset onto the space of the policy model. Informa-
tion theory emphasizes two kinds of projections: the Information(I)-
projection and the Moment(M)-projection [Bishop, 2006]. Using the
Kullback-Leibler (KL) divergence [Kullback and Leibler, 1951], the I-
projection is

p∗ = arg min DKL (p ë q) , (1.3)

and, the M-projection

p∗ = arg min DKL (q ë p) . (1.4)

As the KL divergence is not symmetric, these two projections result in

diﬀerent solutions when a given distribution is multi-modal as shown in
Figure 1.2. While the M-projection averages over the several modes, the
I-projection concentrates on a single mode. Performing the I-projection
is often not straight-forward, although the M-projection can often be
performed relatively easily by maximizing the likelihood with respect
to a given training dataset [Bishop, 2006].
16 Introduction
0.8

0.6

0.4

0.2

0
-3 -2 -1 0 1 2 3

Figure 1.2: Illustration of I- and M- projections. Given a distribution with two

modes as shown in black, M-projection will give a solution that averages over two
modes as shown in red. On the contrary, I-projection will give a solution that con-
centrates on one of the modes.

1.5.7 The Maximum Entropy Principle

Let us consider a probability distribution p(x) that matches the fea-
tures of an unknown distribution q, i.e. it satisﬁes

Ep [φ(x)] = Eq [φ(x)],

where q(x) is an unknown probability distribution and Eq [φ(x)], which

is the expectation of a feature function φ(x), is available. As there are
typically an inﬁnite amount of such distributions, we need an additional
constraint to obtain a unique solution [Amari, 2016].
The maximum entropy principle [Jaynes, 1957] suggests to choose
a distribution that maximizes the entropy
Ú
H(p) = − p(x) ln p(x)dx

among the distributions that satisfy Ep [φ(x)] = Eq [φ(x)]. From this

constrained optimization program, the maximum entropy distribution
can be computed as

p(x) ∝ exp w⊤ φ(x) ,

! "
(1.5)

where w is a vector-valued Lagrangian multiplier for the feature match-

ing constraint. While the maximum entropy principle does not directly
translate into a practical algorithm, it uncovers an interesting obser-
vation. Every distribution that is in a log-linear representation given
by Equation 1.5, is the maximum entropy distribution that can match
speciﬁc feature expectations given by the feature vector φ(x). This is
1.5. Statistical Machine Learning Background 17

true for typical distributions from the exponential family such as the
Gaussian distribution, which is the maximum entropy distribution that
matches ﬁrst and second order moments. The notion of Maximum En-
tropy generalizes to Maximum Causal Entropy, which turns out to be
a natural notion of uncertainty for dynamical systems [Ziebart et al.,
2013].

1.5.8 Background: Reinforcement Learning

Reinforcement learning is a class of methods that autonomously learns
policies through iterations of trials and evaluations. The goal of
reinforcement learning is to learn a policy π that maps the state of
the system to the control input so as to maximize the expected reward
J(π). The reward rt represents the quality of the given state, action
or trajectory at time t. For example, rt could be large when a robot is
close to the desired trajectory and small when the robot is far from the
trajectory, or, rt could be large for stable robot grasps and small for
unstable ones. With a finite horizon T , the expected return is given by
the accumulation of the reward at each time step,
C T - D
Ø -
J(π) = E rt - π . (1.6)
-
-
t=0
Alternatively, the discounted accumulated reward is used for the infi-
nite horizon scenario, i.e.,
C ∞ - D
Ø -
t
J(π) = E γ rt - π , (1.7)
-
-
t=0
where the discounted factor γ controls the trade-off between shorter
term rewards and longer term rewards. The desired policy π ∗ is given
by
π ∗ = arg max J(π). (1.8)
π
The value of a state x under a policy π can be computed as the expected
reward when starting from x and following π
C ∞ - D
-
π
Ø
t
V (x) = E γ rt - x0 = x, π . (1.9)
-
-
t=0
18 Introduction

V π (xt ) is often called the value function [Sutton and Barto, 1998].
Likewise, the value of taking action u in state x under a policy π can
be computed as the expected reward when starting from the action u
in a state x and thereafter following policy π
C ∞ - D
-
π
Ø
t
Q (x, u) = E γ rt - x0 = x, u0 = u, π . (1.10)
-
-
t=0

Qπ (xt , ut ) is often called the action-value function [Sutton and Barto,

1998].
For an overview of reinforcement learning methods, please refer to
[Sutton and Barto, 1998, Szepesvari, 2010, Wiering and van Otterlo,
2012, Sugiyama et al., 2013] and for an overview in reinforcement learn-
ing in robotics, please refer to Kober et al. [2013], Deisenroth et al.
[2013b].

1.6 Formulation of the Imitation Learning Problem

The goal of imitation learning is to learn a policy that reproduces the

behavior of experts who demonstrate how to perform the desired task.
Suppose that the behavior of the expert demonstrator (or the learner
itself) can be observed as a trajectory τ = [φ0 , ..., φT ], which is a
sequence of features φ. The features φ, which can be the state of the
robotic system or any other measurements, can be chosen according to
the given problem. Please note that the features φ do not have to be
manually specified, and φ could be as general as simply pixels in raw
images.
Often, the demonstrations are recorded under different conditions,
for example, grasping an object at different locations. We will refer to
these task conditions as context vector s of the task which is stored
together with the feature trajectories. The context s can contain any
information relevant to the task, e.g., the initial state of the robotic
system or positions of target objects. Note that, as the context describes
the current task, it is typically fixed during task execution and the only
dynamic aspects of the problem are the state features φt . Optionally,
a reward signal r that the expert is trying to optimize is also available
in some problem settings [Ross and Bagnell, 2014].
1.6. Formulation of the Imitation Learning Problem 19

In imitation learning, we collect a dataset of demonstrations D =

{(τ i , si , ri )}N
i=1 that consists of pairs of trajectories τ , contexts s, and
optionally reward signals r. The data collection process can be both of-
ﬂine and online. Using the collected dataset D, a common optimization-
based strategy learns a policy π ∗ that satisﬁes

π ∗ = arg min D (q(φ), p(φ)) , (1.11)

where q(φ) is the distribution of the features induced by the experts’

policy, p(φ) is the distribution of the features induced by the learner,
and D(q, p) is a similarity measure between q and p. Both oﬄine and
online learning scenarios of this problem have been considered [Ross
et al., 2011]. Please note that, when the dataset contains demonstra-
tions of multiple tasks and the contexts include information of each
task, this problem can be considered multitask learning as in recent
work by Duan et al. [2017], Finn et al. [2017a,b].
In addition, we often have access to an environment such as a sim-
ulator or a physical robotic system where we can perform and evaluate
a policy through interaction. This simulator can be used to gather new
data and iteratively improve the policy to better match the demonstra-
tions.
2
Design of Imitation Learning Algorithms

In this chapter, we discuss the design choices of imitation learning

methods. First, we describe what design choices need to be consid-
ered, and we then discuss what options we can consider for each design
decision. Thereafter, we discuss imitation learning methods from an
information theoretic point of view.

2.1 Design Choices for Imitation Learning Algorithms

When developing an imitation learning method, it is necessary to make

several design choices to formalize the problem. In this section, we
present a list of some of these design choices.

• Access to the reward function: imitation learning or

reinforcement learning. A central distinction in imitation
learning is whether or not the learner has access to both an expert
demonstrator and a reward signal that the expert is attempting
to optimize. For instance, in learning to play Atari games [Mnih
et al., 2015] or play Go [Silver et al., 2016] there is an unambigu-
ous score metric. On the other hand, there exists tasks where
it is feasible for the expert to demonstrate the optimal behavior

20
2.1. Design Choices for Imitation Learning Algorithms 21

and it is hard to deﬁne the reward manually including, learning

to drive a car by demonstration [Pomerleau, 1988] and complex
manipulation such as knot-tying [Osa et al., 2017b].

One might naturally ask what beneﬁt is conferred by an expert if

a reward signal is available– surely we can simply solve the prob-
lem by reinforcement learning? The expert’s role is to reign in
the need for tremendous and expensive global exploration. This
has been consistently demonstrated empirically to speed learn-
ing even on problems with a clear metric (e.g., the ball-in-a-cup
task in [Kober and Peters, 2009]) and recently shown theoret-
ically to provide a potentially exponential improvement in the
number of samples required to learn [Sun et al., 2017]. The most
common approach to leverage such information is initialize a pol-
icy by imitation learning with coarse demonstration and reﬁned
by reinforcement learning through trial and error [Silver et al.,
2016, Tesauro, 1995]. Algorithms like SEARN [Daumé III et al.,
2009] and AggreVaTe [Ross and Bagnell, 2014, Sun et al., 2017],
intermix the process of imitation and reinforcement– the learner
attempts multiple actions and the expert provides the best strat-
egy or an estimate of cost-to-go given the learner’s decision. This
intermixing ensures that the learner is able (with enough samples
and representational power) to recover a policy that is guaran-
teed to be nearly as good as the expert (and can be much better),
and prevents small mistakes from cascading into poor overall be-
havior.

The emergence of the “V-style jump” [Maryniak et al., 2009]

shown in Figure 2.1 in ski jumping is a textbook example of such
imitation learning by humans. Although it took decades to be
recognized, soon after some athletes achieved successful results
with the V-style jump in 1990s, it has become prominent in the
sport and has been mastered by all the athletes performing ski
jumps. This example illustrates that local optimization around
the initial demonstration can only ﬁnd local optima while imita-
tion learning leads to fast skill acquisition.
22 Design of Imitation Learning Algorithms

Figure 2.1: A ski jumper flies through the air using the highly aerodynamic “V-
style”. “V-style” was adopted by most ski jumpers in the 1990s after some jumpers
demonstrated impressive results with the style (public domain picture from Wiki-
media Commons).

• Parsimonious description of the desired behavior: behav-

ioral cloning or inverse reinforcement learning. Data eﬃ-
cient learning demands we identify the most compact represen-
tation of a behavior. Often a direct mapping from features to
trajectories/actions is the most parsimonious description of the
policy and the approach known as behavioral cloning approach is
used. However, particularly for problems where the behavior is,
crudely speaking, deliberative and focused on long-horizon plan-
ning, the most parsimonious description of the policy may be
to encode the policy as the solution of an optimization or plan-
ning problem [Ratliﬀ et al., 2009, Bagnell, 2015] Inverse Optimal
Control approaches learn a (surrogate) cost function so that the
behavior that results from solving that optimization is in some
sense similar to that demonstrated by the expert.

• Access to system dynamics: model-based or model-free.

Access to system dynamics is required for making some prob-
lems tractable. For instance, estimation of the system dynamics
is often required for motion planning in under-actuated robots,
in which accurate controllers are not available. Meanwhile, ac-
cess to the system is not necessary when a controller of suﬃcient
2.1. Design Choices for Imitation Learning Algorithms 23

quality is available. It is desirable to avoid learning of the system

dynamics because it is not a trivial problem. Thus, it is essential
to identify whether access to system dynamics is necessary for
controlling the given system or not.

• Similarity measure between policies. In the event that there

is not a clear notion of reward function being optimized, a sur-
rogate notion of similarity between the experts’ policy and the
learner’s policy needs to be established to reproduce the behav-
ior of the expert. This similarity can be deﬁned at the level of
individual decisions, although it is usual preferred that the notion
of similarity be deﬁned over trajectories the learner and system
take together [Ziebart et al., 2013].

• Features. It is essential to select appropriate features that en-

able the desired behavior to be expressed. Features should contain
enough information to solve the problem while limiting the com-
plexity of learning. The features can be various measurements re-
lated to the desired task, such as kinematic/dynamic state of the
robotic system and/or the surrounding objects. Learning tech-
niques, based on deeper representations have enabled features
representations to be at least partially extracted automatically,
e.g., using deep learning [Ratliﬀ et al., 2006a, Bradley, 2010,
Grubb and Bagnell, 2010, Levine et al., 2016, Ho and Ermon,
2016, Finn et al., 2016b].

• Policy representation. Policy representation needs to be cho-

sen such that the desired behavior can be properly captured. For
example, a policy can be represented by a neural network or a lin-
ear function. With respect to the task abstraction level, we need
to decide at which level of the task we learn, such as task level,
trajectory level, and action-state level. While it is necessary to
select a suﬃciently informative representation to model the de-
sired behavior, increasing the complexity of policy representation
usually leads to the increase of the required training data and
learning time.
24 Design of Imitation Learning Algorithms

As one can see above, these design choices are not independent and
the order of these design choices are ﬂexible. For example, the choice of
similarity measures between policies is related to the choice of policy
representations. In the following sections, we present possible options
for some of these design choices.

2.2 Behavioral Cloning and Inverse Reinforcement

Learning

One way to obtain a policy that reproduces the demonstrated behav-

ior is to learn a policy that directly maps from the input to the ac-
tion/trajectory. In problems, where a dataset of demonstrated trajec-
tories with state-action pairs and contexts D = {(xt , st , ut )} is given,
we can directly compute a mapping from states or/and contexts to
control inputs as
u = π(xt , st ). (2.1)
This kind of policy can be usually obtained through a standard super-
vised learning method. Learning a policy that directly maps from the
state or/and the context to the control input is often referred to as
Behavioral Cloning (BC) [Bain and Sammut, 1996].
Alternatively, given a reward signal, a policy can be obtained so as
to maximize the expected return. Such a policy can be expressed as

π = arg max J(π̂), (2.2)

π̂

where J(π̂) is the expectation of the accumulated reward given the pol-
icy π as in (1.7). However, the reward function is considered unknown
and needs to be recovered from expert demonstrations under the as-
sumption that the demonstrations are (approximately) optimal w.r.t.
this reward function. Recovering the reward function from demonstra-
tions is often referred to as Inverse Reinforcement Learning (IRL) [Rus-
sell, 1998] or Inverse Optimal Control (IOC) [Moylan and Anderson,
1973].
BC and IRL form two major classes of imitation learning methods.
In order to select one of BC and IRL, it is essential to consider what is
the most parsimonious description of the desired behavior? The policy
2.3. Model-Free and Model-Based Imitation Learning Methods 25

learned by an IRL method is valid as long as the estimated reward

function represents the desired behavior appropriately, while a policy
learned by a BC method is valid as long as the learned mapping from
states to actions is valid. A choice between BC and IRL is to select the
best way to describe the desired behavior, which is totally dependent
on a given problem setting. It is essential to analyze how the desired
behavior should be performed when applying imitation learning meth-
ods.

2.3 Model-Free and Model-Based Imitation Learning

Methods

Whether we access the system dynamics for imitation learning or not

is one of the crucial design decisions. Although learning and leveraging
the system dynamics often enables data-eﬃcient learning with a system
that has nonlinear and unknown dynamics, learning the system dynam-
ics can be often challenging. In the reinforcement learning literature,
methods that learn a forward model of the system and leverage it for
learning a policy are often referred to as model-based, while methods
that do not explicitly learn a forward model of the system are referred
to as model-free [Kober et al., 2013, Deisenroth et al., 2013b]. In this
survey, we apply the same categorization to imitation learning meth-
ods. Table 2.1 shows a summary of the advantages and disadvantages
of model-free and model-based methods in imitation learning.
Model-free imitation learning methods attempt to learn a policy
that reproduce the behavior demonstrated by experts without learn-
ing/using a forward model of the system. Therefore, there is no need to
estimate the system dynamics in model-free imitation learning method.
Yet, the system dynamics is encoded only implicitly in policies learned
by model-free methods. In many robotic systems, especially in indus-
trial applications, position/velocity controllers are often available for
controlling joints. In such cases, we can assume that the robot is fully
actuated, and the dynamics of the system is almost negligible in motion
planning if a reasonably smooth trajectory is used. Model-free imita-
tion learning methods can be easily applied to motion planning for such
(nearly) fully-actuated robotic systems when the demonstrations by ex-
26 Design of Imitation Learning Algorithms

perts are available. For this reason, behavioral cloning methods which
learn a direct mapping from states/contexts to actions have focused on
model-free methods until recent years.
For motion planning of underactuated systems, it is often neces-
sary to plan a feasible trajectory by considering the system dynamics.
It can be challenging to use model-free BC methods to learn trajec-
tories in such underactuated systems where the reachable states are
limited. However, recent IRL work by Boularias et al. [2011], Finn
et al. [2016b], Ho and Ermon [2016] shows how one can learn skills
in underactuated systems through iterative rollouts without explicitly
learning a dynamics model.
Model-based imitation learning methods attempt to learn a policy
that reproduces the demonstrated behavior by learning/using the sys-
tem dynamics, e.g. a forward model of the system. This property can
be critical especially for underactuated robots. Since underactuation
limits the number of reachable states, it is essential to take into ac-
count the dynamics of the system when planning feasible trajectories.
Moreover, the prior knowledge of the system dynamics makes inverse
reinforcement learning easier since the learner’s performance can be
easily predicted when the system dynamics is known. However, in a

Table 2.1: Advantages and disadvantages of model-based and model-free methods

in imitation learning. Model-free methods learn a policy without knowledge on the
system dynamics, and the system dynamics is encoded only implicitly in policies.
Model-based methods learn a policy that explicitly satisfies the system dynamics by
leveraging the system dynamics. However, learning/estimating the system dynamics
can be challenging.

Model-free Model-based
A policy can be The learning process can
learned without learn- be data-efficient.
Advantages
ing/estimating the system A learned policy satisfies
dynamics. the system dynamics.
The prediction of future Model learning can be
states is difficult. difficult.
Disadvantages The system dynamics is Computationally expen-
only implicitly considered sive.
in the resulting policy.
2.3. Model-Free and Model-Based Imitation Learning Methods 27

real robotic system, it is often challenging to learn the system dynam-

ics. For example, it is hard to model the contact between deformable
objects, and it will be diﬃcult to apply model-based methods to tasks
that involve such contacts.
Existing imitation learning methods can be categorized into be-
havioral cloning and inverse reinforcement learning with a distinction
between model-free and model-based methods as shown in Table 2.2.
At a glance, one can see that studies on behavioral cloning have focused

Table 2.2: Categorization of existing imitation learning methods with distinction

between model-free and model-based methods. Model-free methods are dominant in
behavioral cloning, and model-based methods are dominant in inverse reinforcement
learning. Recent studies on IRL have proposed model-free methods.

Model-free Model-based

Widrow and Smith [1964],

Chambers and Michie
[1969], Pomerleau [1988],
Schaal et al. [2004],
Schaal [1999], Ijspeert
et al. [2013], Calinon
et al. [2007], Khansari-
Behavioral Ude et al. [2004], Englert
Zadeh and Billard [2011],
Cloning et al. [2013], van den Berg
Paraschos et al. [2013],
et al. [2010]
Osa et al. [2014], Ross
and Bagnell [2010], Ross
et al. [2011], Takano and
Nakamura [2015], Maeda
et al. [2016], Deniša et al.
[2016], Ho and Ermon
[2016]
Abbeel and Ng [2004],
Ratliﬀ et al. [2006b], Sil-
Inverse ver et al. [2010], Ziebart
Reinforcement Boularias et al. [2011] et al. [2008], Ziebart
Learning Kalakrishnan et al. [2013] [2010], Levine et al. [2011],
Levine and Koltun [2012],
Hadﬁeld-Menell et al.
[2016], Finn et al. [2016b]
28 Design of Imitation Learning Algorithms

Expert Demonstration Learner

Figure 2.2: Diagram of general imitation learning. The learner cannot directly
observe the expert’s policy in many problems. Instead, a set of trajectories induced
by the expert’s policy is available in imitation learning. The learner estimates the
policy that reproduces the expert’s behavior using the given demonstrations. Please
note that the process of querying the demonstration and updating the learner’s
policy can be interactive.

on model-free methods and studies on inverse reinforcement learning

have focused on model-based methods, although recent studies on IRL
have proposed model-free methods. BC methods have been mainly fo-
cused on trajectory planning for robotic systems in which a lower-level
controller is available. A model-free approach is a reasonable choice in
such applications because the dynamics of the system is not crucial.
On the other hand, IRL has focused on learning a policy in action-
state space which needs to be iteratively evaluated in a given system.
A model-based approach is suitable for such applications, and this is
why many model-based methods have been developed for IRL.

2.4 Observability

The main goal of many imitation learning methods is to learn a pol-

icy that reproduces the expert’s behavior. Since the expert’s policy
cannot be directly observed, the learner recovers the policy from the
expert’s demonstrations. The diagram in Figure 2.2 illustrates the im-
itation learning process. To formulate a imitation learning problem, it
is necessary to consider the observability in practice.
For a formal definition, it is necessary to figure out observability of
the state. Observability can vary significantly between different appli-
cations leading to different kinds of learning methods.
2.4. Observability 29

2.4.1 Trajectories in Fully Observable Settings

When the state of the system is fully observable, we can obtain a tra-
jectory as a sequence of the state and the control input as

τ = [x0 , u0 , x1 , u1 , . . . , xT , uT ]. (2.3)

For instance, both the state and the control inputs are observable in a
teleoperated system in [Abbeel et al., 2010, van den Berg et al., 2010,
Osa et al., 2014, Ross et al., 2011], although observation can be noisy.

2.4.2 Trajectories in Partially Observable Settings

In some settings of imitation learning, the control input by the experts

is not observable in demonstrations, and only the states of the system
during the demonstrations are given. In such cases, the trajectory is
given as a sequence of the state of the system,

τ = [x0 , x1 , . . . , xT ]. (2.4)

For example, the control inputs to achieve the demonstrated trajectory

are often unobservable in kinesthetic teaching [Kober and Peters, 2009,
Englert et al., 2013, Maeda et al., 2016]. Also, when transferring mo-
tions captured from a human expert to a humanoid robot the control
inputs to achieve the desired motion in the learner’s embodiment can-
not be observed [Ijspeert et al., 2002b, Grimes et al., 2006b, Grimes
and Rao, 2009]. In addition, the state of the system is often partially
observable. In this case, the trajectory is given as a sequence of the
partial observation of the system,

τ = [y 0 , y 1 , . . . , y T ]. (2.5)

where y t is the partial observation of the system, which is often given

by y t = f o (xt ) where f o is the observation function. As a special case,
the observation y can be linear w.r.t. the state x as y t = H t xt where
H t is the observation matrix.
30 Design of Imitation Learning Algorithms

2.4.3 Differences in observability between the expert and the

learner
In imitation learning, the expert and the learner often observe the
environment differently. For example, in robotic manipulation tasks a
human expert often obtains much richer sensory information compared
to a robot learner due to the differences in their sensory embodiments.
As another example, a robotic learner may be able to record sensory
information more accurately and at a higher rate than a human ex-
pert. In such cases, the information of the learner about the environ-
ment/system differs from the information of the expert and should be
taken into account when formalizing the imitation learning problem.
In general, the observability of the expert and learner can manifest in
different ways:

• The expert observes partially

– the system state
– the control inputs by the expert
– learner’s observations
• The learner observes partially
– the system state
– expert’s observations
– the control inputs by the expert
– the control inputs by the learner

These cases need to be taken into account when deciding on the im-
itation learning approach for a speciﬁc application. When the expert
observes the system state partially, the expert demonstrations can be-
come sub-optimal requiring careful consideration. Moreover, when the
expert observes the learner, the learner may have more information
about its own embodiment. For example, if a human expert uses kines-
thetic teaching to show how to grasp an object, the demonstration may
be sub-optimal for a robot learner if the expert does not see what the
robot observes.
In imitation learning, the expert is often assumed to behave opti-
mally. However, this optimality is often based on partial observations
2.5. Policy Representation in Imitation Learning 31

which may diﬀer signiﬁcantly from the observations of the learner. For
example, if the human expert performs a motion which goes around
an obstacle which the robot learner does not observe, a robot learner
learns to perform a similar circumnavigation motion even when there
are no obstacles. Moreover, when the learner observes only partially
expert observations the learner can make wrong predictions about the
policy behind expert behavior.

2.5 Policy Representation in Imitation Learning

One of the important design choices in imitation learning is policy

representation. In this section, we discuss the design choices related to
policy representation.

2.5.1 Levels of Policy Abstraction

For imitation learning, several types of policy abstractions can be

used. We can categorize the policy representations into three types:
1) symbolic-level abstraction, 2) trajectory-level abstraction, and
3) action-state space abstraction. In task level planning, the learner
learns a policy that generates an option o ∈ O where O is the set of
options. Options are often deﬁned as policies of taking actions over a
period of time [Sutton et al., 1999]. In this task-level planning, each
option often consists of a set of actions or trajectories. A policy maps
given states xt and contexts s to sequences of options in the task-level
abstraction.

π : xt , s Ô→ [o1 , . . . , oT ], (2.6)

where T is the horizon of the task. A complex task is often hard to

model as a single movement. The task-level abstraction enables model-
ing such complex task as a sequence of simple movements. BC methods
such as [Konidaris et al., 2011, Niekum et al., 2014, Kroemer et al.,
2015] model complex task as a sequence of movement primitives.
In trajectory planning, a policy maps a context s to a trajectory τ
that is a sequence of the state of the system x (and control inputs u)
32 Design of Imitation Learning Algorithms

as
π : s Ô→ τ . (2.7)
BC methods such as DMP [Schaal et al., 2004, Ijspeert et al., 2013]
and ProMP [Paraschos et al., 2013, Maeda et al., 2016] learn such
trajectory-based policies.
In the action-state space level, a policy maps states of the system
xt and contexts s to control inputs ut as
π : xt , s Ô→ ut . (2.8)
BC methods such as [Chambers and Michie, 1969, Pomerleau, 1988,
Khansari-Zadeh and Billard, 2011, Ross et al., 2011] and IRL methods
such as [Abbeel and Ng, 2004, Ziebart et al., 2008, Boularias et al.,
2011, Finn et al., 2016b] learn policies in action-state space. These
abstractions are summarized in Table 2.3.
Existing imitation learning methods can be categorized based on
task abstractions as shown in Table 2.4. The table displays an abun-
dance of model-free methods for trajectory learning. On the contrary,
many model-based IRL methods have been developed with action-space
space abstractions. Since commercially available robotic manipulators
often have a position/velocity controller, model-free methods are pre-
ferred for trajectory planning in such systems. This is especially pro-
nounced in motion planning methods designed for robotic manipulators

Table 2.3: Abstraction and the related policy in imitation learning. In a task-
level abstraction, the policy maps from the initial state x0 to a sequence of discrete
options, where an option at time step t is denoted with ot . In a trajectory-level
abstraction, the policy maps from an initial state x0 to a trajectory τ . In an action-
state space abstraction, the policy maps from the current state xt to a control ut .

Abstraction Level Policy

Task-level abstraction π : x, s Ô→ [o1 , . . . , oT ]

Trajectory-based abstraction π : x0 , s Ô→ τ

Action-state space abstraction πt : xt , s Ô→ ut

2.5. Policy Representation in Imitation Learning 33

in the robotics research community. On the other hand, the machine

learning community have developed many IRL methods for learning a
policy in action-state space.

2.5.2 Hierarchical vs Monolithic Policies

When we consider a single abstraction level of policy, the policy will

be non-hierarchical/monolithic. BC methods such as [Chambers and
Michie, 1969, Pomerleau, 1988, Schaal et al., 2004, Khansari-Zadeh
and Billard, 2011, Paraschos et al., 2013, Ross et al., 2011] and IRL
methods such as [Abbeel and Ng, 2004, Ratliﬀ et al., 2006b, Ziebart,
2010, Finn et al., 2016b] are monolithic. Thus far, numerous methods
have been developed for learning a monolithic policy. However, we need
to employ a complex policy representation such as a neural network

Table 2.4: Categorization of imitation learning methods based on different policy

abstractions with distinction between model-free and model-based methods. Many
model-free methods have been developed for imitation learning with trajectory-
based abstractions. On the contrary, many model-based IRL methods have been
developed with action-space space abstractions.
Model-free Model-based

Takano and Nakamura

Task-level [2015], Niekum et al. [2014],
abstration Konidaris et al. [2014], -
Inamura et al. [2004]
Schaal et al. [2004], Schaal
[1999], Ijspeert et al. [2013],
Trajectory- Calinon et al. [2007], Ude et al. [2004], Englert
based Khansari-Zadeh and Bil- et al. [2013], van den Berg
abstraction lard [2011], Paraschos et al. et al. [2010], Abbeel et al.
[2013], Osa et al. [2014], [2010]
Maeda et al. [2016], Deniša
et al. [2016]
Chambers and Michie [1969], Abbeel and Ng [2004], Ratliff
Widrow and Smith [1964], et al. [2006b], Silver et al.
Action- Pomerleau [1988], Ross [2010], Ziebart et al. [2008],
state space and Bagnell [2010], Ross Ziebart [2010], Levine et al.
abstraction et al. [2011], Boularias et al. [2011], Levine and Koltun
[2011], Kalakrishnan et al. [2012], Hadfield-Menell et al.
[2013], Ho and Ermon [2016] [2016], Finn et al. [2016b]
34 Design of Imitation Learning Algorithms

policy in [Finn et al., 2016b] in order to learn a complex task with a

monolithic policy.
On the contrary, by combining the diﬀerent levels of abstraction,
we can learn a hierarchical policy where the lower-level policies learn
to perform the primitive behavior and the upper-level policy learns
to plan a sequence of the lower-level policies. BC methods such as
[Niekum et al., 2014, Konidaris et al., 2014, Kroemer et al., 2015] and
IRL methods such as [Kolter et al., 2008, Choi and Kim, 2015, Krishnan
et al., 2016] learn hierarchical policies. Since a hierarchical policy can
be decomposed into a sequence of the lower-level policies, we do not
have to use complex policy representation for the lower-level policies.
On the other hand, it is not trivial to learn all of the lower-level and
upper-level policies simultaneously.

2.5.3 Feedback vs Open-Loop/Feedback-Free Policies

With regard to feedback of the state, policies can be categorized into

two types: feedback and open-loop/feedback-free policies. A feedback
policy iteratively determines the control input/desired behavior based
on the feedback from the environment. In other words, a feedback policy
considers the changes of the environment caused by the previous control
input in sequential decision making. A policy for determining the torque
input to a robotic manipulator is often learned in robotic applications
such as [Boularias et al., 2011, Englert et al., 2013]. Such a torque-
based control is often learned as a feedback policy since it is essential
to consider the state of the system in sequential decision making where
a small mistake can cause a big error in the next state.
In contrast, an open-loop/feedback-free policy determines the con-
trol input/desired behavior just based on the initial input. Therefore,
once a open-loop policy starts running, it does not use the feedback
from the environment. A policy for planning a desired trajectory can
be often learned as an open-loop policy since it can be addressed as
a one shot decision making for a given situation such as in [Calinon
et al., 2007, Takano and Nakamura, 2015]. However, there are methods
for planning and updating the desired trajectory during the task execu-
tion such as [Ijspeert et al., 2013, Paraschos et al., 2013, Schulman et al.,
2.5. Policy Representation in Imitation Learning 35

Non-stationary
deterministic

Non-stationary Stationary
stochastic deterministic policy

Stationary
stochastic

Figure 2.3: Illustration of the relationships between basic policy classes. Stationar-
ity is a special case of non-stationarity and determinism is a special case of stochas-
ticity. We use the terms “stationary” and “time-invariant” interchangeably. Likewise,
“non-stationary” and “time-variant” are used interchangeably. Please see § 2.5.4 for
more details.

2013, Osa et al., 2017b]. For example, in the framework of [Schulman

et al., 2013], the trajectory is learned as a direct function of the pixel
values observed, and the desired trajectory is updated online.
Different policy types can be used in the same system at the dif-
ferent level. In the acrobatic helicopter flights by Abbeel et al. [2010],
the scheme for planning the desired trajectory can be interpreted as
an open-loop policy because the system does not update the desired
trajectory during the flight. Meanwhile, an iterative LQR controller for
the lower-level control in [Abbeel et al., 2010] can be considered as a
feedback policy because it determines the control input based on the
observation of the system.

2.5.4 Stationarity and Stochasticity of Policies

With respect to stationarity, we can categorize policies into stationary

and non-stationary policies, depending on whether the policy depends
on time. Moreover, we can categorize policies into deterministic and
stochastic policies in terms of stochasticity. Note that stationarity is
a special case of non-stationarity and determinism is a special case of
stochasticity. Figure 2.3 illustrates relationships between these policy
classes.
36 Design of Imitation Learning Algorithms

2.5.4.1 Stationary vs. Non-Stationary Policies

A non-stationary (time-variant) policy depends on time. Typically tra-
jectory based policies are non-stationary since the policy depends on
the time step or phase of the trajectory. For example, a complex move-
ment of a robot arm through space [van den Berg et al., 2010, Osa
et al., 2014] needs to be performed such that the learned speed is sim-
ilar to the demonstrated speed over the whole trajectory, which often
requires a non-stationary policy. A stationary (time-invariant) policy
depends only on the current state of the system. Stationary policies are
typically used in applications where data from diﬀerent time steps can
be similar. For example, in a racing car simulation [Abbeel and Ng,
2004, Ross et al., 2011], steering right when about to drive left oﬀ the
road is a useful action independent of the time this occurs. In another
instance, simple motion for approaching an object can be also learned
as a stationary policy [Khansari-Zadeh and Billard, 2011].

2.5.4.2 Deterministic Policy

A deterministic policy for trajectory planning determines a unique tra-
jectory τ for a given initial state x0 and/or context s as

τ = π(x0 , s). (2.9)

Behavior cloning methods such as dynamic movement primi-

tives [Ijspeert et al., 2013, Schaal et al., 2004] can be interpreted as
deterministic policy representations for trajectory planning.
A deterministic policy in action-state space determines a unique
control input u for a given state x and/or context s as

u = π(x, s). (2.10)

In this case, π represents a deterministic function of x. When a deter-

ministic policy is used and both states and actions are fully observable,
the distribution of the trajectory τ = [x0 , u0 , . . . , xT , uT ] is given as
T
Ù
p(τ ) = p(x0 ) p(xt+1 |xt , πt (xt )). (2.11)
t=1
2.5. Policy Representation in Imitation Learning 37

Commonly, in non-adversarial sequential decision making problems,

such as Markov decision processes, the optimal policy for accom-
plishing the objective for a given model is deterministic. Inverse
reinforcement learning methods such as MMP [Ratliﬀ et al., 2006b]
and LEARCH [Ratliﬀ et al., 2009, Zucker et al., 2011] employ a de-
terministic policy derived from a reward/cost function recovered from
demonstrations. Behavior cloning methods such as [Pomerleau, 1988,
Khansari-Zadeh and Billard, 2011] also employ deterministic policies.

2.5.4.3 Stochastic Policy

A stochastic policy in action-state space draws a control input u ac-

cording to a probability distribution for a given state x and/or context
s as

u ∼ π(u|x, s). (2.12)

In this case, π represents a conditional distribution of the control input

u given x and s. If the probability distribution is given as a delta
function, the policy is deterministic. When a stochastic policy is used
and both states and actions are fully observable, the distribution of the
trajectory τ = [x0 , u0 , . . . , xT , uT ] is given as

T
Ù
p(τ ) = p(x0 ) p(xt+1 |xt , ut )π(ut |xt ). (2.13)
t=1

A stochastic policy is useful to model the stochastic behavior of the

expert. Inverse reinforcement learning methods such as [Ziebart et al.,
2008, Boularias et al., 2011] employ a stochastic policy to represent such
stochastic behavior. Stochastic policies introduce uncertainty, which
is useful for exploring the parameter space of the policy in iterative
methods. Model-based behavior cloning methods such as [Englert et al.,
2013] and inverse reinforcement learning methods such as [Finn et al.,
2016b] employ a stochastic policy and learn system dynamics through
iterative learning.
38 Design of Imitation Learning Algorithms

2.6 Behavior Descriptors

In imitation learning, it is essential to quantify the behavior and mea-

sure the diﬀerence between the expert’s behavior and the learner’s be-
havior. For this purpose, we need to consider “what should be matched
between the expert and the learner?” In the following, we list descrip-
tors of behavior, which can be matched between the expert and the
learner in imitation learning.

2.6.1 State-action Distribution

Given a dataset D = {(x, u)} that consists of state-control input pairs,
we can model the joint distribution of the state and the control in-
put p(x, u) or the conditional distribution of the control input given
the state p(u|x). Early imitation learning approaches [Chambers and
Michie, 1969, Widrow and Smith, 1964, Pomerleau, 1988] learned a
policy by modeling state-action distributions using supervised learning
methods. However, since a state-action distribution only describes the
short term behavior, matching only the state-action distribution can
lead to a mismatch with long term behavior.

2.6.2 Trajectory Feature Expectation

To match the behavior between the expert and the learner over a long
horizon, it is necessary to consider trajectory features. Since both a
trajectory itself and observations of it are often stochastic and noisy,
the expectation of trajectory features (an expectation has less noise
compared to individual instances) is often used to describe the behavior
of the expert and the learner. The expectation of the trajectory features
with respect to the learner’s policy is given by
Ú
Ep(τ ) [φ(τ )] = p(τ )φ(τ )dτ , (2.14)

where p(τ ) is the trajectory distribution induced by the learner’s policy

and φ(τ ) is the feature vector of the trajectory τ . The expectation of
the trajectory E[τ ] can be interpreted as a special case of the feature
expectation. When a dataset of trajectories D = {τ demo i }Ni=1 is avail-
able, the expectation of the trajectory feature can be approximated
2.7. Information Theoretic Understanding of Feature Matching 39

N
1 Ø
Ep(τ ) [φ(τ )] ≃ φ(τidemo ). (2.15)
N i=1

Feature expectation matching appears both in behavior

cloning [Ijspeert et al., 2002a, Osa et al., 2014] and inverse
reinforcement learning [Abbeel and Ng, 2004, Ratliﬀ et al., 2006b,
Ziebart et al., 2008].

2.6.3 Trajectory Feature Distribution

A distribution over trajectory features p(φ(τ )) is also often used for

matching the behavior between the expert and the learner. We can
match not only the ﬁrst order moment (mean) of the distribution but
also higher order moments. The trajectory distribution p(τ ) can be
considered as a special case of the feature distribution. Behavior cloning
methods such as [Paraschos et al., 2013, Englert et al., 2013] and inverse
reinforcement learning methods such as [Arenz et al., 2016] use feature
distributions.

2.7 Information Theoretic Understanding of Feature

Matching in Imitation Learning

As we discussed in § 1.6, imitation learning can be formulated as a prob-

lem of ﬁnding a policy that minimizes the diﬀerence between demon-
strated and learned behavior. For this purpose, many imitation learning
methods perform a “projection” of demonstrated behavior into a pa-
rameterized policy space. Projecting demonstrations onto a manifold
of a parameterized policy requires considering the relationship between
the distribution of the demonstrations and the distribution of the pa-
rameterized policy. Information theory provides a principled way of
assessing this relationship.
40 Design of Imitation Learning Algorithms

Data manifold Policy model manifold

Figure 2.4: Illustration of M- and I- projections from the data manifold onto the
policy model manifold. The solutions of M- and I- projections are different since the
KL divergence is not symmetric.

2.7.1 Information Theoretic Understanding of Imitation

Learning Algorithms for Trajectory Learning
Consider a trajectory distribution p(τ |w) induced by a policy π with
a parameter vector w. Supervised learning methods often obtain a
solution based on the maximum likelihood of the given training data.
As is well known Bishop [2006], maximizing the (causal) likelihood is
equivalent to minimizing the KL divergence
q(τ )
Ú
DKL (q(τ )||p(τ |w)) = q(τ ) ln dτ , (2.16)
p(τ |w)
where q(τ ) is the empirical distribution over trajectories induced by
the expert’s policy and τ is a trajectory. This equation can be in-
terpreted as a projection from the data manifold to the policy model
manifold [Amari, 2016]. On the other hand, as the KL divergence is not
symmetric, minimizing DKL (q(τ )||p(τ |w)) is not equivalent to mini-
mizing
p(τ |w)
Ú
DKL (p(τ |w)||q(τ )) = p(τ |w) ln dτ , (2.17)
q(τ )
which represents the projection from the policy model manifold to the
data manifold. We discuss a few more details of minimizing the diﬀerent
projections in the following.
First, we consider imitation learning methods for trajectory learn-
ing based on the M-projection deﬁned in (1.4). The goal of imitation
2.7. Information Theoretic Understanding of Feature Matching 41

learning in this case is to learn a parameter vector w, such that the

M-projection is minimized, i.e.,

w∗ = arg min DKL (q(τ )||p(τ |w)) . (2.18)

The resulting objective function can also be written as

q(τ )
Ú
LM = DKL (q(τ )||p(τ |w)) = q(τ ) ln dτ (2.19)
p(τ |w)
= Eq [ln q(τ )] − Eq [ln p(τ |w)] , (2.20)

where Eq [·] is the expectation with respect to q(τ ) [Bishop, 2006,

Sugiyama, 2015]. The expectations Eq [·] in (2.20) can be estimated us-
ing the demonstrated trajectories drawn from q(τ ). Since the ﬁrst term
in (2.20) is independent from w, DKL (q(τ )||p(τ |w)) can be minimized
by maximizing the expected log likelihood Eq [ln p(τ |w)]. Hence, imita-
tion learning based on simple supervised learning can be seen as a spe-
cial case of computing the M-projection as these algorithms essentially
perform a likelihood maximization. Examples of such algorithms are
the least square solution for DMPs, expectation maximization (EM)
for ProMPs, and EM for SEDS, which minimize DKL (q(τ )||p(τ |w))
with diﬀerent parameterizations [Ijspeert et al., 2013, Paraschos et al.,
2013, Khansari-Zadeh and Billard, 2011].
It is informative to note that there is a close relation between the
maximum likelihood solution and the solution obtained from the prin-
ciple of maximum entropy. Consider, for instance, average feature con-
straints
Ep [φ(τ )] = a. (2.21)
If we chose subject to the feature matching constraint the distribu-
tion that results in maximum entropy, we cover the exponential family
parametrization of p(τ |w) Amari [2016]:

exp w⊤ φ(τ )
! "
p(τ |w) = . (2.22)
Z
Substituting the resulting form p(τ |w) with (2.22) into the original
maximum entropy problem ignoring terms which do not depend on the
parameters w, the resulting dual objective function (or equivalently
42 Design of Imitation Learning Algorithms

the one in (2.20)) yields

LM = Eq [w⊤ φ(τ )] − ln Z. (2.23)

Diﬀerentiating (2.23) w.r.t. w yields the following gradient which can

be used for optimization of the parameters:
dLM
Ú 1 2
= Eq [φ(τ )] − 1/Z exp w⊤ φ(τ ) φ(τ )dτ (2.24)
dw
= Eq [φ(τ )] − Ep [φ(τ )]. (2.25)

Note that setting the gradient to 0 in order to obtain the optimum

yields the optimality condition required to hold in the primal, that is
that feature expectations match:

Ep [φ(τ )] = Eq [φ(τ )]. (2.26)

From (2.26), we can conclude that maximum likelihood on an as-

sumed exponential family form is also a solution to finding the max-
imum entropy distribution (2.22) which respects the average feature
constraint (2.26). The latter viewpoint allows us to reason about, for
instance, cost function matching in Inverse Reinforcement Learning and
to automatically construct an appropriate form for policies.
This observation is called the maximum likelihood / maximum
entropy duality Dudík and Schapire [2006]. Furthermore, as the M-
projection yields the same solution as maximizing the likelihood, we
can conclude that the M-projection solution for an exponential family
of trajectory distributions is equivalent to the maximum entropy one.
It is often useful to consider the maximum entropy principle in its
regularized form [Ziebart et al., 2013] [Boularias et al., 2011], that is,
instead of finding a maximum entropy distribution we want to find
a distribution with the minimal KL divergence relative to a “prior”
distribution p0 (τ ) while matching the features of the demonstrator,
that is,

arg minDKL (p(τ )||p0 (τ )) (2.27)

w
s.t.: Ep [φ(τ )] = Eq [φ(τ )]. (2.28)
2.7. Information Theoretic Understanding of Feature Matching 43

The solution to this problem can again be obtained by the method of

Lagrangian multipliers
p0 (τ ) exp w⊤ φ(τ )
! "
p(τ |w) = (2.29)
Z
with p0 (τ ) = exp w⊤
! "
0 φ(τ ) /Z0 .
A particularly elegant result due to [Dudík and Schapire, 2006]
demonstrates that if we have bounds on the accuracy with which
our feature matching constraints hold, the resulting maximum entropy
problem gives rise via duality to a regularized likelihood equivalent to
a maximum a-posterior estimate with a prior on the dual parameters.
It is, however, important to note that such maximum entropy prin-
ciples should not to be confused with the I-projection, which computes
arg min DKL (p(τ |w)||q(τ )) .
w

Here, the data is induced via the distribution q(τ ) on the right-hand
side of the KL, while in the maximum entropy principle, the data is
induced by the feature averages and p0 (τ ) on the right-hand side of
the KL is just a prior. The I-projection does not match features of
the demonstrator. Whenever an algorithm matches average features,
it is an instance of an M-projection based algorithm. Since ln q(τ ) is
unknown and hard to evaluate in practice, it is challenging to perform
the I-projection in the context of imitation learning. To the best of our
knowledge, there is no existing imitation learning method that performs
the I-projection exactly.
As we have seen from our discussion above, many imitation learning
methods can be seen as related to the M-projection and to the principle
of maximum entropy. This is true for most model-free and model-based
methods. Model-free methods based on standard supervised learning
[Ijspeert et al., 2013, Khansari-Zadeh and Billard, 2011] do not require
access to the system dynamics or iterative data acquisition.
In contrast, model-based imitation learning methods often try to
match features of the state distribution so as to satisfy Ep [φ(τ )] =
Eq [φ(τ )]. In order to do so, we either need access to the system dy-
namics [Ziebart et al., 2008, Ziebart, 2010] or require iterative data
acquisition [Boularias et al., 2011].
44 Design of Imitation Learning Algorithms

2.7.2 Information Theoretic Understanding of Imitation

Learning Algorithms in Action-State Space
In this section, we have a look at imitation learning in action-state
space from an information theoretic point of view. In a Markov model,
the probability distribution over trajectories p(τ ) can be decomposed
as a sequence of states and actions
T
Ù
p(τ ) = p(x0 ) p(xt+1 |xt , ut )π(ut |xt ) , (2.30)
t=0

where the policy π(ut |xt ) maps from the states of the system to the
control inputs. Let us consider the trajectory distribution p(τ ) induced
by the learner’s policy and the trajectory distribution q(τ ) induced by
the expert’s policy. If the embodiments of the learner and the expert
are equivalent and stationary, that is, q(xt+1 |xt , ut ) = p(xt+1 |xt , ut ) =
p(xt |xt−1 , ut−1 ), the relation of p(τ ) and q(τ ) is given by
T
π L (ut |xt )
r
p(τ )
= rTt=0 , (2.31)
q(τ ) E
t=0 π (ut |xt )

where π L is the learner’s policy and π E is the expert’s policy. In this

case, imitation learning methods based on the M-projection minimize
T
π E (ut |xt )
Ú Ø
DKL (q(τ )||p(τ )) = q(τ ) ln dτ (2.32)
t=0
π L (ut |xt )
π E (u|x)
Ú
= q(x, u) ln dxdu (2.33)
π L (u|x)
= Eq(x,u) [ln π E (u|x) − ln π L (u|x)], (2.34)

where q(x, u) is the state action distribution induced by the trajectory

distribution q(τ ) of the expert. Since Eq [·] can be approximated using
the trajectories drawn from q(τ ), minimization of the KL divergence
in (2.34) can be solved using only the demonstrated trajectories. Early
studies on imitation learning such as [Widrow and Smith, 1964, Pomer-
leau, 1988] are based on this kind of supervised learning. However, these
methods may not work well in many applications as indicated by [Ross
et al., 2011, Bagnell, 2015].
2.7. Information Theoretic Understanding of Feature Matching 45

On the contrary, we could try to base imitation learning techniques

on an I-projection [Amari, 2016] that minimizes

π L (u|x)
Ú
DKL (p(τ )||q(τ )) = p(x, u) ln dxdu (2.35)
π E (u|x)
= Ep [ln π L (u|x) − ln π E (u|x)]. (2.36)

However, it is hard to minimize DKL (p(τ )||q(τ )) in practice as we can

not evaluate ln π E (u|x), and there is no prior work on imitation learn-
ing methods that minimize DKL (p(τ )||q(τ )) to the best of our knowl-
edge. Exploring imitation learning methods based on I-projection will
be an interesting research direction. Intuitively, the solution obtained
by DAGGER [Ross et al., 2011] may result in a smaller KL-divergence
under the I-projection than the one obtained by ordinary supervised
learning as DAGGER attempts to achieve good performance under the
learner’s own data distribution.
3
Behavioral Cloning

In this chapter, we review behavioral cloning (BC) methods. BC meth-

ods learn a direct mapping from states/contexts to trajectories/actions
without recovering the reward function. Behavioral cloning can be an
eﬃcient way to reproduce the demonstrated behavior when such di-
rect mapping is the most parsimonious way to represent the desired
behavior.
We start by reviewing model-free BC methods and continue by
reviewing model-based BC methods, which leverage information about
system dynamics.

3.1 Problem Statement

A controller for a robotic system often has a hierarchical structure.

Figure 3.1 shows a control diagram of a robotic system with imitation
learning. The upper-level controller plans the desired trajectory based
on a given context and/or observations. Meanwhile, the lower-level con-
troller determines the control input to achieve the desired trajectory.
The main target of imitation learning for robotic systems is to learn
these controllers.

46
3.1. Problem Statement 47

Expert

Dataset of demonstrations
D = {( sdemo ,τ demo )}
Target of imitation learning
Desired trajectory Control input
Upper-level τd Lower-level u
System
controller controller

Observer
Context s, Observation of the system y State of the system x
Figure 3.1: Control diagram of a robotic system with imitation learning. An ex-
pert demonstrates the desired behavior generating a dataset D. Based on D and
observations about the current context and system state an upper-level controller
generates the desired trajectory τ d . A lower-level feedback controller tries to follow
τ d using observation feedback to generate a control u which causes a change to
the system state x and a new observation. In imitation learning, the controllers are
tuned to imitate the expert demonstrations.

When learning trajectories, the aim of imitation learning is to learn

a policy that generates a desired trajectory
τ d = π(s) (3.1)
for a given context s. The context s can be the initial state of the
robotic manipulator x0 or the state of objects relevant to a given task.
In action-state space learning, the aim is to learn a policy that generates
a control input ut for a given state xt and/or context s,
ut = π(xt , s). (3.2)
In imitation learning, we assume that a dataset of experts’ demon-
strations is available. When learning trajectories, the dataset consists
usually of a set of trajectories and contexts D = {(τ i , si )}N
i=1 . In action-
state space learning, the dataset will be given as a set of control inputs
and states D = {(ui , xi )}N i=1 . Using such datasets, a policy can be
learned as the direct mapping from the context to the trajectory or
from the state to the control input. This learning problem can be for-
mulated as a supervised learning problem in which a policy can be
48 Behavioral Cloning

Algorithm 1 Abstract of behavioral cloning

Collect a set of trajectories demonstrated by the expert D
Select a policy representation πθ
Select an objective function L
Optimize L w.r.t. the policy parameter θ using D
return optimized policy parameters θ

obtained by solving a simple regression problem. We call this approach

“behavioral cloning”. Algorithm 1 abstracts the procedure of BC meth-
ods. The ﬁrst step of BC is to record a set of expert demonstrations
D which are usually given as a set of trajectories. Thereafter, we need
to select a policy representation πθ appropriate for a given application.
In addition, we need to select an objective function L that represents
the similarity between the demonstrated behaviors and the learner’s
policy. The policy parameters θ are then optimized using the collected
dataset of demonstrations.

3.2 Design Choices for Behavioral Cloning

In addition to the design choices we described in Chapter 2, we list

here some essential design choices for BC methods.

1. What surrogate loss function should be used to repre-

sent the difference in demonstrated and produced behav-
ior? BC methods require a surrogate loss function which quan-
tifies the difference between the demonstrated behavior and the
learned policy. The choice of the surrogate loss function influ-
ences strongly how to train the policy, and we need to select the
appropriate surrogate loss function to achieve efficient learning.

2. What regression method should be used to represent the

policy? To obtain satisfactory system performance, it is essential
to select the appropriate regression method. The regression model
should be suﬃciently expressive to represent the desired behavior
but simple enough to allow for eﬃcient training of the model. For
3.2. Design Choices for Behavioral Cloning 49

eﬃcient learning the regression method should be chosen together

with the surrogate loss function.

3.2.1 Choice of Surrogate Loss Functions for Behavioral

Cloning

We discuss some options for surrogate loss functions in this section.

3.2.1.1 Quadratic Loss Function

The quadratic loss function is the most common choice for the loss
function. Given two vectors, x1 and x1 , a quadratic loss function is
given by

ℓquad (x1 , x2 ) = (x1 − x2 )⊤ (x1 − x2 ). (3.3)

For example, the diﬀerence between the state xL induced by the

learner’s policy and the state xdemo demonstrated by the expert can
be formulated as

ℓquad (xL , xdemo ) = (xL − xdemo )⊤ (xL − xdemo ). (3.4)

The quadratic loss function is also called the ℓ2 -loss function, and re-
gression with minimizing the quadratic loss function is often called least
squares (LS) regression or ℓ2 -loss minimization Sugiyama [2015].
Minimizing the quadratic loss function is closely related to maxi-
mizing the expected log likelihood under the Gaussian distribution as-
sumption. Let us consider the regression function fθ (x) parameterized
by θ. Suppose that the target variable y follows the equation

y = fθ (x) + ǫ, (3.5)

where ǫ is drawn from the Gaussian distribution as ǫ ∼ N (0, σ). In this

model, the probability distribution of y is given by

(y − fθ (x))2
A B
1
p(y|x, θ) = √ exp − . (3.6)
2πσ 2σ
50 Behavioral Cloning

Finding the model fθ (x) that maximizes the expected log likelihood
can be formulated as
 B2 
(y − fθ (x))2
A
argmax E[log p] = argmax E log exp −  (3.7)
θ θ 2σ
= arg min E[(y − fθ (x))2 ] (3.8)
θ
1 Ø
≈ arg min (y − fθ (x))2 . (3.9)
θ N i

Therefore, minimizing the quadratic loss function is equivalent to maxi-

mizing the expected log likelihood under the Gaussian distribution. BC
methods such as DMP [Schaal et al., 2004, Ijspeert et al., 2013] and
ProMP [Paraschos et al., 2013, Maeda et al., 2016] learn a trajectory
representation by minimizing quadratic loss functions.
Additionally, one can also use a weighted quadratic loss function

ℓwquad (x1 , x2 , W ) = (x1 − x2 )⊤ W (x1 − x2 ) (3.10)

when an appropriate weight W is known. For example, Mahalanobis

distance [Mahalanobis, 1936] given by

ℓMahal (x1 , x2 ) = (x1 − x2 )⊤ Σ−1 (x1 − x2 ), (3.11)

where Σ is the covariance matrix of a distribution of interest, is often

used in the literature [Rozo et al., 2016, Osa et al., 2017a].

3.2.1.2 ℓ1 -Loss Function

The ℓ1 -loss function is often employed for regression. The ℓ1 -loss func-
tion is given by
Ø
ℓabs (x1 , x2 ) = |x1,i − x2,i | , (3.12)
i

where x1,i and x2,i are the ith element of the vectors x1 and x2 , re-
spectively. The ℓ1 -loss function is also called the absolute loss function,
and regression with minimization of ℓ1 -loss is called least absolute de-
viations regression or ℓ1 -loss minimization Sugiyama [2015]. Usually,
3.2. Design Choices for Behavioral Cloning 51

ℓ1 -loss minimization is more robust to outliers than ℓ2 -loss minimiza-

tion. This robustness can be attributed to the property of ℓ1 -loss min-
imization, which gives the median of training samples, while ℓ2 -loss
minimization gives the mean of training samples. Effectively, in ℓ2 -loss
minimization a few large outliers can influence the mean significantly
while in ℓ1 -loss minimization the median can be largely unaffected by
a few large outliers. In addition, unlike ℓ2 -loss minimization, ℓ1 -loss
minimization results in a sparse solution, which can be computation-
ally efficient. Although, in imitation learning, there are not many prior
studies on using ℓ1 -loss minimization, the discussed properties of the
ℓ1 -loss could be beneficial.

3.2.1.3 Log Loss Function

The log loss function is deﬁned by

Ø
ℓlog (q, p) = − qi ln pi , (3.13)
i

where q is the true probability and p is the predicted probability. In

binary classiﬁcation, the log loss function is given by

ℓlog (q, p) = −q log p + (1 − q) log(1 − p). (3.14)

Since the log loss is equivalent to the cross entropy, the log loss is also
called the cross-entropy loss [Sugiyama, 2015].
In binary classification (in imitation learning classification can be
used to learn a discrete control policy from expert demonstrations),
minimizing the log loss function is equivalent to maximizing the log
likelihood in logistic regression. In more detail, suppose that we want to
learn a binary classification where the probability follows the Bernoulli
distribution

p(y = 1|x, θ) = fθ (x), p(y = 0|x, θ) = 1 − fθ (x), (3.15)

which can be more compactly written as

p(y|x, θ) = (fθ (x))y (1 − fθ (x))1−y . (3.16)

52 Behavioral Cloning

Maximizing the expected log likelihood E[log p] of Bernoulli distributed

data follows then as
max E[log p] = max E[y log fθ (x) + (1 − y) log(1 − fθ (x))]
1 Ø
= max (y log fθ (x) + (1 − y) log(1 − fθ (x)))
N
= min ℓlog (y, fθ (x)). (3.17)
Therefore, in binary classiﬁcation, minimizing the log loss function
is equivalent to maximizing the expected log likelihood under the
Bernoulli distribution.

3.2.1.4 Hinge Loss Function

Hinge loss is a loss function often used for maximum margin optimiza-
tion in classifiers such as support vector machines (SVMs) [Cortes and
Vapnik, 1995]. Given two scalar variables, x1 and x2 , the hinge loss can
be defined as
ℓhinge (x1 , x2 ) = max (0, 1 − x1 x2 ) . (3.18)
Intuitively, the hinge loss assigns zero costs if a classification is correct:
ℓhinge (x1 , x2 ) = 0. For “wrong” classifications the cost is linear w.r.t. the
parameters. This also explains the term “hinge”; in a visual illustration
of the cost function one can imagine a hinge at x1 x2 = 1. While hinge
loss is discontinuous at the “hinge” location x1 x2 = 1, optimization
solutions still exist in practice. Moreover, since the hinge loss function
is convex, it can be optimized efficiently with various convex optimizers.

3.2.1.5 Kullback-Leibler Divergence

In the field of information geometry, Kullback-Leibler (KL) divergence
is used to quantify the difference between two probability distribu-
tions [Kullback and Leibler, 1951]
p(x)
Ú
DKL (p(x)||q(x)) = p(x) ln dx. (3.19)
q(x)
Since the KL divergence measures the difference between two prob-
ability distributions, it is useful when learning stochastic policies.
3.3. Model-Free and Model-Based Behavioral Cloning Methods 53

Please note that the KL divergence is not symmetric, therefore

DKL (p(x)||q(x)) = DKL (q(x)||p(x)) does not hold in general. BC meth-
ods such as [Englert et al., 2013] use the KL divergence as the loss
function.

3.2.2 Choice of Regression Methods for Behavioral Cloning

When applying behavioral cloning, an appropriate regression method

must be chosen. Table 3.1 lists regression methods found in the liter-
ature. As discussed by Bishop [2006], one must choose a model that
has appropriate complexity. Simple models which can be trained using
linear regression are easy to train, but may not be suﬃciently infor-
mative. Complex models such as neural networks can represent highly
nonlinear mappings. However, training such complex models requires a
large amount of training data. In addition, it is important to note that
imitation learning cannot be addressed as simple supervised learning
in many applications as we discussed in §2.7. We discuss an approach
for reducing imitation learning to supervised learning with interaction
in §3.4.3.

3.3 Model-Free and Model-Based Behavioral Cloning

Methods

As discussed in §2.3, BC methods can be categorized into model-free

and model-based methods. Table 3.2 shows advantages and disadvan-
tages of both model-free and model-based BC methods.
Model-free BC methods learn a policy that reproduces the expert’s
behavior without learning/estimating system dynamics nor recovering
the reward function. Since model-free BC methods do not require learn-
ing of system dynamics, model-free BC methods often do not require
iterative learning and are relatively simple to implement compared to
model-based BC methods. However, in trajectory learning, model-free
BC methods do not ensure that the resulting trajectory is feasible in a
given system. For this reason, it is hard to apply model-free methods to
underactuated systems in which the set of reachable states is limited.
Contrary to model-free BC methods, model-based BC methods
54 Behavioral Cloning

learn a policy using information about the system dynamics. By learn-

ing forward dynamics, it is possible to plan a feasible trajectory close to
the expert’s behavior even if a robotic system is underactuated. How-
ever, in many applications, learning a forward model is a non-trivial
problem. In addition, model-based BC methods often require iterative
learning, which is usually time-consuming compared with learning with
model-free BC methods.

3.4 Model-Free Behavioral Cloning Methods in Action-

State space

In this section we discuss behavioral cloning methods in action-state

space. Although it seems that simple supervised learning can work in
imitation learning, such a naive approach does not work in many appli-
cations. We will identify potential problems encountered when applying

Table 3.1: Regression methods in model-free behavioral cloning for both trajectory
and action-state space learning. The output trajectory in trajectory learning consists
of a long high dimensional sequence of variables while in action-state space learning
the output is a single action. Therefore, some methods such as look-up tables have
not been applied to trajectory learning. For modeling uncertainty in demonstrations,
regression methods need to have explicit support for variance. Gaussian model,
GMM and GPR methods model uncertainty explicitly.

[Paraschos et al., 2013, Maeda et al.,

Gaussian Model
2016]
[Calinon and Billard, 2009,
Gribovskaya et al., 2011,
GMR
Trajectory Khansari-Zadeh and Billard, 2014,
Learning Calinon, 2016]
[Schaal and Atkeson, 1998, Mülling
LWR
et al., 2013, Osa et al., 2017a]
LWPR [Vijayakumar et al., 2005]
GPR [Osa et al., 2017b]
Look-Up Table [Chambers and Michie, 1969]
Linear Regression [Widrow and Smith, 1964]
Action-State [Pomerleau, 1988, LeCun et al., 2006,
Neural Network
Space Stadie et al., 2017, Duan et al., 2017]
Decision Tree [Sammut et al., 1992]
LWR [Atkeson and Schaal, 1997]
LWPR [Vijayakumar and Schaal, 2000]
3.4. Model-Free Behavioral Cloning Methods in Action-State space55

supervised learning to imitation learning and discuss how we can alle-

viate these problems.

3.4.1 Model-Free Behavioral Cloning as Supervised Learn-

ing

Early studies on imitation learning such as [Widrow and Smith, 1964,

Chambers and Michie, 1969, Pomerleau, 1988] employed supervised
learning methods for imitation learning in action-state space. Among
such early studies, in the seminal work ALVINN (Autonomous Land Ve-
hicle In a Neural Network), Pomerleau [1988] developed an autonomous
driving system using imitation learning. Pomerleau [1988] collected
pairs of camera images and steering angles and trained a a neural net-
work that modeled a direct mapping from camera images to steering
angles. However, this simple approach can fail in practice and the au-
tonomous car drives oﬀ the road quickly. As Bagnell [2015] indicated,
learning errors cascade in sequential decision making, which makes the
learner encounter unknown states that the expert never encounters in
her/his successful demonstrations. Pomerleau [1988] described “If the
network is not presented with sufficient variability in its training exem-

Table 3.2: A main choice when doing behavior cloning is whether to use a model-
based or a model-free method. Model-free methods can directly learn a policy from
data without learning a dynamics model. Direct learning also usually means that
the learning algorithm does not need to iterate between trajectory and behavior gen-
eration. However, model-free methods are hard to apply to underactuated systems
since without a model predicting desired behavior is hard. Model-based methods
may work in underactuated systems but learning the actual model can be in many
cases difficult.

Model-free Model-based
A policy can be usually
Applicable to underactu-
Advantages learned without iterative
ated systems.
learning.
Hard to apply to underac- Model learning can be
tuated systems. very diﬃcult.
Disadvantages
Hard to predict future An iterative learning pro-
states. cess is often required.
56 Behavioral Cloning

plars to cover the conditions it is likely to encounter when it takes over

driving from the human operator, it will not develop a sufficiently robust
representation and will perform poorly. In addition, the network must
not solely be shown examples of accurate driving, but also how to recover
(i.e. return to the road center) once a mistake has been made.” That
is, the distribution of the states that the learner encounters is diﬀerent
from the distribution of the states in the given demonstration data. Su-
pervised learning is usually based on the assumption that training data
samples are independent and identically distributed. However, this as-
sumption is often violated in an imitation learning problem, especially
when a policy for sequential decision making needs to be learned. To
address this issue, Ross and Bagnell [2010], Ross et al. [2011] proposed
an approach which reduces imitation learning to supervised learning
with interaction, which we discuss in § 3.4.3.

3.4.2 Imitation as Supervised Learning with Neural Net-

works

Using neural networks for learning has attracted great interest in vari-
ous ﬁelds. Supervised learning of neural networks can be also used for
imitation learning: the desired neural network policy can be learned
from the dataset generated/demonstrated by the expert. In this sec-
tion, we shortly highlight some recent imitation learning successes with
neural networks.

3.4.2.1 Recent Successes of Imitation Leaning with Neural

Networks

Recently, using neural networks for imitation learning has shown im-
pressive results in certain applications such as learning to play Go [Sil-
ver et al., 2016], generating handwriting [Chung et al., 2015], gener-
ating natural language [Wen et al., 2015], or generating image cap-
tions [Karpathy and Fei-Fei, 2015]. Moreover, supervised learning of
neural networks has been used as a building block for example for
learning the policy or the cost function in inverse reinforcement learn-
ing (please see §4.4.6 for more details).
3.4. Model-Free Behavioral Cloning Methods in Action-State space57

Figure 3.2: The game of Go is played on a 19x19 board. Even though the total
number of possible board configurations exceeds 10170 and thus the training data can
not cover all possible plays, the simple imitation learning approach in [Silver et al.,
2016] was able to learn a competitive policy from demonstrations and improve the
policy using self-play. [Figure from https://commons.wikimedia.org/wiki/File:
Tuchola_026.jpg. CC license.]

Supervised imitation learning can be challenging when demonstra-

tions do not cover the states that the learner encounters. For some
applications, such as board games where the state space is known in
advance, demonstrations could in principle be made to cover the state
space. However, for example in the game of Go shown in Figure 3.2,
the set of possible states is too large to cover completely and the su-
pervised training approach needs to be able to generalize from training
data. AlphaGo, an algorithm which was able to beat a human Go mas-
ter [Silver et al., 2016], succeeded to learn a competitive Go policy
using supervised imitation learning and then improve the policy using
reinforcement learning.
AlphaGo trains a value network, which approximates the value
function to predict the expected outcome of the game, and a policy
network, which outputs actions using a representation of the image in-
put of the board. The policy network is initialized by supervision using
a large set of expert demonstrations, in total 30 million positions from
the KGS Go Server. The value and policy networks are improved using
data collected through self-play. AlphaGo selects actions by evaluating
them with the policy and value networks.
58 Behavioral Cloning

The trained policy is a 13-layer deep neural network with alternat-

ing convolutional layers and rectiﬁer nonlinearity layers, and the output
is a soft-max layer resulting in a probability distribution over actions.
The neural network receives as input a representation of the board
state. For supervised training of the policy AlphaGo uses stochastic
gradient ascent to maximize the likelihood of expert demonstrations
θ (ut |xt )
w.r.t. parameters θ: ∆θ ∝ ∂ log π∂θ , where ∆θ is the change in pa-
rameters, ut is the expert action and xt is the state. AlphaGo also
utilizes a smaller, less accurate, but faster policy for predicting the
expected outcome of actions.

3.4.2.2 Learning with Recurrent Neural Networks

In many applications, supervised learning of recurrent neural networks

has made imitation learning of complex time series predictions possible.
Wen et al. [2015] show how to generate human like natural language
using a special form of the long short-term memory (LSTM) [Hochreiter

Table 3.3: Natural language generated by the semantically controlled LSTM (SC-
LSTM) cell neural network proposed in [Wen et al., 2015]. The table shows an
example dialogue act and related natural language samples from [Wen et al., 2015].
The neural network generates natural language learned from human demonstrations.
The neural network is conditioned on the dialogue act which limits the generated
sentences to specific meanings.

Dialogue act:
inform(name=”red door cafe”, goodformeal=”breakfast”,
area=”cathedral hill”, kidsallowed=”no”)
Generated samples:
red door cafe is a good restaurant for breakfast in the area
of cathedral hill and does not allow children .
red door cafe is a good restaurant for breakfast in the cathedral hill
area and does not allow children .
red door cafe is a good restaurant for breakfast in the cathedral hill
area and does not allow kids .
red door cafe is good for breakfast and is in the area of cathedral hill
and does not allow children .
red door cafe does not allow kids and is in the cathedral hill area
and is good for breakfast .
3.4. Model-Free Behavioral Cloning Methods in Action-State space59

and Schmidhuber, 1997] network. Wen et al. [2015] train their system
using data collected from a spoken dialogue system. Table 3.3 shows an
example of natural language generated by the trained neural network.
As is common when designing neural network based systems, the
neural network architecture in [Wen et al., 2015] is adapted to the
task at hand. Moreover, neural network approaches need to take prob-
lems such as vanishing gradients, co-adaptation, and overfitting into
account. Vanishing gradients can be a problem especially for recurrent
neural networks due to the high optimization depth. The neural net-
work architecture in [Wen et al., 2015] includes skip connections [Graves
et al., 2013] to soften vanishing gradients and Wen et al. [2015] utilize
dropout [Srivastava et al., 2014], a technique which randomly deac-
tivates connections in the neural network during training, to reduce
co-adaptation and overfitting.
Learning recurrent neural networks from demonstrations has been
shown to work also for other kinds of data. Karpathy and Fei-Fei [2015]
show how to learn to generate annotations for image regions from
demonstrations. The approach of [Karpathy and Fei-Fei, 2015] learns
from a combination of image and language data to generate natural lan-
guage descriptions of images. Chung et al. [2015] show how to learn to
generate handwriting and natural speech from demonstrations. Chung
et al. [2015] propose a new type of recurrent neural network with hid-
den random variables and argue that random variables are needed to
model variability in data with complex correlations between different
time steps, for example, in natural speech.

3.4.3 Teacher-Student Interaction during Behavioral

Cloning

Although the goal of imitation learning is to learn a policy that repro-

duces the expert’s behavior, any learned policy will inevitably make at
least occasional mistakes. As a result small error may cascade [Bagnell,
2015]: a small error at an early time-step may lead the learner to a state
which deviates from expert demonstrations. Consequently, the learner
will make further mistakes, leading to poor performance.
This highlights a central diﬀerence between imitation learning and
60 Behavioral Cloning

the traditional setting of supervised learning, where we typically as-

sume the input distribution to be independent and identically dis-
tributed [Shalev-Shwartz and Ben-David, 2014]. Instead, in imitation
learning, the features/states in a dataset of demonstrations are not
drawn from the distribution of the features which the learner will en-
counter using their own policy. This means that the assumption of
independent and identically distributed (i.i.d.) data is often violated
in imitation learning. Crudely speaking, a policy for recovering from
mistakes needs to be learned as suggested by Pomerleau [1988].
However, in even modest-scale imitation learning problems it is in-
feasible to collect demonstrations under all possible situations, and in-
stead we must focus corrections to the most relevant scenarios. Instead,
a policy can be iteratively learned by alternating between policy up-
dates and requesting additional demonstrations for the current state
distribution [Ross and Bagnell, 2010, Ross et al., 2011, Bagnell, 2015].
We review methods that address this problem in the following section.

3.4.3.1 Reduction of Structured Prediction to Iterative

Learning of Simple Classification

The task of learning a function that maps inputs x to structured out-

puts y (for example, parse trees, trajectories, matchings, etc. [Taskar,
2005]) is referred to as structured prediction [Tsochantaridis et al.,
2005, BakIr et al., 2007]. Problems of imitation learning can often prof-
itably be phrased as structured prediction [Ratliff et al., 2006b,a, 2009],
and has led to developments of some techniques we cover extensively
in this survey in § 4.4.2.
Conversely, Search-based structured prediction (SEARN) proposed
by Daumé III et al. [2009], is a seminal work that demonstrated that
one can also reduce structured prediction to a kind of imitation learn-
ing. In particular, SEARN crafts a series of reductions from structured
prediction to simple classification. In SEARN, structured prediction is
formulated as a search process over the components yt of the structured
output y, where the tth decision is dependent on the preceding t − 1
decisions. Therefore, the training process of a classifier in SEARN is
dependent on the classifier itself.
3.4. Model-Free Behavioral Cloning Methods in Action-State space61

SEARN learns a multiclass cost-sensitive classiﬁer, e.g. [Zadrozny

et al., 2003], for each state in the dataset through an iterative process.
By performing the prediction using the current classifier π, SEARN cre-
ates new cost-sensitive samples. These cost-sensitive samples are used
to learn a new classifier π ′ which SEARN combines with the current
classifier h in a stochastic manner. Daumé III et al. [2009] show that
the performance of SEARN is competitive with other methods such
as structured SVM [Tsochantaridis et al., 2005] and Conditional Ran-
dom Field [Lafferty et al., 2001], while often being tremendously faster
to learn. Modern, high performance, implementations of such search
based structured prediction use online learning methods of DAGGER
[Ross et al., 2011, 2013], AggreVaTe [Ross and Bagnell, 2014, Sun et al.,
2017], or LOLS [Chang et al., 2015, Daumé III and Langford, 2015].
However, for each time step, simple implementations of these search
based structured prediction require a state reset and an expert demon-
stration. Such a reset is often infeasible in the physical world, and
even if possible, the expert may need to provide a prohibitively large
number of demonstrations. For these reasons, SEARN, AggreVaTe and
LOLS require substantial care to implement efficiently, using e.g. ban-
dit methods or value regression, and deal with resets [Chang et al.,
2015].

3.4.3.2 Confidence-Based Approach

Chernova and Veloso [2009] proposed a method that learns a policy by

requesting additional expert demonstrations based on the confidence
of a given state. In this method, the learner learns how to select the
action from a finite set of action primitives by using classifiers that
return selection confidence, e.g. Gaussian mixture models. When the
confidence is lower than a threshold, additional expert demonstrations
are requested. In addition, when the expert observes incorrect actions
by the learner, the expert corrects the action and the corrected action is
added to the training dataset. By requesting additional demonstrations,
this method also tries to empirically learn a policy under the state
distribution induced by the learner’s policy.
62 Behavioral Cloning

Algorithm 2 Conﬁdence-based autonomy algorithm: conﬁdent execu-

tion and corrective demonstration [Chernova and Veloso, 2009]
Input: Demonstration of the action-state pairs D = {(xi , ui )}N
i=1 ,
conﬁdence threshold c0
Initialize the policy π
repeat
Observe the state x
Compute the conﬁdence c(x)
Plan action uL
if c(x) < c0 or Corrective demonstration is necessary then
Receive the demonstration data Dnew = {(xnew , unew )}
Update the dataset D ← D ∪ Dnew
Update the policy π L
end if
until the task learned

3.4.3.3 Data Aggregation Approach: DAGGER

Ross et al. [2011] proposed an meta-algorithm called DAGGER , which

attempts to collect expert demonstrations under the state distribution
induced by the learned policy. It can be seen most naturally as an on-
policy approach 1 [Sutton and Barto, 1998] to imitation learning: the
expert provides the correct actions to take, but the input distribution
of examples comes from the learner’s own behavior.
Figure 3.3 shows an overview of the DAGGER approach to imita-
tion learning. The simplest form of DAGGER proceeds as follows. At
the ﬁrst iteration, the policy is initialized by behavioral cloning of the
expert demonstrations, resulting in policy π1L . Subsequently, the policy
is used to collect a dataset of trajectories, and those newly obtained
trajectories and the demonstrated trajectories are aggregated into a
dataset D, which is used to train a policy π2L . At iteration n, a pol-
icy πnL is used to collect more trajectories, and those trajectories are
1
The first using of the phrasing of on-policy, which nicely evokes the closely
related approaches and issues in Reinforcement Learning is due to [Laskey et al.,
2017].
3.4. Model-Free Behavioral Cloning Methods in Action-State space63

Execute p n -1 and query expert New data

Steering
from expert

Aggregate
dataset All previous data
New policy
pn

Supervised learning
Figure 3.3: An overview of DAGGER from [Bagnell, 2015]. In each iteration,
DAGGER generates new examples using the current policy with corrections (labels)
provided by the experts, adds the new demonstrations to a demonstration dataset
and computes a new policy to optimize performance in aggregate over that dataset.
The figure illustrates a single iteration of DAGGER . The basic version of DAGGER
initializes the demonstration dataset from a single set of expert demonstrations and
then interleaves policy optimization and data generation to grow the dataset. More
generally, there is nothing special about aggregating data– any method, like gradient
descent or weighted majority that is sufficiently stable in its policy generation and
does well on average over the iterations (or more broadly, all no-regret algorithm run
over each iterations dataset) will achieve the same guarantees, and maybe strongly
preferred for computational reasons.

added to the dataset D. The next policy πn+1 L L

is trained so that πn+1
mimics the expert on the whole dataset D. To leverage the presence of
the expert, DAGGER queries partial expert demonstrations π E in the
learning phase, and the policy πi = βi π E + (1 − βi )πiL — a stochas-
tic mixing of expert and learner– is used to collect the next dataset.
In other words, partial expert demonstrations are requested under the
states induced by the learned policy πiL . Thus, DAGGER learns a policy
from the expert demonstrations under the state distribution induced
by the learned policy. Algorithm 3 shows the details of the general
DAGGER algorithm.
64 Behavioral Cloning

Algorithm 3 DAGGER [Ross et al., 2011]

Input: initial dataset of demonstrations D = {(x, u)}, {βi } such
that N1 N i=1 βi → 0
q

Initialize: π1L
for i = 1 to N do
Let πi = βi π E + (1 − βi )πiL .
Sample trajectories τ = [x0 , u0 , ..., xT , uT ] using πi
Get dataset Di of visited states by πi and actions given by expert.
Aggregate datasets: D ← D ∪ Di
L
Train the policy πi+1 on D.
end for
return best πiL on validation.

By collecting the expert demonstrations under the state which the

learner encountered, DAGGER alleviates the problem that the state dis-
tribution induced by the learner’s policy is diﬀerent from the state dis-
tribution in the initial demonstration data. This approach signiﬁcantly
reduces the size of the training dataset necessary to obtain satisfactory
performance[Ross et al., 2011], and often achieves much better perfor-
mance even asymptotically. DAGGER can be interpreted as a reduction
of imitation learning to supervised learning with interaction Bagnell
[2015].
Crucially, the DAGGER approach is not limited to naive aggre-
gation of all previous data: in fact, any algorithm (like gradient de-
scent, some variants of newton’s method, the exponentiated gradient
approach, etc.) that enjoys the property of being no-regret can be used
to learn iteratively on each newly collected data-set, and achieve the
related formal guarantees. In practice, for instance, training complex
policies with substantial training data is often based on online learning
approaches like gradient descent.2 We can think crudely of no-regret
algorithms as the class of methods whose predictions are asymptoti-
2
Note it is not technically correct to refer to these as Stochastic Gradient Descent
(SGD) methods because the data being generated is not independent and identically
distributed. Instead, the more general analysis of Online Gradient Descent [Hazan,
2016] is required.
3.5. Model-Free Behavioral Cloning for Learning Trajectories 65

cally good on average over the data-sets they are presented, and are
suﬃciently stable between iterations [Hazan, 2016].
Data as Demonstrator: Venkatraman et al. [2015] extended
DAGGER and proposed a framework called Data as Demonstrator
(DaD) where the problem of multi-step prediction is formulated as im-
itation learning. Prediction errors will cascade over time in multi-step
prediction as in the case of learning a policy, and this prediction error
can also be improved by a data aggregation approach. Recent work
shows the eﬃcacy of DaD in control problems [Venkatraman et al.,
2016].

3.5 Model-Free Behavioral Cloning for Learning Trajec-

tories

In this section, we review approaches to learn trajectories from demon-

strations. In robotic manipulation, trajectory planning is one of the
most significant problems. If we assume that the system is (nearly)
fully actuated and that a low-level controller to achieve the desired
state is available, a trajectory for a given task can be learned without
explicitly estimating the system dynamics. Since many commercial-
ized robotic manipulators usually have such low-level controllers, this
model-free BC approach has been dominant in imitation learning re-
search for robotic manipulator trajectory planning. Next, we show how
the choice of the trajectory representation influences trajectory learning
and how the representation needs to fit to the application at hand.

3.5.1 Trajectory Representation

In order to learn trajectories we ﬁrst need to deﬁne how to represent

a trajectory. The choice of trajectory representation determines the
parameterized space where demonstrated trajectories are projected.
Therefore, it is essential to ﬁgure out the most parsimonious repre-
sentation for a given application.
For planning a desired trajectory, we need a policy that generates a
trajectory τ ∈ T . The trajectory is given by a sequence of desired states
and/or control inputs based on a given context s ∈ S. Given a set of
66 Behavioral Cloning

demonstrated trajectories D = {(si , τ i )}N

i=1 , we can use a supervised
learning method to learn a policy which directly maps from contexts
to trajectories

π : S Ô→ T . (3.20)

For this purpose, we can use various regression methods developed in

the field of machine learning. For example, Calinon et al. [2007] em-
ployed Gaussian mixture regression to model a mapping from time
to states, and Osa et al. [2017b] used Gaussian Process regression for
learning a mapping from contexts to trajectories. For learning such poli-
cies the choice of methods is usually not limited to specific regression
methods, and we can also employ various machine learning techniques
such as dimensionality reduction [Sugiyama, 2015] in order to alleviate
the challenges of trajectory learning.
However, when planning a desired trajectory for a robotic system
we need to ensure that the planned trajectory is physically feasible and
a naive application of regression may not be the best choice. It is often
necessary to impose some constraints on the planned trajectory, such as
smooth convergence to the goal state. Such constraints may be implic-
itly satisfied when regression methods are used to learn a policy, but
it is often convenient to use a policy that explicitly satisfies some con-
straints. Dynamic movement primitives (DMPs) [Schaal et al., 2004,
Ijspeert et al., 2013] and the stable estimator of dynamical systems
(SEDS) approach [Khansari-Zadeh and Billard, 2011] are representa-
tions that explicitly satisfy the condition of smooth convergence to the
goal state. For learning these policies, regression methods are used in
specific ways such that the desired constraints are satisfied. In the fol-
lowing, we discuss the details of different trajectory representations.

3.5.1.1 Keyframe/Via-Point Based Approaches

One obvious way to represent trajectories is as a sequence of keyframes
or via-points. In the ﬁeld of computer graphics, the term “keyframe”
is used to express important states which are needed for accomplish-
ing a given task [Parent, 2002]. In a keyframe-based approach, a task
trajectory is represented as a sequence of keyframes. In robotic motion
3.5. Model-Free Behavioral Cloning for Learning Trajectories 67

planning literature, the term “via-point” is used similarly to the term

keyframe [Pastor et al., 2009, Paraschos et al., 2013, Zucker et al., 2013].
Instead of using the terms “keyframe” and “via-point”, several articles
describe a trajectory as consisting of a sequence of discrete states [Lee
and Nakamura, 2009, Takano and Nakamura, 2015].
A keyframe-based trajectory representation appears in several im-
itation learning applications. Nakaoka et al. [2007] developed a hu-
manoid system that learns dancing from human expert demonstration
using a keyframe-based approach. The motion of the human expert
was captured by a 3D motion tracking system, and the keyframes
were subsequently extracted. By modifying the keyframes according
to the dynamics of the humanoid, the humanoid was able to perform
the demonstrated dance properly. Okamoto et al. [2014] developed a
system that can perform a dance synchronously to music with diﬀer-
ence rhythms by learning the correspondence between the music and
the dancing motion.
Trajectories can be represented using discrete states. For discrete
states one natural dynamics and observation model representation is
the hidden Markov model which we will discuss next.

3.5.1.2 Representation with Hidden Markov Models

A hidden Markov model (HMM) is often used to model the proba-

bilistic transition between discrete states [Inamura et al., 2004, Kulić
et al., 2008, Lee and Nakamura, 2009, Takano and Nakamura, 2015].
A discrete HMM consists of a finite set of latent states X, a finite
set of observation labels Y , a state transition matrix A = {aij }, an
output probability matrix B = {bij }, and an initial distribution vector
di . When an HMM is used to represent motion, the latent state often
represents the phase of the motion, and the observation represents the
kinematic state of the system. Given a set of observation sequences
and the set of states, A and B can be obtained by the Baum-Welch
algorithm, which is a variant of the Expectation-Maximization (EM)
algorithm. Once A and B are trained, a motion sequence can be esti-
mated for a given initial state.
One of the benefits of an HMM representation is the ability to
68 Behavioral Cloning

recognize the current system state based on the learned probabilistic

model. HMMs have been used in classical speech recognition [Rabiner,
1989], and motion recognition can be performed in the same manner
using HMMs [Inamura et al., 2004, Takano and Nakamura, 2015]. Given
an HMM λ = (A, B) and an observation sequence Y ′ , the likelihood of
observing a given sequence p(Y ′ |λ) can be computed. Therefore, the
observed motion can be recognized as

λ∗ = arg max p Y ′ |λ .
! "
(3.21)
λ

In the framework in [Inamura et al., 2004], HMMs are used to represent

primitive motions. The library of primitive motions are represented by
a set of HMMs, and the motion is recognized based on the likelihood as
in (3.21). This framework is extended to clustering and segmentation
of demonstrated trajectories in [Kulić et al., 2008, Lee and Nakamura,
2009, Lee et al., 2010, Takano and Nakamura, 2015, 2016].
On the other hand, one of the drawbacks of the HMM representation
is discreteness. Recognition with HMMs works well when the number
of states is relatively low [Kulić et al., 2008]. However, HMMs with
too few states may not be capable of reproducing a motion sequence.
In robotic applications, HMMs are often used to represent the discrete
high-level state of the system, assuming a low-level controller to achieve
the desired state is available. However, it is non-trivial to plan smooth
and feasible trajectories in many robotic systems.
To overcome the discreteness of HMMs, recent work uses other
techniques in combination with HMMs, such as state speciﬁc Gaus-
sian models [Calinon et al., 2010] to represent continuous values such
as velocity, spatial position, or force [Racca et al., 2016]. Recent work
also uses Hidden Semi-Markov Models (HSMM) [Yu, 2010] to model
more complex state duration distributions [Calinon et al., 2011]. The
work by Rozo et al. [2016] employs an LQR controller to address the
problem of optimizing a trajectory retrieved from an HSMM. Addition-
ally, Takano and Nakamura [2017] recently proposed an HMM-based
method for planning joint torques to control the contact force.
3.5. Model-Free Behavioral Cloning for Learning Trajectories 69

(a) (b) (c)

Figure 3.4: Schematic illustration of DMP. DMPs represent the demonstrated

motion as a combination of a nonlinear force term and an attractor force term. Blue
and red points represent the start and goal positions, respectively. Suppose that
the trajectory shown in (a) is given as a demonstrated trajectory. The nonlinear
force term along the trajectory, which is dependent on the phase of the motion, is
shown as orange vectors, and in (b) green vectors represent the attractor force term,
which is stationary and dependent on the state of the system. The dynamics of the
demonstrated motion is learned as a sum of these terms shown in (c).

3.5.1.3 Dynamic Movement Primitives

Dynamic Movement Primitives (DMPs) were introduced by Ijspeert
et al. [2002a,b], Schaal et al. [2004], Ijspeert et al. [2013]. DMPs are
motivated by diﬀerential equations of well-deﬁned attractor dynamics.
Representation with DMPs ensures the smoothness and continuity of
the trajectory. In addition, a DMP is able to represent nonlinear move-
ments without losing the stability of the behavior. Figure 3.4 shows a
schematic illustration of DMPs. DMPs represent demonstrated motion
with a combination of a nonlinear force term and an attractor force
term. The nonlinear force term enables expressing complex motions.
Since the nonlinear force term decays in time, the goal attractor force
term is dominant in the end of the motion and a path planned by a
DMP smoothly converges to the goal state.
We describe details of DMPs in the following. In a DMP, the demon-
strated motion with one degree of freedom (DoF) is modeled as a
spring-damper system
τ 2 ẍ = αx (βx (g − x) − τ ẋ) + f, (3.22)
where x is the state of the system, f is the forcing function that deter-
mines the nonlinear behavior, αx and βx are constants that determine
70 Behavioral Cloning

the damping and spring behavior, respectively. τ is a constant that de-

termines the temporal behavior, and g denotes the goal state. In the
example shown in Figure 3.4, the forcing function f and the goal at-
tractor term αx βx (g − x) are visualized in Figure 3.4(b). In imitation
learning with DMPs, we can often assume that the final state xdemo (T )
of the demonstrated motion is the goal state g = xdemo (T ).
One significant feature of DMPs is time modulation by using a
phase variable. By choosing the appropriate form of the basis function
of the forcing function and the phase variable, DMPs can represent var-
ious movements with different execution speeds [Ijspeert et al., 2013].
Let us denote by z a phase variable. For a striking movement, one can
introduce the phase variable that follows the first-order linear dynamics
as
τ ż = −αz z, (3.23)
where αz is a constant. Ijspeert et al. [2013] called this equation the
canonical system because it models the generic behavior of the system.
In this case, the phase variable z is given by a function of time t
αz
3 4
z = z0 exp − t , (3.24)
τ
where z0 is the initial value of z. The phase variable z exponentially
converges to zero from an arbitrary initial state. Typically, the phase
variable z is used as z ∈ [0, 1] for a striking movement.
The forcing function that models the nonlinear behavior is learned
as a function of the phase variable z. Using a Gaussian basis function
with this phase variable z, the forcing function can be formulated as
M
Ø
f (z) = (g − x0 ) ψi (z)wi z, (3.25)
i=1

where x0 denotes the initial position and M the number of the basis
functions. The Gaussian basis function ψi (z) is given by
exp −hi (z − ci )2
! "
ψi (z) = qN , (3.26)
2
j=1 exp (−hj (z − cj ) )

where hi and ci are constants that determine the width and centers of
the basis functions, respectively. This system represents stable attractor
3.5. Model-Free Behavioral Cloning for Learning Trajectories 71

Algorithm 4 Learning dynamic movement primitives [Schaal et al.,

2004, Ijspeert et al., 2013]
Input: demonstrated trajectory τ demo , parameters αx , βx , τ, αz , ωz
Choose a system of a phase variable z, e.g., (3.23)
Choose a basis function ψ of the forcing function f
Compute the forcing function at each time step using τ demo with
(3.27)
Find a weight vector w that minimize LDMP in (3.28) using a least-
square solution (3.29)

dynamics with nonlinear behavior. DMPs can be also used to represent

rhythmic movements by using periodic basis functions [Schaal et al.,
2004, Ijspeert et al., 2013].
If we assume that a demonstrated trajectory τ demo is given, the
weight vector w can be learned as a supervised learning problem [Schaal
et al., 2004, Ijspeert et al., 2013]. From the given trajectory, we compute
the position, velocity and acceleration at each time step. To obtain the
weight parameters in a DMP, we compute the target value of the forcing
function from the given trajectory as
1 2
ftarget (t) = τ 2 ẍdemo (t) − αx βx (g − xdemo (t)) − τ ẋdemo (t) , (3.27)

where xdemo (t), ẋdemo (t), ẍdemo (t) are the position, velocity and accel-
eration at the time t, respectively. Subsequently, we can ﬁnd the weight
vector w that minimizes the sum of the squared error

T
(ftarget (t) − ξ(t)Ψw)2 ,
Ø
LDMP = (3.28)
t=0

where ξ(t) = (g − x0 )z(t) for the discrete system and ξ(t) = 1 for the
rhythmic system, and the entry of Ψ is computed as Ψij = ψi (tj ) with
(3.25). The weight vector w can be obtained by a least-square solution

1 2−1
w = Ψ⊤ Ψ Ψ⊤ F . (3.29)
72 Behavioral Cloning

For the attractor dynamics in (3.25), F is given by

6⊤
ftarget (0) ftarget (t) ftarget (T )
5
F = ,..., ,..., , (3.30)
(g − x0 )z(0) (g − x0 )z(t) (g − x0 )z(T )
where T is the number of the total time steps. Algorithm 4 summarizes
the procedure for learning DMPs. Since DMPs are primarily designed
for learning a motion for a single degree of freedom, multiple DMPs
need to be learned for each dimension when learning motions with mul-
tiple dimensions.
Variants of Dynamic Movement Primitives: Since DMPs have
been proposed, numerous variants of DMPs have been developed. Hoﬀ-
mann et al. [2009] proposed an extended version of DMPs for obstacle
avoidance and real-time goal adaptation. Deniša et al. [2016] developed
Compliant Movement Primitives (CMPs) for learning compliant mo-
tions that require physical interaction between a robot and objects. For
learning coupled motions, several variants of DMPs have been proposed
by Kober et al. [2008], Gams et al. [2014], Amor et al. [2014]. Mülling
et al. [2013] proposed a Mixture of Movement Primitives (MoMPs),
which generalize the movement primitives to new contexts by mixing a
set of learned movement primitives. DMPs have been applied to various
robotic tasks and recognized as one standard representation of robotic
motions.
Relation to Hilbert Norm Minimization: Dragan et al. [2015]
revealed the relation between DMP-like methods and trajectory opti-
mization based on Hilbert norm minimization such as CHOMP Zucker
et al. [2013]. Dragan et al. [2015] formulated the problem of adapt-
ing a demonstrated trajectory τ demo to new start and goal states as
minimization of the distance between the demonstration and the new
trajectory subject to the new start and goal point constraints:
. .2
τ ∗ = arg min .τ demo − τ . (3.31)
. .
M
s.t. x(0) = xnew
s (3.32)
x(T ) = xnew
g (3.33)

where xnew
s and xnew
g are the new start and goal states, M is a linear
operator that deﬁnes the inner product in the Hilbert space. When time
3.5. Model-Free Behavioral Cloning for Learning Trajectories 73

is discrete, M is a matrix, and the norm is given by ëτ ë2M = τ ⊤ M τ .

This formulation can be generalized to arbitrary norms, and Dragan
et al. [2015] prove that trajectory adaptation with DMPs performs
this norm minimization with a particular choice of the Hilbert norm,
which is the same as the norm often used in trajectory optimization
algorithms such as CHOMP Zucker et al. [2013].

3.5.1.4 Probabilistic Movement Primitives

While DMPs represent the movement in a deterministic way, demon-
stration performed by human experts is often stochastic. Such proba-
bilistic behavior cannot be represented by DMPs. Probabilistic Move-
ment Primitives (ProMPs) proposed by Paraschos et al. [2013] rep-
resent movement as a distribution over trajectories. In ProMP, the
trajectory is parameterized as a linear combination of basis functions
ψ(t). The state of the system x(t) at time t is expressed as
C D
q(t)
x(t) = = Ψ(t)⊤ ω + ǫx , (3.34)
q̇(t)

where Ψ(t) is a M ×2 dimensional time-dependent basis matrix deﬁned

as  
ψ1 (t) ψ̇1 (t)

Ψ(t) =  .. .. 
, (3.35)
 . . 
ψM (t) ψ̇M (t)
ω is a weight vector, and ǫx ∼ N (0, Σx ) is zero-mean i.i.d. Gaussian
noise. Here, the probability of observing the state x(t) is expressed as

p(x(t)|ω) = N (x(t)|Ψ(t)⊤ ω, Σx ). (3.36)

Thus, the probability of observing the whole trajectory τ =

[x(0), . . . , x(T )] is written as
1 2
N x(t)|Ψ(t)⊤ ω, Σx .
Ù
p (τ |ω) = (3.37)
t

By introducing a phase variable z(t), we can achieve temporal mod-

ulation in ProMP. The phase variable is deﬁned as z(0) = 0 at the
beginning of the movement and as z(T ) = 1 in the end. The basis
74 Behavioral Cloning

function directly depends on the phase variable by replacing ψ(t) with

dz(t)
ψ(z(t)) and ψ̇(t) = dψ
dz dt .
The basis function should be selected according to the type of
the movement as in DMPs. For point-to-point movements, one typi-
cal choice is a Gaussian function bG , that is,
A B
(z(t) − ci )2
bG
i (z(t)) = exp − , (3.38)
2h
where h deﬁnes the width of the basis function and ci is the center
for the ith basis function. For rhythmic movements, the Von-Mises
function can be used to model periodicity.
For imitation learning, the weight vectors ω and the covariance
matrix Σy need to be learned from the demonstrated trajectories. This
problem can be formulated as a simple supervised learning problem.
Let us assume that the trajectories demonstrated by experts are given
as D = [τ 1 , . . . , τ N ]. If we assume that the demonstrated trajectories
are aligned properly in the time domain, a weight vector wi for the
ith demonstrated trajectory can be obtained by minimizing the sum of
squared errors
T . .2
.x(t) − Ψ(t)⊤ w . ,
Ø
LProMP = (3.39)
. .
t=0

where x(t) = [q(t) q̇(t)]⊤ . The solution is given by a least squares

solution  
q i (0)
 i
 q̇ (0) 

2−1 
Γ  ...  ,
1 
ω i = ΓΓ⊤ (3.40)
 
 
 q i (T ) 
 
q̇ i (T )
where the basis function matrix Γ is given by
 
ψ1 (0) · · · ψ1 (T ) ψ̇1 (T )
ψ̇1 (0)

Γ= .. .. .. 
. (3.41)
 . . . 
ψM (0) ψ̇M (0) . . . ψM (T ) ψ̇M (T )
For each demonstrated trajectory, we obtain a weight vector and for
the whole set of demonstrated trajectories D we obtain a set of weight
3.5. Model-Free Behavioral Cloning for Learning Trajectories 75

Algorithm 5 Learning probabilistic movement primitives [Paraschos

et al., 2013]
Input: Multiple demonstrated trajectories D = {τ demo
i }N
i=1
Choose a basis function ψ and the number of the basis function M
Compute the basis function matrix Ψ(t)
for each demonstrated trajectory do
Obtain ω by computing (3.40)
end for
Compute p(ω) ∼ N (µω , Σω )

vectors Ω = [ω 1 , . . . , ω N ]. From the set of weight vectors Ω we can

estimate a distribution over the weight vectors p (ω) ∼ N (µω , Σω ). The
distribution of the state at time t can be modeled as
1 - 2
p(x(t)) = N x(t) -Ψ(t)⊤ µω , Ψ(t)⊤ ΣΨ(t) + Σx . (3.42)
-

Algorithm 5 summarizes the procedure for learning ProMPs.

One of the characteristic features of ProMPs is the conditional
distribution of the weight conditioned on a sequence of states x∗ =
[x(t), . . . , x(t′ )]. When x∗ is specified as via-points, the distribution
of the weight vector conditioned on x∗ (t) is given as a Gaussian with
mean and variance
∗
µ+
ω = µω + K (x − Ψ(t)µω ) ,
⊤
Σ+
ω = Σω − KH (t)Σω . (3.43)
1 2−1
where K = Σω H ⊤ (t) Σx + H ⊤ (t)Σω H(t) and H is the observa-
tion matrix defined as H = [Ψ(t), . . . , Ψ(t′ )]⊤ .
By using this condition-
ing, ProMPs can deal with modulation of via-points, final positions, or
velocities. Figure 3.5 visualizes the conditioning of the trajectory dis-
tribution on the target position as an example.

3.5.1.5 Trajectory Representation with Time-Invariant Dy-

namical Systems
Khansari-Zadeh and Billard [2011] developed a framework to rep-
resent task trajectories as a time-invariant dynamical system (DS)
76 Behavioral Cloning

0 0.3 time [s] 0.7 1.0

Figure 3.5: Conditioning of the learned distribution on the target position

[Paraschos et al., 2013].

[Gribovskaya et al., 2011, Khansari-Zadeh and Billard, 2014]. While

DMPs model the attractor dynamics and nonlinear behavior as sepa-
rate terms, this framework models demonstrated movements as a single
nonlinear dynamical system. The trajectory generated from this time-
invariant DS is stably attracted to the given target position in the
Lyapunov sense. The time-invariant DS representation cannot repre-
sent time-variant behavior by its nature.
[Khansari-Zadeh and Billard, 2011, Gribovskaya et al., 2011] mod-
eled demonstrated trajectories as an autonomous system [Khalil, 1996],
which follows time-invariant dynamics as

ẋ = f (x), (3.44)

where x is the system state, and f is a function that governs the be-
havior of the system. Khansari-Zadeh and Billard [2011], Gribovskaya
et al. [2011] learn the function f as a GMM.
Let us deﬁne x as the state vector of the system. When a set of
demonstrated trajectories is given, the joint distribution of x and ẋ can
be estimated from the observations using a GMM. The kth component
of the GMM models the distribution p(x, ẋ|k) as
AC D-C D C DB
x - µ
x Σx,k Σxẋ,k
p(x, ẋ|k) ∼ N , . (3.45)
-
ẋ - µẋ Σẋx,k Σẋ,k
-

The estimated dynamics function f̂ is learned as

K 1 2
hk (x) µẋ + Σẋx,k Σ−1
Ø
f̂ = x,k (x − µx,k ) , (3.46)
k=1
3.5. Model-Free Behavioral Cloning for Learning Trajectories 77

where
p(k)p(x|k) πx N (x|µx,k , Σx,k )
hk (x) = qK = qK , (3.47)
i=1 p(i)p(x|i) i=1 πi N (x|µx,i , Σx,i )

where πk is the prior of the kth Gaussian component.

The study by Khansari-Zadeh and Billard [2011] showed that the
system described by (3.46) is globally asymptotically stable at the tar-
get x∗ if the condition
I
Ak + (Ak )⊤ is negative deﬁnite,
(3.48)
−Ak x∗ = µẋ,k − Ak µx,k ,

is satisﬁed for all k = 1, . . . , K where Ak = Σẋx,k (Σx,k )−1 .

Khansari-Zadeh and Billard [2011] proved that (3.48) is the suﬃ-
cient condition to show that the system is globally asymptotically stable
in the sense of Lyapunov. For the details of the proof, we refer to the
original paper [Khansari-Zadeh and Billard, 2011]. Khansari-Zadeh and
Billard [2011] call this time-invariant DS represented by GMMs with
constraints of (3.48) stable estimator of dynamical systems (SEDS).
This representation with time-invariant DS is nonparametric, and
models the correlation of movements in multiple DoFs. In addition,
this approach can be also used to learn second-order dynamics as
ẍ = g(x, ẋ) (please refer to [Khansari-Zadeh and Billard, 2011] for
more details). The approaches with DS have been applied to various
applications, such as learning coupled movements and learning stiﬀ-
ness [Shukla and Billard, 2012, Lukic et al., 2014, Kim et al., 2014].
The limitation of this approach is that the time-invariant repre-
sentations cannot represent time-variant behaviors by its nature. In
addition, due to the constraint of (3.48), SEDS can handle only mod-
els in which the dimensions of the input and output are equal [Shukla
and Billard, 2012].

3.5.2 Comparison of Trajectory Representations

We show a comparison of diﬀerent trajectory representations in Ta-
ble 3.4. As can be seen from Table 3.4, every representation has
strengths and weaknesses.
78 Behavioral Cloning

When choosing a trajectory representation, it is essential to con-

sider the most parsimonious description for the desired trajectories
and select a representation with a model complexity appropriate for
the desired behavior. For example, SEDS in [Khansari-Zadeh and Bil-
lard, 2011, 2014] represents the motion as a time-invariant dynamical
system. Although SEDS may be insufficient to model time-dependent
motions, SEDS works well for some tasks such as catching a flying ob-
ject [Kim et al., 2014]. With regard to stable attraction to a target
position, global asymptotic stability is guaranteed in the sense of Lya-
punov for SEDS [Khansari-Zadeh and Billard, 2011]. This property is
useful for planning a stable behavior to approach a target position.
DMP is a good option for learning a point-to-point motion since
motions can be easily generalized to different start and goal positions.
In addition, bounded-input bounded-output (BIBO) stability is guar-

Table 3.4: Comparison of trajectory representations. Time dependence means here

that the learned policy differs for each time step. With regard to stable attraction
to a target position, bounded-input bounded-output (BIBO) stability is guaranteed
for DMPs [Ijspeert et al., 2013], and global asymptotic stability is guaranteed in
the sense of Lyapunov for SEDS [Khansari-Zadeh and Billard, 2011]. Stochasticity
of trajectories means that a method takes uncertainty into account when modeling
behavior. Encoding spatial coordination means here that a method can explicitly
model the coordination of multi-dimensional motions.

Time Stable Stochasticity Encoding

dependence attraction of spatial co-
to a target trajectories ordination
position patterns
Way points / Keyframe
[Abbeel et al., 2010, X - - -
Nakaoka et al., 2007]
HMMs
[Inamura et al., 2004,
( X) - X X
Takano and Nakamura,
2015]
DMP
[Schaal et al., 2004, X X - -
Ijspeert et al., 2013]
ProMP
[Paraschos et al., 2013, X - X X
Maeda et al., 2016]
SEDS
[Khansari-Zadeh and - X - X
Billard, 2011, 2014]
3.5. Model-Free Behavioral Cloning for Learning Trajectories 79

anteed with regard to stable attraction to a target position. For this

reason, DMP is often used to represent primitive motions in task-level
motion planning Kroemer et al. [2015], Niekum et al. [2014], Manschitz
et al. [2015]. On the other hand, stochasticity of the demonstrated tra-
jectories cannot be encoded by DMPs, and multi-dimensional motion
needs to be modeled by separate DMPs. ProMPs can address these
problems. However, unlike DMP and SEDS, ProMPs do not guarantee
stability of planned trajectories.
In this section we presented several different trajectory represen-
tations and gave some suggestions how to choose them based on the
different properties of the representations. However, the way to choose
among the trajectory representations is still an interesting open ques-
tion. Although efforts for benchmarking these different techniques have
been made, e.g. [Lemme et al., 2015], it is necessary to establish metrics
and benchmarks for comparing existing methods.

3.5.3 Generalization of Demonstrated Trajectories

Generalization of the demonstrated trajectories is one of the most im-

portant problems in imitation learning. The parameterization of trajec-
tories enables generalizing the movements to new scenes. For example,
a movement represented as a DMP can be adapted to a new scene
by changing parameters such as goal and start positions. A popular
approach for generalizing a parametrized motion is conditioning Gaus-
sian distributions. This approach appears in several frameworks such
as ProMP and SEDS. However, generalization with conditioning on
Gaussian distributions is limited to situations where feature vectors
with fixed length are available. Therefore, these methods often require
manually selected feature vectors which are sufficiently informative.
Another way to generalize demonstrated skills is to leverage geomet-
rical warping from a demonstrated scene to a new scene. Recent work
such as [Schulman et al., 2013, Lee et al., 2015a,b, Huang et al., 2015]
propose methods for generalizing skills to new scenes based on non-rigid
registration of point clouds, which does not rely on feature vectors of
fixed length. In the following, we describe a short overview of general-
ization of demonstrated behaviors using different representations.
80 Behavioral Cloning

Motion Generalization with DMP: A trajectory represented with

DMPs can be generalized to different start and goal positions [Schaal
et al., 2004, Ijspeert et al., 2013]. For generalization according to addi-
tional features, some extensions are required. For example, Amor et al.
[2014] proposed to model the joint distribution of DMP parameters and
generalize learned motion in human-robot interaction scenarios.
Motion Generalization with ProMP: ProMP learns the distribu-
tion of the demonstrated trajectory in a parameter space. By condi-
tioning the learned distribution, we can generalize the demonstrated
trajectories to new start and goal positions or via-points [Paraschos
et al., 2013, Maeda et al., 2016]. Maeda et al. [2016] show how to adapt
learned ProMP skills in the context of human-robot interaction.
Motion Generalization with SEDS: Since the SEDS approach
learns the joint distribution of the state and motion of the system,
the demonstrated motion can be generalized to new states [Khansari-
Zadeh and Billard, 2011].
Trajectory Transfer with Geometrical Warping: Although condi-
tioning on Gaussian distributions are popular methods for generalizing
skills, such methods are limited to the generalization with feature vec-
tors with a fixed length. Another way to generalize demonstrated skills
is to leverage Geometrical warping of the demonstrated scene to a new
scene. Recently, Schulman et al. [2013] proposed a method to gener-
alize the demonstrated trajectories based on non-rigid registration. In
the non-rigid registration problem, one tries to find a correspondence
between two point-sets and determine a good non-rigid transforma-
tion that can map one point-set onto the other [Chui and Rangarajan,
2003]. Thus far, non-rigid registration has been applied to for example
template matching in OCR, motion generation in animation, or image
registration in medical image analysis. Schulman et al. [2013] used non-
rigid registration in order to transfer the demonstrated trajectories to
new contexts as shown in Figure 3.6.
The trajectory transfer method consists of three steps: 1) find a
transformation from the training scene to the test scene using a non-
rigid registration method, 2) apply the transformation to the demon-
strated end-effector trajectory in task space, and 3) convert the end-
3.5. Model-Free Behavioral Cloning for Learning Trajectories 81

Figure 3.6: Trajectory transfer using non-rigid registration [Schulman et al., 2013].

eﬀector trajectory in task space into a joint space.

This method has been extended in various ways [Lee et al., 2015a,b,
Huang et al., 2015]. Trajectory transfer with non-rigid registration can
be used to generalize both spatial motion and force proﬁles [Lee et al.,
2015a]. Although the original work on trajectory transfer with non-
rigid registration employed the thin plate spline robust point matching
(TPS-RPM) approach proposed in [Chui and Rangarajan, 2003], the
framework is not limited to speciﬁc non-rigid registration methods. The
recent work by Lee et al. [2015b] shows that the use of the coherent
point drift (CPD) algorithm improves trajectory transfer performance.
Unlike methods such as ProMPs or the dynamical systems ap-
proach, non-rigid registration based trajectory transfer works directly
on point clouds and can generalize demonstrated trajectories to new
scenes without modeling the distribution over demonstrated trajecto-
ries. However, non-rigid trajectory transfer requires that system dy-
namics are approximately invariant between source and target scenar-
ios [Schulman et al., 2013]. In order to plan a trajectory in a new scene,
one must select demonstrations performed in scenes with covariant sys-
tem dynamics. For thousands of stored demonstrations, this search for
an appropriate demonstration is a time-consuming process.
We discussed generalizing policies to new demonstrated trajectories.
Table 3.5 shows a comparison of methods for generalizing demonstrated
trajectories. DMPs allow stable convergence to arbitrary goal positions,
but DMPs’ generalization capability is relatively limited compared to
other methods. ProMPs can generalize trajectories by Gaussian condi-
82 Behavioral Cloning

tioning, but there is no guarantee of stable behavior. SEDS can gener-

alize the trajectories with a guarantee of stable behavior, but cannot
model the time dependence of movements. Trajectory transfer using
non-rigid registration can achieve complex generalization, but does not
incorporate stochasticity in demonstrations and there is no guarantee
of stable behavior.
In addition to methods discussed above, there are numerous stud-
ies on generalizing demonstrated trajectories. Calinon [2015] proposed
task-parameterized Gaussian mixture model (TP-GMM), which en-
codes the context information in its trajectory model. The approach
based on TP-GMM has been recently employed in several studies [Cali-
non, 2016, Rozo et al., 2016]. The recent work by Osa et al. [2017a] pro-
posed a trajectory optimization method for collision avoidance, which
incorporates the distribution of the demonstrated trajectories. In ad-

Table 3.5: Generalization of skills using existing methods. DMPs enable stable con-
vergence to arbitrary goal positions. ProMPs can generalize trajectories by Gaussian
conditioning, but there is no guarantee of stable behavior. SEDS can generalize tra-
jectories while guaranteeing stable behavior, but cannot model time dependence
of movements. Trajectory transfer using non-rigid registration can achieve complex
generalization, but does not incorporate stochasticity of demonstrations and there
is no guarantee of stable behavior.

Generalizable
Method Advantages Disadvantages
context
DMP Limited
Start and goal Guarantee of
[Schaal et al., 2004, generalization
positions stable behavior
Ijspeert et al., 2013] capabilities

ProMP Generalization
Any subset of
[Paraschos et al., 2013, based on No guarantee of
the observations
Maeda et al., 2016] stochasticity of stable behavior
of the system
demonstrations
State of the Generalization
SEDS No time-
system with with guarantee
[Khansari-Zadeh and dependence
fixed of stable
Billard, 2011, 2014]
dimensionality behavior
Generalization Stochasticity of
Way points with
A point cloud of based on point demonstrations
non-rigid registration
the given scene clouds of a is not
[Schulman et al., 2013]
given scene considered
3.5. Model-Free Behavioral Cloning for Learning Trajectories 83

dition, although we focused on the trajectory-based approach, recent

work such as [Finn et al., 2017b, Nair et al., 2017, Liu et al., 2017,
Rahmatizadeh et al., 2017] addressed the problem of generalizing skills
based on visual information by using a deep learning approach, which
is a promising way to deal with complex environments.

3.5.4 Information Theoretic Understanding of Model-Free

BC
Trajectory representations such as DMP, ProMP, and SEDS param-
eterize the trajectories as p(τ |w) by solving linear equations using a
least-squares method. Solving linear equations by minimizing a sum-of-
squares error function is equivalent to maximizing the likelihood for the
given dataset of demonstrations D = {τ demo
i }N
i=1 under the assumption
that the noise is drawn from a Gaussian distribution. This solution can
be interpreted from an information theoretic point of view.
According to information theory, the entropy is a quantity that
represents the amount of information, and the KL divergence can be
obtained as a Bregman divergence derived from the entropy [Amari,
2016]. As described in [Bishop, 2006], ﬁnding parameters that maximize
the likelihood p(τ |w) for the given dataset is equivalent to minimizing
the KL divergence given by
q(τ )
Ú
DKL (q(τ )||p(τ |w)) = q(τ ) ln dτ .
p(τ |w)
where q(τ ) is the distribution of the trajectory induced by the experts’
policy. A sample of the demonstrated trajectories τ demo is drawn from
the distribution q(τ ) induced by the experts’ policy. Therefore, the
expectation with respect to q(τ ) can be approximated as
N 1
1 Ø 2
DKL (q(τ )||p(τ |w)) ≃ − ln p(τ demo
i |w) + ln q(τ demo
i ) . (3.49)
N i=1
Since ln q(τ ) is independent from w, minimizing DKL (q(τ )||p(τ |w))
is equivalent to maximizing the likelihood ln p(τ |w) for the given
dataset D.
Therefore, a policy obtained by model-free BC methods based on
maximizing the likelihood under the Gaussian noise assumption can be
84 Behavioral Cloning

Learn the policy

Data manifold Policy model manifold

Figure 3.7: Schematic illustration of model-free BC methods. Model-free BC meth-

ods can be often interpreted as an M-projection onto the policy model manifold.

regarded as the policy that minimizes the KL divergence as

π ∗ = arg min DKL (q(τ )||p(τ |w))) .

Thus, we can see that model-free methods discussed in the previous

section parameterize the demonstrated behaviors by minimizing the
KL divergence in a different parameter space as shown in Figure 3.7.
It is important to note that these model-free methods can suffer
from the problem of covariate shift where the distribution of the test
condition is different from the distribution of the demonstrated con-
ditions. In other words, the learned skill may not work when the test
condition is too different from the demonstrated condition. To cope
with this problem, we will need incremental learning methods, which
are discussed in § 3.5.7.

3.5.5 Time Alignment of Multiple Demonstrations

When the expert demonstrates the task trajectory multiple times, the
execution speeds are diﬀerent for each demonstration. Therefore, when
a task trajectory is learned from multiple demonstrations, the time
alignments of the demonstrated trajectories often need to be normalized
if a time-dependent trajectory representation is used.
For this purpose, dynamic time warping (DTW) proposed by Sakoe
and Chiba [1978] is often employed. Although DTW is originally devel-
3.5. Model-Free Behavioral Cloning for Learning Trajectories 85

Algorithm 6 Estimate the latent trajectory and the time alignments

of multiple demonstrations [van den Berg et al., 2010]
j
Initialize: Rj = I, and ztj = z TTave
repeat
ξ ← KalmanSmoother(y, R, z)
R ← arg maxR Eξ (l(R|ξ, y))
z j ← arg maxz Eξ l(z j |ξ, y)
! "

until convergence

oped for speech recognition, DTW is frequently used to deal with the
time alignment of trajectories in robotics. The original formulation of
DTW finds the best time alignment of two data sequences. However,
we often obtain more than two demonstrations, and we need to align
all of them appropriately in the time domain.
In the field of imitation learning, Coates et al. [2008] proposed
a method to normalize the time alignment of multiple demonstrated
trajectories. Similar approaches appear in applications such as au-
tonomous helicopter flight [Abbeel et al., 2010] and automation of
robotic surgery [van den Berg et al., 2010, Osa et al., 2014]. Here,
we review the method employed by van den Berg et al. [2010].
van den Berg et al. [2010] regarded the demonstrated trajecto-
ries as noisy ’observations’ of the ’reference’ trajectories. The refer-
ence trajectory and the time mapping from the reference trajectory to
the demonstrated trajectory are computed using the EM (Expectation
Maximization)-algorithm.
The linear system is described as
C D A C DB
A B P 0
ξ(t + 1) = ξ(t) + w(t), w(t) ∼ N 0, (3.50)
0 I 0 Q

where ξ(t) = [x⊤ (t), u⊤ (t)]⊤ is the state and the control input of the
system at time t, A and B are the state matrix and the input matrix,
respectively. w(t) is the noise that follows the zero-mean Gaussian dis-
tribution. P and Q are the covariance matrices of process noise and
observation noise, respectively. If we assume that the jth demonstrated
trajectory τ j is given by τ j = [xj (0), uj (0), · · · , xj (T j ), uj (T )j ], the
86 Behavioral Cloning

relation between the reference trajectory and the observed trajectories

is represented as
      
τ 1 (zt1 ) I R1 0 0
 ..   . 
 =  .  ξ(t) + v(t), v(t) ∼ N 0, 
  .. 
(3.51)

 .   .    0 . 0
 ,

N
τ N (ztN ) I 0 0 R

where v is the noise that follows a zero-mean Gaussian distribution,

and ztj is the mapping of time t in the reference trajectory ξ to the cor-
responding time in trajectory τ j . The covariance matrices Rj behave
as weights on the jth demonstrated trajectory τ j for estimating the
reference trajectory ξ.
The reference trajectory ξ, covariance matrices R and the time-
mapping τ are estimated using the EM algorithm. In the E-step, the
reference trajectory z can be estimated using a Kalman smoother based
on the model in (3.50). In the M-step, the time mapping τ and the
covariance matrices R are updated by maximizing the likelihood with
respect to the estimated z. DTW is used to update the time mapping
τ in [Abbeel et al., 2010, van den Berg et al., 2010]. This procedure is
summarized in Algorithm 6.

3.5.6 Learning Coupled Movements

It is often necessary to learn the correlation of movements between
multiple DoFs or multiple agents. For example, in human-robot inter-
action, an autonomous agent needs to know how to react to a human
operator’s movements. In such a case, the human movement and the
robot reaction can be considered as coupled movements. In this section,
we review how to learn such correlations of movements with multiple
DoFs or agents. One typical approach is modeling the joint distribu-
tion of the parameterized trajectories in multiple DoFs with a Gaussian
(or a mixture of Gaussians) distribution. When partial observations of
the coupled movements are given, the rest of movements are estimated
by computing the conditional distribution on the partial observation.
We will see in the following section that the choice of the trajectory
representation plays an important role in modeling the trajectory dis-
tribution.
3.5. Model-Free Behavioral Cloning for Learning Trajectories 87

3.5.6.1 Learning Coupled Movements with DMPs

DMPs have been used to learn both perceptual coupling and coupling
for human-robot collaborative motion [Kober et al., 2008, Amor et al.,
2014]. In robotic applications, a movement is often represented as tra-
jectories in multiple spaces. For example, a position of an end effector
can be measured using a vision system in Cartesian space, while a
trajectory of a robotic manipulator is often controlled in joint space.
When DMP is used, trajectories in different spaces are often learned
as separate DMPs. However, it is essential to learn the coupling be-
tween the trajectories in different spaces. Kober et al. [2008] proposed
to learn such perceptual coupling for motor skills with DMPs. Instead
of using the forcing function shown in (3.25), the perceptual coupling
is modeled using the modified forcing function
M Mc 1 2
fˆ = ψ̂j (z) κ⊤ ⊤ ˙
Ø Ø
ψi (z)ŵz + j (y − ȳ) + δ j (ẏ − ȳ) , (3.52)
i=1 j=1

where y denotes the state of the external variable, ȳ is the expected

state of the external variable, κ and δ are the coupling factors that
act as the gains on diﬀerence between the desired and actual behaviors
of the external variable. Mc is the number of the basis function for
modeling the coupled behavior. While the weight vectors w and ŵ can
be learned from a single demonstration, the coupling factors κ and δ
cannot be learned from demonstrations since the deviation from the
nominal behavior is necessary for learning these parameters. For this
reason, Kober et al. [2008] used a reinforcement learning method for
learning κ and δ through trial and error.

3.5.6.2 Learning Coupled Movements with Gaussian Condi-

tioning
Statistical machine learning methods oﬀer ways to model correlation
of variables. For example, Gaussian conditioning is a simple way to
model such correlations. Coupled motion in robotic applications can be
learned using such statistical methods. Amor et al. [2014] represented
88 Behavioral Cloning

the motions of two agents using DMPs and learned the correlations of
the distribution of the motion parameters. When one agent’s motion
is observed, the motion of the other agent can be predicted based on
Gaussian conditioning.
Likewise, ProMPs have also been used to learn the correlation of
multiple agents’ motion. Maeda et al. [2016] developed an imitation
learning framework called Interaction ProMP to learn coupled motions
in human-robot collaboration. In the framework of Interaction ProMP,
correlated movements are learned as a distribution of the correlated
weight vectors of ProMPs. Using a partial observation of the movement,
unobserved movements are estimated as a conditional distribution of
the weight vectors on the given partial observation.
Here, we describe details of Interaction ProMP. Suppose demon-
strations of human robot collaborative movements are given. Here, we
deﬁne the state vector as a concatenation of the P DoFs executed by
the human, followed by the Q DoFs executed by the robot
C D
xh (t)
x(t) = , (3.53)
xr (t)

where xh (t) is a P × 1 dimensional vector that represents the state of

the human, and xr (t) is a Q × 1 dimensional vector that represents
the state of the robot at time t. The distribution of the trajectory is
parameterized as

p(x|ω) = N (x|H ⊤ (t)ω, Σy ), (3.54)

where
H ⊤ (t) = diag(Ψ⊤ (t), . . . , Ψ⊤ (t)), (3.55)
⊤
Ψ (t) is a M ×2 matrix deﬁned as (3.35) and M is the number of basis
functions. When a trajectory of a human-robot collaborative movement
is demonstrated, the weight vector ω can be learned as

ω̄ = [(ω h1 )⊤ , . . . , (ω hP )⊤ , (ω r1 )⊤ , . . . , (ω rQ )⊤ ]⊤ . (3.56)

By learning from multiple demonstrations, we can obtain the distribu-

tion of the weight vector p(ω̄) ∼ N (µω̄ , Σω̄ ) where µω̄ ∈ R(P +Q)M ×1
and Σω̄ ∈ R(P +Q)M ×(P +Q)M . After learning the distribution of the
3.5. Model-Free Behavioral Cloning for Learning Trajectories 89

Weight space mapping Model

Human
Training

trajectories

..
Robot .
trajectories
N number of
demonstrations Human Robot

Conditioning
Human observations
Inference

Prediction

Control

Figure 3.8: Overview of Interaction ProMPs in [Maeda et al., 2016]. In the interac-
tion ProMP framework, correlated movements are learned as the joint distribution
of weight vectors of ProMPs. Thanks to the probabilistic modeling of the trajectory
distribution, the interaction ProMP framework works with noisy observations of
trajectories [Maeda et al., 2016]. In this figure, ω̄ represents the weight vector that
contains movements of all DoFs controlled by the robot and the human operator as
defined in (3.56).

weight vector p(ω̄), the robot’s reaction to an observed human move-

ment can be planned as the conditional distribution of the weight
vectors. When a sequence of the observations of the human move-
ment y ∗ is given, the conditional distribution of the ProMP param-
eters given the observation, p(ω̄|y ∗ ), can be computed by applying
the Bayes theorem (3.43). By using a mixture of Interaction ProMPs,
the non-Gaussian distribution p(ω̄) can be represented as a mixture of
Gaussians [Ewerton et al., 2015, Maeda et al., 2016]. The framework
of Interaction ProMPs is summarized in Figure 3.8.
In the Interaction ProMP framework, correlated movements are
learned as correlated weight vectors of ProMPs. Thanks to the proba-
bilistic modeling of the trajectory distribution, the interaction ProMP
framework works with noisy observations of trajectories [Maeda et al.,
2016].

3.5.6.3 Learning Coupled Movements with Time-Invariant

Dynamical Systems

The Time-invariant dynamical system (DS) approach in [Khansari-

Zadeh and Billard, 2011] can be also used to learn coupled move-
90 Behavioral Cloning

ments [Shukla and Billard, 2012, Lukic et al., 2014, Kim et al., 2014].
Shukla and Billard [2012] developed a framework for learning coupled
movement based on DS, which they call the Coupled Dynamical Sys-
tem (CDS) model. The idea of CDS is to model the correlation between
two agents using statistical models.
Let us assume two agents, which we call the master and slave,
perform a coupled motion. The correlation of the movement of the
master xs and the movement of the slave xs can be modeled with CDS.
In CDS, three GMMs are trained to model three joint distributions:
1) the joint distribution of the master movement p(xm , ẋm )
2) the joint distribution
1 of the2states of the master and the desired
state of the slave p Φ(xm ), xds ,
3) the joint distribution of the slave movement p(x̃s , ẋs )
where x̃s = xs − xds and xds is the desired state of the slave. To ensure
the stability of the system, SEDS is used to model these three joint
distributions [Khansari-Zadeh and Billard, 2011]. The function Φ(·)
maps xm to the same dimensionality of xs . This mapping is necessary
because SEDS can handle only models in which the inputs and outputs
have the same dimensionality [Shukla and Billard, 2012].
The reproduction of learned motions is performed by repeating
three steps: First, the movement of the master is planned using
p(xm1, ẋm ). Subsequently,
2 the state of the slave is estimated based
on p xds |Φ(xm ) . Third, the motion of the slave is planned based on
p(xs , ẋs ). These steps are repeated until the system converges to the
goal position. The CDS approach has been applied to learn the cor-
relation between the arm and ﬁngers [Shukla and Billard, 2012, Kim
et al., 2014], or the eye and arm [Lukic et al., 2014].

3.5.7 Incremental Trajectory Learning

Demonstrations by human experts are not always optimal for the
learner, and the performance of the learner can be unsatisfactory after
learning from demonstrations. In such cases, corrective actions can be
used to improve the performance of the learner.
The study by Calinon and Billard [2007] extended the framework
of statistical trajectory learning in [Calinon et al., 2007] to incremen-
3.5. Model-Free Behavioral Cloning for Learning Trajectories 91

Algorithm 7 Incremental gesture learning [Calinon and Billard, 2007]

repeat
Record the demonstrated trajectories
Project demonstrated trajectories onto the latent space with PCA
Recognize the motion
Train GMMs
Plan a trajectory in the latent space using the updated GMMs
Re-project the planned trajectory onto the joint space
Execute/simulate the trajectory
until task learned

tal learning. In [Calinon and Billard, 2007], GMMs are initialized with
trajectories demonstrated by a human wearing a motion sensor. Subse-
quently, the motion of the humanoid robot is modified through kines-
thetic teaching by a human coach. Through this iterative process, the
model of the trajectory distribution is improved incrementally. The
method in [Calinon and Billard, 2007] is summarized in Algorithm
7. The method in [Lee and Ott, 2011] used a similar representation
by combining GMMs with HMMs. In the framework of [Lee and Ott,
2011], the compliance of a robot manipulator is controlled in order to
represent an area where motion refinement is allowed. However, the
method in [Calinon and Billard, 2007] does not address the context
of the task. Therefore, the generalization of the demonstrated trajecto-
ries to new situations is not concerned. Recent follow-up work [Havoutis
and Calinon, 2017] addressed the online learning and the adaptation
of the skill to new contexts by combining an optimal control approach
and TP-GMM in [Calinon, 2015].
Ewerton et al. [2016] used ProMPs for incremental imitation with
generalization to different contexts. Ewerton et al. [2016] parameterizes
trajectories with ProMPs as p(τ |w). To generalize the demonstrated
trajectories to new contexts, the joint distribution of trajectory param-
eters and the Gaussian context p(w, s) is incrementally learned under
the supervision of a human. Given a new context snew , the trajec-
tory is planned as a conditional distribution p(τ |snew ). The method
in [Ewerton et al., 2016] which is suitable for incremental learning of
92 Behavioral Cloning

Algorithm 8 Incremental imitation learning of context-dependent mo-

tor skills [Ewerton et al., 2016]
Input: demonstrated trajectories and the contexts D = {τ , s}
Initialize p(w, s) with D
for each new context s do
Compute µw|s and Σw|s
Compute µτ |s and Στ |s
repeat
Plan the trajectory based on p(τ |s)
Execute the trajectory with human intervention
Record the context and the executed trajectory τ new , snew
until human decides to stop
Compute the weight vector wnew for τ new
Update p(w, s) using wnew and snew
end for

time-dependent trajectories is summarized in Algorithm 8. Recently,

an incremental learning method which combines DMPs and Gaussian
Processes (GPs) was proposed by Maeda et al. [2017]. By modeling the
conditional trajectory distribution with GPs, the system can generalize
the trajectories to new scenes and request additional demonstrations
when the prediction uncertainty is large. In addition, the convergence
to the desired point can be ensured by DMPs.
Kronander et al. [2015] proposed incremental trajectory learning
using a local modulation in a time-invariant dynamical system. The
concept of local modulation is applicable to various vector ﬁelds. We
describe some details of the framework in the following. Let M (x) be
the local modulation function. The velocity for the state x is given by

ẋmod = M (x)ẋini (3.57)

where ẋmod is the velocity with the local modulation and ẋini is the
velocity given by the initial dynamical system. The local modulation
is represented by scaling and rotation of the original dynamics in the
framework of [Kronander et al., 2015]. Therefore, the modulation func-
3.5. Model-Free Behavioral Cloning for Learning Trajectories 93

tion is given by

M (x) = (1 + κ(x))R(x) (3.58)

where κ is a scaling factor and R is a rotation matrix. For 2D motion

R is parameterized by a rotation angle φ. For 3D motion R is param-
eterized by a rotation angle φ and the rotation vector µR . When local
additional demonstrations are given, the nonlinear local dynamics is
modeled with a GP.
While a GP was used to model the local modulation, the frame-
work in [Kronander et al., 2015] is not limited to a speciﬁc regression
method. For movement which can be represented as a vector ﬁeld, the
method in [Kronander et al., 2015] is considered a reasonable option
for incremental learning.

3.5.8 Combining Multiple Expert Policies

When multiple movement primitives can be learned, it is possible to
combine movement primitives to generalize them to new situations.
Jacobs et al. [1991] proposed the concept of mixture of experts, which
generates a policy by mixing multiple experts’ policies. Given multiple
experts’ policies {πi }M
i=1 , the policy can be obtained as a mixture of
these policies
qM
i=1 oi πi (x)
π(x) = qM , (3.59)
i=1 oi

where oi is the weight on each expert policy.

Another way of combining multiple experts’ policies is products of
experts proposed by Hinton [2002]. The policy can be obtained as a
product of multiple experts’ policies
rM
i=1 πi (x)
π(x) = s rM . (3.60)
i=1 πi (x)dx

In imitation learning literature, the concept of mixture of experts

has been applied to multiple DMPs [Mülling et al., 2013]. Mülling et al.
[2013] learned a library of DMP based movement primitives for hitting
a table tennis ball. In [Mülling et al., 2013], given a new ball coming, a
94 Behavioral Cloning

mixture of learned policies generates a striking movement. In addition

to initializing policies by learning from demonstration, Mülling et al.
[2013] used a reinforcement learning method to improve the perfor-
mance.
Likewise, Ewerton et al. [2015] learned human-robot collaborative
motions as a mixture of ProMPs. Ewerton et al. [2015] learned vari-
ous interaction patterns as Gaussian Mixture models of ProMP weight
vectors. This method can also be interpreted as a variant of mixture of
experts.
Haruno et al. [2001] proposed the modular selection and identiﬁ-
cation for control (MOSAIC) model, which learns multiple modules of
forward and inverse dynamics models. In the MOSAIC model, each
module learns local models, and the control input is determined by a
mixture of multiple modules.
Although the concept of products of experts has been used in
reinforcement learning, it has not been popular in imitation learning
so far. An interesting direction of future work could be using products
of experts for combining multiple expert policies in imitation learning.

3.6 Model-Free Behavioral Cloning for Task-Level Plan-

ning

When a task requires a complex motion, it is often necessary to plan

the motion as a sequence of primitive motions. This kind of high level
motion planning is known as task-level planning [Lozano-Perez et al.,
1989, Ekvall and Kragic, 2008, Cambon et al., 2009, Lagriﬀoul et al.,
2014]. In this section, we review model-free behavioral cloning methods
for task-level planning.

3.6.1 Segmentation and Clustering for Task-Level Planning

Although model-free methods for trajectory learning often implicitly

assume that each demonstrated trajectory contains a single motion, a
demonstrated trajectory may consist of a sequence of diﬀerent types of
primitive motions in practice. Therefore, in order to learn each prim-
itive motion, it is necessary to segment the demonstrated trajectory.
3.6. Model-Free Behavioral Cloning for Task-Level Planning 95

In addition, after the segmentation of trajectories, it is often neces-

sary to cluster the segmented motions in order to learn multiple types
of primitive motions. However, manual segmentation and clustering of
trajectories is often time-consuming. For this reason, methods for seg-
menting and clustering the demonstrated trajectories have been inves-
tigated in the ﬁeld of imitation learning. The development of methods
for trajectory segmentation is closely related to the theoretical advances
in clustering in machine learning. Although theories for segmentation
and clustering are out of our scope, we shortly review methods for
segmentation and clustering in imitation learning.
Kohlmorgen and Lemm [2001] developed an online segmentation
method based on HMMs. By computing the “distance" between nearby
data windows, Kohlmorgen and Lemm [2001] segments human mo-
tion data using unsupervised learning. Kulić et al. [2008] proposed a
method for segmenting and clustering whole body motions by using
factorized HMMs. In their method, the distances between HMMs are
computed, and segments of the observed motion are clustered into a
tree structure. Fearnhead and Liu [2007] proposed an online direct sim-
ulation algorithm for online inference in change-point problems (prob-
lems where the probability distribution changes at “change-points”).
Konidaris et al. [2011] extended the approach in [Fearnhead and Liu,
2007] to learning skill trees. The beta process autoregressive HMM (BP-
AR-HMM) developed by Fox et al. [2009] is a Bayesian nonparametric
approach, which ﬁnds dynamic features in time-series data. The BP-
AR-HMM is also employed by Niekum et al. [2014] for learning primi-
tive motion sequences in robotics. As seen from these previous studies,
advances in trajectory segmentation in imitation learning [Kulić et al.,
2008, Konidaris et al., 2011, Niekum et al., 2014] are closely related
to the methodological advances [Fearnhead and Liu, 2007, Fox et al.,
2009] in the machine learning community.

3.6.2 Learning a Sequence of Primitive Motions

For learning a sequence of primitive motions, it is necessary to model

the structure of the skill and learn the transition between primitive
motions from the demonstrated behavior.
96 Behavioral Cloning

Figure 3.9: Learning a motion sequence in [Manschitz et al., 2015]. A library

of movement primitives are learned from demonstrations, and transitions between
movement primitives are modeled using SVMs.

One way of learning a sequence of movement primitives is to learn

a tree-like structure of skills. Konidaris et al. [2011] proposed an on-
line algorithm for constructing skill trees from demonstrations. Based
on change point detection using MAP estimation [Fearnhead and Liu,
2007], a demonstrated trajectory is segmented into a skill chain. Multi-
ple skill chains are merged into a skill tree by identifying similar skills
in different skill chains. The method in [Konidaris et al., 2011] has been
applied to path planning of a mobile robot.
Another way to sequence movements is to learn a transition model
between different movement primitives. Manschitz et al. [2015] learns
a library of movement primitives and uses a support vector machine
(SVM) to compute the solution to the multi-class classification prob-
lem of choosing the next movement primitive for each current move-
ment primitive. This results in a movement primitive graph structure
as shown in Figure 3.9.
For learning a probabilistic transition model between movement
primitives, HMM-based methods are often used. In the autoregres-
sive hidden Markov Model (STARHMM) [Kroemer et al., 2014] the
probability distribution over latent variables also depends on the ob-
served state contrary to the classical auto-regressive hidden Markov
model (AR-HMM) where the current state depends only on the pre-
vious state. STARHMM includes a latent phase variable that defines
the current phase of the task. The framework in [Kroemer et al., 2015]
uses STARHMM to represent a task as a sequence of DMPs [Ijspeert
et al., 2013], where the phase variable corresponds to the currently
active DMP. The model allows for a conditional movement primitive
3.6. Model-Free Behavioral Cloning for Task-Level Planning 97

(a) (b)
Figure 3.10: Learning a hierarchical skill in [Kroemer et al., 2015]. Left: A sequence
of skills are modeled using a variant of HMM. Right: The learned DMPs can be
adapted to different objects.

Algorithm 9 Incremental semantically grounded learning from

demonstration [Niekum et al., 2014]
Input: Demonstrated trajectories and object poses D = {τ demo , o}
Segment the demonstrations with BP-AR-HMM
for each segment do
Learn parameters of DMP
end for
Construct FSM
Replay the task based on the current observation
if correction is necessary then
Collect interactive correction from users
end if

plan that switches from one DMP to another based on the observations.
Kroemer et al. [2015] learn DMPs using imitation learning and optimize
high-level policies using reinforcement learning. Kroemer et al. [2015]
demonstrate the approach in robotic manipulation tasks as shown in
Figure 3.10.
Although it is often assumed that a suﬃcient amount of demonstra-
tion data is available, this may not be the case in many applications.
Incremental imitation learning for task-level planning proposed by
Niekum et al. [2014] can address this issue. The framework in [Niekum
et al., 2014] leverages unstructured demonstrations and corrective ac-
98 Behavioral Cloning

Figure 3.11: Mutual language model between motion and sequence in [Takano
and Nakamura, 2015](Figure used with permission of Wataru Takano). Relevance
between words and motion is learned using a probabilistic model. The approach
can work in two directions: generating sentences from motion or generating motion
from sentences. When motion is observed, a motion language semantic graph model
generates words for the observed motion. A natural language model arranges the
words then into sentences. When observing language the language is segmented into
words using a natural language model and the words are then transformed into
motion using a semantic graph.

tions by human experts. Niekum et al. [2014] segment the demonstrated

task using a Beta Process Autoregressive Hidden Markov Model (BP-
AR-HMM) [Fox et al., 2009], and model the transition between discrete
primitives as a ﬁnite-state automaton (FSA). When a new situation is
given, the learner uses the trained FSA to plan the task as a sequence
of movement primitives. If an expert considers that reﬁnement of the
planned motion is necessary, she/he can stop the autonomous execu-
tion of the task and correct the motion through kinesthetic teaching.
In this way, the learner improves the performance through interaction
with experts. Algorithm 9 summarizes the procedure.
One interesting approach for task-level planning is to leverage an-
notation of demonstrated motions. Recently, Takano and Nakamura
[2015] developed methods for learning a mutual model between lan-
guage and motions, which leverage a dataset of demonstrated motions
and annotated sentences. In the framework of [Takano and Nakamura,
2015], the relationship between the motion symbols and words via la-
tent variables is learned as a motion language model, and the sentence
3.6. Model-Free Behavioral Cloning for Task-Level Planning 99

Algorithm 10 Motion language model [Takano and Nakamura, 2015]

Learning:
Input: demonstrated trajectories and sentences D = {τ demo , y}
Train a set of HMMs that represent the primitive motions
Train the motion language model and the natural language model
Prediction:
Input: a motion sequence or a sentence
if the given input is a motion sequence then
Recognize the motion symbol λin using HMMs
Predict words for the given motion
y ∗ = arg maxy p(y|λin )
Arrange the order of the words using the natural language model
return sentence
end if
if the given input is a sentence then
Predict a motion symbol corresponding to the given sentence y in
λ∗ = arg maxλ∈Λ p(λ|y in )
Predict the motion sequence from the motion symbol λ∗
return motion sequence
end if

structure is learned as a natural language model using an n-gram model.

Figure 3.11 summarizes the framework of a mutual model between lan-
guage and motion. HMMs are used to represent primitive motions,
and the library of primitive motions are learned as a set of HMMs.
In the motion language model, the probability p(λ|y) and p(y|λ) are
learned, where y is an annotated sentence and λ is the motion symbol.
This motion language model can be learned using an EM algorithm.
Meanwhile, a natural language model learns the transition between two
words p(y i |y j ). When a new motion τ in is observed, the correspond-
ing motion symbol λin is predicted using HMMs. Subsequently, words
associated with the motion symbol are estimated as
y ∗ = arg max p(y|λin ), (3.61)
y

where λin is the recognized motion symbol. Thereafter, the estimated

100 Behavioral Cloning

words are arranged grammatically using the natural language model.

When a new sentence y in is given, the motion symbol is selected so
as to maximize the likelihood of observing y in

λ∗ = arg max p(λ|y in ), (3.62)

λ∈Λ

where Λ is a set of learned motion symbols, and λ∗ is the predicted

motion symbol. A motion sequence is then generated using the pre-
dicted motion symbol. The method is summarized in Algorithm 10.
Leveraging the mutual model between language and motion will be an
interesting research direction in imitation learning.
3.7. Model-Based Behavioral Cloning Methods 101

3.7 Model-Based Behavioral Cloning Methods

We discuss model-based behavioral cloning (BC) in this section. As

we discussed in §2.3, model-based BC methods require an iterative
learning process with access to a forward dynamics model. Next, we
discuss model-based BC in more detail.

3.7.1 Model-Based Behavioral Cloning Methods with

Forward Dynamics Models

In imitation learning, experts demonstrate behavior and an au-

tonomous agent tries to imitate the demonstrations. However, the em-
bodiment of the expert is often different from the embodiment of the
learner. In such cases, the demonstrated trajectory needs to be ad-
justed for the embodiment of the learner. Otherwise, the learner fails
to perform the intended task properly. This problem is known as the
“correspondence problem” in imitation learning [Billard et al., 2008].
The correspondence problem frequently appears when we try to teach
humanoids how to imitate human motion obtained e.g. from motion
trackers [Ude et al., 2004, Nakaoka et al., 2007]. Due to the different
embodiments between a human expert and a robot learner, it is es-
sential to adapt the demonstrated trajectories to follow the constraints
and dynamics of the learner.
Even when the embodiments of the demonstrator and learner
match, we may face a similar correspondence problem when we try
to execute a trajectory at a velocity differing from the original veloc-
ity [van den Berg et al., 2010, Englert et al., 2013]. Even if the desired
configuration is kinematically feasible, the demonstrated/desired veloc-
ity may be infeasible due to the underactuation of the manipulator. In
this case, it is also necessary to adjust the planned trajectory based on
the system dynamics.
102 Behavioral Cloning

The straightforward way for solving the correspondence problem is

to explicitly learn a forward dynamics model of the system

xt+1 = f (xt , ut ) (3.63)

and then plan trajectories based on the learned forward model. Forward
dynamics model learning can be framed as a regression problem. Ta-
ble 3.6 lists diﬀerent regression methods which have been utilized in
model-based BC. Although locally weighted regression and Gaussian
mixture regression were used in early studies of model-based methods,
recent studies often employ Gaussian Processes. As we will review in
§3.7.1.2, Gaussian Processes can incorporate inputs with uncertainty.
This property is important for multi-step forward prediction since the
uncertainty is propagated over time. However, due to the computational
cost, Gaussian Process regression is not suitable for high-dimensional
data. To deal with high-dimensional data such as raw images, a deep
learning approach is employed for modeling a forward dynamics in the
most recent studies [Oh et al., 2015, Finn et al., 2017a, Baram et al.,
2017, Nair et al., 2017]. In the following sections, we review some of
the model-based methods with explicit learning of a forward model.

Table 3.6: Model-based behavioral cloning methods using different regression meth-
ods. Early studies on model-based behavior cloning focused on locally weighted
regression but later studies have moved to Gaussian mixture regression and even
more recently to Gaussian processes. We expect that studies based on deep neural
networks will be popular in the near future.

Regression Employed by ...

Locally Weighted
[Atkeson et al., 1997, Schneider, 1997]
Regression
Gaussian Mixture
[Grimes et al., 2006b, Grimes and Rao,
Regression
2009]
Gaussian Process
[Grimes et al., 2006a, Englert et al., 2013,
Regression
Deisenroth et al., 2014]
Neural Networks [Baram et al., 2017, Nair et al., 2017]
3.7. Model-Based Behavioral Cloning Methods 103

3.7.1.1 Imitation with a Gaussian Mixture Forward Model

We will now discuss details of the methods in [Grimes et al., 2006b,
Grimes and Rao, 2009] as an example of learning forward dynamics
with Gaussian Mixture Models (GMMs).
We can obtain a dataset of state xt and action ut trajectories D =
{τ i = [xi1 , ui1 · · · , xiT , uiT ]}M
i=1 from sensor readings. If we introduce
z t = [xt , ut ], the joint distribution of xt+1 and z t can be modeled as a
mixture of Gaussian distributions as
Ø
p(xt+1 , z t ) = p(k)N (µk , Σk ), (3.64)
k

where p(k) is the prior and the kth Gaussian component is given by
AC D-C D C DB
zt - µ
z,k Σz,k Σzx,k
p(xt+1 , z t |k) = N , . (3.65)
-
xt+1 - µx,k Σxz,k Σx,k
-

The conditional distribution of xt+1 for a given z ∗t is a Gaussian dis-

where

µx|z,k = µx,k + Σxz,k (Σz,k )−1 (z ∗t − µz,k ),

Σx|z,k = Σx,k − Σxz,k (Σz,k )−1 Σzx,k , (3.67)
p(k)N (z ∗t |µz,k , Σz,k )
wk = qK ∗
.
k=1 p(k)N (z t |µz,k , Σz,k )

When a given input is drawn from a Gaussian distribution z ∗t ∼

Algorithm 11 Behavior acquisition via Bayesian inference and learn-

ing [Grimes and Rao, 2009]
Observe an expert’s demonstrations [o1 , · · · , oT ]
Estimate the kinematics of the expert
Initialize the forward model f
Infer bootstrap actions based on the forward model
repeat
Execute actions
Learn/update the GMR forward model
Infer constrained actions
until task learned

where
1 2−1
µx|z,k = µx,k + Σxz,k Σz,k + Σin (z ∗t − µz,k ),
1 2−1
Σk,t+1 = Σx,k − Σxz,k Σz,k + Σin Σzx,k , (3.69)
p(k)N (z ∗t |µz,k , Σz,k + Σin )
wk = qK ∗ in
.
k=1 p(k)N (z t |µz,k , Σz,k + Σ )

Grimes and Rao [2009] used this GMR for one-step prediction and
recursively predicted learner’s trajectories. Using the learned forward
model, the action is selected so as to maximize the posterior likelihood
as

u∗1 , · · · , u∗T = arg max p(u1 , · · · , uT |o1 , · · · , oT , c1 , · · · , cT ), (3.70)

u1 ,··· ,uT

where [o1 , · · · , oT ] is a time series of the observed demonstrated states,

and [c1 , · · · , cT ] is a time series of the feasible states of the learner under
the kinematic and dynamic constraints. By repeating the execution of
the planned trajectories, the estimation of the forward model improves.
Algorithm 11 summarizes the procedure in [Grimes and Rao, 2009].

3.7.1.2 Imitation with a Gaussian Process Forward Model

Recent studies on model-based BC have employed Gaussian Processes
(GPs) for modeling the forward dynamics of the system f ∼ GP
3.7. Model-Based Behavioral Cloning Methods 105

[Englert et al., 2013, Deisenroth et al., 2014]. Given a dataset D =

{xt+1 , z t } where z = [x⊤ ⊤ ⊤
t , ut ] , a GP models a mapping from the
input z t to the output xt+1 = f (z t ) as

f (z t ) ∼ GP m(z t ), k(z t , z ′t ) ,
! "
(3.71)

where k(z, z ′ ) is the covariance function. A popular choice of a co-

variance function is the squared exponential covariance function given
by
ëz − z ′ ë2
A B
k(z, z ′ ) = exp − . (3.72)
l2
The joint distribution of the given target value and the function value
xt+1 at the test input z ∗t can be written as
C D A C DB
xt+1 K(Z, Z) + σn2 I K(Z, z ∗t )
∼N 0, , (3.73)
x∗t+1 K(z ∗t , Z) K(z ∗t , z ∗t )
where Z is a matrix in which the input vectors z t for all training
samples are aggregated. The conditional distribution of x∗t+1 given the
test input z ∗t is a Gaussian with mean and variance
µ(z ∗t ) = k⊤ K −1 xt+1 ,
(3.74)
σ 2 (z ∗t ) = k (z ∗t , z ∗t ) − k⊤ K −1 k,

where K = K(Z, Z) + σn2 I and k = K(z ∗t , Z).

As with GMR, propagation of uncertainty can be approximately
modeled by GPs. If we assume that z = [x⊤ ⊤ ⊤
t , ut ] is drawn from a
Gaussian distribution p(z t |µt , Σt ), the predictive distribution of the
state at time t + 1 is given by
Ú
p(xt+1 |µt , Σt ) = p(f (z t )|z t , D)p(z t )dz t , (3.75)

where p (f (x)|x, D) is a Gaussian distribution given by (3.74). The

marginal distribution p(xt+1 |µt , Σt ) in (3.75) can be approximated by
a Gaussian distribution by following the results from [Deisenroth and
Rasmussen, 2011, Deisenroth et al., 2013a].
Englert et al. [2013] used GPs for predicting the trajectory distri-
bution, and the KL divergence was used to evaluate the similarity of
106 Behavioral Cloning

Algorithm 12 Probabilistic model-based imitation learning [Englert

et al., 2013]
Input: n trajectories τ i demonstrated by the expert
Estimate the expert distribution over trajectories q(τ demo )
Record state-action parts of the robot through applying random con-
trol inputs
repeat i = 1 to N do
Learn/update probabilistic GP forward model
Predict the new trajectory distribution
1 p(τ ) 2
Learn policy π L = arg minπ DKL q(τ demo )||p(τ )
Apply π L to the system and record data
until task learned

the demonstrated and learned behaviors. Englert et al. [2013] modeled

trajectories as a Gaussian distribution
T
Ù T
Ù
p (τ ) ∼ p (x(t)) = N (x(t)|µ(t), Σ(t)). (3.76)
t=1 t=1

For two given Gaussian distributions p(x(t)) ∼ N (x|µp (t), Σp (t)) and
q(x(t)) ∼ N (x|µq (t), Σq (t)), the KL divergence of q and p can be com-
puted in closed form. Using the factorization in (3.76), the KL diver-
gence between the trajectory distribution induced by the expert policy
q(τ ) and the trajectory distribution induced by the learned policy p(τ )
can be computed as
T
Ø
DKL (q(τ )||p(τ )) = DKL (q (x(t)) ||p (x(t))). (3.77)
t=1

Englert et al. [2013] used this KL divergence to deﬁne the objec-

tive function to be minimized as LKL = DKL (q(τ )||p(τ )). To min-
imize LKL we can compute the gradient analytically and use gradi-
ent descent [Deisenroth, 2010, Deisenroth and Rasmussen, 2011]. Al-
gorithm 12 summarizes the procedure in [Englert et al., 2013]. The
method in [Englert et al., 2013] matches the ﬁrst and second moment of
the trajectory distribution through iterative learning. Since the deriva-
3.7. Model-Based Behavioral Cloning Methods 107

Algorithm 13 Iterative control learning [van den Berg et al., 2010]

Input: desired trajectory τ d , learning rate α
Initialize the target trajectory as τ = τ d
repeat
Execute a controller with the target trajectory τ̂
Record the executed trajectory τ
Update the target trajectory τ̂ ← τ̂ − α(τ − τ d )
until τ ≈ τ d

tives can be analytically computed when using a GP forward dynamics

model, imitation learning can be eﬃciently performed.

3.7.2 Imitation Learning through Iterative Learning Control

In order to develop a controller to achieve the desired trajectory, we

can also use an iterative learning control approach without a forward
dynamics model. Abbeel et al. [2010], van den Berg et al. [2010] learn
a controller iteratively to reproduce a desired trajectory.
While van den Berg et al. [2010] uses a Linear Quadratic Regulator
(LQR) [Anderson and Moore, 1990] for optimal control, the method is
not limited to a speciﬁc controller. Algorithm 13 shows how iterative
control learning in [van den Berg et al., 2010] works. Given a desired
trajectory τ d , LQR control is performed to track the target trajectory
τ̂ . In the initial step, the target trajectory is initialized as τ̂ = τ d .
When the executed trajectory τ deviates from the desired trajectory
τ d , the approach updates the target trajectory as τ̂ ← τ̂ − α(τ − τ d ),
where α is the learning rate. By repeating this execution and update,
a target trajectory τ ≈ τ ∗ can be obtained. Although this method is
simple and easy to implement, the controller cannot be generalized to
diﬀerent desired trajectories.
When a given system is fully controllable, we can learn forward and
inverse dynamics of the system. As indicated by [Nguyen-Tuong and
Peters, 2011], various methods have been developed for model learning.
However, it is often challenging to apply such approaches to not fully
controllable systems. Iterative LQR (iLQR) is often employed to con-
108 Behavioral Cloning

Learn the policy so as to satisfy

Execute the policy

Data manifold Policy model manifold

Figure 3.12: Schematic illustration of model-based BC methods. Model-based BC

methods often iterate between policy updates and task execution so as to match the
expected features as Eq [φ] ≃ Ep [φ].

trol a system of which the dynamics is not accurately known [Todorov

and Li, 2005, Abbeel et al., 2010, Tassa et al., 2012]. iLQR learns a
linear feedback controller to follow a trajectory through an iterative
learning process. Abbeel et al. [2010] learns from experts’ demonstra-
tions trajectories for acrobatic RC helicopter ﬂights, and utilizes iLQR
to reproduce the desired trajectory.

3.7.3 Information Theoretic Understandings of Model-

Based Behavioral Cloning Methods
BC methods with forward dynamics models such as [Englert et al.,
2013, Grimes and Rao, 2009] iteratively evaluate the learned policy
π L (u|x) in order to reproduce trajectories close to the demonstrations.
These methods evaluate the trajectory under the distribution induced
by the learned policy and match its expected feature with that of the
expert demonstrations. This approach can be interpreted as a process
to empirically learn the policy π L (u|x) that satisﬁes

Ep [φ(τ )] ≃ Eq [φ(τ )], (3.78)

where q(τ ) is the expert trajectory distribution and p(τ ) is the trajec-
tory distribution induced by the learner’s policy. The learning process
of BC methods with forward dynamics can be illustrated as Figure 3.12.
In addition, the method in [Englert et al., 2013] assumes that the tra-
3.8. Robot Applications with Model-Free BC Methods 109

jectory distribution is Gaussian. As Park and Bera [2009] indicated,

Gaussian distribution is one of the maximum entropy distributions.
Therefore, matching the feature expectation as in (3.78) under the
assumption of the Gaussian distribution can be interpreted as the M-
projection onto the manifold of the maximum entropy distribution as
we discussed in §2.7.1.

3.8 Robot Applications with Model-Free BC Methods

Robot Applications with Model-Free Behavioral Cloning Methods In

this section, we show several examples of model-free behavioral cloning
(BC) in robotic applications, to demonstrate the capability of model-
free BC methods. Model-free BC methods have been utilized suc-
cessfully in various applications, including autonomous RC helicopter
flight, ball-hitting tasks, and robotic surgery. Abbeel et al. [2010] uses
an iterative LQR controller in acrobatic helicopter flight to control the
nonlinear system. [Osa et al., 2017b] performs knot-tying tasks using a
standard PD controller on a surgical robot. From the following applica-
tion examples, one can see that different applications require different
controllers and learning methods.

3.8.1 Learning to Hit a Ball with DMP

Hitting a ball is a typical example of tasks that can be learned as

a point-to-point motion. Ijspeert et al. [2002b] showed that a tennis
swing can be learned with DMPs. The motion of a tennis swing was
demonstrated by a human, and the motion was recorded using a mo-
tion capture suit, which can mechanically measure the joint angles of
35 DoFs of the human body at 100Hz. The recorded motion was re-
produced in a humanoid robot with 30 DoFs. To accurately reproduce
the trajectories, an inverse dynamics controller was employed in this
experiment. The experimental results showed that the learned motion
was generalized to diﬀerent target positions.
110 Behavioral Cloning

Figure 3.13: Learning rhythmic motions for the Ball-Paddling task in [Kober and
Peters, 2009]. Kober and Peters [2009] used kinesthetic teaching to demonstrate
periodic hitting motions in Ball-Paddling and trained rhythmic DMPs to reproduce
the demonstrated periodic movements.

Kober and Peters [2009] learned a Ball-Paddling task shown in Fig-

ure 3.13 from demonstrations. The goal of this task is to have the ball
repeatedly bouncing. Kober and Peters [2009] used the seven degrees
of freedom Barrett WAM arm to demonstrate trajectories using kines-
thetic teaching and learned periodic motion using rhythmic DMPs. In
the experiments, ten basis functions per motor primitive were used to
represent the task.

3.8.2 Learning Hand-Over Tasks with ProMPs

Motion planning in the context of human-robot collaboration often re-
quires learning the coupled motions of a human operator and a robot.
Maeda et al. [2016] shows that correlation of the two agents’ motion
can be modeled using ProMPs. Maeda et al. [2016] illustrates the ap-
proach in a hand-over motion: when a human extends her/his hand to
receive a plate or screw, the robot grasps and gives it to the human
operator. Maeda et al. [2016] used a KUKA LWR robot and kines-
thetic teaching for demonstrating tasks, and the motion of a human
operator was tracked using a 3D optical tracking system. The task
was demonstrated 13-20 times. Demonstrated trajectories are shown
3.8. Robot Applications with Model-Free BC Methods 111

human human
human
robot
robot

robot

(a) Handing over a plate (b) Handing over a screw (c) Holding the screw driver

Figure 3.14: Learning human-robot collaborative motions in [Maeda et al., 2016].

Maeda et al. [2016] used kinesthetic teaching to demonstrate coupled movements,
where both the human and robot need to move to perform a task. The demonstra-
tions were used to train interaction ProMPs which take correlations between human
and robot movement into account: the robot motion can be planned as conditional
distribution given the human movement. The pictures show how the robot is able
to adapt its movement in several tasks.

in Figure 3.14. The correlation of the robot’s motion and the human
operator’s motion was learned with interaction ProMPs, which is an
extension of ProMPs proposed by Paraschos et al. [2013]. To achieve
the human-robot collaborative task, the robot motion was planned by
conditioning the learned distribution on the observed motion of the
human operator. Maeda et al. [2016] applied interaction ProMPs to
several tasks as shown in Figure 3.14. The study by Maeda et al. [2016]
showed that the reactive motions of the robot were successfully planned
based on the observed motions of the human operator.
Recent work by Lioutikov et al. [2017] proposed a method for seg-
menting demonstrated trajectory in a probabilistic manner and learn-
ing a sequence of movement primitives represented by ProMPs. Tasks
that emulate table tennis, writing and chair assembly are reported in
[Lioutikov et al., 2017].
112 Behavioral Cloning

(a) Slave manipulator (b) Visualization of planned trajectories

Figure 3.15: Autonomous knot-tying with a surgical robot [Osa et al., 2017b]. Left:
Bimanual manipulation tasks were learned using a model-free BC method. Right:
The trajectories can be updated in real time when the context is changing during
task execution. The demonstration was performed under various contexts, and the
trajectory distribution was modeled using a Gaussian Process. A force controller
was build as an outer loop of the standard PD position controller.

3.8.3 Learning to Tie a Knot by Modeling the Trajectory

Distribution with Gaussian Processes

Knot-tying in robotic surgery is one of the tasks that is hard to learn

as a sequence of point-to-point motions. In a looping motion required
for the knot-tying task, the topological shape of the entire trajectory
is critical, although the start and goal positions of the trajectory is
not critical to the success of the task. Osa et al. [2017b] applied a
behavioral cloning method to this knot-tying task as shown in Fig-
ure 3.15. Osa et al. [2017b] learned a conditional distribution of the
demonstrated trajectories given the context as a Gaussian Process al-
lowing generalizing demonstrated trajectories to a new context in real
time. Additionally, the learned trajectory distribution was used to plan
and control the contact force between the surgical instruments and ob-
jects. Osa et al. [2014] employed Algorithm 6 for normalizing the time
alignment of multiple demonstrated trajectories.
In experiments with a bimanual teleoperated master-slave system
for robotic surgery shown in Figure 3.15, the system performed tasks
that emulate tying a knot and cutting soft tissues. The task was demon-
strated 9-20 times under various contexts. The experimental results
show that the trajectories can be updated in real time.
3.9. Robot Applications with Model-Based BC Methods 113

Figure 3.16: Learning autonomous helicopter maneuvers from expert demonstra-

tions in [Abbeel et al., 2010]. Acrobatic flights were learned in a system that involves
highly nonlinear dynamics. An iterative LQR controller is employed to execute the
trajectory learned from demonstrations.

3.9 Robot Applications with Model-Based Behavioral

Cloning Methods

We present applications of model-based BC methods in this section.

Model-based BC methods can be used to control robotic systems with
nonlinear dynamics. A remarkable application example of model-based
BC methods is acrobatic helicopter ﬂights [Abbeel et al., 2010]. Addi-
tionally, we discuss an application for learning from diﬀerent embodi-
ments. Subsequently, we show applications of planning in action-state
space.

3.9.1 Learning Acrobatic Helicopter Flights

Autonomous flight of an RC helicopter involves nonlinear dynamics,
making helicopter control non-trivial. Abbeel et al. [2010] showed how
to learn acrobatic RC helicopter flight from experts’ demonstrations.
For modeling time-dependent trajectories, Abbeel et al. [2010] nor-
malizes the temporal alignment of the demonstrated trajectories using
an Expectation Maximization (EM)-like method, which we discussed
in §3.5.5. Abbeel et al. [2010] learns acrobatic flight trajectories us-
ing a model-based behavioral cloning method. Due to the challenge
of controlling the highly nonlinear helicopter dynamics, Abbeel et al.
[2010] uses an iterative LQR controller. In the experiments, the heli-
114 Behavioral Cloning

Figure 3.17: Learning to hit a ball with an underactuated manipulator in [En-

glert et al., 2013]. Englert et al. [2013] learned a forward model of the system
using Gaussian Processes. Together with the forward model Englert et al. [2013]
used PILCO Deisenroth and Rasmussen [2011], Deisenroth et al. [2013a] as the
reinforcement learning method Englert et al. [2013] to train a policy to reproduce
demonstrated trajectories.

copter control system performs various maneuvers including in-place

ﬂips, in-place rolls, loops and hurricanes, and even auto-rotation land-
ings, chaos and tic-toc. Figure 3.16 shows a snapshot of the acrobatic
ﬂight reported in [Abbeel et al., 2010]. Previously, these acrobatic ma-
neuvers could only be performed by exceptional experts, but Abbeel
et al. [2010] showed that such expert skills can be transferred to a
robotic system by combining model-based BC and iterative controller
learning.

3.9.2 Learning to Hit a Ball with an Underactuated Robot

Learning tasks with an underactuated robot is challenging since fea-
sible trajectories are limited. Englert et al. [2013] learned ball hitting
with an underactuated robot using a model-based imitation learning
method. In the experiments, the trajectories were demonstrated by
kinesthetic teaching, and the trajectory and the controller to achieve
the task were learned from demonstrations. BioRobTM [Lens et al.,
2010] robot, which is an underactuated and compliant manipulator,
was used in the experiments. Figure 3.17 shows a task with the under-
actuated manipulator reported in [Englert et al., 2013]. Since Englert
3.9. Robot Applications with Model-Based BC Methods 115

Figure 3.18: Applications of DAGGER [Ross et al., 2011]. Left: Learning to play
a video game [Ross et al., 2011]. Right: Learning autonomous UAV flight [Ross
et al., 2013]. The UAV flew autonomously in real forest environments. In DAGGER
, the learner complements initial demonstrations by querying an expert online for
demonstrations specifically for states induced by the learner’s policy.

et al. [2013] learns a robot-speciﬁc controller, the controller is robust

to the correspondence problem compared with model-free behavioral
cloning methods. Learning a robot-speciﬁc policy is one of the beneﬁts
of model-based imitation learning. Although developing a controller for
an underactuated robot with unknown nonlinear dynamics is not triv-
ial, model-based behavioral cloning methods can address this problem
by exploiting the learned forward dynamics model. This method re-
quires an iterative learning process to obtain a policy that reproduces
the expert’s trajectory.

3.9.3 Learning to Control with DAGGER

Ross et al. [2011] demonstrated how the DAGGER algorithm learns to

play a video game as shown in Figure 3.18. Visual features of 2D images
were used as system state, and a policy linear to the visual features was
learned using DAGGER . A human expert demonstrated the correct
steering for observed game images. DAGGER has also been applied to
control UAVs as shown in Figure 3.18 [Ross et al., 2013]. Ross et al.
[2013] trained a controller that can avoid trees in natural environments
using a small set of human demonstrations and performed autonomous
ﬂights in a real forest. In both examples, a small error at an early time-
step may lead the learner to an unseen state which largely deviates from
116 Behavioral Cloning

expert demonstrations. Since the learner encounters various states in

which the expert did not demonstrate how to act, an online learning
approach such as DAGGER is essential in these applications.
4
Inverse Reinforcement Learning

In inverse reinforcement learning (IRL) [Russell, 1998], also called in-

verse optimal control [Kalman, 1964, Moylan and Anderson, 1973, Dvi-
jotham and Todorov, 2010, Levine and Koltun, 2012], inverse planning
[Baker et al., 2009], or structural estimation of MDPs Rust [1994] the
learner tries to recover a reward function from a policy (or demon-
strations of a policy). Recovering the reward function can be beneficial
when the reward function is the most parsimonious way to describe the
desired behavior.
We begin discussion of inverse reinforcement learning (IRL) with
a definition of IRL in §4.1, discuss the critical assumption of linear
vs. nonlinear reward functions in §4.2, continue with model-based IRL
methods in §4.4 and model-free IRL methods in §4.5, give an informa-
tion theoretic interpretation of IRL methods in §4.6, show how partial
observability affects IRL in §4.7, and, finally finish with applications of
IRL in §4.8.

117
118 Inverse Reinforcement Learning

4.1 Problem Statement

Russell deﬁnes the problem of IRL [Russell, 1998] as follows:

Given 1) measurements of an agent’s behavior over time, in a

variety of circumstances, 2) measurements of the sensory inputs to
that agent; 3) a model of the physical environment (including the
agent’s body).
Determine the reward function that the agent is optimizing.

A common assumption in IRL is that the demonstrator utilizes

a Markov decision process (MDP) for decision making. Formally, an
MDP is a tuple (X , U, P, γ, D, R). X is a finite set of states; U is a set
of control inputs; P is a set of state transitions probabilities; γ ∈ [1, 0)
is a discount factor; D is the initial-state distribution from which the
initial state x0 is drawn; and R : X Ô→ R is the reward function. In
addition, many IRL methods assume that there are vectors of features
φ : X Ô→ [0, 1]k . IRL methods often estimate the reward function as a
function of these features φ.
The goal of IRL is to recover the unknown reward function R(τ )
from the expert’s trajectories. However, since a policy can be optimal
for multiple reward functions, the problem of determining the reward
function is “ill-posed”. To obtain the unique solution in IRL, many
studies have proposed additional objective functions to be optimized,
such as margin between the optimal policy and others [Ng and Russell,
2000, Abbeel and Ng, 2004, Ratliff et al., 2006b,a, 2009, Silver et al.,
2010] and to maximize the entropy [Ziebart et al., 2008, Ziebart, 2010,
Kitani et al., 2012, Shiarlis et al., 2016].
Many IRL methods usually require an iterative learning process (al-
though see Ratliff et al. [2006b] for a description directly in terms of a
quadratic program). Algorithm 14 summarizes a class of IRL methods
that proceed by alternatingly solving an RL style problem and updating
a cost function estimate. In order to obtain the performance equiva-
lent to the expert’s policy, state-action visitation frequency µ needs
to be matched between demonstrated trajectories and the trajectories
4.1. Problem Statement 119

Algorithm 14 Abstract version of feature matching inverse

reinforcement learning
Input: Expert trajectories D = {τ i }Ni=1
Initialize the reward function and policy parameters w, θ
repeat
Evaluate the state-action visitation frequency µ of the current pol-
icy πθ
Evaluate the objective function L and its derivative ∇w L by com-
paring µ and the state-action distribution implied by D
Update the reward function parameter w
Update the policy parameter θ with a reinforcement learning
method
until
return optimized policy parameters θ and reward function param-
eter w

induced by the learner’s policy as indicated by Abbeel and Ng [2004],

Ho and Ermon [2016]. The reward function parameter w is updated
through optimizing the objective function under the expected feature
matching constraint. This objective function is designed to estimate the
reward function which makes the demonstrations appear more optimal
than the current policy. The policy parameters θ are then updated
using an optimal control solution (i.e. reinforcement learning method)
based on the current estimate of the reward function. For this purpose,
inverse reinforcement learning methods often have a RL style proce-
dure in an inner loop. By repeating this process, the policy and reward
function parameters can be obtained.
Each IRL method has a diﬀerent way of performing these steps.
Model-based methods require the knowledge of system dynamics in
order to evaluate the state-action visitation frequency. On the con-
trary, model-free methods often employ sampling-based methods for
this purpose. In order to obtain an optimal policy based on the recov-
ered reward function, various reinforcement learning methods can be
used. Although MDP solvers can be used for the policy optimization
in discrete state-action space as in [Abbeel and Ng, 2004, Ratliﬀ et al.,
120 Inverse Reinforcement Learning

2006a], recent policy search methods can be also used. For example,
Finn et al. [2016b] employed guided policy search [Levine and Abbeel,
2014], and Ho and Ermon [2016] and Ho et al. [2016] employed trust
region policy optimization [Schulman et al., 2015].

4.2 Model-Based and Model-Free Inverse

Reinforcement Learning Methods

As with behavioral cloning methods, IRL methods can be categorized

into two categories: model-based and model-free methods. Model-based
IRL methods assume that the dynamics of the system, e.g. state tran-
sition probabilities, are known. The prior knowledge of the system dy-
namics is often used to evaluate and update the learned reward function
and policy. These model-based IRL method are relatively simple to im-
plement when the system dynamics are known. However, it is challeng-
ing to apply model-based IRL methods to applications with nonlinear
dynamics, which are hard to estimate. On the other hand, model-free
IRL methods do not require prior knowledge of the system dynamics.
Model-free IRL methods evaluate and update the learned reward func-
tion and policy using sampling-based methods, which can be applied
to systems with nonlinear dynamics. However, it is necessary to sample
many trajectories to estimate the trajectory distribution, which can be
time-consuming and computationally expensive. Table 4.1 summarizes
the advantages and disadvantages of model-free and model-based IRL
methods.

4.3 Design Choices for Inverse Reinforcement Learning

Methods

In addition to design choices we described in Chapter 2, there are IRL

speciﬁc design choices:

1. What objective should be used to obtain the unique so-

lution in IRL? As discussed in §4.1, IRL itself is an ill-posed
problem, and it is necessary to design the objective function so
as to obtain the unique solution in IRL. Table 4.2 summarizes
4.3. Design Choices for Inverse Reinforcement Learning Methods 121

diﬀerent objectives for learning reward functions. As shown, the

maximum entropy principle is a popular choice in recent studies
on IRL, although the concept of maximizing the margin between
the optimal policy and others was popular in the early studies on
IRL. The maximum entropy principle is well-founded in informa-
tion theory, and we review the related IRL methods in §4.4.3.

2. Should the reward function be linear or nonlinear to

the features? Although many IRL methods employ a reward
function linear to the features, complex tasks in robotics require

Table 4.1: Advantages and disadvantages of model-based and model-free methods

in inverse reinforcement learning. Model-based IRL methods can be more data-
efficient compared to model-free methods. However, it is challenging to apply model-
based IRL methods to systems with nonlinear dynamics. Model-free IRL methods
have been applied to systems with nonlinear dynamics.

Model-free Model-based
Applicable to systems Estimation of the trajec-
Advantages with nonlinear and un- tory distribution is data-
known dynamics eﬃcient.
It is necessary to sample Model learning can be
many trajectories to esti- very diﬃcult.
Disadvantages
mate the trajectory distri- It is hard to apply to un-
bution. deractuated systems.

Table 4.2: Objectives to obtain the unique solution in inverse reinforcement learn-
ing. The concept of maximizing the margin between the optimal policy and others
was popular in the early studies on IRL. The maximum entropy principle is a dom-
inant choice for recent IRL methods.

Objectives Employed by
Maximum margin [Ng and Russell, 2000, Abbeel and Ng, 2004,
Ratliﬀ et al., 2006b,a, 2009, Silver et al., 2010,
Zucker et al., 2011]
Maximum entropy [Ziebart et al., 2008, Ramachandran and Amir,
2007, Choi and Kim, 2011b, Ziebart, 2010,
Boularias et al., 2011, Kitani et al., 2012,
Shiarlis et al., 2016, Ho and Ermon, 2016, Finn
et al., 2016b]
Other [Doerr et al., 2015, Arenz et al., 2016]
122 Inverse Reinforcement Learning

a nonlinear reward function. On the other hand, IRL with the

reward function nonlinear to the features is more challenging
than IRL with the linear reward functions. Therefore, we need
to consider the most parsimonious representation of the reward
function among suﬃciently expressive ones.

Table 4.3 shows categorization of the existing IRL methods. As

one can see, many IRL methods are model-based and use the
linear reward function. On the contrary, model-free methods with
nonlinear reward functions have not been investigated well.

In the next section, we review model-based IRL methods, and there-

after, we review model-free IRL methods.

4.4 Model-Based Inverse Reinforcement Learning Meth-

ods

In this section, we review model-based IRL methods, which leverage

prior knowledge about system dynamics.

Table 4.3: Categorization of existing inverse reinforcement learning methods. How-

ever, tasks such as manipulation in robotic applications require a nonlinear reward
function.

Model-free Model-based
[Abbeel and Ng, 2004,
Ratliff et al., 2006b, Silver
et al., 2010, Ramachan-
Linear [Boularias et al., 2011, dran and Amir, 2007, Choi
reward Kalakrishnan et al., 2013] and Kim, 2011b, Ziebart
et al., 2008, Ziebart, 2010,
Levine and Koltun, 2012,
Hadfield-Menell et al., 2016]
[Ratliff et al., 2006a, 2009,
Nonlinear [Finn et al., 2016b, Ho and Silver et al., 2010, Grubb
reward Ermon, 2016] and Bagnell, 2010, Levine
et al., 2011]
4.4. Model-Based Inverse Reinforcement Learning Methods 123

Algorithm 15 IRL by expected feature matching [Abbeel and Ng,

2004]
Input: Dataset of the demonstrations D = {(xi , ui )}N i=1 , termina-
tion threshold ǫ
Randomly pick some policy π0L
Compute µE using D
Perform rollouts and µL0 = µ(π0L )
Set i = 1
repeat
Compute t = maxw:ëwë2 ≤1 minj∈{0,...,i−1} w⊤ (µE − µLj )
Compute the optimal policy πiL based on r(x) = w⊤ φ(x)
Compute µLi = µ(πiL )
Set i ← i + 1
until ti < ǫ
return π i : i = 0, . . . , n

4.4.1 Feature Expectation Matching

Abbeel and Ng [2004] proposed to match the feature expectation in
order to solve IRL problems. If we assume the reward function is linear
w.r.t. the features, the reward function is given by
r(x) = w⊤ φ(x), (4.1)
where φ(x) is the feature vector of the state x, and w is a weight
vector. Therefore, the expected reward of a policy π is given by
C - D C T - D C T - D
T - - -
Ø Ø Ø
t t ⊤ ⊤ t
E[R|π] = E γ r(xt )- π = E γ w φ(xt )- π = w E γ φ(xt )- π .
- - -
- - -
t=0 t=0 t=0

(4.2)
Abbeel and Ng [2004] deﬁned the feature expectation of a policy π as
C T - D
Ø -
µ(π) = E γ φ(xt )- π ∈ Rk .
t
(4.3)
-
-
t=0
Using this notation the value of a policy can be rewritten as
E[R|π] = w⊤ µ(π), (4.4)
124 Inverse Reinforcement Learning

where R = Tt=0 γ t r(xt ). Based on this matching of the feature ex-

pectation, Abbeel and Ng [2004] proposed to learn the policy from

demonstrations so as to maximize the difference between the optimal
policy and others. Maximization of the difference between the optimal
policy and others was formulated as a quadratic program. By iter-
atively updating the learned policy, the algorithm finds the optimal
policy close to the demonstrated policy. Algorithm 15 summarizes the
method in [Abbeel and Ng, 2004].
The matching feature expectation appears also in other IRL meth-
ods, such as Ziebart et al. [2008]. However, matching the expected
feature count is ambiguous since multiple policies can achieve the same
expected feature counts. Therefore, it is necessary to use additional
conditions that should be satisfied by the optimal policy.

4.4.2 Maximum Margin Planning

To obtain the unique solution in IRL, Ratliff et al. [2006b] proposed
maximum margin Planning (MMP). The idea of MMP is to find the
cost function that maximizes the difference between the optimal pol-
icy and others. MMP finds the cost function in which the cost of the
demonstrated trajectory C(τ demo ) is lower than the cost of other al-
ternative trajectories C(τ ) by a certain margin. This constraint can be
expressed as

C(τ demo ) ≤ C(τ ) − L(τ ), (4.5)

where L(τ ) is the loss function. If the loss function L(τ ) is large, the
cost diﬀerence between the demonstrated trajectory and other trajec-
tories is large. Since we need to consider only the minimizer of the
right-hand side of (4.5), (4.5) can be rewritten as

C(τ demo ) ≤ min{C(τ ) − L(τ )}. (4.6)

In MMP in [Ratliﬀ et al., 2006b], it is assumed that the cost function

is linear to the features of the trajectory as C(τ ) = w⊤ φ(τ ) where w
is the weight and φ(τ ) are the trajectory features. If the trajectory fea-
tures φ(τ ) are linear to the state-action frequency counts µ ∈ R|X ||U| ,
φ(τ ) is given by φ(τ ) = F µ where F ∈ Rd×|X ||U| is the feature matrix.
4.4. Model-Based Inverse Reinforcement Learning Methods 125

Likewise, if the loss function L(τ ) is linear to µ, the loss function of the
trajectory is given by L(τ ) = l⊤ µ where l ∈ R|X ||U | is the loss vector.
Given a training set D = {Fi , τ i , li }N
i=1 , the problem of ﬁnding w can
be formalized as a quadratic program:
N
1 1 Ø
min ëwë2 + ζi (4.7)
w,ζi 2 N i=1
î ï
s.t.∀i, w⊤ φi (τ i ) ≤ min w⊤ φi (τ ) − l⊤
i µ + ζi (4.8)

The slack variable {ζ}Ni=1 allows the violation of the constraints in a

similar manner as in support vector machines
î [Vapnik,ï1998]. If we use
a slack variable ζi = w Fi µi − minµ w Fi µ − l⊤
⊤ ⊤
i µ , the objective
function can be obtained as
N 3
1 Ø î ï4 λ
LMMP (w) = w⊤ Fi µi − min w⊤ Fi µ − l⊤
i µ + ëwë ,
N i=1 µ 2
(4.9)

which Ratliﬀ et al. [2009] call the maximum margin objective where
λ > 0 is the regularization parameter.
For solving this problem, a method based on subgradients is used
in Ratliﬀ et al. [2006b]. MMP assumes access to a MDP solver that
returns the optimal trajectory by solving the problem

τ ∗ = arg min C(τ ), (4.10)

where C(τ ) is the cumulative cost of the trajectory τ . MMP uses the
loss-augmented cost map C(τ ˜ ) = C(τ ) − L(τ ) to plan the trajectory.
Algorithm 16 summarizes the procedure of MMP.
The MMP framework was extended to LEARCH (LEArning to
seaRCH), which is a framework for learning nonlinear cost functions
efficiently [Ratliff et al., 2009, Silver et al., 2010, Zucker et al., 2011].
In LEARCH, exponential functional gradient descent was used for op-
timizing the maximum margin planning objective.
The policy obtained in MMP is based on efficient MDP solvers,
which generate deterministic optimal policies. However, robotic sys-
tems with large configuration space dimensionality often require a
126 Inverse Reinforcement Learning

Algorithm 16 Maximum margin planning Ratliﬀ et al. [2006b]

input: Training set D = {Fi , τ i , li }N
i=1 , regularization parameter λ >
0, stepsize sequence {αt }, iteration T
while t < T do
for i = 1, ..., N do
Compute the loss-augmented cost map c̃i = w⊤ Fi − l⊤ i
Compute the optimal trajectory τ ∗i = arg min c̃i µ
Compute the state-action frequency couts µ∗i
end for
Compute the subgradient g ∈ ∂LMMP (w)
w ← w − αt g
(Optional) Project w on to any additional constraint
t←t+1
end while
return w

stochastic policy and approximations in planning [Ratliﬀ et al., 2009].

In the next section, we review the maximum entropy IRL by Ziebart
et al. [2008] that considers the distribution of the resulting trajectories.

4.4.3 Inverse Reinforcement Learning Based on the Maxi-

mum Entropy Principle
In recent studies on IRL, the maximum entropy principle [Jaynes, 1957]
is often used to obtain the unique reward function. In the following sec-
tion, we review IRL methods based on the maximum entropy principle.

4.4.3.1 Maximum Entropy Inverse Reinforcement Learning

As described in §4.1, the IRL problem is ill-posed because a policy can
be optimal for multiple reward functions. The max-margin approach
described in the previous section works well when there is a single
reward function that is clearly better than alternatives. However, in
other cases optimizing a distribution over behaviors may be preferable.
The maximum entropy principle [Jaynes, 1957] suggests to choose
a distribution that maximizes the entropy among the distributions
4.4. Model-Based Inverse Reinforcement Learning Methods 127

that matches the feature expectations of the demonstrator [Dudík and

Schapire, 2006, Ziebart et al., 2008]. Following this principle, Ziebart
et al. [2008] proposed to learn a policy that maximizes the entropy
Ø 1
H(p(τ )) = p(τ ) ln (4.11)
p(τ )
subject to the constraints
EπL [φ(τ )] = EπE [φ(τ )], (4.12)
Ø
p(τ ) = 1, ∀τ , p(τ ) > 0, (4.13)
τ
where EπL [φ(τ )] is the expected feature count with respect to the
learner’s policy and EπE [φ(τ )] is the expected feature count with re-
spect to the expert’s policy.
Among the distributions that satisfy EπL [φ(τ )] = EπE [φ(τ )], the
maximum entropy distribution follows
p(τ ) ∝ exp (R(τ )) , (4.14)
where p(τ ) is the probability of the trajectory τ , and R(τ ) = w⊤ φ(τ )
is the reward of τ . The parameter vector w is the Lagrangian multiplier
for the feature matching constraint. Hence, we can see that, due to the
feature matching constraint, the reward function is linear in the trajec-
tory features. The probability of the trajectory can hence be expressed
as
1 1 2
p(τ |w) = exp w⊤ φ(τ ) , (4.15)
Z(w)
where 1Z(w) is2 the partition function given by Z(w) =
q ⊤
τ exp w φ(τ ) .
However, Equation 4.15 only holds for deterministic environments.
For stochastic environments, the trajectory distribution is also aﬀected
by the transition probabilities, i.e.,
1 1 2
exp w⊤ φ(τ )
Ù
p(τ |w) = p(xt+1 |ut , xt ). (4.16)
Z(w) xt+1 ,ut ,xt ∈τ

The implication of this observation is that the agent is now trying to

optimize
R̃(τ ) = w⊤ φ(τ ) +
Ø
log p(xt+1 |ut , xt ),
t
128 Inverse Reinforcement Learning

where we have a bias term due to the stochasticity of the environment.

This is one of the main theoretical drawbacks of maximum entropy
IRL, which is addressed by follow-up work such as the maximum causal
entropy IRL [Ziebart, 2010].
The parameter w of the reward can be obtained by maximizing the
likelihood of the observed data under the maximum entropy distribu-
tion as

w∗ = arg max LME (w) = arg max ln p(τ demo |w).

Ø
(4.17)
w w
τ demo

Since maximizing the likelihood is equivalent to the M-projection, this

problem formulation can be interpreted as M-projection onto the mani-
fold of the maximum entropy distribution, which we discussed in §2.7.1.
Since the objective function LME (w) is convex, this optimization can
be solved using gradient-based methods. The gradient is given by the
diﬀerence between the empirical feature counts from demonstrations
and the expected feature counts from the learner’s policy as

Ø Ø
∇LME (w) = EπE [φ(τ )] − p(τ |w)φ(τ ) = EπE [φ(τ )] − Dxi φ(xi ).
τ xi
(4.18)

If φ(τ ) = Tt=0 φ(xt ), then the expectation over the state-features

φ(x) can be computed by estimating the expected state visitation fre-

quencies Dxi of the current reward model, at least in discrete domains.
For computing these frequencies, a backward-forward message passing
algorithm can be used. Algorithm 17 summarizes the procedure for
computing the state visitation frequencies.
Although the maximum entropy IRL proposed by Ziebart et al.
[2008], Rust [1994] works well in MDP problems, it assumes that the
state transition distribution is known, which is not the case in many
robotic applications. Sampling-based or model learning extensions must
be applied for problems where the model is not speciﬁed.
4.4. Model-Based Inverse Reinforcement Learning Methods 129

Algorithm 17 Expected edge frequency calculation [Ziebart et al.,

2008]
Backward pass
Set Zxterminal = 1
Recursively compute for N iterations
Zui,j = k p(xk |xi , ui,j ) exp(R(xi |w))Zxk
q
q
Zxi = j Zui,j
Local action probability computation
Zu
p(ui,j |xi ) = Zxi,j
i
Forward pass
Set Dsi ,t = p(xi = xinitial )
Recursively compute for t = 1 to N
Dxk ,t+1 = xi ui,j Dxk ,t p(ui,j |xi )p(xk |xi , ui,j )
q q

Summing frequencies
q
Dxi = t Dxi ,t

4.4.3.2 Maximum Causal Entropy Inverse Reinforcement

Learning
In order to ﬁx the theoretical drawbacks of max-ent IRL in case of
stochastic dynamics, Ziebart [2010] proposed to use the maximum
causal entropy for IRL. The key idea of the causal entropy is that
action choices need to be causal, i.e., the action selection at time step
t needs to be independent from future states in the trajectory. Using
these insights, a new algorithm can be developed that also incorporates
the stochasticity of the dynamics in the reward estimation. Contrary
to maximum entropy IRL, maximum causal entropy IRL removes the
“bonus entropy” that is due to the stochastic dynamics of an envi-
ronment itself. This prevents learning policies that simply attempt to
target areas in state-space of high stochasticity.
Maximum causal entropy IRL Ziebart [2010], tries to ﬁnd the pol-
icy π ∗ (u|x), which maximizes the causal entropy H(u1:T ||x1:T ) of the
actions given the states, i.e,
π ∗ (u|x) = argmax H(u1:T ||x1:T ) (4.19)
π L (u|x)
130 Inverse Reinforcement Learning

subject to the constraint of feature expectation matching

EπL [φ(τ )] = EπE [φ(τ demo )],

π L (u|x) = 1, π L (u|x) ≥ 0 ,
Ø
(4.20)
u
q
where the feature function φ(τ ) = t φt (xt , ut ) is given by the sum
over state-action features. The causal entropy is deﬁned as
T
Ø
H(u1:T ||x1:T ) = H(ut |u1:t−1 , x1:t ) (4.21)
t=1
T
Ø Ø
=− p(u1:t , x1:t ) ln (π(ut |x1:t , u1:t−1 )) ,
t=1 u1:t ,x1:t

where H(ut |u1:t−1 , x1:t ) is the conditional entropy and p(u1:t , x1:t ) is
the joint distribution over all states and actions until time step t. Con-
trary to the conditional entropy H(u1:T |x1:T ), that is implicitly used
in standard max-ent IRL, the causal entropy H(u1:T ||x1:T ) conditions
action choices at time step t only on states until time step t, while the
conditional entropy would make the action choice also dependent on
future states (i.e., it ignores the causality).
Under the assumption that the system is Markovian,
p(xt |x1:t−1 , u1:t−1 ) reduces to p(xt |xt−1 , ut−1 ), and π(ut |x1:t , u1:t−1 )
reduces to π(ut |xt ). Causal entropy can be maximized using dynamic
programming [Ziebart, 2010] resulting in equations similar to those
found in soft value-iteration methods.

4.4.3.3 IRL from Failed Demonstrations

Although the usual aim of inverse reinforcement learning is to learn an
optimal policy from demonstrated successful trajectories, failed demon-
strations also contain information that can be used for learning. Shiarlis
et al. [2016] extends the maximum causal entropy IRL [Ziebart, 2010]
method to learning from failed demonstrations. When using the max-
imum entropy approach for learning from successful demonstrations,
the learned feature expectations should be similar to the demonstrated
ones. In order to take failed demonstrations into account, Shiarlis et al.
4.4. Model-Based Inverse Reinforcement Learning Methods 131

[2016] modiﬁes the maximum causal entropy IRL [Ziebart, 2010] opti-
mization problem so that the optimized policy favors trajectories with
features which are dissimilar to the features found in failed demonstra-
tions
K
λ
||w||2
Ø
max H(u1:T ||x1:T ) + wk zk − (4.22)
π L (u|x),w,z
k=1
2
subject to
EπL (u|x) [φ(τ S )] = EπE [φ(τ demo
S )],
EπL (u|x) [φ(τ F )] − EπE [φ(τ demo
F )] = zk ,
π L (u|x) = 1 , π L (u|x) ≥ 0 ,
Ø

u
where λ is a constant, K is the number of features, and w are fea-
ture weights to optimize. While the original maximum causal entropy
approach used only features of successful demonstrations φ(τ demoS ) the
approach of Shiarlis et al. [2016] uses also failed demonstration features
φ(τ F ). The term K
q
k=1 wk zk favors large distances between policy gen-
erated features and features in failed demonstrations. λ2 ||w||2 is a reg-
ularization term to keep w small enough. In order to find a solution to
the program in Equation 4.22, Shiarlis et al. [2016] performs gradient
ascent to find the feature weights while incrementally decreasing λ until
hitting a λ threshold. The idea in this procedure is to first emphasize
finding good weights for successful demonstrations and then focus on
finding weights for failed demonstrations.

4.4.3.4 Connection of Maximum Entropy Methods to Eco-

nomics
For discrete MDPs, the Boltzmann policy form and closely-related dy-
namic programs have been developed in the econometrics community
under the rubric of “structural estimation” from a completely diﬀerent
analysis. Notably, Rust [1994] derived predictive distributions of agents’
actions by developing a framework for learning cost functions and pre-
dictive stochastic policies for agents acting according to a Markov De-
cision Process. Intriguingly, the MaxEnt policy structure and the dy-
namic programming algorithms derived from the maximum entropy
132 Inverse Reinforcement Learning

formulation arise as well by considering an economist with only partial

access to the prediction problem and including random “shocks” in a
model of what would otherwise be optimal behavior. These close con-
nections between operations research (“structural estimation”), con-
trol theory (“inverse optimal control”) and machine learning (“inverse
reinforcement learning”) deserve much deeper investigation and better
cross-fertilization between communities.

4.4.4 Miscellaneous Important Model-Based IRL Methods

Although the maximum entropy principle is becoming dominant in
recent studies on IRL, various other model-based IRL methods have
been proposed. We review some of them in the following sections.

4.4.4.1 Linearly-Solvable MDPs

The linearly-solvable MDP approach of Dvijotham and Todorov [2010]
differs from standard inverse reinforcement learning approaches since
it estimates a value function instead of a reward or cost function. A
reward function can be used to optimize a policy under different system
dynamics but a value function may require system dynamics similar to
those used for learning the value function.
The linearly-solvable MDP approach of Dvijotham and Todorov
[2010] is designed to not require solving an MDP repeatedly. Dvi-
jotham and Todorov [2010] assume a special kind of linearly-solvable
MDP where the system dynamics are divided into passive dynamics
and policy specific active dynamics. The cost function is a combination
of state specific cost c(x) and the cost on the difference between passive
dynamics p(xt+1 |xt ) and policy specific dynamics π(xt+1 |xt ):
c(xt , π) = c(xt ) + DKL (π||p) . (4.23)
While the maximum entropy approach of [Ziebart et al., 2008] prefers
exponentially larger rewards, Dvijotham and Todorov [2010] prefers
exponentially larger value functions of the next state which is influenced
by the policy π:
p(xt+1 |xt )z(xt+1 )
π(xt+1 |xt ) = , (4.24)
Z
4.4. Model-Based Inverse Reinforcement Learning Methods 133

where z(xt+1 ) = exp (V (xt+1 )) is the desirability function, Z is for

normalization, and V (xt+1 ) is the value function. Note that the pol-
icy π(xt+1 |xt ) is a scaled version of the passive transition probabilities
p(xt+1 |xt ). The IRL problem is then to estimate the value function
from state transition samples. Dvijotham and Todorov [2010] finds the
maximum likelihood value function from an unconstrained convex op-
timization problem. The advantage of the approach is that it does not
require solving the MDP repeatedly. Disadvantages are that in continu-
ous states spaces Dvijotham and Todorov [2010] needs to approximate
the value function which may be more challenging then approximating
reward functions which is the common approach in IRL. Moreover, a
learned reward function can be used under different dynamics while
this can be challenging for a value function which has been optimized
for specific application dynamics.

4.4.4.2 IRL Methods Based on a Bayesian Framework

The Bayesian framework is a powerful tool in machine learning which
allows updating the current hypothesis based on new evidence. Ra-
machandran and Amir [2007] proposed an IRL method based on the
Bayesian framework. In this framework, the action of the expert is
considered as evidence that can be used to update a prior on reward
functions. As in [Ziebart et al., 2008], a (diﬀerent) log-linear distribu-
tion is assumed, and the posterior probability of the reward function
can be computed using Bayes theorem as
p(τ |R)p(R) 1
p(R|τ ) = = exp(αE(τ , R))p(R), (4.25)
p(τ ) Z
which can be considered as a Boltzmann-type distribution with energy
E(τ , R). Computing the mean of this posterior distribution requires
to recover the reward function and to learn the optimal policy from
demonstrations. In the study by Ramachandran and Amir [2007], an
MCMC algorithm was used to generate samples from distributions and
the sample mean was used as an estimate of the mean of the true
distribution.
Instead of computing the posterior mean, Choi and Kim [2011b]
proposed to use maximum-a-posterior(MAP) inference. The IRL prob-
134 Inverse Reinforcement Learning

lem with MAP inference can be formulated as ﬁnding the reward func-
tion RMAP that maximizes the posterior

RMAP = arg max p(R|D) = arg max [ln p(D|R) + ln p(R)] , (4.26)
R R

where D = {(xt , ut )} is a set of state-action pairs demonstrated by

the expert. The likelihood p(R|D) can be interpreted as a measure
of the compatibility of the reward function R with the demonstrated
behavior data D. For solving this problem, the method in Choi and
Kim [2011b] used gradient-based optimization. Choi and Kim [2011b]
suggested that MMP, Maximum entropy IRL, and other IRL methods
can be interpreted in a Bayesian framework.

4.4.5 Learning Nonlinear Reward Functions

While research on inverse reinforcement learning originally focused

mostly on learning reward functions linear with respect to feature vec-
tors [Abbeel and Ng, 2004, Ziebart et al., 2008, Ratliﬀ et al., 2006a,
Boularias et al., 2011], many tasks, for example in robotics, require non-
linear reward functions [Silver et al., 2010, Ratliﬀ et al., 2006b, Grubb
and Bagnell, 2010, Levine et al., 2011, Finn et al., 2016b]. We discuss
below such model-based approaches for modeling nonlinear rewards.

4.4.5.1 Boosting Methods

The earliest approaches to rich reward function learning from model

classes with high representational power was the use of gradient-
boosting. These methods, typified by Ratliff et al. [2006b], Silver et al.
[2010], Ratliff et al. [2009] can use arbitrary supervised learning algo-
rithms in an ensemble to create highly non-linear cost functions. This
approach has been used to learn locomotion strategies by demonstra-
tion Zucker et al. [2011] as well as to learn to match the real-world,
rough, terrain driving strategies Silver et al. [2010, 2016, 2013]. These
are among the easiest and most general approaches to implement, and
an example of their use is discussed in 4.8.2.
4.4. Model-Based Inverse Reinforcement Learning Methods 135

4.4.5.2 Deep Network Methods

Deep neural approaches to complex IRL cost functions were ﬁrst
demonstrated in Grubb and Bagnell [2010], Bradley [2010]. These ap-
proaches both build on the maximum margin formalism (although ap-
ply equally to related ones like Maximum Entropy), and use variants
of backpropagation to learn sophisticated cost functions from demon-
strations for interpreting sensor data.

4.4.5.3 Gaussian Process IRL

To learn a nonlinear reward function, Levine et al. [2011] use a Gaus-
sian Process (GP) approach based on the maximum entropy princi-
ple [Ziebart et al., 2008]. The original maximum entropy based ap-
proach Ziebart et al. [2008] uses linear reward features for the reward
function. Levine et al. [2011] use GP inverse reinforcement learning
(GPIRL) to represent a reward function which is nonlinear in the fea-
tures. In general, a GP [Rasmussen and Williams, 2006] defines a proba-
bility distribution over possible outputs given some input coordinates,
and, kernel hyperparameters define the actual shape of the GP. In
GPIRL, the kernel hyperparameters θ define the shape of the reward
function, manually chosen feature coordinates φu correspond to input
coordinates, outputs correspond to demonstrated actions, and a GP
models the probability distribution over true actions u. The probabil-
ity distribution over u and θ is
5Ú 6
p(u, θ|D, φu ) ∝ p(D|r)p(r|u, θ, φu )dr p(u, θ|φu ) , (4.27)
r

where p(D|r) is the distribution over demonstrated trajectories and is

given by the maximum entropy principle yielding trajectories exponen-
tially more likely closer to larger rewards. p(r|u, θ, φu ) is the condi-
tional GP posterior reward probability, and p(u, θ|φu ) is the prior GP
probability for u and θ. In order to compute 4.27, Levine et al. [2011]
use several approximations. The choice of φu is particularly important
since it has a large impact on both whether the solution covers the
true reward function and on the computational requirements: GPs are
computationally intensive because of the required covariance matrix
136 Inverse Reinforcement Learning

inversion where the size of the matrix depends on input space size.

4.4.6 Guided Cost Learning

Recently, Finn et al. [2016b] extended the use of non-linear neural cost
function approach described above Grubb and Bagnell [2010] using an
adaptive sampling scheme rather then an analytic approximation as
the policy optimization step in an unknown Markov Decision Process.
In order to solve the cost function non-uniqueness problem as well as
imperfect demonstration, Finn et al. [2016b] use the popular maximum
entropy principle Ziebart et al. [2008]. For optimizing a policy and
learning the cost function, the approach of Finn et al. [2016b] repeats
two steps: 1) updates the cost function based on samples from both
the policy and demonstrations, 2) updates the policy based on the new
cost function.
Guided cost learning ﬁnds the maximum likelihood solution un-
der the maximum entropy principle as in [Ziebart et al., 2008]. Under
the maximum entropy assumption, the probability distribution of the
trajectory τ is given by p(τ ) = Z1 exp(−cw (τ )), where cw is the cost
function parameterized with a vector w. The objective function LGCL
of the guided cost learning is given by the negative log-likelihood of the
maximum entropy distribution
1 Ø
LGCL = cw (τ i ) + ln Z (4.28)
N τ j ∈Ddemo
1 Ø 1 Ø exp(−cw (τ j ))
≈ cw (τ i ) + ln , (4.29)
N τ i ∈Ddemo
M τ j ∈Dsamp
q(τ j )

where Ddemo is the set of demonstrated trajectories, Dsamp is the set of

samples, and q is the distribution from which the τ j is sampled. The
gradient of the cost cw with respect to the parameter can be eﬃciently
computed when the cost is represented by a neural network.
Algorithm 18 summarizes the approach. In more detail, at each it-
eration, Finn et al. [2016b] samples additional trajectories using the
current policy and a black box simulator. Next, the cost function is up-
dated based on all sampled trajectories and the demonstrations. The
parameters of the neural network, representing the cost function, are
4.4. Model-Based Inverse Reinforcement Learning Methods 137

Algorithm 18 Guided cost learning Finn et al. [2016b]

Initialize qk (τ ) at either a random initial controller or from demon-
strations
for iteration i = 1 to I do
Generate samples Dtraj from qk (τ )
Append samples: Dsamp ← Dsamp ∪ Dtraj
Use Dsamp to update the cost cw using Algorithm 19
Update qk (τ ) using Dtraj and the method from Levine and Abbeel
[2014] to obtain qk+1 (τ )
end for
return optimized cost parameters w and trajectory distribution
q(τ )

updated based on the gradient computed using the exponential cost

typical for maximum entropy based approaches. For updating the pol-
icy based on the new cost function and samples, Finn et al. [2016b]
uses a constrained version of linear quadratic regular (LQR) based
trajectory optimization together with linearizing dynamics of local ap-
proximate Gaussian distributions estimated from the samples [Levine
and Abbeel, 2014].
The approach of Finn et al. [2016b] has several interesting proper-
ties. Firstly, the policy optimization part of the approach is designed for
smooth continuous trajectories found e.g. in robotics. Secondly, the ap-
proach requires a black box simulator but no explicit dynamics model.
Recently, Finn et al. [2016a], Ho and Ermon [2016] identiﬁed the
close connection between Inverse Reinforcement Learning and the more
recent generative adversarial networks [Goodfellow et al., 2014]. In gen-
erative adversarial networks, a generative model G is trained to gen-
erate data samples so as to mimic the true data distribution, while
the discriminator D is trained to discriminate the data generated by
G and the true data. These works demonstrate that optimization/RL
play the role of a generator while the learned cost function plays the
role of a discriminator, albeit with the generalization of applying to
any trajectory a system could take. This viewpoint sheds light on the
138 Inverse Reinforcement Learning

Algorithm 19 Nonlinear IOC with stochastic gradients [Finn et al.,

2016b]
for iteration k = 1 to K do
Sample demonstration batch D̂demo ⊂ Ddemo
Sample background batch D̂samp ⊂ Dsamp
Append demonstration batch to background batch
D̂samp ← D̂demo ∪ D̂samp
Estimate dLdw
GCL
(w) using D̂demo and D̂samp
Update parameters w using gradient dLdw GCL
(w)
end for
return optimized cost parameters w

instabilities of GANs and the potential power of combining algorithms

used in each ﬁeld.

4.5 Model-Free Inverse Reinforcement Learning Meth-

ods

In robotics and other application ﬁelds, exact dynamics models are of-
ten diﬃcult to come by. Model-free IRL methods side step the problem
by not requiring such prior knowledge. Model-free IRL methods often
employ sampling-based approaches to estimate the trajectory distribu-
tion. Although this approach requires many samples of trajectories in
the learning process, it avoids the explicit learning of system dynamics.

4.5.1 Relative Entropy Inverse Reinforcement Learning

Although model-based IRL methods assume that the system dynam-

ics, e.g. state transition probability, is known, model-free IRL methods
do not require such prior knowledge on the system dynamics. Relative
entropy IRL in [Boularias et al., 2011] is one of such model-free IRL
methods. Boularias et al. [2011] proposed to minimize the relative en-
tropy between a prior trajectory distribution q0 (τ ) induced by a base-
line policy and the trajectory distribution p(τ ) induced by the learner’s
policy. For minimizing the relative entropy without prior knowledge of
4.5. Model-Free Inverse Reinforcement Learning Methods 139

the system dynamics, importance-sampling is used to estimate the ex-

pected feature count in [Boularias et al., 2011]. Relative entropy IRL
also assumes that the reward is given as a linear function of the fea-
ture vector as R(τ ) = w⊤ φ(τ ). This problem can be formulated as
minimizing the relative entropy
Ø p(τ )
min p(τ ) ln , (4.30)
q0 (τ )
subject to the constraints

∀i ∈ {1, ...k}, |EπL [φi (τ )] − EπE [φi (τ )]| ≤ ǫi , (4.31)

Ø
p(τ ) = 1, (4.32)
τ ∈T
∀τ ∈ T , p(τ ) ≥ 0, (4.33)

where EπE [φi (τ )] is the empirical expectation of the ith feature vec-
q
tor calculated from demonstrations, EπL [φi (τ )] = τ p(τ )φi (τ ) is the
expectation of the feature vector with respect to the learner’s policy,
k is the number of features, T is a set of feasible trajectories, and the
threshold ǫi is calculated by using Hoeﬀding’s bound. The Lagrangian
of this problem is given by
A B
p(τ )
− w⊤
Ø Ø
LRE (p, w, η) = p(τ ) ln p(τ )φ(τ ) − EπE [φ(τ )]
q0 (τ ) τ
k
A B
Ø Ø
− |wi |ǫi + η p(τ ) − 1 .
i=1 τ ∈T
(4.34)

The dual problem is given by maximizing the dual function

k
gRE (w) = w⊤ EπE [φ(τ )] − ln Z(w) −
Ø
|wi |ǫi . (4.35)
i=1

This dual problem can be solved by using a sub-gradient-based method

and importance sampling in Boularias et al. [2011]. Since the expected
feature count is estimated through sampling, this method can be ap-
plied to a system with unknown dynamics.
140 Inverse Reinforcement Learning

Algorithm 20 Generative adversarial imitation learning

Input: Expert trajectories D = {τ i }N
i=1 , initial policy and discrimi-
nator parameters θ 0 , w0
for iteration i = 1 to K do
Sample Trajectories τ i ∼ πiL
Update the discriminator parameters from wi to wi+1 with the
gradient
EπL [∇w ln(Dw (s, a))] + EπE [∇w ln (1 − Dw (s, a))]
i
Update a policy πiL using the TRPO rule with the cost func-
tion ln(Dwi+1 (s, a)), which takes a KL-constrained natural gra-
dient step
è with é
EπL ∇θ ln π L (u|x)Q(x, u) − λ∇θ H(π L ) ,
i # ! " $
where Q(x̄, ū) = EπL ln Dwi+1 (x, u) |x0 = x̄, u0 = ū
i
end for
return optimized policy parameters θ

4.5.2 Generative Adversarial Imitation Learning

Recently, Ho and Ermon [2016] proposed generative adversarial imita-

tion learning (GAIL) by leveraging the connection noted above between
GANs [Goodfellow et al., 2014] and IRL. 1 This viewpoint enables con-
straining the behavior of the agent to be approximately optimal ac-
cording to an unknown reward function without explicitly attempting
to recover that reward function.
Ho and Ermon [2016] trained a policy that reproduces the expert’s
behavior and a discriminator that distinguishes trajectories induced by
the learner’s policy from trajectories demonstrated by the expert. The
state-action occupancy induced by the expert’s policy in GAIL is anal-
ogous to the true data distribution in GANs. Algorithm 20 summarizes
GAIL. Ho and Ermon [2016] indicated that IRL is a dual of the occu-
pancy measure matching under the maximum entropy principle. Based

1
GAIL [Ho and Ermon, 2016] cannot be fully classified as an IRL approach since
GAIL does not recover the reward function. However, we introduce the study [Ho
and Ermon, 2016] in the IRL section since it is relevant to the concept of IRL.
4.6. Interpretation of IRL with the Maximum Entropy Principle 141

on this consideration, the objective function

LGA = EπL [ln(Dw (x, u))] − EπE [ln(1 − Dw (x, u))] − λH(πθL ) (4.36)
θ

is optimized to match the occupancy measure, where πθL is the learner’s

policy parameterized with θ, Dw is the discriminator network parame-
terized with w, H(πθL ) ≡ EπL [− ln πθL (u|x)] is the γ-discounted causal
θ
entropy of the policy πθL in [Bloem and Bambos, 2014]. Through op-
timizing LGA , the discriminator network Dw and the policy πθL are
trained. Here, trust region policy optimization (TRPO) proposed by
Schulman et al. [2015] is used to optimize LGA with respect to the pol-
icy parameter θ. TRPO employs the constraint between the current
and updated policies in order to avoid unstable policy updates. For
this purpose, the KL divergence is used as a measure of the dissimilar-
ity of policies in TRPO.
Recent work by Baram et al. [2017] extended GAIL to the model-
based approach. Baram et al. [2017] proposed to make the computation
for training a stochastic policy fully diﬀerentiable by using a forward
model. The empirical results show that the model-based GAIL outper-
forms the model-free GAIL in continuous control tasks. In addition, the
work by Henderson et al. [2018] extended GAIL to the option frame-
work for a hierarchical policy.

4.6 Interpretation of IRL with the Maximum Entropy

Principle

As we have seen so far, many IRL methods iteratively estimate the

reward function to make the demonstrations appear more optimal than
other policies, then update the policy under the updated reward func-
tion, and execute the policy to get more samples which the reward
function attempts to distinguish. This process is summarized in Fig-
ure 4.1. To obtain the unique solution of the “ill-posed” IRL problem,
the maximum entropy principle is often used. Here, we discuss the in-
terpretation of IRL with the maximum entropy principle.
Let us consider a prior trajectory distribution p0 (τ ) and the tra-
jectory distribution p(τ ) induced by the learner’s policy. Information
142 Inverse Reinforcement Learning

Estimate the
reward function
Update the policy

Execute the learned policy

Data manifold Policy model manifold

Figure 4.1: Illustration of many IRL approaches. Such IRL methods iteratively
estimate the reward function to make the demonstrations appear more optimal
than the current policy, then update the policy under the new reward function, and
execute the policy virtually or physically to get more samples which the reward
function attempts to distinguish.

geometry suggests to minimize the KL divergence DKL (p(τ )||p0 (τ ))

from p(τ ) to p0 (τ ) [Amari, 2016]. The maximum entropy principle in
[Jaynes, 1957] suggests to choose a distribution that maximizes the
entropy among the distributions that achieve at least the same total
reward. Entropy H(p(τ )) is defined as
Ø 1
H(p(τ )) ≃ p(τ ) ln , (4.37)
p(τ )
whereas the KL divergence DKL (p(τ )||p0 (τ )) is defined as
Ø p(τ )
DKL (p(τ )||p0 (τ )) ≃ p(τ ) ln . (4.38)
p0 (τ )
Therefore, maximizing the entropy H(p(τ )) is equivalent to minimizing
the KL divergence DKL (p(τ )||p0 (τ )) under the assumption that p0 (τ )
is the uniform distribution. Alternate prior distributions can be easily
taken into account by simply adding a “feature” that is log p0 (τ ) either
with a weight fixed to 1.0 or allowed to adapt and learn.
The maximum causal entropy distribution [Ziebart et al., 2013] can
be understood to assume to remove the effects of stochastic dynamics
as well. For learning tasks involving physical systems, it is often desir-
able to consider alternate p0 (τ ), particularly by exploiting information
4.7. Inverse Reinforcement Learning under Partial Observability 143

in the system dynamics. For this reason, Dvijotham and Todorov [2010]
proposed to use the trajectory distribution induced by the passive dy-
namics p(xt+1 |xt ) of the system as the KL divergence term p0 (τ ) of
the cost function. Kalakrishnan et al. [2013] also approximated a trajec-
tory distribution using trajectories sampled from the system dynamics.
These methods consider the passive dynamics of the system in their
problem formulation.
The relative entropy IRL approach by Boularias et al. [2011] at-
tempts to minimize the KL divergence DKL (p(τ )||p0 (τ )), with feature
matching constraints. By using importance sampling, the expected fea-
ture counts are approximated without prior knowledge of the system
dynamics. Since the trajectories sampled from the actual system fol-
low the system dynamics, we can consider that the expected feature
counts approximated using importance sampling implicitly encode the
system dynamics. Arenz et al. [2016] use the M-projection to obtain
the data state distribution analytically, and then use the I-projection
to obtain the policy given the analytic model of the data distribution.
Methods that directly try to minimize the KL to the data distribution
DKL (p(τ )||q demo (τ )), where q demo (τ ) is the trajectory distribution in-
duced by the expert policy, have not been widely researched in imitation
learning to our knowledge. However, some recent research shows that
any f -divergence can be minimized [Nowozin et al., 2016] in GANs and
given the close connection to IOC methods we expect that investiga-
tions into this area may be proﬁtable.

4.7 Inverse Reinforcement Learning under Partial Ob-

servability

Partial observability is common in robotics and other domains due to

sensor noise and occlusions caused by objects, robots, humans, and
the environment. Moreover, the whole process of IRL can be seen as
a process where the agent has incomplete observations about the true
reward function. Here, we discuss the cases when the expert and learner
make partial observations, and, the case of formally framing IRL as
the learner making partial observations about the reward function.
Section 4.7.1 discusses the case when the learner partially observes
144 Inverse Reinforcement Learning

the demonstrations, Section 4.7.2 then discusses the case when the ex-
pert makes partial observations when performing demonstrations, Sec-
tion 4.7.3 describes how IRL can be framed as a partially observable
Markov decision process, and Section 4.7.4 discusses a model for opti-
mizing the behavior of both the expert and learner when the reward
function is partially observable.

4.7.1 IRL from Partially Observable Demonstrations

Recently, inverse reinforcement learning with partially observable ex-

pert demonstrations has gained interest in vision research [Kitani et al.,
2012] and robotics [Boularias et al., 2012, Bogert and Doshi, 2014, 2015,
Bogert et al., 2016].
Noisy sensors are a common source of partial observability. To fore-
cast human activities from noisy images, [Kitani et al., 2012] extends
maximum entropy IRL [Ziebart et al., 2008] into domains where the
learner only partially observes expert demonstrations. To handle par-
tial observability, Kitani et al. [2012] proposes to use a hidden variable
Markov decision process (hMDP). In hMDP, observation probabilities
are part of the joint maximum entropy state-observation probability
distribution
exp(w⊤ φ′τ )
p(τ |o, θ) ≈ (4.39)
Z(w)
which is similar to the maximum entropy IRL trajectory probability
distribution in (4.15), but, the state features φ′τ in (4.39) include the
logarithm of the probability of the observations o. For simplicity, in [Ki-
tani et al., 2012], the observation probability is Gaussian.
Boularias et al. [2012] deal with noisy features using a graphical
model based on Markov random ﬁelds (MRFs) that allows correlation
between actions of similar states. Intuitively, utilizing correlations re-
duces noise due to the smoothing eﬀect on observations over similar
states. In many problems state similarity is easy to determine. For ex-
ample in navigation, Euclidean distance can be used as a similarity
measure. Boularias et al. [2012] demonstrate the approach in a simu-
lated navigation and in a simulated grasping task. One disadvantage
of the approach is that the algorithms presented in [Boularias et al.,
4.7. Inverse Reinforcement Learning under Partial Observability 145

2012] are computationally heavy.

Motivated by occlusions in robotic problems, Bogert and Doshi
[2014, 2015], Bogert et al. [2016] study the problem of reward learning
from partially occluded demonstrations. Moreover, the demonstrations
are performed by multiple experts. Contrary to [Natarajan et al., 2010],
the experts’ policies are not independent from each other but take other
experts into account. The methods developed in [Bogert and Doshi,
2014, 2015, Bogert et al., 2016] are based on maximum entropy IRL
[Ziebart et al., 2008]. To handle partial observability, Bogert and Doshi
[2014] simply do not consider occluded states and actions, but, instead,
compute feature expectations only for observable states. Bogert and
Doshi [2014] demonstrate the approach in multi-robot patrolling: the
learner has to ﬁnd out the reward functions of patrolling robots in or-
der to plan a route around them. Bogert and Doshi [2015] consider also
uncertain transition functions. Instead of discarding partially observed
time steps, Bogert et al. [2016] follow a diﬀerent approach by treating
missing data as hidden variables and presents an expectation maxi-
mization (EM) approach for a locally optimal solution. Bogert et al.
[2016] demonstrates the EM approach in a simulated reconnaissance
scenario with dynamically changing occlusions and shows how a robot
learns to perform a sorting task demonstrated by a human.

4.7.2 IRL with Incomplete Expert Observations

Usually the basic premise in IRL is that the expert observes the world
state fully. However, similarly to the learner, the expert may only
partially observe the world when demonstrating the task. Thus in-
stead of an MDP model a partially observable Markov decision process
(POMDP) model is needed for the expert. The formal POMDP model
is identical to the MDP model except that a POMDP additionally
includes observation probabilities conditioned on the next state and
current action. Policy computation for POMDPs is challenging com-
pared to MDPs. The same applies to IRL in POMDPs [Choi and Kim,
2011a]. Choi and Kim [2011a] extend classical IRL algorithms [Ng and
Russell, 2000, Abbeel and Ng, 2004] to two diﬀerent POMDP settings:
1) learning from a given expert’s policy and 2) learning from expert
146 Inverse Reinforcement Learning

trajectories. Learning from a given policy is a simpler problem than

learning from trajectories. Because of the computational diﬃculty the
demonstrations on benchmark problems are relatively simple.

4.7.3 Active Inverse Reinforcement Learning as a POMDP

With active inverse reinforcement learning we refer to learning the
reward function when the robot is able to influence the demonstra-
tions [Daniel et al., 2015]. An appealing way is to model the process of
active inverse reinforcement learning as a partially observable Markov
decision process (POMDP) where the reward function is a hidden
quantity which the agent partially observes. Solving the POMDP then
yields optimal actions for both gathering information about the reward
function and other task specific objectives. Computational methods
exist for both parametric [Dearden et al., 1999, Poupart and Vlas-
sis, 2008] and non-parametric [Doshi-Velez et al., 2012, 2015] learning
of the reward function when the IRL problem itself is modeled as a
POMDP. The main disadvantage of POMDPs is the high computa-
tional complexity. The current application of POMDPs for active IRL
in robotic applications is limited but an interesting avenue for future
work since POMDPs offer a principled way of modeling IRL. For exam-
ple, POMDPs do not suffer from the exploration-exploitation dilemma
which could be a useful property in active IRL.

4.7.4 Cooperative Inverse Reinforcement Learning

In the vein of the approaches discussed above, Hadfield-Menell et al.
[2016] frame the problem of IRL as learning a hidden reward func-
tion as a partially observable Markov decision process (POMDP).
Hadfield-Menell et al. [2016] define and study the cooperative inverse
reinforcement learning (CIRL) problem. A CIRL is a two player game
where the human observes the reward function but the robot not. Tra-
ditional IRL [Ng and Russell, 2000] assumes that the demonstrator is
acting based on an optimal policy. Hadfield-Menell et al. [2016] show
that in CIRL, the human may accept sub-optimal reward if it can
provide the robot with more information. CIRL defines optimal be-
havior for both the human and the robot when optimizing reward for
4.8. Robot Applications with IRL Methods 147

the human. CIRL potentially leads to policies where human teaching

and robot learning are jointly optimized. Hadfield-Menell et al. [2016]
show that finding optimal policies for the human and robot in CIRL
corresponds to solving a POMDP. A drawback of the POMDP model
is that in practice exact optimal solutions for the model are hard to
come by but the POMDP model can be used as a theoretical tool and
a basis for practical solutions.
Hadfield-Menell et al. [2016] demonstrate the CIRL framework in
simple simulated scenarios. Considering more complicated robotic ex-
periments, the traditional way of IRL of performing close to optimal
demonstrations could be easier for a human compared to teaching a
robot optimally. In order to perform demonstrations which teach the
robot optimally, the human has to consider how the robot optimizes
learning in addition to the actual task being demonstrated.

4.8 Robot Applications with Inverse Reinforcement

Learning Methods

Inverse reinforcement learning has been used for tasks such as parsing
sentences Neu and Szepesvári [2009], car driving Abbeel and Ng [2004],
path planning Ratliﬀ et al. [2006b], Silver et al. [2010], Zucker et al.
[2011], and robot motions Boularias et al. [2011], Finn et al. [2016b].
First, we review applications of model-based inverse reinforcement
learning methods. Since model-based IRL methods assume that the
dynamics of the system is available, they have been applied to prob-
lems where the system dynamics is completely known such as a driv-
ing simulator. Thereafter, we review applications of model-free inverse
reinforcement learning methods. Since model-free IRL methods do not
require prior knowledge of the system dynamics, they can be applied to
robotic tasks where the dynamics of a manipulator is hard to obtain.

4.8.1 Learning to Drive a Car in a Simulator

Simulating car-driving is a typical application which can be modeled

as an MDP problem. It is often assumed that the policy is stationary
(independent of time) and that the state-action space can be approx-
148 Inverse Reinforcement Learning

Figure 4.2: Screen shot of the driving simulator used in [Abbeel and Ng, 2004]. A
time-invariant policy was learned using a model-based IRL method. Experimental
results show that a different driving style can be learning using different demonstra-
tion data.

imated by a set of discrete states and actions. Abbeel et al. demon-

strated the performance of IRL in a car-driving simulation shown in
Figure 4.3 [Abbeel and Ng, 2004]. In the car simulation, five actions
were available, three of which were to steer the car to one of the lanes,
and two of which were to drive off the road on the left or the right side.
The expert’s features were computed from a single trajectory of 1200
samples. In this experiment, different driving styles were demonstrated
by the expert. The results show that the method in [Abbeel and Ng,
2004] is able to imitate different driving styles.

4.8.2 Learning Path Planning with MMP

Ratliff et al. [2006b], Silver et al. [2010] apply maximum margin plan-
ning (MMP) and LEARCH for finding a path with minimum accu-
mulated cost (see Figure 4.3). Interestingly, from raw perceptual data,
lattice planners can be taught human-like rough terrain driving more
efficiently compared to manually programmed behavior Silver et al.
[2010]. LEARCH learns the cost as a function of features and the op-
timal path can be found by using classic motion planning methods on
the recovered cost function. The features of the MDP are based on
visual (images/lidar) input as shown in Figure 4.4. The learned cost
4.8. Robot Applications with IRL Methods 149

Figure 4.3: The learning to search (LEARCH) approach for identifying a cost func-
tion has been applied to various robotic applications including learning rough terrain
navigation from sensor data. The approach iterates between building a discrimina-
tive classifier between states visited by the learner and the demonstrator, updating
the cost function with the discriminative classifier, and then using classical path
planning methods to identify a new proposed optimal plan.

Figure 4.4: Examples of path planning with LEARCH [Silver et al., 2010]. Top
figures show the satellite images and the bottom figures show the costs. The cost
function evolves from left to right in the learning process. The red line represents the
example path and the green represents the current plan. The learned cost function
reproduces paths more similar to the example path as the learning evolves. The
upper set of images shows the raw visual (camera) data being interpreted by the
learner, the lower images show the interpretation in terms of costs (white expensive,
dark low-cost).
150 Inverse Reinforcement Learning

human demo initial pose final pose

dish
pouring

Figure 4.5: Learning house-keeping tasks in [Finn et al., 2016b]. Tasks that require
a nonlinear reward function and a complex policy were learned using guided cost
learning.

function reproduces paths incrementally more similar to the example

path as the learning evolves. MMP and LEARCH have been applied to
various robotic systems, including footstep planning for a quadruped
robot [Zucker et al., 2011].

4.8.3 Learning Motion Planning with Deep Guided-Cost

Learning

Learning manipulation tasks often requires nonlinear reward functions.

Finn et al. [2016b] applied guided cost learning to house-keeping tasks
such as moving dishes and pouring water shown in Figure 4.5. Demon-
strations were recorded using kinesthetic teaching with a PR2 robot.
As we described in §4.4.6, guided cost learning uses a neural network
to represent the reward function. The state of the system was rep-
resented by vision-based features obtained by using an unsupervised
learning method [Finn et al., 2016b]. The experimental results show
that guided cost learning can be used to learn robotic manipulation
tasks that require a nonlinear reward function under unknown dynam-
ics.
4.8. Robot Applications with IRL Methods 151

Figure 4.6: Learning ball-in-the-cup in [Boularias et al., 2011]. The KL divergence

between the expert policy and the learner’s policy was minimized using a sampling-
based method.

4.8.4 Learning a Ball-in-a-Cup task with Relative Entropy

Inverse Reinforcement Learning
Learning robotic tasks with an underactuated manipulator is non-
trivial because the dynamics of the system is hard to estimate. Since
model-based IRL methods require an accurate model of the system
dynamics, applying model-based IRL methods to such tasks can be
challenging. Boularias et al. [2011] applied the model-free Relative En-
tropy Inverse Reinforcement Learning (RE-IRL) approach to the Ball-
in-a-cup task with an underactuated robot shown in Figure 4.6. A hu-
man demonstrated the ball-in-a-cup motion 17 times, and the motions
were recorded using a 3D motion capture system. Robotic simulations
showed successful learning of the demonstrated motion.
5
Challenges in Imitation Learning for
Robotics

We have surveyed the state of the art in imitation learning for robotics.
Although imitation learning has progressed rapidly, it is clear that there
are still many problems and challenges which need to be investigated.
In this section, we highlight open questions and technical challenges in
imitation learning.

5.1 Behavioral Cloning vs Inverse Reinforcement Learn-

ing

Behavioral cloning (BC) and inverse reinforcement learning (IRL)

methods form the two major classes of imitation learning methods.
As discussed in § 2, “BC vs IRL” is the ﬁrst question that one needs
to answer when applying imitation learning to the problem at hand.
Recovering the reward function can be interpreted as inferring the
expert’s intent since the reward function encodes the objective for the
desired task. For example, when learning from a sequence of images
without kinematic information of the expert, it is not clear how to ap-
ply behavioral cloning. In such a case, we need to infer what is desired
by the expert and then estimate a policy to achieve the inferred goal.

152
5.1. Behavioral Cloning vs Inverse Reinforcement Learning 153

For example, to address the problem of imitation from observation, the

recent work by Sermanet et al. [2017] and Liu et al. [2017] proposed
methods for recovering the reward function from visual features ex-
tracted by deep neural networks. Thus, IRL is a reasonable choice for
such problems where inference of the expert’s intent is necessary even
if the policy itself is more compact than a reward function.
When both behavioral cloning or inverse reinforcement learning can
be applied to a given problem, it is essential to consider “what is the
most parsimonious description of the desired behavior, reward or pol-
icy?”. Ho and Ermon [2016] recently indicated that under the maxi-
mum entropy assumption recovering the reward function is the dual
of matching the expectation of states and actions. This implies that
BC and IRL can be equivalent under certain assumptions since BC
methods learn a policy by matching the expectation of states and ac-
tions and IRL methods learn a policy based on the reward function
recovered by matching the expectation of states and actions. Since IRL
recovers the “hidden” reward function, IRL often adds complexity to
the solution approach compared to BC. Thus, in order to select BC or
IRL, it is essential to clarify whether recovering the reward function is
beneficial or not.
For instance, recovering a reward function for a manipulation task
is often difficult since it is not trivial to extract features of the given
scene which are relevant to the task. On the other hand, the distri-
bution of the demonstrated trajectories for manipulation can be often
learned without recovering the reward function. When the distribution
of necessary trajectories can be predicted for a given context, the task
can be performed without any knowledge about the reward function of
the task. In this case, the distribution of the demonstrated trajectories
can be considered a parsimonious description of the desired behavior.
As another example, learning a reward function for footstep plan-
ning for a quadruped robot enables generalizing the footstep planning
strategy to different terrains. If the reward function that tells “which
footstep location is stable” is recovered, footstep locations can be adap-
tively selected based on this criteria. Such generalization is hard to ob-
tain if we only learn the distribution of the footstep locations. In this
154 Challenges in Imitation Learning for Robotics

case, the reward function is considered a parsimonious description of

the desired behavior, which enables good generalization of skills.
Overall, the answer to the question “BC vs IRL” totally depends
on the problem setting. It is essential to analyze what and how the task
should be performed when applying imitation learning methods.

5.2 Open Questions in Imitation Learning

We have discussed the state of the art in imitation learning in this

survey. Although imitation learning methods so far have demonstrated
great capability, it is clear that there still exists several challenges to be
solved. In this section, we highlight open questions in imitation learning
and try to clarify what problems need to be solved.

5.2.1 Problems Related to Demonstrated Data

The ﬁrst step of imitation learning is to collect expert demonstration

data. However, it is often not trivial to obtain appropriate data to
achieve satisfactory performance in imitation learning. Below we list
questions related to data collection.

How to learn from multiple experts? It is known that imitation

learning methods work well for demonstrations performed by one
expert rather than multiple experts [Camacho and Michie, 1995].
Therefore, when multiple human experts give instructions to a robotic
system, one could extract one expert from multiple experts. However,
this problem has not been suﬃciently addressed.

How to deal with undesirable motions in demonstrations?

Many imitation learning methods assume that demonstrated behavior
is (sub-)optimal. However, in practice, demonstrated behavior often
contains undesirable motions which may may result in low performance
policies. To address this issue, reinforcement learning can be used to
improve the learned policy [Kober et al., 2013, Mnih et al., 2015, Silver
et al., 2016]. Nevertheless, explicitly detecting unnecessary motion and
removing it from demonstrated behavior is still an open problem.
5.2. Open Questions in Imitation Learning 155

How to learn from raw sensory inputs without embodiment

information? When learning only from vision we cannot directly
measure the kinematic information of the expert. While learning from
raw sensory inputs without embodiment information is challenging,
humans can do it based on prior knowledge. Recent work by Sermanet
et al. [2017] shows that the reward function can be inferred from few
demonstrations by using visual representations learned by deep models.

How to deal with different viewpoints? Current imitation learn-

ing methods are usually limited to the case where the demonstration
is supplied in the ﬁrst-person, i.e., a sequence of states and actions is
provided similarly to how the learner would observe the task. However,
humans can learn by observing the behavior of other humans. When
learning from the third-person view it is necessary to infer how the
task should be performed. Recent work on third-person imitation
learning [Stadie et al., 2017] addresses this problem in some simple
environments.

How to leverage past demonstrations of other related tasks,

to learn more quickly the current task? While it is challenging to
learn a very complex task from one demonstration, humans can learn
from few demonstrations because they have so much prior knowledge.
In principle, this knowledge could be captured and reused for other
tasks. Recent work such as [Gupta et al., 2017, Finn et al., 2017a,b,
Duan et al., 2017] addresses this research direction.

5.2.2 Open Questions Related to Design Choices

When we implement imitation learning in an actual robotic system,

we need to make several design choices as we discussed in Chapter 2.
There are still several open questions when making such design choices.

What is the best similarity measure of policies? To obtain a

policy that imitates experts’ behavior, it is essential to measure the
156 Challenges in Imitation Learning for Robotics

similarity of policies. Although we discussed some similarity measure

such as KL divergence and Euclidean distance, there exist many
other options. For example, recently the Wasserstein divergence (aka
Earth-mover distance) [Arjovsky et al., 2017] has been shown to
improve the performance of generative adversarial networks (GANs)
[Goodfellow et al., 2014] which have inspired some recent imitation
learning approaches [Ho and Ermon, 2016, Finn et al., 2016a]. Ex-
ploring new similarity measures is a promising way to discover new
imitation learning methods which may work in situations not handled
by current methods.

How to learn from multiple instruction types? In practice,

various types of instructions are available, such as corrective motion
from operators, preferences on optional actions and evaluation of
the performance. To achieve intuitive human-robot interaction and
eﬃcient learning, it is necessary to utilize various instruction types.
Although some methods incorporate multiple instruction types Jain
et al. [2015], this research direction has not been well-investigated yet.

How to incorporate prior knowledge? How to do it explic-

itly? Although prior knowledge of the system or environment, e.g.,
kinematics and the mass of a manipulator, are often available, many
imitation learning methods utilize only demonstrations. However,
incorporating available prior knowledge will be useful for system
control and trajectory planning. On the other hand, many methods
use implicit prior knowledge such as assuming a Gaussian distribution
of samples. Methods that explicitly incorporate prior knowledge could
lower the amount of demonstration data required and make new
robotic applications possible.

How to learn from various sensors? Many studies on imitation

learning implicitly select sensory information appropriate for their
method. However, in practice, we can use various redundant sen-
sory information such as tactile information, RGB-D images, audio
information, and encoders in robot joints. Fusing of various sensory
5.2. Open Questions in Imitation Learning 157

information will lead to more robust and adaptive behavior.

How to learn tasks humans cannot do? Imitation learning

methods assume that demonstrations of the desired task are available.
However, it is often the case that human operators cannot appropri-
ately demonstrate the given task, especially in cases where a robot has
a physical advantage compared to a human. For example, a robotic
system may have more than two arms making it challenging for the
human operator to demonstrate the desired behavior. To achieve
performance beyond human capability, methods that iteratively
improve the performance of the system will be necessary.

How to choose a trajectory representation? In §3.5.1, we dis-

cussed several different trajectory representations. An interesting open
question is how to choose among the trajectory representations. We
gave in §3.5.2 some suggestions how to choose based on the different
properties of the representations. However, there is no definite answer
on how to select a trajacectory presentation. Note that choosing a
trajectory representation is analogous to model selection in machine
learning [Bishop, 2006]. Considering trajectory representation selection
as a model selection problem could lead to interesting advances.

5.2.3 Problems Related to Algorithms

When we want to overcome limitations in current imitation learning,

we also need to face several open questions related to algorithmic
aspects of imitation learning.

How to generalize skills with complex conditions? Many

methods model the distribution over demonstrated trajectories and
generalize the skill by conditioning the distribution Khansari-Zadeh
and Billard [2011], Paraschos et al. [2013] for example on diﬀerent
start or end positions. However, such methods might not scale to high
dimensional conditions. Although some work addresses scaling up
generalization of skills with high dimensional inputs Schulman et al.
158 Challenges in Imitation Learning for Robotics

[2013], further investigation is necessary. Recent work by Finn et al.

[2017b], Sermanet et al. [2017], Liu et al. [2017], and Rahmatizadeh
et al. [2017] proposed methods for learning from visual information
using deep neural networks, which is a promising way to address the
skill generalization with complex conditions.

How to find solutions with guarantees? In current imita-

tion learning, there are performance guarantees, e.g., stability of
DMPs [Ijspeert et al., 2002a, Schaal et al., 2004] and a proof of low
error in DAGGER [Ross et al., 2011]. However, currently, for many
imitation learning methods there are no performance guarantees.
Especially in robotics, guarantees such as stability or convergence
can be very important in practice. Finding guarantees for common
imitation learning methods is a worthy research direction.

How to scale up with respect to the number of dimensions?

Motion planning in a robotic system requires a high dimensional
solution. For example, a humanoid robot often has over 50 joints.
However, existing imitation learning methods are often ineﬃcient for
such high dimensional motion due to the diﬀerent embodiment of the
learner and the expert. Recent studies show that the dimensionality of
the input space can be scaled up using convolutional neural networks.
However, current methods for high dimensional inputs are often limited
to 2D images. Incorporating high dimensional sensory inputs is still
an open question. In addition, scaling up the dimensionality of actions
is also an open problem. Incorporating dimensionality reduction in
imitation learning is an interesting research direction [Sugiyama et al.,
2010, Tangkaratt et al., 2015]

How to find globally optimal solutions in high dimensional

spaces? How to make it tractable? In robotic applications, it is
essential to find solutions in a continuous and high dimensional space.
Many imitation learning methods find locally optimal solutions close
to the behavior demonstrated by experts. However, there may exist a
better solution which is different from the demonstrated behavior.
5.2. Open Questions in Imitation Learning 159

How to perform imitation by multiple agents? In multi-agent

domains, an agent needs to consider how the other agents’ behavior
may inﬂuence the outcome. Prior work [Waugh et al., 2011, Kuleshov
and Schrijvers, 2015] addresses how to infer the reward function,
which represents the equilibrium of agents’ strategies, from observed
behavior of multiple agents. However, the results are still quite limited
to simple problem settings and have not migrated to large scale robot
applications.

How to perform incremental/active learning in IRL?

Although many inverse reinforcement learning (IRL) methods assume
a suﬃcient number of demonstrations, it is often not the case in
practice. When the policy learned from the initial dataset of demon-
strations does not show satisfactory performance, the policy can
be incrementally improved. Silver et al. [2012], Lopes et al. [2009]
proposed methods for IRL with active learning. Such incremental IRL
methods have not been investigated suﬃciently.

5.2.4 Performance Evaluation

Since the purpose and target applications of imitation learning are very
broad, benchmarking imitation learning methods can be challenging.
The following open questions are related to performance evaluation in
imitation learning.

How to establish benchmark problems for imitation learning?

Unlike other machine learning fields, there is no widely accepted
set of benchmark problems for imitation learning. Although efforts
for benchmarking different techniques have been made, e.g. [Lemme
et al., 2015], there is no clear way to compare performance between
methods. Benchmark problems such as data mining and computer
vision communities should be established.

What metric should be used to evaluate imitation learning

160 Challenges in Imitation Learning for Robotics

methods? There are various ways to quantify imitation learning per-

formance. However, there is no established way to evaluate imitation
learning methods, nor are there yet large scale benchmarks that make
it eﬀective and easy to compare and contrast approaches.
Acknowledgements

The research leading to this work has received funding from

the European Union’s Horizon 2020 research and innovation pro-
gramme under grant agreements #645582 (RoMaNS) and #640554
(SKILLS4ROBOTS). J. A. Bagnell’s gratefully acknowledges the sup-
port of the National Science Foundation’s NRI grant on “National
Robotics Initiative: Purposeful Prediction” (Grant 1227495) and the
Oﬃce of Naval Research’s (Grant N000141512365) “Learning to Rea-
son with Inference Machines”.
T. Osa was supported by JST CREST Grant Number JPMJCR1403
and KAKENHI Grant Number 17H00757. P. Abbeel acknowledges the
Oﬃce of Naval Research’s Presidential Early Career Awards for Scien-
tists and Engineers (PECASE).

161
References

P. Abbeel and A. Y. Ng. Apprenticeship learning via inverse reinforcement

learning. In Proceedings of the international conference on Machine learn-
ing (ICML), 2004.
P. Abbeel, A. Coates, and A. Y. Ng. Autonomous helicopter aerobatics
through apprenticeship learning. The International Journal of Robotics
Research, 29(13):1608–1639, 2010.
S. Amari. Information Geometry and Its Applications. Springer, 2016.
H. Ben Amor, G. Neumann, S. Kamthe, O. Kroemer, and J. Peters. In-
teraction primitives for human-robot cooperation tasks. In Proceedings of
the IEEE International Conference on Robotics and Automation (ICRA),
pages 2831–2837, 2014.
B. Anderson and J Moore. Optimal Control: Linear Quadratic Methods.
Prentice-Hall, Inc., Upper Saddle River, NJ, USA, 1990.
O. Arenz, H. Abdulsamad, and G. Neumann. Optimal control and inverse
optimal control by distribution matching. In Proceedings of the IEEE/RSJ
International Conference on Intelligent Robots and Systems (IROS), 2016.
B. D. Argall, S. Chernova, M. Veloso, and B. Browning. A survey of robot
learning from demonstration. Robotics and Autonomous Systems, 57(5):
469–483, 2009.
M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein generative adversar-
ial networks. In Proceedings of the International Conference on Machine
Learning (ICML), pages 214–223, 2017.

162
References 163

C. G. Atkeson and S. Schaal. Robot learning from demonstration. In Proceed-

ings of the International Conference on Machine Learning (ICML), pages
12–20, 1997.
C. G. Atkeson, Andrew W. Moore, and Stefan Schaal. Locally weighted learn-
ing for control. Artificial Intelligence Review, 11(1):75–113, 1997. ISSN
1573-7462. . URL http://dx.doi.org/10.1023/A:1006511328852.
J. Andrew (Drew) Bagnell. An invitation to imitation. Technical report,
Robotics Institute, Carnegie Mellon University, March 2015.
M. Bain and C. Sammut. A framework for behavioural cloning. Machine
Intelligence 15, pages 103–129, 1996.
C. L. Baker, R. Saxe, and J. B. Tenenbaum. Action understanding as inverse
planning. Cognition, 113(3):329–349, 2009.
G. BakIr, T. Hofmann, B. Schölkopf, A. J. Smola, B. Taskar, and S.V.N
Vishwanathan. Predicting Structured Data (Neural Information Process-
ing). MIT Press, 2007.
N. Baram, O. Anschel, I. Caspi, and S. Mannor. End-to-end diﬀerentiable ad-
versarial imitation learning. In Proceedings of the International Conference
on Machine Learning (ICML), 2017.
A. Billard and D.H. Grollman. Learning by demonstration. Scholarpedia,
2013. .
A. Billard, S. Calinon, R. Dillmann, and S. Schaal. Springer handbook of
robotics, chapter Robot programming by demonstration, pages 1371–1394.
Springer Berlin Heidelberg, 2008.
A. Billard, S. Calinon, and R. Dillmann. Handbook of robotics, chapter Learn-
ing from Humans, pages 1995–2014. Springer, 2016.
C. M. Bishop. Pattern recognition and machine learning. Springer, 2006.
M. Bloem and N. Bambos. Inﬁnite time horizon maximum causal entropy
inverse reinforcement learning. In Proceedings of the IEEE Conference on
Decision and Control (CDC), pages 4911–4916, 2014.
K. Bogert and P. Doshi. Multi-robot inverse reinforcement learning under
occlusion with interactions. In Proceedings of the International Conference
on Autonomous Agents & Multiagent Systems (AAMAS), pages 173–180,
2014.
K. Bogert and P. Doshi. Toward estimating others transition models un-
der occlusion for multi-robot irl. In Proceedings of the International Joint
Conference on Artificial Intelligence (IJCAI), pages 1867–1873, 2015.
164 References

K. Bogert, J. F. Lin, P. Doshi, and D. Kulic. Expectation-maximization for

inverse reinforcement learning with hidden data. In Proceedings of the In-
ternational Conference on Autonomous Agents & Multiagent Systems (AA-
MAS), pages 1034–1042, 2016.
A. Boularias, J. Kober, and J Peters. Relative entropy inverse reinforcement
learning. In Proceedings of the International Conference on Artificial In-
telligence and Statistics (AISTAT), 2011.
A. Boularias, O. Krömer, and J. Peters. Structured apprenticeship learning.
In European Conference on Machine Learning and Principles and Prac-
tice of Knowledge Discovery in Databases (ECML-PKDD), pages 227–242.
Springer, 2012.
D. M. Bradley. Learning in modular systems. PhD thesis, Carnegie Mellon
University, 2010.
S. Calinon. Robot learning with task-parameterized generative models. In
In Proceedings of International Symposium on Robotics Research (ISRR),
2015.
S. Calinon. A tutorial on task-parameterized movement learning and retrieval.
Intelligent Service Robotics (Springer), 9(1):1–29, 2016.
S. Calinon and A. Billard. Incremental learning of gestures by imitation in a
humanoid robot. In Proceedings of the ACM/IEEE International Confer-
ence on Human-Robot Interaction (HRI), pages 255–262, 2007.
S. Calinon and A. Billard. Statistical learning by imitation of competing
constraints in joint space and task space. Advanced Robotics, 23(15):2059–
2076, 2009.
S. Calinon, F. Guenter, and A. Billard. On learning, representing and gener-
alizing a task in a humanoid robot. IEEE Transactions on Systems, Man
and Cybernetics, Part B, 37(2):286–298, 2007.
S. Calinon, F. D’halluin, E. L. Sauser, D. G. Caldwell, and A. G Billard.
Learning and reproduction of gestures by imitation. IEEE Robotics & Au-
tomation Magazine, 17(2):44–54, 2010.
S. Calinon, A. Pistillo, and D. G. Caldwell. Encoding the time and space
constraints of a task in explicit-duration hidden Markov model. In Pro-
ceedings of the IEEE/RSJ International Conference on Intelligent Robots
and Systems (IROS), pages 3413–3418. IEEE, 2011.
R. Camacho and D. Michie. Behavioral cloning: A correction. AI MAGAZINE,
16(2), 1995.
References 165

S. Cambon, R. Alami, and Fabien Gravot. A hybrid approach to intricate

motion, manipulation and task planning. The International Journal of
Robotics Research, 28:104–126, 2009.
R. A. Chambers and D. Michie. Man-machine co-operation on a learning task.
Computer Graphics: Techniques and Applications, 1969.
K. Chang, A. Krishnamurthy, A. Agarwal, H. Daumé III, and J. Lang-
ford. Learning to search better than your teacher. In Proceedings of
the International Conference on Machine Learning (ICML), 2015. URL
http://hal3.name/docs/#daume15lols.
S. Chernova and M. Veloso. Interactive policy learning through conﬁdence-
based autonomy. Journal of Artificial Intelligence Research, 34:1–25, 2009.
J. Choi and K. Kim. Inverse reinforcement learning in partially observable
environments. Journal of Machine Learning Research, 12(Mar):691–730,
2011a.
J. Choi and K. E. Kim. Map inference for bayesian inverse reinforcement
learning. In Advances in Neural Information Processing Systems (NIPS),
pages 1989–1997, 2011b.
J. Choi and K. E. Kim. Hierarchical bayesian inverse reinforcement learning.
IEEE Transactions on Cybernetics, 45(4):793–805, April 2015. ISSN 2168-
2267. .
H. Chui and A. Rangarajan. A new point matching algorithm for non-rigid
registration. Computer Vision and Image Understanding, 89(2):114–141,
2003.
J. Chung, K. Kastner, L. Dinh, K. Goel, A. C. Courville, and Y. Bengio. A
recurrent latent variable model for sequential data. In Advances in Neural
Information Processing Systems (NIPS), pages 2980–2988, 2015.
A. Coates, P. Abbeel, and A. Y. Ng. Learning for control from multiple
demonstrations. In Proceedings of the International Conference on Machine
Learning (ICML), 2008.
C. Cortes and V. Vapnik. Support-vector networks. Machine Learning, 20
(3):273–297, 1995.
C. Daniel, O. Kroemer, M. Viering, J. Metz, and J. Peters. Active reward
learning with a novel acquisition function. Autonomous Robots, 39(3):389–
405, 2015.
H. Daumé III and J. Langford. Advances in structured prediction. In Tutorials
in the International Conference on Machine Learning (ICML), July 2015.
166 References

H. Daumé III, J. Langford, and D. Marcu. Search-based structured prediction.

Machine Learning, 75:297–325, 2009.
R. Dearden, N. Friedman, and D. Andre. Model based bayesian exploration.
In Proceedings of the Fifteenth Conference on Uncertainty in Artificial In-
telligence, pages 150–159. Morgan Kaufmann Publishers Inc., 1999.
M. P. Deisenroth. Efficient reinforcement learning using gaussian processes.
KIT Scientific Publishing, 2010.
M. P. Deisenroth and C. E. Rasmussen. PILCO: A model-based and data-
efficient approach to policy search. In Proceedings of the International
Conference on Machine Learning (ICML), 2011.
M. P. Deisenroth, D. Fox, and C. E. Rasmussen. Gaussian processes for data-
efficient learning in robotics and control. IEEE Transactions on Pattern
Analysis and Machine Intelligence, 37(2):408–423, 2013a.
M. P. Deisenroth, G. Neumann, and J. Peters. A survey on policy search for
robotics. Foundations and Trends in Robotics, 2(1-2):1–142, 2013b.
M. P. Deisenroth, P. Englert, J. Peters, and D. Fox. Multi-task policy search
for robotics. In IEEE International Conference on Robotics and Automation
(ICRA), pages 3876–3881, 2014.
M. Deniša, A. Gams, A. Ude, and T. Petric̆. Learning compliant move-
ment primitives through demonstration and statistical generalization.
IEEE/ASME Transactions on Mechatronics, 21(5):2581–2594, 2016.
Andreas Doerr, Nathan Ratliff, Jeannette Bohg, Marc Toussaint, and Stefan
Schaal. Direct loss minimization inverse optimal control. Proceedings of
Robotics: Science and Systems (R:SS), pages 1–9, 2015.
F. Doshi-Velez, J. Pineau, and N. Roy. Reinforcement learning with limited
reinforcement: Using bayes risk for active learning in pomdps. Artificial
Intelligence, 187:115–132, 2012.
F. Doshi-Velez, D. Pfau, F. Wood, and N. Roy. Bayesian nonparametric meth-
ods for partially-observable reinforcement learning. IEEE Transactions on
Pattern Analysis and Machine Intelligence, 37(2):394–407, 2015.
A. D. Dragan, K. Muelling, J. Andrew Bagnell, and S. S. Srinivasa. Move-
ment primitives via optimization. In Proceedings of the IEEE International
Conference on Robotics and Automation (ICRA), pages 2339–2346, May
2015. .
Y. Duan, M. Andrychowicz, B. C. Stadie, J. Ho, J. Schneider, I. Sutskever,
P. Abbeel, and W. Zaremba. One-shot imitation learning. arXiv preprint,
abs/1703.07326, 2017. URL http://arxiv.org/abs/1703.07326.
References 167

M. Dudík and R. E. Schapire. Maximum entropy distribution estimation with

generalized regularization. In Proceedings of the International Conference
on Computational Learning Theory (COLT), pages 123–138, 2006.
K. Dvijotham and E. Todorov. Inverse optimal control with linearly-solvable
mdps. In Proceedings of the International Conference on Machine Learning
(ICML), 2010.
S. Ekvall and D. Kragic. Robot learning from demonstration: a task-level
planning approach. International Journal of Advanced Robotic Systems, 5
(3), 2008.
P. Englert, A. Paraschos, J. Peters, and M. P. Deisenroth. Probabilistic model-
based imitation learning. Adaptive Behavior, 21:388–403, 2013.
M. Ewerton, G. Neumann, R. Lioutikov, H. Ben Amor, J. Peters, and
G. Maeda. Learning multiple collaborative tasks with a mixture of in-
teraction primitives. In Proceedings of the IEEE International Conference
on Robotics and Automation (ICRA), pages 1535–1542, 2015.
M. Ewerton, G. Maeda, G. Kollegger, J. Wiemeyer, and J. Peters. Incremen-
tal imitation learning of context-dependent motor skills. In Proceedings
of IEEE international Conference on Humanoid Robots (HUMANOIDS),
2016.
P. Fearnhead and Z. Liu. On-line inference for multiple changepoint problems.
Journal of the Royal Statistical Society: Series B, 69(4):507–740, 2007.
C. Finn, P. Christiano, P. Abbeel, and S. Levine. A connection between
generative adversarial networks, inverse reinforcement learning, and energy-
based models. In arXiv 1611.03852, 2016a.
C. Finn, S. Levine, and P. Abbeel. Guided cost learning: Deep inverse op-
timal control via policy optimization. In Proceedings of the International
Conference on Machine Learning (ICML), 2016b.
C. Finn, P. Abbeel, and S. Levine. Model-agnostic meta-learning for fast
adaptation of deep networks. In Proceedings of the International Conference
on Machine Learning (ICML),, 2017a.
C. Finn, T. Yu, T. Zhang, P. Abbeel, and S. Levine. One-shot visual imita-
tion learning via meta-learning. In Proceedings of the Conference on Robot
Learning (CoRL), 2017b.
E. Fox, E. Sudderth, M. Jordan, and A. Willsky. Sharing features among
dynamical systems with beta processes. In Advances in Neural Information
Processing Systems (NIPS), 2009.
168 References

A. Gams, B. Nemec, A. J. Ijspeert, and A. Ude. Coupling movement primi-

tives: Interaction with the environment and bimanual tasks. IEEE Trans-
actions on Robotics, 30(4):816–830, Aug 2014. ISSN 1552-3098. .
I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley,
S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In
Advances in Neural Information Processing Systems (NIPS), 2014.
A. Graves, A.-R. Mohamed, and G. Hinton. Speech recognition with deep
recurrent neural networks. In Proceedings of the IEEE International Con-
ference on Acoustics, Speech and Signal Processing (ICASPP), pages 6645–
6649. IEEE, 2013.
E. Gribovskaya, S. M. Khansari-Zadeh, and A. Billard. Learning non-linear
multivariate dynamics of motion in robotic manipulators. International
Journal of Robotics Research, 30(1):80–117, 2011.
D. B. Grimes and R. P. N. Rao. Creating brain-like intelligence: From basic
principles to complex intelligent systems, chapter Learning Actions through
Imitation and Exploration: Towards Humanoid Robots That Learn from
Humans, pages 103–138. Springer Berlin Heidelberg, 2009.
D. B. Grimes, R. Chalodhorn, and R. P. N. Rao. Dynamic imitation in a hu-
manoid robot through nonparametric probabilistic inference. In Proceedings
of Robotics: Science and Systems (R:SS), 2006a.
D. B. Grimes, D. R. Rashid, and R. P. Rao. Learning nonparametric models
for probabilistic imitation. In Advances in Neural Information Processing
Systems 19, 2006b.
A. Grubb and J. A. Bagnell. Boosted backpropagation learning for training
deep modular networks. In Proceedings of the International Conference on
Machine Learning (ICML), 2010.
A. Gupta, C. Devin, Y. Liu, P. Abbeel, and S. Levine. Learning invariant
feature spaces to transfer skills with reinforcement learning. In Proceedings
of the Internatinal Conference on Learning Representatinos (ICLR), 2017.
D. Hadﬁeld-Menell, A. Dragan, P. Abbeel, and S. Russell. Cooperative in-
verse reinforcement learning. In Advances in Neural Information Processing
Systems (NIPS), 2016.
M. Haruno, D.M. Wolpert, and M. Kawato. Mosaic model for sensorimotor
learning and control. Neural Computation, 13(10):2201–2220, 2001.
I. Havoutis and S. Calinon. Supervisory teleoperation with online learning
and optimal control. In Proceedings of the IEEE International Conference
on Robotics and Automation (ICRA), 2017.
References 169

E. Hazan. Introduction to online convex optimization. Foundations and

Trends® in Optimization, 2(3-4):157–325, 2016.
P. Henderson, W. Chang, P. L. Bacon, D. Meger, J. Pineau, and D. Precup.
Optiongan: Learning joint reward-policy options using generative adver-
sarial inverse reinforcement learning. In In the Proceedings of the AAAI
Conference on Artificial Intelligence (AAAI), 2018.
G. E. Hinton. Training products of experts by minimizing contrastive diver-
gence. Neural Computation, 14(8):1771–1800, 2002.
J. Ho and S. Ermon. Generative adversarial imitation learning. In Adances
in Neural Information Processing Systems (NIPS), 2016.
J. Ho, J. K. Gupta, and S. Ermon. Model-free imitation learning with policy
optimization. In Proceedings of the International Conference on Interna-
tional Conference on Machine Learning (ICML), 2016.
S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural compu-
tation, 9(8):1735–1780, 1997.
H. Hoﬀmann, P. Pastor, D. H. Park, and S. Schaal. Biologically-inspired dy-
namical systems for movement generation: Automatic real-time goal adap-
tation and obstacle avoidance. In Proceedings of IEEE International Con-
ference on Robotics and Automation (ICRA), pages 2587–2592, 2009.
S. Huang, J. Pan, G. Mulcaire, and P. Abbeel. Leveraging appearance priors
in non-rigid registration with applications to manipulation of deformable
objects. In Proceedings of the IEEE/RSJ International Conference on In-
telligent Robots and Systems (IROS), 2015.
A. J. Ijspeert, J. Nakanishi, and S. Schaal. Learning attractor landscapes for
learning motor primitives. In Advances in Neural Information Processing
Systems (NIPS), 2002a.
A. J. Ijspeert, J. Nakanishi, and S. Schaal. Movement imitation with nonlinear
dynamical systems in humanoid robots. In Proceedings of the IEEE Interna-
tional Conference on Robotics and Automation (ICRA), pages 1398–1403,
2002b.
A. J. Ijspeert, J. Nakanishi, H. Hoﬀmann, P. Pastor, and S. Schaal. Dynam-
ical movement primitives: Learning attractor models for motor behaviors.
Neural Computation, 25(2):328–373, 2013.
T. Inamura, I. Toshima, H. Tanie, and Y. Nakamura. Embodied symbol
emergence based on mimesis theory. The International Journal of Robotics
Research, 2004.
R. A. Jacobs, M. I. Jordan, S. Nowlan, and G. E. Hinton. Adaptive mixtures
of local experts. Neural Computation, 1991.
170 References

A. Jain, S. Sharma, T. Joachims, and A. Saxena. Learning preferences for ma-

nipulation tasks from online coactive feedback. The International Journal
of Robotics Research, 2015.
E. T. Jaynes. Information theory and statistical mechanics. Physical Re-
view, 106:620–630, May 1957. . URL http://link.aps.org/doi/10.
1103/PhysRev.106.620.
M. Kalakrishnan, P. Pastor, L. Righetti, and S. Schaal. Learning objective
functions for manipulation. In Proceedings of the IEEE International Con-
ference on Robotics and Automation (ICRA), pages 1331–1336. IEEE, 2013.
R. E. Kalman. When is a linear control system optimal? Trans. ASME, J.
Basic Eng., Ser. D., 86(1):51 – 60, 1964.
S. B. Kang and K. Ikeuchi. Toward automatic robot instruction from
perception-recognizing a grasp from observation. IEEE Transactions on
Robotics and Automation, 9(4):432–443, Aug 1993.
A. Karpathy and L. Fei-Fei. Deep visual-semantic alignments for generating
image descriptions. In Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition (CVPR), pages 3128–3137, 2015.
H. Khalil. Nonlinear systems. Prentice Hall, Upper Saddle River, NJ, 1996.
S. M. Khansari-Zadeh and A. Billard. Learning stable nonlinear dynamical
systems with gaussian mixture models. IEEE Transactions on Robotics, 27
(5):943–957, 2011.
S. M. Khansari-Zadeh and A. Billard. Learning control lyapunov function
to ensure stability of dynamical system-based robot reaching motions.
Robotics and Autonomous Systems, 62(6):752–765, 2014.
S. Kim, A. Shukla, and A. Billard. Catching objects in ﬂight. IEEE Trans-
actions on Robotics, 30(5):1049–1065, 2014.
K. M. Kitani, B. D. Ziebart, J. A. Bagnell, and M. Hebert. Activity forecast-
ing. In European Conference on Computer Vision (ECCV), pages 201–214.
Springer, 2012.
J. Kober and J. Peters. Learning motor primitives for robotics. In Proceed-
ings of IEEE International Conference on Robotics and Automation, pages
2112–2118, 2009.
J. Kober, B. Mohler, and J. Peters. Learning perceptual coupling for motor
primitives. In Proceedings of the IEEE/RSJ International Conference on
Intelligent Robot Systems (IROS), pages 834–839, 2008.
References 171

J. Kober, J. A. Bagnell, and J. Peters. Reinforcement learning in robotics:

A survey. The International Journal of Robotics Research, 32:1238–1274,
2013.
J. Kohlmorgen and S. Lemm. A dynamic hmm for on-line segmentation of
sequential data. In Proceedings of the International Conference on Neural
Information Processing Systems (NIPS), pages 793–800, 2001.
J. Z. Kolter, P. Abbeel, and A. Y. Ng. Hierarchical apprenticeship learning
with application to quadruped locomotion. In Advances in Neural Infor-
mation Processing Systems (NIPS), 2008.
G. Konidaris, S. Kuindersma, R. Grupen, and A. Barto. Robot learning from
demonstration by constructing skill trees. The International Journal of
Robotics Research, 31(3):360–375, 2011.
G. Konidaris, L. Kaelbling, and T. Lozano-Perez. Constructing symbolic rep-
resentations for high-level planning. In Proceedings of the Twenty-Eighth
Conference on Artificial Intelligence (AAAI), 2014.
S. Krishnan, A. Garg, R. Liaw, L. Miller, F. T. Pokorny, and K. Goldberg.
HIRL: hierarchical inverse reinforcement learning for long-horizon tasks
with delayed rewards. CoRR, abs/1604.06508, 2016. URL http://arxiv.
org/abs/1604.06508.
O. Kroemer, H. van Hoof, G. Neumann, and J. Peters. Learning to predict
phases of manipulation tasks as hidden states. In Proceedings of IEEE
International Conference on Robotics and Automation (ICRA), pages 4009–
4014, 2014.
O. Kroemer, C. Daniel, G. Neumann, H. van Hoof, and J. Peters. Towards
learning hierarchical skills for multi-phase manipulation tasks. In Pro-
ceedings of International Conference on Robotics and Automation (ICRA),
pages 1503 – 1510, 2015.
K. Kronander, M. Khansari, and A. Billard. Incremental motion learning with
locally modulated dynamical systems. Robotics and Autonomous Systems,
70(C):52–62, 2015.
V. Kuleshov and O. Schrijvers. Inverse game theory: Learning utilities in
succinct games. In Proceedings of the International Conference on Web
and Internet Economics, 2015.
D. Kulić, W. Takano, and Yoshihiko Nakamura. Incremental learning, cluster-
ing and hierarchy formation of whole body motion patterns using adaptive
hidden markov chains. The International Journal of Robotics Research, 27:
761–784, 2008.
172 References

S. Kullback and R. A. Leibler. On information and suﬃciency. The Annals

of Mathematical Statistics, 22(1):79–86, 1951.
Y. Kuniyoshi, M. Inaba, and H. Inoue. Learning by watching: extracting
reusable task knowledge from visual observation of human performance.
IEEE Transactions on Robotics and Automation, 10(6):799–822, 1994.
J. Lafferty, A. McCallum, and F. Pereira. Conditional random fields: proba-
bilistic models for segmenting and labeling sequence data. In Proceedings
of the International Conference on Machine Learning (ICML), 2001.
F. Lagriffoul, D. Dimitrov, J. Bidot, A. Saffiotti, and L. Karlsson. Efficiently
combining task and motion planning using geometric constraints. The In-
ternational Journal of Robotics Research, 33(14):1726–1747, 2014.
M. Laskey, J. Lee, W. Hsieh, R. Liaw, J. Mahler, R. Fox, and K. Goldberg.
Iterative noise injection for scalable imitation learning. arXiv preprint,
2017.
Y. LeCun, U. Muller, J. Ben, E. Cosatto, and B. Flepp. Off-road obstacle
avoidance through end-to-end learning. In Advances in Neural Information
Processing Systems (NIPS), 2006.
A. Lee, H. Lu, A. Gupta, S. Levine, and P. Abbeel. Learning force-based
manipulation of deformable objects from multiple demonstrations. In Pro-
ceedings of the IEEE International Conference on Robotics and Automation
(ICRA), 2015a.
A. X. Lee, M. A. Goldstein, S. T. Barratt, and P. Abbeel. A non-rigid
point and normal registration algorithm with applications to learning from
demonstrations. In Proceedings of the IEEE International Conference on
Robotics and Automation (ICRA), 2015b.
D. Lee and Y. Nakamura. Mimesis model from partial observations for a
humanoid robot. The International Journal of Robotics Research, 2009.
D. Lee and C. Ott. Incremental kinesthetic teaching of motion primitives
using the motion refinement tube. Autonomous Robots, 2011.
D. Lee, C. Ott, and Y. Nakamura. Mimetic communication model with com-
pliant physical contact in human-humanoid interaction. The International
Journal of Robotics Research, 29:1684–1704, 2010.
A. Lemme, Y. Meirovitch, M. Khansari-Zadeh, T. Flash, A. Billard, and J. J.
Steil. Open-source benchmarking for learned reaching motion generation
in robotics. Paladyn, Journal of Behavioral Robotics, 6(1), 2015.
References 173

T. Lens, J. Kunz, O. v. Stryk, C. Trommer, and A. Karguth. Biorob-arm: A

quickly deployable and intrinsically safe, light- weight robot arm for service
robotics applications. In International Symposium on Robotics (ISR), pages
1–6, 2010.
S. Levine and P. Abbeel. Learning neural network policies with guided pol-
icy search under unknown dynamics. In Advances in Neural Information
Processing Systems (NIPS), 2014.
S. Levine and V. Koltun. Continuous inverse optimal control with locally op-
timal examples. In Proceedings of the International Conference on Machine
Learning (ICML), pages 41–48, 2012.
S. Levine, Z. Popovic, and V. Koltun. Nonlinear inverse reinforcement learning
with gaussian processes. In Advances in Neural Information Processing
Systems (NIPS), pages 19–27, 2011.
S. Levine, C. Finn, T. Darrell, and P. Abbeel. End-to-end training of deep
visuomotor policies. The Journal of Machine Learning Research, 17(1):
1334–1373, 2016.
R. Lioutikov, G. Neumann, G. Maeda, and J. Peters. Learning movement
primitive libraries through probabilistic segmentation. The International
Journal of Robotics Research (IJRR), 36(8):879–894, 2017.
Y. Liu, A. Gupta, P. Abbeel, and S. Levine. Imitation from observation:
Learning to imitate behaviors from raw video via context translation. arXiv,
2017.
M. Lopes, F. Melo, and L. Montesano. Active Learning for Reward Esti-
mation in Inverse Reinforcement Learning, pages 31–46. Springer Berlin
Heidelberg, Berlin, Heidelberg, 2009. ISBN 978-3-642-04174-7. .
T. Lozano-Perez, J. L. Jones, E. Mazer, and P. A. O’Donnell. Task-level
planning of pick-and-place robot motions. Computer, 22(3):21–29, 1989.
L. Lukic, J. Santos-Victor, and A. Billard. Learning robotic eye-arm-hand
coordination from human demonstration: a coupled dynamical systems ap-
proach. Biological Cybernetics, 108(2):223–248, 2014.
G. Maeda, G. Neumann, M. Ewerton, L. Lioutikov, O. Kroemer, and J. Pe-
ters. Probabilistic movement primitives for coordination of multiple human-
robot collaborative tasks. Autonomous Robots, 2016.
G. Maeda, M. Ewerton, T. Osa, B. Busch, and J. Peters. Active incremental
learning of robot movement primitives. In Proceedings of the Conference
on Robot Learning (CoRL), 2017.
P. C. Mahalanobis. On the generalised distance in statistics. In Proceedings
of the National Institute of Sciences of India, 1936.
174 References

S. Manschitz, J. Kober, M. Gienger, and J. Peters. Learning movement prim-

itive attractor goals and sequential skills from kinesthetic demonstrations.
Robotics and Autonomous Systems, 74:97–107, 2015.
J. Maryniak, E. Ładyżyńska-Kozdraś, and S. Tomczak. Conﬁgurations of the
Graf-Boklev (V-style) ski jumper model and aerodynamic parameters in a
wind tunnel. Human Movement, 10(2):130–136, 2009.
H. Miyamoto, S. Schaal, F. Gandolfoc, H. Gomi, Y. Koike, R. Osu, E. Nakano,
Y. Wada, and M. Kawato. A kendama learning robot based on bi-
directional theory. Neural Networks, 9(8):1281–1302, 1996.
V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Belle-
mare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen,
C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra,
S. Legg, and D. Hassabis. Human-level control through deep reinforcement
learning. Nature, 518(7540):529–533, 2015.
P. Moylan and B. Anderson. Nonlinear regulator theory and an inverse op-
timal control problem. IEEE Transactions on Automatic Control, 18(5):
460–465, 1973.
K. Mülling, O. Kroemer J. Kober and, and J. Peters. Learning to select
and generalize striking movements in robot table tennis. The International
Journal of Robotics Research, 32:263–279, 2013.
A. Nair, D. Chen, P. Agrawal, P. Isola, P. Abbeel, J. Malik, and S. Levine.
Combining self-supervised learning and imitation for vision-based rope ma-
nipulation. In Proceedings of the International Conference on Robotics and
Automation (ICRA), 2017.
S. Nakaoka, A. Nakazawa, F. Kanehiro, K. Kaneko, M. Morisawa,
H. Hirukawa, and K. Ikeuchi. Learning from observation paradigm: leg task
models for enabling a biped humanoid robot to imitate human dances. The
International Journal of Robotics Research, 26(8):829–844, 2007.
S. Natarajan, G. Kunapuli, K. Judah, P. Tadepalli, K. Kersting, and J. Shav-
lik. Multi-agent inverse reinforcement learning. In Proceedings of the ninth
International Conference on Machine Learning and Applications (ICMLA),
pages 395–400. IEEE, 2010.
G. Neu and C. Szepesvári. Training parsers by inverse reinforcement learning.
Machine learning, 77(2-3):303–337, 2009.
A. Y. Ng and S. J. Russell. Algorithms for inverse reinforcement learning. In
Proceedings of the International Conference on Machine Learning (ICML),
pages 663–670, 2000.
References 175

D. Nguyen-Tuong and J. Peters. Model learning for robot control: a survey.

Cognitive Processing, 12(4):319–340, 2011.
S. Niekum, S. Osentoski, G. Konidaris, S. Chitta, B. Marthi, and A. G.
Barto. Learning grounded ﬁnite-state representations from unstructured
demonstrations. The International Journal of Robotics Research, 34:131–
157, 2014.
S. Nowozin, B. Cseke, and R. Tomioka. f-GAN: Training Generative Neural
Samplers using Variational Divergence Minimization. In Advances in Neural
Information Processing Systems (NIPS), pages 271–279, 2016.
J. Oh, X. Guo, H. Lee, R. Lewis, and S. Singh. Action-conditional video
prediction using deep networks in atari games. In Advances in Neural
Information Processing Systems (NIPS), pages 2845–2853, 2015.
T. Okamoto, T. Shiratori, S. Kudoh, S. Nakaoka, and K. Ikeuchi. Toward a
dancing robot with listening capability: keypose-based integration of lower-,
middle-, and upper-body motions for varying music tempos. IEEE Trans-
actions on Robotics, 30(3):771–778, 2014.
T. Osa, N. Sugita, and M. Mitsuishi. Online trajectory planning in dynamic
environments for surgical task automation. In Proceedings of Robotics:
Science and Systems (R:SS), 2014.
T. Osa, A. M. Ghalamzan E., R. Stolkin, R. Lioutikov, J. Peters, and G. Neu-
mann. Guiding trajectory optimization by demonstrated distributions.
IEEE Robotics and Automation Letters (RA-L), 2(2):819–826, 2017a.
T. Osa, N. Sugita, and M. Mitsuishi. Online trajectory planning and force
control for automation of surgical tasks. IEEE Transactions on Automation
Science and Engineering, 2017b.
A. Paraschos, C. Daniel, J. Peters, and G. Neumann. Probabilistic movement
primitives. In Proceedings of Advances in Neural Information Processing
Systems 26, 2013.
R. Parent. Computer animation: algorithms and techniques. Morgan Kauf-
mann, 2002.
S. Y. Park and A. K. Bera. Maximum entropy autoregressive conditional
heteroskedasticity model. Journal of Econometrics, 150(2):219–230, 2009.
P. Pastor, H. Hoﬀmann, T. Asfour, and S. Schaal. Learning and generalization
of motor skills by learning from demonstration. In Proceedings of the IEEE
International Conference on Robotics and Automation (ICRA), 2009.
D. A. Pomerleau. ALVINN: An autonomous land vehicle in a neural network.
In Advances in Neural Information Processing Systems (NIPS), 1988.
176 References

P. Poupart and N. Vlassis. Model-based Bayesian reinforcement learning in

partially observable domains. In Proceedings of the Tenth International
Symposium on Artificial Intelligence and Mathematics (ISAIM), 2008.
L. R. Rabiner. A tutorial on hidden markov models and selected applications
in speech recognition. Proceedings of the IEEE, 77(2):257–286, 1989.
M. Racca, J. Pajarinen, A. Montebelli, and V. Kyrki. Learning in-contact
control strategies from demonstration. In Proceedings of the IEEE/RSJ
International Conference on Intelligent Robots and Systems (IROS), pages
688–695. IEEE, 2016.
R. Rahmatizadeh, P. Abolghasemi, L. Bölöni, and S. Levine. Vision-based
multi-task manipulation for inexpensive robots using end-to-end learning
from demonstration. arXiv, 2017.
D. Ramachandran and E. Amir. Bayesian inverse reinforcement learning. In
Proceedings of the International Joint Conference on Artifical Intelligence
(IJCAI), pages 2586–2591, 2007.
C. E. Rasmussen and C. K. I. Williams. Gaussian processes for machine
learning. The MIT Press, 2006.
N. Ratliff, D. Bradley, J. A. Bagnell, and J. Chestnutt. Boosting structured
prediction for imitation learning. In Advances in Neural Information Pro-
cessing Systems 19, 2006a.
N. Ratliff, D. Silver, and J. A. Bagnell. Learning to search: Functional gradient
techniques for imitation learning. Autonomous Robots, 27:25–53, 2009.
N. D. Ratliff, J. A. Bagnell, and M. A. Zinkevich. Maximum margin planning.
In Proceedings of the international conference on Machine learning (ICML),
pages 729–736, 2006b.
S. Ross and J. A. Bagnell. Efficient reductions for imitation learning. In
Proceedings of the International Conference on Artificial Intelligence and
Statistics (AISTATS), 2010.
S. Ross and J. A. Bagnell. Reinforcement and imitation learning via interac-
tive no-regret learning. Arxiv preprint, 2014.
S. Ross, G. J. Gordon, and J. A. Bagnell. A reduction of imitation learning
and structured prediction to no-regret online learning. In Proceedings of
the International Conference on Artificial Intelligence and Statistics (AIS-
TATS), 2011.
S. Ross, N. Melik-Barkhudarov, K. S. Shankar, A. Wendel, D. Dey, J. A. Bag-
nell, and M. Hebert. Learning monocular reactive uav control in cluttered
natural environments. In Proceedings of IEEE International Conference on
Robotics and Automation (ICRA), pages 1765–1772, May 2013. .
References 177

L. Rozo, Silvério J., S. Calinon, and D. Caldwell. Learning controllers for

reactive and proactive behaviors in human-robot collaboration. Frontiers
in Robotics and AI, pages 1–11, 2016.
S. Russell. Learning agents for uncertain environments (extended abstract). In
Proceedings of the Eleventh Annual Conference on Computational Learning
Theory, 1998.
J. Rust. Handbook of Econometrics, chapter Structural estimation of Markov
decision processes, pages 3082–3139. Elsevier, 1994.
H. Sakoe and S. Chiba. Dynamic programming algorithm optimization for
spoken word recognition. IEEE Transactions on Acoustics, Speech, and
Signal Processing, 26(1):43–49, 1978.
C. Sammut, S. Hurst, D. Kedzier, and D. Michie. Learning to ﬂy. In Proceed-
ings of the International Conference on Machine Learning (ICML), pages
385–393, 1992.
S. Schaal. Learning from demonstration. In Advances in Neural Information
Processing Systems (NIPS), pages 1040–1046, 1997.
S. Schaal. Is imitation learning the route to humanoid robots? Trends in
Cognitive Sciences, 3(6):233 – 242, 1999.
S. Schaal and C. Atkeson. Constructive incremental learning from only local
information. Neural Computation, 10(8):2047–2084, 1998.
S. Schaal, J. Peters, J. Nakanishi, and A. Ijspeert. Learning movement primi-
tives. In Proceedings of the International Symposium on Robotics Research
(ISRR), 2004.
J. G. Schneider. Exploiting model uncertainty estimates for safe dynamic
control learning. In Advances in Neural Information Processing Systems
(NIPS), 1997.
J. Schulman, J. Ho, C. Lee, and P. Abbeel. Learning from demonstrations
through the use of non-rigid registration. In Proceedings of the International
Symposium on Robotics Research (ISRR), 2013.
J. Schulman, S. Levine, P. Moritz, M. I. Jordan, and P. Abbeel. Trust region
policy optimization. In Proceedings of the 32nd International Conference
on Machine Learning (ICML), 2015.
R. Serfozo. Basics of Applied Stochastic Processes. Springer Science & Busi-
ness Media, 2009.
P. Sermanet, K. Xu, and S. Levine. Unsupervised perceptual rewards for
imitation learning. In Proceedings of Robotics and Science and Systems
(R:SS), 2017.
178 References

S. Shalev-Shwartz and S. Ben-David. Understanding Machine Learning: From

Theory to Algorithms. Cambridge University Press, 2014.
K. Shiarlis, J. Messias, and S. Whiteson. Inverse reinforcement learning from
failure. In Proceedings of the International Conference on Autonomous
Agents & Multiagent Systems, pages 1060–1068. International Foundation
for Autonomous Agents and Multiagent Systems, 2016.
A. Shukla and A. Billard. Coupled dynamical system based arm-hand grasping
model for learning fast adaptation strategies. Robotics and Autonomous
Systems, 60(3):424–440, 2012.
D. Silver, J. A. Bagnell, and A. Stentz. Learning from demonstration for
autonomous navigation in complex unstructured terrain. The International
Journal of Robotics Research, 29(12):1565–1592, 2010.
D. Silver, J. A. Bagnell, and A. Stentz. Active learning from demonstra-
tion for robust autonomous navigation. In Proceedings of the International
Conference on Robotics and Automation (ICRA), pages 200–207, 2012.
D. Silver, J. A. Bagnell, and A. Stentz. Learning autonomous driving styles
and maneuvers from expert demonstration. In Experimental Robotics: The
13th International Symposium on Experimental Robotics (ISER), pages
371–386, 2013.
D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche,
J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, S. Diele-
man, D. Grewe, J. Nham, N. Kalchbrenner, I. Sutskever, T. Lillicrap,
M. Leach, K. Kavukcuoglu, T. Graepel, and D. Hassabis. Mastering the
game of go with deep neural networks and tree search. Nature, 529(7587):
484–489, 2016.
N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov.
Dropout: A simple way to prevent neural networks from overﬁtting. Journal
of Machine Learning Research, 15(1):1929–1958, 2014.
B. Stadie, P. Abbeel, and I. Sutskever. Third person imitation learning. In
Proceedings of the International Conference on Learning Representations
(ICLR), 2017.
M. Sugiyama. Introduction to Statistical Machine Learning. Morgan Kauf-
mann, 2015.
M. Sugiyama, M. Kawanabe, and P. L. Chui. Dimensionality reduction for
density ratio estimation in high-dimensional spaces. Neural Networks, 23
(1):44–59, 2010.
M. Sugiyama, H. Hachiya, and T. Morimura. Statistical Reinforcement Learn-
ing: Modern Machine Learning Approaches. Chapman & Hall/CRC, 2013.
References 179

W. Sun, A. Venkatraman, G. Gordon, B. Boots, and J. A. Bagnell. Deeply

aggrevated: Diﬀerentiable imitation learning for sequential prediction. In
Proceedings of the International Conference on Machine Learning (ICML),
2017.
R. Sutton and A. Barto. Reinforcement learning: An introduction. The MIT
Press, 1998.
R. S. Sutton, D. Precup, and S. Singh. Between MDPs and semi-MDPs: A
framework for temporal abstraction in reinforcement learning. Artificial
Intelligence, 112(1-2):181–211, 1999.
C. Szepesvari. Algorithms for reinforcement learning. Synthesis Lectures on
Artificial Intelligence and Machine Learning, 4(1):1–103, 2010. .
W. Takano and Y. Nakamura. Statistical mutual conversion between whole
body motion primitives and linguistic sentences for human motions. The
International Journal of Robotics Research, 34:1314–1328, 2015.
W. Takano and Y. Nakamura. Real-time unsupervised segmentation of human
whole-body motion and its application to humanoid robot acquisition of
motion symbols. Robotics and Autonomous Systems, 75:260–272, 2016.
W. Takano and Y. Nakamura. Planning of goal-oriented motion from stochas-
tic motion primitives and optimal controlling of joint torques in whole-body.
Robotics and Autonomous Systems, 91:226–233, 2017.
V. Tangkaratt, N. Xie, and M. Sugiyama. Conditional density estimation
with dimensionality reduction via squared-loss conditional entropy mini-
mization. Neural Computation, 27(1):228–254, 2015.
B. Taskar. Learning structured prediction models: a large margin approach.
PhD thesis, Stanford University, 2005.
Y. Tassa, T. Erez, and E. Todorov. Synthesis and stabilization of com-
plex behaviors through online trajectory optimization. In Proceedings of
IEEE/RSJ International Conference on Intelligent Robots and Systems,
pages 4906–4913, 2012.
G. Tesauro. Temporal diﬀerence learning and td-gammon. Communications
of the ACM, 38(3):58 – 68, 1995.
E. Todorov and W. Li. A generalized iterative lqg method for locally-optimal
feedback control of constrained nonlinear stochastic systems. In Proceedings
of the American Control Conference, 2005.
I. Tsochantaridis, T. Joachims, T. Hofmann, and Y. Altun. Large margin
methods for structured and interdependent output variables. Journal of
Machine Learning Research, 6:1453–1484, 2005.
180 References

A. Ude, C. G. Atkeson, and M. Riley. Programming full-body movements for

humanoid robots by observation. Robotics and Autonomous Systems, pages
93–108, 2004.
J. van den Berg, S. Miller, D. Duckworth, H. Hu, A. Wan, X. Y. Fu, K. Gold-
berg, and P. Abbeel. Superhuman performance of surgical tasks by robots
using iterative learning from human-guided demonstrations. In Proceed-
ings of the IEEE International Conference on Robotics and Automation
(ICRA), pages 2074–2081, 2010.
V. N. Vapnik. Statistical learning theory. John Wiley & Sons, 1998.
A. Venkatraman, M. Hebert, and J. A. Bagnell. Improving multi-step predic-
tion of learned time series models. In Proceedings of the AAAI Conference
on Artificial Intelligence (AAAI), pages 3024–3030, 2015.
A. Venkatraman, R. Capobianco, L. Pinto, M. Hebert, D. Nardi, and J. A.
Bagnell. Improved learning of dynamics models for control. In Proceedings
of the International Symposium on Experimental Robotics (ISER), 2016.
S. Vijayakumar and S. Schaal. Locally weighted projection regression: An
o(n) algorithm for incremental real time learning in high dimensional space.
In Proceedings of International Conference on Machine Learning (ICML),
pages 1079–1086, 2000.
S. Vijayakumar, A. D’Souza, and S. Schaal. Incremental online learning in
high dimensions. Neural Computation, 17:2602–2634, 2005.
K. Waugh, B. D. Ziebart, and J. A. Bagnell. Computational rationalization:
The inverse equilibrium problem. In Proceedings of the International Con-
ference on Machine Learning (ICML), pages 1169–1176, 2011.
T. H. Wen, M. Gašić, N. Mrkšić, P. H. Su, D. Vandyke, and S. Young. Se-
mantically conditioned lstm-based natural language generation for spoken
dialogue systems. In Proceedings of the Conference on Empirical Methods
in Natural Language Processing (EMNLP), pages 1711–1721. Association
for Computational Linguistics, 2015.
B. Widrow and F. W. Smith. Pattern recognising control systems. Computer
and Information Sciences Clever Hume Press, 1964.
M. Wiering and M. van Otterlo, editors. Reinforcement Learning: State-of-
the-Art. Springer, 2012.
S. Z. Yu. Hidden semi-markov models. Artificial Intelligence, 174(2):215–243,
2010.
B. Zadrozny, J. Langford, and N. Abe. Cost-sensitive learning by cost-
proportionate example weighting. In Proceedings of the IEEE International
Conference on Data Mining, pages 435–442, 2003.
References 181

B. D. Ziebart. Modeling purposeful adaptive behavior with the principle of

maximum causal entropy. PhD thesis, University of Washington, 2010.
B. D. Ziebart, A. Maas, J. A. Bagnell, and A. K. Dey. Maximum entropy
inverse reinforcement learning. In Proceedings of the Twenty-Second Con-
ference on Artificial Intelligence (AAAI), pages 1433–1438, 2008.
B. D. Ziebart, J. A. Bagnell, and A. K. Dey. The principle of maximum
causal entropy for estimating interacting processes. IEEE Transactions on
Information Theory, 59(4):1966–1980, 2013.
M. Zucker, N. Ratliff, M. Stolle, J. Chestnutt, J Andrew Bagnell, C. G. Atke-
son, and J. Kuffner. Optimization and learning for rough terrain legged
locomotion. The International Journal of Robotics Research, 30(2):175–
191, 2011.
M. Zucker, N. Ratliff, A. D. Dragan, M. Pivtoraiko, M. Klingensmith, C. M.
Dellin, J. A. Bagnell, and S. S. Srinivasa. Chomp: covariant hamiltonian
optimization for motion planning. The International Journal of Robotics
Research, 32:1164–1193, 2013.

Download Full (Ebook) Spring in Action, Sixth Edition by Craig Walls ISBN 9781617297571, 1617297577 PDF All Chapters
100% (6)
Download Full (Ebook) Spring in Action, Sixth Edition by Craig Walls ISBN 9781617297571, 1617297577 PDF All Chapters
71 pages
Parallel Distributed Processing by David Rumelhart
No ratings yet
Parallel Distributed Processing by David Rumelhart
249 pages
Stochastic Modeling: Analysis and Simulation
From Everand
Stochastic Modeling: Analysis and Simulation
Barry L. Nelson
No ratings yet
Mts Silentflo 515 Hydraulic Power Units - Compact: Clean Quiet, and Reliable Power Generation
No ratings yet
Mts Silentflo 515 Hydraulic Power Units - Compact: Clean Quiet, and Reliable Power Generation
8 pages
Alg RLearning Ejemplo
No ratings yet
Alg RLearning Ejemplo
99 pages
Algorithms For Reinforcement Learning Csaba Szepesvari instant download
No ratings yet
Algorithms For Reinforcement Learning Csaba Szepesvari instant download
36 pages
Summary
No ratings yet
Summary
43 pages
INTRODUCTION To Machine Learning
No ratings yet
INTRODUCTION To Machine Learning
188 pages
Model-Based Reinforcement Learning
No ratings yet
Model-Based Reinforcement Learning
67 pages
Algorithms For Reinforced Learning
No ratings yet
Algorithms For Reinforced Learning
98 pages
Deep Reinforcement Learning
No ratings yet
Deep Reinforcement Learning
406 pages
Algorithms For Reinforcement Learning - Szepesvari
No ratings yet
Algorithms For Reinforcement Learning - Szepesvari
98 pages
Imitation Learning Paper Brazil
No ratings yet
Imitation Learning Paper Brazil
75 pages
RLAlgs in MDPs
No ratings yet
RLAlgs in MDPs
98 pages
Deep Reinforcement Learning: Overcoming The Challenges of Deep Learning in Discrete and Continuous Markov Decision Processes
No ratings yet
Deep Reinforcement Learning: Overcoming The Challenges of Deep Learning in Discrete and Continuous Markov Decision Processes
110 pages
Deep Reinforcement Learning
100% (1)
Deep Reinforcement Learning
410 pages
Reinforcement Learning and Dynamic Programming For Control
100% (1)
Reinforcement Learning and Dynamic Programming For Control
111 pages
Deep Reinforcement Learning: Lecture Notes
No ratings yet
Deep Reinforcement Learning: Lecture Notes
60 pages
Imitation Learning Papers
No ratings yet
Imitation Learning Papers
10 pages
Audio to text embedding
No ratings yet
Audio to text embedding
144 pages
ML (Unit-1)
No ratings yet
ML (Unit-1)
17 pages
Full Download Foundations of Deep Reinforcement Learning Theory and Practice in Python First Edition Laura Graesser PDF
100% (5)
Full Download Foundations of Deep Reinforcement Learning Theory and Practice in Python First Edition Laura Graesser PDF
62 pages
RL-Notes Book
No ratings yet
RL-Notes Book
119 pages
Module 3 - AIML
No ratings yet
Module 3 - AIML
134 pages
IITM_MS_Thesis____Final
No ratings yet
IITM_MS_Thesis____Final
83 pages
The Principal-Agent Alignment Problem in Artificial
No ratings yet
The Principal-Agent Alignment Problem in Artificial
166 pages
Get (Ebook) The mathematical foundations of learning machines by Nils J Nilsson ISBN 9781558601239, 1558601236 free all chapters
100% (3)
Get (Ebook) The mathematical foundations of learning machines by Nils J Nilsson ISBN 9781558601239, 1558601236 free all chapters
81 pages
Reinforcement Learning - A comprehensive Overview
No ratings yet
Reinforcement Learning - A comprehensive Overview
177 pages
Lecture Notes v1.0 687 F22
No ratings yet
Lecture Notes v1.0 687 F22
115 pages
Thesis
No ratings yet
Thesis
292 pages
ML RUSA Module 1 Intro
No ratings yet
ML RUSA Module 1 Intro
30 pages
SSRN 4963741
No ratings yet
SSRN 4963741
26 pages
Hansen_2022
No ratings yet
Hansen_2022
20 pages
6036 Lecture Notes
No ratings yet
6036 Lecture Notes
56 pages
Reinforcement Learning An Introduction 2 Trimmed Edition Richard S. Sutton instant download
100% (1)
Reinforcement Learning An Introduction 2 Trimmed Edition Richard S. Sutton instant download
54 pages
Modern_Deep_Reinforcement_Learning_Algorithms
No ratings yet
Modern_Deep_Reinforcement_Learning_Algorithms
56 pages
Final MSC Report Divyam Rastogi
No ratings yet
Final MSC Report Divyam Rastogi
78 pages
Machine Learning concise notes
No ratings yet
Machine Learning concise notes
7 pages
AIML Module - 03
No ratings yet
AIML Module - 03
34 pages
Module 03
No ratings yet
Module 03
54 pages
Practical Hierarchical Reinforcement Lea
No ratings yet
Practical Hierarchical Reinforcement Lea
88 pages
AIML Module - 03 21CS4
No ratings yet
AIML Module - 03 21CS4
34 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
232 pages
Machine Learning
No ratings yet
Machine Learning
99 pages
Machine Learning Summarized Notes 1660762916
No ratings yet
Machine Learning Summarized Notes 1660762916
111 pages
(Ebook) The mathematical foundations of learning machines by Nils J Nilsson ISBN 9781558601239, 1558601236 download
100% (2)
(Ebook) The mathematical foundations of learning machines by Nils J Nilsson ISBN 9781558601239, 1558601236 download
49 pages
Ecs 403 ML Module I
No ratings yet
Ecs 403 ML Module I
33 pages
Reinforcement Learning An Introduction 2 Trimmed Edition Richard S. Sutton - The ebook is available for quick download, easy access to content
100% (2)
Reinforcement Learning An Introduction 2 Trimmed Edition Richard S. Sutton - The ebook is available for quick download, easy access to content
57 pages
(Synthesis Lectures On Artificial Intelligence and Machine Learning) Philip Osborne, Kajal Singh, Matthew E. Taylor - Applying Reinforcement Learning On Real-World Data With Practical Examples in Pyth
No ratings yet
(Synthesis Lectures On Artificial Intelligence and Machine Learning) Philip Osborne, Kajal Singh, Matthew E. Taylor - Applying Reinforcement Learning On Real-World Data With Practical Examples in Pyth
105 pages
Imitation Learning
No ratings yet
Imitation Learning
29 pages
MIT - Machine Learning Notes From Chapter 1 - 14 PDF
No ratings yet
MIT - Machine Learning Notes From Chapter 1 - 14 PDF
101 pages
Grow with Python Programming: From Basics to Advanced
From Everand
Grow with Python Programming: From Basics to Advanced
Mark Fliks
No ratings yet
AI-Driven Time Series Forecasting: Complexity-Conscious Prediction and Decision-Making
From Everand
AI-Driven Time Series Forecasting: Complexity-Conscious Prediction and Decision-Making
Raghurami Reddy Etukuru Ph.D.
No ratings yet
Mastering Python Advanced Concepts and Practical Applications
From Everand
Mastering Python Advanced Concepts and Practical Applications
Aissa Younes
No ratings yet
ChatGPT for Business: Strategies for Success
From Everand
ChatGPT for Business: Strategies for Success
Matthew C. Smith
1/5 (1)
Data Empowerment: Harnessing Advanced Mathematical and Statistical Methods for Data Science and Machine Learning
From Everand
Data Empowerment: Harnessing Advanced Mathematical and Statistical Methods for Data Science and Machine Learning
NAGARAJU CHEVURU
No ratings yet
Advanced Multiplayer Game Development with Ureal Engine 5: A Comprehensive Guide to C++ Scripting
From Everand
Advanced Multiplayer Game Development with Ureal Engine 5: A Comprehensive Guide to C++ Scripting
Vladimir Kiselev
No ratings yet
Risk Management and System Safety
From Everand
Risk Management and System Safety
Leonam dos Santos Guimarães
5/5 (1)
Plain JavaScript: Learning the Front-End
From Everand
Plain JavaScript: Learning the Front-End
Roger Beans-Rivet
No ratings yet
Unlocking Statistics for the Social Sciences
From Everand
Unlocking Statistics for the Social Sciences
Norma Sinclair
No ratings yet
Python for Machine Learning: From Fundamentals to Real-World Applications
From Everand
Python for Machine Learning: From Fundamentals to Real-World Applications
Kameron Hussain
No ratings yet
DATA MINING and MACHINE LEARNING: CLUSTER ANALYSIS and kNN CLASSIFIERS. Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING: CLUSTER ANALYSIS and kNN CLASSIFIERS. Examples with MATLAB
César Pérez López
No ratings yet
Chub Pack
No ratings yet
Chub Pack
32 pages
Vt900a Vapor Eng
No ratings yet
Vt900a Vapor Eng
7 pages
NICE1000 Elevator Integrated Controller: Setup Manual - Brief Version 1.4
No ratings yet
NICE1000 Elevator Integrated Controller: Setup Manual - Brief Version 1.4
51 pages
BF Product News 022023 Commercial Vehicles - 2212927
No ratings yet
BF Product News 022023 Commercial Vehicles - 2212927
11 pages
ASME Section VIII Pressure Vessel - Nozzle Load Application With Software Automation
No ratings yet
ASME Section VIII Pressure Vessel - Nozzle Load Application With Software Automation
4 pages
M38 5 3 Axles Manufacturers Brochure
No ratings yet
M38 5 3 Axles Manufacturers Brochure
2 pages
Holiday Homework Vi (Winter Break 2024-25) (1) (1)
No ratings yet
Holiday Homework Vi (Winter Break 2024-25) (1) (1)
3 pages
Mesure de Pression D'huile
No ratings yet
Mesure de Pression D'huile
6 pages
Level II UT
100% (1)
Level II UT
4 pages
Ema Pyq Merged
No ratings yet
Ema Pyq Merged
7 pages
Eurocode 5 1.2
100% (1)
Eurocode 5 1.2
54 pages
U900 Deployment Considerations - V
No ratings yet
U900 Deployment Considerations - V
14 pages
Ravana The Great
No ratings yet
Ravana The Great
104 pages
Ritual 1
No ratings yet
Ritual 1
14 pages
Revision worksheet of English for grade III
No ratings yet
Revision worksheet of English for grade III
6 pages
Dissolution and Pharmacokinetics parameters
No ratings yet
Dissolution and Pharmacokinetics parameters
18 pages
Drill Manual Codes For Schedules
No ratings yet
Drill Manual Codes For Schedules
89 pages
Case Study # 1
No ratings yet
Case Study # 1
1 page
Commerce Term-End Examination June, 2018 Ib0-05: International Marketing Logistics
No ratings yet
Commerce Term-End Examination June, 2018 Ib0-05: International Marketing Logistics
4 pages
Treasure to Die for (1)
No ratings yet
Treasure to Die for (1)
66 pages
Seed Treatment
No ratings yet
Seed Treatment
31 pages
Immunity (Part II) notes by Biology Guardian - MCQs world
No ratings yet
Immunity (Part II) notes by Biology Guardian - MCQs world
26 pages
Reliability Theory-K. K. Agrawal
No ratings yet
Reliability Theory-K. K. Agrawal
396 pages
Mindmap Bio621 Chapter1
No ratings yet
Mindmap Bio621 Chapter1
3 pages
IHF India Brochure LR
No ratings yet
IHF India Brochure LR
4 pages
PERFORMANCE PE TAS-WPS Office
No ratings yet
PERFORMANCE PE TAS-WPS Office
6 pages
Grade 10- UNIT 5- Test 1
No ratings yet
Grade 10- UNIT 5- Test 1
7 pages
Handout Entrep
No ratings yet
Handout Entrep
6 pages