Imitation Learning
Imitation Learning
An Algorithmic Perspective on
Imitation Learning
1 Introduction 3
1.1 Key successes in Imitation Learning . . . . . . . . . . . 4
1.2 Imitation Learning from the Point of View of Robotics . 5
1.3 Differences between Imitation Learning and Supervised
Learning . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.4 Insights for Machine Learning and Robotics Research . 11
1.5 Statistical Machine Learning Background . . . . . . . . 12
1.5.1 Notation and Mathematical Formalization . . . . 12
1.5.2 Markov Property . . . . . . . . . . . . . . . . . . 14
1.5.3 Markov Decision Process . . . . . . . . . . . . . 14
1.5.4 Entropy . . . . . . . . . . . . . . . . . . . . . . . 14
1.5.5 Kullback-Leibler (KL) Divergence . . . . . . . . 15
1.5.6 Information and Moment Projections . . . . . . . 15
1.5.7 The Maximum Entropy Principle . . . . . . . . . 16
1.5.8 Background: Reinforcement Learning . . . . . . . 17
1.6 Formulation of the Imitation Learning Problem . . . . . 18
ii
iii
3 Behavioral Cloning 46
3.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . 46
3.2 Design Choices for Behavioral Cloning . . . . . . . . . . 48
3.2.1 Choice of Surrogate Loss Functions for Behav-
ioral Cloning . . . . . . . . . . . . . . . . . . . . 49
3.2.1.1 Quadratic Loss Function . . . . . . . . 49
3.2.1.2 ℓ1 -Loss Function . . . . . . . . . . . . 50
3.2.1.3 Log Loss Function . . . . . . . . . . . . 51
3.2.1.4 Hinge Loss Function . . . . . . . . . . . 52
3.2.1.5 Kullback-Leibler Divergence . . . . . . 52
iv
Acknowledgements 161
References 162
Abstract
DOI: 10.1561/2300000053.
1
Introduction
3
4 Introduction
are ideal for applications where robots work alongside people, such as
collaborating with human operators and reducing the physical work-
load of care givers. These applications require efficient, intuitive ways
to teach robots the motions they need to perform from domain experts
who may not possess special skills or knowledge about robotics.
In recent years, imitation learning has been investigated as a way to
efficiently and intuitively program autonomous behavior[Schaal, 1999,
Argall et al., 2009, Billard et al., 2008, Billard and Grollman, 2013,
Bagnell, 2015, Billard et al., 2016]. In imitation learning, a human
demonstrates how to perform a task. A robotic system learns a pol-
icy to execute the given task by imitating the demonstrated motions.
Numerous imitation learning methods have been developed and imita-
tion learning has become a gigantic field of research. As a consequence,
capturing the entire field of imitation learning is not a trivial task.
The purpose of this survey is to provide a structural understanding
of existing imitation learning methods and its relationship with other
fields from supervised learning to control theory. We will describe what
has been developed in the field of imitation learning and what should
be developed in the future.
One of the earliest and most well-known imitation learning success sto-
ries was the autonomous driving project Autonomous Land Vehicle In
a Neural Network (ALVINN) at Carnegie Mellon University [Pomer-
leau, 1988]. In ALVINN, a neural network learned how to map input
images to discrete actions in order to drive a vehicle. ALVINN’s neu-
ral network had one hidden layer with five units. Its input layer had
30 by 32 units; its output layer had 30 units. Although the structure
of this network was simple compared to modern neural networks with
millions of parameters, the system succeeded at driving autonomously
across the North American continent.
The Kendama robot developed by Miyamoto et al. [1996] is an-
other successful application of imitation learning. In the early days
of imitation learning, roboticists were mainly interested in teaching
1.2. Imitation Learning from the Point of View of Robotics 5
General Aspects:
1. Why and when should imitation learning be used? This
question clarifies the motivation for using imitation learning and
what we should do with it.
Vision system
…
https://commons.wikimedia.org/w/index.php?curid=11467562
Demonstration by experts
(a) Learning of acrobatic RC helicopter maneuvers [Abbeel et al., 2010]. The tra-
jectories for acrobatic flights are learned from a human expert’s demonstrations.
To control the system with highly nonlinear dynamics, iterative learning control
was used.
Demonstration by experts
(b) Learning with a teleoperated system [Osa et al., 2014] where a posi-
tion/velocity controller is available. To generalize the trajectory to different situ-
ations, a mapping from task situations to trajectories is learned from demonstra-
tions under various situations.
Demonstration by experts
(c) Learning quadruped robot locomotion [Zucker et al., 2011]. The footstep plan-
ning was addressed as an optimization of the reward/cost function, which was re-
covered from the expert demonstrations. Learning the reward/cost function allows
the footstep planning strategy to be generalized to different terrains.
Figure 1.1: Observations y and control inputs u for imitation learning in (a)
helicopter flight, (b) surgery, and (c) locomotion. Motion planning is formulated in
different ways in these examples.
1.2. Imitation Learning from the Point of View of Robotics 9
oped many imitation learning methods for motion planning and robot
control. When planning a trajectory for a robotic system, it is often
necessary to make sure that a planned trajectory satisfies some con-
straints such as smooth convergence to a new goal state. For this rea-
son, robotics researchers have developed “custom” trajectory represen-
tations that explicitly satisfy constraints necessary for robotic appli-
cations. Machine learning techniques are often used as a part of such
frameworks. However, robotics researchers need to be aware that rich
set of algorithms have been developed by the machine learning com-
munity and some of new algorithms might eliminate the need for cus-
tomizing policy or trajectory representation.
For machine learning researchers, imitation learning offers interest-
ing practical and theoretical problems, which differ from standard su-
pervised and reinforcement learning settings. Although imitation learn-
ing is closely related to structured prediction, it is often challenging to
apply existing machine learning methods to imitation learning, espe-
cially robotic applications. In imitation learning, collecting demonstra-
tions and performing rollouts are often expensive and time-consuming.
Therefore, it is necessary to consider how to minimize these costs and
perform learning efficiently. In addition, embodiments and observabil-
ity of the learner and the expert are different in many applications. In
such cases, the demonstrated motion needs to be adapted based on the
learner’s embodiment and observability. These difficulties in imitation
learning present new challenges to machine learning researchers.
Table 1.1: Table of Notation. We use a notation common in the control literature
for states and controls.
x system state
s context
φ feature vector
u control input/action
τ trajectory
π policy
D dataset of demonstrations
q probability distribution induced by an expert’s policy
p probability distribution induced by a learner’s policy
t time
T finite horizon
N number of demonstrations
superscript representing an expert
E
e.g. π E denotes an expert’s policy
superscript representing a learner
L
e.g. π L denotes a learner’s policy
superscript representing a demonstration by an expert
demo
e.g. τ demo denotes a trajectory demonstrated by an expert
14 Introduction
1.5.4 Entropy
Given the random variable x and its probability distribution p(x), the
entropy
Ú
H (p) = − p(x) ln p(x)dx (1.1)
0.6
0.4
0.2
0
-3 -2 -1 0 1 2 3
Ep [φ(x)] = Eq [φ(x)],
true for typical distributions from the exponential family such as the
Gaussian distribution, which is the maximum entropy distribution that
matches first and second order moments. The notion of Maximum En-
tropy generalizes to Maximum Causal Entropy, which turns out to be
a natural notion of uncertainty for dynamical systems [Ziebart et al.,
2013].
V π (xt ) is often called the value function [Sutton and Barto, 1998].
Likewise, the value of taking action u in state x under a policy π can
be computed as the expected reward when starting from the action u
in a state x and thereafter following policy π
C ∞ - D
-
π
Ø
t
Q (x, u) = E γ rt - x0 = x, u0 = u, π . (1.10)
-
-
t=0
20
2.1. Design Choices for Imitation Learning Algorithms 21
Figure 2.1: A ski jumper flies through the air using the highly aerodynamic “V-
style”. “V-style” was adopted by most ski jumpers in the 1990s after some jumpers
demonstrated impressive results with the style (public domain picture from Wiki-
media Commons).
As one can see above, these design choices are not independent and
the order of these design choices are flexible. For example, the choice of
similarity measures between policies is related to the choice of policy
representations. In the following sections, we present possible options
for some of these design choices.
where J(π̂) is the expectation of the accumulated reward given the pol-
icy π as in (1.7). However, the reward function is considered unknown
and needs to be recovered from expert demonstrations under the as-
sumption that the demonstrations are (approximately) optimal w.r.t.
this reward function. Recovering the reward function from demonstra-
tions is often referred to as Inverse Reinforcement Learning (IRL) [Rus-
sell, 1998] or Inverse Optimal Control (IOC) [Moylan and Anderson,
1973].
BC and IRL form two major classes of imitation learning methods.
In order to select one of BC and IRL, it is essential to consider what is
the most parsimonious description of the desired behavior? The policy
2.3. Model-Free and Model-Based Imitation Learning Methods 25
perts are available. For this reason, behavioral cloning methods which
learn a direct mapping from states/contexts to actions have focused on
model-free methods until recent years.
For motion planning of underactuated systems, it is often neces-
sary to plan a feasible trajectory by considering the system dynamics.
It can be challenging to use model-free BC methods to learn trajec-
tories in such underactuated systems where the reachable states are
limited. However, recent IRL work by Boularias et al. [2011], Finn
et al. [2016b], Ho and Ermon [2016] shows how one can learn skills
in underactuated systems through iterative rollouts without explicitly
learning a dynamics model.
Model-based imitation learning methods attempt to learn a policy
that reproduces the demonstrated behavior by learning/using the sys-
tem dynamics, e.g. a forward model of the system. This property can
be critical especially for underactuated robots. Since underactuation
limits the number of reachable states, it is essential to take into ac-
count the dynamics of the system when planning feasible trajectories.
Moreover, the prior knowledge of the system dynamics makes inverse
reinforcement learning easier since the learner’s performance can be
easily predicted when the system dynamics is known. However, in a
Model-free Model-based
A policy can be The learning process can
learned without learn- be data-efficient.
Advantages
ing/estimating the system A learned policy satisfies
dynamics. the system dynamics.
The prediction of future Model learning can be
states is difficult. difficult.
Disadvantages The system dynamics is Computationally expen-
only implicitly considered sive.
in the resulting policy.
2.3. Model-Free and Model-Based Imitation Learning Methods 27
Model-free Model-based
Figure 2.2: Diagram of general imitation learning. The learner cannot directly
observe the expert’s policy in many problems. Instead, a set of trajectories induced
by the expert’s policy is available in imitation learning. The learner estimates the
policy that reproduces the expert’s behavior using the given demonstrations. Please
note that the process of querying the demonstration and updating the learner’s
policy can be interactive.
2.4 Observability
When the state of the system is fully observable, we can obtain a tra-
jectory as a sequence of the state and the control input as
τ = [x0 , u0 , x1 , u1 , . . . , xT , uT ]. (2.3)
For instance, both the state and the control inputs are observable in a
teleoperated system in [Abbeel et al., 2010, van den Berg et al., 2010,
Osa et al., 2014, Ross et al., 2011], although observation can be noisy.
τ = [x0 , x1 , . . . , xT ]. (2.4)
τ = [y 0 , y 1 , . . . , y T ]. (2.5)
These cases need to be taken into account when deciding on the im-
itation learning approach for a specific application. When the expert
observes the system state partially, the expert demonstrations can be-
come sub-optimal requiring careful consideration. Moreover, when the
expert observes the learner, the learner may have more information
about its own embodiment. For example, if a human expert uses kines-
thetic teaching to show how to grasp an object, the demonstration may
be sub-optimal for a robot learner if the expert does not see what the
robot observes.
In imitation learning, the expert is often assumed to behave opti-
mally. However, this optimality is often based on partial observations
2.5. Policy Representation in Imitation Learning 31
which may differ significantly from the observations of the learner. For
example, if the human expert performs a motion which goes around
an obstacle which the robot learner does not observe, a robot learner
learns to perform a similar circumnavigation motion even when there
are no obstacles. Moreover, when the learner observes only partially
expert observations the learner can make wrong predictions about the
policy behind expert behavior.
π : xt , s Ô→ [o1 , . . . , oT ], (2.6)
as
π : s Ô→ τ . (2.7)
BC methods such as DMP [Schaal et al., 2004, Ijspeert et al., 2013]
and ProMP [Paraschos et al., 2013, Maeda et al., 2016] learn such
trajectory-based policies.
In the action-state space level, a policy maps states of the system
xt and contexts s to control inputs ut as
π : xt , s Ô→ ut . (2.8)
BC methods such as [Chambers and Michie, 1969, Pomerleau, 1988,
Khansari-Zadeh and Billard, 2011, Ross et al., 2011] and IRL methods
such as [Abbeel and Ng, 2004, Ziebart et al., 2008, Boularias et al.,
2011, Finn et al., 2016b] learn policies in action-state space. These
abstractions are summarized in Table 2.3.
Existing imitation learning methods can be categorized based on
task abstractions as shown in Table 2.4. The table displays an abun-
dance of model-free methods for trajectory learning. On the contrary,
many model-based IRL methods have been developed with action-space
space abstractions. Since commercially available robotic manipulators
often have a position/velocity controller, model-free methods are pre-
ferred for trajectory planning in such systems. This is especially pro-
nounced in motion planning methods designed for robotic manipulators
Table 2.3: Abstraction and the related policy in imitation learning. In a task-
level abstraction, the policy maps from the initial state x0 to a sequence of discrete
options, where an option at time step t is denoted with ot . In a trajectory-level
abstraction, the policy maps from an initial state x0 to a trajectory τ . In an action-
state space abstraction, the policy maps from the current state xt to a control ut .
Trajectory-based abstraction π : x0 , s Ô→ τ
Non-stationary
deterministic
Non-stationary Stationary
stochastic deterministic policy
Stationary
stochastic
Figure 2.3: Illustration of the relationships between basic policy classes. Stationar-
ity is a special case of non-stationarity and determinism is a special case of stochas-
ticity. We use the terms “stationary” and “time-invariant” interchangeably. Likewise,
“non-stationary” and “time-variant” are used interchangeably. Please see § 2.5.4 for
more details.
T
Ù
p(τ ) = p(x0 ) p(xt+1 |xt , ut )π(ut |xt ). (2.13)
t=1
as
N
1 Ø
Ep(τ ) [φ(τ )] ≃ φ(τidemo ). (2.15)
N i=1
Figure 2.4: Illustration of M- and I- projections from the data manifold onto the
policy model manifold. The solutions of M- and I- projections are different since the
KL divergence is not symmetric.
exp w⊤ φ(τ )
! "
p(τ |w) = . (2.22)
Z
Substituting the resulting form p(τ |w) with (2.22) into the original
maximum entropy problem ignoring terms which do not depend on the
parameters w, the resulting dual objective function (or equivalently
42 Design of Imitation Learning Algorithms
Here, the data is induced via the distribution q(τ ) on the right-hand
side of the KL, while in the maximum entropy principle, the data is
induced by the feature averages and p0 (τ ) on the right-hand side of
the KL is just a prior. The I-projection does not match features of
the demonstrator. Whenever an algorithm matches average features,
it is an instance of an M-projection based algorithm. Since ln q(τ ) is
unknown and hard to evaluate in practice, it is challenging to perform
the I-projection in the context of imitation learning. To the best of our
knowledge, there is no existing imitation learning method that performs
the I-projection exactly.
As we have seen from our discussion above, many imitation learning
methods can be seen as related to the M-projection and to the principle
of maximum entropy. This is true for most model-free and model-based
methods. Model-free methods based on standard supervised learning
[Ijspeert et al., 2013, Khansari-Zadeh and Billard, 2011] do not require
access to the system dynamics or iterative data acquisition.
In contrast, model-based imitation learning methods often try to
match features of the state distribution so as to satisfy Ep [φ(τ )] =
Eq [φ(τ )]. In order to do so, we either need access to the system dy-
namics [Ziebart et al., 2008, Ziebart, 2010] or require iterative data
acquisition [Boularias et al., 2011].
44 Design of Imitation Learning Algorithms
where the policy π(ut |xt ) maps from the states of the system to the
control inputs. Let us consider the trajectory distribution p(τ ) induced
by the learner’s policy and the trajectory distribution q(τ ) induced by
the expert’s policy. If the embodiments of the learner and the expert
are equivalent and stationary, that is, q(xt+1 |xt , ut ) = p(xt+1 |xt , ut ) =
p(xt |xt−1 , ut−1 ), the relation of p(τ ) and q(τ ) is given by
T
π L (ut |xt )
r
p(τ )
= rTt=0 , (2.31)
q(τ ) E
t=0 π (ut |xt )
π L (u|x)
Ú
DKL (p(τ )||q(τ )) = p(x, u) ln dxdu (2.35)
π E (u|x)
= Ep [ln π L (u|x) − ln π E (u|x)]. (2.36)
46
3.1. Problem Statement 47
Expert
Dataset of demonstrations
D = {( sdemo ,τ demo )}
Target of imitation learning
Desired trajectory Control input
Upper-level τd Lower-level u
System
controller controller
Observer
Context s, Observation of the system y State of the system x
Figure 3.1: Control diagram of a robotic system with imitation learning. An ex-
pert demonstrates the desired behavior generating a dataset D. Based on D and
observations about the current context and system state an upper-level controller
generates the desired trajectory τ d . A lower-level feedback controller tries to follow
τ d using observation feedback to generate a control u which causes a change to
the system state x and a new observation. In imitation learning, the controllers are
tuned to imitate the expert demonstrations.
The quadratic loss function is the most common choice for the loss
function. Given two vectors, x1 and x1 , a quadratic loss function is
given by
The quadratic loss function is also called the ℓ2 -loss function, and re-
gression with minimizing the quadratic loss function is often called least
squares (LS) regression or ℓ2 -loss minimization Sugiyama [2015].
Minimizing the quadratic loss function is closely related to maxi-
mizing the expected log likelihood under the Gaussian distribution as-
sumption. Let us consider the regression function fθ (x) parameterized
by θ. Suppose that the target variable y follows the equation
y = fθ (x) + ǫ, (3.5)
(y − fθ (x))2
A B
1
p(y|x, θ) = √ exp − . (3.6)
2πσ 2σ
50 Behavioral Cloning
Finding the model fθ (x) that maximizes the expected log likelihood
can be formulated as
B2
(y − fθ (x))2
A
argmax E[log p] = argmax E log exp − (3.7)
θ θ 2σ
= arg min E[(y − fθ (x))2 ] (3.8)
θ
1 Ø
≈ arg min (y − fθ (x))2 . (3.9)
θ N i
The ℓ1 -loss function is often employed for regression. The ℓ1 -loss func-
tion is given by
Ø
ℓabs (x1 , x2 ) = |x1,i − x2,i | , (3.12)
i
where x1,i and x2,i are the ith element of the vectors x1 and x2 , re-
spectively. The ℓ1 -loss function is also called the absolute loss function,
and regression with minimization of ℓ1 -loss is called least absolute de-
viations regression or ℓ1 -loss minimization Sugiyama [2015]. Usually,
3.2. Design Choices for Behavioral Cloning 51
Since the log loss is equivalent to the cross entropy, the log loss is also
called the cross-entropy loss [Sugiyama, 2015].
In binary classification (in imitation learning classification can be
used to learn a discrete control policy from expert demonstrations),
minimizing the log loss function is equivalent to maximizing the log
likelihood in logistic regression. In more detail, suppose that we want to
learn a binary classification where the probability follows the Bernoulli
distribution
Table 3.1: Regression methods in model-free behavioral cloning for both trajectory
and action-state space learning. The output trajectory in trajectory learning consists
of a long high dimensional sequence of variables while in action-state space learning
the output is a single action. Therefore, some methods such as look-up tables have
not been applied to trajectory learning. For modeling uncertainty in demonstrations,
regression methods need to have explicit support for variance. Gaussian model,
GMM and GPR methods model uncertainty explicitly.
Table 3.2: A main choice when doing behavior cloning is whether to use a model-
based or a model-free method. Model-free methods can directly learn a policy from
data without learning a dynamics model. Direct learning also usually means that
the learning algorithm does not need to iterate between trajectory and behavior gen-
eration. However, model-free methods are hard to apply to underactuated systems
since without a model predicting desired behavior is hard. Model-based methods
may work in underactuated systems but learning the actual model can be in many
cases difficult.
Model-free Model-based
A policy can be usually
Applicable to underactu-
Advantages learned without iterative
ated systems.
learning.
Hard to apply to underac- Model learning can be
tuated systems. very difficult.
Disadvantages
Hard to predict future An iterative learning pro-
states. cess is often required.
56 Behavioral Cloning
Using neural networks for learning has attracted great interest in vari-
ous fields. Supervised learning of neural networks can be also used for
imitation learning: the desired neural network policy can be learned
from the dataset generated/demonstrated by the expert. In this sec-
tion, we shortly highlight some recent imitation learning successes with
neural networks.
Recently, using neural networks for imitation learning has shown im-
pressive results in certain applications such as learning to play Go [Sil-
ver et al., 2016], generating handwriting [Chung et al., 2015], gener-
ating natural language [Wen et al., 2015], or generating image cap-
tions [Karpathy and Fei-Fei, 2015]. Moreover, supervised learning of
neural networks has been used as a building block for example for
learning the policy or the cost function in inverse reinforcement learn-
ing (please see §4.4.6 for more details).
3.4. Model-Free Behavioral Cloning Methods in Action-State space57
Figure 3.2: The game of Go is played on a 19x19 board. Even though the total
number of possible board configurations exceeds 10170 and thus the training data can
not cover all possible plays, the simple imitation learning approach in [Silver et al.,
2016] was able to learn a competitive policy from demonstrations and improve the
policy using self-play. [Figure from https://commons.wikimedia.org/wiki/File:
Tuchola_026.jpg. CC license.]
Table 3.3: Natural language generated by the semantically controlled LSTM (SC-
LSTM) cell neural network proposed in [Wen et al., 2015]. The table shows an
example dialogue act and related natural language samples from [Wen et al., 2015].
The neural network generates natural language learned from human demonstrations.
The neural network is conditioned on the dialogue act which limits the generated
sentences to specific meanings.
Dialogue act:
inform(name=”red door cafe”, goodformeal=”breakfast”,
area=”cathedral hill”, kidsallowed=”no”)
Generated samples:
red door cafe is a good restaurant for breakfast in the area
of cathedral hill and does not allow children .
red door cafe is a good restaurant for breakfast in the cathedral hill
area and does not allow children .
red door cafe is a good restaurant for breakfast in the cathedral hill
area and does not allow kids .
red door cafe is good for breakfast and is in the area of cathedral hill
and does not allow children .
red door cafe does not allow kids and is in the cathedral hill area
and is good for breakfast .
3.4. Model-Free Behavioral Cloning Methods in Action-State space59
and Schmidhuber, 1997] network. Wen et al. [2015] train their system
using data collected from a spoken dialogue system. Table 3.3 shows an
example of natural language generated by the trained neural network.
As is common when designing neural network based systems, the
neural network architecture in [Wen et al., 2015] is adapted to the
task at hand. Moreover, neural network approaches need to take prob-
lems such as vanishing gradients, co-adaptation, and overfitting into
account. Vanishing gradients can be a problem especially for recurrent
neural networks due to the high optimization depth. The neural net-
work architecture in [Wen et al., 2015] includes skip connections [Graves
et al., 2013] to soften vanishing gradients and Wen et al. [2015] utilize
dropout [Srivastava et al., 2014], a technique which randomly deac-
tivates connections in the neural network during training, to reduce
co-adaptation and overfitting.
Learning recurrent neural networks from demonstrations has been
shown to work also for other kinds of data. Karpathy and Fei-Fei [2015]
show how to learn to generate annotations for image regions from
demonstrations. The approach of [Karpathy and Fei-Fei, 2015] learns
from a combination of image and language data to generate natural lan-
guage descriptions of images. Chung et al. [2015] show how to learn to
generate handwriting and natural speech from demonstrations. Chung
et al. [2015] propose a new type of recurrent neural network with hid-
den random variables and argue that random variables are needed to
model variability in data with complex correlations between different
time steps, for example, in natural speech.
Aggregate
dataset All previous data
New policy
pn
Supervised learning
Figure 3.3: An overview of DAGGER from [Bagnell, 2015]. In each iteration,
DAGGER generates new examples using the current policy with corrections (labels)
provided by the experts, adds the new demonstrations to a demonstration dataset
and computes a new policy to optimize performance in aggregate over that dataset.
The figure illustrates a single iteration of DAGGER . The basic version of DAGGER
initializes the demonstration dataset from a single set of expert demonstrations and
then interleaves policy optimization and data generation to grow the dataset. More
generally, there is nothing special about aggregating data– any method, like gradient
descent or weighted majority that is sufficiently stable in its policy generation and
does well on average over the iterations (or more broadly, all no-regret algorithm run
over each iterations dataset) will achieve the same guarantees, and maybe strongly
preferred for computational reasons.
Initialize: π1L
for i = 1 to N do
Let πi = βi π E + (1 − βi )πiL .
Sample trajectories τ = [x0 , u0 , ..., xT , uT ] using πi
Get dataset Di of visited states by πi and actions given by expert.
Aggregate datasets: D ← D ∪ Di
L
Train the policy πi+1 on D.
end for
return best πiL on validation.
cally good on average over the data-sets they are presented, and are
sufficiently stable between iterations [Hazan, 2016].
Data as Demonstrator: Venkatraman et al. [2015] extended
DAGGER and proposed a framework called Data as Demonstrator
(DaD) where the problem of multi-step prediction is formulated as im-
itation learning. Prediction errors will cascade over time in multi-step
prediction as in the case of learning a policy, and this prediction error
can also be improved by a data aggregation approach. Recent work
shows the efficacy of DaD in control problems [Venkatraman et al.,
2016].
π : S Ô→ T . (3.20)
λ∗ = arg max p Y ′ |λ .
! "
(3.21)
λ
where x0 denotes the initial position and M the number of the basis
functions. The Gaussian basis function ψi (z) is given by
exp −hi (z − ci )2
! "
ψi (z) = qN , (3.26)
2
j=1 exp (−hj (z − cj ) )
where hi and ci are constants that determine the width and centers of
the basis functions, respectively. This system represents stable attractor
3.5. Model-Free Behavioral Cloning for Learning Trajectories 71
where xdemo (t), ẋdemo (t), ẍdemo (t) are the position, velocity and accel-
eration at the time t, respectively. Subsequently, we can find the weight
vector w that minimizes the sum of the squared error
T
(ftarget (t) − ξ(t)Ψw)2 ,
Ø
LDMP = (3.28)
t=0
where ξ(t) = (g − x0 )z(t) for the discrete system and ξ(t) = 1 for the
rhythmic system, and the entry of Ψ is computed as Ψij = ψi (tj ) with
(3.25). The weight vector w can be obtained by a least-square solution
1 2−1
w = Ψ⊤ Ψ Ψ⊤ F . (3.29)
72 Behavioral Cloning
where xnew
s and xnew
g are the new start and goal states, M is a linear
operator that defines the inner product in the Hilbert space. When time
3.5. Model-Free Behavioral Cloning for Learning Trajectories 73
ẋ = f (x), (3.44)
where x is the system state, and f is a function that governs the be-
havior of the system. Khansari-Zadeh and Billard [2011], Gribovskaya
et al. [2011] learn the function f as a GMM.
Let us define x as the state vector of the system. When a set of
demonstrated trajectories is given, the joint distribution of x and ẋ can
be estimated from the observations using a GMM. The kth component
of the GMM models the distribution p(x, ẋ|k) as
AC D-C D C DB
x - µ
x Σx,k Σxẋ,k
p(x, ẋ|k) ∼ N , . (3.45)
-
ẋ - µẋ Σẋx,k Σẋ,k
-
where
p(k)p(x|k) πx N (x|µx,k , Σx,k )
hk (x) = qK = qK , (3.47)
i=1 p(i)p(x|i) i=1 πi N (x|µx,i , Σx,i )
Figure 3.6: Trajectory transfer using non-rigid registration [Schulman et al., 2013].
Table 3.5: Generalization of skills using existing methods. DMPs enable stable con-
vergence to arbitrary goal positions. ProMPs can generalize trajectories by Gaussian
conditioning, but there is no guarantee of stable behavior. SEDS can generalize tra-
jectories while guaranteeing stable behavior, but cannot model time dependence
of movements. Trajectory transfer using non-rigid registration can achieve complex
generalization, but does not incorporate stochasticity of demonstrations and there
is no guarantee of stable behavior.
Generalizable
Method Advantages Disadvantages
context
DMP Limited
Start and goal Guarantee of
[Schaal et al., 2004, generalization
positions stable behavior
Ijspeert et al., 2013] capabilities
ProMP Generalization
Any subset of
[Paraschos et al., 2013, based on No guarantee of
the observations
Maeda et al., 2016] stochasticity of stable behavior
of the system
demonstrations
State of the Generalization
SEDS No time-
system with with guarantee
[Khansari-Zadeh and dependence
fixed of stable
Billard, 2011, 2014]
dimensionality behavior
Generalization Stochasticity of
Way points with
A point cloud of based on point demonstrations
non-rigid registration
the given scene clouds of a is not
[Schulman et al., 2013]
given scene considered
3.5. Model-Free Behavioral Cloning for Learning Trajectories 83
until convergence
oped for speech recognition, DTW is frequently used to deal with the
time alignment of trajectories in robotics. The original formulation of
DTW finds the best time alignment of two data sequences. However,
we often obtain more than two demonstrations, and we need to align
all of them appropriately in the time domain.
In the field of imitation learning, Coates et al. [2008] proposed
a method to normalize the time alignment of multiple demonstrated
trajectories. Similar approaches appear in applications such as au-
tonomous helicopter flight [Abbeel et al., 2010] and automation of
robotic surgery [van den Berg et al., 2010, Osa et al., 2014]. Here,
we review the method employed by van den Berg et al. [2010].
van den Berg et al. [2010] regarded the demonstrated trajecto-
ries as noisy ’observations’ of the ’reference’ trajectories. The refer-
ence trajectory and the time mapping from the reference trajectory to
the demonstrated trajectory are computed using the EM (Expectation
Maximization)-algorithm.
The linear system is described as
C D A C DB
A B P 0
ξ(t + 1) = ξ(t) + w(t), w(t) ∼ N 0, (3.50)
0 I 0 Q
where ξ(t) = [x⊤ (t), u⊤ (t)]⊤ is the state and the control input of the
system at time t, A and B are the state matrix and the input matrix,
respectively. w(t) is the noise that follows the zero-mean Gaussian dis-
tribution. P and Q are the covariance matrices of process noise and
observation noise, respectively. If we assume that the jth demonstrated
trajectory τ j is given by τ j = [xj (0), uj (0), · · · , xj (T j ), uj (T )j ], the
86 Behavioral Cloning
the motions of two agents using DMPs and learned the correlations of
the distribution of the motion parameters. When one agent’s motion
is observed, the motion of the other agent can be predicted based on
Gaussian conditioning.
Likewise, ProMPs have also been used to learn the correlation of
multiple agents’ motion. Maeda et al. [2016] developed an imitation
learning framework called Interaction ProMP to learn coupled motions
in human-robot collaboration. In the framework of Interaction ProMP,
correlated movements are learned as a distribution of the correlated
weight vectors of ProMPs. Using a partial observation of the movement,
unobserved movements are estimated as a conditional distribution of
the weight vectors on the given partial observation.
Here, we describe details of Interaction ProMP. Suppose demon-
strations of human robot collaborative movements are given. Here, we
define the state vector as a concatenation of the P DoFs executed by
the human, followed by the Q DoFs executed by the robot
C D
xh (t)
x(t) = , (3.53)
xr (t)
where
H ⊤ (t) = diag(Ψ⊤ (t), . . . , Ψ⊤ (t)), (3.55)
⊤
Ψ (t) is a M ×2 matrix defined as (3.35) and M is the number of basis
functions. When a trajectory of a human-robot collaborative movement
is demonstrated, the weight vector ω can be learned as
ω̄ = [(ω h1 )⊤ , . . . , (ω hP )⊤ , (ω r1 )⊤ , . . . , (ω rQ )⊤ ]⊤ . (3.56)
trajectories
..
Robot .
trajectories
N number of
demonstrations Human Robot
Conditioning
Human observations
Inference
Prediction
Control
Figure 3.8: Overview of Interaction ProMPs in [Maeda et al., 2016]. In the interac-
tion ProMP framework, correlated movements are learned as the joint distribution
of weight vectors of ProMPs. Thanks to the probabilistic modeling of the trajectory
distribution, the interaction ProMP framework works with noisy observations of
trajectories [Maeda et al., 2016]. In this figure, ω̄ represents the weight vector that
contains movements of all DoFs controlled by the robot and the human operator as
defined in (3.56).
ments [Shukla and Billard, 2012, Lukic et al., 2014, Kim et al., 2014].
Shukla and Billard [2012] developed a framework for learning coupled
movement based on DS, which they call the Coupled Dynamical Sys-
tem (CDS) model. The idea of CDS is to model the correlation between
two agents using statistical models.
Let us assume two agents, which we call the master and slave,
perform a coupled motion. The correlation of the movement of the
master xs and the movement of the slave xs can be modeled with CDS.
In CDS, three GMMs are trained to model three joint distributions:
1) the joint distribution of the master movement p(xm , ẋm )
2) the joint distribution
1 of the2states of the master and the desired
state of the slave p Φ(xm ), xds ,
3) the joint distribution of the slave movement p(x̃s , ẋs )
where x̃s = xs − xds and xds is the desired state of the slave. To ensure
the stability of the system, SEDS is used to model these three joint
distributions [Khansari-Zadeh and Billard, 2011]. The function Φ(·)
maps xm to the same dimensionality of xs . This mapping is necessary
because SEDS can handle only models in which the inputs and outputs
have the same dimensionality [Shukla and Billard, 2012].
The reproduction of learned motions is performed by repeating
three steps: First, the movement of the master is planned using
p(xm1, ẋm ). Subsequently,
2 the state of the slave is estimated based
on p xds |Φ(xm ) . Third, the motion of the slave is planned based on
p(xs , ẋs ). These steps are repeated until the system converges to the
goal position. The CDS approach has been applied to learn the cor-
relation between the arm and fingers [Shukla and Billard, 2012, Kim
et al., 2014], or the eye and arm [Lukic et al., 2014].
tal learning. In [Calinon and Billard, 2007], GMMs are initialized with
trajectories demonstrated by a human wearing a motion sensor. Subse-
quently, the motion of the humanoid robot is modified through kines-
thetic teaching by a human coach. Through this iterative process, the
model of the trajectory distribution is improved incrementally. The
method in [Calinon and Billard, 2007] is summarized in Algorithm
7. The method in [Lee and Ott, 2011] used a similar representation
by combining GMMs with HMMs. In the framework of [Lee and Ott,
2011], the compliance of a robot manipulator is controlled in order to
represent an area where motion refinement is allowed. However, the
method in [Calinon and Billard, 2007] does not address the context
of the task. Therefore, the generalization of the demonstrated trajecto-
ries to new situations is not concerned. Recent follow-up work [Havoutis
and Calinon, 2017] addressed the online learning and the adaptation
of the skill to new contexts by combining an optimal control approach
and TP-GMM in [Calinon, 2015].
Ewerton et al. [2016] used ProMPs for incremental imitation with
generalization to different contexts. Ewerton et al. [2016] parameterizes
trajectories with ProMPs as p(τ |w). To generalize the demonstrated
trajectories to new contexts, the joint distribution of trajectory param-
eters and the Gaussian context p(w, s) is incrementally learned under
the supervision of a human. Given a new context snew , the trajec-
tory is planned as a conditional distribution p(τ |snew ). The method
in [Ewerton et al., 2016] which is suitable for incremental learning of
92 Behavioral Cloning
where ẋmod is the velocity with the local modulation and ẋini is the
velocity given by the initial dynamical system. The local modulation
is represented by scaling and rotation of the original dynamics in the
framework of [Kronander et al., 2015]. Therefore, the modulation func-
3.5. Model-Free Behavioral Cloning for Learning Trajectories 93
tion is given by
(a) (b)
Figure 3.10: Learning a hierarchical skill in [Kroemer et al., 2015]. Left: A sequence
of skills are modeled using a variant of HMM. Right: The learned DMPs can be
adapted to different objects.
plan that switches from one DMP to another based on the observations.
Kroemer et al. [2015] learn DMPs using imitation learning and optimize
high-level policies using reinforcement learning. Kroemer et al. [2015]
demonstrate the approach in robotic manipulation tasks as shown in
Figure 3.10.
Although it is often assumed that a sufficient amount of demonstra-
tion data is available, this may not be the case in many applications.
Incremental imitation learning for task-level planning proposed by
Niekum et al. [2014] can address this issue. The framework in [Niekum
et al., 2014] leverages unstructured demonstrations and corrective ac-
98 Behavioral Cloning
Figure 3.11: Mutual language model between motion and sequence in [Takano
and Nakamura, 2015](Figure used with permission of Wataru Takano). Relevance
between words and motion is learned using a probabilistic model. The approach
can work in two directions: generating sentences from motion or generating motion
from sentences. When motion is observed, a motion language semantic graph model
generates words for the observed motion. A natural language model arranges the
words then into sentences. When observing language the language is segmented into
words using a natural language model and the words are then transformed into
motion using a semantic graph.
and then plan trajectories based on the learned forward model. Forward
dynamics model learning can be framed as a regression problem. Ta-
ble 3.6 lists different regression methods which have been utilized in
model-based BC. Although locally weighted regression and Gaussian
mixture regression were used in early studies of model-based methods,
recent studies often employ Gaussian Processes. As we will review in
§3.7.1.2, Gaussian Processes can incorporate inputs with uncertainty.
This property is important for multi-step forward prediction since the
uncertainty is propagated over time. However, due to the computational
cost, Gaussian Process regression is not suitable for high-dimensional
data. To deal with high-dimensional data such as raw images, a deep
learning approach is employed for modeling a forward dynamics in the
most recent studies [Oh et al., 2015, Finn et al., 2017a, Baram et al.,
2017, Nair et al., 2017]. In the following sections, we review some of
the model-based methods with explicit learning of a forward model.
Table 3.6: Model-based behavioral cloning methods using different regression meth-
ods. Early studies on model-based behavior cloning focused on locally weighted
regression but later studies have moved to Gaussian mixture regression and even
more recently to Gaussian processes. We expect that studies based on deep neural
networks will be popular in the near future.
where p(k) is the prior and the kth Gaussian component is given by
AC D-C D C DB
zt - µ
z,k Σz,k Σzx,k
p(xt+1 , z t |k) = N , . (3.65)
-
xt+1 - µx,k Σxz,k Σx,k
-
where
where
1 2−1
µx|z,k = µx,k + Σxz,k Σz,k + Σin (z ∗t − µz,k ),
1 2−1
Σk,t+1 = Σx,k − Σxz,k Σz,k + Σin Σzx,k , (3.69)
p(k)N (z ∗t |µz,k , Σz,k + Σin )
wk = qK ∗ in
.
k=1 p(k)N (z t |µz,k , Σz,k + Σ )
Grimes and Rao [2009] used this GMR for one-step prediction and
recursively predicted learner’s trajectories. Using the learned forward
model, the action is selected so as to maximize the posterior likelihood
as
f (z t ) ∼ GP m(z t ), k(z t , z ′t ) ,
! "
(3.71)
For two given Gaussian distributions p(x(t)) ∼ N (x|µp (t), Σp (t)) and
q(x(t)) ∼ N (x|µq (t), Σq (t)), the KL divergence of q and p can be com-
puted in closed form. Using the factorization in (3.76), the KL diver-
gence between the trajectory distribution induced by the expert policy
q(τ ) and the trajectory distribution induced by the learned policy p(τ )
can be computed as
T
Ø
DKL (q(τ )||p(τ )) = DKL (q (x(t)) ||p (x(t))). (3.77)
t=1
where q(τ ) is the expert trajectory distribution and p(τ ) is the trajec-
tory distribution induced by the learner’s policy. The learning process
of BC methods with forward dynamics can be illustrated as Figure 3.12.
In addition, the method in [Englert et al., 2013] assumes that the tra-
3.8. Robot Applications with Model-Free BC Methods 109
Figure 3.13: Learning rhythmic motions for the Ball-Paddling task in [Kober and
Peters, 2009]. Kober and Peters [2009] used kinesthetic teaching to demonstrate
periodic hitting motions in Ball-Paddling and trained rhythmic DMPs to reproduce
the demonstrated periodic movements.
human human
human
robot
robot
robot
(a) Handing over a plate (b) Handing over a screw (c) Holding the screw driver
in Figure 3.14. The correlation of the robot’s motion and the human
operator’s motion was learned with interaction ProMPs, which is an
extension of ProMPs proposed by Paraschos et al. [2013]. To achieve
the human-robot collaborative task, the robot motion was planned by
conditioning the learned distribution on the observed motion of the
human operator. Maeda et al. [2016] applied interaction ProMPs to
several tasks as shown in Figure 3.14. The study by Maeda et al. [2016]
showed that the reactive motions of the robot were successfully planned
based on the observed motions of the human operator.
Recent work by Lioutikov et al. [2017] proposed a method for seg-
menting demonstrated trajectory in a probabilistic manner and learn-
ing a sequence of movement primitives represented by ProMPs. Tasks
that emulate table tennis, writing and chair assembly are reported in
[Lioutikov et al., 2017].
112 Behavioral Cloning
Figure 3.15: Autonomous knot-tying with a surgical robot [Osa et al., 2017b]. Left:
Bimanual manipulation tasks were learned using a model-free BC method. Right:
The trajectories can be updated in real time when the context is changing during
task execution. The demonstration was performed under various contexts, and the
trajectory distribution was modeled using a Gaussian Process. A force controller
was build as an outer loop of the standard PD position controller.
Figure 3.18: Applications of DAGGER [Ross et al., 2011]. Left: Learning to play
a video game [Ross et al., 2011]. Right: Learning autonomous UAV flight [Ross
et al., 2013]. The UAV flew autonomously in real forest environments. In DAGGER
, the learner complements initial demonstrations by querying an expert online for
demonstrations specifically for states induced by the learner’s policy.
117
118 Inverse Reinforcement Learning
2006a], recent policy search methods can be also used. For example,
Finn et al. [2016b] employed guided policy search [Levine and Abbeel,
2014], and Ho and Ermon [2016] and Ho et al. [2016] employed trust
region policy optimization [Schulman et al., 2015].
Model-free Model-based
Applicable to systems Estimation of the trajec-
Advantages with nonlinear and un- tory distribution is data-
known dynamics efficient.
It is necessary to sample Model learning can be
many trajectories to esti- very difficult.
Disadvantages
mate the trajectory distri- It is hard to apply to un-
bution. deractuated systems.
Table 4.2: Objectives to obtain the unique solution in inverse reinforcement learn-
ing. The concept of maximizing the margin between the optimal policy and others
was popular in the early studies on IRL. The maximum entropy principle is a dom-
inant choice for recent IRL methods.
Objectives Employed by
Maximum margin [Ng and Russell, 2000, Abbeel and Ng, 2004,
Ratliff et al., 2006b,a, 2009, Silver et al., 2010,
Zucker et al., 2011]
Maximum entropy [Ziebart et al., 2008, Ramachandran and Amir,
2007, Choi and Kim, 2011b, Ziebart, 2010,
Boularias et al., 2011, Kitani et al., 2012,
Shiarlis et al., 2016, Ho and Ermon, 2016, Finn
et al., 2016b]
Other [Doerr et al., 2015, Arenz et al., 2016]
122 Inverse Reinforcement Learning
Model-free Model-based
[Abbeel and Ng, 2004,
Ratliff et al., 2006b, Silver
et al., 2010, Ramachan-
Linear [Boularias et al., 2011, dran and Amir, 2007, Choi
reward Kalakrishnan et al., 2013] and Kim, 2011b, Ziebart
et al., 2008, Ziebart, 2010,
Levine and Koltun, 2012,
Hadfield-Menell et al., 2016]
[Ratliff et al., 2006a, 2009,
Nonlinear [Finn et al., 2016b, Ho and Silver et al., 2010, Grubb
reward Ermon, 2016] and Bagnell, 2010, Levine
et al., 2011]
4.4. Model-Based Inverse Reinforcement Learning Methods 123
(4.2)
Abbeel and Ng [2004] defined the feature expectation of a policy π as
C T - D
Ø -
µ(π) = E γ φ(xt )- π ∈ Rk .
t
(4.3)
-
-
t=0
Using this notation the value of a policy can be rewritten as
E[R|π] = w⊤ µ(π), (4.4)
124 Inverse Reinforcement Learning
where L(τ ) is the loss function. If the loss function L(τ ) is large, the
cost difference between the demonstrated trajectory and other trajec-
tories is large. Since we need to consider only the minimizer of the
right-hand side of (4.5), (4.5) can be rewritten as
Likewise, if the loss function L(τ ) is linear to µ, the loss function of the
trajectory is given by L(τ ) = l⊤ µ where l ∈ R|X ||U | is the loss vector.
Given a training set D = {Fi , τ i , li }N
i=1 , the problem of finding w can
be formalized as a quadratic program:
N
1 1 Ø
min ëwë2 + ζi (4.7)
w,ζi 2 N i=1
î ï
s.t.∀i, w⊤ φi (τ i ) ≤ min w⊤ φi (τ ) − l⊤
i µ + ζi (4.8)
which Ratliff et al. [2009] call the maximum margin objective where
λ > 0 is the regularization parameter.
For solving this problem, a method based on subgradients is used
in Ratliff et al. [2006b]. MMP assumes access to a MDP solver that
returns the optimal trajectory by solving the problem
where C(τ ) is the cumulative cost of the trajectory τ . MMP uses the
loss-augmented cost map C(τ ˜ ) = C(τ ) − L(τ ) to plan the trajectory.
Algorithm 16 summarizes the procedure of MMP.
The MMP framework was extended to LEARCH (LEArning to
seaRCH), which is a framework for learning nonlinear cost functions
efficiently [Ratliff et al., 2009, Silver et al., 2010, Zucker et al., 2011].
In LEARCH, exponential functional gradient descent was used for op-
timizing the maximum margin planning objective.
The policy obtained in MMP is based on efficient MDP solvers,
which generate deterministic optimal policies. However, robotic sys-
tems with large configuration space dimensionality often require a
126 Inverse Reinforcement Learning
Ø Ø
∇LME (w) = EπE [φ(τ )] − p(τ |w)φ(τ ) = EπE [φ(τ )] − Dxi φ(xi ).
τ xi
(4.18)
Summing frequencies
q
Dxi = t Dxi ,t
where H(ut |u1:t−1 , x1:t ) is the conditional entropy and p(u1:t , x1:t ) is
the joint distribution over all states and actions until time step t. Con-
trary to the conditional entropy H(u1:T |x1:T ), that is implicitly used
in standard max-ent IRL, the causal entropy H(u1:T ||x1:T ) conditions
action choices at time step t only on states until time step t, while the
conditional entropy would make the action choice also dependent on
future states (i.e., it ignores the causality).
Under the assumption that the system is Markovian,
p(xt |x1:t−1 , u1:t−1 ) reduces to p(xt |xt−1 , ut−1 ), and π(ut |x1:t , u1:t−1 )
reduces to π(ut |xt ). Causal entropy can be maximized using dynamic
programming [Ziebart, 2010] resulting in equations similar to those
found in soft value-iteration methods.
[2016] modifies the maximum causal entropy IRL [Ziebart, 2010] opti-
mization problem so that the optimized policy favors trajectories with
features which are dissimilar to the features found in failed demonstra-
tions
K
λ
||w||2
Ø
max H(u1:T ||x1:T ) + wk zk − (4.22)
π L (u|x),w,z
k=1
2
subject to
EπL (u|x) [φ(τ S )] = EπE [φ(τ demo
S )],
EπL (u|x) [φ(τ F )] − EπE [φ(τ demo
F )] = zk ,
π L (u|x) = 1 , π L (u|x) ≥ 0 ,
Ø
u
where λ is a constant, K is the number of features, and w are fea-
ture weights to optimize. While the original maximum causal entropy
approach used only features of successful demonstrations φ(τ demoS ) the
approach of Shiarlis et al. [2016] uses also failed demonstration features
φ(τ F ). The term K
q
k=1 wk zk favors large distances between policy gen-
erated features and features in failed demonstrations. λ2 ||w||2 is a reg-
ularization term to keep w small enough. In order to find a solution to
the program in Equation 4.22, Shiarlis et al. [2016] performs gradient
ascent to find the feature weights while incrementally decreasing λ until
hitting a λ threshold. The idea in this procedure is to first emphasize
finding good weights for successful demonstrations and then focus on
finding weights for failed demonstrations.
lem with MAP inference can be formulated as finding the reward func-
tion RMAP that maximizes the posterior
RMAP = arg max p(R|D) = arg max [ln p(D|R) + ln p(R)] , (4.26)
R R
inversion where the size of the matrix depends on input space size.
In robotics and other application fields, exact dynamics models are of-
ten difficult to come by. Model-free IRL methods side step the problem
by not requiring such prior knowledge. Model-free IRL methods often
employ sampling-based approaches to estimate the trajectory distribu-
tion. Although this approach requires many samples of trajectories in
the learning process, it avoids the explicit learning of system dynamics.
where EπE [φi (τ )] is the empirical expectation of the ith feature vec-
q
tor calculated from demonstrations, EπL [φi (τ )] = τ p(τ )φi (τ ) is the
expectation of the feature vector with respect to the learner’s policy,
k is the number of features, T is a set of feasible trajectories, and the
threshold ǫi is calculated by using Hoeffding’s bound. The Lagrangian
of this problem is given by
A B
p(τ )
− w⊤
Ø Ø
LRE (p, w, η) = p(τ ) ln p(τ )φ(τ ) − EπE [φ(τ )]
q0 (τ ) τ
k
A B
Ø Ø
− |wi |ǫi + η p(τ ) − 1 .
i=1 τ ∈T
(4.34)
1
GAIL [Ho and Ermon, 2016] cannot be fully classified as an IRL approach since
GAIL does not recover the reward function. However, we introduce the study [Ho
and Ermon, 2016] in the IRL section since it is relevant to the concept of IRL.
4.6. Interpretation of IRL with the Maximum Entropy Principle 141
LGA = EπL [ln(Dw (x, u))] − EπE [ln(1 − Dw (x, u))] − λH(πθL ) (4.36)
θ
Estimate the
reward function
Update the policy
Figure 4.1: Illustration of many IRL approaches. Such IRL methods iteratively
estimate the reward function to make the demonstrations appear more optimal
than the current policy, then update the policy under the new reward function, and
execute the policy virtually or physically to get more samples which the reward
function attempts to distinguish.
in the system dynamics. For this reason, Dvijotham and Todorov [2010]
proposed to use the trajectory distribution induced by the passive dy-
namics p(xt+1 |xt ) of the system as the KL divergence term p0 (τ ) of
the cost function. Kalakrishnan et al. [2013] also approximated a trajec-
tory distribution using trajectories sampled from the system dynamics.
These methods consider the passive dynamics of the system in their
problem formulation.
The relative entropy IRL approach by Boularias et al. [2011] at-
tempts to minimize the KL divergence DKL (p(τ )||p0 (τ )), with feature
matching constraints. By using importance sampling, the expected fea-
ture counts are approximated without prior knowledge of the system
dynamics. Since the trajectories sampled from the actual system fol-
low the system dynamics, we can consider that the expected feature
counts approximated using importance sampling implicitly encode the
system dynamics. Arenz et al. [2016] use the M-projection to obtain
the data state distribution analytically, and then use the I-projection
to obtain the policy given the analytic model of the data distribution.
Methods that directly try to minimize the KL to the data distribution
DKL (p(τ )||q demo (τ )), where q demo (τ ) is the trajectory distribution in-
duced by the expert policy, have not been widely researched in imitation
learning to our knowledge. However, some recent research shows that
any f -divergence can be minimized [Nowozin et al., 2016] in GANs and
given the close connection to IOC methods we expect that investiga-
tions into this area may be profitable.
the demonstrations, Section 4.7.2 then discusses the case when the ex-
pert makes partial observations when performing demonstrations, Sec-
tion 4.7.3 describes how IRL can be framed as a partially observable
Markov decision process, and Section 4.7.4 discusses a model for opti-
mizing the behavior of both the expert and learner when the reward
function is partially observable.
Usually the basic premise in IRL is that the expert observes the world
state fully. However, similarly to the learner, the expert may only
partially observe the world when demonstrating the task. Thus in-
stead of an MDP model a partially observable Markov decision process
(POMDP) model is needed for the expert. The formal POMDP model
is identical to the MDP model except that a POMDP additionally
includes observation probabilities conditioned on the next state and
current action. Policy computation for POMDPs is challenging com-
pared to MDPs. The same applies to IRL in POMDPs [Choi and Kim,
2011a]. Choi and Kim [2011a] extend classical IRL algorithms [Ng and
Russell, 2000, Abbeel and Ng, 2004] to two different POMDP settings:
1) learning from a given expert’s policy and 2) learning from expert
146 Inverse Reinforcement Learning
Inverse reinforcement learning has been used for tasks such as parsing
sentences Neu and Szepesvári [2009], car driving Abbeel and Ng [2004],
path planning Ratliff et al. [2006b], Silver et al. [2010], Zucker et al.
[2011], and robot motions Boularias et al. [2011], Finn et al. [2016b].
First, we review applications of model-based inverse reinforcement
learning methods. Since model-based IRL methods assume that the
dynamics of the system is available, they have been applied to prob-
lems where the system dynamics is completely known such as a driv-
ing simulator. Thereafter, we review applications of model-free inverse
reinforcement learning methods. Since model-free IRL methods do not
require prior knowledge of the system dynamics, they can be applied to
robotic tasks where the dynamics of a manipulator is hard to obtain.
Figure 4.2: Screen shot of the driving simulator used in [Abbeel and Ng, 2004]. A
time-invariant policy was learned using a model-based IRL method. Experimental
results show that a different driving style can be learning using different demonstra-
tion data.
Ratliff et al. [2006b], Silver et al. [2010] apply maximum margin plan-
ning (MMP) and LEARCH for finding a path with minimum accu-
mulated cost (see Figure 4.3). Interestingly, from raw perceptual data,
lattice planners can be taught human-like rough terrain driving more
efficiently compared to manually programmed behavior Silver et al.
[2010]. LEARCH learns the cost as a function of features and the op-
timal path can be found by using classic motion planning methods on
the recovered cost function. The features of the MDP are based on
visual (images/lidar) input as shown in Figure 4.4. The learned cost
4.8. Robot Applications with IRL Methods 149
Figure 4.3: The learning to search (LEARCH) approach for identifying a cost func-
tion has been applied to various robotic applications including learning rough terrain
navigation from sensor data. The approach iterates between building a discrimina-
tive classifier between states visited by the learner and the demonstrator, updating
the cost function with the discriminative classifier, and then using classical path
planning methods to identify a new proposed optimal plan.
Figure 4.4: Examples of path planning with LEARCH [Silver et al., 2010]. Top
figures show the satellite images and the bottom figures show the costs. The cost
function evolves from left to right in the learning process. The red line represents the
example path and the green represents the current plan. The learned cost function
reproduces paths more similar to the example path as the learning evolves. The
upper set of images shows the raw visual (camera) data being interpreted by the
learner, the lower images show the interpretation in terms of costs (white expensive,
dark low-cost).
150 Inverse Reinforcement Learning
Figure 4.5: Learning house-keeping tasks in [Finn et al., 2016b]. Tasks that require
a nonlinear reward function and a complex policy were learned using guided cost
learning.
We have surveyed the state of the art in imitation learning for robotics.
Although imitation learning has progressed rapidly, it is clear that there
are still many problems and challenges which need to be investigated.
In this section, we highlight open questions and technical challenges in
imitation learning.
152
5.1. Behavioral Cloning vs Inverse Reinforcement Learning 153
Since the purpose and target applications of imitation learning are very
broad, benchmarking imitation learning methods can be challenging.
The following open questions are related to performance evaluation in
imitation learning.
161
References
162
References 163