0% found this document useful (0 votes)
69 views8 pages

Apprenticeship Learning About Multiple Intentions: Abbeel & NG 2004

Uploaded by

banned miner
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
69 views8 pages

Apprenticeship Learning About Multiple Intentions: Abbeel & NG 2004

Uploaded by

banned miner
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Apprenticeship Learning About Multiple Intentions

Monica Babeş-Vroman [email protected]


Vukosi Marivate [email protected]
Department of Computer Science, Rutgers University, 110 Frelinghuysen Rd, Piscataway, NJ 08854 USA
Kaushik Subramanian [email protected]
College of Computing, Georgia Institute of Technology, 801 Atlantic Dr., Atlanta, GA 30332 USA
Michael Littman [email protected]
Department of Computer Science, Rutgers University, 110 Frelinghuysen Rd, Piscataway, NJ 08854 USA

Abstract that performs well with respect to this unknown re-


ward function. A basic assumption is that the expert’s
In this paper, we apply tools from inverse re-
intent can be expressed as a reward function that is a
inforcement learning (IRL) to the problem
linear combination of a known set of features. If the
of learning from (unlabeled) demonstration
apprentice’s goal is also to learn an explicit represen-
trajectories of behavior generated by varying
tation of the expert’s reward function, the problem
“intentions” or objectives. We derive an EM
is often called inverse reinforcement learning (IRL) or
approach that clusters observed trajectories
inverse optimal control.
by inferring the objectives for each cluster us-
ing any of several possible IRL methods, and In many natural scenarios, the apprentice observes the
then uses the constructed clusters to quickly expert acting with different intents at different times.
identify the intent of a trajectory. We show For example, a driver might be trying to get to the
that a natural approach to IRL—a gradient store safely one day or rushing to work for a meeting
ascent method that modifies reward param- on another. If trajectories are labeled by the expert to
eters to maximize the likelihood of the ob- identify their underlying objectives, the problem can
served trajectories—is successful at quickly be decomposed into a set of separate IRL problems.
identifying unknown reward functions. We However, more often than not, the apprentice is left to
demonstrate these ideas in the context of ap- infer the expert’s intention for each trajectory.
prenticeship learning by acquiring the prefer-
In this paper, we formalize the problem of apprentice-
ences of a human driver in a simple highway
ship learning about multiple intentions. We adopt a
car simulator.
clustering approach in which observed trajectories are
grouped so their inferred reward functions are consis-
tent with observed behavior. We report results using
1. Introduction seven IRL/AL approaches including a simple but ef-
Apprenticeship learning (Abbeel & Ng, 2004), or AL, fective novel approach that chooses rewards to maxi-
addresses the task of learning a policy from expert mize the likelihoods of the observed trajectories under
demonstrations. In one well studied formulation, the a (near) optimal policy.
expert is assumed to be acting to maximize a reward
function, but the reward function is unknown to the 2. Background and Definitions
apprentice. The only information available concern-
ing the expert’s intent is a set of trajectories from the In this section, we define apprenticeship learning (AL)
expert’s interaction with the environment. From this and the closely related problem of inverse reinforce-
information, the apprentice strives to derive a policy ment learning (IRL). Algorithms for these problems
take as input a Markov decision process (MDP) with-
Appearing in Proceedings of the 28 th International Con- out a reward function and the observed behavior of the
ference on Machine Learning, Bellevue, WA, USA, 2011. expert in the form of a sequence of state–action pairs.
Copyright 2011 by the author(s)/owner(s). This behavior is assumed to be (nearly) optimal in the
Apprenticeship Learning About Multiple Intentions

MDP with respect to an unknown reward function. that randomness is introduced into each decision made
The goal in IRL is to find a proxy for the expert’s re- by the expert. In Active Learning (Lopes et al., 2009),
ward function. The goal in AL is to find a policy that transitions are provided dynamically. The apprentice
performs well with respect to the expert’s reward func- queries the expert for additional examples in states
tion. As is common in earlier work, we focus on IRL where needed.
as a means to solving AL problems. IRL is finding ap-
We devised two new IRL algorithms for our compar-
plication in a broad range of problems from inferring
isons. The linear program that constitutes the opti-
people’s moral values (Moore et al., 2009) to interpret-
mization core of LPAL (Linear Programming Appren-
ing verbal instructions (Branavan et al., 2009).
ticeship Learning) is a modified version of the standard
We use the following notation: MDP\r (or MDP) is a LP dual for solving MDPs (Puterman, 1994). It has
tuple (S, A, T, γ), where S is the state space, A is the as its variables the “policy flow” and a minimum per-
action space, the transition function T : S × A × S → feature reward component. We note that taking the
[0, 1] gives the transition probabilities between states dual of this LP results in a modified version of the
when actions are taken, and γ ∈ [0, 1) is a discount standard LP primal for solving MDPs. It has as its
factor that weights the outcome of future actions ver- variables the value function and θA . Because it pro-
sus present actions. We will assume the availability of duces explicit reward weights instead of just behavior,
a set of trajectories coming from expert agents taking we call this algorithm Linear Programming Inverse Re-
actions in the MDP in the form D = {ξ1 , ..., ξN }. A inforcement Learning (LPIRL). Because its behavior is
trajectory consists of a sequence of state-action pairs defined indirectly by θA , it can produce slightly differ-
ξi = {(s1 , a1 ), . . .}. ent answers from LPAL. Our second algorithm seeks
to maximize the likelihood of the observed trajectories,
Reward functions are parameterized by a vector of re-
as described in the next section.
ward weights θ applied to a feature vector for each
state-action pair φ(s, a). Thus, a reward function is
written rθ (s, a) = θT φ(s, a). If the expert’s reward 3. Maximum Likelihood Inverse
function is given by θE , the apprentice’s objective is Reinforcement Learning (MLIRL)
to behave in a way that maximizes the discounted sum
of expected future rewards with respect to rθE . How- We present a simple IRL algorithm we call Maximum
ever, the apprentice does not know θE and must use Likelihood Inverse Reinforcement Learning (MLIRL).
information from the observed trajectories to decide Like Bayesian IRL, it adopts a probability model that
how to behave. It can, for example, hypothesize its uses θA to create a value function and then assumes
own reward weights θA and behave accordingly. the expert randomizes at the level of individual action
choices. Like Maximum Entropy IRL, it seeks a max-
IRL algorithms differ not just in their algorithmic ap- imum likelihood model. Like Policy matching, it uses
proach but also in the objective function they seek a gradient method to find optimal behavior. The re-
to optimize (Neu & Szepesvári, 2009). In this work, sulting algorithm is quite simple and natural, but we
we examined several existing algorithms for IRL/AL. have not seen it described explicitly.
In Projection (Abbeel & Ng, 2004), the objective is
to make the features encountered by the appren- To define the algorithm more formally, we start by
tice’s policy match those of the expert. LPAL and detailing the process by which a hypothesized θA in-
MWAL (Syed et al., 2008) behave in such a way that duces a probability distribution over action choices and
they outperform the expert according to θA . Pol- thereby assigns a likelihood to the trajectories in D.
icy matching (Neu & Szepesvári, 2007) tries to make First, θA provides the rewards from which discounted
the actions taken by its policy as close as possible to expected values are derived:
those observed from the expert. Maximum Entropy ! "
T
QθA (s, a) = θA φ(s, a) + γ T (s, a, s! ) QθA (s! , a! ).
IRL (Ziebart et al., 2008) defines a probability distri-
s! a!
bution over complete trajectories as a function of θA
and produces the θA that maximizes the likelihood of Here, the “max” in the standard Bellman equa-
the observed trajectories. tion is replaced with an operator that blends
values via Boltzmann
# $ exploration $ (John, 1994):
It is worth noting several approaches that we were Q(s, a) = Q(s, a)e βQ(s,a)
/ e βQ(s,a! )
. This
a a a!
not able to include in our comparisons. Bayesian approach makes the likelihood (infinitely) differen-
IRL (Ramachandran & Amir, 2007) is a framework for tiable, although, in practice, other mappings could be
estimating posterior probabilities over possible reward used. In our work, we calculate these values via 100
functions given the observed trajectories. It assumes iterations of value iteration and use β = 0.5.
Apprenticeship Learning About Multiple Intentions

to G. The algorithm is now faced with the task of infer-


Algorithm 1 Maximum Likelihood IRL
ring the parameters of the expert’s reward function θE
Input: MDP\r, features φ, trajectories using this trajectory. It appears that the expert is try-
{ξ1 , . . . , ξN }, trajectory weights {w1 , . . . , wN }, ing to reach the goal by taking the shortest path and
number of iterations M , step size for each iteration at the same time avoid any intermediate puddles. The
(t) αt , 1 ≤ t < M . assignment of reward weights to the three features—
Initialize: Choose random set of reward weights ground, puddle, and goal—that makes this trajectory
θ1 . maximally likely is one that assigns the highest re-
for t = 1 to M do ward to the goal. (Otherwise, the expert would have
Compute ! Qθ! t , πθt . preferred to travel somewhere else in the grid.) The
L= wi log(πθt (s, a)). probability of the observed path is further enhanced
i (s,a)∈ξ by assigning lower reward weights to puddles than to
θt+1 ← θt + αt ∇L. ground. Thus, although one explanation for the path
end for is that it is one of a large number of possible shortest
Output: Return θA = θM . paths to the goal, the trajectory’s probability is max-
imized by assuming the expert intentionally missed
1
the puddles. The MLIRL-computed reward function
15
is shown in Figure 2, which assigns high likelihood
2 10 (0.1662) to the single demonstration trajectory.
5
3 One of the challenges of IRL is that, given an expert
0
4
policy, there are an infinite number of reward functions
−5 for which that policy is optimal in the given MDP.
5
−10 Like several other IRL approaches, MLIRL addresses
1 2 3 4 5 this issue by searching for a solution that not only
explains why the observed behavior is optimal, but
Figure 1. A single trajec- Figure 2. Reward function also by explaining why the other possible behaviors
tory from start to goal. computed using MLIRL. are suboptimal. In particular, by striving to assign
high probability to the observed behavior, it implicitly
The Boltzmann exploration policy is πθA (s, a) = assigns low probability to unobserved behavior.
$ !
eβQθA (s,a) / a! eβQθA (s,a ) . Under this policy, the log
likelihood of the trajectories in D is L(D|θ) = 4. Apprenticeship Learning about
Multiple Intentions
%
N % !
N !
wi
log πθ (s, a) = wi log πθ (s, a). The motivation for our work comes from settings like
i=1 (s,a)∈ξi i=1 (s,a)∈ξi
surveillance, in which observed actors are classified as
(1) “normal” or “threatening” depending on their behav-
Here, wi is a trajectory-specific weight encoding the ior. We contend that a parsimonious classifier results
frequency of trajectory i. MLIRL seeks θA = by adopting a generative model of behavior—assume
argmaxθ L(D|θ)—the maximum likelihood solution. actors select actions that reflect their intentions and
In our work, we optimized this function via gradient then categorize them based on their inferred inten-
ascent (although we experimented with several other tions. For example, the behavior of people in a train
optimization approaches). These pieces come together station might differ according to their individual goals:
in Algorithm 1. some have the goal of traveling causing them to buy
It is open whether infinite-horizon value iteration with tickets and then go to their trains, while others may be
the Boltzmann operator will converge. In our finite- picking up passengers causing them to wait in a visi-
horizon setting, it is well-behaved and produces a well- ble area. We adopt the approach of using unsupervised
defined answer, as illustrated later in this section and clustering to identify the space of common intentions
in our experiments (Section 5). from a collection of examples, then mapping later ex-
amples to this set using Bayes rule.
We illustrate the functioning of the MLIRL algorithm
using the example shown in Figure 1. It depicts a Similar scenarios include decision making by auto-
5 × 5 grid with puddles (indicated by wavy lines), a matic doors that infer when people intend to go
start state (S) and an absorbing goal state (G). The through them, a home climate control system that
dashed line shows the path taken by an expert from S sets temperature controls appropriately by reasoning
Apprenticeship Learning About Multiple Intentions

about the home owner’s likely destinations when driv-


Algorithm 2 EM Trajectory Clustering
ing. A common theme in these applications is that
unlabeled data—observations of experts with varying Input: Trajectories {ξ1 , ..., ξN } (with varying in-
intentions—are much easier to come by than trajecto- tentions), number of clusters K.
ries labeled with their underlying goal. We define our Initialize: ρ1 , . . . , ρK , θ1 , . . . , θK randomly.
formal problem accordingly. repeat &
E Step: Compute zij = (s,a)∈ξi πθj (s, a)ρj /Z,
In the problem of apprenticeship learning about mul- where Z is the normalization factor.
$
tiple intentions, we assume there exists a finite set M step: For all l, ρl = i zil /N . Compute θl via
of K or fewer intentions each represented by reward MLIRL on D with weight zij on trajectory ξi .
weights θk . The apprentice is provided with a set of until target number of iterations completed.
N > K trajectories D = {ξ1 , ..., ξN }. Each intention
is represented by at least one element in this set and
each trajectory is generated by an expert with one !!
N %
N

of the intentions. An additional trajectory ξE is the = log(ρyi Pr(ξi |θyi )) Pr(yi! |ξi! , Θt )
test trajectory—the apprentice’s objective is to pro- y i=1 i! =1

duce behavior πA that obtains high reward with re- ! !!


N !
K

spect to θE , the reward weights that generated ξE . = ··· δl=yi log(ρl Pr(ξi |θl ))
Many possible clustering algorithms could be applied y1 yN i=1 l=1

to attack this problem. We show that Expectation- %


N

Maximization (EM) is a viable approach. × Pr(yi! |ξi! , Θt )


i! =1

4.1. A Clustering Algorithm for Intentions !


K !
N ! !
= log(ρl Pr(ξi |θl )) ··· δl=yi
We adopt EM (Dempster et al., 1977) as a straight- l=1 i=1 y1 yN
forward approach to computing a maximum likelihood %
N
model in a probabilistic setting in the face of missing × Pr(yi! |ξi! , Θt )
data. The missing data in this case are the cluster i! =1
labels—the mapping from trajectories to one of the !
K !
N
intentions. We next derive an EM algorithm. = log(ρl Pr(ξi |θl ))zilt
l=1 i=1
Define zij to be the probability that trajectory i be-
longs in cluster j. Let θj be the estimate of the !
K !
N !
K !
N
= log(ρl )zilt + log(Pr(ξi |θl ))zilt .
reward weights for cluster j, and ρj to be the es-
l=1 i=1 l=1 i=1
timate for the prior probability of cluster j. Fol-
lowing the development in Bilmes (1997), we define
In the M step, we need to pick Θ (ρl and θl )
Θ = (ρ1 , . . . , ρK , θ1 , . . . , θK ) as the parameter vector
to maximize Equation 3. Since they are not
we are searching for and Θt as the parameter vector
interdependent, we can optimize$ t them separately.
at iteration t. Let yi = j if trajectory i came from
Thus, we can set ρt+1 = t+1
i zil /N and θl =
following intention j and y = (y1 , . . . , yN ). We write $N t l
t
zij = Pr(ξi |θjt ), the probability, according to the pa- argmaxθ i=1 zil log(Pr(ξi |θl )). The key observation
rameters at iteration t, that trajectory i was generated is that this second quantity is precisely the IRL log
by intention j. likelihood, as seen in Equation 1. That is, the M
step demands that we find reward weights that make
The E step of EM simply computes the observed data as likely as possible, which is pre-
% cisely what MLIRL seeks to do. As a result, EM
t
zij = πθjt (s, a)ρtj /Z, (2) for learning about multiple intentions alternates be-
(s,a)∈ξi tween calculating probabilities via the E step (Equa-
tion 2) and performing IRL on the current clusters.
where Z is the normalization factor.
Algorithm 2 pulls these pieces together. This EM
To carry out the M step, we define the EM Q function approach is a fairly direct interpretation of the clus-
(distinct from the MDP Q function): tering problem we defined. It differs from much of
the published work on learning from multiple ex-
Q(Θ, Θt ) perts, however, which starts with the assumption that
!
= L(Θ|D, y)) Pr(y|D, Θt ) all the experts have the same intentions (same re-
y ward function), but perhaps differ in their reliabil-
Apprenticeship Learning About Multiple Intentions

ity (Argall et al. 2009, Richardson & Domingos 2003). they function in the setting of learning about multiple
intentions. We first look at their performance in a grid
4.2. Using Clusters for AL world with a single expert (single intention), a domain
where a few existing approaches (Abbeel & Ng 2004,
The input of the EM method of the previous section Syed et al. 2008) have already been tested. Our sec-
is a set of trajectories D and a number of clusters K. ond experiment, in a grid world with puddles, demon-
The output is a set of K clusters. Associated with strates the MLIRL algorithm as part of our EM ap-
each cluster i are the reward weights θi , which induce proach (Section 4) to cluster trajectories from mul-
a reward function rθi , and a cluster prior ρi . Next, tiple intentions—each corresponding to a different
we consider how to carry out AL on a new trajectory reward function. Thirdly, we compare the perfor-
ξE under the assumption that it comes from the same mance of all the IRL/AL algorithms as part of the
population as the trajectories in D. EM clustering approach in the simulated Highway
By Bayes rule, Pr(θi |ξE ) = Pr(ξE |θi ) Pr(θi )/ Pr(ξE ). Car domain (Abbeel & Ng 2004, Syed et al. 2008), an
Here, Pr(θi ) = ρi and Pr(ξE |θi ) is easily computable infinite-horizon domain with stochastic transitions.
(z in Section 4.1). The quantity Pr(ξ) is a simple nor- Our experiments used implementations of the MLIRL,
malization factor. Thus, the apprentice can derive a LPIRL, Maximum Entropy IRL, LPAL, MWAL, Pro-
probability distribution over reward functions given a jection, and Policy Matching algorithms. We obtained
trajectory (Ziebart et al., 2008). How should it be- implementations from the original authors wherever
have? Let f π (s, a) be the (weighted) fraction of the possible.
time policy π spends taking action a in state s. Then,
with respect to $
reward function r, the value of policy π
5.1. Learning from a Single Expert
can be written s,a f π (s, a)r(s, a). We should choose
the policy with the highest expected reward: 6
! !
argmax Pr(θi |ξE ) f π (s, a)rθi (s, a) 5.5 Optimal
π
Average Value

i s,a MLIRL
! ! 5
LPAL

= argmax f π (s, a) Pr(θi |ξE )rθi (s, a) LPIRL


Policy Matching
π s,a Projection
i
! 4.5 Maximum Entropy
MWAL
π !
= argmax f (s, a)r (s, a),
π 4
s,a
$ 0 50 100 150
where r! (s, a) = i Pr(θi |ξE )rθi (s, a). That is, the
Sample Trajectories

optimal policy for the apprentice is the one that maxi-


mizes the sum of the reward functions for the possible Figure 3. A plot of the average reward computed with in-
intentions, weighted by their likelihoods. This prob- creasing number of sample trajectories.
lem can be solved by computing the optimal policy of
the MDP with this averaged reward function. Thus,
to figure out how to act given an initial trajectory and 0

collection of example trajectories, our approach is to −5


Average Likelihood (log scale)

cluster the examples, use Bayes rule to figure out the −10
MLIRL
probability that the current trajectory belongs in each −15 LPAL
Policy Matching
cluster, create a merged reward function by combining −20
LPIRL
Projection
the cluster reward functions using the derived proba- −25
MWAL
−30
bilities, and finally compute a policy for the merged Maximum Entropy

−35
reward function to decide how to behave.
−40

0 50 100 150
5. Experiments Sample Trajectories

Our experiments were designed to compare the per- Figure 4. A plot of the average trajectory likelihood com-
formance of the MLIRL (Section 3) and LPIRL (Sec- puted with increasing number of sample trajectories.
tion 2) algorithms with five existing IRL/AL ap-
proaches summarized in Section 2. We compare these
seven approaches in several ways to assess (a) how well In this experiment, we tested the performance of
they perform apprenticeship learning and (b) how well each IRL/AL algorithm in a grid world environ-
Apprenticeship Learning About Multiple Intentions

ment similar to one used by Abbeel & Ng (2004) and 5.2. Learning about Multiple Intentions—Grid
Syed et al. (2008). We use a grid of size 16×16. Move- World with Puddles
ment of the agent is possible in the four compass direc-
In our second experiment, we test the ability of our
tions with each action having a 30% chance of causing
proposed EM approach, described in Section 4, to ac-
a random transition. The grid is further subdivided
curately cluster trajectories associated with multiple
into non-overlapping square regions, each of size 4 × 4.
intentions.
Using the same terminology as Abbeel & Ng (2004),
we refer to the square regions as “macrocells”. The We make use of a 5 × 5 discrete grid world shown in
partitioning of the grid results in a total of 16 macro- Figure 5 (Left). The world contains a start state, a
cells. Every cell in the gridworld is characterized by a goal state and patches in the middle indicating pud-
16-dimensional feature vector φ indicating, using a 0 dles. Furthermore, the world is characterized by three
or 1, which macrocell it belongs to. A random weight feature vectors, one for the goal, one for the puddles
vector is chosen such that the true reward function just and another for the remaining states. For added ex-
encodes that some macrocells are more desirable than pressive power, we also included the negations of the
others. The optimal policy π ∗ is computed for the features in the set thereby doubling the number of fea-
true reward function and the single expert trajectories tures to six.
are acquired by sampling π ∗ . To maintain consistency
We imagine data comes from two experts with differ-
across the algorithms, the start state is drawn from a
ent intentions. Expert 1 goes to the goal avoiding the
fixed distribution and the lengths of the trajectories
puddles at all times and Expert 2 goes to the goal
are truncated to 60 steps.
completely ignoring the puddles. Sample trajectories
Of particular interest is the ability of the seven from these experts are shown in Figure 5 (Left). Tra-
IRL/AL algorithms to learn from a small amount of jectory T1 was generated by Expert 1, T2 and T3, by
data. Thus, we illustrate the performance of the algo- Expert 2. This experiment used a total of N = 12 sam-
rithms by varying the number of sample trajectories ple trajectories of varying lengths, 5 from Expert 1, 7
available for learning. Results are averaged over 5 rep- from Expert 2. We initiated the EM algorithm by set-
etitions and standard error bars are given. Note that in ting the value of K, the number of clusters, to 5 to
this and the following experiments, we use Boltzmann allow some flexibility in clustering. We ran the clus-
exploration polices to transform the reward functions tering, then hand-identified the two experts. Figure 5
computed by the IRL algorithms into policies when (Right) shows the algorithm’s estimates that the three
required. trajectories, T1, T2 and T3, belong to Expert 1. The
EM approach was able to successfully cluster all of
Figure 3 shows the average reward accumulated by
the 12 trajectories in the manner described above: the
the policy computed by each algorithm as more tra-
unambiguous trajectories were accurately assigned to
jectories are available for training. With 30 or more
their clusters and the ambiguous ones were “properly”
trajectories, MLIRL outperforms the other six. LPAL
assigned to multiple clusters. Since we set the value
and LPIRL also perform well. An advantage of LPIRL
of K = 5, EM produced 5 clusters. On analyzing
over LPAL is that it returns a reward function, which
these clusters, we found that the algorithm produced
makes it able to generalize over states that the ex-
2 unique policies along with 3 copies. Thus, EM cor-
pert has not visited during the demonstration trajec-
rectly extracted the preferences of the experts using
tories. However, we observed that designing a policy
the input sample trajectories.
indirectly through the reward function was less stable
than optimizing the policy directly. It is interesting to
note that MaxEnt lags behind in this setting. MaxEnt
appears best suited for settings with very long demon-
stration trajectories, as opposed to the relatively short
trajectories we used in this experiment.
Figure 4 shows that for the most part, in this dataset,
the better an algorithm does at assigning high proba-
bility to the observed trajectories, the more likely it is
to obtain higher rewards.
Figure 5. Left: Grid world showing the start states (grey),
goal state (G), puddles and three sample trajectories.
Right: Posterior probabilities of the three trajectories be-
longing to Expert 1.
Apprenticeship Learning About Multiple Intentions

3.9

3.8

3.7

3.6
MLIRL
Average Reward

3.5 Maximum Entropy


LPAL
3.4 Policy Matching
MWAL
3.3 Projection
LPIRL
3.2

3.1

3
Figure 7. Simulated Highway Car.
0 20 40 60 80 100 120
Driving Time in Seconds
4

Figure 6. Average reward for Student trajectory for EM 3.5


approach with varying IRL/AL components. 3

Average Value
EM + MLIRL
2.5 EM + Maximum Entropy
Online AL
2
The probability values were computed at intermediate Single Expert AL
steps during the 10 iterations of the EM algorithm. 1.5
After the 1st iteration, EM estimated that T1 belongs
1
to Expert 1 with high probability and T2 belongs to
Expert 1 with very low probability (implying that it 0.5
0 20 40 60 80 100 120
therefore belongs to Expert 2). It is interesting to note Driving Time in seconds
here that EM estimated that trajectory T3 belongs to
Expert 1 with probability 0.3. The uncertainty in- Figure 8. Value of the computed policy as a function
dicates that T3 could belong to either Expert 1 or of length of driving trajectories for three approaches to
Expert 2. learning about multiple intentions.

5.3. Learning about Multiple


Intentions—Highway Car Domain erations. The trajectory used for evaluation ξE was
generated by Student. The actions selected by the
In our third experiment, we instantiated the EM al-
approach outlined in the previous section were evalu-
gorithm in an infinite horizon domain with stochas-
ated according to the reward function from Student
tic transitions, the simulated Highway Car do-
and plotted in Figure 6. Although our MLIRL algo-
main (Abbeel & Ng 2004, Syed et al. 2008). This do-
rithm is best suited to carry out the M step in the
main consists of a three-lane highway with an extra off-
EM algorithm, any IRL can be used to approximately
road lane on either side, a car driving at constant speed
optimize the likelihood. Indeed, even AL algorithms
and a set of oncoming cars. Figure 7 shows a snapshot
can be used in the EM framework where a probabilis-
of the simulated highway car domain. The task is for
tic policy takes the place of the reward weights as the
the car to navigate through the busy highway using
hidden parameters. Thus, we instantiated each of the
three actions: left, right and stay. The domain consists
7 AL/IRL approaches within the EM algorithm. It
of three features: speed, number of collisions, number
is interesting to note that maximum likelihood algo-
of off-road visits. Our experiment uses these three fea-
rithms (MLIRL and MaxEnt) are the most effective
tures along with their negations, making a total of six
for this taks. This time, MaxEnt was provided with
features. The transition dynamics are stochastic. Four
longer trajectories, leading to an improvement in its
different experts were used for this experiment: Safe:
performance compared to Section 5.1 and Figure 3.
Avoids collisions and avoids going off-road. Student:
Avoid collisions and does not mind going off-road. De- Other approaches to learning about multiple inten-
molition: Collides with every car and avoids going tions are possible. We compared the EM approach
off-road. Nasty: Collides with every car and does to AL (Section 4.2) with two other possibilities: (1)
not mind going off-road. Sample trajectories were col- an AL learner that ignores all previous data in D
lected from between ten seconds and two minutes of and only learns from the current trajectory ξE (on-
driving time from a human subject emulating each of line AL), and (2) all the prior data D and current
the four experts. Using these sample trajectories, the trajectory ξE are treated as a single input to the AL
EM approach performed clustering (K = 6) for 10 it- learner , commingling data generated from different
Apprenticeship Learning About Multiple Intentions

intentions (single expert AL). Figure 8 shows that the of the Twelfth National Conference on Artificial In-
EM approach (with either MLIRL or Maximum En- telligence, pp. 1464, Seattle, WA, 1994.
tropy) makes much better use of the available data
and mixing data from multiple experts is undesirable. Lopes, Manuel, Melo, Francisco S., and Montesano,
Luis. Active learning for reward estimation in in-
verse reinforcement learning. In ECML/PKDD, pp.
6. Conclusion and Future Work 31–46, 2009.
We defined an extension of inverse reinforcement learn- Moore, Adam B., Todd, Michael T., and Conway,
ing and apprenticeship learning in which the learner is Andrew R. A. A computational model of moral
provided with unlabeled example trajectories gener- judgment. Poster at Psychonomics Society Meeting,
ated from a number of possible reward functions. Us- 2009.
ing these examples as a kind of background knowledge,
a learner can more quickly infer and optimize reward Neu, Gergely and Szepesvári, Csaba. Apprenticeship
functions for novel trajectories. learning using inverse reinforcement learning and
gradient methods. In Proceedings of the Conference
Having shown that an EM clustering approach can suc-
of Uncertainty in Artificial Intelligence, 2007.
cessfully infer individual intentions from a collection of
unlabeled trajectories, we next intend to pursue using Neu, Gergely and Szepesvári, Csaba. Training parsers
these learned intentions to predict the behavior of and by inverse reinforcement learning. Machine Learn-
better interact with other agents in multiagent envi- ing, 77(2–3):303–337, 2009.
ronments.
Puterman, Martin L. Markov Decision Processes—
Discrete Stochastic Dynamic Programming. John
References Wiley & Sons, Inc., New York, NY, 1994.
Abbeel, Pieter and Ng, Andrew Y. Apprenticeship
learning via inverse reinforcement learning. In Pro- Ramachandran, Deepak and Amir, Eyal. Bayesian in-
ceedings of the International Conference on Machine verse reinforcement learning. In Proceedings of IJ-
Learning, 2004. CAI, pp. 2586–2591, 2007.
Richardson, Matthew and Domingos, Pedro. Learn-
Argall, Brenna, Browning, Brett, and Veloso,
ing with knowledge from multiple experts. In Pro-
Manuela M. Automatic weight learning for multi-
ceedings of the International Conference on Machine
ple data sources when learning from demonstration.
Learning, pp. 624–631, 2003.
In Proceedings of the International Conference on
Robotics and Automation, pp. 226–231, 2009. Syed, Umar, Bowling, Michael, and Schapire,
Robert E. Apprenticeship learning using linear pro-
Bilmes, Jeff A. A gentle tutorial of the EM algorithm gramming. In Proceedings of the International Con-
and its application to parameter estimation for gaus- ference on Machine Learning, pp. 1032–1039, 2008.
sian mixture and hidden Markov models. Technical
Report TR-97-021, International Computer Science Ziebart, Brian D., Maas, Andrew, Bagnell, J. Andrew,
Institute, 1997. and Dey, Anind K. Maximum entropy inverse re-
inforcement learning. In Proceedings of the 23rd
Branavan, S. R. K., Chen, Harr, Zettlemoyer, Luke S., National Conference on Artificial Intelligence, pp.
and Barzilay, Regina. Reinforcement learning for 1433–1438, 2008.
mapping instructions to actions. In Proceedings of
the Joint Conference of the 47th Annual Meeting of
the ACL and the 4th International Joint Conference
on Natural Language Processing of the AFNLP, pp.
82–90, 2009.

Dempster, A. P., Laird, N. M., and Rubin, D. B. Max-


imum likelihood from incomplete data via the EM
algorithm. Journal of the Royal Statistical Society,
39(1):1–38, 1977.

John, George H. When the best move isn’t opti-


mal: Q-learning with exploration. In Proceedings

You might also like