Apprenticeship Learning About Multiple Intentions: Abbeel & NG 2004
Apprenticeship Learning About Multiple Intentions: Abbeel & NG 2004
MDP with respect to an unknown reward function. that randomness is introduced into each decision made
The goal in IRL is to find a proxy for the expert’s re- by the expert. In Active Learning (Lopes et al., 2009),
ward function. The goal in AL is to find a policy that transitions are provided dynamically. The apprentice
performs well with respect to the expert’s reward func- queries the expert for additional examples in states
tion. As is common in earlier work, we focus on IRL where needed.
as a means to solving AL problems. IRL is finding ap-
We devised two new IRL algorithms for our compar-
plication in a broad range of problems from inferring
isons. The linear program that constitutes the opti-
people’s moral values (Moore et al., 2009) to interpret-
mization core of LPAL (Linear Programming Appren-
ing verbal instructions (Branavan et al., 2009).
ticeship Learning) is a modified version of the standard
We use the following notation: MDP\r (or MDP) is a LP dual for solving MDPs (Puterman, 1994). It has
tuple (S, A, T, γ), where S is the state space, A is the as its variables the “policy flow” and a minimum per-
action space, the transition function T : S × A × S → feature reward component. We note that taking the
[0, 1] gives the transition probabilities between states dual of this LP results in a modified version of the
when actions are taken, and γ ∈ [0, 1) is a discount standard LP primal for solving MDPs. It has as its
factor that weights the outcome of future actions ver- variables the value function and θA . Because it pro-
sus present actions. We will assume the availability of duces explicit reward weights instead of just behavior,
a set of trajectories coming from expert agents taking we call this algorithm Linear Programming Inverse Re-
actions in the MDP in the form D = {ξ1 , ..., ξN }. A inforcement Learning (LPIRL). Because its behavior is
trajectory consists of a sequence of state-action pairs defined indirectly by θA , it can produce slightly differ-
ξi = {(s1 , a1 ), . . .}. ent answers from LPAL. Our second algorithm seeks
to maximize the likelihood of the observed trajectories,
Reward functions are parameterized by a vector of re-
as described in the next section.
ward weights θ applied to a feature vector for each
state-action pair φ(s, a). Thus, a reward function is
written rθ (s, a) = θT φ(s, a). If the expert’s reward 3. Maximum Likelihood Inverse
function is given by θE , the apprentice’s objective is Reinforcement Learning (MLIRL)
to behave in a way that maximizes the discounted sum
of expected future rewards with respect to rθE . How- We present a simple IRL algorithm we call Maximum
ever, the apprentice does not know θE and must use Likelihood Inverse Reinforcement Learning (MLIRL).
information from the observed trajectories to decide Like Bayesian IRL, it adopts a probability model that
how to behave. It can, for example, hypothesize its uses θA to create a value function and then assumes
own reward weights θA and behave accordingly. the expert randomizes at the level of individual action
choices. Like Maximum Entropy IRL, it seeks a max-
IRL algorithms differ not just in their algorithmic ap- imum likelihood model. Like Policy matching, it uses
proach but also in the objective function they seek a gradient method to find optimal behavior. The re-
to optimize (Neu & Szepesvári, 2009). In this work, sulting algorithm is quite simple and natural, but we
we examined several existing algorithms for IRL/AL. have not seen it described explicitly.
In Projection (Abbeel & Ng, 2004), the objective is
to make the features encountered by the appren- To define the algorithm more formally, we start by
tice’s policy match those of the expert. LPAL and detailing the process by which a hypothesized θA in-
MWAL (Syed et al., 2008) behave in such a way that duces a probability distribution over action choices and
they outperform the expert according to θA . Pol- thereby assigns a likelihood to the trajectories in D.
icy matching (Neu & Szepesvári, 2007) tries to make First, θA provides the rewards from which discounted
the actions taken by its policy as close as possible to expected values are derived:
those observed from the expert. Maximum Entropy ! "
T
QθA (s, a) = θA φ(s, a) + γ T (s, a, s! ) QθA (s! , a! ).
IRL (Ziebart et al., 2008) defines a probability distri-
s! a!
bution over complete trajectories as a function of θA
and produces the θA that maximizes the likelihood of Here, the “max” in the standard Bellman equa-
the observed trajectories. tion is replaced with an operator that blends
values via Boltzmann
# $ exploration $ (John, 1994):
It is worth noting several approaches that we were Q(s, a) = Q(s, a)e βQ(s,a)
/ e βQ(s,a! )
. This
a a a!
not able to include in our comparisons. Bayesian approach makes the likelihood (infinitely) differen-
IRL (Ramachandran & Amir, 2007) is a framework for tiable, although, in practice, other mappings could be
estimating posterior probabilities over possible reward used. In our work, we calculate these values via 100
functions given the observed trajectories. It assumes iterations of value iteration and use β = 0.5.
Apprenticeship Learning About Multiple Intentions
of the intentions. An additional trajectory ξE is the = log(ρyi Pr(ξi |θyi )) Pr(yi! |ξi! , Θt )
test trajectory—the apprentice’s objective is to pro- y i=1 i! =1
spect to θE , the reward weights that generated ξE . = ··· δl=yi log(ρl Pr(ξi |θl ))
Many possible clustering algorithms could be applied y1 yN i=1 l=1
ity (Argall et al. 2009, Richardson & Domingos 2003). they function in the setting of learning about multiple
intentions. We first look at their performance in a grid
4.2. Using Clusters for AL world with a single expert (single intention), a domain
where a few existing approaches (Abbeel & Ng 2004,
The input of the EM method of the previous section Syed et al. 2008) have already been tested. Our sec-
is a set of trajectories D and a number of clusters K. ond experiment, in a grid world with puddles, demon-
The output is a set of K clusters. Associated with strates the MLIRL algorithm as part of our EM ap-
each cluster i are the reward weights θi , which induce proach (Section 4) to cluster trajectories from mul-
a reward function rθi , and a cluster prior ρi . Next, tiple intentions—each corresponding to a different
we consider how to carry out AL on a new trajectory reward function. Thirdly, we compare the perfor-
ξE under the assumption that it comes from the same mance of all the IRL/AL algorithms as part of the
population as the trajectories in D. EM clustering approach in the simulated Highway
By Bayes rule, Pr(θi |ξE ) = Pr(ξE |θi ) Pr(θi )/ Pr(ξE ). Car domain (Abbeel & Ng 2004, Syed et al. 2008), an
Here, Pr(θi ) = ρi and Pr(ξE |θi ) is easily computable infinite-horizon domain with stochastic transitions.
(z in Section 4.1). The quantity Pr(ξ) is a simple nor- Our experiments used implementations of the MLIRL,
malization factor. Thus, the apprentice can derive a LPIRL, Maximum Entropy IRL, LPAL, MWAL, Pro-
probability distribution over reward functions given a jection, and Policy Matching algorithms. We obtained
trajectory (Ziebart et al., 2008). How should it be- implementations from the original authors wherever
have? Let f π (s, a) be the (weighted) fraction of the possible.
time policy π spends taking action a in state s. Then,
with respect to $
reward function r, the value of policy π
5.1. Learning from a Single Expert
can be written s,a f π (s, a)r(s, a). We should choose
the policy with the highest expected reward: 6
! !
argmax Pr(θi |ξE ) f π (s, a)rθi (s, a) 5.5 Optimal
π
Average Value
i s,a MLIRL
! ! 5
LPAL
cluster the examples, use Bayes rule to figure out the −10
MLIRL
probability that the current trajectory belongs in each −15 LPAL
Policy Matching
cluster, create a merged reward function by combining −20
LPIRL
Projection
the cluster reward functions using the derived proba- −25
MWAL
−30
bilities, and finally compute a policy for the merged Maximum Entropy
−35
reward function to decide how to behave.
−40
0 50 100 150
5. Experiments Sample Trajectories
Our experiments were designed to compare the per- Figure 4. A plot of the average trajectory likelihood com-
formance of the MLIRL (Section 3) and LPIRL (Sec- puted with increasing number of sample trajectories.
tion 2) algorithms with five existing IRL/AL ap-
proaches summarized in Section 2. We compare these
seven approaches in several ways to assess (a) how well In this experiment, we tested the performance of
they perform apprenticeship learning and (b) how well each IRL/AL algorithm in a grid world environ-
Apprenticeship Learning About Multiple Intentions
ment similar to one used by Abbeel & Ng (2004) and 5.2. Learning about Multiple Intentions—Grid
Syed et al. (2008). We use a grid of size 16×16. Move- World with Puddles
ment of the agent is possible in the four compass direc-
In our second experiment, we test the ability of our
tions with each action having a 30% chance of causing
proposed EM approach, described in Section 4, to ac-
a random transition. The grid is further subdivided
curately cluster trajectories associated with multiple
into non-overlapping square regions, each of size 4 × 4.
intentions.
Using the same terminology as Abbeel & Ng (2004),
we refer to the square regions as “macrocells”. The We make use of a 5 × 5 discrete grid world shown in
partitioning of the grid results in a total of 16 macro- Figure 5 (Left). The world contains a start state, a
cells. Every cell in the gridworld is characterized by a goal state and patches in the middle indicating pud-
16-dimensional feature vector φ indicating, using a 0 dles. Furthermore, the world is characterized by three
or 1, which macrocell it belongs to. A random weight feature vectors, one for the goal, one for the puddles
vector is chosen such that the true reward function just and another for the remaining states. For added ex-
encodes that some macrocells are more desirable than pressive power, we also included the negations of the
others. The optimal policy π ∗ is computed for the features in the set thereby doubling the number of fea-
true reward function and the single expert trajectories tures to six.
are acquired by sampling π ∗ . To maintain consistency
We imagine data comes from two experts with differ-
across the algorithms, the start state is drawn from a
ent intentions. Expert 1 goes to the goal avoiding the
fixed distribution and the lengths of the trajectories
puddles at all times and Expert 2 goes to the goal
are truncated to 60 steps.
completely ignoring the puddles. Sample trajectories
Of particular interest is the ability of the seven from these experts are shown in Figure 5 (Left). Tra-
IRL/AL algorithms to learn from a small amount of jectory T1 was generated by Expert 1, T2 and T3, by
data. Thus, we illustrate the performance of the algo- Expert 2. This experiment used a total of N = 12 sam-
rithms by varying the number of sample trajectories ple trajectories of varying lengths, 5 from Expert 1, 7
available for learning. Results are averaged over 5 rep- from Expert 2. We initiated the EM algorithm by set-
etitions and standard error bars are given. Note that in ting the value of K, the number of clusters, to 5 to
this and the following experiments, we use Boltzmann allow some flexibility in clustering. We ran the clus-
exploration polices to transform the reward functions tering, then hand-identified the two experts. Figure 5
computed by the IRL algorithms into policies when (Right) shows the algorithm’s estimates that the three
required. trajectories, T1, T2 and T3, belong to Expert 1. The
EM approach was able to successfully cluster all of
Figure 3 shows the average reward accumulated by
the 12 trajectories in the manner described above: the
the policy computed by each algorithm as more tra-
unambiguous trajectories were accurately assigned to
jectories are available for training. With 30 or more
their clusters and the ambiguous ones were “properly”
trajectories, MLIRL outperforms the other six. LPAL
assigned to multiple clusters. Since we set the value
and LPIRL also perform well. An advantage of LPIRL
of K = 5, EM produced 5 clusters. On analyzing
over LPAL is that it returns a reward function, which
these clusters, we found that the algorithm produced
makes it able to generalize over states that the ex-
2 unique policies along with 3 copies. Thus, EM cor-
pert has not visited during the demonstration trajec-
rectly extracted the preferences of the experts using
tories. However, we observed that designing a policy
the input sample trajectories.
indirectly through the reward function was less stable
than optimizing the policy directly. It is interesting to
note that MaxEnt lags behind in this setting. MaxEnt
appears best suited for settings with very long demon-
stration trajectories, as opposed to the relatively short
trajectories we used in this experiment.
Figure 4 shows that for the most part, in this dataset,
the better an algorithm does at assigning high proba-
bility to the observed trajectories, the more likely it is
to obtain higher rewards.
Figure 5. Left: Grid world showing the start states (grey),
goal state (G), puddles and three sample trajectories.
Right: Posterior probabilities of the three trajectories be-
longing to Expert 1.
Apprenticeship Learning About Multiple Intentions
3.9
3.8
3.7
3.6
MLIRL
Average Reward
3.1
3
Figure 7. Simulated Highway Car.
0 20 40 60 80 100 120
Driving Time in Seconds
4
Average Value
EM + MLIRL
2.5 EM + Maximum Entropy
Online AL
2
The probability values were computed at intermediate Single Expert AL
steps during the 10 iterations of the EM algorithm. 1.5
After the 1st iteration, EM estimated that T1 belongs
1
to Expert 1 with high probability and T2 belongs to
Expert 1 with very low probability (implying that it 0.5
0 20 40 60 80 100 120
therefore belongs to Expert 2). It is interesting to note Driving Time in seconds
here that EM estimated that trajectory T3 belongs to
Expert 1 with probability 0.3. The uncertainty in- Figure 8. Value of the computed policy as a function
dicates that T3 could belong to either Expert 1 or of length of driving trajectories for three approaches to
Expert 2. learning about multiple intentions.
intentions (single expert AL). Figure 8 shows that the of the Twelfth National Conference on Artificial In-
EM approach (with either MLIRL or Maximum En- telligence, pp. 1464, Seattle, WA, 1994.
tropy) makes much better use of the available data
and mixing data from multiple experts is undesirable. Lopes, Manuel, Melo, Francisco S., and Montesano,
Luis. Active learning for reward estimation in in-
verse reinforcement learning. In ECML/PKDD, pp.
6. Conclusion and Future Work 31–46, 2009.
We defined an extension of inverse reinforcement learn- Moore, Adam B., Todd, Michael T., and Conway,
ing and apprenticeship learning in which the learner is Andrew R. A. A computational model of moral
provided with unlabeled example trajectories gener- judgment. Poster at Psychonomics Society Meeting,
ated from a number of possible reward functions. Us- 2009.
ing these examples as a kind of background knowledge,
a learner can more quickly infer and optimize reward Neu, Gergely and Szepesvári, Csaba. Apprenticeship
functions for novel trajectories. learning using inverse reinforcement learning and
gradient methods. In Proceedings of the Conference
Having shown that an EM clustering approach can suc-
of Uncertainty in Artificial Intelligence, 2007.
cessfully infer individual intentions from a collection of
unlabeled trajectories, we next intend to pursue using Neu, Gergely and Szepesvári, Csaba. Training parsers
these learned intentions to predict the behavior of and by inverse reinforcement learning. Machine Learn-
better interact with other agents in multiagent envi- ing, 77(2–3):303–337, 2009.
ronments.
Puterman, Martin L. Markov Decision Processes—
Discrete Stochastic Dynamic Programming. John
References Wiley & Sons, Inc., New York, NY, 1994.
Abbeel, Pieter and Ng, Andrew Y. Apprenticeship
learning via inverse reinforcement learning. In Pro- Ramachandran, Deepak and Amir, Eyal. Bayesian in-
ceedings of the International Conference on Machine verse reinforcement learning. In Proceedings of IJ-
Learning, 2004. CAI, pp. 2586–2591, 2007.
Richardson, Matthew and Domingos, Pedro. Learn-
Argall, Brenna, Browning, Brett, and Veloso,
ing with knowledge from multiple experts. In Pro-
Manuela M. Automatic weight learning for multi-
ceedings of the International Conference on Machine
ple data sources when learning from demonstration.
Learning, pp. 624–631, 2003.
In Proceedings of the International Conference on
Robotics and Automation, pp. 226–231, 2009. Syed, Umar, Bowling, Michael, and Schapire,
Robert E. Apprenticeship learning using linear pro-
Bilmes, Jeff A. A gentle tutorial of the EM algorithm gramming. In Proceedings of the International Con-
and its application to parameter estimation for gaus- ference on Machine Learning, pp. 1032–1039, 2008.
sian mixture and hidden Markov models. Technical
Report TR-97-021, International Computer Science Ziebart, Brian D., Maas, Andrew, Bagnell, J. Andrew,
Institute, 1997. and Dey, Anind K. Maximum entropy inverse re-
inforcement learning. In Proceedings of the 23rd
Branavan, S. R. K., Chen, Harr, Zettlemoyer, Luke S., National Conference on Artificial Intelligence, pp.
and Barzilay, Regina. Reinforcement learning for 1433–1438, 2008.
mapping instructions to actions. In Proceedings of
the Joint Conference of the 47th Annual Meeting of
the ACL and the 4th International Joint Conference
on Natural Language Processing of the AFNLP, pp.
82–90, 2009.