0% found this document useful (0 votes)

69 views8 pages

Apprenticeship Learning About Multiple Intentions: Abbeel & NG 2004

Uploaded by

banned miner

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

69 views8 pages

Apprenticeship Learning About Multiple Intentions: Abbeel & NG 2004

Uploaded by

banned miner

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

Apprenticeship Learning About Multiple Intentions

Monica Babeş-Vroman [email protected]

Vukosi Marivate [email protected]
Department of Computer Science, Rutgers University, 110 Frelinghuysen Rd, Piscataway, NJ 08854 USA
Kaushik Subramanian [email protected]
College of Computing, Georgia Institute of Technology, 801 Atlantic Dr., Atlanta, GA 30332 USA
Michael Littman [email protected]
Department of Computer Science, Rutgers University, 110 Frelinghuysen Rd, Piscataway, NJ 08854 USA

Abstract that performs well with respect to this unknown re-

ward function. A basic assumption is that the expert’s
In this paper, we apply tools from inverse re-
intent can be expressed as a reward function that is a
inforcement learning (IRL) to the problem
linear combination of a known set of features. If the
of learning from (unlabeled) demonstration
apprentice’s goal is also to learn an explicit represen-
trajectories of behavior generated by varying
tation of the expert’s reward function, the problem
“intentions” or objectives. We derive an EM
is often called inverse reinforcement learning (IRL) or
approach that clusters observed trajectories
inverse optimal control.
by inferring the objectives for each cluster us-
ing any of several possible IRL methods, and In many natural scenarios, the apprentice observes the
then uses the constructed clusters to quickly expert acting with different intents at different times.
identify the intent of a trajectory. We show For example, a driver might be trying to get to the
that a natural approach to IRL—a gradient store safely one day or rushing to work for a meeting
ascent method that modifies reward param- on another. If trajectories are labeled by the expert to
eters to maximize the likelihood of the ob- identify their underlying objectives, the problem can
served trajectories—is successful at quickly be decomposed into a set of separate IRL problems.
identifying unknown reward functions. We However, more often than not, the apprentice is left to
demonstrate these ideas in the context of ap- infer the expert’s intention for each trajectory.
prenticeship learning by acquiring the prefer-
In this paper, we formalize the problem of apprentice-
ences of a human driver in a simple highway
ship learning about multiple intentions. We adopt a
car simulator.
clustering approach in which observed trajectories are
grouped so their inferred reward functions are consis-
tent with observed behavior. We report results using
1. Introduction seven IRL/AL approaches including a simple but ef-
Apprenticeship learning (Abbeel & Ng, 2004), or AL, fective novel approach that chooses rewards to maxi-
addresses the task of learning a policy from expert mize the likelihoods of the observed trajectories under
demonstrations. In one well studied formulation, the a (near) optimal policy.
expert is assumed to be acting to maximize a reward
function, but the reward function is unknown to the 2. Background and Definitions
apprentice. The only information available concern-
ing the expert’s intent is a set of trajectories from the In this section, we define apprenticeship learning (AL)
expert’s interaction with the environment. From this and the closely related problem of inverse reinforce-
information, the apprentice strives to derive a policy ment learning (IRL). Algorithms for these problems
take as input a Markov decision process (MDP) with-
Appearing in Proceedings of the 28 th International Con- out a reward function and the observed behavior of the
ference on Machine Learning, Bellevue, WA, USA, 2011. expert in the form of a sequence of state–action pairs.
Copyright 2011 by the author(s)/owner(s). This behavior is assumed to be (nearly) optimal in the
Apprenticeship Learning About Multiple Intentions

MDP with respect to an unknown reward function. that randomness is introduced into each decision made
The goal in IRL is to find a proxy for the expert’s re- by the expert. In Active Learning (Lopes et al., 2009),
ward function. The goal in AL is to find a policy that transitions are provided dynamically. The apprentice
performs well with respect to the expert’s reward func- queries the expert for additional examples in states
tion. As is common in earlier work, we focus on IRL where needed.
as a means to solving AL problems. IRL is finding ap-
We devised two new IRL algorithms for our compar-
plication in a broad range of problems from inferring
isons. The linear program that constitutes the opti-
people’s moral values (Moore et al., 2009) to interpret-
mization core of LPAL (Linear Programming Appren-
ing verbal instructions (Branavan et al., 2009).
ticeship Learning) is a modified version of the standard
We use the following notation: MDP\r (or MDP) is a LP dual for solving MDPs (Puterman, 1994). It has
tuple (S, A, T, γ), where S is the state space, A is the as its variables the “policy flow” and a minimum per-
action space, the transition function T : S × A × S → feature reward component. We note that taking the
[0, 1] gives the transition probabilities between states dual of this LP results in a modified version of the
when actions are taken, and γ ∈ [0, 1) is a discount standard LP primal for solving MDPs. It has as its
factor that weights the outcome of future actions ver- variables the value function and θA . Because it pro-
sus present actions. We will assume the availability of duces explicit reward weights instead of just behavior,
a set of trajectories coming from expert agents taking we call this algorithm Linear Programming Inverse Re-
actions in the MDP in the form D = {ξ1 , ..., ξN }. A inforcement Learning (LPIRL). Because its behavior is
trajectory consists of a sequence of state-action pairs defined indirectly by θA , it can produce slightly differ-
ξi = {(s1 , a1 ), . . .}. ent answers from LPAL. Our second algorithm seeks
to maximize the likelihood of the observed trajectories,
Reward functions are parameterized by a vector of re-
as described in the next section.
ward weights θ applied to a feature vector for each
state-action pair φ(s, a). Thus, a reward function is
written rθ (s, a) = θT φ(s, a). If the expert’s reward 3. Maximum Likelihood Inverse
function is given by θE , the apprentice’s objective is Reinforcement Learning (MLIRL)
to behave in a way that maximizes the discounted sum
of expected future rewards with respect to rθE . How- We present a simple IRL algorithm we call Maximum
ever, the apprentice does not know θE and must use Likelihood Inverse Reinforcement Learning (MLIRL).
information from the observed trajectories to decide Like Bayesian IRL, it adopts a probability model that
how to behave. It can, for example, hypothesize its uses θA to create a value function and then assumes
own reward weights θA and behave accordingly. the expert randomizes at the level of individual action
choices. Like Maximum Entropy IRL, it seeks a max-
IRL algorithms differ not just in their algorithmic ap- imum likelihood model. Like Policy matching, it uses
proach but also in the objective function they seek a gradient method to find optimal behavior. The re-
to optimize (Neu & Szepesvári, 2009). In this work, sulting algorithm is quite simple and natural, but we
we examined several existing algorithms for IRL/AL. have not seen it described explicitly.
In Projection (Abbeel & Ng, 2004), the objective is
to make the features encountered by the appren- To define the algorithm more formally, we start by
tice’s policy match those of the expert. LPAL and detailing the process by which a hypothesized θA in-
MWAL (Syed et al., 2008) behave in such a way that duces a probability distribution over action choices and
they outperform the expert according to θA . Pol- thereby assigns a likelihood to the trajectories in D.
icy matching (Neu & Szepesvári, 2007) tries to make First, θA provides the rewards from which discounted
the actions taken by its policy as close as possible to expected values are derived:
those observed from the expert. Maximum Entropy ! "
T
QθA (s, a) = θA φ(s, a) + γ T (s, a, s! ) QθA (s! , a! ).
IRL (Ziebart et al., 2008) defines a probability distri-
s! a!
bution over complete trajectories as a function of θA
and produces the θA that maximizes the likelihood of Here, the “max” in the standard Bellman equa-
the observed trajectories. tion is replaced with an operator that blends
values via Boltzmann
# $ exploration $ (John, 1994):
It is worth noting several approaches that we were Q(s, a) = Q(s, a)e βQ(s,a)
/ e βQ(s,a! )
. This
a a a!
not able to include in our comparisons. Bayesian approach makes the likelihood (infinitely) differen-
IRL (Ramachandran & Amir, 2007) is a framework for tiable, although, in practice, other mappings could be
estimating posterior probabilities over possible reward used. In our work, we calculate these values via 100
functions given the observed trajectories. It assumes iterations of value iteration and use β = 0.5.
Apprenticeship Learning About Multiple Intentions

to G. The algorithm is now faced with the task of infer-

Algorithm 1 Maximum Likelihood IRL
ring the parameters of the expert’s reward function θE
Input: MDP\r, features φ, trajectories using this trajectory. It appears that the expert is try-
{ξ1 , . . . , ξN }, trajectory weights {w1 , . . . , wN }, ing to reach the goal by taking the shortest path and
number of iterations M , step size for each iteration at the same time avoid any intermediate puddles. The
(t) αt , 1 ≤ t < M . assignment of reward weights to the three features—
Initialize: Choose random set of reward weights ground, puddle, and goal—that makes this trajectory
θ1 . maximally likely is one that assigns the highest re-
for t = 1 to M do ward to the goal. (Otherwise, the expert would have
Compute ! Qθ! t , πθt . preferred to travel somewhere else in the grid.) The
L= wi log(πθt (s, a)). probability of the observed path is further enhanced
i (s,a)∈ξ by assigning lower reward weights to puddles than to
θt+1 ← θt + αt ∇L. ground. Thus, although one explanation for the path
end for is that it is one of a large number of possible shortest
Output: Return θA = θM . paths to the goal, the trajectory’s probability is max-
imized by assuming the expert intentionally missed
1
the puddles. The MLIRL-computed reward function
15
is shown in Figure 2, which assigns high likelihood
2 10 (0.1662) to the single demonstration trajectory.
5
3 One of the challenges of IRL is that, given an expert
0
4
policy, there are an infinite number of reward functions
−5 for which that policy is optimal in the given MDP.
5
−10 Like several other IRL approaches, MLIRL addresses
1 2 3 4 5 this issue by searching for a solution that not only
explains why the observed behavior is optimal, but
Figure 1. A single trajec- Figure 2. Reward function also by explaining why the other possible behaviors
tory from start to goal. computed using MLIRL. are suboptimal. In particular, by striving to assign
high probability to the observed behavior, it implicitly
The Boltzmann exploration policy is πθA (s, a) = assigns low probability to unobserved behavior.
$ !
eβQθA (s,a) / a! eβQθA (s,a ) . Under this policy, the log
likelihood of the trajectories in D is L(D|θ) = 4. Apprenticeship Learning about
Multiple Intentions
%
N % !
N !
wi
log πθ (s, a) = wi log πθ (s, a). The motivation for our work comes from settings like
i=1 (s,a)∈ξi i=1 (s,a)∈ξi
surveillance, in which observed actors are classified as
(1) “normal” or “threatening” depending on their behav-
Here, wi is a trajectory-specific weight encoding the ior. We contend that a parsimonious classifier results
frequency of trajectory i. MLIRL seeks θA = by adopting a generative model of behavior—assume
argmaxθ L(D|θ)—the maximum likelihood solution. actors select actions that reflect their intentions and
In our work, we optimized this function via gradient then categorize them based on their inferred inten-
ascent (although we experimented with several other tions. For example, the behavior of people in a train
optimization approaches). These pieces come together station might differ according to their individual goals:
in Algorithm 1. some have the goal of traveling causing them to buy
It is open whether infinite-horizon value iteration with tickets and then go to their trains, while others may be
the Boltzmann operator will converge. In our finite- picking up passengers causing them to wait in a visi-
horizon setting, it is well-behaved and produces a well- ble area. We adopt the approach of using unsupervised
defined answer, as illustrated later in this section and clustering to identify the space of common intentions
in our experiments (Section 5). from a collection of examples, then mapping later ex-
amples to this set using Bayes rule.
We illustrate the functioning of the MLIRL algorithm
using the example shown in Figure 1. It depicts a Similar scenarios include decision making by auto-
5 × 5 grid with puddles (indicated by wavy lines), a matic doors that infer when people intend to go
start state (S) and an absorbing goal state (G). The through them, a home climate control system that
dashed line shows the path taken by an expert from S sets temperature controls appropriately by reasoning
Apprenticeship Learning About Multiple Intentions

about the home owner’s likely destinations when driv-

Algorithm 2 EM Trajectory Clustering
ing. A common theme in these applications is that
unlabeled data—observations of experts with varying Input: Trajectories {ξ1 , ..., ξN } (with varying in-
intentions—are much easier to come by than trajecto- tentions), number of clusters K.
ries labeled with their underlying goal. We define our Initialize: ρ1 , . . . , ρK , θ1 , . . . , θK randomly.
formal problem accordingly. repeat &
E Step: Compute zij = (s,a)∈ξi πθj (s, a)ρj /Z,
In the problem of apprenticeship learning about mul- where Z is the normalization factor.
$
tiple intentions, we assume there exists a finite set M step: For all l, ρl = i zil /N . Compute θl via
of K or fewer intentions each represented by reward MLIRL on D with weight zij on trajectory ξi .
weights θk . The apprentice is provided with a set of until target number of iterations completed.
N > K trajectories D = {ξ1 , ..., ξN }. Each intention
is represented by at least one element in this set and
each trajectory is generated by an expert with one !!
N %
N

of the intentions. An additional trajectory ξE is the = log(ρyi Pr(ξi |θyi )) Pr(yi! |ξi! , Θt )
test trajectory—the apprentice’s objective is to pro- y i=1 i! =1

duce behavior πA that obtains high reward with re- ! !!

N !
K

spect to θE , the reward weights that generated ξE . = ··· δl=yi log(ρl Pr(ξi |θl ))
Many possible clustering algorithms could be applied y1 yN i=1 l=1

to attack this problem. We show that Expectation- %

Maximization (EM) is a viable approach. × Pr(yi! |ξi! , Θt )

i! =1

4.1. A Clustering Algorithm for Intentions !

K !
N ! !
= log(ρl Pr(ξi |θl )) ··· δl=yi
We adopt EM (Dempster et al., 1977) as a straight- l=1 i=1 y1 yN
forward approach to computing a maximum likelihood %
N
model in a probabilistic setting in the face of missing × Pr(yi! |ξi! , Θt )
data. The missing data in this case are the cluster i! =1
labels—the mapping from trajectories to one of the !
K !
N
intentions. We next derive an EM algorithm. = log(ρl Pr(ξi |θl ))zilt
l=1 i=1
Define zij to be the probability that trajectory i be-
longs in cluster j. Let θj be the estimate of the !
K !
N !
K !
N
= log(ρl )zilt + log(Pr(ξi |θl ))zilt .
reward weights for cluster j, and ρj to be the es-
l=1 i=1 l=1 i=1
timate for the prior probability of cluster j. Fol-
lowing the development in Bilmes (1997), we define
In the M step, we need to pick Θ (ρl and θl )
Θ = (ρ1 , . . . , ρK , θ1 , . . . , θK ) as the parameter vector
to maximize Equation 3. Since they are not
we are searching for and Θt as the parameter vector
interdependent, we can optimize$ t them separately.
at iteration t. Let yi = j if trajectory i came from
Thus, we can set ρt+1 = t+1
i zil /N and θl =
following intention j and y = (y1 , . . . , yN ). We write $N t l
t
zij = Pr(ξi |θjt ), the probability, according to the pa- argmaxθ i=1 zil log(Pr(ξi |θl )). The key observation
rameters at iteration t, that trajectory i was generated is that this second quantity is precisely the IRL log
by intention j. likelihood, as seen in Equation 1. That is, the M
step demands that we find reward weights that make
The E step of EM simply computes the observed data as likely as possible, which is pre-
% cisely what MLIRL seeks to do. As a result, EM
t
zij = πθjt (s, a)ρtj /Z, (2) for learning about multiple intentions alternates be-
(s,a)∈ξi tween calculating probabilities via the E step (Equa-
tion 2) and performing IRL on the current clusters.
where Z is the normalization factor.
Algorithm 2 pulls these pieces together. This EM
To carry out the M step, we define the EM Q function approach is a fairly direct interpretation of the clus-
(distinct from the MDP Q function): tering problem we defined. It differs from much of
the published work on learning from multiple ex-
Q(Θ, Θt ) perts, however, which starts with the assumption that
!
= L(Θ|D, y)) Pr(y|D, Θt ) all the experts have the same intentions (same re-
y ward function), but perhaps differ in their reliabil-
Apprenticeship Learning About Multiple Intentions

ity (Argall et al. 2009, Richardson & Domingos 2003). they function in the setting of learning about multiple
intentions. We first look at their performance in a grid
4.2. Using Clusters for AL world with a single expert (single intention), a domain
where a few existing approaches (Abbeel & Ng 2004,
The input of the EM method of the previous section Syed et al. 2008) have already been tested. Our sec-
is a set of trajectories D and a number of clusters K. ond experiment, in a grid world with puddles, demon-
The output is a set of K clusters. Associated with strates the MLIRL algorithm as part of our EM ap-
each cluster i are the reward weights θi , which induce proach (Section 4) to cluster trajectories from mul-
a reward function rθi , and a cluster prior ρi . Next, tiple intentions—each corresponding to a different
we consider how to carry out AL on a new trajectory reward function. Thirdly, we compare the perfor-
ξE under the assumption that it comes from the same mance of all the IRL/AL algorithms as part of the
population as the trajectories in D. EM clustering approach in the simulated Highway
By Bayes rule, Pr(θi |ξE ) = Pr(ξE |θi ) Pr(θi )/ Pr(ξE ). Car domain (Abbeel & Ng 2004, Syed et al. 2008), an
Here, Pr(θi ) = ρi and Pr(ξE |θi ) is easily computable infinite-horizon domain with stochastic transitions.
(z in Section 4.1). The quantity Pr(ξ) is a simple nor- Our experiments used implementations of the MLIRL,
malization factor. Thus, the apprentice can derive a LPIRL, Maximum Entropy IRL, LPAL, MWAL, Pro-
probability distribution over reward functions given a jection, and Policy Matching algorithms. We obtained
trajectory (Ziebart et al., 2008). How should it be- implementations from the original authors wherever
have? Let f π (s, a) be the (weighted) fraction of the possible.
time policy π spends taking action a in state s. Then,
with respect to $
reward function r, the value of policy π
5.1. Learning from a Single Expert
can be written s,a f π (s, a)r(s, a). We should choose
the policy with the highest expected reward: 6
! !
argmax Pr(θi |ξE ) f π (s, a)rθi (s, a) 5.5 Optimal
π
Average Value

i s,a MLIRL
! ! 5
LPAL

= argmax f π (s, a) Pr(θi |ξE )rθi (s, a) LPIRL

Policy Matching
π s,a Projection
i
! 4.5 Maximum Entropy
MWAL
π !
= argmax f (s, a)r (s, a),
π 4
s,a
$ 0 50 100 150
where r! (s, a) = i Pr(θi |ξE )rθi (s, a). That is, the
Sample Trajectories

optimal policy for the apprentice is the one that maxi-

mizes the sum of the reward functions for the possible Figure 3. A plot of the average reward computed with in-
intentions, weighted by their likelihoods. This prob- creasing number of sample trajectories.
lem can be solved by computing the optimal policy of
the MDP with this averaged reward function. Thus,
to figure out how to act given an initial trajectory and 0

collection of example trajectories, our approach is to −5

Average Likelihood (log scale)

cluster the examples, use Bayes rule to figure out the −10
MLIRL
probability that the current trajectory belongs in each −15 LPAL
Policy Matching
cluster, create a merged reward function by combining −20
LPIRL
Projection
the cluster reward functions using the derived proba- −25
MWAL
−30
bilities, and finally compute a policy for the merged Maximum Entropy

−35
reward function to decide how to behave.
−40

0 50 100 150
5. Experiments Sample Trajectories

Our experiments were designed to compare the per- Figure 4. A plot of the average trajectory likelihood com-
formance of the MLIRL (Section 3) and LPIRL (Sec- puted with increasing number of sample trajectories.
tion 2) algorithms with five existing IRL/AL ap-
proaches summarized in Section 2. We compare these
seven approaches in several ways to assess (a) how well In this experiment, we tested the performance of
they perform apprenticeship learning and (b) how well each IRL/AL algorithm in a grid world environ-
Apprenticeship Learning About Multiple Intentions

ment similar to one used by Abbeel & Ng (2004) and 5.2. Learning about Multiple Intentions—Grid
Syed et al. (2008). We use a grid of size 16×16. Move- World with Puddles
ment of the agent is possible in the four compass direc-
In our second experiment, we test the ability of our
tions with each action having a 30% chance of causing
proposed EM approach, described in Section 4, to ac-
a random transition. The grid is further subdivided
curately cluster trajectories associated with multiple
into non-overlapping square regions, each of size 4 × 4.
intentions.
Using the same terminology as Abbeel & Ng (2004),
we refer to the square regions as “macrocells”. The We make use of a 5 × 5 discrete grid world shown in
partitioning of the grid results in a total of 16 macro- Figure 5 (Left). The world contains a start state, a
cells. Every cell in the gridworld is characterized by a goal state and patches in the middle indicating pud-
16-dimensional feature vector φ indicating, using a 0 dles. Furthermore, the world is characterized by three
or 1, which macrocell it belongs to. A random weight feature vectors, one for the goal, one for the puddles
vector is chosen such that the true reward function just and another for the remaining states. For added ex-
encodes that some macrocells are more desirable than pressive power, we also included the negations of the
others. The optimal policy π ∗ is computed for the features in the set thereby doubling the number of fea-
true reward function and the single expert trajectories tures to six.
are acquired by sampling π ∗ . To maintain consistency
We imagine data comes from two experts with differ-
across the algorithms, the start state is drawn from a
ent intentions. Expert 1 goes to the goal avoiding the
fixed distribution and the lengths of the trajectories
puddles at all times and Expert 2 goes to the goal
are truncated to 60 steps.
completely ignoring the puddles. Sample trajectories
Of particular interest is the ability of the seven from these experts are shown in Figure 5 (Left). Tra-
IRL/AL algorithms to learn from a small amount of jectory T1 was generated by Expert 1, T2 and T3, by
data. Thus, we illustrate the performance of the algo- Expert 2. This experiment used a total of N = 12 sam-
rithms by varying the number of sample trajectories ple trajectories of varying lengths, 5 from Expert 1, 7
available for learning. Results are averaged over 5 rep- from Expert 2. We initiated the EM algorithm by set-
etitions and standard error bars are given. Note that in ting the value of K, the number of clusters, to 5 to
this and the following experiments, we use Boltzmann allow some flexibility in clustering. We ran the clus-
exploration polices to transform the reward functions tering, then hand-identified the two experts. Figure 5
computed by the IRL algorithms into policies when (Right) shows the algorithm’s estimates that the three
required. trajectories, T1, T2 and T3, belong to Expert 1. The
EM approach was able to successfully cluster all of
Figure 3 shows the average reward accumulated by
the 12 trajectories in the manner described above: the
the policy computed by each algorithm as more tra-
unambiguous trajectories were accurately assigned to
jectories are available for training. With 30 or more
their clusters and the ambiguous ones were “properly”
trajectories, MLIRL outperforms the other six. LPAL
assigned to multiple clusters. Since we set the value
and LPIRL also perform well. An advantage of LPIRL
of K = 5, EM produced 5 clusters. On analyzing
over LPAL is that it returns a reward function, which
these clusters, we found that the algorithm produced
makes it able to generalize over states that the ex-
2 unique policies along with 3 copies. Thus, EM cor-
pert has not visited during the demonstration trajec-
rectly extracted the preferences of the experts using
tories. However, we observed that designing a policy
the input sample trajectories.
indirectly through the reward function was less stable
than optimizing the policy directly. It is interesting to
note that MaxEnt lags behind in this setting. MaxEnt
appears best suited for settings with very long demon-
stration trajectories, as opposed to the relatively short
trajectories we used in this experiment.
Figure 4 shows that for the most part, in this dataset,
the better an algorithm does at assigning high proba-
bility to the observed trajectories, the more likely it is
to obtain higher rewards.
Figure 5. Left: Grid world showing the start states (grey),
goal state (G), puddles and three sample trajectories.
Right: Posterior probabilities of the three trajectories be-
longing to Expert 1.
Apprenticeship Learning About Multiple Intentions

3.9

3.8

3.7

3.6
MLIRL
Average Reward

3.5 Maximum Entropy

LPAL
3.4 Policy Matching
MWAL
3.3 Projection
LPIRL
3.2

3.1

3
Figure 7. Simulated Highway Car.
0 20 40 60 80 100 120
Driving Time in Seconds
4

Figure 6. Average reward for Student trajectory for EM 3.5

approach with varying IRL/AL components. 3

Average Value
EM + MLIRL
2.5 EM + Maximum Entropy
Online AL
2
The probability values were computed at intermediate Single Expert AL
steps during the 10 iterations of the EM algorithm. 1.5
After the 1st iteration, EM estimated that T1 belongs
1
to Expert 1 with high probability and T2 belongs to
Expert 1 with very low probability (implying that it 0.5
0 20 40 60 80 100 120
therefore belongs to Expert 2). It is interesting to note Driving Time in seconds
here that EM estimated that trajectory T3 belongs to
Expert 1 with probability 0.3. The uncertainty in- Figure 8. Value of the computed policy as a function
dicates that T3 could belong to either Expert 1 or of length of driving trajectories for three approaches to
Expert 2. learning about multiple intentions.

5.3. Learning about Multiple

Intentions—Highway Car Domain erations. The trajectory used for evaluation ξE was
generated by Student. The actions selected by the
In our third experiment, we instantiated the EM al-
approach outlined in the previous section were evalu-
gorithm in an infinite horizon domain with stochas-
ated according to the reward function from Student
tic transitions, the simulated Highway Car do-
and plotted in Figure 6. Although our MLIRL algo-
main (Abbeel & Ng 2004, Syed et al. 2008). This do-
rithm is best suited to carry out the M step in the
main consists of a three-lane highway with an extra off-
EM algorithm, any IRL can be used to approximately
road lane on either side, a car driving at constant speed
optimize the likelihood. Indeed, even AL algorithms
and a set of oncoming cars. Figure 7 shows a snapshot
can be used in the EM framework where a probabilis-
of the simulated highway car domain. The task is for
tic policy takes the place of the reward weights as the
the car to navigate through the busy highway using
hidden parameters. Thus, we instantiated each of the
three actions: left, right and stay. The domain consists
7 AL/IRL approaches within the EM algorithm. It
of three features: speed, number of collisions, number
is interesting to note that maximum likelihood algo-
of off-road visits. Our experiment uses these three fea-
rithms (MLIRL and MaxEnt) are the most effective
tures along with their negations, making a total of six
for this taks. This time, MaxEnt was provided with
features. The transition dynamics are stochastic. Four
longer trajectories, leading to an improvement in its
different experts were used for this experiment: Safe:
performance compared to Section 5.1 and Figure 3.
Avoids collisions and avoids going off-road. Student:
Avoid collisions and does not mind going off-road. De- Other approaches to learning about multiple inten-
molition: Collides with every car and avoids going tions are possible. We compared the EM approach
off-road. Nasty: Collides with every car and does to AL (Section 4.2) with two other possibilities: (1)
not mind going off-road. Sample trajectories were col- an AL learner that ignores all previous data in D
lected from between ten seconds and two minutes of and only learns from the current trajectory ξE (on-
driving time from a human subject emulating each of line AL), and (2) all the prior data D and current
the four experts. Using these sample trajectories, the trajectory ξE are treated as a single input to the AL
EM approach performed clustering (K = 6) for 10 it- learner , commingling data generated from different
Apprenticeship Learning About Multiple Intentions

intentions (single expert AL). Figure 8 shows that the of the Twelfth National Conference on Artificial In-
EM approach (with either MLIRL or Maximum En- telligence, pp. 1464, Seattle, WA, 1994.
tropy) makes much better use of the available data
and mixing data from multiple experts is undesirable. Lopes, Manuel, Melo, Francisco S., and Montesano,
Luis. Active learning for reward estimation in in-
verse reinforcement learning. In ECML/PKDD, pp.
6. Conclusion and Future Work 31–46, 2009.
We defined an extension of inverse reinforcement learn- Moore, Adam B., Todd, Michael T., and Conway,
ing and apprenticeship learning in which the learner is Andrew R. A. A computational model of moral
provided with unlabeled example trajectories gener- judgment. Poster at Psychonomics Society Meeting,
ated from a number of possible reward functions. Us- 2009.
ing these examples as a kind of background knowledge,
a learner can more quickly infer and optimize reward Neu, Gergely and Szepesvári, Csaba. Apprenticeship
functions for novel trajectories. learning using inverse reinforcement learning and
gradient methods. In Proceedings of the Conference
Having shown that an EM clustering approach can suc-
of Uncertainty in Artificial Intelligence, 2007.
cessfully infer individual intentions from a collection of
unlabeled trajectories, we next intend to pursue using Neu, Gergely and Szepesvári, Csaba. Training parsers
these learned intentions to predict the behavior of and by inverse reinforcement learning. Machine Learn-
better interact with other agents in multiagent envi- ing, 77(2–3):303–337, 2009.
ronments.
Puterman, Martin L. Markov Decision Processes—
Discrete Stochastic Dynamic Programming. John
References Wiley & Sons, Inc., New York, NY, 1994.
Abbeel, Pieter and Ng, Andrew Y. Apprenticeship
learning via inverse reinforcement learning. In Pro- Ramachandran, Deepak and Amir, Eyal. Bayesian in-
ceedings of the International Conference on Machine verse reinforcement learning. In Proceedings of IJ-
Learning, 2004. CAI, pp. 2586–2591, 2007.
Richardson, Matthew and Domingos, Pedro. Learn-
Argall, Brenna, Browning, Brett, and Veloso,
ing with knowledge from multiple experts. In Pro-
Manuela M. Automatic weight learning for multi-
ceedings of the International Conference on Machine
ple data sources when learning from demonstration.
Learning, pp. 624–631, 2003.
In Proceedings of the International Conference on
Robotics and Automation, pp. 226–231, 2009. Syed, Umar, Bowling, Michael, and Schapire,
Robert E. Apprenticeship learning using linear pro-
Bilmes, Jeff A. A gentle tutorial of the EM algorithm gramming. In Proceedings of the International Con-
and its application to parameter estimation for gaus- ference on Machine Learning, pp. 1032–1039, 2008.
sian mixture and hidden Markov models. Technical
Report TR-97-021, International Computer Science Ziebart, Brian D., Maas, Andrew, Bagnell, J. Andrew,
Institute, 1997. and Dey, Anind K. Maximum entropy inverse re-
inforcement learning. In Proceedings of the 23rd
Branavan, S. R. K., Chen, Harr, Zettlemoyer, Luke S., National Conference on Artificial Intelligence, pp.
and Barzilay, Regina. Reinforcement learning for 1433–1438, 2008.
mapping instructions to actions. In Proceedings of
the Joint Conference of the 47th Annual Meeting of
the ACL and the 4th International Joint Conference
on Natural Language Processing of the AFNLP, pp.
82–90, 2009.

Dempster, A. P., Laird, N. M., and Rubin, D. B. Max-

imum likelihood from incomplete data via the EM
algorithm. Journal of the Royal Statistical Society,
39(1):1–38, 1977.

John, George H. When the best move isn’t opti-

mal: Q-learning with exploration. In Proceedings

Oral Com - COT2 - Lesson Plan
100% (1)
Oral Com - COT2 - Lesson Plan
3 pages
PDF Grupos Etnicos Lesson Plan Spanish Speakers I
No ratings yet
PDF Grupos Etnicos Lesson Plan Spanish Speakers I
7 pages
1022 Deep Inverse Reinforcement Lea
No ratings yet
1022 Deep Inverse Reinforcement Lea
15 pages
Active Learning For Reward Estimation in Inverse Reinforcement Learning
No ratings yet
Active Learning For Reward Estimation in Inverse Reinforcement Learning
16 pages
A Survey of Inverse Reinforcement Learning: Challenges, Methods and Progress
No ratings yet
A Survey of Inverse Reinforcement Learning: Challenges, Methods and Progress
48 pages
Adams2022 Article ASurveyOfInverseReinforcementL
No ratings yet
Adams2022 Article ASurveyOfInverseReinforcementL
40 pages
Reinforcement Learning in The Era of LLMS: What Is Essential? What Is Needed? An RL Perspective On RLHF, Prompting, and Beyond
No ratings yet
Reinforcement Learning in The Era of LLMS: What Is Essential? What Is Needed? An RL Perspective On RLHF, Prompting, and Beyond
11 pages
Imitation Learning Papers
No ratings yet
Imitation Learning Papers
10 pages
An Application of Inverse Reinforcement Learning To Medical Records of Diabetes Treatment
No ratings yet
An Application of Inverse Reinforcement Learning To Medical Records of Diabetes Treatment
8 pages
3.RL Unit 3
No ratings yet
3.RL Unit 3
31 pages
RLAlgs in MDPs
No ratings yet
RLAlgs in MDPs
98 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
48 pages
Unit 3
No ratings yet
Unit 3
12 pages
A Survey On Intrinsic Motivation in Reinforcement Learning
No ratings yet
A Survey On Intrinsic Motivation in Reinforcement Learning
39 pages
Algorithms For Reinforced Learning
No ratings yet
Algorithms For Reinforced Learning
98 pages
Algorithms For Reinforcement Learning - Szepesvari
No ratings yet
Algorithms For Reinforcement Learning - Szepesvari
98 pages
Reinforcement Learning 2
No ratings yet
Reinforcement Learning 2
13 pages
Sdfesdf
No ratings yet
Sdfesdf
23 pages
L13 Reinforcement Learning
No ratings yet
L13 Reinforcement Learning
57 pages
Reinforcement Learning: Instructor: Max Welling
No ratings yet
Reinforcement Learning: Instructor: Max Welling
18 pages
NIPS 2016 Generative Adversarial Imitation Learning Paper
No ratings yet
NIPS 2016 Generative Adversarial Imitation Learning Paper
9 pages
11-DL-Deep Learning For Reinforcement Learning
No ratings yet
11-DL-Deep Learning For Reinforcement Learning
47 pages
Cooperative Inverse Reinforcement Learning Hadfield Mendell Et Al 2016
No ratings yet
Cooperative Inverse Reinforcement Learning Hadfield Mendell Et Al 2016
9 pages
IBRL
No ratings yet
IBRL
18 pages
RL Introduction
No ratings yet
RL Introduction
225 pages
Combining Reinforcement Learning and Inverse Reinforcement Learning For Asset Allocation Recommendations
No ratings yet
Combining Reinforcement Learning and Inverse Reinforcement Learning For Asset Allocation Recommendations
9 pages
7.reinforcement Learning-Introduction-The Learning Task Q-Learning
No ratings yet
7.reinforcement Learning-Introduction-The Learning Task Q-Learning
34 pages
Lecture 29 RL
No ratings yet
Lecture 29 RL
38 pages
Inverse Optimal Control With Linearly-Solvable MDPs
No ratings yet
Inverse Optimal Control With Linearly-Solvable MDPs
8 pages
F90de-Introduction To Reinforcement Learning
No ratings yet
F90de-Introduction To Reinforcement Learning
67 pages
Reinforcement Learning Notes ?
No ratings yet
Reinforcement Learning Notes ?
40 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
52 pages
Reinforcement Learning: Csci 5512: Artificial Intelligence Ii
No ratings yet
Reinforcement Learning: Csci 5512: Artificial Intelligence Ii
30 pages
Unit 5d - Deep Reinforcement Learning
No ratings yet
Unit 5d - Deep Reinforcement Learning
52 pages
SP14 CS188 Lecture 10 - Reinforcement Learning I
No ratings yet
SP14 CS188 Lecture 10 - Reinforcement Learning I
35 pages
Reinforcement Learning Mastery Path
No ratings yet
Reinforcement Learning Mastery Path
18 pages
37 RL
No ratings yet
37 RL
18 pages
Reinforcement Learning: Yijue Hou
No ratings yet
Reinforcement Learning: Yijue Hou
34 pages
Artificial Intelligence: Lecture 10 - Reinforcement Learning Prof. Shivanjali Khare
No ratings yet
Artificial Intelligence: Lecture 10 - Reinforcement Learning Prof. Shivanjali Khare
45 pages
A Crash Course On Reinforcement Learning - Felix Wagner
No ratings yet
A Crash Course On Reinforcement Learning - Felix Wagner
84 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
38 pages
Sminton,+13445 Article+ (PDF) 30493 1 11 20220502
No ratings yet
Sminton,+13445 Article+ (PDF) 30493 1 11 20220502
74 pages
Reinforcement Learning and Dynamic Programming For Control
100% (1)
Reinforcement Learning and Dynamic Programming For Control
111 pages
Unit 3 Ai
No ratings yet
Unit 3 Ai
5 pages
Lecture RL
No ratings yet
Lecture RL
37 pages
Inverse Reinforcement Learning Through Policy Gradient Minimization
No ratings yet
Inverse Reinforcement Learning Through Policy Gradient Minimization
7 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
50 pages
CMPE257 - W10C13 - Reinforcement Learning
No ratings yet
CMPE257 - W10C13 - Reinforcement Learning
161 pages
Reinforcement Learning (Part 2) : Nguyen Do Van, PHD
No ratings yet
Reinforcement Learning (Part 2) : Nguyen Do Van, PHD
46 pages
07 Deep Reinforcement Learning (John)
No ratings yet
07 Deep Reinforcement Learning (John)
52 pages
Learningintro Notes
No ratings yet
Learningintro Notes
12 pages
An Overview of Machine Learning
No ratings yet
An Overview of Machine Learning
42 pages
Comprehensive Survey of Reinforcement Learning From Algorithms To Practical Challenges
No ratings yet
Comprehensive Survey of Reinforcement Learning From Algorithms To Practical Challenges
79 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
45 pages
5SC28 L9 Machine Learning Systems Control
No ratings yet
5SC28 L9 Machine Learning Systems Control
75 pages
Lecture 30 Reinforcement-Learning
No ratings yet
Lecture 30 Reinforcement-Learning
50 pages
DLMAIRIL01 Q4-2024 Session4
No ratings yet
DLMAIRIL01 Q4-2024 Session4
80 pages
Lecture Notes v1.0 687 F22
No ratings yet
Lecture Notes v1.0 687 F22
115 pages
Alg RLearning Ejemplo
No ratings yet
Alg RLearning Ejemplo
99 pages
Reinforcement Learning I:: The Setting and Classical Stochastic Dynamic Programming Algorithms
No ratings yet
Reinforcement Learning I:: The Setting and Classical Stochastic Dynamic Programming Algorithms
42 pages
Machine Learning: Fundamentals and Applications
From Everand
Machine Learning: Fundamentals and Applications
Fouad Sabry
No ratings yet
Intermediate AI Prompting – Reinforcement Learning
From Everand
Intermediate AI Prompting – Reinforcement Learning
Eric Centore
No ratings yet
Potential-Based Shaping in Model-Based Reinforcement Learning
No ratings yet
Potential-Based Shaping in Model-Based Reinforcement Learning
6 pages
Maximum Entropy Inverse Reinforcement Learning: Brian D. Ziebart, Andrew Maas, J.Andrew Bagnell, and Anind K. Dey
No ratings yet
Maximum Entropy Inverse Reinforcement Learning: Brian D. Ziebart, Andrew Maas, J.Andrew Bagnell, and Anind K. Dey
6 pages
Calculus, Probability, and Statistics Primers: Dave Goldsman
100% (1)
Calculus, Probability, and Statistics Primers: Dave Goldsman
104 pages
Module01 TourOfSimulation 190513 PDF
No ratings yet
Module01 TourOfSimulation 190513 PDF
73 pages
Dango Daikazoku PDF
No ratings yet
Dango Daikazoku PDF
3 pages
My Answers
No ratings yet
My Answers
12 pages
Self Reflection Report
No ratings yet
Self Reflection Report
2 pages
What Are Pronouns?
No ratings yet
What Are Pronouns?
16 pages
LGBT Lesson Plan
No ratings yet
LGBT Lesson Plan
4 pages
Subject Matter of Translation Studies
No ratings yet
Subject Matter of Translation Studies
3 pages
Sample Teacher Improvement Plan 2012
No ratings yet
Sample Teacher Improvement Plan 2012
2 pages
Phrasal Verbs - Introduction - Article - Onestopenglish
No ratings yet
Phrasal Verbs - Introduction - Article - Onestopenglish
2 pages
Pattern Recognition
No ratings yet
Pattern Recognition
9 pages
BLOOM's TAXONOMY 3
No ratings yet
BLOOM's TAXONOMY 3
3 pages
Natural Language Processing
No ratings yet
Natural Language Processing
47 pages
Models of Communication
No ratings yet
Models of Communication
62 pages
LMCR2252-3 Teori-Teori Emosi
No ratings yet
LMCR2252-3 Teori-Teori Emosi
69 pages
The Construct of Resilience: Implications For Interventions and Social Policies
No ratings yet
The Construct of Resilience: Implications For Interventions and Social Policies
29 pages
CBA Teachers Guide 8 - 1
No ratings yet
CBA Teachers Guide 8 - 1
34 pages
G2 SERVAL Intro
No ratings yet
G2 SERVAL Intro
6 pages
Genius Kid Activation
No ratings yet
Genius Kid Activation
10 pages
FG Week 4 Task Rangpas Aiza Altheia 1
No ratings yet
FG Week 4 Task Rangpas Aiza Altheia 1
6 pages
Shikha Emmanuel John: Contact:93057080 Contact:9305708010 10
No ratings yet
Shikha Emmanuel John: Contact:93057080 Contact:9305708010 10
2 pages
Phase 3.2 TESOL
No ratings yet
Phase 3.2 TESOL
18 pages
Question Text: Complete Mark 1.00 Out of 1.00
No ratings yet
Question Text: Complete Mark 1.00 Out of 1.00
23 pages
Millano - Btvted 2 Lesson Plan
No ratings yet
Millano - Btvted 2 Lesson Plan
5 pages
Portfolio Graphic Design Illustration 20121217
100% (2)
Portfolio Graphic Design Illustration 20121217
6 pages
Lesson Plan (Procedure Text) - Aldi Ardiyanto - 19113003
No ratings yet
Lesson Plan (Procedure Text) - Aldi Ardiyanto - 19113003
7 pages
Ede 202
No ratings yet
Ede 202
28 pages
Madam Reysas Action Research
No ratings yet
Madam Reysas Action Research
25 pages
Level of Performance of The Grade 7 A 1
No ratings yet
Level of Performance of The Grade 7 A 1
10 pages
Week6 Riego Kierstin Kyle Perdev
100% (1)
Week6 Riego Kierstin Kyle Perdev
8 pages
KN Dictionary
No ratings yet
KN Dictionary
121 pages

Apprenticeship Learning About Multiple Intentions: Abbeel & NG 2004

Uploaded by

Apprenticeship Learning About Multiple Intentions: Abbeel & NG 2004

Uploaded by

Apprenticeship Learning About Multiple Intentions

Monica Babeş-Vroman [email protected]

Abstract that performs well with respect to this unknown re-

to G. The algorithm is now faced with the task of infer-

about the home owner’s likely destinations when driv-

duce behavior πA that obtains high reward with re- ! !!

to attack this problem. We show that Expectation- %

Maximization (EM) is a viable approach. × Pr(yi! |ξi! , Θt )

4.1. A Clustering Algorithm for Intentions !

= argmax f π (s, a) Pr(θi |ξE )rθi (s, a) LPIRL

optimal policy for the apprentice is the one that maxi-

collection of example trajectories, our approach is to −5

3.5 Maximum Entropy

Figure 6. Average reward for Student trajectory for EM 3.5

5.3. Learning about Multiple

Dempster, A. P., Laird, N. M., and Rubin, D. B. Max-

John, George H. When the best move isn’t opti-

You might also like