0% found this document useful (0 votes)
15 views

Ta Lecture2

This document discusses temporal abstraction in reinforcement learning using options. Options generalize actions by allowing temporally extended courses of action. An option is defined by its initiation set, policy, and termination condition. Using options defines a semi-Markov decision process that can be analyzed using methods like value iteration and Q-learning that have been generalized for SMDPs. The document outlines how options can provide temporal abstraction and how their internal structure and policies can be learned through methods like intra-option learning.

Uploaded by

alan
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views

Ta Lecture2

This document discusses temporal abstraction in reinforcement learning using options. Options generalize actions by allowing temporally extended courses of action. An option is defined by its initiation set, policy, and termination condition. Using options defines a semi-Markov decision process that can be analyzed using methods like value iteration and Q-learning that have been generalized for SMDPs. The document outlines how options can provide temporal abstraction and how their internal structure and policies can be learned through methods like intra-option learning.

Uploaded by

alan
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

Temporal Abstraction in RL

How can an agent represent stochastic, closed-loop,


temporally-extended courses of action? How can it
act, learn, and plan using such representations?
 HAMs (Parr & Russell 1998; Parr 1998)
 MAXQ (Dietterich 2000)
 Options framework (Sutton, Precup & Singh 1999;
Precup 2000)
Outline

 Options
 MDP + options = SMDP
 SMDP methods
 Looking inside the options
Markov Decision Processes (MDPs)

S: set of states of the environment


A(s): set of actions possible in state s, for all s∈S
Pss'a = Pr{st +1 = s'| st = s, at = a} "s,s'# S, a # A(s)
Rss'a = E{rt +1 | st = s, at = a, st +1 = s'} "s,s'# S, a # A(s)
γ: discount rate
!
rt +1 rt +2 rt +3
! ... st s t +1 s t +2 s t +3 ...
at a t +1 a t +2 a t +3
Example
 Actions
 North, East, South, West
 Fail 33% of the time
 Reward
 +1 for transitions into G
G
 0 otherwise
 γ = 0.9
Options

 A generalization of actions
 Starting from a finite MDP, specify a way of
choosing actions until termination
 Example: go-to-hallway
Markov options

A Markov option can be represented as a triple o =< I, " , # >


• I $ S is the set of states in which o may be started
• " :S % A & [0,1] is the policy followed during o
• #:S & [0,1] is the probability of terminating in each state

!
Examples

 Dock-into-charger
 I : all states in which charger is in sight
 π : pre-defined controller
 β : terminate when docked or charger not visible
 Open-the-door
 I : all states in which a closed door is within reach
 π : pre-defined controller for reaching, grasping, and
turning the door knob
 β : terminate when the door is open
One-Step options

A primitive action a " #s"S As of the base MDP is also


an option, called a one - step option.
• I = {s : a " As}
• $ (s,a) = 1,%s " I
• & (s) = 1,%s " S

!
Markov vs. Semi-Markov options

 Markov option: policy and termination


condition depend only on the current state
 Semi-Markov option: policy and termination
condition may depend on the entire history
of states, actions, and rewards since the
initiation of the option
 Options that terminate after a pre-specified
number of time steps
 Options that execute other options
Semi-Markov Options

Let H be the set of possible histories (segments of


experience)
st , at , rt +1 , st +1 ,..., sT

A semi - Markov option may be represented as a triple o =< I, " , # >


• I $ S is the set of states in which o may be started
• " :H % A & [0,1] is the policy followed during o
• #:H & [0,1] is the probability of terminating in each state

!
Value functions for options

def
µ
Q (s,o) = E{rt +1 + " rt +2 + ... | o initiated in s at time t,
µ followed after termination}

def
QO* (s,o) = max Qµ (s,o)
µ "#(O )
!
Set of all policies selecting
only from options in O

!
Options define a Semi-Markov
Decision Process (SMDP)
Time

MDP Discrete time


State
Homogeneous discount

Continuous time
SMDP Discrete events
Interval-dependent discount

Discrete time
Options Overlaid discrete events
over MDP Interval-dependent discount

A discrete-time SMDP overlaid on an MDP


Can be analyzed at either level
SMDPs

 The amount of time between one decision and


the next is a random variable τ
 Transition probabilities
P(s', " | s,a)
 Bellman equations
& )
V (s) = max (R(s,a) +% # P(s', $ | s,a)V (s')]+
* $ *

! o"A s ' s',$ *

Q* (s,a) = R(s,a) + % " # P(s', # | s,a)max Q* (s',a')


s',# o' $A s '
!
Option models

Rso = E{rt +1 + " rt +2 + L + " # $1 rt +# |


o is initiated in state s at time t and lasts # steps}
$
P = % " # p(s', # )
o
ss'
# =1

! Probability that o terminates in s´


after τ steps when initiated in state s

!
They generalize the reward and transition probabilities of an
MDP in such a way that one can write a generalized form of
the Bellman optimality equations.
Bellman optimality equations
$ '
V (s) = max &R(s,o) +# P(s'| s,o)VO (s')])
*
O
*

o"Os % s' (

QO* (s,o) = R(s,o) + # P(s'| s,o)max QO* (s',o')


! s' o' "Os '

Bellman optimality equations can be solved, exactly


! or approximately, using methods that generalize the
usual DP and RL algorithms.
DP backups

$ '
Vk +1 (s) = max &R(s,o) +# P(s'| s,o)Vk (s')])
o"Os % s' (

!
Qk +1 (s,o) = R(s,o) + # P(s'| s,o)max Qk (s',o')
s' o' "Os '

!
Illustration: Rooms Example

8 multi-step options,
to each room's 2
ROOM HALLWAYS hallways

O1 G

O2
Goal states are
given a terminal
value of 1
Synchronous value iteration

with cell-to-cell
primitive actions

V(goal )=1

Iteration #0 Iteration #1 Iteration #2

with room-to-room
options

V(goal )=1

Iteration #0 Iteration #1 Iteration #2


SMDP Q-learning backups
end of one option,
beginning of next

 At state s, initiate option o and execute until termination


 Observe termination state s´, number of steps τ,
discounted return r
& t )
Qk +1 (s,o) = (1" # k )Qk (s,o) + # k (r + $ max Qk (s',o')+
' o' %Os ' *

!
Looking inside options

SMDP methods apply to options, but only when


they are treated as opaque indivisible units. Once
an option has been selected, such methods require
that its policy be followed until the option
terminates. More interesting and potentially more
powerful methods are possible by looking inside
options and by altering their internal structure.
—Precup (2000)
Intra-option Q-learning

st rt s
On every transition: at t +1

Update every Markov option o whose policy could have selected at


according to the same distribution π(st, ·):
Qk +1 (st ,o) = (1" # k )Qk (st ,o) + # k [ rt +1 + $U k (st +1,o)],
where
U k (s,o) = (1" # (s))Qk (s,o) + # (s)max Qk (s,o')
o' $O

! is an estimate of the value of state-option pair (s,o) upon arrival in state s.

!
Illustration: Intra-option Q-learning

Random start, goal in


right hallway, choice
with cell-to-cell
from actions and
primitive actions options, 90% greedy

V(goal )=1
with cell-to-cell
primitive actions

IterationV#0
(goal )=1
Iteration #1 Iteration #2

ith room-to-room Iteration #0 Iteration #1 Iteration #2


options

V(goal )=1
ith room-to-room
options

IterationV#0
(goal )=1
Iteration #1 Iteration #2
Summary
Time

MDP Discrete time


State
Homogeneous discount

Continuous time
SMDP Discrete events
Interval-dependent discount

Discrete time
Options Overlaid discrete events
over MDP Interval-dependent discount

A discrete-time SMDP overlaid on an MDP


Can be analyzed at either level
What else?

 Intra-option learning of option models


 Early termination of options
 Improving option policies (given its reward
function)
 Learning option policies given useful subgoals to
reach (e.g. hallways in the sample problem)
Which states are useful subgoals?

States that …
 have a high reward gradient or are visited frequently
(Digney 1998)
 are visited frequently only on successful trajectories
(McGovern & Barto 2001)
 change the value of certain variables
(Hengst 2002; Barto et al. 2004; Jonsson & Barto 2005)
 lie between densely connected regions
(Menache et al. 2002; Mannor et al. 2004; Simsek & Barto 2004;
Simsek, Wolfe & Barto 2005)
References

 D. Precup. Temporal abstraction in reinforcement


learning. PhD thesis, University of Massachusetts
Amherst, 2000.
 R. S. Sutton, D. Precup, and S. P. Singh. Between MDPs
and Semi-MDPs: A framework for temporal abstraction
in reinforcement learning. Artificial Intelligence, 112(1-
2):181–211, 1999.
 A. G. Barto and S. Mahadevan. Recent advances in
hierarchical reinforcement learning. Discrete Event
Dynamic Systems, 13(4):341 – 379, October 2003.

You might also like