Ta Lecture2
Ta Lecture2
Options
MDP + options = SMDP
SMDP methods
Looking inside the options
Markov Decision Processes (MDPs)
A generalization of actions
Starting from a finite MDP, specify a way of
choosing actions until termination
Example: go-to-hallway
Markov options
!
Examples
Dock-into-charger
I : all states in which charger is in sight
π : pre-defined controller
β : terminate when docked or charger not visible
Open-the-door
I : all states in which a closed door is within reach
π : pre-defined controller for reaching, grasping, and
turning the door knob
β : terminate when the door is open
One-Step options
!
Markov vs. Semi-Markov options
!
Value functions for options
def
µ
Q (s,o) = E{rt +1 + " rt +2 + ... | o initiated in s at time t,
µ followed after termination}
def
QO* (s,o) = max Qµ (s,o)
µ "#(O )
!
Set of all policies selecting
only from options in O
!
Options define a Semi-Markov
Decision Process (SMDP)
Time
Continuous time
SMDP Discrete events
Interval-dependent discount
Discrete time
Options Overlaid discrete events
over MDP Interval-dependent discount
!
They generalize the reward and transition probabilities of an
MDP in such a way that one can write a generalized form of
the Bellman optimality equations.
Bellman optimality equations
$ '
V (s) = max &R(s,o) +# P(s'| s,o)VO (s')])
*
O
*
o"Os % s' (
$ '
Vk +1 (s) = max &R(s,o) +# P(s'| s,o)Vk (s')])
o"Os % s' (
!
Qk +1 (s,o) = R(s,o) + # P(s'| s,o)max Qk (s',o')
s' o' "Os '
!
Illustration: Rooms Example
8 multi-step options,
to each room's 2
ROOM HALLWAYS hallways
O1 G
O2
Goal states are
given a terminal
value of 1
Synchronous value iteration
with cell-to-cell
primitive actions
V(goal )=1
with room-to-room
options
V(goal )=1
!
Looking inside options
st rt s
On every transition: at t +1
!
Illustration: Intra-option Q-learning
V(goal )=1
with cell-to-cell
primitive actions
IterationV#0
(goal )=1
Iteration #1 Iteration #2
V(goal )=1
ith room-to-room
options
IterationV#0
(goal )=1
Iteration #1 Iteration #2
Summary
Time
Continuous time
SMDP Discrete events
Interval-dependent discount
Discrete time
Options Overlaid discrete events
over MDP Interval-dependent discount
States that …
have a high reward gradient or are visited frequently
(Digney 1998)
are visited frequently only on successful trajectories
(McGovern & Barto 2001)
change the value of certain variables
(Hengst 2002; Barto et al. 2004; Jonsson & Barto 2005)
lie between densely connected regions
(Menache et al. 2002; Mannor et al. 2004; Simsek & Barto 2004;
Simsek, Wolfe & Barto 2005)
References