0% found this document useful (0 votes)

15 views

Ta Lecture2

This document discusses temporal abstraction in reinforcement learning using options. Options generalize actions by allowing temporally extended courses of action. An option is defined by its initiation set, policy, and termination condition. Using options defines a semi-Markov decision process that can be analyzed using methods like value iteration and Q-learning that have been generalized for SMDPs. The document outlines how options can provide temporal abstraction and how their internal structure and policies can be learned through methods like intra-option learning.

Uploaded by

alan

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views

Ta Lecture2

Uploaded by

alan

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 26

Temporal Abstraction in RL

How can an agent represent stochastic, closed-loop,

temporally-extended courses of action? How can it
act, learn, and plan using such representations?
 HAMs (Parr & Russell 1998; Parr 1998)
 MAXQ (Dietterich 2000)
 Options framework (Sutton, Precup & Singh 1999;
Precup 2000)
Outline

 Options
 MDP + options = SMDP
 SMDP methods
 Looking inside the options
Markov Decision Processes (MDPs)

S: set of states of the environment

A(s): set of actions possible in state s, for all s∈S
Pss'a = Pr{st +1 = s'| st = s, at = a} "s,s'# S, a # A(s)
Rss'a = E{rt +1 | st = s, at = a, st +1 = s'} "s,s'# S, a # A(s)
γ: discount rate
!
rt +1 rt +2 rt +3
! ... st s t +1 s t +2 s t +3 ...
at a t +1 a t +2 a t +3
Example
 Actions
 North, East, South, West
 Fail 33% of the time
 Reward
 +1 for transitions into G
G
 0 otherwise
 γ = 0.9
Options

 A generalization of actions
 Starting from a finite MDP, specify a way of
choosing actions until termination
 Example: go-to-hallway
Markov options

A Markov option can be represented as a triple o =< I, " , # >

• I $ S is the set of states in which o may be started
• " :S % A & [0,1] is the policy followed during o
• #:S & [0,1] is the probability of terminating in each state

!
Examples

 Dock-into-charger
 I : all states in which charger is in sight
 π : pre-defined controller
 β : terminate when docked or charger not visible
 Open-the-door
 I : all states in which a closed door is within reach
 π : pre-defined controller for reaching, grasping, and
turning the door knob
 β : terminate when the door is open
One-Step options

A primitive action a " #s"S As of the base MDP is also

an option, called a one - step option.
• I = {s : a " As}
• $ (s,a) = 1,%s " I
• & (s) = 1,%s " S

!
Markov vs. Semi-Markov options

 Markov option: policy and termination

condition depend only on the current state
 Semi-Markov option: policy and termination
condition may depend on the entire history
of states, actions, and rewards since the
initiation of the option
 Options that terminate after a pre-specified
number of time steps
 Options that execute other options
Semi-Markov Options

Let H be the set of possible histories (segments of

experience)
st , at , rt +1 , st +1 ,..., sT

A semi - Markov option may be represented as a triple o =< I, " , # >

• I $ S is the set of states in which o may be started
• " :H % A & [0,1] is the policy followed during o
• #:H & [0,1] is the probability of terminating in each state

!
Value functions for options

def
µ
Q (s,o) = E{rt +1 + " rt +2 + ... | o initiated in s at time t,
µ followed after termination}

def
QO* (s,o) = max Qµ (s,o)
µ "#(O )
!
Set of all policies selecting
only from options in O

!
Options define a Semi-Markov
Decision Process (SMDP)
Time

MDP Discrete time

State
Homogeneous discount

Continuous time
SMDP Discrete events
Interval-dependent discount

Discrete time
Options Overlaid discrete events
over MDP Interval-dependent discount

A discrete-time SMDP overlaid on an MDP

Can be analyzed at either level
SMDPs

 The amount of time between one decision and

the next is a random variable τ
 Transition probabilities
P(s', " | s,a)
 Bellman equations
& )
V (s) = max (R(s,a) +% # P(s', $ | s,a)V (s')]+
* $ *

! o"A s ' s',$ *

Q* (s,a) = R(s,a) + % " # P(s', # | s,a)max Q* (s',a')

s',# o' $A s '
!
Option models

Rso = E{rt +1 + " rt +2 + L + " # $1 rt +# |

o is initiated in state s at time t and lasts # steps}
$
P = % " # p(s', # )
o
ss'
# =1

! Probability that o terminates in s´

after τ steps when initiated in state s

!
They generalize the reward and transition probabilities of an
MDP in such a way that one can write a generalized form of
the Bellman optimality equations.
Bellman optimality equations
$ '
V (s) = max &R(s,o) +# P(s'| s,o)VO (s')])
*
O
*

o"Os % s' (

QO* (s,o) = R(s,o) + # P(s'| s,o)max QO* (s',o')

! s' o' "Os '

Bellman optimality equations can be solved, exactly

! or approximately, using methods that generalize the
usual DP and RL algorithms.
DP backups

$ '
Vk +1 (s) = max &R(s,o) +# P(s'| s,o)Vk (s')])
o"Os % s' (

!
Qk +1 (s,o) = R(s,o) + # P(s'| s,o)max Qk (s',o')
s' o' "Os '

!
Illustration: Rooms Example

8 multi-step options,
to each room's 2
ROOM HALLWAYS hallways

O1 G

O2
Goal states are
given a terminal
value of 1
Synchronous value iteration

with cell-to-cell
primitive actions

V(goal )=1

Iteration #0 Iteration #1 Iteration #2

with room-to-room
options

V(goal )=1

Iteration #0 Iteration #1 Iteration #2

SMDP Q-learning backups
end of one option,
beginning of next

 At state s, initiate option o and execute until termination

 Observe termination state s´, number of steps τ,
discounted return r
& t )
Qk +1 (s,o) = (1" # k )Qk (s,o) + # k (r + $ max Qk (s',o')+
' o' %Os ' *

!
Looking inside options

SMDP methods apply to options, but only when

they are treated as opaque indivisible units. Once
an option has been selected, such methods require
that its policy be followed until the option
terminates. More interesting and potentially more
powerful methods are possible by looking inside
options and by altering their internal structure.
—Precup (2000)
Intra-option Q-learning

st rt s
On every transition: at t +1

Update every Markov option o whose policy could have selected at

according to the same distribution π(st, ·):
Qk +1 (st ,o) = (1" # k )Qk (st ,o) + # k [ rt +1 + $U k (st +1,o)],
where
U k (s,o) = (1" # (s))Qk (s,o) + # (s)max Qk (s,o')
o' $O

! is an estimate of the value of state-option pair (s,o) upon arrival in state s.

!
Illustration: Intra-option Q-learning

Random start, goal in

right hallway, choice
with cell-to-cell
from actions and
primitive actions options, 90% greedy

V(goal )=1
with cell-to-cell
primitive actions

IterationV#0
(goal )=1
Iteration #1 Iteration #2

ith room-to-room Iteration #0 Iteration #1 Iteration #2

options

V(goal )=1
ith room-to-room
options

IterationV#0
(goal )=1
Iteration #1 Iteration #2
Summary
Time

MDP Discrete time

State
Homogeneous discount

Continuous time
SMDP Discrete events
Interval-dependent discount

Discrete time
Options Overlaid discrete events
over MDP Interval-dependent discount

A discrete-time SMDP overlaid on an MDP

Can be analyzed at either level
What else?

 Intra-option learning of option models

 Early termination of options
 Improving option policies (given its reward
function)
 Learning option policies given useful subgoals to
reach (e.g. hallways in the sample problem)
Which states are useful subgoals?

States that …
 have a high reward gradient or are visited frequently
(Digney 1998)
 are visited frequently only on successful trajectories
(McGovern & Barto 2001)
 change the value of certain variables
(Hengst 2002; Barto et al. 2004; Jonsson & Barto 2005)
 lie between densely connected regions
(Menache et al. 2002; Mannor et al. 2004; Simsek & Barto 2004;
Simsek, Wolfe & Barto 2005)
References

 D. Precup. Temporal abstraction in reinforcement

learning. PhD thesis, University of Massachusetts
Amherst, 2000.
 R. S. Sutton, D. Precup, and S. P. Singh. Between MDPs
and Semi-MDPs: A framework for temporal abstraction
in reinforcement learning. Artificial Intelligence, 112(1-
2):181–211, 1999.
 A. G. Barto and S. Mahadevan. Recent advances in
hierarchical reinforcement learning. Discrete Event
Dynamic Systems, 13(4):341 – 379, October 2003.

Unit 3 MST 124
100% (1)
Unit 3 MST 124
121 pages
Chapter 2 MMW
83% (6)
Chapter 2 MMW
4 pages
Lecture 3 - MDPs and Dynamic Programming
No ratings yet
Lecture 3 - MDPs and Dynamic Programming
66 pages
DRL #4-5 - Introducing MDP and Dynamic Programming Solution
No ratings yet
DRL #4-5 - Introducing MDP and Dynamic Programming Solution
74 pages
06 MDP
No ratings yet
06 MDP
89 pages
Markov Decision Processes & Reinforcement Learning: Megan Smith Lehigh University, Fall 2006
No ratings yet
Markov Decision Processes & Reinforcement Learning: Megan Smith Lehigh University, Fall 2006
40 pages
DSA5102_lecture11
No ratings yet
DSA5102_lecture11
44 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
101 pages
CSE4037 Reinforcement Learning: Options
No ratings yet
CSE4037 Reinforcement Learning: Options
17 pages
Reinforcement Learning Note
No ratings yet
Reinforcement Learning Note
16 pages
A17 Complexdecisions
No ratings yet
A17 Complexdecisions
28 pages
Add-On DRL CS06
No ratings yet
Add-On DRL CS06
23 pages
Lecture4 Model Free Prediction
No ratings yet
Lecture4 Model Free Prediction
34 pages
Stochastic DP
No ratings yet
Stochastic DP
23 pages
1.1 Discounted (Infinite-Horizon) Markov Decision Processes
No ratings yet
1.1 Discounted (Infinite-Horizon) Markov Decision Processes
26 pages
L12 Markov Decision Processes
No ratings yet
L12 Markov Decision Processes
64 pages
Markovian Decision Process
No ratings yet
Markovian Decision Process
27 pages
A Tutorial For Reinforcement Learning
No ratings yet
A Tutorial For Reinforcement Learning
17 pages
22 Reinforcement Learning
No ratings yet
22 Reinforcement Learning
18 pages
DAC: The Double Actor-Critic Architecture For Learning Options
No ratings yet
DAC: The Double Actor-Critic Architecture For Learning Options
15 pages
Lec 09
No ratings yet
Lec 09
51 pages
Stochastic Process - Markov Property - Markov Chain - Markov Decision Process - Reinforcement Learning - RL Techniques - Example Applications
No ratings yet
Stochastic Process - Markov Property - Markov Chain - Markov Decision Process - Reinforcement Learning - RL Techniques - Example Applications
39 pages
242 Sheet 02 03
No ratings yet
242 Sheet 02 03
5 pages
Conjugate Markov Decision Processes
No ratings yet
Conjugate Markov Decision Processes
8 pages
A Brief Introduction To Reinforcement Learning
No ratings yet
A Brief Introduction To Reinforcement Learning
4 pages
SBRN2002
No ratings yet
SBRN2002
7 pages
Logistics: CSE 473 Markov Decision Processes
No ratings yet
Logistics: CSE 473 Markov Decision Processes
10 pages
Lecture7 MDPs I
No ratings yet
Lecture7 MDPs I
9 pages
A Tutorial For Reinforcement Learning
No ratings yet
A Tutorial For Reinforcement Learning
14 pages
20AI903_RL_UNIT 2
No ratings yet
20AI903_RL_UNIT 2
27 pages
Sp14 Cs188 Lecture 8 - Mdps I
No ratings yet
Sp14 Cs188 Lecture 8 - Mdps I
50 pages
DEEP RL - CONTENT BEYOND SYLLABUS
No ratings yet
DEEP RL - CONTENT BEYOND SYLLABUS
16 pages
Reinforcement-Learning-Cheatsheet
No ratings yet
Reinforcement-Learning-Cheatsheet
16 pages
An Introduction To Markov Decision Processes: Bob Givan Ron Parr Purdue University Duke University
No ratings yet
An Introduction To Markov Decision Processes: Bob Givan Ron Parr Purdue University Duke University
23 pages
Lecture7 MDP
No ratings yet
Lecture7 MDP
44 pages
Markov Decision Process Tutorial
No ratings yet
Markov Decision Process Tutorial
22 pages
slidedeck_6_MAS_2021_22_RL_2_MDP_Model-based
No ratings yet
slidedeck_6_MAS_2021_22_RL_2_MDP_Model-based
36 pages
Unit 1, 2 RL
No ratings yet
Unit 1, 2 RL
29 pages
5.4-Reinforcement Learning-Part1-Introduction
No ratings yet
5.4-Reinforcement Learning-Part1-Introduction
15 pages
ML Unit 4
No ratings yet
ML Unit 4
9 pages
11-DL-Deep Learning For Reinforcement Learning
No ratings yet
11-DL-Deep Learning For Reinforcement Learning
47 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
43 pages
EE290 Lecture 16
No ratings yet
EE290 Lecture 16
4 pages
16 RL PDF
No ratings yet
16 RL PDF
87 pages
RL Ese
No ratings yet
RL Ese
7 pages
A10
No ratings yet
A10
4 pages
Reinforcement Learning: Karan Kathpalia
No ratings yet
Reinforcement Learning: Karan Kathpalia
80 pages
New CZ3005 Module 4 - Markov Decision Process
No ratings yet
New CZ3005 Module 4 - Markov Decision Process
38 pages
242 Sheet 02 02
No ratings yet
242 Sheet 02 02
6 pages
Lec 08
No ratings yet
Lec 08
59 pages
Module-2 For Btech in Topic
No ratings yet
Module-2 For Btech in Topic
29 pages
02 MarkovDecisionProcess
No ratings yet
02 MarkovDecisionProcess
51 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
31 pages
Unit 5 Reinforcement Learning Notes
No ratings yet
Unit 5 Reinforcement Learning Notes
20 pages
Machine Learning
No ratings yet
Machine Learning
5 pages
MDP 2
No ratings yet
MDP 2
53 pages
17 - Markov Decision Processes.pptx
No ratings yet
17 - Markov Decision Processes.pptx
59 pages
RL and ObC Lecture 2
No ratings yet
RL and ObC Lecture 2
20 pages
کتاب هشتم بارگزاری شده
No ratings yet
کتاب هشتم بارگزاری شده
112 pages
Generalized Fermat Equation
From Everand
Generalized Fermat Equation
Ran Van Vo
No ratings yet
Student Solutions Manual to Accompany Economic Dynamics in Discrete Time, secondedition
From Everand
Student Solutions Manual to Accompany Economic Dynamics in Discrete Time, secondedition
Yue Jiang
4.5/5 (2)
Trigonometric Ratios to Transformations (Trigonometry) Mathematics E-Book For Public Exams
From Everand
Trigonometric Ratios to Transformations (Trigonometry) Mathematics E-Book For Public Exams
Mohmmad Khaja Shareef
5/5 (1)
Lecture-6 (Cache Attacks) CS665-Fall 2018: Secure Memory Systems
No ratings yet
Lecture-6 (Cache Attacks) CS665-Fall 2018: Secure Memory Systems
25 pages
Av2026 Airoha Dvbs Tuner Ic
No ratings yet
Av2026 Airoha Dvbs Tuner Ic
17 pages
Power Schematic Diagram: (TV MT PCB)
No ratings yet
Power Schematic Diagram: (TV MT PCB)
1 page
SP4731
No ratings yet
SP4731
3 pages
TOTX173: Fiber Optic Transmitting Module For Digital Audio Equipment
No ratings yet
TOTX173: Fiber Optic Transmitting Module For Digital Audio Equipment
5 pages
4-Bit Analogue-To-Digital Converter: Design: G. Baars
No ratings yet
4-Bit Analogue-To-Digital Converter: Design: G. Baars
1 page
MintySynth2.0 Schematic
No ratings yet
MintySynth2.0 Schematic
1 page
BHUSA2015 Unicorn
No ratings yet
BHUSA2015 Unicorn
39 pages
Muxed Home Intercom
No ratings yet
Muxed Home Intercom
7 pages
2014 10 Cho EMNLP
No ratings yet
2014 10 Cho EMNLP
11 pages
Programming Assignment 3
No ratings yet
Programming Assignment 3
18 pages
Programming Assignment 3
No ratings yet
Programming Assignment 3
18 pages
COS3761_TL 101_0_2025
No ratings yet
COS3761_TL 101_0_2025
15 pages
Full Adder
0% (1)
Full Adder
4 pages
OSM ConsolEx 4A04 4a lv2 e
No ratings yet
OSM ConsolEx 4A04 4a lv2 e
4 pages
Module 1 Discrete Math
No ratings yet
Module 1 Discrete Math
56 pages
Module 6
No ratings yet
Module 6
5 pages
Computer AS Assignment Jan
No ratings yet
Computer AS Assignment Jan
6 pages
St. Joseph College of Bulacan: General Mathematics
No ratings yet
St. Joseph College of Bulacan: General Mathematics
5 pages
Logic
No ratings yet
Logic
32 pages
CP - Pointers
No ratings yet
CP - Pointers
27 pages
DS Assignment
No ratings yet
DS Assignment
8 pages
Write Two Set Level1 1 PDF
No ratings yet
Write Two Set Level1 1 PDF
2 pages
Analysis of Median of Medians Algorithm
No ratings yet
Analysis of Median of Medians Algorithm
3 pages
Fundamentals of Logic
No ratings yet
Fundamentals of Logic
56 pages
Worksheet 1 2
No ratings yet
Worksheet 1 2
5 pages
Sma112-By Sandrah Happiness
No ratings yet
Sma112-By Sandrah Happiness
119 pages
The Existence of Generalized Inverses of Fuzzy Matrices
No ratings yet
The Existence of Generalized Inverses of Fuzzy Matrices
14 pages
Dypu Ce Se Tcs Week6
No ratings yet
Dypu Ce Se Tcs Week6
70 pages
Assignment 2 - DS
No ratings yet
Assignment 2 - DS
2 pages
Basic Logic Lecture Notes
No ratings yet
Basic Logic Lecture Notes
160 pages
Group 1 Posets and Boolean Algebra
No ratings yet
Group 1 Posets and Boolean Algebra
38 pages
Certificate: Rev. Sr. Anjali Maria - Mr. K.K. Pandey
No ratings yet
Certificate: Rev. Sr. Anjali Maria - Mr. K.K. Pandey
24 pages
Algorith Design and Analysis
No ratings yet
Algorith Design and Analysis
6 pages
MTH15W1 Test 2
No ratings yet
MTH15W1 Test 2
2 pages
Square of Opposition
No ratings yet
Square of Opposition
19 pages
K MAP
No ratings yet
K MAP
8 pages
Session Plan For CSE1001-Winter 2020: Inlab Practice Session - IPS 1 (Covering Sessions: 4-6 Operators, Expressions)
No ratings yet
Session Plan For CSE1001-Winter 2020: Inlab Practice Session - IPS 1 (Covering Sessions: 4-6 Operators, Expressions)
3 pages
The Pumping Lemma For Context-Free Languages
No ratings yet
The Pumping Lemma For Context-Free Languages
43 pages
A Primer of Real Analysis
No ratings yet
A Primer of Real Analysis
152 pages