RL Theory Tutorial
RL Theory Tutorial
A Tutorial
Satinder Singh
http://www.eecs.umich.edu/~baveja/ICML06Tutorial/
Outline
• What is RL?
• Markov Decision Processes (MDPs)
• Planning in MDPs
• Learning in MDPs
Environment
perception action
reward
Agent
• complete agent
• temporally situated
RL is like Life! • continual learning and planning
• object is to affect environment
• environment is stochastic and uncertain
RL (another view)
2. Unsupervised Learning
• learning approaches to dimensionality reduction, density
estimation, recoding data based on some principle, etc.
3. Reinforcement Learning
• learning approaches to sequential decision making
• learning from a critic, learning from delayed reward
Some Key Ideas in RL
• Temporal Differences (or updating a guess on the
basis of another guess)
• Eligibility traces
• Off-policy learning
• Function approximation for RL
• Hierarchical RL (options)
• Going beyond MDPs/POMDPs towards AI
Model of Agent-Environment Interaction
Model?
Discrete time
Discrete observations
Discrete actions
Markov Decision Processes
(MDPs)
Markov Assumption
Markov Assumption:
MDP Preliminaries
Discounted framework
Markov assumption!
Bellman Optimality Equations
Optimal Control
Graphical View of MDPs
state
action
action
Learning from Delayed Reward
state
action
Distinguishes RL from other forms of ML
state
Planning & Learning
in
MDPs
Planning in MDPs
• Given an exact model (i.e., reward function,
transition probabilities), and a fixed policy
Stopping criterion:
Arbitrary initialization: V0
Planning in MDPs
Given a exact model (i.e., reward function, transition
probabilities), and a fixed policy
Stopping criterion:
Arbitrary initialization: Q0
Planning in MDPs
Given a exact model (i.e., reward function, transition
probabilities)
Value Iteration (Optimal Control)
For k = 0,1,2,...
Stopping criterion:
Convergence of Value Iteration
2
3
*
1
Contractions!
Proof of the DP contraction
Learning in MDPs
state
• Have access to the “real
system” but no model
action
action
This is what life looks like!
state
step-size
Only updates state-action pairs
Big table of Q-values? that are visited...
Watkins, 1988
So far...
• Q-Learning is the first provably convergent direct
adaptive optimal control algorithm
• Great impact on the field of modern
Reinforcement Learning
• smaller representation than models
• automatically focuses attention to where it is
needed, i.e., no sweeps through state space
• though does not solve the exploration versus
exploitation dilemma
• epsilon-greedy, optimistic initialization, etc,...
Monte Carlo?
Suppose you want to find for some fixed state s
temporal difference
TD(0)
TD(!)
r0 r1 r2 r3 …. rk rk+1 ….
r0 + "V(s1)
1-step (e1): r0 + "r1 + "2V(s2)
w0 e0: r0 + "V(s1)
w1 e1: r0 + "r1 + "2V(s2)
w2 e2: r0 + "r1 + "2r2 + "3V(s3)
(1-!) r0 + "V(s1)
(1-!)! r0 + "r1 + "2V(s2)
(1-!)!2 r0 + "r1 + "2r2 + "3V(s3)
#0 r0 + "V(s1) - V(s0)
#1 r1 + "V(s2) - V(s1)
#2 r2 + "V(s3) - V(s2)
#k rk-1 + "V(sk)-V(sk-1)
Could be:
• table
gradient-
• Backprop Neural Network descent
• Radial-Basis-Function Network methods
weight vector
standard
backprop
gradient
e.g., gradient-descent Sarsa:
estimated value
target value
Linear in the Parameters FAs
rT r r
ˆ
V (s) = ! "s #!rVˆ (s) = " s
r
Each state s represented by a feature vector "s
r
Or represent a state-action pair with "sa
and approximate action values:
$ 2
Q (s, a) = E r1 + %r2 + % r3 +L s t = s, at = a, $
rT r
Qˆ (s,a) = ! " s,a
Sparse Coarse Coding
.
.
.
. Linear
fixed expansive .
. last
Re-representation . layer
.
.
.
.
features
High variance
Off-Policy with Linear
Function Approximation
MAXQ by Dietterich
HAMs by Parr & Russell
Abstraction in Learning and Planning
# Example: docking
I : all states in which charger is in sight
! : hand-crafted controller
" : terminate when docked or charger not visible
4 rooms
4 hallways
4 unreliable
ROOM HALLWAYS primitive actions
up
G? 8 multi-step options
O2 G? (to each room's 2 hallways)
Continuous time
SMDP Discrete events
Interval-dependent discount
Theorem:
For any MDP,
and any set of options,
the decision process that chooses among the options,
executing each to termination,
is an SMDP.
This form follows from SMDP theory. Such models can be used
!
interchangeably with models of primitive actions in Bellman equations.
Room Example
4 rooms
4 hallways
4 unreliable
ROOM HALLWAYS primitive actions
up
G? 8 multi-step options
O2 G? (to each room's 2 hallways)
!
Rooms Example
with cell-to-cell
primitive actions
V(goal )=1
with room-to-room
options
V(goal )=1
• Termination Improvement
Improving the value function by changing the termination
conditions of options
• Intra-Option Learning
Learning the values of options in parallel, without executing them
to termination
Learning the models of options in parallel, without executing
them to termination
"
Landmarks Task
50
• Temporal scales:
! Actions: which direction to fly now
! Options: which site to head for
5
• Options compress space and time
100 10
! Reduce steps from ~600 to ~6
50 ! Reduce states from ~1011 to ~106
Base * o o *
100 decision steps QO (s, o) = rs + " ps s! VO ( s!)
s!
any state (106) sites only (6)
Illustration: Reconnaissance
Mission Planning (Results)
• SMDP planner:
Expected Reward/Mission ! Assumes options followed to
completion
! Plans optimal SMDP solution
60 • SMDP planner with re-evaluation
! Plans as if options must be followed to
50 completion
! But actually takes them for only one
40 step
High Fuel
Low Fuel ! Re-picks a new option on every step
30
SMDP SMDP Static • Static planner:
planner Planner Re-planner ! Assumes weather will not change
with ! Plans optimal tour among clear sites
re-evaluation Temporal abstraction ! Re-plans whenever weather changes
of options on finds better approximation
each step
than static planner, with
little more computation
than SMDP planner
Example of Intra-Option Value Learning
0
Value of Optimal Policy
Learned value
Upper
-2 -1 hallway
True value option
Average Option
-2
-3 value of values
greedy policy Learned value
-3
Left
-4
True value hallway
option
-4
1 10 100 1000 6000 0 1000 2000 3000 4000 5000 6000
Episodes Episodes
4 Reward State
Max error prediction 0.7
prediction
3 error 0.6
error
SMDP
0.5
Intra
2 SMDP 1/t 0.4
Avg. error SMDP
0.3
• Termination Improvement
Improving the value function by changing the termination
conditions of options
• Intra-Option Learning
Learning the values of options in parallel, without executing them
to termination
Learning the models of options in parallel, without executing
them to termination