Lecture 5 - ModelFreePrediction
Lecture 5 - ModelFreePrediction
UCL, 2021
Background
I(Ai = a)Ri+1
Ít
qt (a) = i= 0
≈ E [Rt+1 | At = a] = q(a)
i=0 I(Ai = a)
Ít
I Equivalently:
with αt = N (A
1
= Ít I(A
1
t t) i=0 i =a)
Solution for large MDPs, if the environment state is not fully observable
I Use the agent state:
St = uω (St−1, At−1, Ot )
with parameters ω (typically ω ∈ Rn )
I Henceforth, St denotes the agent state
I Think of this as either a vector inside the agent,
or, in the simplest case, just the current observation: St = Ot
I For now we are not going to talk about how to learn the agent state update
I Feel free to consider St an observation
Linear Function Approximation
Feature Vectors
x (s)
© 1.
..
x(s) =
ª
®
®
« xm (s) ¬
I x : S → Rm is a fixed mapping from agent state (e.g., observation) to features
I Short-hand: xt = x(St )
I For example:
I Distance of robot from landmarks
I Trends in the stock market
I Piece and pawn configurations in chess
Linear Value Function Approximation
I q could be a parametric function, e.g., neural network, and we could use loss
1
L(w) = E (Rt+1 − qw (St , At ))2
2
I Then the gradient update is
vπ (s) = E [Gt | St = s, π]
I We can just use sample average return instead of expected return
I We call this Monte Carlo policy evaluation
Example: Blackjack
Blackjack Example
I States (200 of them):
I Current sum (12-21)
I Dealer’s showing card (ace-10)
I Do I have a “useable" ace? (yes-no)
I Action stick: Stop receiving cards (and terminate)
I Action draw: Take another card (random, no replacement)
I Reward for stick:
I +1 if sum of cards > sum of dealer cards
I 0 if sum of cards = sum of dealer cards
I -1 if sum of cards < sum of dealer cards
I Reward for draw:
I -1 if sum of cards > 21 (and terminate)
I 0 otherwise
I Transitions: automatically draw if sum of cards < 12
Blackjack Value Function after Monte-Carlo Learning
Disadvantages of Monte-Carlo Learning
I Temporal-difference learning:
I Update value vt (St ) towards estimated return Rt+1 + γv(St+1 )
TD error
©z }| {ª
vt+1 (St ) ← vt (St ) + α Rt+1 + γvt (St+1 ) −vt (St )®
®
| {z } ®
« target ¬
I δt = Rt+1 + γvt (St+1 ) − vt (St ) is called the TD error
Dynamic Programming Backup
st
rt +1
st +1
T! TT!! T! T! T!
TT! T! T! T! T!
Monte-Carlo Backup
st
T!
T T! TT! T! T!
! ! !
st
rt +1
st +1
T! TT! TT! T! T!
! !
T! T
T! TT!! T! TT!
!
Bootstrapping and Sampling
TD error
©z }| {ª
qt+1 (St , At ) ← qt (St , At ) + α Rt+1 + γqt (St+1 , At+1 ) −qt (St , At )®
®
| {z } ®
« target ¬
I This algorithm is known as SARSA, because it uses (St , At , Rt+1, St+1, At+1 )
Temporal-Difference Learning
exit highway 20 15 35
behind truck 30 10 40
home street 40 3 43
arrive home 43 0 43
Driving Home Example: MC vs. TD
alpha 0.01
alpha 0.03
0.5 alpha 0.1 0.5
alpha 0.3
0.4 0.4
RMSE
RMSE
0.3 0.3
0.2 0.2
alpha 0.01
0.1 0.1 alpha 0.03
alpha 0.1
alpha 0.3
0.0 0.0
0 20 40 60 80 100 0 20 40 60 80 100
Episodes Episodes
Batch MC and TD
Batch MC and TD
A, 0, B, 0!
B, 1!
B, 1!
B, 1!
B, 1!
B, 1!
B, 1!
B, 0!
What is v(A), v(B)?
Example: Batch Learning in Two States
Two states A, B; no discounting; 8 episodes of experience
A, 0, B, 0!
B, 1!
B, 1!
B, 1!
B, 1!
B, 1!
B, 1!
B, 0!
What is v(A), v(B)?
Differences in batch solutions
Õ Tk
K Õ 2
Gtk − v(Stk )
k=1 t=1
G(n)
t = Rt+1 + γRt+2 + ... + γ
n−1
Rt+n + γ n v(St+n )
I Multi-step temporal-difference learning
v(St ) ← v(St ) + α G(n)
t − v(St )
Multi-Step Examples
Grid Example
n=8 n=2
n=4
↵ ↵
Figure 7.2: Performance of n-step TD methods as a function of ↵, for various values
Mixed Multi-Step Returns
Mixing multi-step returns
I Multi-step returns bootstrap on one state, v(St+n ):
G(n) (n−1)
t = Rt+1 + γG t+1 (while n > 1, continue)
G(t1) = Rt+1 + γv(St+1 ) . (truncate & bootstrap)
− λ)λ n−1 = 1)
Í∞
(Note, n=1 (1
Mixing multi-step returns
Gλt = Rt+1 + γ (1 − λ)v(St+1 ) + λGλt+1
Special cases:
Gλ=
t
0
= Rt+1 + γv(St+1 ) (TD)
Gλ=
t
1
= Rt+1 + γGt+1 (MC)
Mixing multi-step returns
T
Õ −1 t
Õ
∆w k = αδt et where et = γ t−j x j
t=0 j=0
t−1
Õ
= γ t−j x j + x t
j=0
t−1
Õ
=γ γ t−1−j x j +x t
j=0
| {z }
= et−1
= γ et−1 + x t .
∆v = δ0 e0 δ1 e1 δ2 e2 δ3 e3
(G0 − v(S0 ))x0 δ0 x0 γδ1 x0 γ 2 δ2 x0 γ 3 δ3 x0
(G1 − v(S1 ))x1 δ1 x1 γδ2 x1 γ 2 δ3 x1
(G2 − v(S2 ))x2 δ2 x2 γδ3 x2
(G3 − v(S3 ))x3 δ3 x3
Mixed Multi-Step Returns
and Eligibility Traces
Mixing multi-step returns & traces
Next lecture:
Model-free control