0% found this document useful (0 votes)
30 views

Lecture 5 - ModelFreePrediction

The document discusses model-free prediction using Monte Carlo algorithms without requiring a model of the environment. It covers estimating values for multi-armed bandits and bandits with states using the average reward. The document then discusses using linear function approximation for value function approximation and applying Monte Carlo methods to policy evaluation using sample returns.

Uploaded by

Trinaya Kodavati
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views

Lecture 5 - ModelFreePrediction

The document discusses model-free prediction using Monte Carlo algorithms without requiring a model of the environment. It covers estimating values for multi-armed bandits and bandits with states using the average reward. The document then discusses using linear function approximation for value function approximation and applying Monte Carlo methods to policy evaluation using sample returns.

Uploaded by

Trinaya Kodavati
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 79

Lecture 5: Model-Free Prediction

Hado van Hasselt

UCL, 2021
Background

Sutton & Barto 2018, Chapters 5 + 6 + 7 + 9 +12

Don’t worry about reading all of this at once!


Most important chapters, for now: 5 + 6
You can also defer some reading, e.g., until the reading week
Recap

I Reinforcement learning is the science of learning to make decisions


I Agents can learn a policy, value function and/or a model
I The general problem involves taking into account time and consequences
I Decisions affect the reward, the agent state, and environment state
Lecture overview

I Last lectures (3+4):


I Planning by dynamic programming to solve a known MDP
I This and next lectures (5→8):
I Model-free prediction to estimate values in an unknown MDP
I Model-free control to optimise values in an unknown MDP
I Function approximation and (some) deep reinforcement learning (but more to follow later)
I Off-policy learning
I Later lectures:
I Model-based learning and planning
I Policy gradients and actor critic systems
I More deep reinforcement learning
I More advanced topics and current research
Model-Free Prediction:
Monte Carlo Algorithms
Monte Carlo Algorithms

I We can use experience samples to learn without a model


I We call direct sampling of episodes Monte Carlo
I MC is model-free: no knowledge of MDP required, only samples
Monte Carlo: Bandits

I Simple example, multi-armed bandit:


I For each action, average reward samples

I(Ai = a)Ri+1
Ít
qt (a) = i= 0
≈ E [Rt+1 | At = a] = q(a)
i=0 I(Ai = a)
Ít

I Equivalently:

qt+1 (At ) = qt (At ) + αt (Rt+1 − qt (At ))


qt+1 (a) = qt (a) ∀a , At

with αt = N (A
1
= Ít I(A
1
t t) i=0 i =a)

I Note: we changed notation Rt → Rt+1 for the reward after At


In MDPs, the reward is said to arrive on the time step after the action
Monte Carlo: Bandits with States

I Consider bandits with different states


I episodes are still one step
I actions do not affect state transitions
I =⇒ no long-term consequences
I Then, we want to estimate

q(s, a) = E [Rt+1 |St = s, At = a]


I These are called contextual bandits
Introduction Function Approximation
Value Function Approximation

I So far we mostly considered lookup tables


I Every state s has an entry v(s)
I Or every state-action pair s, a has an entry q(s, a)
I Problem with large MDPs:
I There are too many states and/or actions to store in memory
I It is too slow to learn the value of each state individually
I Individual states are often not fully observable
Value Function Approximation

Solution for large MDPs:


I Estimate value function with function approximation

vw (s) ≈ vπ (s) (or v∗ (s))


qw (s, a) ≈ qπ (s, a) (or q∗ (s, a))

I Update parameter w (e.g., using MC or TD learning)


I Generalise from to unseen states
Agent state update

Solution for large MDPs, if the environment state is not fully observable
I Use the agent state:
St = uω (St−1, At−1, Ot )
with parameters ω (typically ω ∈ Rn )
I Henceforth, St denotes the agent state
I Think of this as either a vector inside the agent,
or, in the simplest case, just the current observation: St = Ot
I For now we are not going to talk about how to learn the agent state update
I Feel free to consider St an observation
Linear Function Approximation
Feature Vectors

I A useful special case: linear functions


I Represent state by a feature vector

x (s)
© 1.
­ ..
x(s) = ­
ª
®
®
« xm (s) ¬
I x : S → Rm is a fixed mapping from agent state (e.g., observation) to features
I Short-hand: xt = x(St )
I For example:
I Distance of robot from landmarks
I Trends in the stock market
I Piece and pawn configurations in chess
Linear Value Function Approximation

I Approximate value function by a linear combination of features


n
Õ
vw (s) = w x(s) =
>
x j (s)w j
j=1

I Objective function (‘loss‘) is quadratic in w

L(w) = ES∼d [(vπ (S) − w> x(S))2 ]


I Stochastic gradient descent converges on global optimum
I Update rule is simple

∇w vw (St ) = x(St ) = xt =⇒ ∆w = α(vπ (St ) − vw (St ))xt

Update = step-size × prediction error × feature vector


Table Lookup Features

I Table lookup is a special case of linear value function approximation


I Let the n states be given by S = {s1, . . . , sn }.
I Use one-hot feature:
I(s = s1 )
x(s) = ­
© .. ª
.
­ ®
®
« I(s = s n ) ¬
I Parameters w then just contains value estimates for each state
Õ
v(s) = w> x(s) = w j x j (s) = ws .
j
Model-Free Prediction:
Monte Carlo Algorithms
(Continuing from before...)
Monte Carlo: Bandits with States

I q could be a parametric function, e.g., neural network, and we could use loss
1 
L(w) = E (Rt+1 − qw (St , At ))2

2
I Then the gradient update is

wt+1 = wt − α∇wt L(wt )


1 
= wt − α∇wt E (Rt+1 − qwt (St , At ))2

 2
= wt + αE (Rt+1 − qwt (St , At ))∇wt qwt (St , At ) .


We can sample this to get a stochastic gradient update (SGD)


I The tabular case is a special case (only updates the value in cell [St , At ])
I Also works for large (continuous) state spaces S — this is just regression
Monte Carlo: Bandits with States

I When using linear functions, q(s, a) = w> x(s, a) and

∇wt qwt (St , At ) = x(s, a)


I Then the SGD update is

wt+1 = wt + α(Rt+1 − qwt (St , At ))x(s, a) .

I Linear update = step-size × prediction error × feature vector


I Non-linear update = step-size × prediction error × gradient
Monte-Carlo Policy Evaluation

I Now we consider sequential decision problems


I Goal: learn vπ from episodes of experience under policy π

S1, A1, R2, ..., Sk ∼ π


I The return is the total discounted reward (for an episode ending at time T > t ):

Gt = Rt+1 + γRt+2 + ... + γT −t−1 RT


I The value function is the expected return:

vπ (s) = E [Gt | St = s, π]
I We can just use sample average return instead of expected return
I We call this Monte Carlo policy evaluation
Example: Blackjack
Blackjack Example
I States (200 of them):
I Current sum (12-21)
I Dealer’s showing card (ace-10)
I Do I have a “useable" ace? (yes-no)
I Action stick: Stop receiving cards (and terminate)
I Action draw: Take another card (random, no replacement)
I Reward for stick:
I +1 if sum of cards > sum of dealer cards
I 0 if sum of cards = sum of dealer cards
I -1 if sum of cards < sum of dealer cards
I Reward for draw:
I -1 if sum of cards > 21 (and terminate)
I 0 otherwise
I Transitions: automatically draw if sum of cards < 12
Blackjack Value Function after Monte-Carlo Learning
Disadvantages of Monte-Carlo Learning

I We have seen MC algorithms can be used to learn value predictions


I But when episodes are long, learning can be slow
I ...we have to wait until an episode ends before we can learn
I ...return can have high variance
I Are there alternatives? (Spoiler: yes)
Temporal-Difference Learning
Temporal Difference Learning by Sampling Bellman Equations
I Previous lecture: Bellman equations,

vπ (s) = E [Rt+1 + γvπ (St+1 ) | St = s, At ∼ π(St )]


I Previous lecture: Approximate by iterating,

vk+1 (s) = E [Rt+1 + γvk (St+1 ) | St = s, At ∼ π(St )]


I We can sample this!
vt+1 (St ) = Rt+1 + γvt (St+1 )
I This is likely quite noisy — better to take a small step (with parameter α):
 
vt+1 (St ) = vt (St ) + αt Rt+1 + γvt (St+1 ) −vt (St )
| {z }
target

(Note: tabular update)


Temporal difference learning

I Prediction setting: learn vπ online from experience under policy π


I Monte-Carlo
I Update value vn (St ) towards sampled return Gt

vn+1 (St ) = vn (St ) + α (Gt − vn (St ))

I Temporal-difference learning:
I Update value vt (St ) towards estimated return Rt+1 + γv(St+1 )

TD error
©z }| {ª
vt+1 (St ) ← vt (St ) + α ­ Rt+1 + γvt (St+1 ) −vt (St )®
­ ®
­| {z } ®
« target ¬
I δt = Rt+1 + γvt (St+1 ) − vt (St ) is called the TD error
Dynamic Programming Backup

v(St ) ← E [Rt+1 + γv(St+1 ) | At ∼ π(St )]

st

rt +1
st +1

T! TT!! T! T! T!

TT! T! T! T! T!
Monte-Carlo Backup

v(St ) ← v(St ) + α (Gt − v(St ))

st

T!
T T! TT! T! T!
! ! !

TT! T! TT! T! TT!


! ! !
Temporal-Difference Backup

v(St ) ← v(St ) + α (Rt+1 + γv(St+1 ) − v(St ))

st

rt +1
st +1

T! TT! TT! T! T!
! !

T! T
T! TT!! T! TT!
!
Bootstrapping and Sampling

I Bootstrapping: update involves an estimate


I MC does not bootstrap
I DP bootstraps
I TD bootstraps
I Sampling: update samples an expectation
I MC samples
I DP does not sample
I TD samples
Temporal difference learning

I We can apply the same idea to action values


I Temporal-difference learning for action values:
I Update value qt (St , At ) towards estimated return Rt+1 + γq(St+1 , At+1 )

TD error
©z }| {ª
qt+1 (St , At ) ← qt (St , At ) + α ­ Rt+1 + γqt (St+1 , At+1 ) −qt (St , At )®
­ ®
­| {z } ®
« target ¬
I This algorithm is known as SARSA, because it uses (St , At , Rt+1, St+1, At+1 )
Temporal-Difference Learning

I TD is model-free (no knowledge of MDP) and learn directly from experience


I TD can learn from incomplete episodes, by bootstrapping
I TD can learn during each episode
Example: Driving Home
Driving Home Example
State Elapsed Time Predicted Predicted
(minutes) Time to Go Total Time
leaving office 0 30 30

reach car, raining 5 35 40

exit highway 20 15 35

behind truck 30 10 40

home street 40 3 43

arrive home 43 0 43
Driving Home Example: MC vs. TD

Changes recommended by Changes recommended!


Monte Carlo methods (!=1)! by TD methods (!=1)!
Comparing MC and TD
Advantages and Disadvantages of MC vs. TD

I TD can learn before knowing the final outcome


I TD can learn online after every step
I MC must wait until end of episode before return is known
I TD can learn without the final outcome
I TD can learn from incomplete sequences
I MC can only learn from complete sequences
I TD works in continuing (non-terminating) environments
I MC only works for episodic (terminating) environments
I TD is independent of the temporal span of the prediction
I TD can learn from single transitions
I MC must store all predictions (or states) to update at the end of an episode
I TD needs reasonable value estimates
Bias/Variance Trade-Off

I MC return Gt = Rt+1 + γRt+2 + . . . is an unbiased estimate of vπ (St )


I TD target Rt+1 + γvt (St+1 ) is a biased estimate of vπ (St ) (unless vt (St+1 ) = vπ (St+1 ))
I But the TD target has lower variance:
I Return depends on many random actions, transitions, rewards
I TD target depends on one random action, transition, reward
Bias/Variance Trade-Off

I In some cases, TD can have irreducible bias


I The world may be partially observable
I MC would implicitly account for all the latent variables
I The function to approximate the values may fit poorly
I In the tabular case, both MC and TD will converge: vt → vπ
Example: Random Walk
Random Walk Example

I Uniform random transitions (50% left, 50% right)


I Initial values are v(s) = 0.5, for all s
I True values happen to be
v(A) = 61 , v(B) = 26 , v(C) = 63 , v(D) = 64 , v(E) = 5
6
Random Walk Example
Random Walk: MC vs. TD
TD MC

alpha 0.01
alpha 0.03
0.5 alpha 0.1 0.5
alpha 0.3

0.4 0.4
RMSE

RMSE
0.3 0.3

0.2 0.2

alpha 0.01
0.1 0.1 alpha 0.03
alpha 0.1
alpha 0.3
0.0 0.0
0 20 40 60 80 100 0 20 40 60 80 100
Episodes Episodes
Batch MC and TD
Batch MC and TD

I Tabular MC and TD converge: vt → vπ as experience → ∞ and αt → 0


I But what about finite experience?
I Consider a fixed batch of experience:

episode 1: S11, A11, R21, ..., ST11


..
.
episode K: S1K , A1K , R2K , ..., STKK
I Repeatedly sample each episode k ∈ [1, K] and apply MC or TD(0)
I = sampling from an empirical model
Example:
Batch Learning in Two States
Example: Batch Learning in Two States
Two states A, B; no discounting; 8 episodes of experience

A, 0, B, 0!
B, 1!
B, 1!
B, 1!
B, 1!
B, 1!
B, 1!
B, 0!
What is v(A), v(B)?
Example: Batch Learning in Two States
Two states A, B; no discounting; 8 episodes of experience

A, 0, B, 0!
B, 1!
B, 1!
B, 1!
B, 1!
B, 1!
B, 1!
B, 0!
What is v(A), v(B)?
Differences in batch solutions

I MC converges to best mean-squared fit for the observed returns

Õ Tk 
K Õ 2
Gtk − v(Stk )
k=1 t=1

I In the AB example, v(A) = 0


I TD converges to solution of max likelihood Markov model, given the data
I Solution to the empirical MDP (S, A, p̂, γ) that best fits the data
I In the AB example: p̂(St+1 = B | St = A) = 1, and therefore v(A) = v(B) = 0.75
Advantages and Disadvantages of MC vs. TD

I TD exploits Markov property


I Can help in fully-observable environments
I MC does not exploit Markov property
I Can help in partially-observable environments
I With finite data, or with function approximation, the solutions may differ
Between MC and TD:
Multi-Step TD
Unified View of Reinforcement Learning
Multi-Step Updates

I TD uses value estimates which might be inaccurate


I In addition, information can propagate back quite slowly
I In MC information propagates faster, but the updates are noisier
I We can go in between TD and MC
Multi-Step Prediction

I Let TD target look n steps into the future


Multi-Step Returns
I Consider the following n-step returns for n = 1, 2, ∞:

n=1 (T D) G(t1) = Rt+1 + γv(St+1 )


n=2 G(t2) = Rt+1 + γRt+2 + γ 2 v(St+2 )
.. ..
. .
n = ∞ (M C) G(∞)
t = Rt+1 + γRt+2 + ... + γT −t−1 RT
I In general, the n-step return is defined by

G(n)
t = Rt+1 + γRt+2 + ... + γ
n−1
Rt+n + γ n v(St+n )
I Multi-step temporal-difference learning
 
v(St ) ← v(St ) + α G(n)
t − v(St )
Multi-Step Examples
Grid Example

(Reminder: SARSA is TD for action values q(s, a))


were used for all methods). First note that the on-line methods generally worked
on this task, both reaching lower levels of absolute error and doing so over a la
Large Random Walk Example
range of the step-size parameter ↵ (in fact, all the o↵-line methods were unstable f
much above 0.3). Second, note that methods with an intermediate value of n wor
best. This illustrates how the generalization of TD and Monte Carlo methods t
step methods can potentially perform better than either of the two extreme metho
..., but with 19 states, rather than 5
On-line n-step TD methods Off-line n-step TD metho
256 256
512 512
128 n=64 128 n=64
n=32

RMS error n=64


n=3
over first
10 episodes n=32
n=32 n=1
n=4
n=16
n=8
n=16

n=8 n=2
n=4

↵ ↵
Figure 7.2: Performance of n-step TD methods as a function of ↵, for various values
Mixed Multi-Step Returns
Mixing multi-step returns
I Multi-step returns bootstrap on one state, v(St+n ):

G(n) (n−1)
t = Rt+1 + γG t+1 (while n > 1, continue)
G(t1) = Rt+1 + γv(St+1 ) . (truncate & bootstrap)

I You can also bootstrap a little bit on multiple states:


 
Gλt = Rt+1 + γ (1 − λ)v(St+1 ) + λGλt+1

This gives a weighted average of n-step returns:



Gλt =
Õ
(1 − λ)λ n−1 G(n)
t
n=1

− λ)λ n−1 = 1)
Í∞
(Note, n=1 (1
Mixing multi-step returns

 
Gλt = Rt+1 + γ (1 − λ)v(St+1 ) + λGλt+1

Special cases:

Gλ=
t
0
= Rt+1 + γv(St+1 ) (TD)
Gλ=
t
1
= Rt+1 + γGt+1 (MC)
Mixing multi-step returns

Intuition: 1/(1 − λ) is the ‘horizon’. E.g., λ = 0.9 ≈ n = 10.


Benefits of Multi-Step Learning
Benefits of multi-step returns

I Multi-step returns have benefits from both TD and MC


I Bootstrapping can have issues with bias
I Monte Carlo can have issues with variance
I Typically, intermediate values of n or λ are good (e.g., n = 10, λ = 0.9)
Eligibility Traces
Independence of temporal span

I MC and multi-step returns are not independent of span of the predictions:


To update values in a long episode, you have to wait
I TD can update immediately, and is independent of the span of the predictions
I Can we get both?
Eligibility traces

I Recall linear function approximation


I The Monte Carlo and TD updates to vw (s) = w> x(s) for a state s = St is

∆wt = α(Gt − v(St ))xt (MC)


∆wt = α(Rt+1 + γv(St+1 ) − v(St ))xt (TD)

I MC updates all states in episode k at once:


T
Õ −1
∆wk+1 = α(Gt − v(St ))xt
t=0

where t ∈ {0, . . . , T − 1} enumerate the time steps in this specific episode


I Recall: tabular is a special case, with one-hot vector xt
Eligibility traces

I Accumulating a whole episode of updates:

∆wt ≡ αδt et (one time step)


where et = γλet−1 + x t

I Note: if λ = 0, we get one-step TD


I Intuition: decay the eligibility of past states for the current TD error, then add it
I This is kind of magical: we can update all past states (to account for the new TD error)
with a single update! No need to recompute their values.
I This idea extends to function approximation: xt does not have to be one-hot
Eligibility traces
Eligibility traces

We can rewrite the MC error as a sum of TD errors:

Gt − v(St ) = Rt+1 + γGt+1 − v(St )


= Rt+1 + γv(St+1 ) − v(St ) +γ(Gt+1 − v(St+1 ))
| {z }
= δt
= δt + γ(Gt+1 − v(St+1 ))
= ...
= δt + γδt+1 + γ 2 (Gt+2 − v(St+2 ))
= ...
ÕT
= γ k−t δk (used in the next slide)
k=t
Eligibility traces
I Now consider accumulating a whole episode (from time t = 0 to T ) of updates:
T
Õ −1
∆wk = α(Gt − v(St ))x t
t=0
T −1 T −1
!
Õ Õ
= α γ k−t δk x t (Using result from previous slide)
t=0 k=t
T
Õ −1 k
Õ m Õ
Õ m j
m Õ
Õ
= α γ k−t
δk x t (Using zi j = zi j )
k=0 t=0 i=0 j=i j=0 i=0
T
Õ −1 k
Õ T
Õ −1 T
Õ −1
= αδk γ k−t x t = αδk ek = αδt et .
k=0 t=0 k=0 t=0
| {z } | {z }
≡ ek renaming
k→t
Eligibility traces
Accumulating a whole episode of updates:

T
Õ −1 t
Õ
∆w k = αδt et where et = γ t−j x j
t=0 j=0
t−1
Õ
= γ t−j x j + x t
j=0
t−1
Õ
=γ γ t−1−j x j +x t
j=0
| {z }
= et−1
= γ et−1 + x t .

The vector et is called an eligibility trace


Every step, it decays (according to γ ) and then the current feature xt is added
Eligibility traces

I Accumulating a whole episode of updates:

∆wt ≡ αδt et (one time step)


T
Õ −1
∆wk = ∆wt (whole episode)
t=0
where et = γ et−1 + x t .

(And then apply ∆w at the end of the episode)


I Intuition: the same TD error shows up in multiple MC errors—grouping them allows
applying it to all past states in one update
Eligibility Traces: Intuition
Eligibility traces

Consider a batch update on an episode with four steps: t ∈ {0, 1, 2, 3}

∆v = δ0 e0 δ1 e1 δ2 e2 δ3 e3
(G0 − v(S0 ))x0 δ0 x0 γδ1 x0 γ 2 δ2 x0 γ 3 δ3 x0
(G1 − v(S1 ))x1 δ1 x1 γδ2 x1 γ 2 δ3 x1
(G2 − v(S2 ))x2 δ2 x2 γδ3 x2
(G3 − v(S3 ))x3 δ3 x3
Mixed Multi-Step Returns
and Eligibility Traces
Mixing multi-step returns & traces

I Reminder: mixed multi-step return


 
Gλt = Rt+1 + γ (1 − λ)v(St+1 ) + λGλt+1

I The associated error and trace update are


T −t
Gλt
Õ
= λ k γ k δt+k (same as before, but with λγ instead of γ )
k=0
=⇒ et = γλet−1 + xt and ∆wt = αδt et .
I This is called an accumulating trace with decay γλ
I It is exact for batched episodic updates (‘offline’), similar traces exist for online updating
End of Lecture

Next lecture:
Model-free control

You might also like