Lecture 3 Pre
Lecture 3 Pre
Emma Brunskill
Winter 2025
Can value iteration require more iterations than |A||S| to compute the
optimal value function? (Assume |A| and |S| are small enough that each
round of value iteration can be done exactly).
Last Time:
Markov reward / decision processes
Policy evaluation & control when have true model (of how the world works)
Today
Policy evaluation without known dynamics & reward models
Next Time:
Control when don’t have a model of how the world works
1
Assume today this experience comes from executing the policy π. Later will
consider how to do policy evaluation using data gathered from other policies.
Emma Brunskill (CS234 Reinforcement Learning)
Lecture 3: Model-Free Policy Evaluation: Policy Evaluation
Winter
Without
2025Knowing How5the
/ 67Wo
This Lecture: Policy Evaluation
If trajectories are all finite, sample set of trajectories & average returns
Does not require MDP dynamics/rewards
Does not assume state is Markov
Can be applied to episodic MDPs
Averaging over returns from a complete episode
Requires each episode to terminate
Sample episode i = si,1 , ai,1 , ri,1 , si,2 , ai,2 , ri,2 , . . . , si,Ti , ai,Ti , ri,Ti
Gi,t = ri,t + γri,t+1 + γ 2 ri,t+2 + · · · γ Ti −1 ri,Ti
For each time step t until Ti ( the end of the episode i)
If this is the first time t that state s is visited in episode i (for first visit MC)
Increment counter of total first visits: N(s) = N(s) + 1
Increment total return G (s) = G (s) + Gi,t
Update estimate V π (s) = G (s)/N(s)
Sample episode i = si,1 , ai,1 , ri,1 , si,2 , ai,2 , ri,2 , . . . , si,Ti , , ai,Ti , ri,Ti
Gi,t = ri,t + γri,t+1 + γ 2 ri,t+2 + · · · γ Ti −1 ri,Ti
for t = 1 : Ti where Ti is the length of the i-th episode
V π (sit ) = V π (sit ) + α(Gi,t − V π (sit ))
We will see many algorithms of this form with a learning rate, target, and
incremental update
Consistency: with enough data, does the estimate converge to the true value
of the policy?
Computational complexity: as get more data, computational cost of
updating estimate
Memory requirements
Statistical efficiency (intuitively, how does the accuracy of the estimate
change with the amount of data)
Empirical accuracy, often evaluated by mean squared error
Let n be the number of data points x used to estimate the parameter θ and
call the resulting estimate of θ using that data θ̂n
Then the estimator θ̂n is consistent if, for all ϵ > 0
Properties:
First-visit Monte Carlo
V π estimator is an unbiased estimator of true Eπ [Gt |st = s]
By law of large numbers, as N(s) → ∞, V π (s) → Eπ [Gt |st = s]
Every-visit Monte Carlo
V π every-visit MC estimator is a biased estimator of V π
But consistent estimator and often has better MSE
Incremental Monte Carlo
Properties depends on the learning rate α
“If one had to identify one idea as central and novel to reinforcement
learning, it would undoubtedly be temporal-difference (TD) learning.” –
Sutton and Barto 2017
Combination of Monte Carlo & dynamic programming methods
Model-free
Can be used in episodic or infinite-horizon non-episodic settings
Immediately updates estimate of V after each (s, a, r , s ′ ) tuple
TD(0) error:
δt = rt + γV π (st+1 ) − V π (st )
Input: α
Initialize V π (s) = 0, ∀s ∈ S
Loop
Sample tuple (st , at , rt , st+1 )
V π (st ) = V π (st ) + α([rt + γV π (st+1 )] −V π (st ))
| {z }
TD target
Input: α
Initialize V π (s) = 0, ∀s ∈ S
Loop
Sample tuple (st , at , rt , st+1 )
V π (st ) = V π (st ) + α([rt + γV π (st+1 )] −V π (st ))
| {z }
TD target
Example:
Mars rover: R = [ 1 0 0 0 0 0 +10] for any action
π(s) = a1 ∀s, γ = 1. any action from s1 and s7 terminates episode
Trajectory = (s3 , a1 , 0, s2 , a1 , 0, s2 , a1 , 0, s1 , a1 , 1, terminal)
TD estimate of all states (init at 0) with α = 1, γ < 1 at end of this
episode?
Input: α
Initialize V π (s) = 0, ∀s ∈ S
Loop
Sample tuple (st , at , rt , st+1 )
V π (st ) = V π (st ) + α([rt + γV π (st+1 )] −V π (st ))
| {z }
TD target
Input: α
Initialize V π (s) = 0, ∀s ∈ S
Loop
Sample tuple (st , at , rt , st+1 )
V π (st ) = V π (st ) + α([rt + γV π (st+1 )] −V π (st ))
| {z }
TD target
i
1
1(sk = s, ak = a)rk
X
rˆ(s, a) =
N(s, a)
k=1
2
Requires initializing for all (s, a) pairs
Emma Brunskill (CS234 Reinforcement Learning)
Lecture 3: Model-Free Policy Evaluation: Policy Evaluation
Winter
Without
2025Knowing How
42the
/ 67Wo
Certainty Equivalence V π MLE MDP Model Estimates
Model-based option for policy evaluation without true models
After each (s, a, r , s ′ ) tuple
Recompute maximum likelihood MDP model for (s, a)
K TXk −1
1 X
P̂(s ′ |s, a) = 1(sk,t = s, ak,t = a, sk,t+1 = s ′ )
N(s, a) t=1
k=1
K TXk −1
1 X
rˆ(s, a) = 1(sk,t = s, ak,t = a)rt,k
N(s, a) t=1
k=1
Compute V π using MLE MDP
Cost: Updating MLE model and MDP planning at each update (O(|S|3 ) for
analytic matrix solution, O(|S|2 |A|) for iterative methods)
Very data efficient and very computationally expensive
Consistent (will converge to right estimate for Markov models)
Can also easily be used for off-policy evaluation (which we will shortly define
and discuss)
Emma Brunskill (CS234 Reinforcement Learning)
Lecture 3: Model-Free Policy Evaluation: Policy Evaluation
Winter
Without
2025Knowing How
43the
/ 67Wo
This Lecture: Policy Evaluation
Monte Carlo in batch setting converges to min MSE (mean squared error)
Minimize loss with respect to observed returns
In AB example, V (A) = 0
i
1
1(sk = s, ak = a)rk
X
rˆ(s, a) =
N(s, a)
k=1
Last Time:
Markov reward / decision processes
Policy evaluation & control when have true model (of how the world works)
Today
Policy evaluation without known dynamics & reward models
Next Time:
Control when don’t have a model of how the world works
For all s, for first or every time t that state s is visited in episode i
N(s) = N(s) + 1, G (s) = G (s) + Gi,t
Update estimate V π (s) = G (s)/N(s)
Incremental MC
Sample episode i = si,1 , ai,1 , ri,1 , si,2 , ai,2 , ri,2 , . . . , si,T , ai,T , ri,T
i i i
For all s, for first or every time t that state s is visited in episode i
N(s) = N(s) + 1, G (s) = G (s) + Gi,t
Update estimate V π (s) = G (s)/N(s)
Incremental MC
Sample episode i = si,1 , ai,1 , ri,1 , si,2 , ai,2 , ri,2 , . . . , si,T , ai,T , ri,T
i i i
89/: ./01/!123
.2456 7214 .2456 7214
89/: ./01/!123
.2456 7214 .2456 7214
89/: ./01/!123
.2456 7214 .2456 7214
89/: ./01/!123
.2456 7214 .2456 7214