0% found this document useful (0 votes)
10 views

AI512/EE633: Reinforcement Learning: Lecture 3 - Dynamic Programming

The lecture discusses policy evaluation in reinforcement learning. Policy evaluation estimates the value function for a given policy using the Bellman equation and fixed point theory. The Bellman backup operator iteratively applies the Bellman equation to estimate the value function.

Uploaded by

이강민
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

AI512/EE633: Reinforcement Learning: Lecture 3 - Dynamic Programming

The lecture discusses policy evaluation in reinforcement learning. Policy evaluation estimates the value function for a given policy using the Bellman equation and fixed point theory. The Bellman backup operator iteratively applies the Bellman equation to estimate the value function.

Uploaded by

이강민
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 43

AI512/EE633: Reinforcement Learning

Lecture 3 - Dynamic Programming

Seungyul Han

UNIST
[email protected]

Spring 2024

Seungyul Han (UNIST) AI512/EE633 Spring 2024 1 / 43


Contents

1 Background

2 Policy Evaluation

3 Policy Improvement and Policy Iteration

4 Generalized Policy Iteration

5 Modifications

Seungyul Han (UNIST) AI512/EE633 Spring 2024 2 / 43


Background

Table of Contents

1 Background

2 Policy Evaluation

3 Policy Improvement and Policy Iteration

4 Generalized Policy Iteration

5 Modifications

Seungyul Han (UNIST) AI512/EE633 Spring 2024 3 / 43


Background

Bellman Optimality Equation

Bellman Optimality Equation (BOE)


( )
X
V ∗ (s) = max R(s, a) + γ p(s′ |s, a)V ∗ (s′ )]
a∈A
s′
Q∗ (s, a) = EP [ rt+1 +γV ∗ (st+1 )|st = a, at = a]
|{z}
R(s,a,st+1 )

Optimal Policy

∗ 1, a = arg maxα∈A Q∗ (s, α)
π (a|s) =
0, otherwise

Seungyul Han (UNIST) AI512/EE633 Spring 2024 4 / 43


Background

Bellman Optimality Equation


Bellman Optimality Equation (BOE)

( )
X
∗ ′ ∗ ′
V (s) = max R(s, a) + γ p(s |s, a)V (s )]
a∈A
s′

An MDP with S = {s1 , · · · , sN } and A = {a1 , · · · , aM }:


∗ 1 1 1 1 1 1 ∗ 1 2 1 1 ∗ 2 N 1 1 ∗ N
V (s ) = max{R(s , a ) + γ[p(s |s , a )V (a ) + p(s |s , a )V (s ) + · · · p(s |s , a )V (s )]

.
.
.
1 M 1 1 M ∗ 1 2 1 M ∗ 2 N 1 M ∗ N
R(s , a ) + γ[p(s |s , a )V (a ) + p(s |s , a )V (s ) + · · · p(s |s , a )V (s )]}

.
.
.
∗ N N 1 1 N 1 ∗ 1 2 N 1 ∗ 2 N N 1 ∗ N
V (s ) = max{R(s , a ) + γ[p(s |s , a )V (a ) + p(s |s , a )V (s ) + · · · p(s |s , a )V (s )]

.
.
.
N M 1 N M ∗ 1 2 N M ∗ 2 N N M ∗ N
R(s ,a ) + γ[p(s |s ,a )V (a ) + p(s |s ,a )V (s ) + · · · p(s |s ,a )V (s )]}

A system of nonlinear equations ⇒ The number of cases for max operation: |A||S| ⇒ Difficult to solve.
Seungyul Han (UNIST) AI512/EE633 Spring 2024 5 / 43
Background

Solution to Finite MDPs Based on Iterative Approaches

Statistical Inference
Model-based
Inverse Problem
Data-based

Supervised Learning

Dynamic Programming
Model-based

MDP
Sample-based

Model-free RL: MC & TD

Seungyul Han (UNIST) AI512/EE633 Spring 2024 6 / 43


Background

Solution by Dynamic Programming

Dynamic Programming assumes full knowledge of the dynamics


model P (s′ |s, a) and R(s, a, s′ ).
Solves an MDP based on iteration of prediction and control
Prediction (or estimation): policy evaluation
For given policy π, estimate V π (s) and/or Qπ (s, a)
Control: policy improvement:
Improves the policy with respect to estimated Qπ (s, a)

Seungyul Han (UNIST) AI512/EE633 Spring 2024 7 / 43


Policy Evaluation

Table of Contents

1 Background

2 Policy Evaluation

3 Policy Improvement and Policy Iteration

4 Generalized Policy Iteration

5 Modifications

Seungyul Han (UNIST) AI512/EE633 Spring 2024 8 / 43


Policy Evaluation

The Banach Fixed Point Theorem

Theorem (The Banach Fixed Point Theorem)


Let (X , d) be a complete metric space and T : X → X be a contraction
mapping on X . Then, the mapping T has a unique fixed point x∗ ∈ X ,
i.e., x∗ = T (x∗ ). Furthermore, for any x0 , we have

lim T n (x0 ) = x∗ ,
n→∞

where

T n (x) = T
| ◦ T ◦{z· · · ◦ T}(x).
n times

Seungyul Han (UNIST) AI512/EE633 Spring 2024 9 / 43


Policy Evaluation

Policy Evaluation

Problem: Policy Evaluation


For given policy π, we want to estimate the value function V π (s) and/or
Qπ (s, a)

Tool: The Bellman Equation and The Fixed Point Theorem


X a  ′ π ′ 
V π (s) = π(a|s)Pss ′ R(s, a, s ) + γV (s )

a,s′
X π ′
= Rπ (s) + γ π
Pss ′ V (s )

s′

Seungyul Han (UNIST) AI512/EE633 Spring 2024 10 / 43


Policy Evaluation

Policy Evaluation: Bellman Backup Operator

The Bellman Backup Operator


X a  ′ π ′ 
V π (s) = π(a|s)Pss ′ R(s, a, s ) + γV (s )

a,s′
X π ′
= Rπ (s) + γ π
Pss ′ V (s )

s′

= T π (V π (s))

In matrix-vector form
   π 1    
V π (s1 ) R (s ) π
P11 ··· π
P1N V π (s1 )
 ..   ..   .  .. 
 . = .  + γ  ..  . 
π π
V π (sN ) Rπ (sN ) PN 1 ··· PN N V π (sN )

T π : V → V is an affine operator.

Seungyul Han (UNIST) AI512/EE633 Spring 2024 11 / 43


Policy Evaluation

Policy Evaluation: Bellman Backup Operator

Bellman Backup Operator:

V π (s) = T π (V π (s))
   π 1    
V π (s1 ) R (s ) π
P11 ··· π
P1N V π (s1 )
 .
.   ...   ..  ... 
 . =  +γ  .  
π N π N π π
V (s ) R (s ) PN 1 ··· PN N V π (sN )
| {z } | {z }
∆ ∆
= rπ = Pπ

In Matrix-Vector Form: v 7→ T π (v)

T π (v) = rπ + Pπ v

Seungyul Han (UNIST) AI512/EE633 Spring 2024 12 / 43


Policy Evaluation

Policy Evaluation: Bellman Backup Operator

The Bellman Backup Operator


X a  ′ π ′ 
V π (s) = π(a|s)Pss ′ R(s, a, s ) + γV (s )

a,s′
X π ′
= Rπ (s) + γ π
Pss ′ V (s )

s′
∆ π π
= T (V (s))

V π (s) is the solution to the Bellman equation.


V π (s) is a fixed point of the Bellman backup operator T π .
If T π is a contraction, we can apply the fixed point theorem to obtain
V π (s).

Seungyul Han (UNIST) AI512/EE633 Spring 2024 13 / 43


Policy Evaluation

Iterative Method for Policy Evaluation Based on the Fixed


Point Theorem

Set an arbitrary initial function V (0) (s) for a given policy π.


Apply iteration V (k+1) (s) = T π (V (k) (s))
That is, in the synchronous backup case,
for k = 0, 1, 2, · · ·
for s = s1 : sN
update V (k+1) (s) = T π (V (k) (s))
end for
end for
The asynchronous backup case will be explained later.

Seungyul Han (UNIST) AI512/EE633 Spring 2024 14 / 43


Policy Evaluation

The Bellman Backup Operator

Bellman Backup Operator

V (k+1) (s) = T π (V (k) (s))


X h i
a ′ (k) ′
= π(a|s)Pss ′ R(s, a, s ) + γV (s )
a,s′
X (k) ′
= Rπ (s) + γ π
Pss ′V (s )
s′

s V (k+1) (s)

Transition a Backup
r

s′ V (k) (s′ )
Remark: If the transition is sparse, the backup operation for each state s can be
performed very efficiently.
Seungyul Han (UNIST) AI512/EE633 Spring 2024 15 / 43
Policy Evaluation

An Example: Gridworld (from the Textbook, Example 4.1)

1 2 3

4 5 6 7
r = −1 for all transitions
8 9 10 11

actions 12 13 14

Nonterminal states: 1, 2, . . . , 14
One terminal state: Shaded upper-Left and lower-right (it is a single state)
Actions: Left, right, up and down (actions out of the grid leave the state
unchanged, e.g. P (7|7, R) = 1, P (4|4, L) = 1)
Reward is −1 for all actions until the terminal state is reached, i.e.,
R(s, a, s′ ) = −1 for all s, a, s′ .
Consider an uniform random policy π:
1
π(N | ·) = π(S | ·) = π(E | ·) = π(W | ·) = .
4
Estimate V π using iterative policy evaluation.
Seungyul Han (UNIST) AI512/EE633 Spring 2024 16 / 43
Policy Evaluation

Iterative Policy Evaluation: Ex. Small Gridworld


V (k) (s) with BBO with γ = 1 Best action w.r.t. Q(k) (s.a)

0.0 0.0 0.0 0.0


0.0 0.0 0.0 0.0
k=0 unifrom random policy
0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0

0.0 −1.0 −1.0 −1.0


−1.0 −1.0 −1.0 −1.0
k=1
−1.0 −1.0 −1.0 −1.0
−1.0 −1.0 −1.0 0.0

0.0 −1.7 −2.0 −2.0


−1.7 −2.0 −2.0 −2.0
k=2
−2.0 −2.0 −2.0 −1.7
−2.0 −2.0 −1.7 0.0
Seungyul Han (UNIST) AI512/EE633 Spring 2024 17 / 43
Policy Evaluation

Iterative Policy Evaluation: Ex. Small Gridworld

0.0 −2.4 −2.9 −3.0


−2.4 −2.9 −3.0 −2.9
k=3
−2.9 −3.0 −2.9 −2.4
−3.0 −2.9 −2.4 0.0

0.0 −6.1 −8.4 −9.0


−6.1 −7.7 −8.4 −8.4
k = 10 optimal policy
−8.4 −8.4 −7.7 −6.1
−9.0 −8.4 −6.1 0.0

0.0 −14 −20 −22


−14 −18 −20 −20
k=∞
−20 −20 −18 −14
−22 −20 −14 0.0

Seungyul Han (UNIST) AI512/EE633 Spring 2024 18 / 43


Policy Improvement and Policy Iteration

Table of Contents

1 Background

2 Policy Evaluation

3 Policy Improvement and Policy Iteration

4 Generalized Policy Iteration

5 Modifications

Seungyul Han (UNIST) AI512/EE633 Spring 2024 19 / 43


Policy Improvement and Policy Iteration

Policy Improvement

Question 1
Why did we estimate V π (s) (and/or Qπ (s, a))? Just to evaluate a policy π?

Question 2
When we are given a policy π, can we find an improved policy π ′ from the old
policy π?

Seungyul Han (UNIST) AI512/EE633 Spring 2024 20 / 43


Policy Improvement and Policy Iteration

Policy Improvement

V π (s) versus Qπ (s, a)

V π (s) = Eπ [Gt |st = s]


Qπ (s, a) = Eπ [Gt |st = s, at = a]

r
V π (s) s s′
a
π
r
Qπ (s, a) s s′
a
π
What if we take the first action a from π ′ and then follows π, i.e., Qπ (s, π ′ (·|s))?

Seungyul Han (UNIST) AI512/EE633 Spring 2024 21 / 43


Policy Improvement and Policy Iteration

Policy Improvement

r
V π (s) s s′
a
π
r
Qπ (s, a′ ) s s′

a
π

Comparison of π ′ to π based on first-step action

Qπ (s, π ′ (·|s)) ≥ V π (s) = Qπ (s, π(·|s)), ∀s ∈ S

Seungyul Han (UNIST) AI512/EE633 Spring 2024 22 / 43


Policy Improvement and Policy Iteration

Policy Improvement Theorem

Theorem (Policy Improvement Theorem)


Let π and π ′ be two policies such that

Qπ (s, π ′ (·|s)) ≥ V π (s) = Qπ (s, π(·|s)), ∀s ∈ S.

Then, we have ′
V π (s) ≥ V π (s), ∀s ∈ S,
i.e., π ′ ≥ π.

Seungyul Han (UNIST) AI512/EE633 Spring 2024 23 / 43


Policy Improvement and Policy Iteration

Policy Improvement Theorem: Proof

" #
X X
V π (s) ≤ Qπ (s, π ′ (s)) = π ′ (a|s) r(s, a) + γ p(s′ |s, a)V π (s′ )
a s′

= Eπ′ [rt+1 + γV π (st+1 ) | st = s]


 
≤ Eπ′ rt+1 + γQπ (st+1 , π ′ (st+1 )) | st = s
= Eπ′ [rt+1 + γEπ′ [rt+2 + γV π (st+2 )] | st = s]
 
= Eπ′ rt+1 + γrt+2 + γ 2 V π (st+2 ) | st = s
 
≤ Eπ′ rt+1 + γrt+2 + γ 2 Qπ (st+2 , π ′ (st+2 )) | st = s
 
= Eπ′ rt+1 + γrt+2 + γ 2 Eπ′ [rt+3 + γV π (st+3 )] | st = s
 
= Eπ′ rt+1 + γrt+2 + γ 2 rt+3 + γ 3 V π (st+3 ) | st = s
..
.
≤ Eπ′ [rt+1 + γrt+2 + · · · | st = s]

= V π (s)

Seungyul Han (UNIST) AI512/EE633 Spring 2024 24 / 43
Policy Improvement and Policy Iteration

Policy Improvement

r
V π (s̃) s̃ s′
a
π
r
Qπ (s̃, a′ ∼ π ′ (s̃)) s̃ s′

a
π
If we have for some s̃ ∈ S

Qπ (s̃, π ′ (s̃)) > V π (s̃), with π ′ (s̃) 6= π(s̃),

then set action π ′ (s̃) for this s̃, and set π ′ (s) = π(s) for all other s(6= s̃) ∈ S.

Then, this π ′ and π satisfy the condition of the policy improvement theorem. Hence,
π ′ > π.

Seungyul Han (UNIST) AI512/EE633 Spring 2024 25 / 43


Policy Improvement and Policy Iteration

Policy Improvement

The Policy Improvement Theorem (PIT) Condition

Qπ (s, π ′ (·|s)) ≥ V π (s) = Qπ (s, π(·|s)), ∀s ∈ S.

Construction of New Policy π ′ :


For each s ∈ S, set π ′ as

π ′ (s) = arg max Qπ (s, a)


a
X
≥ π(a|s)Qπ (s, a)
a
= V π (s)

Then, Qπ (s, π ′ (s)) ≥ V π (s), ∀s ∈ S.



Hence, this π ′ ≥ π, i.e., V π (s) ≥ V π (s) for all s ∈ S by PIT.

Seungyul Han (UNIST) AI512/EE633 Spring 2024 26 / 43


Policy Improvement and Policy Iteration

Policy Improvement: Stopping Criterion

When improvement stops, we have

Qπ (s, π ′ (s)) = max Qπ (s, a) = Qπ (s, π(s)) = V π (s), ∀s ∈ S.


a∈A

That is,
V π (s) = max Qπ (s, a) ∀s ∈ S.
a∈A

But, this is nothing but the Bellman optimality equation.


Hence, when improvement stops, V π (s) = V ∗ (s) and π = π ∗ , i.e., an
optimal policy.

Seungyul Han (UNIST) AI512/EE633 Spring 2024 27 / 43


Policy Improvement and Policy Iteration

Policy Iteration

Policy Iteration
1 Initialize a policy π;
2 Evaluate the policy π by the fixed point theorem iteration, i.e., apply
Bellman backup operator iteratively.
3 Once V π is obtained, from V π obtain Qπ from their relationship, and
obtain an improved policy π ′ by π ′ = greedyQπ (s, a).
4 Iterate Steps 2 and 3 until convergence.

Iteration converges within a finite time for a finite MDP because we only need to
consider deterministic policies and the total number of deterministic policies is
|A||S| . If there is no improvement, then the optimal value function is obtained, so
an optimal policy is obtained.

Seungyul Han (UNIST) AI512/EE633 Spring 2024 28 / 43


Policy Improvement and Policy Iteration

Policy Iteration

Evaluation
Spa
ce o
f va V 7→ V π
lue
fu n
ctio π V
ns V
Initial π V ∗, π 7→ π ′ = greedy(V )

π∗
Improvement
Π
cies
f poli
ce o
Spa

Policy Evaluation: Estimate V π


-The fixed point theorem
Policy Improvement π ′ = greedy(V π ) π∗ V∗
-The policy improvement theorem
when converged

Seungyul Han (UNIST) AI512/EE633 Spring 2024 29 / 43


Generalized Policy Iteration

Table of Contents

1 Background

2 Policy Evaluation

3 Policy Improvement and Policy Iteration

4 Generalized Policy Iteration

5 Modifications

Seungyul Han (UNIST) AI512/EE633 Spring 2024 30 / 43


Generalized Policy Iteration

Value Iteration

Question
Should we fully estimate V π for a given π? Note that estimation of V π
itself theoretically requires an infinite number of iterations by the Banach
fixed point theorem.

Modifications:
Stop policy evaluation when the iteration difference is less than some
threshold.
Just stop policy evaluation iterations after K iterations.
In the extreme case of K = 1, it reduces to value iteration.

Seungyul Han (UNIST) AI512/EE633 Spring 2024 31 / 43


Generalized Policy Iteration

Value Iteration

V∗
π∗
Ṽ π
π
greedy(Ṽ π )

Modifications:
Stop policy evaluation when the iteration difference is less than some
threshold.
Just stop policy evaluation iterations after K iterations.
In the extreme case of K = 1, it reduces to value iteration.

Seungyul Han (UNIST) AI512/EE633 Spring 2024 32 / 43


Generalized Policy Iteration

Value Iteration
One-step policy evaluation for current π

Q′ (s, a) = EP [rt+1 + γV (k) (st+1 )|st = s, at = a]


" #
X ′ (k) ′
= R(s, a) + γ p(s |s, a)V (s )
s′

Policy improvement
π (k+1) = greedy(Q′ (s, a))
= arg max EP [rt+1 + γV (k) (st+1 )|st = s, at = a]
a

(k+1)
The value function of π is given by
V (k+1) (s) = max EP [rt+1 + γV (k) (st+1 )|st = s, at = a]
a

But, this is the Bellman optimality backup operator T ∗ . Recall


( )

X ′ ∗ ′
V (s) = max R(s, a) + γ p(s |s, a)V (s )] (1)
a∈A
s′

Seungyul Han (UNIST) AI512/EE633 Spring 2024 33 / 43


Generalized Policy Iteration

Value Iteration

V (k+1) (s) : s

a
r

V (k) (s′ ) : s′
" #
X
V (k+1) (s) = max R(s, a) + γ p(s′ |s, a)V (k) (s′ )
a
s′
h i
v(k+1) = max ra + γPa v(k)
a

Seungyul Han (UNIST) AI512/EE633 Spring 2024 34 / 43


Generalized Policy Iteration

Generalized Policy Iteration

Definition (Generalized Policy Iteration (GPI))


Generalized policy iteration is an iterative process composed of interacting
policy evaluation and policy improvement, independent of granularity and
details of the two processes.

Seungyul Han (UNIST) AI512/EE633 Spring 2024 35 / 43


Generalized Policy Iteration

Summary: Synchronous Dynamic Programming

Problem Equation Algorithm

Prediction Bellman Equation Iterative Policy Evaluation


Control Bellman Equation + Greedy Policy Iteration
Policy Improvement
Control Bellman Optimality Equation Value Iteration

Seungyul Han (UNIST) AI512/EE633 Spring 2024 36 / 43


Modifications

Table of Contents

1 Background

2 Policy Evaluation

3 Policy Improvement and Policy Iteration

4 Generalized Policy Iteration

5 Modifications

Seungyul Han (UNIST) AI512/EE633 Spring 2024 37 / 43


Modifications

Modifications

Synchronous DP

Eval. Impr. Eval. Impr. s s′

s1 s1

s2 ··· s2

s3 s3

.. .. .. .. .. ..
. . . . . .

s|S| s|S|

⇒ Asynchronous DP

Seungyul Han (UNIST) AI512/EE633 Spring 2024 38 / 43


Modifications

Asynchronous Dynamic Programming

Asynchronous DP applies backup to states in an individual manner with any order.


We apply backup only to the selected states.
Asynchronous DP reduces computational complexity significantly.
Still, convergence is guaranteed if all states continue to be selected.

Methods for Asynchronous DP:


In-place DP
Prioritized sweeping
..
.

Seungyul Han (UNIST) AI512/EE633 Spring 2024 39 / 43


Modifications

In-Place Dynamic Programming

Synchronous value iteration stores two arrays


of the value function. s s′
1 |S|
For s = s : s ,
s1
 
new ′ old ′ 
X
V (s) ← max R(s, a) + γ P (s |s, a)V (s ) .

a∈A
s′ ∈N (s)
s2

V old ← V new . s3
In-place value iteration stores only a single
array of the value function. .. ..
For s = s1 : s|S| ,
. .
 
X ′ ′ s|S|
V (s) ← max R(s, a) + γ P (s |s, a)v(s )
a∈A
s′ ∈N (s)

Seungyul Han (UNIST) AI512/EE633 Spring 2024 40 / 43


Modifications

Prioritized Sweeping

Backup states with priority, e.g., backup the state with the largest
Bellman error:
!
X
′ ′
max R(s, a) + γ P (s |s, a)V (s ) − V (s) .
a∈A
s′ ∈S

Update the Bellman error of backuped states.


Implemented efficiently by using a priority queue.

Seungyul Han (UNIST) AI512/EE633 Spring 2024 41 / 43


Modifications

Drawback of DP: Full Backups

DP requires the knowledge of model


dynamics: transition probabilities and
reward function
DP uses full backups.
For each backup (sync or async), V (k+1) (s)

That is, for each backup, all possible


actions and next states are considered to
a
compute the update.
So, DP is computationally expensive: DP is r

effective for medium-size problems (millions V (k) (s′ )

of states).
For large problems, it is difficult to apply
DP when the number of states increases.
⇒ Model-free sample-based approach
Seungyul Han (UNIST) AI512/EE633 Spring 2024 42 / 43
Modifications

References

Textbook: Sutton and Barto, Reinforcement Learning: An Introduction, The MIT


Press, Cambridge MA, 2018
Dr. David Silver’s course material.

Seungyul Han (UNIST) AI512/EE633 Spring 2024 43 / 43

You might also like