AI512/EE633: Reinforcement Learning: Lecture 3 - Dynamic Programming
AI512/EE633: Reinforcement Learning: Lecture 3 - Dynamic Programming
Seungyul Han
UNIST
[email protected]
Spring 2024
1 Background
2 Policy Evaluation
5 Modifications
Table of Contents
1 Background
2 Policy Evaluation
5 Modifications
Optimal Policy
∗ 1, a = arg maxα∈A Q∗ (s, α)
π (a|s) =
0, otherwise
( )
X
∗ ′ ∗ ′
V (s) = max R(s, a) + γ p(s |s, a)V (s )]
a∈A
s′
.
.
.
1 M 1 1 M ∗ 1 2 1 M ∗ 2 N 1 M ∗ N
R(s , a ) + γ[p(s |s , a )V (a ) + p(s |s , a )V (s ) + · · · p(s |s , a )V (s )]}
.
.
.
∗ N N 1 1 N 1 ∗ 1 2 N 1 ∗ 2 N N 1 ∗ N
V (s ) = max{R(s , a ) + γ[p(s |s , a )V (a ) + p(s |s , a )V (s ) + · · · p(s |s , a )V (s )]
.
.
.
N M 1 N M ∗ 1 2 N M ∗ 2 N N M ∗ N
R(s ,a ) + γ[p(s |s ,a )V (a ) + p(s |s ,a )V (s ) + · · · p(s |s ,a )V (s )]}
A system of nonlinear equations ⇒ The number of cases for max operation: |A||S| ⇒ Difficult to solve.
Seungyul Han (UNIST) AI512/EE633 Spring 2024 5 / 43
Background
Statistical Inference
Model-based
Inverse Problem
Data-based
Supervised Learning
Dynamic Programming
Model-based
MDP
Sample-based
Table of Contents
1 Background
2 Policy Evaluation
5 Modifications
lim T n (x0 ) = x∗ ,
n→∞
where
∆
T n (x) = T
| ◦ T ◦{z· · · ◦ T}(x).
n times
Policy Evaluation
a,s′
X π ′
= Rπ (s) + γ π
Pss ′ V (s )
s′
a,s′
X π ′
= Rπ (s) + γ π
Pss ′ V (s )
s′
∆
= T π (V π (s))
In matrix-vector form
π 1
V π (s1 ) R (s ) π
P11 ··· π
P1N V π (s1 )
.. .. . ..
. = . + γ .. .
π π
V π (sN ) Rπ (sN ) PN 1 ··· PN N V π (sN )
T π : V → V is an affine operator.
V π (s) = T π (V π (s))
π 1
V π (s1 ) R (s ) π
P11 ··· π
P1N V π (s1 )
.
. ... .. ...
. = +γ .
π N π N π π
V (s ) R (s ) PN 1 ··· PN N V π (sN )
| {z } | {z }
∆ ∆
= rπ = Pπ
T π (v) = rπ + Pπ v
a,s′
X π ′
= Rπ (s) + γ π
Pss ′ V (s )
s′
∆ π π
= T (V (s))
s V (k+1) (s)
Transition a Backup
r
s′ V (k) (s′ )
Remark: If the transition is sparse, the backup operation for each state s can be
performed very efficiently.
Seungyul Han (UNIST) AI512/EE633 Spring 2024 15 / 43
Policy Evaluation
1 2 3
4 5 6 7
r = −1 for all transitions
8 9 10 11
actions 12 13 14
Nonterminal states: 1, 2, . . . , 14
One terminal state: Shaded upper-Left and lower-right (it is a single state)
Actions: Left, right, up and down (actions out of the grid leave the state
unchanged, e.g. P (7|7, R) = 1, P (4|4, L) = 1)
Reward is −1 for all actions until the terminal state is reached, i.e.,
R(s, a, s′ ) = −1 for all s, a, s′ .
Consider an uniform random policy π:
1
π(N | ·) = π(S | ·) = π(E | ·) = π(W | ·) = .
4
Estimate V π using iterative policy evaluation.
Seungyul Han (UNIST) AI512/EE633 Spring 2024 16 / 43
Policy Evaluation
Table of Contents
1 Background
2 Policy Evaluation
5 Modifications
Policy Improvement
Question 1
Why did we estimate V π (s) (and/or Qπ (s, a))? Just to evaluate a policy π?
Question 2
When we are given a policy π, can we find an improved policy π ′ from the old
policy π?
Policy Improvement
r
V π (s) s s′
a
π
r
Qπ (s, a) s s′
a
π
What if we take the first action a from π ′ and then follows π, i.e., Qπ (s, π ′ (·|s))?
Policy Improvement
r
V π (s) s s′
a
π
r
Qπ (s, a′ ) s s′
′
a
π
Then, we have ′
V π (s) ≥ V π (s), ∀s ∈ S,
i.e., π ′ ≥ π.
" #
X X
V π (s) ≤ Qπ (s, π ′ (s)) = π ′ (a|s) r(s, a) + γ p(s′ |s, a)V π (s′ )
a s′
Policy Improvement
r
V π (s̃) s̃ s′
a
π
r
Qπ (s̃, a′ ∼ π ′ (s̃)) s̃ s′
′
a
π
If we have for some s̃ ∈ S
then set action π ′ (s̃) for this s̃, and set π ′ (s) = π(s) for all other s(6= s̃) ∈ S.
Then, this π ′ and π satisfy the condition of the policy improvement theorem. Hence,
π ′ > π.
Policy Improvement
That is,
V π (s) = max Qπ (s, a) ∀s ∈ S.
a∈A
Policy Iteration
Policy Iteration
1 Initialize a policy π;
2 Evaluate the policy π by the fixed point theorem iteration, i.e., apply
Bellman backup operator iteratively.
3 Once V π is obtained, from V π obtain Qπ from their relationship, and
obtain an improved policy π ′ by π ′ = greedyQπ (s, a).
4 Iterate Steps 2 and 3 until convergence.
Iteration converges within a finite time for a finite MDP because we only need to
consider deterministic policies and the total number of deterministic policies is
|A||S| . If there is no improvement, then the optimal value function is obtained, so
an optimal policy is obtained.
Policy Iteration
Evaluation
Spa
ce o
f va V 7→ V π
lue
fu n
ctio π V
ns V
Initial π V ∗, π 7→ π ′ = greedy(V )
π∗
Improvement
Π
cies
f poli
ce o
Spa
Table of Contents
1 Background
2 Policy Evaluation
5 Modifications
Value Iteration
Question
Should we fully estimate V π for a given π? Note that estimation of V π
itself theoretically requires an infinite number of iterations by the Banach
fixed point theorem.
Modifications:
Stop policy evaluation when the iteration difference is less than some
threshold.
Just stop policy evaluation iterations after K iterations.
In the extreme case of K = 1, it reduces to value iteration.
Value Iteration
Vπ
V∗
π∗
Ṽ π
π
greedy(Ṽ π )
Modifications:
Stop policy evaluation when the iteration difference is less than some
threshold.
Just stop policy evaluation iterations after K iterations.
In the extreme case of K = 1, it reduces to value iteration.
Value Iteration
One-step policy evaluation for current π
Policy improvement
π (k+1) = greedy(Q′ (s, a))
= arg max EP [rt+1 + γV (k) (st+1 )|st = s, at = a]
a
(k+1)
The value function of π is given by
V (k+1) (s) = max EP [rt+1 + γV (k) (st+1 )|st = s, at = a]
a
Value Iteration
V (k+1) (s) : s
a
r
V (k) (s′ ) : s′
" #
X
V (k+1) (s) = max R(s, a) + γ p(s′ |s, a)V (k) (s′ )
a
s′
h i
v(k+1) = max ra + γPa v(k)
a
Table of Contents
1 Background
2 Policy Evaluation
5 Modifications
Modifications
Synchronous DP
s1 s1
s2 ··· s2
s3 s3
.. .. .. .. .. ..
. . . . . .
s|S| s|S|
⇒ Asynchronous DP
V old ← V new . s3
In-place value iteration stores only a single
array of the value function. .. ..
For s = s1 : s|S| ,
. .
X ′ ′ s|S|
V (s) ← max R(s, a) + γ P (s |s, a)v(s )
a∈A
s′ ∈N (s)
Prioritized Sweeping
Backup states with priority, e.g., backup the state with the largest
Bellman error:
!
X
′ ′
max R(s, a) + γ P (s |s, a)V (s ) − V (s) .
a∈A
s′ ∈S
of states).
For large problems, it is difficult to apply
DP when the number of states increases.
⇒ Model-free sample-based approach
Seungyul Han (UNIST) AI512/EE633 Spring 2024 42 / 43
Modifications
References