08 - Markov Decision Processes
08 - Markov Decision Processes
Saverio Bolognani
input:
state:
disturbance input:
performance output:
measurements:
dynamics:
2 / 29
A state-space representation of such a system is problematic.
discretized state space (number of vehicles per queue)
nonlinear dynamics
service
arrival
3 / 29
Markov chain
p p p p p p p
4 3 2 1 q=0
4 3 2 1 q=0 1
1 1
p p p p p
4 / 29
Markov decision process
set X of N states
set U of M control actions
transition probabilities
P : X × U × X → [0, 1]
u ′
Px,x ′ = P[x |x, u]
Markov property
For every x ′ , Px,x
u
′ depends only on x and u.
It does not depend on how the system got to x (past states and past inputs).
immediate cost
c(x, u, x ′ )
is the cost after transitioning to state x ′ given that the MDP is in state x and
action u is taken.
5 / 29
Policies
Deterministic policies
µ:X →U
π : X × U → [0, 1]
π(x, u) = P[u|x] is the conditional probability of selecting the input (action) u given
that the MDP is in state x.
Although in many control application one looks for deterministic policies, the
general formulation for stochastic policies is not more complicated (and will be
useful when it comes to learning).
6 / 29
Optimal sequential decision problems
ck = c(xk , uk , xk+1 )
is the cost incurred at time k when the system goes from xk to xk+1 under input uk .
7 / 29
Example
8 / 29
Dynamic programming on MDP
In order to solve this problem recursively (as we did for LQR problems), we define
the value " K #
π
X
Vk (x) = E ci |xk = x
i=k
where
Cxu = E [ck |xk = x, uk = u]
9 / 29
" " K
##
X X X ′
Vkπ (x) = πk (x, u) Cxu + u
Px,x ′E ci |xk+1 = x
u x′ i=k+1
" Vk∗ : optimal value from stage k, Vkπ : value from stage k if πk , πk+1 , . . . is used.
10 / 29
Optimal policy via backward induction
" #
X X
Vk∗ = min πk (x, u) Cxu + u
Px,x ∗
′ Vk+1
πk
u x′
11 / 29
Infinite horizon / stationary problems
Most of the times, when working with MDP, we look at infinite horizon problems.
stationary behavior of the Markov process
time invariant expected cost Cxu
time invariant policy π
Infinite-horizon cost
∞
X
J= γ k ck
k=0
12 / 29
Traffic light example
service
arrival
Possible costs:
negative throughput
R3green =? R2green =?
total time spent in the queue
R3green =? R3red =?
13 / 29
Discount factor
For 0 < γ < 1, a balance between immediate cost and future cost is
achieved.
14 / 29
Closed-loop MDP
For a fixed policy π, the MDP reduces to a Markov chain with transition
probabilities X X
π
Px,x ′ = P[x ′ |x, u]P[u|x] = u
π(x, u)Px,x ′
u u
Evolution (linear!)
⊤
dk+1 = dk⊤ P π
d̄π⊤ = d̄π⊤ P π
For every policy π, there exist a stationary distribution d̄π (x) that gives the
steady-state probability that the system is in state x.
Assuming ergodicity of the Markov chain: possible path from any state to any state.
15 / 29
Example
What is the closed-loop MDP for the traffic light queue problem, with the
deterministic policy (
green if q ≥ 3
µ(q) =
red otherwise
1−p
p p p
3 2 1 q=0
p
1−p 1−p 1−p
16 / 29
Example
Stationary distribution
The stationary distribution on the states can be computed by normalizing the left
eigenvector of P π associated to the eigenvalue 1.
q=0 26.7%
1 33.3%
2 33.3%
3 6.7%
Example: p = 0.2
17 / 29
Infinite-horizon value of a policy
The value of a policy π is defined as the conditional expected value of
infinite-horizon future cost when starting in state x
"∞ #
π
X k
V (x) = E γ ck |x0 = x
k=0
18 / 29
" #
X X ′
π
V (x) = π(x, u) Cxu +γ u
Pxx π
′ V (x )
u x′
19 / 29
Computational complexity of the Bellman equation
" #
X X ′
π
V (x) = π(x, u) Cxu +γ u
Pxx π
′ V (x )
u x′
If the MDP is finite with N states, then the Bellman equation is a system of N
linear equations.
Policy evaluation
V π = Cπ + γP π V π
where
Cπ (x) = u π(x, u)Cxu (expected cost)
P
π P u
Px,x ′ = u π(x, u)Px,x ′ (expected transition probabilities)
V π = (I − γP π )−1 Cπ
20 / 29
Example
V π = (I − γP π )−1 Cπ
1−p
p p p
3 2 1 q=0
p
1−p 1−p 1−p
1−p p 0 0 0
π
0 1−p p 0 π
1
P =
C =
cars in queue
0 0 1−p p 2
1−p p 0 0 3
21 / 29
Bellman principle for infinite horizon MDPs
22 / 29
Iterative computation of the optimal policy
Policy evaluation
Given a policy π, we can evaluate the resulting value V π .
Interpretation
Can I improve the cost by deviating from the current policy π for one step, and
then fall back to the policy π?
23 / 29
Policy improvement theorem
It can be shown that the greedy policy improvement always improve the value, i.e.,
′ ′
V π (x) ≤ V π (x) ∀x, and ∃x ′ such that V π (x ′ ) < V π (x ′ )
The r.h.s. is equal to the following expectation E [c0 + γV π (x1 )] (with respect to π ′ ).
The strict inequality follows by assuming "<" for x ′ (and "≤" for the others).
Corollary: after a finite amount of improvements we reach the optimum (why?).
24 / 29
Policy iteration algorithm
Computational complexity
Finite number of iterations of
1 Policy evaluation: expensive → O(N 3 + N 2 M)
2 Policy improvement: easy → min over M alternatives, N times
(it is enough to search over deterministic policies).
25 / 29
Value iteration
In alternative, we can look for a fixed point of the Bellman optimality equation
" #
∗
X u
X u ∗ ′
V (x) = min π(x, u) Cx + γ Pxx ′ V (x )
π
u x′
Convergence result
The value iteration is contractive: if V (t) and W (t) are two value functions, then
∥V (t+1) − W (t+1) ∥∞ ≤ γ∥V (t) − W (t) ∥∞
Therefore, the value iteration converges at rate γ to V ∗
∥V (t) − V ∗ ∥∞ ≤ γ t ∥V (0) − V ∗ ∥∞
26 / 29
Value iteration algorithm
Computational complexity
Asymptotic convergence at rate γ
Bellman iteration: usually cheap → O(N 2 M)
27 / 29
Toy example: Policy iteration
1 1 1
3 1 1
5 2 2
10 3 3
1 1 1
11 4 4
28 / 29
Toy example: Value iteration
1 1 1
3 1 1
5 4 2
10 6 3
1 1 1
11 2 2
29 / 29
This work is licensed under a
Creative Commons Attribution-ShareAlike 4.0 International License
https://bsaver.io/COCO