0% found this document useful (0 votes)
3 views

08 - Markov Decision Processes

The document discusses the application of Markov decision processes (MDPs) to control traffic lights with the goal of minimizing waiting times and maximizing throughput while managing queue lengths. It outlines the structure of MDPs, including states, control actions, transition probabilities, and policies, as well as the principles of dynamic programming and optimal policy determination through backward induction. Additionally, it addresses the computational complexity of evaluating policies and the Bellman equation in the context of infinite-horizon problems.

Uploaded by

Ahmet Çelik
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

08 - Markov Decision Processes

The document discusses the application of Markov decision processes (MDPs) to control traffic lights with the goal of minimizing waiting times and maximizing throughput while managing queue lengths. It outlines the structure of MDPs, including states, control actions, transition probabilities, and policies, as well as the principles of dynamic programming and optimal policy determination through backward induction. Additionally, it addresses the computational complexity of evaluating policies and the Bellman equation in the context of infinite-horizon problems.

Uploaded by

Ahmet Çelik
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

Computational Control

Markov decision processes

Saverio Bolognani

Automatic Control Laboratory (IfA)


ETH Zurich
1 / 29
Problem:
turn traffic lights to red/green
minimize waiting time / maximize throughput
ensure queue lengths are limited

input:
state:
disturbance input:
performance output:
measurements:
dynamics:

How would a “PID-like” controller look like?


What is preventing you from using, for example, MPC?

2 / 29
A state-space representation of such a system is problematic.
discretized state space (number of vehicles per queue)
nonlinear dynamics

Is another Markovian representation possible?

service
arrival

Update equation at every time:


with probability p, queue = queue+1
if green light then queue = max(0, queue-3)

3 / 29
Markov chain
p p p p p p p

4 3 2 1 q=0

1−p 1−p 1−p 1−p 1−p 1−p 1−p

1−p 1−p 1−p 1−p 1−p 1−p

4 3 2 1 q=0 1

1 1
p p p p p

green/red duty cycle: queue = max(0, queue-3)


more complex policies:
▶ green light if queue longer than 3
▶ red light if zero queue
▶ etc.

4 / 29
Markov decision process

set X of N states
set U of M control actions
transition probabilities

P : X × U × X → [0, 1]
u ′
Px,x ′ = P[x |x, u]

is the conditional probability of transitioning to state x ′ is the MDP is in state x


and the input u is applied.

Markov property
For every x ′ , Px,x
u
′ depends only on x and u.

It does not depend on how the system got to x (past states and past inputs).

immediate cost
c(x, u, x ′ )
is the cost after transitioning to state x ′ given that the MDP is in state x and
action u is taken.

5 / 29
Policies

Control design problems on MDP consists in designing a policy to select a


suitable action in U based on the current state x ∈ X of the system.

Deterministic policies

µ:X →U

Stochastic (or mixed) policies

π : X × U → [0, 1]
π(x, u) = P[u|x] is the conditional probability of selecting the input (action) u given
that the MDP is in state x.

Although in many control application one looks for deterministic policies, the
general formulation for stochastic policies is not more complicated (and will be
useful when it comes to learning).

6 / 29
Optimal sequential decision problems

Consider an MDP “through time” (dynamic decision problem), where k is a time


index.
Stage cost

ck = c(xk , uk , xk+1 )
is the cost incurred at time k when the system goes from xk to xk+1 under input uk .

As control designers, we are often interested in a performance index of the form


K
X
ck
k=0

And therefore to find the policy that minimizes


" K #
π
X
V (x) = E ck |x0 = x
k=0

7 / 29
Example

8 / 29
Dynamic programming on MDP

In order to solve this problem recursively (as we did for LQR problems), we define
the value " K #
π
X
Vk (x) = E ci |xk = x
i=k

This allows to derive a backward recursion for the value:


" K #
π
X
Vk (x) = E ci |xk = x
i=k
" K
#
X
= E ck + ci |xk = x
i=k+1
" " K
##
X X X ′
= πk (x, u) Cxu + u
Px,x ′E ci |xk+1 = x
u x′ i=k+1

where
Cxu = E [ck |xk = x, uk = u]

9 / 29
" " K
##
X X X ′
Vkπ (x) = πk (x, u) Cxu + u
Px,x ′E ci |xk+1 = x
u x′ i=k+1

can be then directly made into a recursive relation:


" #
π
X u
X u π ′
Vk (x) = πk (x, u) Cx + Px,x ′ Vk+1 (x )
u x′

Bellman’s optimality principle


The optimal cost, defined as
" #
X X
Vk∗ (x) = min πk (x, u) Cxu + u
Px,x π ′
′ Vk+1 (x )
πk
u x′

can be computed inductively:


" #
X X
Vk∗ (x) = min πk (x, u) Cxu + u
Px,x ∗ ′
′ Vk+1 (x )
πk
u x′

" Vk∗ : optimal value from stage k, Vkπ : value from stage k if πk , πk+1 , . . . is used.

10 / 29
Optimal policy via backward induction

" #
X X
Vk∗ = min πk (x, u) Cxu + u
Px,x ∗
′ Vk+1
πk
u x′

Each of these backward step corresponds to a Linear Program per state x.

For finite horizon K , VK acts as the base case where to start


K steps are needed
in each step a policy πk is found
under weak conditions, the policy is deterministic (πk is a singleton)

Backward induction via LP for a general class of nonlinear stochastic systems –


where is the catch?

11 / 29
Infinite horizon / stationary problems

Most of the times, when working with MDP, we look at infinite horizon problems.
stationary behavior of the Markov process
time invariant expected cost Cxu
time invariant policy π

Infinite-horizon cost

X
J= γ k ck
k=0

where 0 ≤ γ < 1 is a discount factor.

12 / 29
Traffic light example

service
arrival

Possible costs:
negative throughput
R3green =? R2green =?
total time spent in the queue

R3green =? R3red =?

length of the queue


R3green =? R2red =?

13 / 29
Discount factor

The discount factor has multiple interpretations.

From a mathematical perspective, it ensures that the cost is finite.

In some problems, it accounts for the lower impact of future costs


▶ Economics: money spent later can be invested at a guaranteed base rate of γ

In some problem, it is a proxy for a Bernoulli termination probability


▶ Processes: time scale of operation

For γ = 0, only the immediate cost is considered

For 0 < γ < 1, a balance between immediate cost and future cost is
achieved.

14 / 29
Closed-loop MDP

For a fixed policy π, the MDP reduces to a Markov chain with transition
probabilities X X
π
Px,x ′ = P[x ′ |x, u]P[u|x] = u
π(x, u)Px,x ′

u u

Evolution (linear!)

dk+1 = dk⊤ P π

Steady state (eigenvector problem)

d̄π⊤ = d̄π⊤ P π

For every policy π, there exist a stationary distribution d̄π (x) that gives the
steady-state probability that the system is in state x.

Assuming ergodicity of the Markov chain: possible path from any state to any state.

15 / 29
Example
What is the closed-loop MDP for the traffic light queue problem, with the
deterministic policy (
green if q ≥ 3
µ(q) =
red otherwise

1−p

p p p

3 2 1 q=0

p
1−p 1−p 1−p

16 / 29
Example

The closed-loop transition probability matrix is


 
1−p p 0 0
π
 0 1 − p p 0
P = 
 0 0 1−p p
1−p p 0 0
π ′
where remember that Px,x ′ is the probability of transitioning from state x to x

under the policy π.

Stationary distribution
The stationary distribution on the states can be computed by normalizing the left
eigenvector of P π associated to the eigenvalue 1.

q=0 26.7%

1 33.3%

2 33.3%

3 6.7%
Example: p = 0.2

17 / 29
Infinite-horizon value of a policy
The value of a policy π is defined as the conditional expected value of
infinite-horizon future cost when starting in state x
"∞ #
π
X k
V (x) = E γ ck |x0 = x
k=0

Discounted Bellman equation


We can express the value of a policy recursively
" #
π
X u
X u π ′
V (x) = π(x, u) Cx + γ Pxx ′ V (x )
u x′

Prove it: (take inspiration from slide 9)

18 / 29
" #
X X ′
π
V (x) = π(x, u) Cxu +γ u
Pxx π
′ V (x )

u x′

The Bellman equation can be interpreted as a consistency equation.


It allows to compute the value of a given policy (its performance) based on the
u
system dynamics (encoded in Pxx ′)

cost function (encoded in Cxu )

Bellman equation in learning


In the context of learning, V π (x) is a key piece of information: it’s a predictor of
the quality of a candidate policy.
Bellman equation allows to split this predictor into two parts:
the immediate observable cost Cxu
the estimate of future cost V π (x ′ )

19 / 29
Computational complexity of the Bellman equation

" #
X X ′
π
V (x) = π(x, u) Cxu +γ u
Pxx π
′ V (x )

u x′

If the MDP is finite with N states, then the Bellman equation is a system of N
linear equations.

Policy evaluation

V π = Cπ + γP π V π
where
Cπ (x) = u π(x, u)Cxu (expected cost)
P
π P u
Px,x ′ = u π(x, u)Px,x ′ (expected transition probabilities)

V π = (I − γP π )−1 Cπ

Alternative: iterative update of the Bellman equation, which is contractive.

20 / 29
Example

V π = (I − γP π )−1 Cπ
1−p

p p p

3 2 1 q=0

p
1−p 1−p 1−p

   
1−p p 0 0 0
π
 0 1−p p 0 π
1
P =
  C = 
 cars in queue
0 0 1−p p 2
1−p p 0 0 3

I − γP π always invertible because of Perron Frobenious theorem and


γ ∈ [0, 1)
Solution is unique
It can be computationally expensive for large systems → O(N 3 + N 2 M)

21 / 29
Bellman principle for infinite horizon MDPs

The optimal value satisfies


" #

X X ′
π
V (x) = min V (x) = min π(x, u) Cxu +γ u
Pxx π
′ V (x )
π π
u x′

Bellman optimality principle allows to make this minimization problem recursive


" #

X X ∗ ′
V (x) = min π(x, u) Cxu +γ u
Pxx ′ V (x )
π
u x′

If the minimum is achieved at a deterministic policy, then this is equivalent to


" #
∗ u
X u ∗ ′
V (x) = min Cx + γ Pxx ′ V (x )
u
x′

→ system of N non-linear equations (min operator on a finite set).

22 / 29
Iterative computation of the optimal policy

Policy evaluation
Given a policy π, we can evaluate the resulting value V π .

At the same time, if we have a value V π , we can do a greedy improvement of the


policy " #

X u
X u π ′
π (x, u) = argmin ν(x, u) Cx + γ Pxx ′ V (x )
ν
u x′

Interpretation
Can I improve the cost by deviating from the current policy π for one step, and
then fall back to the policy π?

23 / 29
Policy improvement theorem
It can be shown that the greedy policy improvement always improve the value, i.e.,
′ ′
V π (x) ≤ V π (x) ∀x, and ∃x ′ such that V π (x ′ ) < V π (x ′ )

unless V π is the optimal value.


Proof
We prove the ≤ relation. By definition of argmin, we have that, for every x,
" #
π
X ′ u
X u π ′
V (x) ≥ π (x, u) Cx + γ Pxx ′ V (x ) .
u x′

The r.h.s. is equal to the following expectation E [c0 + γV π (x1 )] (with respect to π ′ ).

V π (x) ≥ E [c0 + γV π (x1 )]


h i
≥ E [c0 + γ(c1 + γV π (x2 ))] = E c0 + γc1 + γ 2 V π (x2 ))

h i
≥ E c0 + γc1 + γ 2 c2 + γ 3 c3 + . . . = V π .

The strict inequality follows by assuming "<" for x ′ (and "≤" for the others).
Corollary: after a finite amount of improvements we reach the optimum (why?).

24 / 29
Policy iteration algorithm

Initialize at a policy guess π ← π0


For each iteration
1 Compute the value V associated to the policy π
V ← (I − γP π )−1 Cπ
2 Greedy update of the policy to update π
" #
X X

π(x, u) ← argmin ν(x, u) Cxu + γ u
Pxx ′ V (x )
ν u x′

Computational complexity
Finite number of iterations of
1 Policy evaluation: expensive → O(N 3 + N 2 M)
2 Policy improvement: easy → min over M alternatives, N times
(it is enough to search over deterministic policies).

25 / 29
Value iteration
In alternative, we can look for a fixed point of the Bellman optimality equation
" #

X u
X u ∗ ′
V (x) = min π(x, u) Cx + γ Pxx ′ V (x )
π
u x′

Obvious candidate to iteratively compute a fixed point?


" #
(t+1)
X u
X u (t) ′
V (x) = min π(x, u) Cx + γ Pxx ′ V (x )
π
u x′

" t: iteration index

Convergence result
The value iteration is contractive: if V (t) and W (t) are two value functions, then
∥V (t+1) − W (t+1) ∥∞ ≤ γ∥V (t) − W (t) ∥∞
Therefore, the value iteration converges at rate γ to V ∗
∥V (t) − V ∗ ∥∞ ≤ γ t ∥V (0) − V ∗ ∥∞

∥v∥∞ := maxi |vi |

26 / 29
Value iteration algorithm

Initialize at a value guess V ← V0


For each iteration
1 Apply the Bellman iteration
" #
X X

V (x) ← min π(x, u) Cxu + γ u
Pxx ′ V (x )
π
u x′

At convergence, extract the optimal policy π (greedy policy)
" #
X u
X u ′
π(x, u) ← argmin ν(x, u) Cx + γ Pxx ′ V (x )
ν
u x′

Computational complexity
Asymptotic convergence at rate γ
Bellman iteration: usually cheap → O(N 2 M)

27 / 29
Toy example: Policy iteration

V, π Policy evaluation Policy improvement


2 2 2

1 1 1
3 1 1

5 2 2

10 3 3
1 1 1

11 4 4

28 / 29
Toy example: Value iteration

V Bellman iteration Bellman iteration


2 2 2

1 1 1
3 1 1

5 4 2

10 6 3
1 1 1

11 2 2

29 / 29
This work is licensed under a
Creative Commons Attribution-ShareAlike 4.0 International License

https://bsaver.io/COCO

You might also like