0% found this document useful (0 votes)

3 views

08 - Markov Decision Processes

The document discusses the application of Markov decision processes (MDPs) to control traffic lights with the goal of minimizing waiting times and maximizing throughput while managing queue lengths. It outlines the structure of MDPs, including states, control actions, transition probabilities, and policies, as well as the principles of dynamic programming and optimal policy determination through backward induction. Additionally, it addresses the computational complexity of evaluating policies and the Bellman equation in the context of infinite-horizon problems.

Uploaded by

Ahmet Çelik

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views

08 - Markov Decision Processes

Uploaded by

Ahmet Çelik

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 31

Computational Control

Markov decision processes

Saverio Bolognani

Automatic Control Laboratory (IfA)

ETH Zurich
1 / 29
Problem:
turn traffic lights to red/green
minimize waiting time / maximize throughput
ensure queue lengths are limited

input:
state:
disturbance input:
performance output:
measurements:
dynamics:

How would a “PID-like” controller look like?

What is preventing you from using, for example, MPC?

2 / 29
A state-space representation of such a system is problematic.
discretized state space (number of vehicles per queue)
nonlinear dynamics

Is another Markovian representation possible?

service
arrival

Update equation at every time:

with probability p, queue = queue+1
if green light then queue = max(0, queue-3)

3 / 29
Markov chain
p p p p p p p

4 3 2 1 q=0

1−p 1−p 1−p 1−p 1−p 1−p 1−p

1−p 1−p 1−p 1−p 1−p 1−p

4 3 2 1 q=0 1

1 1
p p p p p

green/red duty cycle: queue = max(0, queue-3)

more complex policies:
▶ green light if queue longer than 3
▶ red light if zero queue
▶ etc.

4 / 29
Markov decision process

set X of N states
set U of M control actions
transition probabilities

P : X × U × X → [0, 1]
u ′
Px,x ′ = P[x |x, u]

is the conditional probability of transitioning to state x ′ is the MDP is in state x

and the input u is applied.

Markov property
For every x ′ , Px,x
u
′ depends only on x and u.

It does not depend on how the system got to x (past states and past inputs).

immediate cost
c(x, u, x ′ )
is the cost after transitioning to state x ′ given that the MDP is in state x and
action u is taken.

5 / 29
Policies

Control design problems on MDP consists in designing a policy to select a

suitable action in U based on the current state x ∈ X of the system.

Deterministic policies

µ:X →U

Stochastic (or mixed) policies

π : X × U → [0, 1]
π(x, u) = P[u|x] is the conditional probability of selecting the input (action) u given
that the MDP is in state x.

Although in many control application one looks for deterministic policies, the
general formulation for stochastic policies is not more complicated (and will be
useful when it comes to learning).

6 / 29
Optimal sequential decision problems

Consider an MDP “through time” (dynamic decision problem), where k is a time

index.
Stage cost

ck = c(xk , uk , xk+1 )
is the cost incurred at time k when the system goes from xk to xk+1 under input uk .

As control designers, we are often interested in a performance index of the form

K
X
ck
k=0

And therefore to find the policy that minimizes

" K #
π
X
V (x) = E ck |x0 = x
k=0

7 / 29
Example

8 / 29
Dynamic programming on MDP

In order to solve this problem recursively (as we did for LQR problems), we define
the value " K #
π
X
Vk (x) = E ci |xk = x
i=k

This allows to derive a backward recursion for the value:

" K #
π
X
Vk (x) = E ci |xk = x
i=k
" K
#
X
= E ck + ci |xk = x
i=k+1
" " K
##
X X X ′
= πk (x, u) Cxu + u
Px,x ′E ci |xk+1 = x
u x′ i=k+1

where
Cxu = E [ck |xk = x, uk = u]

9 / 29
" " K
##
X X X ′
Vkπ (x) = πk (x, u) Cxu + u
Px,x ′E ci |xk+1 = x
u x′ i=k+1

can be then directly made into a recursive relation:

" #
π
X u
X u π ′
Vk (x) = πk (x, u) Cx + Px,x ′ Vk+1 (x )
u x′

Bellman’s optimality principle

The optimal cost, defined as
" #
X X
Vk∗ (x) = min πk (x, u) Cxu + u
Px,x π ′
′ Vk+1 (x )
πk
u x′

can be computed inductively:

" #
X X
Vk∗ (x) = min πk (x, u) Cxu + u
Px,x ∗ ′
′ Vk+1 (x )
πk
u x′

" Vk∗ : optimal value from stage k, Vkπ : value from stage k if πk , πk+1 , . . . is used.

10 / 29
Optimal policy via backward induction

" #
X X
Vk∗ = min πk (x, u) Cxu + u
Px,x ∗
′ Vk+1
πk
u x′

Each of these backward step corresponds to a Linear Program per state x.

For finite horizon K , VK acts as the base case where to start

K steps are needed
in each step a policy πk is found
under weak conditions, the policy is deterministic (πk is a singleton)

Backward induction via LP for a general class of nonlinear stochastic systems –

where is the catch?

11 / 29
Infinite horizon / stationary problems

Most of the times, when working with MDP, we look at infinite horizon problems.
stationary behavior of the Markov process
time invariant expected cost Cxu
time invariant policy π

Infinite-horizon cost
∞
X
J= γ k ck
k=0

where 0 ≤ γ < 1 is a discount factor.

12 / 29
Traffic light example

service
arrival

Possible costs:
negative throughput
R3green =? R2green =?
total time spent in the queue

R3green =? R3red =?

length of the queue

R3green =? R2red =?

13 / 29
Discount factor

The discount factor has multiple interpretations.

From a mathematical perspective, it ensures that the cost is finite.

In some problems, it accounts for the lower impact of future costs

▶ Economics: money spent later can be invested at a guaranteed base rate of γ

In some problem, it is a proxy for a Bernoulli termination probability

▶ Processes: time scale of operation

For γ = 0, only the immediate cost is considered

For 0 < γ < 1, a balance between immediate cost and future cost is
achieved.

14 / 29
Closed-loop MDP

For a fixed policy π, the MDP reduces to a Markov chain with transition
probabilities X X
π
Px,x ′ = P[x ′ |x, u]P[u|x] = u
π(x, u)Px,x ′

u u

Evolution (linear!)
⊤
dk+1 = dk⊤ P π

Steady state (eigenvector problem)

d̄π⊤ = d̄π⊤ P π

For every policy π, there exist a stationary distribution d̄π (x) that gives the
steady-state probability that the system is in state x.

Assuming ergodicity of the Markov chain: possible path from any state to any state.

15 / 29
Example
What is the closed-loop MDP for the traffic light queue problem, with the
deterministic policy (
green if q ≥ 3
µ(q) =
red otherwise

1−p

p p p

3 2 1 q=0

p
1−p 1−p 1−p

16 / 29
Example

The closed-loop transition probability matrix is

 
1−p p 0 0
π
 0 1 − p p 0
P = 
 0 0 1−p p
1−p p 0 0
π ′
where remember that Px,x ′ is the probability of transitioning from state x to x

under the policy π.

Stationary distribution
The stationary distribution on the states can be computed by normalizing the left
eigenvector of P π associated to the eigenvalue 1.

q=0 26.7%

1 33.3%

2 33.3%

3 6.7%
Example: p = 0.2

17 / 29
Infinite-horizon value of a policy
The value of a policy π is defined as the conditional expected value of
infinite-horizon future cost when starting in state x
"∞ #
π
X k
V (x) = E γ ck |x0 = x
k=0

Discounted Bellman equation

We can express the value of a policy recursively
" #
π
X u
X u π ′
V (x) = π(x, u) Cx + γ Pxx ′ V (x )
u x′

Prove it: (take inspiration from slide 9)

18 / 29
" #
X X ′
π
V (x) = π(x, u) Cxu +γ u
Pxx π
′ V (x )

u x′

The Bellman equation can be interpreted as a consistency equation.

It allows to compute the value of a given policy (its performance) based on the
u
system dynamics (encoded in Pxx ′)

cost function (encoded in Cxu )

Bellman equation in learning

In the context of learning, V π (x) is a key piece of information: it’s a predictor of
the quality of a candidate policy.
Bellman equation allows to split this predictor into two parts:
the immediate observable cost Cxu
the estimate of future cost V π (x ′ )

19 / 29
Computational complexity of the Bellman equation

" #
X X ′
π
V (x) = π(x, u) Cxu +γ u
Pxx π
′ V (x )

u x′

If the MDP is finite with N states, then the Bellman equation is a system of N
linear equations.

Policy evaluation

V π = Cπ + γP π V π
where
Cπ (x) = u π(x, u)Cxu (expected cost)
P
π P u
Px,x ′ = u π(x, u)Px,x ′ (expected transition probabilities)

V π = (I − γP π )−1 Cπ

Alternative: iterative update of the Bellman equation, which is contractive.

20 / 29
Example

V π = (I − γP π )−1 Cπ
1−p

p p p

3 2 1 q=0

p
1−p 1−p 1−p

   
1−p p 0 0 0
π
 0 1−p p 0 π
1
P =
  C = 
 cars in queue
0 0 1−p p 2
1−p p 0 0 3

I − γP π always invertible because of Perron Frobenious theorem and

γ ∈ [0, 1)
Solution is unique
It can be computationally expensive for large systems → O(N 3 + N 2 M)

21 / 29
Bellman principle for infinite horizon MDPs

The optimal value satisfies

" #
∗
X X ′
π
V (x) = min V (x) = min π(x, u) Cxu +γ u
Pxx π
′ V (x )
π π
u x′

Bellman optimality principle allows to make this minimization problem recursive

" #
∗
X X ∗ ′
V (x) = min π(x, u) Cxu +γ u
Pxx ′ V (x )
π
u x′

If the minimum is achieved at a deterministic policy, then this is equivalent to

" #
∗ u
X u ∗ ′
V (x) = min Cx + γ Pxx ′ V (x )
u
x′

→ system of N non-linear equations (min operator on a finite set).

22 / 29
Iterative computation of the optimal policy

Policy evaluation
Given a policy π, we can evaluate the resulting value V π .

At the same time, if we have a value V π , we can do a greedy improvement of the

policy " #
′
X u
X u π ′
π (x, u) = argmin ν(x, u) Cx + γ Pxx ′ V (x )
ν
u x′

Interpretation
Can I improve the cost by deviating from the current policy π for one step, and
then fall back to the policy π?

23 / 29
Policy improvement theorem
It can be shown that the greedy policy improvement always improve the value, i.e.,
′ ′
V π (x) ≤ V π (x) ∀x, and ∃x ′ such that V π (x ′ ) < V π (x ′ )

unless V π is the optimal value.

Proof
We prove the ≤ relation. By definition of argmin, we have that, for every x,
" #
π
X ′ u
X u π ′
V (x) ≥ π (x, u) Cx + γ Pxx ′ V (x ) .
u x′

The r.h.s. is equal to the following expectation E [c0 + γV π (x1 )] (with respect to π ′ ).

V π (x) ≥ E [c0 + γV π (x1 )]

h i
≥ E [c0 + γ(c1 + γV π (x2 ))] = E c0 + γc1 + γ 2 V π (x2 ))
′
h i
≥ E c0 + γc1 + γ 2 c2 + γ 3 c3 + . . . = V π .

The strict inequality follows by assuming "<" for x ′ (and "≤" for the others).
Corollary: after a finite amount of improvements we reach the optimum (why?).

24 / 29
Policy iteration algorithm

Initialize at a policy guess π ← π0

For each iteration
1 Compute the value V associated to the policy π
V ← (I − γP π )−1 Cπ
2 Greedy update of the policy to update π
" #
X X
′
π(x, u) ← argmin ν(x, u) Cxu + γ u
Pxx ′ V (x )
ν u x′

Computational complexity
Finite number of iterations of
1 Policy evaluation: expensive → O(N 3 + N 2 M)
2 Policy improvement: easy → min over M alternatives, N times
(it is enough to search over deterministic policies).

25 / 29
Value iteration
In alternative, we can look for a fixed point of the Bellman optimality equation
" #
∗
X u
X u ∗ ′
V (x) = min π(x, u) Cx + γ Pxx ′ V (x )
π
u x′

Obvious candidate to iteratively compute a fixed point?

" #
(t+1)
X u
X u (t) ′
V (x) = min π(x, u) Cx + γ Pxx ′ V (x )
π
u x′

" t: iteration index

Convergence result
The value iteration is contractive: if V (t) and W (t) are two value functions, then
∥V (t+1) − W (t+1) ∥∞ ≤ γ∥V (t) − W (t) ∥∞
Therefore, the value iteration converges at rate γ to V ∗
∥V (t) − V ∗ ∥∞ ≤ γ t ∥V (0) − V ∗ ∥∞

∥v∥∞ := maxi |vi |

26 / 29
Value iteration algorithm

Initialize at a value guess V ← V0

For each iteration
1 Apply the Bellman iteration
" #
X X
′
V (x) ← min π(x, u) Cxu + γ u
Pxx ′ V (x )
π
u x′
∗
At convergence, extract the optimal policy π (greedy policy)
" #
X u
X u ′
π(x, u) ← argmin ν(x, u) Cx + γ Pxx ′ V (x )
ν
u x′

Computational complexity
Asymptotic convergence at rate γ
Bellman iteration: usually cheap → O(N 2 M)

27 / 29
Toy example: Policy iteration

V, π Policy evaluation Policy improvement

2 2 2

1 1 1
3 1 1

5 2 2

10 3 3
1 1 1

11 4 4

28 / 29
Toy example: Value iteration

V Bellman iteration Bellman iteration

2 2 2

1 1 1
3 1 1

5 4 2

10 6 3
1 1 1

11 2 2

29 / 29
This work is licensed under a
Creative Commons Attribution-ShareAlike 4.0 International License

https://bsaver.io/COCO

Hpe7 A03
No ratings yet
Hpe7 A03
7 pages
Powershell Basic Cheat Sheet2 PDF
100% (1)
Powershell Basic Cheat Sheet2 PDF
1 page
Automation Anywhere PDF
0% (1)
Automation Anywhere PDF
19 pages
Addressing ICT Entrepreneurship Empowerment in Response To Statement of Work On Technologies and Management Tools For Digital Villages
100% (1)
Addressing ICT Entrepreneurship Empowerment in Response To Statement of Work On Technologies and Management Tools For Digital Villages
49 pages
RL and ObC Lecture 2
No ratings yet
RL and ObC Lecture 2
20 pages
Cs5811 Ch17 Complex Dec
No ratings yet
Cs5811 Ch17 Complex Dec
29 pages
242 Sheet 02 03
No ratings yet
242 Sheet 02 03
5 pages
Machine Learning
No ratings yet
Machine Learning
5 pages
EE290 Lecture 16
No ratings yet
EE290 Lecture 16
4 pages
New CZ3005 Module 4 - Markov Decision Process
No ratings yet
New CZ3005 Module 4 - Markov Decision Process
38 pages
15 MDP
No ratings yet
15 MDP
35 pages
Lecture 3 - MDPs and Dynamic Programming
No ratings yet
Lecture 3 - MDPs and Dynamic Programming
66 pages
CS229
No ratings yet
CS229
17 pages
EE675 Lecture 10
No ratings yet
EE675 Lecture 10
4 pages
Fa19 Lecture 15 MDPs II
No ratings yet
Fa19 Lecture 15 MDPs II
76 pages
09 - Monte Carlo Learning
No ratings yet
09 - Monte Carlo Learning
24 pages
Lecture4 Model Free Prediction
No ratings yet
Lecture4 Model Free Prediction
34 pages
Stochastic DP
No ratings yet
Stochastic DP
23 pages
Markovian Decision Process
No ratings yet
Markovian Decision Process
27 pages
cs229 Notes13
No ratings yet
cs229 Notes13
15 pages
Reinforcement Learning 3 Recap
No ratings yet
Reinforcement Learning 3 Recap
3 pages
Lecture Notes
No ratings yet
Lecture Notes
29 pages
Reinforcement Learning and Control: CS229 Lecture Notes
No ratings yet
Reinforcement Learning and Control: CS229 Lecture Notes
7 pages
Reinforcement Learning and Control: CS229 Lecture Notes
No ratings yet
Reinforcement Learning and Control: CS229 Lecture Notes
15 pages
MDP 2
No ratings yet
MDP 2
53 pages
cs229-notes12 Reinforcement in Control
No ratings yet
cs229-notes12 Reinforcement in Control
17 pages
10 - Reinforcement Learning
No ratings yet
10 - Reinforcement Learning
24 pages
Lec 09
No ratings yet
Lec 09
51 pages
EE675A Lec12
No ratings yet
EE675A Lec12
5 pages
AI512/EE633: Reinforcement Learning: Lecture 3 - Dynamic Programming
No ratings yet
AI512/EE633: Reinforcement Learning: Lecture 3 - Dynamic Programming
43 pages
A17 Complexdecisions
No ratings yet
A17 Complexdecisions
28 pages
DRL #4-5 - Introducing MDP and Dynamic Programming Solution
No ratings yet
DRL #4-5 - Introducing MDP and Dynamic Programming Solution
74 pages
policy (RL IITH)
No ratings yet
policy (RL IITH)
46 pages
02 MarkovDecisionProcess
No ratings yet
02 MarkovDecisionProcess
51 pages
Sp14 Cs188 Lecture 9 - Mdps II
No ratings yet
Sp14 Cs188 Lecture 9 - Mdps II
48 pages
slidedeck_6_MAS_2021_22_RL_2_MDP_Model-based
No ratings yet
slidedeck_6_MAS_2021_22_RL_2_MDP_Model-based
36 pages
Dynamic Programming
No ratings yet
Dynamic Programming
52 pages
06 MDP
No ratings yet
06 MDP
89 pages
Reinforcement Learning in A Nutshell
No ratings yet
Reinforcement Learning in A Nutshell
12 pages
DSA5102_lecture11
No ratings yet
DSA5102_lecture11
44 pages
RL UNIT-4
No ratings yet
RL UNIT-4
18 pages
Markov Decision Processes & Reinforcement Learning: Megan Smith Lehigh University, Fall 2006
No ratings yet
Markov Decision Processes & Reinforcement Learning: Megan Smith Lehigh University, Fall 2006
40 pages
18 - Dynamic Programming for Markov Decision Processes.pptx
No ratings yet
18 - Dynamic Programming for Markov Decision Processes.pptx
50 pages
Markov Decision Processes and Exact Solution Methods
No ratings yet
Markov Decision Processes and Exact Solution Methods
34 pages
Dynamic Programing and Optimal Control
No ratings yet
Dynamic Programing and Optimal Control
276 pages
Dynamic Programing and Optimal Control PDF
No ratings yet
Dynamic Programing and Optimal Control PDF
276 pages
L12 Markov Decision Processes
No ratings yet
L12 Markov Decision Processes
64 pages
3 DP PDF
No ratings yet
3 DP PDF
42 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
101 pages
20AI903_RL_UNIT 2
No ratings yet
20AI903_RL_UNIT 2
27 pages
Instructor (Andrew NG) :okay, Good Morning. Welcome Back. So I Hope All of You Had
No ratings yet
Instructor (Andrew NG) :okay, Good Morning. Welcome Back. So I Hope All of You Had
14 pages
mdp-cheatsheet
No ratings yet
mdp-cheatsheet
3 pages
2-dynamic
No ratings yet
2-dynamic
50 pages
Tut21 RL
No ratings yet
Tut21 RL
101 pages
22 Reinforcement Learning
No ratings yet
22 Reinforcement Learning
18 pages
Lecture26 Ri
No ratings yet
Lecture26 Ri
55 pages
Module-2 For Btech in Topic
No ratings yet
Module-2 For Btech in Topic
29 pages
Unit 5 Reinforcement Learning Notes
No ratings yet
Unit 5 Reinforcement Learning Notes
20 pages
Quick Start: Resolving A Markov Decision Process Problem Using The Mdptoolbox in Matlab
No ratings yet
Quick Start: Resolving A Markov Decision Process Problem Using The Mdptoolbox in Matlab
9 pages
Add-On DRL CS06
No ratings yet
Add-On DRL CS06
23 pages
Lecture7 MDPs I
No ratings yet
Lecture7 MDPs I
9 pages
Experiment 4
No ratings yet
Experiment 4
7 pages
Stochastic Process - Markov Property - Markov Chain - Markov Decision Process - Reinforcement Learning - RL Techniques - Example Applications
No ratings yet
Stochastic Process - Markov Property - Markov Chain - Markov Decision Process - Reinforcement Learning - RL Techniques - Example Applications
39 pages
Mathematical Functions
From Everand
Mathematical Functions
Oliver Linton
No ratings yet
05 - Robust MPC
No ratings yet
05 - Robust MPC
28 pages
Blackmore GNC06
No ratings yet
Blackmore GNC06
15 pages
deep_ukf
No ratings yet
deep_ukf
13 pages
Makaleler
No ratings yet
Makaleler
108 pages
Receding Horizon HN Control of
No ratings yet
Receding Horizon HN Control of
10 pages
Applsci 13 08204
No ratings yet
Applsci 13 08204
14 pages
Domain Centric Security
No ratings yet
Domain Centric Security
22 pages
SAS #1 - Nursing Informatics
No ratings yet
SAS #1 - Nursing Informatics
6 pages
Design of The Data Acquisition System Based On STM32: Information Technology and Quantitative Management (ITQM2013)
No ratings yet
Design of The Data Acquisition System Based On STM32: Information Technology and Quantitative Management (ITQM2013)
7 pages
Manual Controlador Temperatura Editable
No ratings yet
Manual Controlador Temperatura Editable
22 pages
Sample Project For Study - Doc Vevo
100% (1)
Sample Project For Study - Doc Vevo
34 pages
IRCh 7 Slides
No ratings yet
IRCh 7 Slides
52 pages
Easy Downloader
100% (1)
Easy Downloader
6 pages
Wxyz Paper I: Test Booklet No. Time: 1 Hours) (Maximum Marks: 100 Instructions For The Candidates
No ratings yet
Wxyz Paper I: Test Booklet No. Time: 1 Hours) (Maximum Marks: 100 Instructions For The Candidates
24 pages
Additional Program
No ratings yet
Additional Program
573 pages
Q1. Read The Passage Given Below
No ratings yet
Q1. Read The Passage Given Below
11 pages
Villarama Christian Michael AS1
No ratings yet
Villarama Christian Michael AS1
11 pages
Multi Agent Systems - Algorithmic, Theoretic, and Logical Foundations - ToC
No ratings yet
Multi Agent Systems - Algorithmic, Theoretic, and Logical Foundations - ToC
10 pages
To Pasolink Family: Hc-Lead Tc-Psl-0004
No ratings yet
To Pasolink Family: Hc-Lead Tc-Psl-0004
29 pages
TP13 095
No ratings yet
TP13 095
6 pages
Student Projects Catia
No ratings yet
Student Projects Catia
5 pages
D203040-17 MPM3 ReferenceGuide e
No ratings yet
D203040-17 MPM3 ReferenceGuide e
222 pages
Problem Statement:-: Project Title: Online Book Store Management System
No ratings yet
Problem Statement:-: Project Title: Online Book Store Management System
3 pages
p7n Sli Platinum Motherboard Atx PDF
No ratings yet
p7n Sli Platinum Motherboard Atx PDF
108 pages
JasperReports Server CP Admin Guide PDF
No ratings yet
JasperReports Server CP Admin Guide PDF
204 pages
Design of Automatic Seed Sowing Machine
No ratings yet
Design of Automatic Seed Sowing Machine
5 pages
Communication Security
No ratings yet
Communication Security
17 pages
Unisonic Technologies Co., LTD: 50 Amps, 60 Volts N-Channel Power Mosfet
100% (1)
Unisonic Technologies Co., LTD: 50 Amps, 60 Volts N-Channel Power Mosfet
8 pages
Manual 22S Rev2
100% (1)
Manual 22S Rev2
350 pages
Computer Mediated Communication
100% (1)
Computer Mediated Communication
15 pages
Variables in Python - Real Python
No ratings yet
Variables in Python - Real Python
10 pages