0% found this document useful (0 votes)

15 views

RL and ObC Lecture 2

Uploaded by

Erdem Şimşek

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views

RL and ObC Lecture 2

Uploaded by

Erdem Şimşek

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 20

Reinforcement Learning and Optimization-based

Control

Assoc. Prof. Dr. Emre Koyuncu

Department of Aeronautics Engineering

Istanbul Technical University

Lecture 2: MDP and Bellman Optimality

E. Koyuncu (ITU) RL and ObC Lecture 2 1 / 20

Table of Contents

1 Markov Decision Process

E. Koyuncu (ITU) RL and ObC Lecture 2 2 / 20

Sequential Decision

Optimal decision
• At current state, apply decision that minimizes
Current stage cost + J ∗ (Next state)
where J ∗ (Next state) is the optimal future cost, starting from next
state
• This defines optimal policy - an optimal control to apply at each state

E. Koyuncu (ITU) RL and ObC Lecture 2 3 / 20

Principle of Optimality

Principle of optimality
Let {u0∗ , ..., uN−1
∗ } be an optimal control sequence wrt state sequence
{x0 , ..., xN }. Consider the tail subproblem that starts at xk∗ at time k and
∗ ∗

minimizes over {uk , ..., uN−1 } the cost-to-go from k to N

N−1
X
gk (xk∗ , uk ) + gm (xm , um ) + gN (xN )
m=k+1

Then the tail optimal control sequence {uk∗ , ..., uN−1

∗ } is optimal for the
tail subproblem.
E. Koyuncu (ITU) RL and ObC Lecture 2 4 / 20
Dynamic Programming
Solve all the tail subproblems of a given time length using the solution of
all the tail subproblems of shorter time length.
By principle of optimality
• Consider every possible uk and solve the tail subproblem that starts at
next state xk+1 = fk (xk , uk )
• Optimize over all uk

By principle of optimality
Start with
JN∗ (xN ) = gN (xN ) , for all xN
and for k = 0, , N − 1, let

Jk∗ (xk ) = ∗

min gk (xk , uk ) + Jk+1 (fk (xk , uk )) , for all xk .
uk ∈Uk (xk )

then optimal cost J ∗ (x0 ) is obtained at the last step: J0 (x0 ) = J ∗ (x0 )
E. Koyuncu (ITU) RL and ObC Lecture 2 5 / 20
Table of Contents

1 Markov Decision Process

E. Koyuncu (ITU) RL and ObC Lecture 2 6 / 20

Optimal Sequential Decision

Remark
• DP solve the optimal decision problems via backward through time -
provides offline solution that can not be implemented online
• RL and Adaptive Control are concerned with forward in time solution
- runs in real time
Let’s formulate the problem as Markov Decision Process:
Consider (X , U, P, R), where X is a set of states, U is a set of actions or
control.
The transition probability P : X × U × X → [0, 1] give for each state
x ∈ X and action u ∈ U.
u 0
The conditional probability Px,x 0 = Pr {x | x, u} of transitioning to state
0
x ∈ X from x ∈ X and takes action u ∈ U.
The cost function R : X × U × X → R gives the expected immediate cost
u paid after the given transition.
Rxx 0

E. Koyuncu (ITU) RL and ObC Lecture 2 7 / 20

Optimal Sequential Decision

u
The Markov property refers that this transition probabilities Px,x 0 depend

only on the current state x - not on the history!

Remark
The basic problem of MDP is to find a mapping π : X × U → [0, 1] that
gives conditional probability π(x, u) = Pr {u|x} of taking action u given
the MDP is in the state x.
• such mapping is termed a closed-loop control, or strategy, or policy.
• π(x, u) = Pr {u|x} policy is called stochastic or mixed if there is a
non-zero probability of selecting more than one control in state x.
• π(x, u) = Pr {u|x} policy is called deterministic policy if admits only
one control, with probability one.

E. Koyuncu (ITU) RL and ObC Lecture 2 8 / 20

Optimal Sequential Decision

Remark
• Dynamical systems evolve through time or more generally to sequence
of events
• therefore, we consider sequential decision problems
• optimality is often desirable in terms of conserving resources such as
time, fuel, energy, etc.

Define a stage cost at time k by: rk = rk (xk , uk , xk+1 ), then

u = E {r | x = x, u = u, x 0
Rxx 0 k k k k+1 = x } where E {.} is the expected
value operator. The sum of future costs over time interval [k, k + T ]:
T
X k+T
X
Jk,T = γ i rk+i = γ i−k ri
i=0 i=k

where 0 ≤ λ < 1 is a discount factor that reduces the weight of cost

incurred further in the future.
E. Koyuncu (ITU) RL and ObC Lecture 2 9 / 20
Optimal Sequential Decision
Consider that an agent selects fixed statianary policy π(x, u) = Pr {u|x}
where conditional probabilities πk (xk , uk ) are independent of k. Then
closed-loop MDP reduces to a Markov chain with state space X .The fixed
transition probabilities of this Markov chain are given by:
X X
π
Px,x 0 ≡ Px,x 0 = Pr x 0 | x, u Pr{u | x} = u
π(x, u)Px,x 0

u u

where the Chapman-Kolmogorov identity is used. If the Markov chain is

ergodic, it can be shown that every MDP has a stationary deterministic
optimal policy (Bertsekas and Tsitsiklis, 1996; Wheeler and Narendra,
1986).

Remark
Ergodicity expresses the idea that a point of a dynamical system or a stochastic process, will eventually visit all parts of the
space that the system moves in, in a uniform and random sense. This implies that the average behavior of the system can be
deduced from the trajectory of a ”typical” point. Equivalently, a sufficiently large collection of random samples from a process
can represent the average statistical properties of the entire process - meaning that the system cannot be reduced or factored
into smaller components.

E. Koyuncu (ITU) RL and ObC Lecture 2 10 / 20

Value Function
The value of a policy is defined as the conditional expected value of future
cost when starting in state x at time k and following policy π(x, u):
(k+T )
X
π i−k
Vk (x) = Eπ {Jk,T | xk = x} = Eπ γ ri | xk = x
i=k
A main objective of MDP is to determine a policy π(x, u) to minimize the
expected future cost:
(k+T )
X
∗ π i−k
π (x, u) = arg minVk (x) = arg minEπ γ ri | xk = x
π π
i=k
The policy is termed the optimal policy and corresponding optimal value is
given as:
(k+T )
X
∗ π i−k
Vk (x) = min Vk (x) = min Eπ γ ri | xk = x
π π
i=k

E. Koyuncu (ITU) RL and ObC Lecture 2 11 / 20

Backward Recursion
Using the Chapman-Kolmogorov identity and the Markov property, write
the value of policy:
(k+T )
X
Vkπ (x) = Eπ {Jk | xk = x} = Eπ γ i−k ri | xk = x
i=k
k+T
( )
X
Vkπ (x) = Eπ rk + γ γ i−(k+1)
ri | xk = x
i=k+1
k+T
" ( )#
X X X
0
Vkπ (x) = π(x, u) u
Pxx 0
u
Rxx 0 + γEπ γ i−(k+1)
ri | xk+1 = x
u x0 i=k+1

Finally, the value function satisfies:

X X
0
Vkπ (x) = u
u π

π(x, u) Pxx 0 Rxx 0 + γVk+1 x

u x0

E. Koyuncu (ITU) RL and ObC Lecture 2 12 / 20

Dynamic Programming
The optimal cost can be written as:
X X
Vk∗ (x) = min Vkπ (x) = min u 0
u π

π(x, u) Pxx 0 Rxx 0 + γVk+1 x
π π
u x0

Bellman Optimality
An optimal policy has the property that no matter what the previous
control actions have been used - the remaining control actions constitute
an optimal policy with regard to the state resulting from previous controls

Then we can write:

X X
Vk∗ (x) = min u ∗ 0
u
π(x, u) Pxx 0 Rxx 0 + γVk+1 x
π
u x0
Suppose an arbitrary control u is applied at k and optimal policy is applied
from k + 1 on. Then the optimal control at k is given by:
X X
π ∗ (xk = x, u) = arg min u ∗ 0
u
π(x, u) Pxx 0 Rxx 0 + γVk+1 x
π u x0
E. Koyuncu (ITU) RL and ObC Lecture 2 13 / 20
Dynamic Programming
Under the assumption that the Markov chain corresponding to each policy
is ergodic, every MDP has a stationary deterministic optimal policy. Then
we can minimize the conditional expectation over all actions u in state x.
Therefore: X
Vk∗ (x) = min u ∗ 0
u
Pxx 0 Rxx 0 + γVk+1 x
u
x0
X
uk∗ = arg min u ∗ 0
u
Pxx 0 Rxx 0 + γVk+1 x
u
x0

Dynamic Programming
The backward recursion forms the basis for dynamic programming (DP)
(Bellman, 1957), which gives offline methods for working backward in time
to determine optimal policies
• requires knowledge of the complete system dynamics in the form of
transition probabilities and expected costs.

E. Koyuncu (ITU) RL and ObC Lecture 2 14 / 20

Bellman Equation
• DP is a backward in time method to find optimal value and policy
• RL uses casual experience through executing sequential decisions to
improve control actions based on observed results of using current
policy.
To derive forward in time methods to find optimal values and policies, set
now the time horizon T to infinity and define infinite horizon cost:
∞
X ∞
X
i
Jk = γ rk+1 = γ i−k ri
i=0 i=k
Associated infinite horizon value function is:
(∞ )
X
V π (x) = Eπ {Jk | xk = x} = Eπ γ i−k ri | xk = x
i=k
It can be seen that the value function for policy π(x, u) satisfies the
Bellman Equation:
X X
V π (x) = u
x0
u π

π(x, u) Pxx 0 Rxx 0 + γV
E. Koyuncu (ITU) u 0 ObC
RL xand Lecture 2 15 / 20
Bellman Optimality Equation
If MDP is finite and has N states, the Bellman Eq. is a system of N linear
equations for the value V π (x) of being in each state x given the current
policy π(x, u). The optimal finite horizon values satisfies:
X X
V ∗ (x) = min V π (x) = min u
x0
u π

π(x, u) Pxx 0 Rxx 0 + γV
π π
u x0
then it yields the Bellman optimality eq.:
X X
V ∗ (x) = min V π (x) = min u ∗
x0
u
π(x, u) Pxx 0 Rxx 0 + γV
π π
u x0
Under ergodicity assumption, Bellman optimality eq. becomes:
X
V ∗ (x) = min u ∗
x0
u
Pxx 0 Rxx 0 + γV
u
x0
This known as Hamilton-Jacobi-Bellman equation in control systems. The
optimal control is given by:
X
u ∗ = arg min u ∗
x0
u
Pxx 0 Rxx 0 + γV
u
x0
E. Koyuncu (ITU) RL and ObC Lecture 2 16 / 20
Ex. Discrete-time LQR
Consider the discrete-time LQR problem, where MDP is deterministic and
satisfies the state transition equation:
xk+1 = Axk + Buk
The associated performance index has deterministic stage cost:
∞ ∞
1X 1 X T
Jk = ri = xi Qxi + uiT Rui
2 2
i=k i=k

where the cost weighting matrices are Q = Q T ≥ 0, R = R T > 0. State

space X and action space U are infinite and continuous. Then value
function is:
∞ ∞
1X 1 X T
V (xk ) = ri = xi Qxi + uiT Rui
2 2
i=k i=k
∞
1 T T
1 X
V (xk ) = xk Qxk + uk Ruk + xiT Qxi + uiT Rui
2 2
i=k+1
E. Koyuncu (ITU) RL and ObC Lecture 2 17 / 20
Ex. Discrete-time LQR

1 T
V (xk ) =xk Qxk + ukT Ruk + V (xk+1 )
2
This is exactly the Bellman Eq. for LQR. Assuming value is quadratic in
the state so that:
1
Vk (xk ) = xkT Pxk
2
Any P = P T > 0 kernel matrix yields Bellman Eq.:

2V (xk ) = xkT Pxk = xkT Qxk + ukT Ruk + xk+1

T
Pxk+1

Remark
So far, we did not see the state dynamics (A, B)
RL algorithms for learning optimal solutions can be devised via temporal
difference methods. That is, RL allows Lyapunov equation to be solved
online without knowing (A, B)
E. Koyuncu (ITU) RL and ObC Lecture 2 18 / 20
Ex. Discrete-time LQR

2V (xk ) = xkT Pxk = xkT Qxk + ukT Ruk + xk+1

T
Pxk+1
Using the state equation, can be written as:

2V (xk ) = xkT Qxk + ukT Ruk + (Axk + Buk )T P (Axk + Buk )

Assuming a constant sate-state feedback policy uk = µ(xk ) = −Kxk for

some K gain, write:

2V (xk ) = xkT Qxk + ukT Ruk + (Axk + Buk )T P (Axk + Buk )

Since the eq. holds for all states trajectories, we have:

(A − BK )T P(A − BK ) − P + Q + K T RK = 0

which is a Lyapunov equation, equivalent to the Bellman eq. for

discrete-time LQR.
E. Koyuncu (ITU) RL and ObC Lecture 2 19 / 20
Ex. Discrete-time LQR

The discrete-time LQR Hamiltonian function is:

H (xk , uk ) = xkT Qxk + ukT Ruk + (Axk + Buk )T P (Axk + Buk ) − xkT Pxk

A necessary condition for optimality is: ∂H (xk , uk ) /∂uk = 0. Solving this

eq. gives:
−1
uk = −Kxk = − B T PB + R B T PAxk

Using this relation, Lyapunov eq. yields the discrete-time algebraic Riccati
eqution:
−1
AT PA − P + Q − AT PB B T PB + R B T PA = 0

ARE is exactly the Bellman optimality equation for discrete-time LQR.

E. Koyuncu (ITU) RL and ObC Lecture 2 20 / 20

P9 Sampling, Convolution, and FIR Filtering
No ratings yet
P9 Sampling, Convolution, and FIR Filtering
12 pages
Signal Processing and Linear Systems - B P Lathi - Solutions Manual PDF
70% (10)
Signal Processing and Linear Systems - B P Lathi - Solutions Manual PDF
205 pages
Lab 5,6,7
No ratings yet
Lab 5,6,7
20 pages
Dynamic Programming Online Teaching FOR PRINT
No ratings yet
Dynamic Programming Online Teaching FOR PRINT
44 pages
03 - Model Predictive Control
No ratings yet
03 - Model Predictive Control
47 pages
36fc4cbaabe504446b51adb8a68f5958 MIT6 231F15 Complete Slide
No ratings yet
36fc4cbaabe504446b51adb8a68f5958 MIT6 231F15 Complete Slide
166 pages
08 - Markov Decision Processes
No ratings yet
08 - Markov Decision Processes
31 pages
ch4 KFderiv
No ratings yet
ch4 KFderiv
22 pages
A2 Linear-Quadratic Optimal Control
No ratings yet
A2 Linear-Quadratic Optimal Control
8 pages
Optimal Control Exercises
100% (2)
Optimal Control Exercises
79 pages
05 - Robust MPC
No ratings yet
05 - Robust MPC
28 pages
Closed Loop Properties of MPC
No ratings yet
Closed Loop Properties of MPC
18 pages
Dynamic Programming and Optimal Control: Third Edition Dimitri P. Bertsekas
No ratings yet
Dynamic Programming and Optimal Control: Third Edition Dimitri P. Bertsekas
54 pages
Kalman Filter For Vision Tracking: Erik Cuevas1,2, Daniel Zaldivar1,2 and Raul Rojas1 10th August 2005
No ratings yet
Kalman Filter For Vision Tracking: Erik Cuevas1,2, Daniel Zaldivar1,2 and Raul Rojas1 10th August 2005
18 pages
Projet
No ratings yet
Projet
5 pages
Pontryagin's Maximum Principle: Emo Todorov
No ratings yet
Pontryagin's Maximum Principle: Emo Todorov
12 pages
10 - Reinforcement Learning
No ratings yet
10 - Reinforcement Learning
24 pages
Stochastic OPF by Constraint Relaxation
No ratings yet
Stochastic OPF by Constraint Relaxation
5 pages
4F3 - Predictive Control
No ratings yet
4F3 - Predictive Control
27 pages
Convergence Analysis of Extended Kalman
No ratings yet
Convergence Analysis of Extended Kalman
10 pages
Model_free_RL_Notes
No ratings yet
Model_free_RL_Notes
7 pages
In Homogeneous 1
No ratings yet
In Homogeneous 1
8 pages
DPOCexam2017 Solution BB
No ratings yet
DPOCexam2017 Solution BB
20 pages
ch8-SHM
No ratings yet
ch8-SHM
23 pages
Fundamental Approximation Theorems: Kunal Narayan Chaudhury
No ratings yet
Fundamental Approximation Theorems: Kunal Narayan Chaudhury
4 pages
Paixi
No ratings yet
Paixi
12 pages
RL and ObC Lecture 1
No ratings yet
RL and ObC Lecture 1
34 pages
Lecture 28: The Spectral Gap
No ratings yet
Lecture 28: The Spectral Gap
6 pages
1 Optimal Control: 1.1 Problem Definition
No ratings yet
1 Optimal Control: 1.1 Problem Definition
8 pages
Estimation 2 PDF
No ratings yet
Estimation 2 PDF
44 pages
On The Stability of Constrained MPC Without Terminal Constraint
No ratings yet
On The Stability of Constrained MPC Without Terminal Constraint
5 pages
Maria Kulikov A 2015
No ratings yet
Maria Kulikov A 2015
6 pages
Tutorial KF
No ratings yet
Tutorial KF
13 pages
Unscented KF Using Agumeted State in The Presence of Additive Noise
No ratings yet
Unscented KF Using Agumeted State in The Presence of Additive Noise
4 pages
Chapter 3
No ratings yet
Chapter 3
10 pages
Download
No ratings yet
Download
7 pages
Has Algorithm
No ratings yet
Has Algorithm
5 pages
Gen. CHEM L4
No ratings yet
Gen. CHEM L4
19 pages
CH605 2023 24tutorial3
No ratings yet
CH605 2023 24tutorial3
2 pages
Kalman Filter Derivation 2023-11-08 15-05-45 (1)
No ratings yet
Kalman Filter Derivation 2023-11-08 15-05-45 (1)
7 pages
1988-Nakagawa-Sequential Imperfect Preventive Maintenance Policies
No ratings yet
1988-Nakagawa-Sequential Imperfect Preventive Maintenance Policies
4 pages
Jurnal - Gauss Newton
No ratings yet
Jurnal - Gauss Newton
11 pages
GCL 03
No ratings yet
GCL 03
16 pages
Alireza Javaheri
No ratings yet
Alireza Javaheri
17 pages
Model Free Difference Feedback Control of Stochastic Systems
No ratings yet
Model Free Difference Feedback Control of Stochastic Systems
6 pages
Lecture 7: Optimal Smoothing: Simo Särkkä
No ratings yet
Lecture 7: Optimal Smoothing: Simo Särkkä
40 pages
Limiting Average Cost Control Problems in A Class of Discrete-Time Stochastic Systems
No ratings yet
Limiting Average Cost Control Problems in A Class of Discrete-Time Stochastic Systems
13 pages
A Tutorial of The Wavelet Transform: Chun-Lin, Liu February 23, 2010
No ratings yet
A Tutorial of The Wavelet Transform: Chun-Lin, Liu February 23, 2010
72 pages
tut 4s
No ratings yet
tut 4s
5 pages
Hinf and LQR Design
No ratings yet
Hinf and LQR Design
4 pages
A Concise Quantum Mechanical Treatment of The Forced Damped Harmonic Oscillator
No ratings yet
A Concise Quantum Mechanical Treatment of The Forced Damped Harmonic Oscillator
4 pages
G O: A General Framework For (Hyper) Graph Optimization
No ratings yet
G O: A General Framework For (Hyper) Graph Optimization
19 pages
Robust Principle Component Analysis (RPCA) For Seismic Data Denoising
No ratings yet
Robust Principle Component Analysis (RPCA) For Seismic Data Denoising
5 pages
Optimal Control of Hybrid Systems
No ratings yet
Optimal Control of Hybrid Systems
6 pages
Kalman
No ratings yet
Kalman
8 pages
Figure by Mit Opencourseware
No ratings yet
Figure by Mit Opencourseware
26 pages
TP09 Garciano PDF
No ratings yet
TP09 Garciano PDF
11 pages
Deterministic Continuous Time Optimal Control and The Hamilton-Jacobi-Bellman Equation
No ratings yet
Deterministic Continuous Time Optimal Control and The Hamilton-Jacobi-Bellman Equation
7 pages
Model Answer Mid2-Mth462 - 2023-2024
No ratings yet
Model Answer Mid2-Mth462 - 2023-2024
3 pages
Norm-Optimal Control of Time-Varying Discrete Repetitive Processes
No ratings yet
Norm-Optimal Control of Time-Varying Discrete Repetitive Processes
6 pages
N D IX: The E-M Algorithm
No ratings yet
N D IX: The E-M Algorithm
12 pages
Student Solutions Manual to Accompany Economic Dynamics in Discrete Time, secondedition
From Everand
Student Solutions Manual to Accompany Economic Dynamics in Discrete Time, secondedition
Yue Jiang
4.5/5 (2)
Elementary Calculus
From Everand
Elementary Calculus
George N. Frempong
No ratings yet
Activity 4 Application of Matrix Operations GROUP 1
No ratings yet
Activity 4 Application of Matrix Operations GROUP 1
8 pages
College Math Lesson 65 Ch.5.2
No ratings yet
College Math Lesson 65 Ch.5.2
11 pages
Examination Paper For TTT4120 Digital Signal Processing
No ratings yet
Examination Paper For TTT4120 Digital Signal Processing
9 pages
Algorithm Analysis & Types of Algorithms
100% (1)
Algorithm Analysis & Types of Algorithms
22 pages
Significant Figures and Scientific Notation
No ratings yet
Significant Figures and Scientific Notation
28 pages
Lung Cancer Detection and Classification Using Machine Learning Algorithm
No ratings yet
Lung Cancer Detection and Classification Using Machine Learning Algorithm
7 pages
Assignment On Operations Research
80% (10)
Assignment On Operations Research
24 pages
Solutions of Simultaneous System of Linear Equations: Assignment 01
No ratings yet
Solutions of Simultaneous System of Linear Equations: Assignment 01
85 pages
Analysis of Algorithms & Orders of Growth
No ratings yet
Analysis of Algorithms & Orders of Growth
54 pages
DSA Syllabus
No ratings yet
DSA Syllabus
7 pages
Lab 11
No ratings yet
Lab 11
4 pages
Machine Learning and Data Mining: Prof. Alexander Ihler
No ratings yet
Machine Learning and Data Mining: Prof. Alexander Ihler
21 pages
A New Approach For Persian Speech Recognition
No ratings yet
A New Approach For Persian Speech Recognition
6 pages
Cfdem 2017
No ratings yet
Cfdem 2017
27 pages
Part 2 - Assignment Model: Topic 4
No ratings yet
Part 2 - Assignment Model: Topic 4
19 pages
Zhu - Multiclass Adaboost2009 PDF
No ratings yet
Zhu - Multiclass Adaboost2009 PDF
12 pages
CIT742-2021
No ratings yet
CIT742-2021
2 pages
GSM Speech Coders
No ratings yet
GSM Speech Coders
7 pages
Direct Method Interpolation
100% (1)
Direct Method Interpolation
9 pages
2 - Optimisation Tools
No ratings yet
2 - Optimisation Tools
18 pages
Temperature Sensor With 10-Bit SAR Analog-to-Digital Converter
No ratings yet
Temperature Sensor With 10-Bit SAR Analog-to-Digital Converter
31 pages
Signals & Systems: Class Test
No ratings yet
Signals & Systems: Class Test
15 pages
5.hyperparameters and Validation Sets (C)
No ratings yet
5.hyperparameters and Validation Sets (C)
3 pages
Lesson 4 Gradient Descent
No ratings yet
Lesson 4 Gradient Descent
13 pages
Research Methodology - Lung Cancer Prediction
No ratings yet
Research Methodology - Lung Cancer Prediction
12 pages
Matlab Differential Eq (Euler Method)
No ratings yet
Matlab Differential Eq (Euler Method)
8 pages