0% found this document useful (0 votes)
15 views

RL and ObC Lecture 2

Uploaded by

Erdem Şimşek
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views

RL and ObC Lecture 2

Uploaded by

Erdem Şimşek
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

Reinforcement Learning and Optimization-based

Control

Assoc. Prof. Dr. Emre Koyuncu

Department of Aeronautics Engineering


Istanbul Technical University

Lecture 2: MDP and Bellman Optimality

E. Koyuncu (ITU) RL and ObC Lecture 2 1 / 20


Table of Contents

1 Markov Decision Process

E. Koyuncu (ITU) RL and ObC Lecture 2 2 / 20


Sequential Decision

Optimal decision
• At current state, apply decision that minimizes
Current stage cost + J ∗ (Next state)
where J ∗ (Next state) is the optimal future cost, starting from next
state
• This defines optimal policy - an optimal control to apply at each state

E. Koyuncu (ITU) RL and ObC Lecture 2 3 / 20


Principle of Optimality

Principle of optimality
Let {u0∗ , ..., uN−1
∗ } be an optimal control sequence wrt state sequence
{x0 , ..., xN }. Consider the tail subproblem that starts at xk∗ at time k and
∗ ∗

minimizes over {uk , ..., uN−1 } the cost-to-go from k to N


N−1
X
gk (xk∗ , uk ) + gm (xm , um ) + gN (xN )
m=k+1

Then the tail optimal control sequence {uk∗ , ..., uN−1


∗ } is optimal for the
tail subproblem.
E. Koyuncu (ITU) RL and ObC Lecture 2 4 / 20
Dynamic Programming
Solve all the tail subproblems of a given time length using the solution of
all the tail subproblems of shorter time length.
By principle of optimality
• Consider every possible uk and solve the tail subproblem that starts at
next state xk+1 = fk (xk , uk )
• Optimize over all uk

By principle of optimality
Start with
JN∗ (xN ) = gN (xN ) , for all xN
and for k = 0, , N − 1, let

Jk∗ (xk ) = ∗
 
min gk (xk , uk ) + Jk+1 (fk (xk , uk )) , for all xk .
uk ∈Uk (xk )

then optimal cost J ∗ (x0 ) is obtained at the last step: J0 (x0 ) = J ∗ (x0 )
E. Koyuncu (ITU) RL and ObC Lecture 2 5 / 20
Table of Contents

1 Markov Decision Process

E. Koyuncu (ITU) RL and ObC Lecture 2 6 / 20


Optimal Sequential Decision

Remark
• DP solve the optimal decision problems via backward through time -
provides offline solution that can not be implemented online
• RL and Adaptive Control are concerned with forward in time solution
- runs in real time
Let’s formulate the problem as Markov Decision Process:
Consider (X , U, P, R), where X is a set of states, U is a set of actions or
control.
The transition probability P : X × U × X → [0, 1] give for each state
x ∈ X and action u ∈ U.
u 0
The conditional probability Px,x 0 = Pr {x | x, u} of transitioning to state
0
x ∈ X from x ∈ X and takes action u ∈ U.
The cost function R : X × U × X → R gives the expected immediate cost
u paid after the given transition.
Rxx 0

E. Koyuncu (ITU) RL and ObC Lecture 2 7 / 20


Optimal Sequential Decision

u
The Markov property refers that this transition probabilities Px,x 0 depend

only on the current state x - not on the history!


Remark
The basic problem of MDP is to find a mapping π : X × U → [0, 1] that
gives conditional probability π(x, u) = Pr {u|x} of taking action u given
the MDP is in the state x.
• such mapping is termed a closed-loop control, or strategy, or policy.
• π(x, u) = Pr {u|x} policy is called stochastic or mixed if there is a
non-zero probability of selecting more than one control in state x.
• π(x, u) = Pr {u|x} policy is called deterministic policy if admits only
one control, with probability one.

E. Koyuncu (ITU) RL and ObC Lecture 2 8 / 20


Optimal Sequential Decision

Remark
• Dynamical systems evolve through time or more generally to sequence
of events
• therefore, we consider sequential decision problems
• optimality is often desirable in terms of conserving resources such as
time, fuel, energy, etc.

Define a stage cost at time k by: rk = rk (xk , uk , xk+1 ), then


u = E {r | x = x, u = u, x 0
Rxx 0 k k k k+1 = x } where E {.} is the expected
value operator. The sum of future costs over time interval [k, k + T ]:
T
X k+T
X
Jk,T = γ i rk+i = γ i−k ri
i=0 i=k

where 0 ≤ λ < 1 is a discount factor that reduces the weight of cost


incurred further in the future.
E. Koyuncu (ITU) RL and ObC Lecture 2 9 / 20
Optimal Sequential Decision
Consider that an agent selects fixed statianary policy π(x, u) = Pr {u|x}
where conditional probabilities πk (xk , uk ) are independent of k. Then
closed-loop MDP reduces to a Markov chain with state space X .The fixed
transition probabilities of this Markov chain are given by:
X  X
π
Px,x 0 ≡ Px,x 0 = Pr x 0 | x, u Pr{u | x} = u
π(x, u)Px,x 0

u u

where the Chapman-Kolmogorov identity is used. If the Markov chain is


ergodic, it can be shown that every MDP has a stationary deterministic
optimal policy (Bertsekas and Tsitsiklis, 1996; Wheeler and Narendra,
1986).

Remark
Ergodicity expresses the idea that a point of a dynamical system or a stochastic process, will eventually visit all parts of the
space that the system moves in, in a uniform and random sense. This implies that the average behavior of the system can be
deduced from the trajectory of a ”typical” point. Equivalently, a sufficiently large collection of random samples from a process
can represent the average statistical properties of the entire process - meaning that the system cannot be reduced or factored
into smaller components.

E. Koyuncu (ITU) RL and ObC Lecture 2 10 / 20


Value Function
The value of a policy is defined as the conditional expected value of future
cost when starting in state x at time k and following policy π(x, u):
(k+T )
X
π i−k
Vk (x) = Eπ {Jk,T | xk = x} = Eπ γ ri | xk = x
i=k
A main objective of MDP is to determine a policy π(x, u) to minimize the
expected future cost:
(k+T )
X
∗ π i−k
π (x, u) = arg minVk (x) = arg minEπ γ ri | xk = x
π π
i=k
The policy is termed the optimal policy and corresponding optimal value is
given as:
(k+T )
X
∗ π i−k
Vk (x) = min Vk (x) = min Eπ γ ri | xk = x
π π
i=k

E. Koyuncu (ITU) RL and ObC Lecture 2 11 / 20


Backward Recursion
Using the Chapman-Kolmogorov identity and the Markov property, write
the value of policy:
(k+T )
X
Vkπ (x) = Eπ {Jk | xk = x} = Eπ γ i−k ri | xk = x
i=k
k+T
( )
X
Vkπ (x) = Eπ rk + γ γ i−(k+1)
ri | xk = x
i=k+1
k+T
" ( )#
X X X
0
Vkπ (x) = π(x, u) u
Pxx 0
u
Rxx 0 + γEπ γ i−(k+1)
ri | xk+1 = x
u x0 i=k+1

Finally, the value function satisfies:


X X
0
Vkπ (x) = u
 u π

π(x, u) Pxx 0 Rxx 0 + γVk+1 x

u x0

E. Koyuncu (ITU) RL and ObC Lecture 2 12 / 20


Dynamic Programming
The optimal cost can be written as:
X X
Vk∗ (x) = min Vkπ (x) = min u 0
 u π

π(x, u) Pxx 0 Rxx 0 + γVk+1 x
π π
u x0

Bellman Optimality
An optimal policy has the property that no matter what the previous
control actions have been used - the remaining control actions constitute
an optimal policy with regard to the state resulting from previous controls

Then we can write:


X X
Vk∗ (x) = min u ∗ 0
 u 
π(x, u) Pxx 0 Rxx 0 + γVk+1 x
π
u x0
Suppose an arbitrary control u is applied at k and optimal policy is applied
from k + 1 on. Then the optimal control at k is given by:
X X
π ∗ (xk = x, u) = arg min u ∗ 0
 u 
π(x, u) Pxx 0 Rxx 0 + γVk+1 x
π u x0
E. Koyuncu (ITU) RL and ObC Lecture 2 13 / 20
Dynamic Programming
Under the assumption that the Markov chain corresponding to each policy
is ergodic, every MDP has a stationary deterministic optimal policy. Then
we can minimize the conditional expectation over all actions u in state x.
Therefore: X
Vk∗ (x) = min u ∗ 0
 u 
Pxx 0 Rxx 0 + γVk+1 x
u
x0
X
uk∗ = arg min u ∗ 0
 u 
Pxx 0 Rxx 0 + γVk+1 x
u
x0

Dynamic Programming
The backward recursion forms the basis for dynamic programming (DP)
(Bellman, 1957), which gives offline methods for working backward in time
to determine optimal policies
• requires knowledge of the complete system dynamics in the form of
transition probabilities and expected costs.

E. Koyuncu (ITU) RL and ObC Lecture 2 14 / 20


Bellman Equation
• DP is a backward in time method to find optimal value and policy
• RL uses casual experience through executing sequential decisions to
improve control actions based on observed results of using current
policy.
To derive forward in time methods to find optimal values and policies, set
now the time horizon T to infinity and define infinite horizon cost:

X ∞
X
i
Jk = γ rk+1 = γ i−k ri
i=0 i=k
Associated infinite horizon value function is:
(∞ )
X
V π (x) = Eπ {Jk | xk = x} = Eπ γ i−k ri | xk = x
i=k
It can be seen that the value function for policy π(x, u) satisfies the
Bellman Equation:
X X
V π (x) = u
x0
 u π

π(x, u) Pxx 0 Rxx 0 + γV
E. Koyuncu (ITU) u 0 ObC
RL xand Lecture 2 15 / 20
Bellman Optimality Equation
If MDP is finite and has N states, the Bellman Eq. is a system of N linear
equations for the value V π (x) of being in each state x given the current
policy π(x, u). The optimal finite horizon values satisfies:
X X
V ∗ (x) = min V π (x) = min u
x0
 u π

π(x, u) Pxx 0 Rxx 0 + γV
π π
u x0
then it yields the Bellman optimality eq.:
X X
V ∗ (x) = min V π (x) = min u ∗
x0
 u 
π(x, u) Pxx 0 Rxx 0 + γV
π π
u x0
Under ergodicity assumption, Bellman optimality eq. becomes:
X
V ∗ (x) = min u ∗
x0
 u 
Pxx 0 Rxx 0 + γV
u
x0
This known as Hamilton-Jacobi-Bellman equation in control systems. The
optimal control is given by:
X
u ∗ = arg min u ∗
x0
 u 
Pxx 0 Rxx 0 + γV
u
x0
E. Koyuncu (ITU) RL and ObC Lecture 2 16 / 20
Ex. Discrete-time LQR
Consider the discrete-time LQR problem, where MDP is deterministic and
satisfies the state transition equation:
xk+1 = Axk + Buk
The associated performance index has deterministic stage cost:
∞ ∞
1X 1 X T 
Jk = ri = xi Qxi + uiT Rui
2 2
i=k i=k

where the cost weighting matrices are Q = Q T ≥ 0, R = R T > 0. State


space X and action space U are infinite and continuous. Then value
function is:
∞ ∞
1X 1 X T 
V (xk ) = ri = xi Qxi + uiT Rui
2 2
i=k i=k
∞ 
1 T T
 1 X 
V (xk ) = xk Qxk + uk Ruk + xiT Qxi + uiT Rui
2 2
i=k+1
E. Koyuncu (ITU) RL and ObC Lecture 2 17 / 20
Ex. Discrete-time LQR

1 T 
V (xk ) =xk Qxk + ukT Ruk + V (xk+1 )
2
This is exactly the Bellman Eq. for LQR. Assuming value is quadratic in
the state so that:
1
Vk (xk ) = xkT Pxk
2
Any P = P T > 0 kernel matrix yields Bellman Eq.:

2V (xk ) = xkT Pxk = xkT Qxk + ukT Ruk + xk+1


T
Pxk+1

Remark
So far, we did not see the state dynamics (A, B)
RL algorithms for learning optimal solutions can be devised via temporal
difference methods. That is, RL allows Lyapunov equation to be solved
online without knowing (A, B)
E. Koyuncu (ITU) RL and ObC Lecture 2 18 / 20
Ex. Discrete-time LQR

2V (xk ) = xkT Pxk = xkT Qxk + ukT Ruk + xk+1


T
Pxk+1
Using the state equation, can be written as:

2V (xk ) = xkT Qxk + ukT Ruk + (Axk + Buk )T P (Axk + Buk )

Assuming a constant sate-state feedback policy uk = µ(xk ) = −Kxk for


some K gain, write:

2V (xk ) = xkT Qxk + ukT Ruk + (Axk + Buk )T P (Axk + Buk )

Since the eq. holds for all states trajectories, we have:

(A − BK )T P(A − BK ) − P + Q + K T RK = 0

which is a Lyapunov equation, equivalent to the Bellman eq. for


discrete-time LQR.
E. Koyuncu (ITU) RL and ObC Lecture 2 19 / 20
Ex. Discrete-time LQR

The discrete-time LQR Hamiltonian function is:

H (xk , uk ) = xkT Qxk + ukT Ruk + (Axk + Buk )T P (Axk + Buk ) − xkT Pxk

A necessary condition for optimality is: ∂H (xk , uk ) /∂uk = 0. Solving this


eq. gives:
 −1
uk = −Kxk = − B T PB + R B T PAxk

Using this relation, Lyapunov eq. yields the discrete-time algebraic Riccati
eqution:
 −1
AT PA − P + Q − AT PB B T PB + R B T PA = 0

ARE is exactly the Bellman optimality equation for discrete-time LQR.

E. Koyuncu (ITU) RL and ObC Lecture 2 20 / 20

You might also like