RL and ObC Lecture 2
RL and ObC Lecture 2
Control
Optimal decision
• At current state, apply decision that minimizes
Current stage cost + J ∗ (Next state)
where J ∗ (Next state) is the optimal future cost, starting from next
state
• This defines optimal policy - an optimal control to apply at each state
Principle of optimality
Let {u0∗ , ..., uN−1
∗ } be an optimal control sequence wrt state sequence
{x0 , ..., xN }. Consider the tail subproblem that starts at xk∗ at time k and
∗ ∗
By principle of optimality
Start with
JN∗ (xN ) = gN (xN ) , for all xN
and for k = 0, , N − 1, let
Jk∗ (xk ) = ∗
min gk (xk , uk ) + Jk+1 (fk (xk , uk )) , for all xk .
uk ∈Uk (xk )
then optimal cost J ∗ (x0 ) is obtained at the last step: J0 (x0 ) = J ∗ (x0 )
E. Koyuncu (ITU) RL and ObC Lecture 2 5 / 20
Table of Contents
Remark
• DP solve the optimal decision problems via backward through time -
provides offline solution that can not be implemented online
• RL and Adaptive Control are concerned with forward in time solution
- runs in real time
Let’s formulate the problem as Markov Decision Process:
Consider (X , U, P, R), where X is a set of states, U is a set of actions or
control.
The transition probability P : X × U × X → [0, 1] give for each state
x ∈ X and action u ∈ U.
u 0
The conditional probability Px,x 0 = Pr {x | x, u} of transitioning to state
0
x ∈ X from x ∈ X and takes action u ∈ U.
The cost function R : X × U × X → R gives the expected immediate cost
u paid after the given transition.
Rxx 0
u
The Markov property refers that this transition probabilities Px,x 0 depend
Remark
• Dynamical systems evolve through time or more generally to sequence
of events
• therefore, we consider sequential decision problems
• optimality is often desirable in terms of conserving resources such as
time, fuel, energy, etc.
u u
Remark
Ergodicity expresses the idea that a point of a dynamical system or a stochastic process, will eventually visit all parts of the
space that the system moves in, in a uniform and random sense. This implies that the average behavior of the system can be
deduced from the trajectory of a ”typical” point. Equivalently, a sufficiently large collection of random samples from a process
can represent the average statistical properties of the entire process - meaning that the system cannot be reduced or factored
into smaller components.
u x0
Bellman Optimality
An optimal policy has the property that no matter what the previous
control actions have been used - the remaining control actions constitute
an optimal policy with regard to the state resulting from previous controls
Dynamic Programming
The backward recursion forms the basis for dynamic programming (DP)
(Bellman, 1957), which gives offline methods for working backward in time
to determine optimal policies
• requires knowledge of the complete system dynamics in the form of
transition probabilities and expected costs.
1 T
V (xk ) =xk Qxk + ukT Ruk + V (xk+1 )
2
This is exactly the Bellman Eq. for LQR. Assuming value is quadratic in
the state so that:
1
Vk (xk ) = xkT Pxk
2
Any P = P T > 0 kernel matrix yields Bellman Eq.:
Remark
So far, we did not see the state dynamics (A, B)
RL algorithms for learning optimal solutions can be devised via temporal
difference methods. That is, RL allows Lyapunov equation to be solved
online without knowing (A, B)
E. Koyuncu (ITU) RL and ObC Lecture 2 18 / 20
Ex. Discrete-time LQR
(A − BK )T P(A − BK ) − P + Q + K T RK = 0
H (xk , uk ) = xkT Qxk + ukT Ruk + (Axk + Buk )T P (Axk + Buk ) − xkT Pxk
Using this relation, Lyapunov eq. yields the discrete-time algebraic Riccati
eqution:
−1
AT PA − P + Q − AT PB B T PB + R B T PA = 0