0% found this document useful (0 votes)
1 views

10 - Reinforcement Learning

The document states that the training data is current only up to October 2023. No additional information or context is provided. It emphasizes the limitation of the data's recency.

Uploaded by

Ahmet Çelik
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1 views

10 - Reinforcement Learning

The document states that the training data is current only up to October 2023. No additional information or context is provided. It emphasizes the limitation of the data's recency.

Uploaded by

Ahmet Çelik
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

Computational Control

Reinforcement learning

Saverio Bolognani

Automatic Control Laboratory (IfA)


ETH Zurich
Can we find the optimal policy without model information?
Two different settings:

Based on collected data (typically in the form of repeated episodes)

(x0 , u0 , r0 ), (x1 , u1 , r1 ), . . . (xT , uT , rT )

▶ repetition of the control task in a numerical simulator


▶ repetition of the control task in a controlled environment

Online during the control task with no prior training "


→ adaptive control
- extremely challenging dual control task (learn AND control)
- few guarantees
- an open problem for decades!

1 / 22
Policy iteration and value iteration methods require either
a full (and sufficiently rich) trajectory or
a model.

Is it possible to learn the system as data is collected along a single trajectory?

Remember the difficult step in policy iteration: for a given policy π, estimate
X u π ′
Qπ (x, u) = Rxu + γ Pxx ′ V (x )
x′

where " ∞
#
X
V π (x) = E γ k rk |x0 = x
k=0

In Monte carlo learning we did it by simply evaluating it based on the episode data
T
X
Qπ (xk , uk ) ≈ gk = γ i−k ri
i=k

2 / 22
Monte Carlo learning

u1 u2 u3

x1 x2 x3 ...
r1 r2 r3 r4 r5 r6 r7 r8 r9

g1 = r1 + γr2 + γ 2 r3 + . . .
g2 = r2 + γr3 + γ 2 r4 + . . .
g3 = r3 + γr4 + γ 2 r5 + . . .

Q( , ) Q( , ) Q( , ) Q( , ) ...

3 / 22
We havent’t used Bellman equation for Q
X u X
Q(x, u) = Rxu + γ Pxx ′ π(x ′ , u′ )Q(x ′ , u′ )
x′ u′

(although of course the estimate of Q will satisfy that, with infinite data)

Temporal Difference error


Given an individual data point

xk , uk , xk+1 , uk+1 , rk

define the TD error as

ek = rk + γQ(xk+1 , uk+1 ) − Q(xk , uk )

According to Bellman equation, we have

E[ek ] = 0

4 / 22
Stochastic approximation
Suppose that we have a random variable

e(q)

and we want to find q that solves E[e(q)] = 0.

Stochastic approximation
The iteration
qk+1 ← qk − αk e(qk )

converges to the solution q∗ of E[e(q)] = 0 if


e(q) is bounded
E[e(q)] is non-decreasing in q (and increasing at q∗ )
the sequence αk satisfies

X ∞
X
αk = ∞, αk2 < ∞.
k=0 k=0

5 / 22
TD-learning (SARSA)
Key idea: Use the empirical evaluations of the TD error

ek = rk + γQ(xk+1 , uk+1 ) − Q(xk , uk )

to update the Q function via the stochastic approximation iteration


 
Q(xk , uk ) ← Q(xk , uk ) + αk rk + γQ(xk+1 , uk+1 ) − Q(xk , uk )

Usually with a constant αk ⇒ steady-state non-zero variance

u1 u2 u3

x1 x2 x3 ...
r1 r2 r3 r4 r5 r6 r7 r8 r9

Q( , ) Q( , )

6 / 22
TD-learning (SARSA)

Moreover, instead of alternating between


complete policy evaluation (full episode)
policy improvement
we can interleave the two operations.

TD-learning (policy iteration)


At every time k
1 Select uk via an ϵ-greedy policy based on the latest estimate of Q(xk , uk )
(
argmaxu Q(x, u) with probability 1 − ϵ
π(x, u) ←
Uniform(U) with probability ϵ

2 Perform an iterative update of the Q function


 
Q(xk , uk ) ← Q(xk , uk ) + α rk + γQ(xk+1 , uk+1 ) − Q(xk , uk )

Meanwhile, allow ϵ → 0 as k → ∞.

7 / 22
TD-learning (value iteration)

Instead of using the temporal difference error to perform policy evaluation (via
stochastic approximation), we can directly use it to do a stochastic
approximation of the Bellman optimality principle.

Bellman optimality principle (in Q)

X  
∗ ∗ ′ ′
Q (x, u) = Rxu +γ u
Pxx ′ min

Q (x , u )
u
x′

Individual realization

Q∗ (xk , uk ) = rk + γ min

Q∗ (xk+1 , uk+1 )
u

Notice that the expectation of the individual realization is equal to the Bellman
optimality principle.
(Can we do the same with the Bellman optimality principle on the value function?)

8 / 22
Q-learning

The stochastic approximation update for the Bellman optimality principle is


 
Q(xk , uk ) ← Q(xk , uk ) + αk rk + γ min Q(xk+1 , u) − Q(xk , uk )
u

Important things to notice:


The next input uk+1 is irrelevant (off-policy algorithm), it doesn’t need to be
the optimal one.
Convergence to the optimal Q is guaranteed under
▶ non-summability/square-summability assumption on αk
▶ full state-action space exploration
The policy is typically updated as ϵ-greedy
Same scalability issue, and same solution: function approximations
▶ linear approximants
▶ neural networks
Forward dynamic programming!

9 / 22
Q-learning with linear approximant

Consider again the linear parametrization


d
X
Qθ (x, u) = ϕℓ (x, u)θl = ϕ⊤ (x, u)θ
ℓ=1

How do we update the parameter θ in order to satisfy the Bellman optimality


principle
X u  
Q∗ (x, u) = Rxu + γ Pxx ′ min′
Q ∗
(x ′
, u ′
)
u
x′

that it  
X
ϕ⊤ (x, u)θ∗ = Rxu + γ u
Pxx ′ min

ϕ⊤ (x ′ , u′ )θ∗
u
x′

iteratively as samples become available?

10 / 22
X  
⊤ ∗ ⊤ ′ ′ ∗
ϕ (x, u)θ = Rxu +γ u
Pxx ′ min

ϕ (x , u )θ
u
x′
| {z }
Q+

First, notice that this equation (most likely) does not have a solution.

Loss function

1 ⊤ 2
min ϕ (x, u)θ − Q+
|2
θ
{z }
L(θ)

A simple iteration that converges to this minimum is the gradient descent:

θk+1 = θk − α∇L(θk )
 
= θk − α ϕ⊤ (x, u)θk − Q+ ϕ(x, u)

11 / 22
Q-learning as stochastic gradient descent

 
θk+1 = θk − α ϕ⊤ (x, u)θk − Q+ ϕ(x, u)

Similarly to stochastic approximation, we can


replace the real gradient with samples whose expectation is the gradient
use a non-summable/square-summable step size

Replace  
X ⊤ ′ ′ ∗
Q+ = Rxu + γ u
Pxx ′ min

ϕ (x , u )θ
u
x′

with
rk + γ min ϕ⊤ (xk+1 , u)θk
u

and obtain the iterative update


  
θk+1 = θk − αk ϕ⊤ (xk , uk )θk − rk + γ min ϕ⊤ (xk+1 , u)θk ϕ(xk , uk )
u

12 / 22
General approximators

Much more complex approximators can be used (neural networks) as long as:
they provide a parametrized approximation Qθ (x, u)
they allow to minimize Qθ (x, u) over u
they allow to perform a stochastic gradient step in θ to minimize a loss
function (quality of the approximation)

13 / 22
www.incontrolpodcast.com

Listen to Anu telling us about the challenge of adaptation

14 / 22
Really model free?

Monte Carlo and Temporal Difference methods seem to do without a model of


the system.
Is that really true? What model information are we assuming?
It is possible to learn directly the optimal policy π(x, u) without also learning the
value function or Q function (which implies knowledge of a Markovian state x)

Policy gradient methods

15 / 22
Policy gradient

Consider a trajectory τ of the system

τ = (x0 , u0 , x1 , u1 , . . . , xT , uT )

Assume for simplicity that x0 is fixed over multiple episodes/experiments.

For a fixed x0 , let P(τ ) be the probability of trajectory τ happening.


Let the associated cost be
T
X
R(τ ) = γ t rt
t=0

Parametrized policy
Let πθ (x, u) be a stochastic policy parametrized in θ
based on a linear combination of basis functions ϕ⊤ θ
based on some parametric form (Gaussian)
neural network with weights θ

16 / 22
Goal: minimize the expected cost
X
J(θ) := Eτ ∼πθ R(τ ) = Pθ (τ )R(τ )
τ

where Pθ (τ ) is the probability that trajectory τ happens when the policy πθ is used.

Two sources of complexity


1 It’s an expectation, therefore a sum over all possible trajectories
2 Pθ (τ ) is unknown: it depends on both
▶ the transition probabilities of the system
▶ the policy πθ

Trajectory probability

Pθ (τ ) = ΠTt=0 P(xt+1 |xt , ut ) πθ (ut , xt )


| {z } | {z }
Transition probabilities Policy

17 / 22
As we are trying to minimize J(θ), let’s try to derive the gradient ∇J(θ).

Proposition
h i
∇J(θ) = Eτ ∼πθ ∇ log Pθ (τ ) R(τ )

where Pθ (τ ) is the probability of τ when πθ is used.

Good news! We knew that J(θ) is an expectation, but also ∇J(θ) is an


expectation. Do you remember stochastic gradient descent?
Proof:
" #
  X
∇J(θ) = ∇ Eτ ∼πθ R(τ ) = ∇ Pθ (τ )R(τ )
τ
X
= ∇Pθ (τ )R(τ )
τ
X Pθ (τ ) X
= ∇Pθ (τ )R(τ ) = Pθ (τ )∇ log Pθ (τ )R(τ )
τ
Pθ (τ ) τ

= Eτ ∼πθ [∇ log Pθ (τ )R(τ )]

18 / 22
h i
∇J(θ) = Eτ ∼πθ ∇ log Pθ (τ ) R(τ )

What is ∇ log Pθ (τ )?

Proposition

T
X
∇ log Pθ (τ ) = ∇ log πθ (ut , xt )
t=0

Amazing result!
no more dependance on the transition probabilities
depending on the parametrization of πθ , ∇ log πθ is known
we need a trajectory τ to compute it

Proof:
 
∇ log Pθ (τ ) = ∇ log ΠTt=0 P(xt+1 |xt , ut )πθ (ut , xt )
T T
! T
X X X
=∇ log P(xt+1 |xt , ut ) + log πθ (ut , xy ) = ∇ log πθ (ut , xt )
t=0 t=0 t=0

19 / 22
Putting things together

Instead of computing
h i
∇J(θ) = Eτ ∼πθ ∇ log Pθ (τ ) R(τ )

we use a sample
∇ log Pθ (τ̂ ) R(τ̂ )
where τ̂ comes from the distribution τ ∼ πθ .
How do we sample this distribution?

Policy gradient algorithm


Iteratively, repeat
1 Generate τ via πθ
2 Compute R(τ )
PT
3 Update θ ← θ − α R(τ ) t=0 ∇ log πθ (ut , xt )

20 / 22
Why not learn all the time?

Consider a simple system


   
1 1 0
xt+1 = xt + u
0 1 1 t

and an LQR-type cost with


 
1 0
Q= , R=r
0 0

Consider the alternatives


Q-learning (what is the Q function in an LQR problem?)
▶ prior information: Markovian state
▶ good parametrization of the Q function
policy gradient

21 / 22
10
policy gradient
9 random search
optimal
8
cost 7
6
5
0 5000 10000 15000 20000 25000 30000
samples

22 / 22
This work is licensed under a
Creative Commons Attribution-ShareAlike 4.0 International License

https://bsaver.io/COCO

You might also like