10 - Reinforcement Learning
10 - Reinforcement Learning
Reinforcement learning
Saverio Bolognani
1 / 22
Policy iteration and value iteration methods require either
a full (and sufficiently rich) trajectory or
a model.
Remember the difficult step in policy iteration: for a given policy π, estimate
X u π ′
Qπ (x, u) = Rxu + γ Pxx ′ V (x )
x′
where " ∞
#
X
V π (x) = E γ k rk |x0 = x
k=0
In Monte carlo learning we did it by simply evaluating it based on the episode data
T
X
Qπ (xk , uk ) ≈ gk = γ i−k ri
i=k
2 / 22
Monte Carlo learning
u1 u2 u3
x1 x2 x3 ...
r1 r2 r3 r4 r5 r6 r7 r8 r9
g1 = r1 + γr2 + γ 2 r3 + . . .
g2 = r2 + γr3 + γ 2 r4 + . . .
g3 = r3 + γr4 + γ 2 r5 + . . .
Q( , ) Q( , ) Q( , ) Q( , ) ...
3 / 22
We havent’t used Bellman equation for Q
X u X
Q(x, u) = Rxu + γ Pxx ′ π(x ′ , u′ )Q(x ′ , u′ )
x′ u′
(although of course the estimate of Q will satisfy that, with infinite data)
xk , uk , xk+1 , uk+1 , rk
E[ek ] = 0
4 / 22
Stochastic approximation
Suppose that we have a random variable
e(q)
Stochastic approximation
The iteration
qk+1 ← qk − αk e(qk )
5 / 22
TD-learning (SARSA)
Key idea: Use the empirical evaluations of the TD error
u1 u2 u3
x1 x2 x3 ...
r1 r2 r3 r4 r5 r6 r7 r8 r9
Q( , ) Q( , )
6 / 22
TD-learning (SARSA)
Meanwhile, allow ϵ → 0 as k → ∞.
7 / 22
TD-learning (value iteration)
Instead of using the temporal difference error to perform policy evaluation (via
stochastic approximation), we can directly use it to do a stochastic
approximation of the Bellman optimality principle.
X
∗ ∗ ′ ′
Q (x, u) = Rxu +γ u
Pxx ′ min
′
Q (x , u )
u
x′
Individual realization
Q∗ (xk , uk ) = rk + γ min
′
Q∗ (xk+1 , uk+1 )
u
Notice that the expectation of the individual realization is equal to the Bellman
optimality principle.
(Can we do the same with the Bellman optimality principle on the value function?)
8 / 22
Q-learning
9 / 22
Q-learning with linear approximant
that it
X
ϕ⊤ (x, u)θ∗ = Rxu + γ u
Pxx ′ min
′
ϕ⊤ (x ′ , u′ )θ∗
u
x′
10 / 22
X
⊤ ∗ ⊤ ′ ′ ∗
ϕ (x, u)θ = Rxu +γ u
Pxx ′ min
′
ϕ (x , u )θ
u
x′
| {z }
Q+
First, notice that this equation (most likely) does not have a solution.
Loss function
1 ⊤ 2
min ϕ (x, u)θ − Q+
|2
θ
{z }
L(θ)
θk+1 = θk − α∇L(θk )
= θk − α ϕ⊤ (x, u)θk − Q+ ϕ(x, u)
11 / 22
Q-learning as stochastic gradient descent
θk+1 = θk − α ϕ⊤ (x, u)θk − Q+ ϕ(x, u)
Replace
X ⊤ ′ ′ ∗
Q+ = Rxu + γ u
Pxx ′ min
′
ϕ (x , u )θ
u
x′
with
rk + γ min ϕ⊤ (xk+1 , u)θk
u
12 / 22
General approximators
Much more complex approximators can be used (neural networks) as long as:
they provide a parametrized approximation Qθ (x, u)
they allow to minimize Qθ (x, u) over u
they allow to perform a stochastic gradient step in θ to minimize a loss
function (quality of the approximation)
13 / 22
www.incontrolpodcast.com
14 / 22
Really model free?
15 / 22
Policy gradient
τ = (x0 , u0 , x1 , u1 , . . . , xT , uT )
Parametrized policy
Let πθ (x, u) be a stochastic policy parametrized in θ
based on a linear combination of basis functions ϕ⊤ θ
based on some parametric form (Gaussian)
neural network with weights θ
16 / 22
Goal: minimize the expected cost
X
J(θ) := Eτ ∼πθ R(τ ) = Pθ (τ )R(τ )
τ
where Pθ (τ ) is the probability that trajectory τ happens when the policy πθ is used.
Trajectory probability
17 / 22
As we are trying to minimize J(θ), let’s try to derive the gradient ∇J(θ).
Proposition
h i
∇J(θ) = Eτ ∼πθ ∇ log Pθ (τ ) R(τ )
18 / 22
h i
∇J(θ) = Eτ ∼πθ ∇ log Pθ (τ ) R(τ )
What is ∇ log Pθ (τ )?
Proposition
T
X
∇ log Pθ (τ ) = ∇ log πθ (ut , xt )
t=0
Amazing result!
no more dependance on the transition probabilities
depending on the parametrization of πθ , ∇ log πθ is known
we need a trajectory τ to compute it
Proof:
∇ log Pθ (τ ) = ∇ log ΠTt=0 P(xt+1 |xt , ut )πθ (ut , xt )
T T
! T
X X X
=∇ log P(xt+1 |xt , ut ) + log πθ (ut , xy ) = ∇ log πθ (ut , xt )
t=0 t=0 t=0
19 / 22
Putting things together
Instead of computing
h i
∇J(θ) = Eτ ∼πθ ∇ log Pθ (τ ) R(τ )
we use a sample
∇ log Pθ (τ̂ ) R(τ̂ )
where τ̂ comes from the distribution τ ∼ πθ .
How do we sample this distribution?
20 / 22
Why not learn all the time?
21 / 22
10
policy gradient
9 random search
optimal
8
cost 7
6
5
0 5000 10000 15000 20000 25000 30000
samples
22 / 22
This work is licensed under a
Creative Commons Attribution-ShareAlike 4.0 International License
https://bsaver.io/COCO