Policy Gradient Methods
Policy Gradient Methods
Week 6 @ spring2019
Yandex
Research %
1
Small experiment
2
Small experiment
left or right? 3
Small experiment
2
L≈ E[Q(s t , a t )−(r t +γ⋅max a ' Q(s t +1 , a '))]
Simple 2-state world
True (A) (B)
Q(s0,a0) 1 1 2
Q(s0,a1) 2 2 1
Q(s1,a0) 3 3 3
Q(s1,a1) 100 50 100 7
Q: Which prediction is better (A/B)?
Approximation error
DQN is trained to minimize
2
L≈ E[Q(s t , a t )−(r t +γ⋅max a ' Q(s t +1 , a '))]
Simple 2-state world
True (A) (B)
Q(s0,a0) 1 1 2
Q(s0,a1) 2 2 1
Q(s1,a0) 3 3 3
Q(s1,a1) 100 50 100 8
better less
policy MSE
Approximation error
DQN is trained to minimize
2
L≈ E[Q(s t , a t )−(r t +γ⋅max a ' Q(s t +1 , a '))]
Simple 2-state world
True (A) (B)
Q(s0,a0) 1 1 2
Q(s0,a1) 2 2 1
Q(s1,a0) 3 3 3
Q(s1,a1) 100 50 100 9
better less
Q-learning will prefer worse policy (B)! policy MSE
Conclusion
12
how humans survived
π (run∣s)=1
13
Policies
In general, two kinds
● Deterministic policy
a=πθ (s )
● Stochastic policy
a∼πθ (a∣s)
14
Q: Any case where stochastic is better?
Policies
In general, two kinds
● Deterministic policy
a=πθ (s )
● Stochastic policy
● Loop:
– Sample N sessions
– elite = take M best sessions and concatenate
19
Recap: crossentropy method
● Initialize NN weights θ 0 ←random
● Loop:
– Sample N sessions
– elite = take M best sessions and concatenate
Why so complicated?
We'd rather maximize reward directly!
21
Objective
Expected reward:
J= E R(s , a , s ' , a ' , ...)
s∼ p (s)
a∼π θ (s∣a)
...
J= E G(s, a)
s∼ p (s)
a∼π θ (s∣a)
22
Objective
J= E G(s, a)
s∼ p (s)
a∼π θ (s∣a)
23
Objective
Consider an 1-step process for simplicity
24
Objective
Consider an 1-step process for simplicity
N
1
J≈
N
∑ R (s , a)
i=0
sample N sessions
under current policy 26
Objective
N
1
J≈
N
∑ R (s , a)
i=0
sample N sessions
N
1
J≈
N
∑∑ R(s, a)
i=0 s ,a ∈ z i
J θ+ ϵ−J θ
∇ J≈ ϵ
Stochastic optimization
– Good old crossentropy method
– Maximize probability of “elite” actions
29
Optimization
Finite differences
– Change policy a little, evaluate
J θ+ ϵ−J θ
∇ J≈ ϵ
Stochastic optimization
– Good old crossentropy method
– Maximize probability of “elite” actions
31
Objective
Wish list:
– Analytical gradient
– Easy/stable approximations
32
Logderivative trick
Simple math
∇ log π ( z )=? ? ?
33
Logderivative trick
Simple math
1
∇ log π ( z )= ⋅∇ π( z)
π (z)
34
Policy gradient
Analytical inference
35
Policy gradient
Analytical inference
36
Q: anything curious about that formula?
Policy gradient
Analytical inference
● Loop:
– Sample N sessions z under current
– Evaluate policy gradient
N
1
∇ J ≈ ∑ ∇ log πθ (a∣s)⋅R (s , a)
N i=0
40
REINFORCE algorithm
We can estimate Q using G
r r’
r'’’
prev s s'
s a'
a s''
prev a a''
r'’ 41
Recap: discounted rewards
2
Gt =r t +γ⋅r t+ 1+γ ⋅r t +2 +... =
= r t +γ⋅(r t +1 +γ⋅r t +2 +...)=
= r t +γ⋅Gt +1
● Loop:
– Sample N sessions z under current π θ (a∣s )
– Evaluate policy gradient
N
1
∇ J≈
N
∑∑ ∇ log π θ (a∣s)⋅Q(s , a)
i=0 s ,a ∈z i
● Loop:
– Sample N sessions z under current π θ (a∣s )
– Evaluate policy gradient
N
1
∇ J≈
N
∑∑ ∇ log π θ (a∣s)⋅Q(s , a)
i=0 s ,a ∈z i
49
REINFORCE baselines
We can subtract arbitrary baseline b(s)
53
REINFORCE baselines
● Gradient direction ∇ J stays the same
● Variance may change
N
1
∇ J≈
N
∑∑ ∇ log π θ (a∣s)⋅(Q (s , a)−V (s))
i=0 s ,a ∈z i
57
Actor-critic
● Learn both V(s) and π θ (a∣s )
● Hope for the best of both worlds :)
58
Advantage actor-critic
59
Advantage actor-critic
62
Advantage actor-critic
π θ (a∣s ) V θ (s)
Improve policy:
N
1
model
∇ J actor ≈
N
∑∑ ∇ log π θ (a∣s)⋅A (s , a)
i=0 s ,a ∈ zi
W = params
Improve value:
N
1
Lcritic ≈
N
∑ ∑ (V θ (s)−[r +γ⋅V (s ')]) 2
i=0 s ,a ∈ zi
state s
63
Continuous action spaces
64
Continuous action spaces
● LSTM policy
● N-step advantage
● No experience replay
66
Read more: https://arxiv.org/abs/1602.01783
IMPALA
● Massively parallel
● Separate actor / learner processes
● Small experience replay
w/ importance sampling
67
Read more: https://arxiv.org/abs/1802.01561
Duct tape zone
● V(s) errors less important than in Q-learning
– actor still learns even if critic is random, just slower
69
Let's code!
70