Practical RL
Practical RL
@spring ‘18
1
Terms
● Ask!
– Even if the question feels stupid.
– Chances are, half of the group is just like you.
– If it's necessary, interrupt the speaker.
● Contribute!
– Found an error? Got useful link? Ported the
seminar to py3 from py2? Answered peer's question
in the chat?
– You're awesome! 2
<a convenient slide for public survey>
3
Supervised learning
Given:
● objects and answers
(x , y)
● algorithm family a θ (x)→ y
● loss function L( y ,a θ (x))
Find:
Given:
● objects and answers
(x , y)
[banner,page], ctr
● algorithm family a θ (x)→ y
linear / tree / NN
● loss function L( y ,a θ (x))
MSE, crossentropy
Find:
6
Online Ads
Great... except if we have no reference answers
We have:
● YouTube at your disposal
● Live data stream
(banner & video features, #clicked)
● (insert your favorite ML toolkit)
We want:
● Learn to pick relevant ads
Ideas? 7
Duct tape approach
Common idea:
● Initialize with naïve solution
● Get data by trial and error and error and error and error
● Repeat
8
Giant Death Robot (GDR)
Great... except if we have no reference answers
We have:
● Evil humanoid robot
● A lot of spare parts
to repair it :)
We want:
● Enslave humanity
● Learn to walk forward
Ideas? 9
Duct tape approach (again)
Common idea:
● Initialize with naïve solution
● Get data by trial and error and error and error and error
● Repeat
10
Duct tape approach
11
Problems
Problem 1:
● What exactly does the “optimal action”
mean?
Extract as much
money as you can
right now
VS
Make user happy
so that he would
visit you again 12
Problems
Problem 2:
● If you always follow the “current optimal”
Ideas?
13
Duct tape approach
14
Reinforcement learning
15
What-what learning?
Supervised learning Reinforcement learning
● Model does not affect the ● Agent can affect it's own
input data observations
What-what learning?
Unsupervised learning Reinforcement learning
● Model does not affect the ● Agent can affect it's own
input data observations
What is: bandit
observation action
Agent Feedback
Examples:
– banner ads (RTB)
– recommendations
– medical treatment
18
What is: bandit
observation action
Agent Feedback
Examples:
– banner ads (RTB)
– recommendations
– medical treatment
19
What is: bandit
observation action
Agent Feedback
Examples:
– banner ads (RTB)
Q: what's observation, action and
– recommendations feedback in the banner ads problem?
– medical treatment
20
What is: bandit
observation action
Agent Feedback
user features show
time of year click,
banner
trends money
Examples:
– banner ads (RTB)
– recommendations
– medical treatment
21
What is: bandit
observation action
Agent Feedback
user features show
time of year click,
banner
trends money
Q: You're Yandex/Google/Youtube.
There's a kind of banners that would
have great click rates: the “clickbait”.
observation action
Agent Feedback
Q: You're Yandex/Google/Youtube.
There's a kind of banners that would
have great click rates: the “clickbait”.
Agent
observation
action
24
Environment
What is: decision process
Agent
Can do anything
observation
action
Can't even see
(worst case)
25
Environment
Reality check: web
● Cases:
● Pick ads to maximize profit
● Example
● Observation – user features
26
Reality check: dynamic systems
27
Reality check: dynamic systems
● Cases:
● Robots
● Self-driving vehicles
● Pilot assistant
● More robots!
● Example
● Observation: sensor feed
28
Reality check: videogames
29
● Q: What are observations, actions and feedback?
Reality check: videogames
30
● Q: What are observations, actions and feedback?
Other use cases
● Personalized medical treatment
31
● Q: What are observations, actions and feedback?
Other use cases
● Conversation systems
– learning to make user happy
● Quantitative finance
– portfolio management
● Deep learning
– optimizing non-differentiable loss
– finding optimal architecture
32
The MDP formalism
a ∈A
s ∈S
P(s t+ 1∣s t , at )
a ∈A
s ∈S
P(s t+ 1∣s t , at )
π (a∣s): E π [R ]→max 36
Objective
37
Objective
38
How do we solve it?
General idea:
Repeat
39
Crossentropy method
Intialize policy
Repeat:
– Sample N[100] sessions
41
Step-by-step view
elites
42
Step-by-step view
43
Step-by-step view
44
Step-by-step view
45
Step-by-step view
46
Step-by-step view
47
Step-by-step view
48
Step-by-step view
49
Step-by-step view
50
Step-by-step view
51
Tabular crossentropy method
● Policy is a matrix
π (a∣s)= A s , a
52
Tabular crossentropy method
● Policy is a matrix
π (a∣s)= A s , a
π (a∣s)= A s , a
took a at s In M best
π (a∣s)= games
was at s
54
Grim reality
55
Approximate crossentropy method
● Policy is approximated
– Neural network predicts π W ( a∣s) given s
– Linear model / Random Forest / ...
π=argmax
π
∑ log π( ai∣si )
s i ,ai ∈Elite
57
Approximate crossentropy method
● Initialize NN weights W 0 ←random
● Loop:
– Sample N sessions
● Loop:
– Sample N sessions
– nn.fit(elite_states,elite_actions)
59
Continuous action spaces
● Continuous state space
2
● Model π W ( a∣s)=N (μ ( s) , σ )
– Mu(s) is neural network output
– Sigma is a parameter or yet another network output
● Loop:
– Sample N sessions
– elite = take M best sessions and concatenate
–
W i +1=W i + α ∇ [ ∑ log πW ( ai∣s i )]
i
s i , ai ∈ Elite
What changed? 60
Approximate crossentropy method
● Initialize NN weights nn = MLPRegressor(...)
● Loop:
– Sample N sessions
– nn.fit(elite_states,elite_actions)
Almost nothing! 61
Tricks
● Remember sessions from 3-5 past iterations
– Threshold and use all of them when training
– May converge slower if env is easy to solve.
● Parallelize sampling
● Use RNNs if partially-observable (later) 62
Reinforcement learning
Episode 1
1
Recap: reinforcement learning
Agent
observation
action
2
Environment
Recap: MDP
Agent
observation
action
4
Environment
Feedback (Monte-Carlo)
z=[s 0 , a0 , s1 , a1 , s 2 , a2 , ..., sn , a n ]
Deterministic policy:
● Find policy with highest
expected reward
π (s)→a : E [ R]→max
5
Feedback (Monte-Carlo)
Whole session
z=[s 0 , a0 , s1 , a1 , s 2 , a2 , ..., sn , a n ]
Deterministic policy:
● Find policy with highest
expected reward
π (s)→a : E [ R]→max
6
Black box optimization setup
Agent
R(s 0 , a0 , s 1 , a1 ,... , s n , a n)
7
Black box optimization setup
Agent
Policy R
params
R(s 0 , a0 , s 1 , a1 ,... , s n , a n)
8
Black box optimization setup
Agent
Policy black R
params
box
R(s 0 , a0 , s 1 , a1 ,... , s n , a n)
9
Today's menu
Evolution strategies
– A general black box optimization
– Easy to implement & scale
Crossentropy method
– A general method with special case for RL
– Works remarkably well in practice
10
Evolution Strategies
2
θ∼N (θ∣μ , σ )
J= E R
2
N (θ∣μ , σ )
11
Evolution Strategies
2
θ∼N (θ∣μ , σ )
other P(θ) will
work as well
– Maximize expected reward
J= E R
2
N (θ∣μ , σ )
12
Evolution Strategies
2
θ∼N (θ∣μ , σ )
13
Evolution Strategies
– Expected reward (using math. Expectation)
14
Evolution Strategies
– Expected reward (using math. Expectation)
2
Sample Θ from θ∼N (θ∣μ , σ ) 15
Evolution Strategies
– Expected reward (using math. Expectation)
2
Sample Θ from θ∼N (θ∣μ , σ ) 16
Evolution Strategies
– Expected reward (using math. expectation)
J =∫ N (θ∣μ , σ )⋅∫ P(s , a , s ' , a ' ,...)R (s, a , s' , a ' ,...)
2
– What we need
∇ J =∫ ∇ [ N (θ∣μ , σ )]⋅∫ P(s , a , s' , a' ,...) R(s , a , s ' , a ' , ...)
2
17
Evolution Strategies
– Expected reward (using math. expectation)
J =∫ N (θ∣μ , σ )⋅∫ P(s , a , s ' , a ' ,...)R (s, a , s' , a ' ,...)
2
– What we need
∇ J =∫ ∇ [ N (θ∣μ , σ )]⋅∫ P(s , a , s' , a' ,...) R(s , a , s ' , a ' , ...)
2
J =∫ N (θ∣μ , σ )⋅∫ P(s , a , s ' , a ' ,...)R (s, a , s' , a ' ,...)
2
– What we need
∇ J =∫ ∇ [ N (θ∣μ , σ )]⋅∫ P(s , a , s' , a' ,...) R(s , a , s ' , a ' , ...)
2
∇ log f ( x )=? ? ?
20
Logderivative trick
Simple math
1
∇ log f ( x )= ⋅∇ f (x )
f ( x)
21
Logderivative trick
Analytical inference
∇ J =∫ ∇ [ N (θ∣μ , σ )]⋅∫ P(s , a , s' , a' ,...) R(s , a , s ' , a ' , ...)
2
2 2 2
∇ N (θ∣μ , σ )=N (θ∣μ , σ )⋅∇ log N (θ∣μ , σ )
22
Evolution Strategies
Analytical inference
s , a , ...
23
Evolution Strategies
Analytical inference
s , a , ...
Sampled estimate
N
1
∇ J≈
N
∑ ∇ log N (θ∣μ , σ )⋅ ∑
2
R (s , a , ...)
i=0 s , a∈ zi
2
Sample Θ from θ∼N (θ∣μ , σ ) 24
Evolution strategies
Algorithm
2
1. Initialize μ 0 ,σ 0
2. Forever:
N
1
∇ J≈
N
∑ ∇ log N (θ∣μ , σ )⋅ ∑
2
R (s , a , ...)
i=0 s , a∈ zi
2 2
μ=μ+α⋅∇ J σ =σ +α⋅∇ J
25
Evolution strategies
Features
– A general black box optimization
– Needs a lot of samples
– Easy to implement & scale
26
Evolution strategies
Features
– A general black box optimization
– Easy to implement & scale
N
1
∇ J≈
N
∑ ∇ log N (θ∣μ , σ )⋅ ∑
2
R (s , a , ...)
i=0 s , a∈ zi
https://blog.openai.com/evolution-strategies/ 28
Today's menu
Evolution strategies
– A general black box optimization
– Requires a lot of sampling
Crossentropy method
– A general method with special case for RL
– Works remarkably well in practice
29
Estimation problem
● You want to estimate
30
Estimation is not a problem
● You want to estimate
31
Estimation problem
● You want to estimate
32
Ideas?
Estimation in the wild
● You want to estimate
34
Estimation in the wild
● You want to estimate profits!
35
Estimation in the wild
● Sampling = asking users to pass survey
● Usually costs money!
● Guess H(median russian gamer)?
36
Estimation in the wild
● Sampling = asking users to pass survey
● Usually costs money!
● H(median russian gamer) ~ 0
● It's H(hard-core donators) that matters!
37
Estimation in the wild
● Sampling = asking users to pass survey
● Usually costs money!
● Most H(x) are small, few are very large
38
Estimation in the wild
● Sampling = asking users to pass survey
● Usually costs money!
● 99% of H(x)=0, 1% H(x)=$1000 (whale)
● You make a survey of N=50 people
39
Estimation in the wild
● Sampling = asking users to pass survey
● Usually costs money!
● 99% of H(x)=0, 1% H(x)=$1000 (whale)
● You make a survey of N=50 people
1
∫ p( x)⋅H (x)dx≈ N ∑ H ( xk )
x x k ∼ p (x)
41
Importance sampling
● Math:
q (x)
E H (x)=∫ p( x)⋅H (x)dx =∫ p( x )⋅ ⋅H ( x) dx
x∼ p( x) x x q (x)
42
Importance sampling
● Math:
q (x)
E H (x)=∫ p( x)⋅H (x)dx =∫ p( x )⋅ ⋅H ( x) dx =
x∼ p( x) x x q (x)
p (x )
= ∫ q ( x )⋅ ⋅H ( x )dx= E ? ? ?
x q( x ) x∼q (x)
43
Importance sampling
● Math:
q (x)
E H (x)=∫ p( x)⋅H (x)dx =∫ p( x )⋅ ⋅H ( x) dx =
x∼ p( x) x x q (x)
p (x ) p( x)
= ∫ q ( x )⋅ ⋅H (x )dx= E ⋅H ( x)
x q(x ) x∼q (x) q (x )
44
Importance sampling
● TL;DR:
p( x)
E H (x)= E ⋅H ( x)
x∼ p( x) x∼q (x) q ( x )
45
Importance sampling
● TL;DR:
p( x)
E H (x)= E ⋅H ( x)
x∼ p( x) x∼q (x) q ( x )
1 1 p( x)
N
∑ H ( x k )≈ N ∑ q (x )
⋅H (x)
x ∼ p( x)
k
x k ∼q (x)
46
Importance sampling
● TL;DR:
p( x)
E H (x)= E ⋅H ( x)
x∼ p( x) x∼q (x) q ( x )
1 1 p (x )
E H (x)≈
x∼ p( x) N
∑ H (x k )≈ N ∑ q(x )
⋅H ( x )
x ∼ p (x)
k
x k ∼q( x)
1 1 p( x)
N
∑ H ( x k )≈ N ∑ q(x)
⋅H (x)
x ∼ p( x)
k
x k ∼q (x)
48
Importance sampling
● Idea: we may know that all whales are
– 30-40 year old
– single
– wage >100k
● Sample from different q(x)
● Adjust for difference in distributions
What's q(x)?
H(x) p(x)
50
Crossentropy method
● Pick q ( x)∼ p(x )⋅H ( x)
q(x)
H(x) p(x)
51
Crossentropy method
● Minimize difference between q ( x) and p(x )H ( x)
52
Crossentropy method
● Minimize difference between q ( x) and p(x )H ( x)
● Kullback-Leibler divergence
p 1 (x)
KL( p1 ( x )∥p2 (x ))= E log
x∼ p (x)
1
p 2 (x)
53
Crossentropy method
● Minimize difference between q ( x) and p(x )H ( x)
● Kullback-Leibler divergence
p 1 (x)
KL( p1 ( x )∥p2 (x ))= E log =
x∼ p (x)
1
p 2 (x)
54
what? what?
Crossentropy method
● Minimize difference between q ( x) and p(x )H ( x)
● Kullback-Leibler divergence
p 1 (x)
KL( p1 (x )∥p2 (x ))= E log =
x∼ p (x)
1
p 2 (x)
const( p2(x) )
= E log p 1 (x)− E log p 2 (x)
x∼ p1(x ) x∼ p1(x )
55
entropy crossentropy
Crossentropy method
● Minimize difference between q ( x) and p(x )H ( x)
entropy crossentropy
56
Crossentropy method
● Pick q(x) to minimize crossentropy
57
Iterative approach
● Pick q(x) to minimize crossentropy
● Start with q 0 (x )= p( x)
● Iteration
p( x )
q i+1 ( x )=argmin− E H ( x ) log q i+1 (x )
q ( x)
i+ 1
x∼q ( x) q i (x )
i
58
Finally, reinforcement learning
● Objective: H(x) = [R > threshold]
● p(x) = uniform
● Threshold = M'th (e.g. 50th) percentile of R
π (a∣s)= A s , a
63
Tabular crossentropy method
● Policy is a matrix
π (a∣s)= A s , a
π (a∣s)= A s , a
took a at s In M best
π (a∣s)= games
was at s
65
Smoothing
● If you were in some state only once, you only
take this action now.
● Apply smoothing
67
Approximate (deep) version
● Policy is approximated
– Neural network predicts π W ( a∣s) given s
– Linear model / Random Forest / ...
π=argmax
π
∑ log π( ai∣si )
s i ,ai ∈Elite
69
Approximate (deep) version
Neural network predicts π W ( a∣s) given s
71
Approximate (deep) version
● Initialize NN weights W 0 ←random
● Loop:
– Sample N sessions
– elite = take M best sessions and concatenate
– W i +1=W i + α ∇ [ ∑ log πW ( ai∣s i )]
i
s i , ai ∈ Elite
72
Approximate (deep) version
● Initialize NN weights W 0 ←random
model = MLPClassifier()
● Loop:
– Sample N sessions
– elite = take M best sessions and concatenate
– W i +1=W i + α ∇ [ ∑ log πW ( ai∣s i )]
i
s i , ai ∈ Elite
model.fit(elite_states,elite_actions)
73
Continuous action spaces
● Continuous state space
2
● Model π W ( a∣s)=N (μ ( s) , σ )
– Mu(s) is neural network output
– Sigma is a parameter or yet another network output
● Loop:
– Sample N sessions
– elite = take M best sessions and concatenate
– W i +1=W i + α ∇ [ ∑ log πW ( ai∣s i )]
i
s i , ai ∈ Elite
74
What changed?
Continuous action spaces
● Continuous state space model = MLPRegressor()
2
● Model π W ( a∣s)=N (μ ( s) , σ )
– Mu(s) is neural network output
– Sigma is a parameter or yet another network output
● Loop:
– Sample N sessions
– elite = take M best sessions and concatenate
– W i +1=W i + α ∇ [ ∑ log πW ( ai∣s i )]
i
s i , ai ∈ Elite
model.fit(elite_states,
elite_actions) 75
Nothing!
Tricks
● Remember sessions from 3-5 past iterations
– Threshold and use all of them when training
– May converge slower if env is easy to solve.
● Parallelize sampling
● Use RNNs if partially-observable 76
Monte-carlo: upsides
● Great for short episodic problems
● Very modest assumptions
– Easy to extend to continuous actions, partial
observations and more
77
Monte-carlo: downsides
● Need full session to start learning
● Require a lot of interaction
– A lot of crashed robots / simulations
78
Gonna fix that next lecture!
● Need full session to start learning
● Require a lot of interaction
– A lot of crashed robots / simulations
79
Seminar
80
Practical RL – Week 2
Shvechikov Pavel
Previously in the course
1. Reward design
2. Bellman Equations
a. state-value function
b. action-value function
3. Policy: evaluation and improvement
4. Generalized Policy Iteration
a. Policy Iteration
b. Value iteration
Explaining goals to agent through reward
immediate reward
discount factor
10 20 100
Discounting makes sums finite
10 20 100
Any discounting
changes optimisation
task and its solution!
Discounting is inherent to humans
● Quasi-hyperbolic
● Hyperbolic discounting
Laibson, D. (1997). Golden eggs and hyperbolic discounting. The Quarterly Journal of Economics, 112(2), 443-478.
Discounting is inherent to humans
● Quasi-hyperbolic
● Hyperbolic discounting
Mathematical convenience
Laibson, D. (1997). Golden eggs and hyperbolic discounting. The Quarterly Journal of Economics, 112(2), 443-478.
Discounting is a stationary end-of-effect model
Any action affects (1) immediate reward (2) next state
Discounting is a stationary end-of-effect model
Any action affects (1) immediate reward (2) next state
Action indirectly affects future rewards
But how long does this effect lasts?
“Effect
continuation”
“End of probability
effect”
probability
-9 -1
S2
S1 S4
-1 -1 End
Start
S3
Reward design – don’t shift, reward for WHAT
● E.g.: chess – value of taken opponent's piece
○ Problem: agent will not have a desire to win!
-9 -1
S2 Take away: do not
S1 S4
-1 -1 subtract mean
Start End
S3 from rewards
Reward design – scaling, shaping
Ng, A. Y., Harada, D., & Russell, S. (1999, June). Policy invariance under reward transformations: Theory and application to reward shaping. In ICML (Vol. 99,
pp. 278-287).
Reward design – scaling, shaping
1. Reward design
2. Bellman Equations
a. state-value function
b. action-value function
3. Policy: evaluation and improvement
4. Generalized Policy Iteration
a. Policy Iteration
b. Value iteration
How to find optimal policy?
Dynamic programming!
Environment
stochasticity
Policy
stochasticity
Environment
stochasticity
Policy
stochasticity
By definition
Intuition: value of following policy from state s
Bellman expectation equations
Bellman expectation equation for v(s)
Backup
diagram
Bellman expectation equation for v(s)
Backup
diagram
Action-value function q(s, a)
Backup
diagram
for q(s, a)
Bellman expectation equation for q(s,a)
Backup
diagram
for q(s, a)
What do we gonna do with value functions?
Already know
● Bellman equations – assess policy performance
● Return, value- and action-value functions
Want to find an optimal policy:
● optimal actions in each possible state
max
max
max max
max max
1. Reward design
2. Bellman Equations
a. state-value function
b. action-value function
3. Policy: evaluation and improvement
4. Generalized Policy Iteration
a. Policy Iteration
b. Value iteration
Policy evaluation
If you can't measure it, you can't improve it.
Peter Drucker
Bellman expectation
equation for v(s)
Policy improvement
Policy improvement: an idea
then it is optimal !
Policy improvement: convergence
1. Reward design
2. Bellman Equations
a. state-value function
b. action-value function
3. Policy: evaluation and improvement
4. Generalized Policy Iteration
a. Policy Iteration
b. Value iteration
The idea of policy and value iterations
Policy evaluation
Policy improvement
The idea of policy and value iterations
Generalized policy iteration Policy evaluation
1. Evaluate given policy
2. Improve policy by acting greedily
w.r.t. to its value function
Policy improvement
The idea of policy and value iterations
Generalized policy iteration Policy evaluation
1. Evaluate given policy
2. Improve policy by acting greedily
w.r.t. to its value function
Robustness: Policy improvement
● No dependence on initialization
● No need in complete policy evaluation (states / converg.)
● No need in exhaustive update (states)
○ Example of update robustness:
■ Update only one state at a time
■ in a random direction
■ that is correct only in a expectation
The idea of policy and value iterations
Generalized policy iteration Policy evaluation
1. Evaluate given policy
2. Improve policy by acting greedily
w.r.t. to its value function
Policy improvement
Policy iteration
1. Evaluate policy until convergence (with some tolerance)
2. Improve policy
Value iteration
1. Evaluate policy only with single iteration
2. Improve policy
Policy iteration
Policy iteration: scheme
Bellman expectation
equation for v(s)
q(s,a)
Value iteration
Value iteration
Bellman optimality
equation for v(s)
Value iteration (VI) vs. Policy iteration (PI)
1
Previously...
2
Decision process in the wild
Agent
Can do anything
observation
action
Can't even see
(worst case)
3
Environment
Decision process in the wild
Agent
Can do anything
observation
action
Can't even see
(worst case)
4
Environment
Model-free setting:
We don't know actual
P(s',r|s,a)
Whachagonnado?
5
Model-free setting:
We don't know actual
P(s',r|s,a)
Learn it?
Get rid of it?
6
More new letters
7
More new letters
10
To sum up
* *
Q (s , a)= E r ( s , a)+ γ⋅V ( s ')
s' , r
* *
V ( s)=argmax Q (s , a)
a
Image: cs188x
Action value Qπ(s,a) is the expected total reward G agent gets from
state s by taking action a and following policy π from next state.
r'
r'
● Trajectory is a sequence of
– states (s)
– actions (a)
– rewards (r)
r'
● Trajectory is a sequence of
– states (s) Q: What to learn?
V(s) or Q(s,a)
– actions (a)
– rewards (r)
r'
● Trajectory is a sequence of
– states (s) Q: What to learn?
V(s) or Q(s,a)
– actions (a)
– rewards (r) V(s) is useless
without P(s'|s,a)
+1
Cake!
16
Idea 1: monte-carlo
● Get all trajectories containing particular (s,a)
● Estimate G(s,a) for each trajectory
● Average them to get expectation
takes a lot of sessions
17
18
Idea 2: temporal difference
● Remember we can improve Q(s,a) iteratively!
19
Idea 2: temporal difference
● Remember we can improve Q(s,a) iteratively!
That's something
we don't have
What do we do?
20
Idea 2: temporal difference
21
Idea 2: temporal difference
● Replace expectation with sampling
1
E r t + γ⋅max a ' Q(s t +1 , a ')≈ ∑ r i + γ⋅max a ' Q (si , a ')
next
r ,s
t t+1
N i
22
Idea 2: temporal difference
● Replace expectation with sampling
1
E r t + γ⋅max a ' Q(s t +1 , a ')≈ ∑ r i + γ⋅max a ' Q (si , a ')
next
r ,s
t t+1
N i
r'
● Works on a sequence of
– states (s)
– actions (a)
– rewards (r)
24
Q-learning
r
prev s s'
s a'
a s''
prev a a''
r'
Initialize Q(s,a) with zeros
● Loop:
– Sample <s,a,r,s'> from env
25
Q-learning
possible
r actions
Q(s ' , a 0 )
prev s s'
s a' Q(s ' , a 1)
a s''
prev a
Q(s ' , a''
a 2)
r'
Initialize Q(s,a) with zeros
● Loop:
– Sample <s,a,r,s'> from env
26
Q-learning
possible
r actions
Q(s ' , a 0 )
prev s s'
s a' Q(s ' , a 1)
a s''
prev a
Q(s ' , a''
a 2)
r'
Initialize Q(s,a) with zeros
● Loop:
– Sample <s,a,r,s'> from env
– Update ^ , a)+(1−α)Q(s , a)
Q(s , a)← α⋅Q(s 27
Recap
Monte-carlo Temporal Difference
r possible
+1
actions Q(s ' , a 0 )
s'
Q(s ' , a 1)
s
Cake! a
Q(s ' , a 2)
28
Nuts and bolts: MC vs TD
Monte-carlo Temporal Difference
Ideas?
32
Exploration Vs Exploitation
Balance between using what you learned and trying to find
something even better
33
Exploration Vs Exploitation
Strategies:
• ε-greedy
• With probability ε take random action;
34
Exploration Vs Exploitation
Strategies:
• ε-greedy
• With probability ε take random action;
Idea:
If you want to converge to optimal policy,
you need to gradually reduce exploration
Example:
37
Picture from Berkeley CS188x
Cliff world
Conditions
• Q-learning
γ=0.99 ϵ=0.1
• no slipping
Trivia:
What will q-learning learn?
38
Cliff world
Conditions
• Q-learning
γ=0.99 ϵ=0.1
• no slipping
Trivia:
What will q-learning learn?
follow the short path
Conditions
• Q-learning
γ=0.99 ϵ=0.1
• no slipping
Trivia:
What will q-learning learn?
follow the short path
Conditions
• Q-learning
γ=0.99 ϵ=0.1
• no slipping
41
Generalized update rule
Update rule (from Bellman eq.)
^ t , at )+(1−α)Q(st , at )
Q(s t , at )← α⋅Q(s
“better Q(s,a)”
42
Q-learning VS SARSA
Update rule (from Bellman eq.)
^ t , at )+(1−α)Q(st , at )
Q(s t , at )← α⋅Q(s
43
Q-learning VS SARSA
Update rule (from Bellman eq.)
^ t , at )+(1−α)Q(st , at )
Q(s t , at )← α⋅Q(s
SARSA
^ , a)=r (s , a)+ γ⋅
Q(s E Q(s ' , a ')
a '∼π(a '∣s ') 44
Recap: Q-learning
possible
r actions
Q(s ' , a 0 )
prev s s'
s a' Q(s ' , a 1)
a s''
prev a
Q(s ' , a''
a 2)
r'
∀ s ∈S , ∀ a ∈A , Q(s , a)←0
Loop:
– Sample <s,a,r,s'> from env
– ^ , a)+(1−α)Q(s , a)
Update Q(s , a)← α⋅Q(s
45
SARSA r
prev s s'
s a'
a s''
prev a
Q(s ' , a''
a')
r'
∀ s ∈S , ∀ a ∈A , Q(s , a)←0
Loop:
– Sample <s,a,r,s',a'> from env
– ^ , a)+(1−α)Q(s , a)
Update Q(s , a)← α⋅Q(s
46
SARSA r
prev s s'
s a'
a s''
prev a
Q(s ' , a''
a')
r'
∀ s ∈S , ∀ a ∈A , Q(s , a)←0
Loop: hence “SARSA”
– Sample <s,a,r,s',a'> from env
– ^ , a)+(1−α)Q(s , a)
Update Q(s , a)← α⋅Q(s
48
Expected value SARSA
possible
r actions
Q(s ' , a 0 )
prev s s'
s a' Q(s ' , a 1)
a s''
prev a
Q(s ' , a''
a 2)
r'
∀ s ∈S , ∀ a ∈A , Q(s , a)←0
Loop:
– Sample <s,a,r,s'> from env Expected value
– ^ , a)+(1−α)Q(s , a)
Update Q(s , a)← α⋅Q(s
49
Difference
● Q-learning policy
would be optimal Q-learning
under SARSA
50
On-policy vs Off-policy
Two problem setups
on-policy off-policy
53
On-policy vs Off-policy
Two problem setups
on-policy off-policy
54
On-policy vs Off-policy
Two problem setups
on-policy off-policy
55
Experience replay Interaction
training
<s,a,r,s'>
<s,a,r,s'>
training
batches ~ <s,a,r,s'>
<s,a,r,s'>
<s,a,r,s'>
<s,a,r,s'>
<s,a,r,s'>
Replay
56
buffer
Experience replay Interaction
Training curriculum:
- play 1 step and record it training
- pick N random transitions to train
<s,a,r,s'>
<s,a,r,s'>
Profit: you don't need to re-visit same
(s,a) many times to learn it.
training
batches ~ <s,a,r,s'>
<s,a,r,s'>
<s,a,r,s'>
<s,a,r,s'>
<s,a,r,s'>
Training curriculum:
- play 1 step and record it training
- pick N random transitions to train
<s,a,r,s'>
<s,a,r,s'>
Profit: you don't need to re-visit same
(s,a) many times to learn it.
training
batches ~ <s,a,r,s'>
<s,a,r,s'>
<s,a,r,s'>
<s,a,r,s'>
<s,a,r,s'>
Old (s,a,r,s) are
Only works with from older/weaker Replay
off-policy algorithms! version of policy! buffer 58
New stuff we learned
● Anything?
59
New stuff we learned
● Q(s,a),Q*(s,a)
● Q-learning, SARSA
– We can learn from trajectories (model-free)
61
Reinforcement learning
Episode 4
1
Recap: Q-learning
One approach:
action Q-values
Action value Q(s,a) is the expected total reward G agent gets from
state s by taking action a and following policy π from next state.
2
L≈[Q(s t , a t )−(r t +γ⋅max a ' Q(s t +1 , a ' ))]
How to optimize?
4
Q-learning as MSE minimization
2
L≈[Q(s t , a t )−(r t +γ⋅max a ' Q(s t +1 , a ' ))]
2
L≈[Q(s t , a t )−(r t +γ⋅max a ' Q(s t +1 , a ' ))]
8
Q-learning as MSE minimization
9
Q-learning as MSE minimization
11
Real world
8⋅210⋅160
|S|≈2 =729179546432630. ..
80917 digits :)
12
Problem:
State space is usually large,
sometimes continuous.
And so is action space;
13
Problem:
State space is usually large,
sometimes continuous.
And so is action space;
Two solutions:
– Binarize state space (last week)
– Approximate agent with a function (crossentropy method)
14
Which one would you prefer for atari?
Problem:
State space is usually large,
sometimes continuous.
And so is action space;
Two solutions:
– Binarize state space Too many bins or handcrafted features
● Before:
– For all states, for all actions, remember Q(s,a)
● Now:
– Approximate Q(s,a) with some function
– e.g. linear model over state features
2
argminw , b (Q (s t , a t )−[r t + γ⋅max a ' Q(s t +1 , a ' )])
● Before:
– For all states, for all actions, remember Q(s,a)
● Now:
– Approximate Q(s,a) with some function
– e.g. linear model over state features
2
argminw , b (Q (s t , a t )−[r t + γ⋅max a ' Q(s t +1 , a ' )])
apply
observe
action
Obser
action
vation
Environment
Approximate Q-learning
model Objective:
W = params
2
L=(Q(s t ,a t )−[r+γ⋅max a' Q (s t +1 ,a')])
Gradient step:
δL
image w t +1=wt −α⋅
δw
Approximate Q-learning
model Objective:
W = params
2
L=(Q(s t ,a t )−[r+γ⋅max a' Q (s t +1 ,a')])
consider const
Gradient step:
∂L
image w t +1=wt −α⋅
∂ wt
Approximate SARSA
Objective:
consider const
Q-learning:
consider const
Q-learning:
Qvalues is a
dense layer with Dense
no nonlinearity
ϵ-greedy rule
(tune ϵ or use
probabilistic rule)
Dense
Dense
Whatever
you found in
Obser- your favorite
vation deep learning
toolkit
Architectures
24
Architectures
27
Deep learning approach: DQN
28
DQN ϵ-greedy rule
(tune ϵ or use apply
probabilistic rule) action action
Qvalues is a Qvalues
dense layer with
no nonlinearity
Dense
Change axis Any neural
order to fit in network you
with lasagne can think of.
convolutions ● conv
Conv1
● pool
● dense
● dropout
● batchnorm
image Dimshuffle
(i,w,h,3) Conv0 ...
(i,3,w,h)
DQN ϵ-greedy rule
(tune ϵ or use apply
probabilistic rule) action action
Qvalues is a Qvalues
dense layer with
no nonlinearity
Dense
Change axis Any neural
order to fit in network you
with lasagne can think of.
convolutions ● conv
Conv1
● pool
● dense
● dropout
● batchnorm
image Dimshuffle
(i,w,h,3) Conv0 ...
(i,3,w,h)
● Any ideas?
Multiple agent trick
parameter
server
Idea: Throw in several
agents with shared W. W
parameter
server
Idea: Throw in several
agents with shared W. W
Any +/- ?
training
<s,a,r,s'>
<s,a,r,s'>
training
batches ~ <s,a,r,s'>
<s,a,r,s'>
<s,a,r,s'>
<s,a,r,s'>
<s,a,r,s'>
Replay
37
buffer
Experience replay Interaction
~
● Older interactions were obtained <s,a,r,s'>
under weaker policy training <s,a,r,s'>
batches <s,a,r,s'>
<s,a,r,s'>
<s,a,r,s'>
Better versions coming next week
Replay
38
buffer
Summary so far
to make data closer to i.i.d.
39
An important question
– Q-learning – CEM
– SARSA
– Expected Value SARSA 40
An important question
42
Deep learning meets MDP
– Dropout, noize
● Used in experience replay only: like the usual dropout
● Used when interacting: a special kind of exploration
● You may want to decrease p over time.
– Batchnorm
● Faster training but may break moving average
● Experience replay: may break down if buffer is too small
● Parallel agents: may break down under too few agents
<same problem of being non i.i.d.>
Final problem
Left or right?
44
Problem:
Most practical cases are partially observable:
Agent observation does not hold all information about process state
(e.g. human field of view).
Any ideas?
45
Problem:
Most practical cases are partially observable:
Agent observation does not hold all information about process state
(e.g. human field of view).
46
Partially observable MDP
apply
observe Agent action
Obser
vation action
Hidden state
(Markov assumption holds)
[but no one cares]
47
N-gram heuristic
Idea:
s t ≠o( s t )
s t ≈(o (st −n ) , at −n , ..., o(s t−1 ), at −1 , o(s t ))
e.g. ball movement in breakout
Qvalues is a Qvalues
dense layer with
no nonlinearity
Dense
Any neural
push image network you
can think of.
● conv
Conv1
● pool
● dense
● dropout
stack
● batchnorm
image image t 4 images
(i,w,h,3) image t-1 Conv0 ...
image t-2
image t-3
Ngrams:
• Nth-order markov assumption
Alternative approach:
• Infer hidden variables given observation
sequence
• Kalman Filters, Recurrent Neural Networks
52
Autocorrelation
● Any ideas?
Target networks
Idea: use older network snapshot
to compute reference
old 2
L=(Q(s t ,a t )−[r+γ⋅max a ' Q (s t +1 ,a' )])
54
Target networks
Idea: use older network snapshot
to compute reference
old 2
L=(Q(s t ,a t )−[r+γ⋅max a ' Q (s t +1 ,a' )])
● Smooth version:
● use moving average
1
What we already know:
- Q-learning
- Experience replay
3
Target network
Idea: use network with frozen weights to compute the target
4
Target network
Idea: use network with frozen weights to compute the target
5
Playing Atari with Deep Reinforcement Learning (2013, Deepmind)
Experience replay
CNN q-values
6
Asynchronous Methods for Deep Reinforcement Learning (2016, Deepmind)
Learning *
*: 7
Problem of overestimation
We use “max” operator to compute the target
8
Problem of overestimation
Normal distribution
3*10⁶ samples
mean: ~0.0004
9
Problem of overestimation
Normal distribution
3*10⁶ x 3 samples
Then take maximum of every tuple
mean: ~0.8467
10
Problem of overestimation
Normal distribution
3*10⁶ x 10 samples
Then take maximum of every tuple
mean: ~1.538
11
Problem of overestimation
12
Double Q-learning (NIPS 2010)
Idea: use two estimators of q-values:
They should compensate mistakes of each other because they will be independent
Let’s get argmax from another estimator!
- Q-learning target
13
Double Q-learning (NIPS 2010)
15
Prioritized Experience Replay (2016, Deepmind)
Idea: sample transitions from xp-replay more clever
We want to set probability for every transition. Let’s use the absolute value of
TD-error of transition as a probability!
We want to set probability for every transition. Let’s use the absolute value of
TD-error of transition as a probability!
18
Prioritized Experience Replay (2016, Deepmind)
19
Prioritized Experience Replay (2016, Deepmind)
21
Dueling Network Architectures for Deep Reinforcement Learning (2016, Deepmind)
A(s)
22
Dueling Network Architectures for Deep Reinforcement Learning (2016, Deepmind)
A*(s)
Here is a problem!
23
Dueling Network Architectures for Deep Reinforcement Learning (2016, Deepmind)
Here is one extra freedom degree!
Example:
0 3 4
2 2 2
4 4 4
3 3 3
2 -1 -2
4 1 0
3 0 -1
24
Dueling Network Architectures for Deep Reinforcement Learning (2016, Deepmind)
Here is one extra freedom degree!
Example:
0 3 4
2 2
4 4
3 3
2 -1 -2 2
4 1 0 4
3 0 -1 3
What is correct?
Hint 1:
25
Dueling Network Architectures for Deep Reinforcement Learning (2016, Deepmind)
Here is one extra freedom degree!
Example:
0 3 4
2 2
4 4
3 3
2 -1 -2 2
4 1 0 4
3 0 -1 3
What is correct?
Hint 1: Hint 2:
26
Dueling Network Architectures for Deep Reinforcement Learning (2016, Deepmind)
29
Reinforcement learning
Episode 6
1
Small experiment
2
Small experiment
left or right? 3
Small experiment
2
L≈ E[Q(s t , a t )−(r t +γ⋅max a ' Q(s t +1 , a '))]
Simple 2-state world
True (A) (B)
Q(s0,a0) 1 1 2
Q(s0,a1) 2 2 1
Q(s1,a0) 3 3 3
Q(s1,a1) 100 50 100 7
Trivia: Which prediction is better (A/B)?
Approximation error
DQN is trained to minimize
2
L≈ E[Q(s t , a t )−(r t +γ⋅max a ' Q(s t +1 , a '))]
Simple 2-state world
True (A) (B)
Q(s0,a0) 1 1 2
Q(s0,a1) 2 2 1
Q(s1,a0) 3 3 3
Q(s1,a1) 100 50 100 8
better less
policy MSE
Approximation error
DQN is trained to minimize
2
L≈ E[Q(s t , a t )−(r t +γ⋅max a ' Q(s t +1 , a '))]
Simple 2-state world
True (A) (B)
Q(s0,a0) 1 1 2
Q(s0,a1) 2 2 1
Q(s1,a0) 3 3 3
Q(s1,a1) 100 50 100 9
better less
Q-learning will prefer worse policy (B)! policy MSE
Conclusion
12
how humans survived
π (run∣s)=1
13
Policies
In general, two kinds
● Deterministic policy
a=πθ (s )
● Stochastic policy
a∼πθ (a∣s)
14
Trivia: Any case where stochastic is better?
Policies
In general, two kinds
● Deterministic policy
a=πθ (s )
● Stochastic policy
● Loop:
– Sample N sessions
– elite = take M best sessions and concatenate
● Loop:
– Sample N sessions
– elite = take M best sessions and concatenate
Why so complicated?
We'd rather simply maximize R over pi!
21
Objective
Expected reward:
J= E R(s , a , s ' , a ' , ...)
s∼ p (s)
a∼π θ (s∣a)
...
J= E Q(s, a)
s∼ p (s)
a∼π θ (s∣a)
22
Objective
J= E Q(s, a)
s∼ p (s)
a∼π θ (s∣a)
“true” Q-function 23
Objective
24
Objective
Agent's policy
sample N sessions
26
Objective
sample N sessions
J θ+ ϵ−J θ
∇ J≈ ϵ
Stochastic optimization
– Good old crossentropy method
– Maximize probability of “elite” actions
29
Optimization
Finite differences
– Change policy a little, evaluate
J θ+ ϵ−J θ
∇ J≈ ϵ
Stochastic optimization
– Good old crossentropy method
– Maximize probability of “elite” actions
31
Objective
Wish list:
– Analytical gradient
– Easy/stable approximations
32
Logderivative trick
Simple math
∇ log π ( z )=? ? ?
33
Logderivative trick
Simple math
1
∇ log π ( z )= ⋅∇ π( z)
π (z)
34
Policy gradient
Analytical inference
35
Policy gradient
Analytical inference
36
Trivia: anything curious about that formula?
Policy gradient
Analytical inference
∇ J= E ∇ log π θ (a∣s)⋅Q(s , a)
s∼ p (s)
a∼π θ (s∣a) 38
Policy gradient (REINFORCE)
● Policy gradient
39
REINFORCE algorithm
● Initialize NN weights θ 0 ←random
● Loop:
– Sample N sessions z under current π θ (a∣s )
– Evaluate policy gradient
N
1
∇ J≈
N
∑∑ ∇ log π θ (a∣s)⋅Q(s , a)
i=0 s ,a ∈z i
● Loop:
– Sample N sessions z under current π θ (a∣s )
– Evaluate policy gradient
N
1
∇ J≈
N
∑∑ ∇ log π θ (a∣s)⋅Q(s , a)
i=0 s ,a ∈z i
● Loop:
– Sample N sessions z under current π θ (a∣s )
– Evaluate policy gradient
N
1
∇ J≈
N
∑∑ ∇ log π θ (a∣s)⋅Q(s , a)
i=0 s ,a ∈z i
● Loop:
– Sample N sessions z under current π θ (a∣s )
– Evaluate policy gradient
N
1
∇ J≈
N
∑∑ ∇ log π θ (a∣s)⋅(Q (s , a)−b(s))
i=0 s ,a ∈z i
48
Advantage actor-critic
49
Advantage actor-critic
π θ (a∣s ) V θ (s)
Improve policy:
N
1
model
∇ J actor ≈
N
∑∑ ∇ log π θ (a∣s)⋅A (s , a)
i=0 s ,a ∈ zi
W = params
Improve value:
N
1
Lcritic ≈
N
∑ ∑ (V θ (s)−[r +γ⋅V (s ')]) 2
i=0 s ,a ∈ zi
state s
53
Continuous action spaces
54
Continuous action spaces
57
Practical RL
week 5, spring’18
1
Multi-armed bandits
2
Multi-armed bandits
3
Multi-armed bandits: simplest
same state
each time action
Agent Feedback
observation action
Agent Feedback
Ideas?
7
How to measure exploration
Bad idea: by the sound of the name
Good idea: by $$$ it brought/lost you
η=∑ E r ( s , a)− E r ( s , a)
t s , a∼π
*
s , a∼ πt
9
Exploration Vs Exploitation
What exploration strategies
did we use before?
10
Exploration strategies so far...
Strategies:
• ε-greedy
• With probability ε take a uniformly random action;
• Otherwise take optimal action.
• Boltzman
• Pick action proportionally to transformed Qvalues
Q (s , a)
P( a)=softmax ( )
std
• Optimistic initialization
• start from high initial Q(s,a) for all states/actions 11
• good for tabular algorithms, hard to approximate
Exploration strategies so far...
Strategies:
• ε-greedy
• With probability ε take a uniformly random action;
• Otherwise take optimal action.
Say, we use ε-greedy with const ε = 0.25
on top of q-learning to play a videogame.
η=∑ E r ( s , a)− E r ( s , a) 12
t s , a∼π
*
s , a∼ πt
Exploration strategies so far...
Strategies:
• ε-greedy
• With probability ε take a uniformly random action;
• Otherwise take optimal action.
Say, we use ε-greedy with const ε = 0.25
on top of q-learning to play a videogame.
Idea:
If you want to converge to optimal policy,
you need to gradually reduce exploration
Example:
15
How many lucky random actions it takes to
● Apply medical treatment
● Control robots
● Invent efficient VAE training
16
BTW how humans explore?
Whether some new particles violate physics
Vs
Whether you still can't fly by pulling your hair up
17
Uncertainty in returns
18
Q(s,a)
Thompson sampling
● Policy:
– sample once from each Q distribution
– take argmax over samples
– which actions will be taken?
19
Q(s,a)
Thompson sampling
● Policy:
– sample once from each Q distribution
– take argmax over samples
– which actions will be taken?
Takes a1 with p ~ 0.65, a2 with p ~ 0.35, a0 ~ never
20
Q(s,a)
Optimism in face of uncertainty
Idea:
Prioritize actions with uncertain outcomes!
21
Optimism in face of uncertainty
● Policy:
– Compute 95% upper confidence bound for each a
– Take action with highest confidence bound
– What can we tune here to explore more/less?
22
Q(s,a)
Optimism in face of uncertainty
● Policy:
– Compute 95% upper confidence bound for each a
– Take action with highest confidence bound
– Adjust: change 95% to more/less
23
points = 95% percentiles Q(s,a)
Frequentist approach
There's a number of inequalities that bound
P(x>t) < something
~
v a=v a +
√2 log N
na
~
v a=v a +
√2 log N
na
~
v a=v a +
√2 log N
na
where
~
Q(s , a)=Q(s , a)+α⋅
√
2 log N s
n s ,a
– N
s visits to state s
– n s , a times action a is taken from state s
29
Bayesian UCB
tanh -0.1
0.3 μ
1.3
Predict P(y|x)
X
x~N(μ,σ)
0.2
-0.25 σ Normal
tanh
0.05 distribution
Regular y
X
NN
Bayesian P(y|x)
X
NN
Regular y
X
NN
X P(y|x)
Idea:
● No explicit weights
X P(y|x)
Idea:
● No explicit weights
X P(y|x)
Idea:
● No explicit weights
q (θ∣ϕ) q( θ∣ϕ)⋅p (d )
−∫ q( θ∣ϕ)⋅log =−∫ q (θ∣ϕ)⋅log
θ p( d∣θ)⋅p( θ) θ p(d∣θ)⋅p(θ)
[ ]
p(d )
q(θ∣ϕ)
−∫ q( θ∣ϕ)⋅[log −log p (d∣θ)+ log p (d )]
θ p(θ )
In other words,
Σx,y~d log p(y|x,µ+σψ)
BNN likelihood
● Bayesian UCB:
– Prior can make or break it
– Sometimes parametric guys win
(vs bnn)
47
Markov Decision Processes
● Naive approach:
– Infer posterior distribution on Q(s,a)
– Do UCB or Thompson Sampling on those Q-values
– Agent is “greedy” w.r.t. exploration
It would prefer taking one uncertain action now than
make several steps to end up in unexplored regions
48
Markov Decision Processes
● Naive approach:
– Infer posterior distribution on Q(s,a)
– Do UCB or Thompson Sampling on those Q-values
– Agent is “greedy” w.r.t. exploration
● Reward augmentation
– Devise a surrogate “reward” for exploration
● We “pay” our agent for exploring
– Maximize this reward with a (separate) RL agent
49
I got it!
50
Reward augmentation
51
Reward augmentation
Q: Any suggestions on
surrogate r for atari? 52
UNREAL main idea
● Auxilary objectives:
– Pixel control: maximize pixel change in NxN grid
over image
– Feature control: maximize activation of some
neuron deep inside neural network
– Reward prediction: predict future rewards given
history
article: arxiv.org/abs/1611.05397
blog post: bit.ly/2g9Yv2A 53
UNREAL main idea
● Auxilary objectives:
– Pixel control: maximize pixel change in NxN grid
over image Keep calm!
– Feature control: maximize
we’ll get activation of some
more theoretically
neuron deep
soundinside neuralinnetwork
models a few slides
– Reward prediction: predict future rewards given
history
article: arxiv.org/abs/1611.05397
blog post: bit.ly/2g9Yv2A 54
Environment: Labyrinth
56
Results: Atari
57
Count-based models
58
Count-based models
59
Examples: arxiv:1606.01868, arxiv:1703.01310
Count-based models
60
Estimating counts
We need some way to estimate N(s)
62
Density ratio trick
We need some way to estimate N(s)
63
Density ratio trick
We need some way to estimate N(s)
d(s)
Perfect discriminator: p(s∈d)=
d(s)+q(s)
65
Density ratio trick
Train another model to discriminate
between s∼d(s) and s∼q(s)
d(s)
Perfect discriminator: p(s∈d)=
d(s)+q(s)
d(s)
Perfect discriminator: p(s∈d)=
d(s)+q(s)
1 d(s)+q(s) q(s)
= =1+
p(s∈d) d(s) d(s)
q(s) 1 1−p(s∈d (s))
= −1=
d(s) p(s∈d(s)) p(s∈d(s)) 67
Density ratio trick
Train another model to discriminate
between s∼d(s) and s∼q(s)
d(s)
Perfect discriminator: p(s∈d)=
d(s)+q(s)
d(s) p(s∈d(s)) Discriminator(s)
= ≈
q(s) 1−p(s∈d(s)) 1−Discriminator(s∈d(s))
d(s) Discriminator(s)
d(s)=q(s)⋅ ≈q(s)⋅
q(s) 1−Discriminator(s∈d(s)) 68
Density ratio trick
We need some way to estimate N(s)
d(s) Discriminator(s)
d(s)=q(s)⋅ ≈q(s)⋅
q(s) 1−Discriminator(s∈d(s))
Uniform q(s) = simple math, high variance d(s),
Task-specific q(s) = possibly smaller variance 69
Variational Information-Maximizing Exploration
arxiv:1605.09674 70
Variational Information-Maximizing Exploration
Curiosity
Taking actions that increase your knowledge about
the world (a.k.a. the environment)
arxiv:1605.09674 71
Vime main idea
Curiosity definition
r vime (z ,a ,s')=I (θ; s'∣τ ,a)
Vime main idea
Environment model
P(s'∣s, a,θ)
Session
τ t =⟨s0, a 0, s 1, a1, ...,s t ⟩
Surrogate reward
~r (τ ,a,s' )=r (s ,a,s ')+βr vime (τ ,a,s')=r(s ,a,s' )+βI (θ; s'∣τ,a)
curiosity
I (θ; s'∣τ,a)=H(θ∣τ ,a)−H (θ∣τ ,a, s')=E s t +1 ∼P(st +1∣s, a) KL[P(θ∣τ,a ,s')∥P(θ∣τ)]
need proof for that last line?
Naive objective
P(θ∣τ,a ,s')
Es ∣s, a) KL[ P(θ∣τ ,a,s')∥P(θ∣τ)]=∫ P(s'∣s,a)⋅∫ P(θ∣τ, a,s')⋅log
∼P( st+1 dθ ds'
t+1
s' θ P(θ∣τ)
where
P(z∣θ)⋅P(θ)
∏t P(st+1∣st , at ,θ)⋅P(θ)
P(θ∣τ)= =
P(τ) ∫θ P(τ∣θ)⋅P(θ)d θ
Naive objective
P(θ∣τ,a ,s')
Es ∣s, a) KL[P(θ∣τ ,a,s')∥P(θ∣τ)]=∫ P(s'∣s,a)⋅∫ P(θ∣τ, a,s')⋅log
∼P( st+1 dθ ds'
t+1
s' θ P(θ∣τ)
Sample Sample
From MDP Somehow...
Model Prior
P(z∣θ)⋅P(θ)
∏t P(st+1∣st , at ,θ)⋅P(θ)
P(θ∣τ)= =
P(τ) ∫θ P(τ∣θ)⋅P(θ)d θ
dunno
P(θ∣τ,a ,s')
Es ∣s, a) KL[P(θ∣τ ,a,s')∥P(θ∣τ)]=∫ P(s'∣s,a)⋅∫ P(θ∣τ, a,s')⋅log
∼P( st+1 dθ ds'
t+1
s' θ P(θ∣τ)
BNN
q(θ∣τ ,a,s')
Es ∣s, a) KL[ P(θ∣τ ,a,s')∥P(θ∣τ)]≈∫ P(s'∣s,a)⋅∫ q(θ∣τ,a, s')⋅log
∼P( st+1 d θds'
t+1
s' sample θ sample q(θ∣τ)
from env from BNN BNN
last tick
Algorithm
Forever:
1.Interact with environment, get <s,a,r,s'>
2.Compute curiosity reward
~r (z ,a,s' )=r (s ,a,s ')+βKL[q(θ∣ϕ')∥q(θ∣ϕ)]
~ //with any RL algorithm
3.train_agent(s,a,r,s')
4.train_BNN(s,a,s') //maximize lower bound
Dirty hacks
● Use batches of many <s,a,r,s'>
– for CPU/GPU efficiency
– greatly improves RL stability
● Simple formula for KL
– Assuming fully-factorized normal distribution
2 2
1 σi' (μ i '−μi)
KL[q(θ∣ϕ')∥q(θ∣ϕ)]= ∑ [( σ ) +2log σ i−2log σi '+ 2 ]
2 i<|θ| i σi
– Even simpler: second order Taylor approximation
● Divide KL by its running average over past iterations
Session reward
epoch
Session reward
Results
epoch
Session reward
epoch
Results
Pitfalls
● It's curious about irrelevant things
● Predicting (210x160x3) images is hard
● We don't observe full states (POMDP)
State = hidden NN activation
Qvalues Q(s,a)
More layers
Some layers
Raw observation S
State = hidden NN activation
Some layers
Qvalues
Some layers
Qvalues
O O O O We don't care
t-3 t-2 t-1 t