0% found this document useful (0 votes)
2 views

Practical RL

The document provides an introduction to reinforcement learning (RL) and its applications, emphasizing the differences between supervised, unsupervised, and reinforcement learning. It discusses the Markov Decision Process (MDP) framework, the concept of bandits, and various learning strategies, including the cross-entropy method for policy optimization. The document also highlights practical use cases of RL in areas such as online advertising, robotics, and personalized medical treatment.

Uploaded by

raju
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Practical RL

The document provides an introduction to reinforcement learning (RL) and its applications, emphasizing the differences between supervised, unsupervised, and reinforcement learning. It discusses the Markov Decision Process (MDP) framework, the concept of bandits, and various learning strategies, including the cross-entropy method for policy optimization. The document also highlights practical use cases of RL in areas such as online advertising, robotics, and personalized medical treatment.

Uploaded by

raju
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 514

Practical RL

@spring ‘18

Intro to Reinforcement Learning


Why should you care

1
Terms
● Ask!
– Even if the question feels stupid.
– Chances are, half of the group is just like you.
– If it's necessary, interrupt the speaker.

● Contribute!
– Found an error? Got useful link? Ported the
seminar to py3 from py2? Answered peer's question
in the chat?
– You're awesome! 2
<a convenient slide for public survey>

3
Supervised learning

Given:
● objects and answers
(x , y)
● algorithm family a θ (x)→ y
● loss function L( y ,a θ (x))
Find:

θ'← argminθ L( y ,a θ (x))


4
Supervised learning

Given:
● objects and answers
(x , y)
[banner,page], ctr
● algorithm family a θ (x)→ y
linear / tree / NN
● loss function L( y ,a θ (x))
MSE, crossentropy
Find:

θ'← argminθ L( y ,a θ (x))


5
Supervised learning

Great... except if we have no reference answers

6
Online Ads
Great... except if we have no reference answers

We have:
● YouTube at your disposal
● Live data stream
(banner & video features, #clicked)
● (insert your favorite ML toolkit)

We want:
● Learn to pick relevant ads

Ideas? 7
Duct tape approach

Common idea:
● Initialize with naïve solution

● Get data by trial and error and error and error and error

● Learn (situation) → (optimal action)

● Repeat

8
Giant Death Robot (GDR)
Great... except if we have no reference answers

We have:
● Evil humanoid robot
● A lot of spare parts
to repair it :)

We want:
● Enslave humanity
● Learn to walk forward

Ideas? 9
Duct tape approach (again)

Common idea:
● Initialize with naïve solution

● Get data by trial and error and error and error and error

● Learn (situation) → (optimal action)

● Repeat

10
Duct tape approach

11
Problems

Problem 1:
● What exactly does the “optimal action”

mean?

Extract as much
money as you can
right now
VS
Make user happy
so that he would
visit you again 12
Problems

Problem 2:
● If you always follow the “current optimal”

strategy, you may never discover something


better.

● If you show the same banner to 100% users,


you will never learn how other ads affect them.

Ideas?

13
Duct tape approach

14
Reinforcement learning

15
What-what learning?
Supervised learning Reinforcement learning

● Learning to approximate ● Learning optimal strategy


reference answers by trial and error

● Needs correct answers ● Needs feedback on agent's


own actions

● Model does not affect the ● Agent can affect it's own
input data observations
What-what learning?
Unsupervised learning Reinforcement learning

● Learning underlying ● Learning optimal strategy


data structure by trial and error

● No feedback required ● Needs feedback on agent's


own actions

● Model does not affect the ● Agent can affect it's own
input data observations
What is: bandit

observation action
Agent Feedback

Examples:
– banner ads (RTB)
– recommendations
– medical treatment
18
What is: bandit

observation action
Agent Feedback

Examples:
– banner ads (RTB)
– recommendations
– medical treatment
19
What is: bandit

observation action
Agent Feedback

Examples:
– banner ads (RTB)
Q: what's observation, action and
– recommendations feedback in the banner ads problem?
– medical treatment
20
What is: bandit

observation action
Agent Feedback
user features show
time of year click,
banner
trends money
Examples:
– banner ads (RTB)
– recommendations
– medical treatment
21
What is: bandit

observation action
Agent Feedback
user features show
time of year click,
banner
trends money

Q: You're Yandex/Google/Youtube.
There's a kind of banners that would
have great click rates: the “clickbait”.

Is it a good idea to show clickbait?


22
What is: bandit

observation action
Agent Feedback

Q: You're Yandex/Google/Youtube.
There's a kind of banners that would
have great click rates: the “clickbait”.

Is it a good idea to show clickbait?


23
No, no one will trust you after that!
What is: decision process

Agent
observation

action
24
Environment
What is: decision process

Agent
Can do anything
observation

action
Can't even see
(worst case)

25
Environment
Reality check: web
● Cases:
● Pick ads to maximize profit

● Design landing page to

maximize user retention


● Recommend movies to users

● Find pages relevant to queries

● Example
● Observation – user features

● Action – show banner #i

● Feedback – did user click?

26
Reality check: dynamic systems

27
Reality check: dynamic systems

● Cases:
● Robots

● Self-driving vehicles

● Pilot assistant

● More robots!

● Example
● Observation: sensor feed

● Action: voltage sent to motors

● Feedback: how far did it move

forward before falling

28
Reality check: videogames

29
● Q: What are observations, actions and feedback?
Reality check: videogames

30
● Q: What are observations, actions and feedback?
Other use cases
● Personalized medical treatment

● Even more games (Go, chess, etc)

31
● Q: What are observations, actions and feedback?
Other use cases
● Conversation systems
– learning to make user happy
● Quantitative finance
– portfolio management
● Deep learning
– optimizing non-differentiable loss
– finding optimal architecture

32
The MDP formalism

Markov Decision Process


● Environment states:
s ∈S
● Agent actions:
a ∈A
● Rewards
r∈ℝ
● Dynamics: P(s t+ 1∣s t , at ) 33
The MDP formalism

a ∈A
s ∈S

P(s t+ 1∣s t , at )

Markov Decision Process


Markov assumption

P(s t+ 1∣s t , at , s t−1 , at −1 )=P(s t +1∣s t , a t )


34
The MDP formalism

a ∈A
s ∈S

P(s t+ 1∣s t , at )

Markov Decision Process


Markov assumption

P(s t+ 1∣s t , at , s t−1 , at −1 )=P(s t +1∣s t , a t )


35
Total reward

Total reward for session:


R=∑ r t
t
Agent's policy:

π (a∣s)=P(take action a∣in state s )

Problem: find policy with


highest reward:

π (a∣s): E π [R ]→max 36
Objective

The easy way:

E π R is an expected sum of rewards


that agent with policy π earns per session
The hard way:

E E E ... E [r 0 +r 1+r 2 +...+r T ]


s0 ∼ p( s0), a0∼π(a∣s0 ), s1 , r0 ∼P(s' ,r∣s,a) sT ,rT ∼P(s' ,r∣sT −1 ,at −1 )

37
Objective

The easy way:

E π R is an expected sum of rewards


that agent with policy π earns per session
The hard way:

E E E ... E [r 0 +r 1+r 2 +...+r T ]


s0 ∼ p( s0), a0∼π(a∣s0 ), s1 , r0 ∼P(s' ,r∣s,a) sT ,rT ∼P(s' ,r∣sT −1 ,at −1 )

38
How do we solve it?

General idea:

Play a few sessions

Update your policy

Repeat

39
Crossentropy method
Intialize policy

Repeat:
– Sample N[100] sessions

– Pick M[25] best sessions, called elite sessions

– Change policy so that it prioritizes


actions from elite sessions
40
Step-by-step view

41
Step-by-step view

elites

42
Step-by-step view

43
Step-by-step view

44
Step-by-step view

45
Step-by-step view

46
Step-by-step view

47
Step-by-step view

48
Step-by-step view

49
Step-by-step view

50
Step-by-step view

51
Tabular crossentropy method
● Policy is a matrix

π (a∣s)= A s , a

● Sample N games with that policy


● Get M best sessions (elites)

Elite=[( s 0 , a0 ) ,(s 1 , a1 ) ,(s 2 , a2 ),... ,( s k , a k )]

52
Tabular crossentropy method
● Policy is a matrix

π (a∣s)= A s , a

● Sample N games with that policy


● Take M best sessions (elites)
● Aggregate by states
∑ [s t =s][at =a]
s t , at ∈Elite
π (a∣s)=
∑ [ st =s ] 53
s t ,a t ∈ Elite
Tabular crossentropy method
● Policy is a matrix

π (a∣s)= A s , a

● Sample N games with that policy


● Take M best sessions (elite)
● Aggregate by states

took a at s In M best
π (a∣s)= games
was at s
54
Grim reality

If your environment has infinite/large state space

55
Approximate crossentropy method
● Policy is approximated
– Neural network predicts π W ( a∣s) given s
– Linear model / Random Forest / ...

Can't set π (a∣s) explicitly

All state-action pairs from M best sessions


Elite=[( s 0 , a0 ),(s 1 , a1 ) ,(s 2 , a2 ) ,... ,( s k , a k )]
56
Approximate crossentropy method
Neural network predicts π W ( a∣s) given s

All state-action pairs from M best sessions

Elite=[(s 0 , a0 ),(s 1 , a1 ),(s 2 , a2 ),... ,(s k , a k )]

Maximize likelihood of actions in “best” games

π=argmax
π
∑ log π( ai∣si )
s i ,ai ∈Elite
57
Approximate crossentropy method
● Initialize NN weights W 0 ←random

● Loop:
– Sample N sessions

– Elite=[(s 0 , a0 ),(s 1 , a1 ),(s 2 , a2 ),... ,( s k , a k )]

– W i +1=W i + α ∇ [ ∑ log πW (ai∣s i )]


i
s i , ai ∈ Elite
58
Approximate crossentropy method
● Initialize NN weights nn = MLPClassifier(...)

● Loop:
– Sample N sessions

– Elite=[(s 0 , a0 ),(s 1 , a1 ),(s 2 , a2 ),... ,(s k , a k )]

– nn.fit(elite_states,elite_actions)

59
Continuous action spaces
● Continuous state space
2
● Model π W ( a∣s)=N (μ ( s) , σ )
– Mu(s) is neural network output
– Sigma is a parameter or yet another network output
● Loop:
– Sample N sessions
– elite = take M best sessions and concatenate

W i +1=W i + α ∇ [ ∑ log πW ( ai∣s i )]
i
s i , ai ∈ Elite

What changed? 60
Approximate crossentropy method
● Initialize NN weights nn = MLPRegressor(...)

● Loop:
– Sample N sessions

– Elite=[(s 0 , a0 ),(s 1 , a1 ),(s 2 , a2 ),... ,(s k , a k )]

– nn.fit(elite_states,elite_actions)

Almost nothing! 61
Tricks
● Remember sessions from 3-5 past iterations
– Threshold and use all of them when training
– May converge slower if env is easy to solve.

● Regularize with entropy


– to prevent premature convergence.

● Parallelize sampling
● Use RNNs if partially-observable (later) 62
Reinforcement learning
Episode 1

Black box optimization

1
Recap: reinforcement learning

Agent
observation

action
2
Environment
Recap: MDP

Classic MDP(Markov Decision Process)


Agent interacts with environment
● Environment states:
s ∈S
● Agent actions: a ∈A
● State transition: P(s t+ 1∣s t , at )
3
Recap: reinforcement learning

Agent
observation

action
4
Environment
Feedback (Monte-Carlo)

● Naive objective: R(z )

z=[s 0 , a0 , s1 , a1 , s 2 , a2 , ..., sn , a n ]

Deterministic policy:
● Find policy with highest

expected reward

π (s)→a : E [ R]→max

5
Feedback (Monte-Carlo)
Whole session

● Naive objective: R(z )

z=[s 0 , a0 , s1 , a1 , s 2 , a2 , ..., sn , a n ]

Deterministic policy:
● Find policy with highest

expected reward

π (s)→a : E [ R]→max

6
Black box optimization setup

Agent

R(s 0 , a0 , s 1 , a1 ,... , s n , a n)

7
Black box optimization setup

Agent

Policy R
params

R(s 0 , a0 , s 1 , a1 ,... , s n , a n)

8
Black box optimization setup

Agent

Policy black R
params
box
R(s 0 , a0 , s 1 , a1 ,... , s n , a n)

9
Today's menu
Evolution strategies
– A general black box optimization
– Easy to implement & scale

Crossentropy method
– A general method with special case for RL
– Works remarkably well in practice

10
Evolution Strategies

– Introduce distribution over weights

2
θ∼N (θ∣μ , σ )

– Maximize expected reward

J= E R
2
N (θ∣μ , σ )

11
Evolution Strategies

– Introduce distribution over weights

2
θ∼N (θ∣μ , σ )
other P(θ) will
work as well
– Maximize expected reward

J= E R
2
N (θ∣μ , σ )

12
Evolution Strategies

– Introduce distribution over weights

2
θ∼N (θ∣μ , σ )

– Maximize expected reward

J= E E R (s , a , s ' , a ' , ...)


N (θ∣μ , σ2 ) s , a , s ' ,a ' ,...

13
Evolution Strategies
– Expected reward (using math. Expectation)

J= E E R (s , a , s ' , a ' , ...)


N (θ∣μ , σ2 ) s , a , s ' ,a ' ,...

Q: How can we estimate J in practice?


for large/infinite state space

14
Evolution Strategies
– Expected reward (using math. Expectation)

J= E E R (s , a , s ' , a ' , ...)


N (θ∣μ , σ2 ) s , a , s ' ,a ' ,...

– Approximate with sampling


N
1
J≈
N
∑∑ R(s, a ,...)
i=0 s ,a ∈ z i

2
Sample Θ from θ∼N (θ∣μ , σ ) 15
Evolution Strategies
– Expected reward (using math. Expectation)

J= E E R (s , a , s ' , a ' , ...)


N (θ∣μ , σ2 ) s , a , s ' ,a ' ,...

– Approximate with sampling


reward for
N session
1
J≈
N
∑∑ R(s, a ,...)
i=0 s ,a ∈ z i

2
Sample Θ from θ∼N (θ∣μ , σ ) 16
Evolution Strategies
– Expected reward (using math. expectation)

J =∫ N (θ∣μ , σ )⋅∫ P(s , a , s ' , a ' ,...)R (s, a , s' , a ' ,...)
2

– What we need

∇ J =∫ ∇ [ N (θ∣μ , σ )]⋅∫ P(s , a , s' , a' ,...) R(s , a , s ' , a ' , ...)
2

17
Evolution Strategies
– Expected reward (using math. expectation)

J =∫ N (θ∣μ , σ )⋅∫ P(s , a , s ' , a ' ,...)R (s, a , s' , a ' ,...)
2

– What we need

∇ J =∫ ∇ [ N (θ∣μ , σ )]⋅∫ P(s , a , s' , a' ,...) R(s , a , s ' , a ' , ...)
2

Q: Can we estimate ∇J with samples?


18
Evolution Strategies
– Expected reward (using math. expectation)

J =∫ N (θ∣μ , σ )⋅∫ P(s , a , s ' , a ' ,...)R (s, a , s' , a ' ,...)
2

– What we need

∇ J =∫ ∇ [ N (θ∣μ , σ )]⋅∫ P(s , a , s' , a' ,...) R(s , a , s ' , a ' , ...)
2

Not a valid expectation


19
∇ N (Θ|μ,σ) is not a distribution
Logderivative trick
Simple math

∇ log f ( x )=? ? ?

(try chain rule)

20
Logderivative trick
Simple math

1
∇ log f ( x )= ⋅∇ f (x )
f ( x)

f ( x)⋅∇ log f ( z )=∇ f ( z )

21
Logderivative trick
Analytical inference

∇ J =∫ ∇ [ N (θ∣μ , σ )]⋅∫ P(s , a , s' , a' ,...) R(s , a , s ' , a ' , ...)
2

2 2 2
∇ N (θ∣μ , σ )=N (θ∣μ , σ )⋅∇ log N (θ∣μ , σ )

22
Evolution Strategies
Analytical inference

∇ J =∫ [ N (θ∣μ , σ )⋅∇ log N (θ∣μ , σ )] E R (s, a , ...)


2 2

s , a , ...

Q: How can we estimate ∇J in now?

23
Evolution Strategies
Analytical inference

∇ J =∫ [ N (θ∣μ , σ )⋅∇ log N (θ∣μ , σ )] E R (s, a , ...)


2 2

s , a , ...

Sampled estimate
N
1
∇ J≈
N
∑ ∇ log N (θ∣μ , σ )⋅ ∑
2
R (s , a , ...)
i=0 s , a∈ zi

2
Sample Θ from θ∼N (θ∣μ , σ ) 24
Evolution strategies
Algorithm

2
1. Initialize μ 0 ,σ 0

2. Forever:
N
1
∇ J≈
N
∑ ∇ log N (θ∣μ , σ )⋅ ∑
2
R (s , a , ...)
i=0 s , a∈ zi

2 2
μ=μ+α⋅∇ J σ =σ +α⋅∇ J
25
Evolution strategies
Features
– A general black box optimization
– Needs a lot of samples
– Easy to implement & scale

26
Evolution strategies
Features
– A general black box optimization
– Easy to implement & scale

N
1
∇ J≈
N
∑ ∇ log N (θ∣μ , σ )⋅ ∑
2
R (s , a , ...)
i=0 s , a∈ zi

Q: You have 1000 CPUs.


27
Optimize this formula!
Evolution strategies
Features
– A general black box optimization
– Some results on gym

https://blog.openai.com/evolution-strategies/ 28
Today's menu
Evolution strategies
– A general black box optimization
– Requires a lot of sampling

Crossentropy method
– A general method with special case for RL
– Works remarkably well in practice

29
Estimation problem
● You want to estimate

E H (x)=∫ p( x)⋅H (x)dx


x∼ p( x) x

30
Estimation is not a problem
● You want to estimate

E H (x)=∫ p( x)⋅H (x)dx


x∼ p( x) x

● So what? You just compute it!

31
Estimation problem
● You want to estimate

E H (x)=∫ p( x)⋅H (x)dx


x∼ p( x) x

● So what? You just compute it!


– x may be 1000-dimensional
– H(x) may be costly to compute

32
Ideas?
Estimation in the wild
● You want to estimate

E H (x)=∫ p( x)⋅H ( x)dx


x∼ p( x) x

● So what? You just compute it!


– x may be 1000-dimensional
– H(x) may be costly to compute
1
∫ p( x)⋅H (x)dx≈ N ∑ H ( xk )
x x k ∼ p (x) 33
Estimation in the wild
● You want to estimate profits!

E H (x)=∫ p( x)⋅H ( x)dx


x∼ p( x) x

● x – user of your online game (age, gender, ...)


● p(x) – probability of such user
● H(x) – try to guess :)

34
Estimation in the wild
● You want to estimate profits!

E H (x)=∫ p( x)⋅H ( x)dx


x∼ p( x) x

● x – user of your online game (age, gender, ...)


● p(x) – probability of such user
● H(x) – money donated by such user

35
Estimation in the wild
● Sampling = asking users to pass survey
● Usually costs money!
● Guess H(median russian gamer)?

36
Estimation in the wild
● Sampling = asking users to pass survey
● Usually costs money!
● H(median russian gamer) ~ 0
● It's H(hard-core donators) that matters!

37
Estimation in the wild
● Sampling = asking users to pass survey
● Usually costs money!
● Most H(x) are small, few are very large

38
Estimation in the wild
● Sampling = asking users to pass survey
● Usually costs money!
● 99% of H(x)=0, 1% H(x)=$1000 (whale)
● You make a survey of N=50 people

How accurate are we?

39
Estimation in the wild
● Sampling = asking users to pass survey
● Usually costs money!
● 99% of H(x)=0, 1% H(x)=$1000 (whale)
● You make a survey of N=50 people

1
∫ p( x)⋅H (x)dx≈ N ∑ H ( xk )
x x k ∼ p (x)

0 whales: H=0, 1 whale: H= 5x true 40


Importance sampling
● Idea: we know that most whales are
– 30-40 year old
– single
– wage >100k
● Sample 50% in that group, 50% rest
● Adjust for difference in distributions

41
Importance sampling
● Math:

q (x)
E H (x)=∫ p( x)⋅H (x)dx =∫ p( x )⋅ ⋅H ( x) dx
x∼ p( x) x x q (x)

42
Importance sampling
● Math:

q (x)
E H (x)=∫ p( x)⋅H (x)dx =∫ p( x )⋅ ⋅H ( x) dx =
x∼ p( x) x x q (x)

p (x )
= ∫ q ( x )⋅ ⋅H ( x )dx= E ? ? ?
x q( x ) x∼q (x)

43
Importance sampling
● Math:

q (x)
E H (x)=∫ p( x)⋅H (x)dx =∫ p( x )⋅ ⋅H ( x) dx =
x∼ p( x) x x q (x)

p (x ) p( x)
= ∫ q ( x )⋅ ⋅H (x )dx= E ⋅H ( x)
x q(x ) x∼q (x) q (x )

44
Importance sampling
● TL;DR:

p( x)
E H (x)= E ⋅H ( x)
x∼ p( x) x∼q (x) q ( x )

45
Importance sampling
● TL;DR:

p( x)
E H (x)= E ⋅H ( x)
x∼ p( x) x∼q (x) q ( x )

1 1 p( x)
N
∑ H ( x k )≈ N ∑ q (x )
⋅H (x)
x ∼ p( x)
k
x k ∼q (x)

46
Importance sampling
● TL;DR:

p( x)
E H (x)= E ⋅H ( x)
x∼ p( x) x∼q (x) q ( x )

If p(x)>0, then q(x)>0

1 1 p (x )
E H (x)≈
x∼ p( x) N
∑ H (x k )≈ N ∑ q(x )
⋅H ( x )
x ∼ p (x)
k
x k ∼q( x)

original distribution other distribution 47


Importance sampling
● Idea: we may know that all whales are
– 30-40 year old
– single
– wage >100k
● Sample q(x): 50% that group, 50% rest
● Adjust for difference in distributions

1 1 p( x)
N
∑ H ( x k )≈ N ∑ q(x)
⋅H (x)
x ∼ p( x)
k
x k ∼q (x)
48
Importance sampling
● Idea: we may know that all whales are
– 30-40 year old
– single
– wage >100k
● Sample from different q(x)
● Adjust for difference in distributions

Which q(x) is best?


49
Crossentropy method
● Pick q ( x)∼ p(x )⋅H ( x)

What's q(x)?

H(x) p(x)

50
Crossentropy method
● Pick q ( x)∼ p(x )⋅H ( x)

q(x)

H(x) p(x)

51
Crossentropy method
● Minimize difference between q ( x) and p(x )H ( x)

● How to measure that difference? Ideas?

52
Crossentropy method
● Minimize difference between q ( x) and p(x )H ( x)

● Kullback-Leibler divergence
p 1 (x)
KL( p1 ( x )∥p2 (x ))= E log
x∼ p (x)
1
p 2 (x)

53
Crossentropy method
● Minimize difference between q ( x) and p(x )H ( x)

● Kullback-Leibler divergence
p 1 (x)
KL( p1 ( x )∥p2 (x ))= E log =
x∼ p (x)
1
p 2 (x)

= E log p 1 ( x)− E log p 2 ( x)


x∼ p1(x ) x∼ p1(x )

54
what? what?
Crossentropy method
● Minimize difference between q ( x) and p(x )H ( x)

● Kullback-Leibler divergence
p 1 (x)
KL( p1 (x )∥p2 (x ))= E log =
x∼ p (x)
1
p 2 (x)
const( p2(x) )
= E log p 1 (x)− E log p 2 (x)
x∼ p1(x ) x∼ p1(x )

55
entropy crossentropy
Crossentropy method
● Minimize difference between q ( x) and p(x )H ( x)

● Minimize Kullback-Leibler divergence

argmin [const − E H ( x) log q( x )]


q (x) x∼ p ( x)

entropy crossentropy
56
Crossentropy method
● Pick q(x) to minimize crossentropy

q ( x)=argmin [− E H (x )log q( x)]


q (x) x∼ p( x)

● Exact solution in many cases (e.g. gaussian)


● Otherwise use numeric optimization
– e.g. when q(x) is a neural network

57
Iterative approach
● Pick q(x) to minimize crossentropy

q ( x)=argmin [− E H (x )log q( x)]


q (x) x∼ p( x)

● Start with q 0 (x )= p( x)
● Iteration
p( x )
q i+1 ( x )=argmin− E H ( x ) log q i+1 (x )
q ( x)
i+ 1
x∼q ( x) q i (x )
i

58
Finally, reinforcement learning
● Objective: H(x) = [R > threshold]
● p(x) = uniform
● Threshold = M'th (e.g. 50th) percentile of R

π i+1 ( a∣s)=argmin− E [R ( z )≥ψi ]log πi +1 (a∣s)


πi+ 1 z∼π i (a∣s )

ψi=M ' th percentile of R( z ∼πi)


59
Finally, reinforcement learning
● Objective: H(x) = [R > threshold]
● p(x) = uniform
● Threshold = M'th (e.g. 50th) percentile of R

π i+1 ( a∣s)=argmin− E [R ( z )≥ψi ]log πi +1 (a∣s)


πi+ 1 z∼π i (a∣s )

ψi=M ' th percentile of R( z ∼πi)

Something wrong with the formula! 60


Finally, reinforcement learning
● Objective: H(x) = [R > threshold]
● p(x) = uniform
● Threshold = M'th (e.g. 50th) percentile of R

π i+1 ( a∣s)=argmin− E [R ( z )≥ψi ]log πi +1 (a∣s)


πi+ 1 z∼π i (a∣s )

No p(x)/q(x) term as it's okay to expect over pi(a|s)

ψi=M ' th percentile of R( z ∼πi)


61
TL;DR, simplified

● Sample N=100 sessions


● Take M=25 best
● Fit policy to behave as in M best sessions
● Repeat until satisfied

Policy will gradually get better.


62
Tabular crossentropy method
● Policy is a matrix

π (a∣s)= A s , a

● Sample N games with that policy


● Get M best games (highest reward)
● Contatenate, K state-action pairs total

Elite=[( s 0 , a0 ) ,(s 1 , a1 ) ,(s 2 , a2 ),... ,( s k , a k )]

63
Tabular crossentropy method
● Policy is a matrix

π (a∣s)= A s , a

● Sample N games with that policy


● Take M best (highest reward)
● Aggregate by states
∑ [s t =s][at =a]
s t , at ∈Elite
π (a∣s)=
∑ [ st =s ] 64
s t ,a t ∈ Elite
Tabular crossentropy method
● Policy is a matrix

π (a∣s)= A s , a

● Sample N games with that policy


● Take M best (highest reward)
● Aggregate by states

took a at s In M best
π (a∣s)= games
was at s
65
Smoothing
● If you were in some state only once, you only
take this action now.
● Apply smoothing

[took a at s ]+λ In M best


π (a∣s)= games
[was at s]+ λ⋅N actions

Alternative idea: smooth updates

π i+1 ( a∣s)=α⋅π opt +( 1−α ) πi ( a∣s)


66
Stochastic MDPs
● If there's randomness in environment, algorithm
will prefer “lucky” sessions.
● Training on lucky sessions is no good

● Solution: sample action for each state and run


several simulations with these state-action
pairs. Average the results.

67
Approximate (deep) version
● Policy is approximated
– Neural network predicts π W ( a∣s) given s
– Linear model / Random Forest / ...

Can't set π (a∣s) explicitly

All state-action pairs from M best sessions


Elite=[( s 0 , a0 ),(s 1 , a1 ) ,(s 2 , a2 ) ,... ,( s k , a k )]
68
Approximate (deep) version
Neural network predicts π W ( a∣s) given s

All state-action pairs from M best sessions

Elite=[( s 0 , a0 ) ,(s 1 , a1 ) ,(s 2 , a2 ) ,... ,( s k , a k )]

Maximize likelihood of actions in “best” games

π=argmax
π
∑ log π( ai∣si )
s i ,ai ∈Elite
69
Approximate (deep) version
Neural network predicts π W ( a∣s) given s

All state-action pairs from M best sessions

best=[( s 0 , a0 ) ,(s1 , a1 ) ,( s2 , a 2) , ..., (s K , a K )]

Maximize likelihood of actions in “best” games


conveniently,
nn.fit(elite_states,elite_actions)
70
Approximate (deep) version

71
Approximate (deep) version
● Initialize NN weights W 0 ←random

● Loop:
– Sample N sessions
– elite = take M best sessions and concatenate
– W i +1=W i + α ∇ [ ∑ log πW ( ai∣s i )]
i
s i , ai ∈ Elite

72
Approximate (deep) version
● Initialize NN weights W 0 ←random
model = MLPClassifier()
● Loop:
– Sample N sessions
– elite = take M best sessions and concatenate
– W i +1=W i + α ∇ [ ∑ log πW ( ai∣s i )]
i
s i , ai ∈ Elite

model.fit(elite_states,elite_actions)
73
Continuous action spaces
● Continuous state space
2
● Model π W ( a∣s)=N (μ ( s) , σ )
– Mu(s) is neural network output
– Sigma is a parameter or yet another network output
● Loop:
– Sample N sessions
– elite = take M best sessions and concatenate
– W i +1=W i + α ∇ [ ∑ log πW ( ai∣s i )]
i
s i , ai ∈ Elite
74

What changed?
Continuous action spaces
● Continuous state space model = MLPRegressor()
2
● Model π W ( a∣s)=N (μ ( s) , σ )
– Mu(s) is neural network output
– Sigma is a parameter or yet another network output
● Loop:
– Sample N sessions
– elite = take M best sessions and concatenate
– W i +1=W i + α ∇ [ ∑ log πW ( ai∣s i )]
i
s i , ai ∈ Elite
model.fit(elite_states,
elite_actions) 75

Nothing!
Tricks
● Remember sessions from 3-5 past iterations
– Threshold and use all of them when training
– May converge slower if env is easy to solve.

● Regularize with entropy


– to prevent premature convergence.

● Parallelize sampling
● Use RNNs if partially-observable 76
Monte-carlo: upsides
● Great for short episodic problems
● Very modest assumptions
– Easy to extend to continuous actions, partial
observations and more

77
Monte-carlo: downsides
● Need full session to start learning
● Require a lot of interaction
– A lot of crashed robots / simulations

78
Gonna fix that next lecture!
● Need full session to start learning
● Require a lot of interaction
– A lot of crashed robots / simulations

79
Seminar

80
Practical RL – Week 2
Shvechikov Pavel
Previously in the course

● The MDP formalism


○ State, Action, Reward, next State
● Cross-Entropy Method (CEM)
○ easy to implement
○ competitive results
○ black box
■ no knowledge of environment
■ no knowledge of intermediate rewards

Improve on the CEM → dive into the black box


Provided we know all, how to find an optimal policy?
Goal: solve the MDP by finding an optimal policy

1. Reward design
2. Bellman Equations
a. state-value function
b. action-value function
3. Policy: evaluation and improvement
4. Generalized Policy Iteration
a. Policy Iteration
b. Value iteration
Explaining goals to agent through reward

Reward hypothesis (R.Sutton)

Goals and purposes can be thought of as the maximization of the


expected value of the cumulative sum of a received scalar signal
Explaining goals to agent through reward

Reward hypothesis (R.Sutton)

Goals and purposes can be thought of as the maximization of the


expected value of the cumulative sum of a received scalar signal

Сumulative reward is called a return:

E.g.: reward in chess – value of taken opponent's piece


Explaining goals to agent through reward

Reward hypothesis (R.Sutton)

Goals and purposes can be thought of as the maximization of the


expected value of the cumulative sum of a received scalar signal

Сumulative reward is called a return: end of episode

immediate reward

E.g.: reward in chess – value of taken opponent's piece


E.g.: data center non-stop cooling system

● States – temperature measurements


● Actions – different fans speed
● R = 0 for exceeding temperature thresholds
● R = +1 for each second system is cool
What could go wrong with such a design?
E.g.: data center non-stop cooling system

● States – temperature measurements


● Actions – different fans speed
● R = 0 for exceeding temperature thresholds
● R = +1 for each second system is cool
What could go wrong with such a design?

Infinite return for non optimal behaviour!


E.g.: сleaning robot
● States – dust sensors, air
● Actions – cleaning / rest / conditioning on or off
● R = 100 for long tedious floor cleaning task done
● R = 1 for turning air conditioning on-off
● Episode ends each day
What could go wrong with such a design?
E.g.: сleaning robot
● States – dust sensors, air
● Actions – cleaning / rest / conditioning on or off
● R = 100 for long tedious floor cleaning task done
● R = 1 for turning air conditioning on-off
● Episode ends each day
What could go wrong with such a design?

Reward(air) < Reward(cleaning) R=1 R=100


Time(air) << Time(cleaning)
Rest
Positive feedback loop!
One second Whole day
Reward discounting
Reward discounting

Get rid of infinite sum by discounting

discount factor

The same cake compared to


today’s one worth

● times less tomorrow


● times less the day after
will eat it day by day
tomorrow
Discounting makes sums finite

Maximal return for R = +1


0.9 0.95 0.99

10 20 100
Discounting makes sums finite

Maximal return for R = +1


0.9 0.95 0.99

10 20 100

Any discounting
changes optimisation
task and its solution!
Discounting is inherent to humans

● Quasi-hyperbolic

● Hyperbolic discounting

Laibson, D. (1997). Golden eggs and hyperbolic discounting. The Quarterly Journal of Economics, 112(2), 443-478.
Discounting is inherent to humans

● Quasi-hyperbolic

● Hyperbolic discounting

Mathematical convenience

Remember this one!


We will need it later

Laibson, D. (1997). Golden eggs and hyperbolic discounting. The Quarterly Journal of Economics, 112(2), 443-478.
Discounting is a stationary end-of-effect model
Any action affects (1) immediate reward (2) next state
Discounting is a stationary end-of-effect model
Any action affects (1) immediate reward (2) next state
Action indirectly affects future rewards
But how long does this effect lasts?

G is expected return under stationary end-of-effect model


Discounting is a stationary end-of-effect model
Any action affects (1) immediate reward (2) next state
Action indirectly affects future rewards
But how long does this effect lasts?

“Effect
continuation”
“End of probability
effect”
probability

G is expected return under stationary end-of-effect model


Reward design – don’t shift, reward for WHAT
● E.g.: chess – value of taken opponent's piece
○ Problem: agent will not have a desire to win!

● E.g.: сleaning robot, +100 (cleaning), +0.1 (on-off)


○ Problem: agent will not bother cleaning the floor!
Reward design – don’t shift, reward for WHAT
● E.g.: chess – value of taken opponent's piece
○ Problem: agent will not have a desire to win!

● E.g.: сleaning robot, +100 (cleaning), +0.1 (on-off)


○ Problem: agent will not bother cleaning the floor!
Take away: reward only for WHAT, but never for HOW

-9 -1
S2
S1 S4
-1 -1 End
Start
S3
Reward design – don’t shift, reward for WHAT
● E.g.: chess – value of taken opponent's piece
○ Problem: agent will not have a desire to win!

● E.g.: сleaning robot, +100 (cleaning), +0.1 (on-off)


○ Problem: agent will not bother cleaning the floor!
Take away: reward only for WHAT, but never for HOW

-9 -1
S2 Take away: do not
S1 S4
-1 -1 subtract mean
Start End
S3 from rewards
Reward design – scaling, shaping

What transformations do not change optimal policy?


● Reward scaling – division by nonzero constant
○ May be useful in practise for approximate methods

Ng, A. Y., Harada, D., & Russell, S. (1999, June). Policy invariance under reward transformations: Theory and application to reward shaping. In ICML (Vol. 99,
pp. 278-287).
Reward design – scaling, shaping

What transformations do not change optimal policy?


● Reward scaling – division by nonzero constant
○ May be useful in practise for approximate methods
● Reward shaping – we could add to all rewards in
MDP values of potential-based shaping function F(s,
a, s’) without changing an optimal policy:

Intuition: when no discounting F adds as much as it


subtracts from the total return
Ng, A. Y., Harada, D., & Russell, S. (1999, June). Policy invariance under reward transformations: Theory and application to reward shaping. In ICML (Vol. 99,
pp. 278-287).
Lecture plan

1. Reward design
2. Bellman Equations
a. state-value function
b. action-value function
3. Policy: evaluation and improvement
4. Generalized Policy Iteration
a. Policy Iteration
b. Value iteration
How to find optimal policy?

Dynamic programming!

Method to solve a complex problem by


● breaking it into small pieces
● until no more unsolved pieces
○ solve a single piece using solutions of previous pieces

DP equations lies at the heart of RL


It is essential to deeply understand them.
How to find optimal policy?

We know! Maximize cumulative discounted return!


How to find optimal policy?

We know! Maximize cumulative discounted return!

But policy and / or


environment could be
random!

Let get rid of randomness by taking expectation!


Equivalent variants of notation in RL
Equivalent variants of notation in RL
State-value function v(s)

v(s) is expected return conditional on state:

Intuition: value of following policy from state s


State-value function v(s)

v(s) is expected return conditional on state:


stochasticity in policy & environment

Intuition: value of following policy from state s


State-value function v(s)

v(s) is expected return conditional on state:


stochasticity in policy & environment

Environment
stochasticity

Policy
stochasticity

Intuition: value of following policy from state s


State-value function v(s)

v(s) is expected return conditional on state:


stochasticity in policy & environment

Environment
stochasticity

Policy
stochasticity
By definition
Intuition: value of following policy from state s
Bellman expectation equations
Bellman expectation equation for v(s)

Recursive definition of v(s) is an important concept in RL


Bellman expectation equation for v(s)

Recursive definition of v(s) is an important concept in RL

Backup
diagram
Bellman expectation equation for v(s)

Recursive definition of v(s) is an important concept in RL

Backup
diagram
Action-value function q(s, a)

Is expected return conditional on state and action:

Intuition: value of following policy after committing


action a in state s
Action-value function q(s, a)

Is expected return conditional on state and action:

Intuition: value of following policy after committing


action a in state s
No policy
stochasticity at
first step
Relations between v(s) and q(s,a)

We already know how to write q(s,a) in terms of v(s)

What about v(s) in terms of q(s,a)?


Relations between v(s) and q(s,a)

We already know how to write q(s,a) in terms of v(s)

What about v(s) in terms of q(s,a)?

So, we could now write q(s, a) in terms of q(s,a)!


Bellman expectation equation for q(s,a)

Backup
diagram
for q(s, a)
Bellman expectation equation for q(s,a)

Backup
diagram
for q(s, a)
What do we gonna do with value functions?

Already know
● Bellman equations – assess policy performance
● Return, value- and action-value functions
Want to find an optimal policy:
● optimal actions in each possible state

But how to know which policy is better?


How to compare them?
Optimal policy is the one with biggest v(s)

We could compare policies on the basis of v(s)

Best policy is better or equal to any other policy

Use optimal policy from s In any finite MDP there is


always at least one
deterministic optimal policy

Commit action a, and afterwards use optimal policy


Bellman optimality equations
Bellman optimality equation for v(s)

Bellman expectation Bellman optimality


equation for v(s) equation for v*(s)
Bellman optimality equation for v(s)

max

Bellman expectation Bellman optimality


equation for v(s) equation for v*(s)
Bellman optimality equation for v(s)

max

Bellman expectation Bellman optimality


equation for v(s) equation for v*(s)
Bellman optimality equation for q(s,a)

Bellman expectation Bellman optimality


equation for q(s,a) equation for q*(s, a)
Bellman optimality equation for q(s,a)

max max

Bellman expectation Bellman optimality


equation for q(s,a) equation for q*(s, a)
Bellman optimality equation for q(s,a)

max max

Bellman expectation Bellman optimality


equation for q(s,a) equation for q*(s, a)
Bellman equations: operator view
Bellman equations: operator view

Bellman expectation equation for v(s)

Bellman expectation equation for q(s,a)

Bellman optimality equation for v*(s)

Bellman optimality equation for q*(s,a)


What’s next?

Now we are equipped with heavy artillery of


● Bellman expectation equation for v(s) and q(s,a)
● Bellman optimality equation for v*(s) and q*(s,a)

That will be our toolkit for finding optimal policy


using dynamic programming!
Lecture plan

1. Reward design
2. Bellman Equations
a. state-value function
b. action-value function
3. Policy: evaluation and improvement
4. Generalized Policy Iteration
a. Policy Iteration
b. Value iteration
Policy evaluation
If you can't measure it, you can't improve it.
Peter Drucker

Policy evaluation: motivation

Policy evaluation is also called prediction problem:


● predict value function for a particular policy.

Bellman expectation equation

is basically a system of linear equations where


● # of unknowns = # of equations = # of states
Policy evaluation: algorithm

Bellman expectation
equation for v(s)
Policy improvement
Policy improvement: an idea

Once we know what is v(s) for a particular policy


We could improve it by acting greedily w.r.t. v(s)!

This procedure is guaranteed to produce a better policy!


Policy improvement: an idea

Once we know what is v(s) for a particular policy


We could improve it by acting greedily w.r.t. v(s)!

This procedure is guaranteed to produce a better policy!


if for all states
then
meaning that
Policy improvement: convergence

If new policy after improvement

is the same as old one

then it is optimal !
Policy improvement: convergence

If new policy after improvement

is the same as old one


Bellman
optimality
then it is optimal ! equation
Determining optimal policy from v*(s), q*(s,a)

If q* is known – how to recover the optimal policy?

If v* is known – how to recover the optimal policy?


Determining optimal policy from v*(s), q*(s,a)

If q* is known – how to recover the optimal policy?

If v* is known – how to recover the optimal policy?

Unknown model dynamics → unable to recover optimal


policy from v*
Precise evaluation is not needed
Value Greedy policy
function
Value Greedy policy
function
Roadmap

Now we know what is

● Policy evaluation (based on Bellman expectation eq)


● Policy improvement (based on Bellman optimality eq)

The finishing touches:


how to combine them to obtain optimal policy?
Lecture plan

1. Reward design
2. Bellman Equations
a. state-value function
b. action-value function
3. Policy: evaluation and improvement
4. Generalized Policy Iteration
a. Policy Iteration
b. Value iteration
The idea of policy and value iterations
Policy evaluation

Policy improvement
The idea of policy and value iterations
Generalized policy iteration Policy evaluation
1. Evaluate given policy
2. Improve policy by acting greedily
w.r.t. to its value function
Policy improvement
The idea of policy and value iterations
Generalized policy iteration Policy evaluation
1. Evaluate given policy
2. Improve policy by acting greedily
w.r.t. to its value function
Robustness: Policy improvement
● No dependence on initialization
● No need in complete policy evaluation (states / converg.)
● No need in exhaustive update (states)
○ Example of update robustness:
■ Update only one state at a time
■ in a random direction
■ that is correct only in a expectation
The idea of policy and value iterations
Generalized policy iteration Policy evaluation
1. Evaluate given policy
2. Improve policy by acting greedily
w.r.t. to its value function
Policy improvement
Policy iteration
1. Evaluate policy until convergence (with some tolerance)
2. Improve policy

Value iteration
1. Evaluate policy only with single iteration
2. Improve policy
Policy iteration
Policy iteration: scheme

Bellman expectation
equation for v(s)

q(s,a)
Value iteration
Value iteration

Bellman optimality
equation for v(s)
Value iteration (VI) vs. Policy iteration (PI)

● VI is faster per iteration – O(|A||S|2)


● VI requires many iterations

● PI is slower per iteration – O(|A||S|2 + |S|3)


● PI requires few iterations

No silver bullet → experiment with # of steps spent in


policy evaluation phase to find the best
Model-free reinforcement learning

1
Previously...

● V(s) and V*(s,a)

● know V* and P(s'|s,a) → know optimal policy

● We can learn V* with dynamic programming

V i+ 1(s):=max a [r (s , a)+γ⋅E s '∼P (s '∣s ,a ) V i (s ')]

2
Decision process in the wild

Agent
Can do anything
observation

action
Can't even see
(worst case)

3
Environment
Decision process in the wild

Agent
Can do anything
observation

action
Can't even see
(worst case)

4
Environment
Model-free setting:
We don't know actual
P(s',r|s,a)

Whachagonnado?

5
Model-free setting:
We don't know actual
P(s',r|s,a)

Learn it?
Get rid of it?

6
More new letters

● Vπ(s) – expected G from state s if you follow π


● V*(s) – expected G from state s if you follow π*

7
More new letters

● Vπ(s) – expected G from state s if you follow π


● V*(s) – expected G from state s if you follow π*

● Qπ(s,a) – expected G from state s


– if you start by taking action a
– and follow π from next state on

● Q*(s,a) – guess what it is :)


8
More new letters

● Vπ(s) – expected G from state s if you follow π


● V*(s) – expected G from state s if you follow π*

● Qπ(s,a) – expected G from state s


– if you start by taking action a
– and follow π from next state on

● Q*(s,a) – same as Qπ(s,a) where π = π*


9
Trivia
● Assuming you know Q*(s,a),
– how do you compute π*

– how do you compute V*(s)?

● Assuming you know V(s)


– how do you compute Q(s,a)?

10
To sum up
* *
Q (s , a)= E r ( s , a)+ γ⋅V ( s ')
s' , r

* *
V ( s)=argmax Q (s , a)
a

Image: cs188x

Action value Qπ(s,a) is the expected total reward G agent gets from
state s by taking action a and following policy π from next state.

π (s): argmax a Q(s , a) 11


Learning from
r
trajectories
prev s s'
s a'
a s''
prev a a''

r'

Model-based: you know P(s'|s,a)


- can apply dynamic programming
- can plan ahead

Model-free: you can sample trajectories


- can try stuff out
- insurance not included 12
MDP trajectory
r
prev s s'
s a'
a s''
prev a a''

r'

● Trajectory is a sequence of
– states (s)
– actions (a)
– rewards (r)

● We can only sample trajectories 13


MDP trajectory
r
prev s s'
s a'
a s''
prev a a''

r'

● Trajectory is a sequence of
– states (s) Q: What to learn?
V(s) or Q(s,a)
– actions (a)
– rewards (r)

● We can only sample trajectories 14


MDP trajectory
r
prev s s'
s a'
a s''
prev a a''

r'

● Trajectory is a sequence of
– states (s) Q: What to learn?
V(s) or Q(s,a)
– actions (a)
– rewards (r) V(s) is useless
without P(s'|s,a)

● We can only sample trajectories 15


Idea 1: monte-carlo
● Get all trajectories containing particular (s,a)
● Estimate G(s,a) for each trajectory
● Average them to get expectation

+1

Cake!

16
Idea 1: monte-carlo
● Get all trajectories containing particular (s,a)
● Estimate G(s,a) for each trajectory
● Average them to get expectation
takes a lot of sessions

17

Image: super meat boy


Idea 2: temporal difference
● Remember we can improve Q(s,a) iteratively!

Q(s t , at )← E r t + γ⋅maxa ' Q(st +1 , a ')


r t , st+ 1

18
Idea 2: temporal difference
● Remember we can improve Q(s,a) iteratively!

Q(s t , at )← E r t + γ⋅maxa ' Q(st +1 , a ')


r t , st+ 1

That's Q*(s,a) That's value for π*


aka optimal policy

19
Idea 2: temporal difference
● Remember we can improve Q(s,a) iteratively!

Q(s t , at )← E r t + γ⋅maxa ' Q(st +1 , a ')


r t , st+ 1

That's Q*(s,a) That's value for π*


aka optimal policy

That's something
we don't have

What do we do?
20
Idea 2: temporal difference

21
Idea 2: temporal difference
● Replace expectation with sampling

1
E r t + γ⋅max a ' Q(s t +1 , a ')≈ ∑ r i + γ⋅max a ' Q (si , a ')
next

r ,s
t t+1
N i

22
Idea 2: temporal difference
● Replace expectation with sampling

1
E r t + γ⋅max a ' Q(s t +1 , a ')≈ ∑ r i + γ⋅max a ' Q (si , a ')
next

r ,s
t t+1
N i

● Use moving average with just one sample!

Q(s t , at )← α⋅(r t + γ⋅max a ' Q (s t +1 , a '))+(1−α)Q(s t , a t )


23
Q-learning r
prev s s'
s a'
a s''
prev a a''

r'

● Works on a sequence of
– states (s)
– actions (a)
– rewards (r)

24
Q-learning
r
prev s s'
s a'
a s''
prev a a''

r'
Initialize Q(s,a) with zeros
● Loop:
– Sample <s,a,r,s'> from env

25
Q-learning
possible
r actions
Q(s ' , a 0 )
prev s s'
s a' Q(s ' , a 1)
a s''
prev a
Q(s ' , a''
a 2)
r'
Initialize Q(s,a) with zeros
● Loop:
– Sample <s,a,r,s'> from env

– Compute ^ , a)=r (s , a)+ γ max Q (s ' , ai )


Q(s
ai

26
Q-learning
possible
r actions
Q(s ' , a 0 )
prev s s'
s a' Q(s ' , a 1)
a s''
prev a
Q(s ' , a''
a 2)
r'
Initialize Q(s,a) with zeros
● Loop:
– Sample <s,a,r,s'> from env

– Compute ^ , a)=r (s , a)+ γ max Q (s ' , ai )


Q(s
ai

– Update ^ , a)+(1−α)Q(s , a)
Q(s , a)← α⋅Q(s 27
Recap
Monte-carlo Temporal Difference

● Averages Q over sampled paths ● Uses recurrent formula for Q

r possible
+1
actions Q(s ' , a 0 )
s'
Q(s ' , a 1)
s
Cake! a
Q(s ' , a 2)

28
Nuts and bolts: MC vs TD
Monte-carlo Temporal Difference

● Averages Q over sampled paths ● Uses recurrent formula for Q

● Needs full trajectory to learn ● Learns from partial trajectory


Works with infinite MDP

● Less reliant on markov property ● Needs less experience to learn


What could possibly go wrong?
Our mobile robot learns to walk.

Initial Q(s,a) are zeros


robot uses argmax Q(s,a)

He has just learned to crawl with positive reward! 30


What could possibly go wrong?
Our mobile robot learns to walk.

Initial Q(s,a) are zeros


robot uses argmax Q(s,a)

Too bad, now he will never learn to walk upright =(


31
What could possibly go wrong?
New problem:

If our agent always takes “best” actions


from his current point of view,

How will he ever learn that other actions


may be better than his current best one?

Ideas?
32
Exploration Vs Exploitation
Balance between using what you learned and trying to find
something even better

33
Exploration Vs Exploitation
Strategies:
• ε-greedy
• With probability ε take random action;

otherwise take optimal action.

34
Exploration Vs Exploitation
Strategies:
• ε-greedy
• With probability ε take random action;

otherwise take optimal action.


• Softmax
Pick action proportional to softmax of shifted
normalized Q-values.
Q( s , a)
π (a∣s)=softmax( τ )

• More cool stuff coming later


35
Exploration over time

Idea:
If you want to converge to optimal policy,
you need to gradually reduce exploration

Example:

Initialize ε-greedy ε = 0.5, then gradually reduce it

• If ε → 0, it's greedy in the limit


• Be careful with non-stationary environments 36
Cliff world

37
Picture from Berkeley CS188x
Cliff world

Conditions
• Q-learning

γ=0.99 ϵ=0.1
• no slipping

Trivia:
What will q-learning learn?

38
Cliff world

Conditions
• Q-learning

γ=0.99 ϵ=0.1
• no slipping

Trivia:
What will q-learning learn?
follow the short path

Will it maximize reward?


39
Cliff world

Conditions
• Q-learning

γ=0.99 ϵ=0.1
• no slipping

Trivia:
What will q-learning learn?
follow the short path

Will it maximize reward?


no, robot will fall due to 40
epsilon-greedy “exploration”
Cliff world

Conditions
• Q-learning

γ=0.99 ϵ=0.1
• no slipping

Decisions must account


for actual policy!
e.g. ε-greedy policy

41
Generalized update rule
Update rule (from Bellman eq.)
^ t , at )+(1−α)Q(st , at )
Q(s t , at )← α⋅Q(s

“better Q(s,a)”

42
Q-learning VS SARSA
Update rule (from Bellman eq.)
^ t , at )+(1−α)Q(st , at )
Q(s t , at )← α⋅Q(s

Q-learning “better Q(s,a)”

^ , a)=r (s , a)+ γ⋅max Q(s ' , a ')


Q(s
a'

43
Q-learning VS SARSA
Update rule (from Bellman eq.)
^ t , at )+(1−α)Q(st , at )
Q(s t , at )← α⋅Q(s

Q-learning “better Q(s,a)”

^ , a)=r (s , a)+ γ⋅max Q(s ' , a ')


Q(s
a'

SARSA
^ , a)=r (s , a)+ γ⋅
Q(s E Q(s ' , a ')
a '∼π(a '∣s ') 44
Recap: Q-learning
possible
r actions
Q(s ' , a 0 )
prev s s'
s a' Q(s ' , a 1)
a s''
prev a
Q(s ' , a''
a 2)
r'
∀ s ∈S , ∀ a ∈A , Q(s , a)←0
Loop:
– Sample <s,a,r,s'> from env

– Compute ^ , a)=r (s , a)+ γ max Q (s ' , ai )


Q(s
ai

– ^ , a)+(1−α)Q(s , a)
Update Q(s , a)← α⋅Q(s
45
SARSA r
prev s s'
s a'
a s''
prev a
Q(s ' , a''
a')
r'
∀ s ∈S , ∀ a ∈A , Q(s , a)←0
Loop:
– Sample <s,a,r,s',a'> from env

– Compute ^ , a)=r (s , a)+γ Q(s ' , a ')


Q(s

– ^ , a)+(1−α)Q(s , a)
Update Q(s , a)← α⋅Q(s
46
SARSA r
prev s s'
s a'
a s''
prev a
Q(s ' , a''
a')
r'
∀ s ∈S , ∀ a ∈A , Q(s , a)←0
Loop: hence “SARSA”
– Sample <s,a,r,s',a'> from env

– Compute ^ , a)=r (s , a)+γ Q(s ' , a ')


Q(s
next action
(not max)
– ^ , a)+(1−α)Q(s , a)
Update Q(s , a)← α⋅Q(s
47
Expected value SARSA
possible
r actions
Q(s ' , a 0 )
prev s s'
s a' Q(s ' , a 1)
a s''
prev a
Q(s ' , a''
a 2)
r'
∀ s ∈S , ∀ a ∈A , Q(s , a)←0
Loop:
– Sample <s,a,r,s'> from env

– Compute ^ , a)=r (s , a)+ γ


Q(s E Q (s ' , ai )
ai ∼ π(a∣s' )

– ^ , a)+(1−α)Q(s , a)
Update Q(s , a)← α⋅Q(s
48
Expected value SARSA
possible
r actions
Q(s ' , a 0 )
prev s s'
s a' Q(s ' , a 1)
a s''
prev a
Q(s ' , a''
a 2)
r'
∀ s ∈S , ∀ a ∈A , Q(s , a)←0
Loop:
– Sample <s,a,r,s'> from env Expected value

– Compute ^ , a)=r (s , a)+ γ


Q(s E Q (s ' , ai )
ai ∼ π(a∣s' )

– ^ , a)+(1−α)Q(s , a)
Update Q(s , a)← α⋅Q(s
49
Difference

● SARSA gets optimal


rewards under current
policy

● Q-learning policy
would be optimal Q-learning

under SARSA

50
On-policy vs Off-policy
Two problem setups
on-policy off-policy

Agent can pick actions Agent can't pick actions


– Learning with exploration,
– Most obvious setup :) playing without exploration
– Learning from expert
– Agent always follows his (expert is imperfect)
own policy – Learning from sessions
(recorded data)
51
On-policy vs Off-policy
Two problem setups
on-policy off-policy

Agent can pick actions Agent can't pick actions

– On-policy algorithms can't – Off-policy algorithms can


learn off-policy learn on-policy

learn optimal policy even if


agent takes random actions

Q: which of Q-learning, SARSA and exp. val. SARSA 52


will only work on-policy?
On-policy vs Off-policy
Two problem setups
on-policy off-policy

Agent can pick actions Agent can't pick actions

– On-policy algorithms can't – Off-policy algorithms can


learn off-policy learn on-policy
– SARSA – Q-learning
– more later – Expected Value SARSA

53
On-policy vs Off-policy
Two problem setups
on-policy off-policy

Agent can pick actions Agent can't pick actions

– On-policy algorithms can't – Off-policy algorithms can


learn off-policy learn on-policy
– SARSA – Q-learning
– more coming soon – Expected Value SARSA

54
On-policy vs Off-policy
Two problem setups
on-policy off-policy

Agent can pick actions Agent can't pick actions

– On-policy algorithms can't – Off-policy algorithms can


learn off-policy learn on-policy
– SARSA – Q-learning
– more coming soon – Expected Value SARSA

55
Experience replay Interaction

Idea: store several past interactions


<s,a,r,s'>
Train on random subsamples Agent

training

<s,a,r,s'>
<s,a,r,s'>
training
batches ~ <s,a,r,s'>
<s,a,r,s'>
<s,a,r,s'>
<s,a,r,s'>
<s,a,r,s'>

Replay
56
buffer
Experience replay Interaction

Idea: store several past interactions


<s,a,r,s'>
Train on random subsamples Agent

Training curriculum:
- play 1 step and record it training
- pick N random transitions to train
<s,a,r,s'>
<s,a,r,s'>
Profit: you don't need to re-visit same
(s,a) many times to learn it.
training
batches ~ <s,a,r,s'>
<s,a,r,s'>
<s,a,r,s'>
<s,a,r,s'>
<s,a,r,s'>

Only works with Btw, why Replay


off-policy algorithms! only them? buffer 57
</chapter>
Experience replay Interaction

Idea: store several past interactions


<s,a,r,s'>
Train on random subsamples Agent

Training curriculum:
- play 1 step and record it training
- pick N random transitions to train
<s,a,r,s'>
<s,a,r,s'>
Profit: you don't need to re-visit same
(s,a) many times to learn it.
training
batches ~ <s,a,r,s'>
<s,a,r,s'>
<s,a,r,s'>
<s,a,r,s'>
<s,a,r,s'>
Old (s,a,r,s) are
Only works with from older/weaker Replay
off-policy algorithms! version of policy! buffer 58
New stuff we learned
● Anything?

59
New stuff we learned
● Q(s,a),Q*(s,a)

● Q-learning, SARSA
– We can learn from trajectories (model-free)

● Exploration vs exploitation (basics)

● Learning On-policy vs Off-policy


– Using experience replay 60
Coming next...
● What if state space is large/continuous
● Deep reinforcement learning

61
Reinforcement learning
Episode 4

Approximate reinforcement learning

1
Recap: Q-learning
One approach:
action Q-values

Q(s, a)=E [r(s , a)+γ⋅V (s' )]


s'

Action value Q(s,a) is the expected total reward G agent gets from
state s by taking action a and following policy π from next state.

π (s): argmax a Q(s , a) 2


Recap: Q-learning
One approach:
action Q-values

Q(s, a)=E [r(s , a)+γ⋅V (s' )]


s'
We can replace
P(s' | s,a)
with sampling

Q(s t , at )← α⋅(r t +γ⋅max a ' Q (s t +1 , a '))+(1−α)Q(s t , a t )

π (s): argmax a Q(s , a) 3


Q-learning as MSE minimization

Given <s,a,r,s'> minimize


true 2
L=[Q(s t , at )−Q (s t , a t )]

2
L≈[Q(s t , a t )−(r t +γ⋅max a ' Q(s t +1 , a ' ))]

How to optimize?
4
Q-learning as MSE minimization

Given <s,a,r,s'> minimize


true 2
L=[Q(s t , at )−Q (s t , a t )]

2
L≈[Q(s t , a t )−(r t +γ⋅max a ' Q(s t +1 , a ' ))]

For tabular Q(s,a)

∇ L=2⋅[Q(s t , a t )−(r t +γ⋅max a ' Q(s t +1 , a '))]


5
Q-learning as MSE minimization

Given <s,a,r,s'> minimize


true 2
L=[Q(s t , at )−Q (s t , a t )]

2
L≈[Q(s t , a t )−(r t +γ⋅max a ' Q(s t +1 , a ' ))]

For tabular Q(s,a)

∇ L≈2⋅[Q (s t , a t )−(r t + γ⋅max a ' Q (s t +1 , a '))]

Something's sooo wrong! 6


Q-learning as MSE minimization

Given <s,a,r,s'> minimize


true 2
L=[Q(s t , at )−Q (s t , a t )]
const
2
L≈[Q(s t , a t )−(r t +γ⋅max a ' Q(s t +1 , a ' ))]
const
For tabular Q(s,a)

∇ L≈2⋅[Q (s t , a t )−(r t +γ⋅max a ' Q (s t +1 , a '))]


7
Q-learning as MSE minimization

For tabular Q(s,a)


2
L≈[Q(s t , a t )−(r t +γ⋅max a ' Q(s t +1 , a ' ))]

∇ L≈2⋅[Q (s t , a t )−(r t +γ⋅max a ' Q (s t +1 , a '))]


Gradient descent step:

Q(s , a):=Q(s , a)−α⋅2[Q(st , at )−(r t +γ⋅maxa ' Q(st +1 , a '))]

8
Q-learning as MSE minimization

For tabular Q(s,a)


2
L≈[Q(s t , a t )−(r t +γ⋅max a ' Q(s t +1 , a ' ))]

∇ L≈2⋅[Q (s t , a t )−(r t +γ⋅max a ' Q (s t +1 , a '))]


Gradient descent step:

Q(s , a):=Q(s , a)(1−2 α)+2 α (r t +γ⋅max a ' Q(s t +1 , a ' ))

9
Q-learning as MSE minimization

For tabular Q(s,a)


2
L≈[Q(s t , a t )−(r t +γ⋅max a ' Q(s t +1 , a ' ))]

∇ L≈2⋅[Q (s t , a t )−(r t +γ⋅max a ' Q (s t +1 , a '))]


Gradient descent step:

Q(s , a):=Q(s , a)(1−2 α)+2 α (r t +γ⋅max a ' Q(s t +1 , a ' ))

= moving average formula 10


(define alpha' = 2*alpha)
Real world

How many states are there?


approximately

11
Real world

8⋅210⋅160
|S|≈2 =729179546432630. ..

80917 digits :)

12
Problem:
State space is usually large,
sometimes continuous.
And so is action space;

However, states do have a structure, similar


states have similar action outcomes.

13
Problem:
State space is usually large,
sometimes continuous.
And so is action space;

Two solutions:
– Binarize state space (last week)
– Approximate agent with a function (crossentropy method)
14
Which one would you prefer for atari?
Problem:
State space is usually large,
sometimes continuous.
And so is action space;

Two solutions:
– Binarize state space Too many bins or handcrafted features

– Approximate agent with a function Let's pick this one


15
From tables to approximations

● Before:
– For all states, for all actions, remember Q(s,a)

● Now:
– Approximate Q(s,a) with some function
– e.g. linear model over state features

2
argminw , b (Q (s t , a t )−[r t + γ⋅max a ' Q(s t +1 , a ' )])

Trivia: should we use classification or regression model?


(e.g. logistic regression Vs linear regression) 16
From tables to approximations

● Before:
– For all states, for all actions, remember Q(s,a)

● Now:
– Approximate Q(s,a) with some function
– e.g. linear model over state features

2
argminw , b (Q (s t , a t )−[r t + γ⋅max a ' Q(s t +1 , a ' )])

● Solve it as a regression problem!


17
MDP again
Agent

apply
observe
action

Obser
action
vation

Environment
Approximate Q-learning

Q(s,a0), Q(s,a1), Q(s,a2) Q-values:

^ t ,at )=r+γ⋅maxa' Q(s


Q(s ^ t +1 ,a')

model Objective:
W = params
2
L=(Q(s t ,a t )−[r+γ⋅max a' Q (s t +1 ,a')])

Gradient step:
δL
image w t +1=wt −α⋅
δw
Approximate Q-learning

Q(s,a0), Q(s,a1), Q(s,a2) Q-values:

^ t ,at )=r+γ⋅maxa' Q(s


Q(s ^ t +1 ,a')

model Objective:
W = params
2
L=(Q(s t ,a t )−[r+γ⋅max a' Q (s t +1 ,a')])
consider const
Gradient step:
∂L
image w t +1=wt −α⋅
∂ wt
Approximate SARSA
Objective:

Q(s,a0), Q(s,a1), Q(s,a2) ^


L=(Q(s t ,a t )−Q (s t ,a t ))2

consider const

Q-learning:

model ^ t ,at )=r+γ⋅maxa' Q(st +1 ,a')


Q(s
W = params
SARSA:
^ t ,at )=r+γ⋅Q (s t +1 ,at+1 )
Q(s
Expected Value SARSA:
image ^ t ,at )=r+γ⋅
Q(s ??? E Q(s t +1 ,a' )
a'∼π(a∣s)
Approximate SARSA
Objective:

Q(s,a0), Q(s,a1), Q(s,a2) ^


L=(Q(s t ,a t )−Q (s t ,a t ))2

consider const

Q-learning:

model ^ t ,at )=r+γ⋅maxa' Q(st +1 ,a')


Q(s
W = params
SARSA:
^ t ,at )=r+γ⋅Q (s t +1 ,at+1 )
Q(s
Expected Value SARSA:
image ^ t ,at )=r+γ⋅ E Q(s t +1 ,a' )
Q(s
a'∼π(a∣s)
Deep RL 101
apply
Qvalues action action

Qvalues is a
dense layer with Dense
no nonlinearity
ϵ-greedy rule
(tune ϵ or use
probabilistic rule)
Dense

Dense
Whatever
you found in
Obser- your favorite
vation deep learning
toolkit
Architectures

Given (s,a) Given s predict all q-values


Predict Q(s,a) Q(s,a0), Q(s,a1), Q(s,a2)

24
Architectures

Given (s,a) Given s predict all q-values


Predict Q(s,a) Q(s,a0), Q(s,a1), Q(s,a2)

Trivia: in which situation does left model work better? 25


And right?
Architectures

Needs one forward pass


for each action
Needs one forward pass
Works if action space is large for all actions
efficient when not all actions (faster)
26
are available from each state
What kind of network digests images well?

27
Deep learning approach: DQN

28
DQN ϵ-greedy rule
(tune ϵ or use apply
probabilistic rule) action action

Qvalues is a Qvalues
dense layer with
no nonlinearity

Dense
Change axis Any neural
order to fit in network you
with lasagne can think of.
convolutions ● conv
Conv1
● pool

● dense

● dropout

● batchnorm
image Dimshuffle
(i,w,h,3) Conv0 ...
(i,3,w,h)
DQN ϵ-greedy rule
(tune ϵ or use apply
probabilistic rule) action action

Qvalues is a Qvalues
dense layer with
no nonlinearity

Dense
Change axis Any neural
order to fit in network you
with lasagne can think of.
convolutions ● conv
Conv1
● pool

● dense

● dropout

● batchnorm
image Dimshuffle
(i,w,h,3) Conv0 ...
(i,3,w,h)

Those two are


a bit tricky (later)
TSNE makes every slide 40% better

● Embedding of pre-last layer activations


31
● Color = V(s) = max_a Q(s,a)
32
How bad it is if agent spends
next 1000 ticks under the left rock?
(while training)
33
Problem

● Training samples are not


“i.i.d”,

● Model forgets parts of


environment it hasn't
visited for some time

● Drops on learning curve

● Any ideas?
Multiple agent trick

parameter
server
Idea: Throw in several
agents with shared W. W

agent0 agent1 agent2

env0 env1 env2


Multiple agent trick

parameter
server
Idea: Throw in several
agents with shared W. W

● Chances are, they will be agent0 agent1 agent2


exploring different parts of
the environment, env0 env1 env2

● More stable training,

● Requires a lot of interaction

Trivia: your agent is a real


robot car. Any problems?
Experience replay Interaction

Idea: store several past interactions


<s,a,r,s'>
Train on random subsamples Agent

Any +/- ?

training

<s,a,r,s'>
<s,a,r,s'>
training
batches ~ <s,a,r,s'>
<s,a,r,s'>
<s,a,r,s'>
<s,a,r,s'>
<s,a,r,s'>

Replay
37
buffer
Experience replay Interaction

Idea: store several past interactions


<s,a,r,s'>
Train on random subsamples Agent

● Atari DQN: >10^5 interactions

● Closer to i.i.d training


pool contains several sessions
<s,a,r,s'>
<s,a,r,s'>

~
● Older interactions were obtained <s,a,r,s'>
under weaker policy training <s,a,r,s'>
batches <s,a,r,s'>
<s,a,r,s'>
<s,a,r,s'>
Better versions coming next week
Replay
38
buffer
Summary so far
to make data closer to i.i.d.

Use one or several of


– experience replay
– multiple agents
– Infinitely small learning rate :)

advanced stuff coming next lecture

39
An important question

● You approximate Q(s,a) with a neural network


● You use experience replay when training

Trivia: which of those algorithms will fail?

– Q-learning – CEM
– SARSA
– Expected Value SARSA 40
An important question

● You approximate Q(s,a) with a neural network


● You use experience replay when training
Agent trains off-policy on an older version of him

Trivia: which of those algorithms will fail?


Off-policy methods work, On-policy is super-slow (fail)
– Q-learning – CEM
– SARSA
– Expected Value SARSA 41
When training with on-policy methods,
– use no (or small) experience replay
– compensate with parallel game sessions

42
Deep learning meets MDP

– Dropout, noize
● Used in experience replay only: like the usual dropout
● Used when interacting: a special kind of exploration
● You may want to decrease p over time.

– Batchnorm
● Faster training but may break moving average
● Experience replay: may break down if buffer is too small
● Parallel agents: may break down under too few agents
<same problem of being non i.i.d.>
Final problem

Left or right?
44
Problem:
Most practical cases are partially observable:
Agent observation does not hold all information about process state
(e.g. human field of view).

Any ideas?

45
Problem:
Most practical cases are partially observable:
Agent observation does not hold all information about process state
(e.g. human field of view).

● However, we can try to infer hidden states


from sequences of observations.

s t ≃mt : P (mt∣ot , mt−1 )

● Intuitively that's agent memory state.

46
Partially observable MDP

apply
observe Agent action

Obser
vation action

Hidden state
(Markov assumption holds)
[but no one cares]

47
N-gram heuristic
Idea:
s t ≠o( s t )
s t ≈(o (st −n ) , at −n , ..., o(s t−1 ), at −1 , o(s t ))
e.g. ball movement in breakout

• One frame • Several frames 48


4-frame DQN ϵ-greedy rule
(tune ϵ or use apply
probabilistic rule) action action

Qvalues is a Qvalues
dense layer with
no nonlinearity

Dense
Any neural
push image network you
can think of.
● conv
Conv1
● pool

● dense

● dropout
stack
● batchnorm
image image t 4 images
(i,w,h,3) image t-1 Conv0 ...
image t-2
image t-3

delete last frame


N-gram heuristic
Idea:
s t ≠o( s t )
s t ≈(o (st −n ) , at −n , ..., o(s t−1 ), at −1 , o(s t ))
e.g. ball movement in breakout

• One frame • Several frames 50


Alternatives

Ngrams:
• Nth-order markov assumption

• Works for velocity/timers

• Fails for anything longer that N frames

• Impractical for large N

Alternative approach:
• Infer hidden variables given observation

sequence
• Kalman Filters, Recurrent Neural Networks

• More on that in a few lectures


51
Seminar

52
Autocorrelation

● Reference is based on predictions

r + γ⋅maxa ' Q (st +1 , a ')


● Any error in Q approximation is propagated to neighbors

● If some Q(s,a) is mistakenly over-exaggerated,


neighboring qvalues will also be increased in a cascade

● Worst case: divergence

● Any ideas?
Target networks
Idea: use older network snapshot
to compute reference

old 2
L=(Q(s t ,a t )−[r+γ⋅max a ' Q (s t +1 ,a' )])

● Update Q old periodically


● Slows down training

54
Target networks
Idea: use older network snapshot
to compute reference

old 2
L=(Q(s t ,a t )−[r+γ⋅max a ' Q (s t +1 ,a' )])

● Update Q old periodically


● Slows down training

● Smooth version:
● use moving average

old old new


θ :=(1−α)⋅θ +α⋅θ
55
● Θ = weights
Reinforcement learning
Episode 4B

Deep reinforcement learning

1
What we already know:

- Q-learning

- Approximation of q-values with respect to state


, where is the vector of weights

- Experience replay

This is not enough!


2
Autocorrelation
- Target is based on prediction

- Since we use function approximation, when we update we also


update Q(s’, a, t) towards that direction
- In worst case network may diverge, but usually it becomes unstable.

- How to stabilize weights?

3
Target network
Idea: use network with frozen weights to compute the target

where is the frozen weights


Const
Hard target network:

Update every n steps and set its weights as

4
Target network
Idea: use network with frozen weights to compute the target

where is the frozen weights


Const
Hard target network:

Update every n steps and set its weights as

Soft target network:

Update Q_t every step:

5
Playing Atari with Deep Reinforcement Learning (2013, Deepmind)

Experience replay

CNN q-values

4 last frames as input

Update weights using:

10⁶ last transitions


Update every 5000 train steps

6
Asynchronous Methods for Deep Reinforcement Learning (2016, Deepmind)

Worker Worker Worker


<s, ’>
a, r,
s’> <s, a, r, s’>
s , a, r, s
<
Transitions

Learning *

*: 7
Problem of overestimation
We use “max” operator to compute the target

Surprisingly here is a problem

(although we want to be equal zero)

8
Problem of overestimation
Normal distribution
3*10⁶ samples

mean: ~0.0004

9
Problem of overestimation
Normal distribution
3*10⁶ x 3 samples
Then take maximum of every tuple
mean: ~0.8467

10
Problem of overestimation
Normal distribution
3*10⁶ x 10 samples
Then take maximum of every tuple
mean: ~1.538

11
Problem of overestimation

Suppose true Q(s’,a) are equal 0 for all a


But we have an approximation (or other kind) error _
So Q(s,a) should be equal r
But if we update Q(s,a) towards _____________________
we will have overestimated Q(s,a) > because

12
Double Q-learning (NIPS 2010)
Idea: use two estimators of q-values:
They should compensate mistakes of each other because they will be independent
Let’s get argmax from another estimator!
- Q-learning target

- Rewrited Q-learning target

- Double Q-learning target

13
Double Q-learning (NIPS 2010)

How to apply this algorithm in deep reinforcement learning?


14
Deep Reinforcement Learning with Double Q-learning (Deepmind, 2015)
Idea: use main network to choose action!

15
Prioritized Experience Replay (2016, Deepmind)
Idea: sample transitions from xp-replay more clever

We want to set probability for every transition. Let’s use the absolute value of
TD-error of transition as a probability!

where is the priority parameter (when is 0 it’s the uniform case)

Do you see the problem?


16
Prioritized Experience Replay (2016, Deepmind)
Idea: sample transitions from xp-replay more clever

We want to set probability for every transition. Let’s use the absolute value of
TD-error of transition as a probability!

where is the priority parameter (when is 0 it’s the uniform case)

Do you see the problem?


Transitions become non i.i.d. and therefore we introduce the bias.17
Prioritized Experience Replay (2016, Deepmind)
Solution: we can correct the bias by using importance-sampling weights

where is the parameter

We also normalize weights by (here is not mathematical reason)

When we put transition into experience replay, we set

18
Prioritized Experience Replay (2016, Deepmind)

19
Prioritized Experience Replay (2016, Deepmind)

It is the bonus homework! 20


Let’s watch a video…
https://www.youtube.com/watch?v=UXurvvDY93o

21
Dueling Network Architectures for Deep Reinforcement Learning (2016, Deepmind)

Idea: change the network’s architecture.


Q(s)
Recall:
Advantage Function A(s,a) = Q(s,a) - V(s)
V(s)
So, Q(s,a) = A(s,a) + V(s)
Q(s)

A(s)

22
Dueling Network Architectures for Deep Reinforcement Learning (2016, Deepmind)

Idea: change the network’s architecture.


Q*(s)
Recall:
Advantage Function A(s,a) = Q(s,a) - V(s)
V*(s)
So, Q(s,a) = A(s,a) + V(s)
Q*(s)

A*(s)

Here is a problem!
23
Dueling Network Architectures for Deep Reinforcement Learning (2016, Deepmind)
Here is one extra freedom degree!

Example:

0 3 4

2 2 2
4 4 4
3 3 3
2 -1 -2
4 1 0
3 0 -1

24
Dueling Network Architectures for Deep Reinforcement Learning (2016, Deepmind)
Here is one extra freedom degree!

Example:

0 3 4

2 2
4 4
3 3
2 -1 -2 2
4 1 0 4
3 0 -1 3

What is correct?
Hint 1:
25
Dueling Network Architectures for Deep Reinforcement Learning (2016, Deepmind)
Here is one extra freedom degree!

Example:

0 3 4

2 2
4 4
3 3
2 -1 -2 2
4 1 0 4
3 0 -1 3

What is correct?
Hint 1: Hint 2:
26
Dueling Network Architectures for Deep Reinforcement Learning (2016, Deepmind)

Solution: require to be equal to zero!

So the Q-function computes as:

Authors of this papers also introduced this way to compute Q-values:

They wrote that this variant increases stability of the optimization


(The fact that this loses the original semantics of Q doesn’t matter) 27
Dueling Network Architectures for Deep Reinforcement Learning (2016, Deepmind)

Solution: require to be equal to zero!

So the Q-function computes as:

It’s the homework!

Authors of this papers also introduced this way to compute Q-values:

They wrote that this variant increases stability of the optimization


(The fact that this loses the original semantics of Q doesn’t matter) 28
Questions?

29
Reinforcement learning
Episode 6

Policy gradient methods

1
Small experiment

The next slide contains a question

Please respond as fast as you can!

2
Small experiment

left or right? 3
Small experiment

Right! Ready for next one? 4


Small experiment

What's Q(s,right) under gamma=0.99?


5
Small experiment

What's Q(s,right) under gamma=0.99?


6
Approximation error
DQN is trained to minimize

2
L≈ E[Q(s t , a t )−(r t +γ⋅max a ' Q(s t +1 , a '))]
Simple 2-state world
True (A) (B)
Q(s0,a0) 1 1 2
Q(s0,a1) 2 2 1
Q(s1,a0) 3 3 3
Q(s1,a1) 100 50 100 7
Trivia: Which prediction is better (A/B)?
Approximation error
DQN is trained to minimize

2
L≈ E[Q(s t , a t )−(r t +γ⋅max a ' Q(s t +1 , a '))]
Simple 2-state world
True (A) (B)
Q(s0,a0) 1 1 2
Q(s0,a1) 2 2 1
Q(s1,a0) 3 3 3
Q(s1,a1) 100 50 100 8
better less
policy MSE
Approximation error
DQN is trained to minimize

2
L≈ E[Q(s t , a t )−(r t +γ⋅max a ' Q(s t +1 , a '))]
Simple 2-state world
True (A) (B)
Q(s0,a0) 1 1 2
Q(s0,a1) 2 2 1
Q(s1,a0) 3 3 3
Q(s1,a1) 100 50 100 9
better less
Q-learning will prefer worse policy (B)! policy MSE
Conclusion

● Often computing q-values is harder than


picking optimal actions!

● We could avoid learning value functions by


directly learning agent's policy π θ (a∣s )

Trivia: what algorithm works that way?


(of those we studied) 10
Conclusion

● Often computing q-values is harder than


picking optimal actions!

● We could avoid learning value functions by


directly learning agent's policy π θ (a∣s )

Trivia: what algorithm works that way?


e.g. crossentropy method 11
NOT how humans survived
argmax[
Q(s,pet the tiger)
Q(s,run from tiger)
Q(s,provoke tiger)
Q(s,ignore tiger)
]

12
how humans survived

π (run∣s)=1

13
Policies
In general, two kinds

● Deterministic policy

a=πθ (s )
● Stochastic policy

a∼πθ (a∣s)
14
Trivia: Any case where stochastic is better?
Policies
In general, two kinds

● Deterministic policy

a=πθ (s )
● Stochastic policy

a∼πθ (a∣s) e.g. rock-paper


-scissors
15
Trivia: Any case where stochastic is better?
Policies
In general, two kinds

● Deterministic policy same action each time


Genetic algos (week 0)
Deterministic policy gradient a=πθ (s )
● Stochastic policy sampling takes care
of exploration
Crossentropy method
Policy gradient
a∼πθ (a∣s)
16
Trivia: how to represent policy in continuous action space?
Policies
In general, two kinds

● Deterministic policy same action each time


Genetic algos (week 0)
Deterministic policy gradient a=πθ (s )
● Stochastic policy sampling takes care
of exploration
Crossentropy method
Policy gradient
a∼πθ (a∣s)
17
categorical, normal, mixture of normal, whatever
Two approaches
● Value based:

Learn value function Qθ (s , a) or V θ (s)

Infer policy π (a∣s)=[a=argmax Qθ (s , a)]


a
● Policy based:

Explicitly learn policy π θ (a∣s ) or π θ (s)→a

Implicitly maximize reward over policy


18
Recap: crossentropy method
● Initialize NN weights θ 0 ←random

● Loop:
– Sample N sessions
– elite = take M best sessions and concatenate

θi +1=θ i +α ∇ ∑ log π θ (ai∣s i)⋅[ si , ai ∈ Elite ]


i
i

Trivia: Can we adapt it to discounted rewards?


(with γ) 19
Recap: crossentropy method
● Initialize NN weights θ 0 ←random

● Loop:
– Sample N sessions
– elite = take M best sessions and concatenate

θi +1=θ i +α ∇ ∑ log π θ (ai∣s i)⋅[ si , ai ∈ Elite ]


i
i

TD version: elite (s,a) that have highest R(s,a)


20
(select elites independently from each state)
Policy gradient main idea

Why so complicated?
We'd rather simply maximize R over pi!

21
Objective

Expected reward:
J= E R(s , a , s ' , a ' , ...)
s∼ p (s)
a∼π θ (s∣a)
...

Expected discounted reward:

J= E Q(s, a)
s∼ p (s)
a∼π θ (s∣a)

22
Objective

Expected reward: R(z) setting


J= E R(s , a , s ' , a ' , ...)
s∼ p (s)
a∼π θ (s∣a)
...

Expected discounted reward: R(s,a) = r + γ*R(s',a')

J= E Q(s, a)
s∼ p (s)
a∼π θ (s∣a)
“true” Q-function 23
Objective

J= E Q(s, a)=∫ p(s) ∫ πθ (a∣s)Q (s , a)da ds


s∼ p (s) s a
a∼π θ (s∣a)

24
Objective
Agent's policy

J= E Q(s, a)=∫ p(s) ∫ πθ (a∣s)Q (s , a)da ds


s∼ p (s) s a
a∼π θ (s∣a)

True action value


state visitation frequency
(may depend on policy)

Trivia: how do we compute that?


25
Objective

J= E Q(s, a)=∫ p(s) ∫ πθ (a∣s)Q (s , a)da ds


s∼ p (s) s a
a∼π θ (s∣a)

True action value


a.k.a. E[ R(s,a) ]
N
1
J≈
N
∑∑ Q(s , a)
i=0 s ,a ∈ zi

sample N sessions
26
Objective

J= E Q(s, a)=∫ p(s) ∫ πθ (a∣s)Q (s , a)da ds


s∼ p (s) s a
a∼π θ (s∣a)

True action value


a.k.a. E[ R(s,a) ]
N
1
J≈
N
∑∑ Q(s , a)
i=0 s ,a ∈ zi

sample N sessions

Can we optimize policy now? 27


Objective

J= E Q(s, a)=∫ p(s) ∫ πθ (a∣s)Q (s , a)da ds


s∼ p (s) s a
a∼π θ (s∣a)

parameters “sit” here


True action value
a.k.a. E[ R(s,a) ]
N
1
J≈
N
∑∑ Q(s , a)
i=0 s ,a ∈ zi

We don't know how to compute dJ/dtheta 28


Optimization
Finite differences
– Change policy a little, evaluate

J θ+ ϵ−J θ
∇ J≈ ϵ

Stochastic optimization
– Good old crossentropy method
– Maximize probability of “elite” actions

29
Optimization
Finite differences
– Change policy a little, evaluate

J θ+ ϵ−J θ
∇ J≈ ϵ

Stochastic optimization
– Good old crossentropy method
– Maximize probability of “elite” actions

Trivia: any problems with those two? 30


Optimization
Finite differences
– Change policy a little, evaluate

J θ+ ϵ−J θ VERY noizy, especially


∇ J≈ ϵ if both J are sampled

Stochastic optimization “quantile convergence”


problems with stochastic
– Good old crossentropy method MDPs

– Maximize probability of “elite” actions

31
Objective

J= E Q(s, a)=∫ p(s) ∫ πθ (a∣s)Q (s , a)da ds


s∼ p (s) s a
a∼π θ (s∣a)

Wish list:
– Analytical gradient
– Easy/stable approximations

32
Logderivative trick
Simple math

∇ log π ( z )=? ? ?

(try chain rule)

33
Logderivative trick
Simple math

1
∇ log π ( z )= ⋅∇ π( z)
π (z)

π⋅∇ log π( z )=∇ π( z)

34
Policy gradient
Analytical inference

∇ J =∫ p (s)∫ ∇ πθ (a∣s)Q(s , a)da ds


s a

π⋅∇ log π( z )=∇ π( z)

35
Policy gradient
Analytical inference

∇ J =∫ p (s)∫ ∇ πθ (a∣s)Q(s , a)da ds


s a

π⋅∇ log π( z )=∇ π( z)

∇ J =∫ p (s)∫ πθ (a∣s) ∇ log πθ (a∣s)Q (s , a)da ds


s a

36
Trivia: anything curious about that formula?
Policy gradient
Analytical inference

∇ J =∫ p (s)∫ ∇ πθ (a∣s)Q(s , a)da ds


s a

π⋅∇ log π( z )=∇ π( z)

∇ J =∫ p (s)∫ πθ (a∣s) ∇ log πθ (a∣s)Q (s , a)da ds


s a
that's expectation :)
37
Policy gradient
Analytical inference

∇ J =∫ p (s)∫ ∇ πθ (a∣s)Q(s , a)da ds


s a

π⋅∇ log π( z )=∇ π( z)

∇ J= E ∇ log π θ (a∣s)⋅Q(s , a)
s∼ p (s)
a∼π θ (s∣a) 38
Policy gradient (REINFORCE)
● Policy gradient

∇ J= E ∇ log π θ (a∣s )⋅Q(s , a)


s∼ p (s)
a∼π θ (s∣a)

● Approximate with sampling


N
1
∇ J≈
N
∑∑ ∇ log π θ (a∣s)⋅Q(s, a)
i=0 s ,a ∈z i

39
REINFORCE algorithm
● Initialize NN weights θ 0 ←random

● Loop:
– Sample N sessions z under current π θ (a∣s )
– Evaluate policy gradient
N
1
∇ J≈
N
∑∑ ∇ log π θ (a∣s)⋅Q(s , a)
i=0 s ,a ∈z i

– Ascend θi +1 ←θi + α⋅∇ J


40
REINFORCE algorithm
● Initialize NN weights θ 0 ←random
Trivia: is it off- or on-policy?
● Loop:
– Sample N sessions z under current π θ (a∣s )
– Evaluate policy gradient
N
1
∇ J≈
N
∑∑ ∇ log π θ (a∣s)⋅Q(s , a)
i=0 s ,a ∈z i

– Ascend θi +1 ←θi + α⋅∇ J


41
REINFORCE algorithm
● Initialize NN weights θ 0 ←random

● Loop: actions under current policy


= on-policy
– Sample N sessions z under current π θ (a∣s )
– Evaluate policy gradient
N
1
∇ J≈
N
∑∑ ∇ log π θ (a∣s)⋅Q(s , a)
i=0 s ,a ∈z i

– Ascend θi +1 ←θi + α⋅∇ J


42
value-based Vs policy-based
Value-based Policy-based

● Q-learning, SARSA, MCTS ● REINFORCE, CEM


value-iteration

● Solves harder problem ● Solves easier problem


● Artificial exploration ● Innate exploration
● Learns from partial experience ● Innate stochasticity
(temporal difference) ● Support continuous action space
● Evaluates strategy for free :) ● Learns from full session only
value-based Vs policy-based
Value-based Policy-based

● Q-learning, SARSA, MCTS ● REINFORCE, CEM


value-iteration
We'll learn much more soon!

● Solves harder problem ● Solves easier problem


● Artificial exploration ● Innate exploration
● Learns from partial experience ● Innate stochasticity
(temporal difference) ● Support continuous action space
● Evaluates strategy for free :) ● Learns from full session only
REINFORCE algorithm
● Initialize NN weights θ 0 ←random

● Loop:
– Sample N sessions z under current π θ (a∣s )
– Evaluate policy gradient
N
1
∇ J≈
N
∑∑ ∇ log π θ (a∣s)⋅Q(s , a)
i=0 s ,a ∈z i

What is better for learning:


random action in good state
or 45
great action in bad state?
REINFORCE baseline
● Initialize NN weights θ 0 ←random

● Loop:
– Sample N sessions z under current π θ (a∣s )
– Evaluate policy gradient
N
1
∇ J≈
N
∑∑ ∇ log π θ (a∣s)⋅Q(s , a)
i=0 s ,a ∈z i

Q(s,a) = V(s) + A(s,a)


46
Actions influence A(s,a) only, so V(s) is irrelevant
REINFORCE baseline
● Initialize NN weights θ 0 ←random

● Loop:
– Sample N sessions z under current π θ (a∣s )
– Evaluate policy gradient
N
1
∇ J≈
N
∑∑ ∇ log π θ (a∣s)⋅(Q (s , a)−b(s))
i=0 s ,a ∈z i

Anything that doesn't depend on action


47
ideally, b(s) = V(s)
Actor-critic
● Learn both V(s) and π θ (a∣s )
● Hope for best of both worlds :)

48
Advantage actor-critic

Idea: learn both π θ (a∣s ) and V θ (s)

Use V θ (s) to learn π θ (a∣s ) faster!

Non-trivia: how can we estimate A(s,a)


from (s,a,r,s') and V-function?

49
Advantage actor-critic

Idea: learn both π θ (a∣s ) and V θ (s)

Use V θ (s) to learn π θ (a∣s ) faster!

A(s , a)=Q (s , a)−V (s )

Q(s , a)=r +γ⋅V (s ')

A(s , a)=r + γ⋅V (s ')−V (s)


50
Advantage actor-critic

Idea: learn both π θ (a∣s ) and V θ (s)

Use V θ (s) to learn π θ (a∣s ) faster!

A(s , a)=Q (s , a)−V (s )


Also: n-step
Q(s , a)=r +γ⋅V (s ') version

A(s , a)=r + γ⋅V (s ')−V (s)


51
Advantage actor-critic

Idea: learn both π θ (a∣s ) and V θ (s)

Use V θ (s) to learn π θ (a∣s ) faster!

A(s , a)=r + γ⋅V (s ')−V (s)


N
1
∇ J actor ≈
N
∑∑ ∇ log π θ (a∣s)⋅A (s , a)
i=0 s ,a ∈ zi
consider
const

Trivia: how do we train V then? 52


Advantage actor-critic

π θ (a∣s ) V θ (s)
Improve policy:
N
1
model
∇ J actor ≈
N
∑∑ ∇ log π θ (a∣s)⋅A (s , a)
i=0 s ,a ∈ zi
W = params
Improve value:
N
1
Lcritic ≈
N
∑ ∑ (V θ (s)−[r +γ⋅V (s ')]) 2

i=0 s ,a ∈ zi
state s

53
Continuous action spaces

What if there's continuously many actions?


● Robot control: control motor voltage
● Trading: assign money to equity

How does the algorithm change?

54
Continuous action spaces

What if there's continuously many actions?


● Robot control: control motor voltage
● Trading: assign money to equity

How does the algorithm change?


it doesn't :)
Just plug in a different formula for
pi(a|s), e.g. normal distribution 55
Duct tape zone
● V(s) errors less important than in Q-learning
– actor still learns even if critic is random, just slower

● Regularize with entropy


– to prevent premature convergence

● Learn on parallel sessions


– Or super-small experience replay

● Use logsoftmax for numerical stability 56


Let's code!

57
Practical RL
week 5, spring’18

Bandits, exploration, production hacks

1
Multi-armed bandits

2
Multi-armed bandits

3
Multi-armed bandits: simplest

Simple case: no different “states”, just N actions

same state
each time action
Agent Feedback

Exploration: figure out which action is best overall;


take as few bad actions as you can 4
Multi-armed bandits: contextual

A simplified MDP with only one step

observation action
Agent Feedback

Why bandits: it's simpler to explain math here,


Formulae are, like, 50% shorter
(we will generalize to MDP in the second half) 5
What is: contextual bandit

state action r(s,a)


Agent Feedback

Examples: Basicaly it's 1-step MDP where


– banner ads (RTB) – G(s,a) = r(s,a)
– recommendations – Q(s,a) = E r(s,a)
– medical treatment – All formulae are 50% shorter
6
How to measure exploration

Ideas?

7
How to measure exploration
Bad idea: by the sound of the name
Good idea: by $$$ it brought/lost you

Regret of policy π(a|s):


Consider an optimal policy, π*(a|s)

Regret = sum over training time [ optimal – yours ]

η=∑ E r ( s , a)− E r ( s , a)
t s , a∼π
*
s , a∼ πt

Finite horizon: t < max_t Infinite horizon: t → inf 8


How to measure exploration
Bad idea: by the sound of the name
Good idea: by $$$ it brought/lost you

Regret of policy π(a|s):


Regret per tick = optimal – yours

9
Exploration Vs Exploitation
What exploration strategies
did we use before?

10
Exploration strategies so far...

Strategies:
• ε-greedy
• With probability ε take a uniformly random action;
• Otherwise take optimal action.
• Boltzman
• Pick action proportionally to transformed Qvalues
Q (s , a)
P( a)=softmax ( )
std
• Optimistic initialization
• start from high initial Q(s,a) for all states/actions 11
• good for tabular algorithms, hard to approximate
Exploration strategies so far...

Strategies:
• ε-greedy
• With probability ε take a uniformly random action;
• Otherwise take optimal action.
Say, we use ε-greedy with const ε = 0.25
on top of q-learning to play a videogame.

What can you say about regret?

η=∑ E r ( s , a)− E r ( s , a) 12
t s , a∼π
*
s , a∼ πt
Exploration strategies so far...

Strategies:
• ε-greedy
• With probability ε take a uniformly random action;
• Otherwise take optimal action.
Say, we use ε-greedy with const ε = 0.25
on top of q-learning to play a videogame.

Regret grows linearly over time!

Agent always acts suboptimally due to ε


13
Exploration over time

Idea:
If you want to converge to optimal policy,
you need to gradually reduce exploration

Example:

Initialize ε-greedy ε = 0.5, then gradually reduce it

• If ε → 0, it's greedy in the limit


• Be careful with non-stationary environments 14
How many lucky random actions it takes to
● Apply medical treatment
● Control robots
● Invent efficient VAE training

Except humans can learn these in less than a lifetime

15
How many lucky random actions it takes to
● Apply medical treatment
● Control robots
● Invent efficient VAE training

We humans explore not with e-greedy policy!

16
BTW how humans explore?
Whether some new particles violate physics
Vs
Whether you still can't fly by pulling your hair up

17
Uncertainty in returns

We want to try actions if we believe there's a


chance they turn out optimal.

Idea: let's model how certain we are that Q(s,a)


is what we predicted

18
Q(s,a)
Thompson sampling
● Policy:
– sample once from each Q distribution
– take argmax over samples
– which actions will be taken?

19
Q(s,a)
Thompson sampling
● Policy:
– sample once from each Q distribution
– take argmax over samples
– which actions will be taken?
Takes a1 with p ~ 0.65, a2 with p ~ 0.35, a0 ~ never

20
Q(s,a)
Optimism in face of uncertainty

Idea:
Prioritize actions with uncertain outcomes!

More uncertain = better.


Greater expected value = better

Math: try until upper confidence bound is small enough.

21
Optimism in face of uncertainty
● Policy:
– Compute 95% upper confidence bound for each a
– Take action with highest confidence bound
– What can we tune here to explore more/less?

22
Q(s,a)
Optimism in face of uncertainty
● Policy:
– Compute 95% upper confidence bound for each a
– Take action with highest confidence bound
– Adjust: change 95% to more/less

23
points = 95% percentiles Q(s,a)
Frequentist approach
There's a number of inequalities that bound
P(x>t) < something

● E.g. Hoeffding inequality (arbitrary x in [0,1])


2
−2 nt
P(x −Ex≥t)=e

● Remember any others?


24
Frequentist approach
There's a number of inequalities that bound
P(x>t) < something

● E.g. Hoeffding inequality (arbitrary x in [0,1])


2
−2 nt
P(x −Ex≥t)=e

● Remember any others?


(Chernoff, Chebyshev, over9000) 25
Count-based exploration
UCB-1 for bandits
Take actions in in proportion to ~
va

~
v a=v a +
√2 log N
na

– N number of time-steps so far


– n a times action a is taken 26
Count-based exploration
UCB-1 for bandits
Take actions in in proportion to ~
va

~
v a=v a +
√2 log N
na

Upper conf. bound


– N number of time-steps so far for r in [0,1]
– n a times action a is taken If not? 27
Count-based exploration
UCB-1 for bandits
Take actions in in proportion to ~
va

~
v a=v a +
√2 log N
na

Upper conf. bound


– N number of time-steps so far for r in [0,1]
– n a times action a is taken If not – divide 28
by r_max
Count-based exploration
UCB generalized for multiple states

where
~
Q(s , a)=Q(s , a)+α⋅

2 log N s
n s ,a

– N
s visits to state s
– n s , a times action a is taken from state s

29
Bayesian UCB

The usual way:


– Start from prior P(Q)
– Learn posterior P(Q|data)
– Take q-th percentile

What models can learn that?


30
Bayesian UCB

The usual way:


– Start from prior P(Q)
– Learn posterior P(Q|data)
– Take q-th percentile

Approach 1: learn parametric P(Q), e.g. normal


Approach 2: use bayesian neural networks
31
Parametric

0.3 tanh -0.1


Regular
NN
X y

-0.25 tanh 1.3


Parametric

tanh -0.1
0.3 μ
1.3
Predict P(y|x)
X
x~N(μ,σ)
0.2
-0.25 σ Normal
tanh
0.05 distribution

0.3 tanh -0.1


Regular
NN
X y

-0.25 tanh 1.3


BNNs

Disclaimer: this is a hacker's guide to BNNs!

It does not cover all the philosophy and general cases.

0.3 tanh -0.1

Regular y
X
NN

-0.25 tanh 1.3


BNNs
N ( 0.3,0.04 ) tanh N (−0.1, 0.043)

Bayesian P(y|x)
X
NN

tanh N (1.3, 1.97)


N (−0.25, 0.1)

0.3 tanh -0.1

Regular y
X
NN

-0.25 tanh 1.3


BNNs
N (0.3,0.04 ) tanh N (−0.1, 0.043)

X P(y|x)

tanh N (1.3, 1.97)


N (−0.25, 0.1)

Idea:
● No explicit weights

● Maintain parametric distribution on them instead!

● Practical: fully-factorized normal or similar

q (θ∣ϕ:[μ , σ ])=∏ N (θ i∣μ i , σi )


i

P( y∣x )= Eθ∼q(θ∣ϕ) P( y∣x , θ)


BNNs
N (0.3,0.04 ) tanh N (−0.1, 0.043)

X P(y|x)

tanh N (1.3, 1.97)


N (−0.25, 0.1)

Idea:
● No explicit weights

● Maintain parametric distribution on them instead!

● Practical: fully-factorized normal or similar

q (θ∣ϕ:[μ , σ ])=∏ N (θ i∣μ i , σi )


i

P( y∣x )= Eθ∼q(θ∣ϕ) P( y∣x , θ)


BNNs
N (0.3,0.04 ) tanh N (−0.1, 0.043)

X P(y|x)

tanh N (1.3, 1.97)


N (−0.25, 0.1)

Idea:
● No explicit weights

● Inference: sample from weight distributions, predict 1 point

● To get distribution, aggregate K samples (e.g. with histogram)

● Yes, it means running network multiple times per one X

P( y∣x )= Eθ∼q(θ∣ϕ) P( y∣x , θ)


BNNs
Idea:
● No explicit weights

● Maintain parametric distribution on them instead!

● Practical: fully-factorized normal or similar

q (θ∣ϕ:[μ , σ ])=∏ N (θ i∣μ i , σi )


i
P( y∣x )= Eθ∼q(θ∣ϕ) P( y∣x , θ)
● Learn parameters of that distribution (reparameterization trick)
● Less variance: local reparameterization trick.

ϕ̊=argmax ϕ E x , y ∼d E θ∼q (θ∣ϕ ) P( y i∣x i ,θ)


i i

wanna explicit formulae? d = dataset


Lower bound
d = dataset
q(θ∣ϕ)
−KL(q(θ∣ϕ)∥p (θ∣d))=−∫ q(θ∣ϕ)⋅log
θ p(θ∣d )

q (θ∣ϕ) q( θ∣ϕ)⋅p (d )
−∫ q( θ∣ϕ)⋅log =−∫ q (θ∣ϕ)⋅log
θ p( d∣θ)⋅p( θ) θ p(d∣θ)⋅p(θ)
[ ]
p(d )

q(θ∣ϕ)
−∫ q( θ∣ϕ)⋅[log −log p (d∣θ)+ log p (d )]
θ p(θ )

[ Eθ∼ q(θ∣ϕ) log p (d∣θ)]−KL(q (θ∣ϕ)∥p(θ))+ log p (d )


loglikelihood -distance to prior +const
Lower bound

ϕ=argmax (−KL(q( θ∣ϕ)∥p (θ∣d)))


ϕ

argmax ([ Eθ ∼q (θ∣ϕ ) log p (d∣θ)]−KL(q(θ∣ϕ)∥p (θ)))


ϕ

Can we perform gradient ascent directly?


Reparameterization trick

ϕ=argmax (−KL( q(θ∣ϕ)∥p (θ∣d)))


ϕ

argmax ([ Eθ ∼q (θ∣ϕ ) log p (d∣θ)]−KL( q(θ∣ϕ)∥p (θ)))


ϕ

Use reparameterization trick simple formula


(for normal q)

What does this log


BNN likelihood P(d|...) mean?

Eθ ∼N (θ∣μ , σ ) log p (d∣θ)= E ψ∼N (0,1) log p( d∣(μ ϕ + σϕ⋅ψ))


ϕ ϕ

Better: local reparameterization trick (google it)


Reparameterization trick

ϕ=argmax (−KL( q(θ∣ϕ)∥p (θ∣d)))


ϕ

argmax ([ Eθ ∼q (θ∣ϕ ) log p (d∣θ)]−KL(q(θ∣ϕ)∥p (θ)))


ϕ

In other words,
Σx,y~d log p(y|x,µ+σψ)
BNN likelihood

Eθ ∼N (θ∣μ , σ ) log p (d∣θ)= E ψ∼N (0,1) log p( d∣(μ ϕ + σϕ⋅ψ))


ϕ ϕ

Better: local reparameterization trick (google it)


Using BNNs
● If you sample from BNNs
– Can learn ~arbitrary distribution (e.g. multimodal)
– But it takes running network many times
– Use empirical percentiles for exploration priority
● Again, 3 points on horizontal axis are percentiles
Using BNNs
● If you sample from BNNs
– Can learn ~arbitrary distribution (e.g. multimodal)
– But it takes running network many times
– Use empirical percentiles for exploration priority
● Again, 3 points on horizontal axis are percentiles
Practical stuff
● Approximate exploration policy
with something cheaper

● Bayesian UCB:
– Prior can make or break it
– Sometimes parametric guys win
(vs bnn)

● Of course, neural nets aren't


46
always the best model
Markov Decision Processes
● Naive approach:
– Infer posterior distribution on Q(s,a)
– Do UCB or Thompson Sampling on those Q-values
– Anything wrong with this?

47
Markov Decision Processes
● Naive approach:
– Infer posterior distribution on Q(s,a)
– Do UCB or Thompson Sampling on those Q-values
– Agent is “greedy” w.r.t. exploration
It would prefer taking one uncertain action now than
make several steps to end up in unexplored regions

48
Markov Decision Processes
● Naive approach:
– Infer posterior distribution on Q(s,a)
– Do UCB or Thompson Sampling on those Q-values
– Agent is “greedy” w.r.t. exploration

● Reward augmentation
– Devise a surrogate “reward” for exploration
● We “pay” our agent for exploring
– Maximize this reward with a (separate) RL agent
49
I got it!

It's all about funding!


(Just kidding)

50
Reward augmentation

Let's “pay” agent for exploration!

~r (z ,a,s' )=r (s ,a,s ')+r


exploration (s, a,s')

51
Reward augmentation

Let's “pay” agent for exploration!

~r (z ,a,s' )=r (s ,a,s ')+r


exploration (s, a,s')

Q: Any suggestions on
surrogate r for atari? 52
UNREAL main idea
● Auxilary objectives:
– Pixel control: maximize pixel change in NxN grid
over image
– Feature control: maximize activation of some
neuron deep inside neural network
– Reward prediction: predict future rewards given
history

article: arxiv.org/abs/1611.05397
blog post: bit.ly/2g9Yv2A 53
UNREAL main idea
● Auxilary objectives:
– Pixel control: maximize pixel change in NxN grid
over image Keep calm!
– Feature control: maximize
we’ll get activation of some
more theoretically
neuron deep
soundinside neuralinnetwork
models a few slides
– Reward prediction: predict future rewards given
history

article: arxiv.org/abs/1611.05397
blog post: bit.ly/2g9Yv2A 54
Environment: Labyrinth

● Maze with rewards


● Partially observable
– Used a2c + LSTM + experience replay 55
Results: labyrinth

56
Results: Atari

57
Count-based models

TL;DR encourage visiting underexplored states

● Use approximate density model, N(s)


N(s) ~ how many times agent
visited s over training time

58
Count-based models

TL;DR encourage visiting underexplored states

● Use approximate density model, N(s)


N(s) ~ how many times agent
visited s over training time

● Reward for visiting low-N states, e.g. N(s)^-0.5

59
Examples: arxiv:1606.01868, arxiv:1703.01310
Count-based models

On-policy: start training with surrogate rewards,


gradually reduce their weight over time

Off-policy: train a separate agent to maximize


exploration via surrogate rewards, learn Q(s,a)
for original reward from that with off-policy algo

60
Estimating counts
We need some way to estimate N(s)

Task-specific: e.g. image density models


Generic: e.g. VAE / alphaGAN, expected log p(x|z)
Density ratio: estimate p(s) / q(s) for baseline q 61
Density ratio trick
We need some way to estimate N(s)

count = state density (probability)


N(s)∼d(s) up to a coefficient of n_steps

62
Density ratio trick
We need some way to estimate N(s)

count = state density (probability)


N(s)∼d(s) up to a coefficient of n_steps

To estimate d(s), we introduce any d(s)


known distribution q(s), e.g. uniform
d(s)=q(s)⋅
q(s)

63
Density ratio trick
We need some way to estimate N(s)

count = state density (probability)


N(s)∼d(s) up to a coefficient of n_steps

To estimate d(s), we introduce any d(s)


known distribution q(s), e.g. uniform
d(s)=q(s)⋅
q(s)
Train another model to discriminate
between s∼d(s) and s∼q(s) p(s∈d (s))=?? ?

Q: What is P(s ∈ d(s) ) under optimal model? 64


Density ratio trick
Train another model to discriminate
between s∼d(s) and s∼q(s)

d(s)
Perfect discriminator: p(s∈d)=
d(s)+q(s)

65
Density ratio trick
Train another model to discriminate
between s∼d(s) and s∼q(s)

d(s)
Perfect discriminator: p(s∈d)=
d(s)+q(s)

Q: Can you express


d(s) in terms of p(s∈d (s)) ?
q(s)
66
Density ratio trick
Train another model to discriminate
between s∼d(s) and s∼q(s)

d(s)
Perfect discriminator: p(s∈d)=
d(s)+q(s)
1 d(s)+q(s) q(s)
= =1+
p(s∈d) d(s) d(s)
q(s) 1 1−p(s∈d (s))
= −1=
d(s) p(s∈d(s)) p(s∈d(s)) 67
Density ratio trick
Train another model to discriminate
between s∼d(s) and s∼q(s)

d(s)
Perfect discriminator: p(s∈d)=
d(s)+q(s)
d(s) p(s∈d(s)) Discriminator(s)
= ≈
q(s) 1−p(s∈d(s)) 1−Discriminator(s∈d(s))
d(s) Discriminator(s)
d(s)=q(s)⋅ ≈q(s)⋅
q(s) 1−Discriminator(s∈d(s)) 68
Density ratio trick
We need some way to estimate N(s)

count = state density (probability)


N(s)∼d(s) up to a coefficient of n_steps

Train neural network to discriminate between


visited states s∼d(s) and arbitrary known s∼q(s)

d(s) Discriminator(s)
d(s)=q(s)⋅ ≈q(s)⋅
q(s) 1−Discriminator(s∈d(s))
Uniform q(s) = simple math, high variance d(s),
Task-specific q(s) = possibly smaller variance 69
Variational Information-Maximizing Exploration

I hope I don’t have enough time for this…

arxiv:1605.09674 70
Variational Information-Maximizing Exploration

Curiosity
Taking actions that increase your knowledge about
the world (a.k.a. the environment)

Knowledge about the world


Whatever allows you to predict how world works
depending on your behavious

arxiv:1605.09674 71
Vime main idea

Add curiosity to the reward


~r (z ,a,s' )=r (s ,a,s ')+βr vime (τ ,a,s')

Curiosity definition
r vime (z ,a ,s')=I (θ; s'∣τ ,a)
Vime main idea

Environment model

P(s'∣s, a,θ)
Session
τ t =⟨s0, a 0, s 1, a1, ...,s t ⟩
Surrogate reward
~r (τ ,a,s' )=r (s ,a,s ')+βr vime (τ ,a,s')=r(s ,a,s' )+βI (θ; s'∣τ,a)

curiosity

I (θ; s'∣τ,a)=H(θ∣τ ,a)−H (θ∣τ ,a, s')=E s t +1 ∼P(st +1∣s, a) KL[P(θ∣τ,a ,s')∥P(θ∣τ)]
need proof for that last line?
Naive objective

P(θ∣τ,a ,s')
Es ∣s, a) KL[ P(θ∣τ ,a,s')∥P(θ∣τ)]=∫ P(s'∣s,a)⋅∫ P(θ∣τ, a,s')⋅log
∼P( st+1 dθ ds'
t+1
s' θ P(θ∣τ)

where

P(z∣θ)⋅P(θ)
∏t P(st+1∣st , at ,θ)⋅P(θ)
P(θ∣τ)= =
P(τ) ∫θ P(τ∣θ)⋅P(θ)d θ
Naive objective

P(θ∣τ,a ,s')
Es ∣s, a) KL[P(θ∣τ ,a,s')∥P(θ∣τ)]=∫ P(s'∣s,a)⋅∫ P(θ∣τ, a,s')⋅log
∼P( st+1 dθ ds'
t+1
s' θ P(θ∣τ)
Sample Sample
From MDP Somehow...

Model Prior

P(z∣θ)⋅P(θ)
∏t P(st+1∣st , at ,θ)⋅P(θ)
P(θ∣τ)= =
P(τ) ∫θ P(τ∣θ)⋅P(θ)d θ
dunno

Better avoid computing P(theta|tau) directly!


BAYES
We want a model that

- predicts P(s'|s,a,s,…,a, theta)


- we can estimate P(theta | D)
- we can sample from P(theta | D)
We want a model that

- predicts P(s'|s,a,s,…,a, theta)


- we can estimate P(theta | D)
- we can sample from P(theta | D)
- Bayesian Neural Networks!
Vime objective

P(θ∣τ,a ,s')
Es ∣s, a) KL[P(θ∣τ ,a,s')∥P(θ∣τ)]=∫ P(s'∣s,a)⋅∫ P(θ∣τ, a,s')⋅log
∼P( st+1 dθ ds'
t+1
s' θ P(θ∣τ)

KL[P(θ∣τ,a ,s')∥P(θ∣τ)]≈KL[q(θ∣τ,a ,s')∥q(θ∣τ)]≡KL[q(θ∣ϕ')∥q(θ∣ϕ)]

BNN
q(θ∣τ ,a,s')
Es ∣s, a) KL[ P(θ∣τ ,a,s')∥P(θ∣τ)]≈∫ P(s'∣s,a)⋅∫ q(θ∣τ,a, s')⋅log
∼P( st+1 d θds'
t+1
s' sample θ sample q(θ∣τ)
from env from BNN BNN
last tick
Algorithm
Forever:
1.Interact with environment, get <s,a,r,s'>
2.Compute curiosity reward
~r (z ,a,s' )=r (s ,a,s ')+βKL[q(θ∣ϕ')∥q(θ∣ϕ)]
~ //with any RL algorithm
3.train_agent(s,a,r,s')
4.train_BNN(s,a,s') //maximize lower bound
Dirty hacks
● Use batches of many <s,a,r,s'>
– for CPU/GPU efficiency
– greatly improves RL stability
● Simple formula for KL
– Assuming fully-factorized normal distribution
2 2
1 σi' (μ i '−μi)
KL[q(θ∣ϕ')∥q(θ∣ϕ)]= ∑ [( σ ) +2log σ i−2log σi '+ 2 ]
2 i<|θ| i σi
– Even simpler: second order Taylor approximation
● Divide KL by its running average over past iterations
Session reward

epoch

Session reward
Results

epoch
Session reward

epoch
Results
Pitfalls
● It's curious about irrelevant things
● Predicting (210x160x3) images is hard
● We don't observe full states (POMDP)
State = hidden NN activation

Qvalues Q(s,a)

More layers

Some layers

Raw observation S
State = hidden NN activation

Qvalues qvalues, (~n actions)

More layers After DNN, (~256)


Only policy-relevant

Some layers

Raw observation Pixels (210x160x3)


All information
State = hidden NN activation

Qvalues

More layers This is our new state


Predict it with BNN

Some layers

Raw observation We don't care


A case for POMDP

Qvalues

More layers This is our new state


Predict it with BNN

RNN RNN RNN RNN

O O O O We don't care
t-3 t-2 t-1 t

You might also like