0% found this document useful (0 votes)
12 views

4b - Deep Reinforcement Learning

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

4b - Deep Reinforcement Learning

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

Reinforcement learning

Episode 4B

Deep reinforcement learning

1
What we already know:

- Q-learning

- Approximation of q-values with respect to state


, where is the vector of weights

- Experience replay

This is not enough!


2
Autocorrelation
- Target is based on prediction

- Since we use function approximation, when we update we also


update Q(s’, a, t) towards that direction
- In worst case network may diverge, but usually it becomes unstable.

- How to stabilize weights?

3
Target network
Idea: use network with frozen weights to compute the target

where is the frozen weights


Const
Hard target network:

Update every n steps and set its weights as

4
Target network
Idea: use network with frozen weights to compute the target

where is the frozen weights


Const
Hard target network:

Update every n steps and set its weights as

Soft target network:

Update Q_t every step:

5
Playing Atari with Deep Reinforcement Learning (2013, Deepmind)

Experience replay

CNN q-values

4 last frames as input

Update weights using:

10⁶ last transitions


Update every 5000 train steps

6
Asynchronous Methods for Deep Reinforcement Learning (2016, Deepmind)

Worker Worker Worker


<s, ’>
a, r,
s’> <s, a, r, s’>
s , a, r, s
<
Transitions

Learning *

*: 7
Problem of overestimation
We use “max” operator to compute the target

Surprisingly here is a problem

(although we want to be equal zero)

8
Problem of overestimation
Normal distribution
3*10⁶ samples

mean: ~0.0004

9
Problem of overestimation
Normal distribution
3*10⁶ x 3 samples
Then take maximum of every tuple
mean: ~0.8467

10
Problem of overestimation
Normal distribution
3*10⁶ x 10 samples
Then take maximum of every tuple
mean: ~1.538

11
Problem of overestimation

Suppose true Q(s’,a) are equal 0 for all a


But we have an approximation (or other kind) error _
So Q(s,a) should be equal r
But if we update Q(s,a) towards _____________________
we will have overestimated Q(s,a) > because

12
Double Q-learning (NIPS 2010)
Idea: use two estimators of q-values:
They should compensate mistakes of each other because they will be independent
Let’s get argmax from another estimator!
- Q-learning target

- Rewrited Q-learning target

- Double Q-learning target

13
Double Q-learning (NIPS 2010)

How to apply this algorithm in deep reinforcement learning?


14
Deep Reinforcement Learning with Double Q-learning (Deepmind, 2015)
Idea: use main network to choose action!

15
Prioritized Experience Replay (2016, Deepmind)
Idea: sample transitions from xp-replay more clever

We want to set probability for every transition. Let’s use the absolute value of
TD-error of transition as a probability!

where is the priority parameter (when is 0 it’s the uniform case)

Do you see the problem?


16
Prioritized Experience Replay (2016, Deepmind)
Idea: sample transitions from xp-replay more clever

We want to set probability for every transition. Let’s use the absolute value of
TD-error of transition as a probability!

where is the priority parameter (when is 0 it’s the uniform case)

Do you see the problem?


Transitions become non i.i.d. and therefore we introduce the bias.17
Prioritized Experience Replay (2016, Deepmind)
Solution: we can correct the bias by using importance-sampling weights

where is the parameter

We also normalize weights by (here is not mathematical reason)

When we put transition into experience replay, we set

18
Prioritized Experience Replay (2016, Deepmind)

19
Prioritized Experience Replay (2016, Deepmind)

It is the bonus homework! 20


Let’s watch a video…
https://www.youtube.com/watch?v=UXurvvDY93o

21
Dueling Network Architectures for Deep Reinforcement Learning (2016, Deepmind)

Idea: change the network’s architecture.


Q(s)
Recall:
Advantage Function A(s,a) = Q(s,a) - V(s)
V(s)
So, Q(s,a) = A(s,a) + V(s)
Q(s)

A(s)

22
Dueling Network Architectures for Deep Reinforcement Learning (2016, Deepmind)

Idea: change the network’s architecture.


Q*(s)
Recall:
Advantage Function A(s,a) = Q(s,a) - V(s)
V*(s)
So, Q(s,a) = A(s,a) + V(s)
Q*(s)

A*(s)

Here is a problem!
23
Dueling Network Architectures for Deep Reinforcement Learning (2016, Deepmind)
Here is one extra freedom degree!

Example:

0 3 4

2 2 2
4 4 4
3 3 3
2 -1 -2
4 1 0
3 0 -1

24
Dueling Network Architectures for Deep Reinforcement Learning (2016, Deepmind)
Here is one extra freedom degree!

Example:

0 3 4

2 2
4 4
3 3
2 -1 -2 2
4 1 0 4
3 0 -1 3

What is correct?
Hint 1:
25
Dueling Network Architectures for Deep Reinforcement Learning (2016, Deepmind)
Here is one extra freedom degree!

Example:

0 3 4

2 2
4 4
3 3
2 -1 -2 2
4 1 0 4
3 0 -1 3

What is correct?
Hint 1: Hint 2:
26
Dueling Network Architectures for Deep Reinforcement Learning (2016, Deepmind)

Solution: require to be equal to zero!

So the Q-function computes as:

Authors of this papers also introduced this way to compute Q-values:

They wrote that this variant increases stability of the optimization


(The fact that this loses the original semantics of Q doesn’t matter) 27
Dueling Network Architectures for Deep Reinforcement Learning (2016, Deepmind)

Solution: require to be equal to zero!

So the Q-function computes as:

It’s the homework!

Authors of this papers also introduced this way to compute Q-values:

They wrote that this variant increases stability of the optimization


(The fact that this loses the original semantics of Q doesn’t matter) 28
Questions?

29

You might also like