4b - Deep Reinforcement Learning
4b - Deep Reinforcement Learning
Episode 4B
1
What we already know:
- Q-learning
- Experience replay
3
Target network
Idea: use network with frozen weights to compute the target
4
Target network
Idea: use network with frozen weights to compute the target
5
Playing Atari with Deep Reinforcement Learning (2013, Deepmind)
Experience replay
CNN q-values
6
Asynchronous Methods for Deep Reinforcement Learning (2016, Deepmind)
Learning *
*: 7
Problem of overestimation
We use “max” operator to compute the target
8
Problem of overestimation
Normal distribution
3*10⁶ samples
mean: ~0.0004
9
Problem of overestimation
Normal distribution
3*10⁶ x 3 samples
Then take maximum of every tuple
mean: ~0.8467
10
Problem of overestimation
Normal distribution
3*10⁶ x 10 samples
Then take maximum of every tuple
mean: ~1.538
11
Problem of overestimation
12
Double Q-learning (NIPS 2010)
Idea: use two estimators of q-values:
They should compensate mistakes of each other because they will be independent
Let’s get argmax from another estimator!
- Q-learning target
13
Double Q-learning (NIPS 2010)
15
Prioritized Experience Replay (2016, Deepmind)
Idea: sample transitions from xp-replay more clever
We want to set probability for every transition. Let’s use the absolute value of
TD-error of transition as a probability!
We want to set probability for every transition. Let’s use the absolute value of
TD-error of transition as a probability!
18
Prioritized Experience Replay (2016, Deepmind)
19
Prioritized Experience Replay (2016, Deepmind)
21
Dueling Network Architectures for Deep Reinforcement Learning (2016, Deepmind)
A(s)
22
Dueling Network Architectures for Deep Reinforcement Learning (2016, Deepmind)
A*(s)
Here is a problem!
23
Dueling Network Architectures for Deep Reinforcement Learning (2016, Deepmind)
Here is one extra freedom degree!
Example:
0 3 4
2 2 2
4 4 4
3 3 3
2 -1 -2
4 1 0
3 0 -1
24
Dueling Network Architectures for Deep Reinforcement Learning (2016, Deepmind)
Here is one extra freedom degree!
Example:
0 3 4
2 2
4 4
3 3
2 -1 -2 2
4 1 0 4
3 0 -1 3
What is correct?
Hint 1:
25
Dueling Network Architectures for Deep Reinforcement Learning (2016, Deepmind)
Here is one extra freedom degree!
Example:
0 3 4
2 2
4 4
3 3
2 -1 -2 2
4 1 0 4
3 0 -1 3
What is correct?
Hint 1: Hint 2:
26
Dueling Network Architectures for Deep Reinforcement Learning (2016, Deepmind)
29