2.4+Advanced+Tricks+for+DQNs
2.4+Advanced+Tricks+for+DQNs
[1] Playing Atari with Deep Reinforcement Learning. In NIPS workshop, 2013.
[2] Reinforcement Learning for Robots using Neural Networks. Technical Report, 1993.
2
Experience Replay
Experience Replay
4
Experience Replay
5
TD with Experience Replay
"
# (
• Find 𝒘 by minimizing 𝐿 𝒘 = ∑&!'# !
& )
6
Advantages of Experience Replay
7
Algorithm: Deep Q-Learning
8
Improvement 1: Prioritized
Experience Replay
Prioritized Experience Replay
10
Recap: Asynchronous Dynamic Programming
11
Recap: Prioritized Sweeping (Lecture 1.4)
max 𝑅 𝑠, 𝑎 + 𝛾 < 𝑃 𝑠 8 𝑠, 𝑎 𝑉 𝑠 8 −𝑉 𝑠
3
1$ ∈7
12
Prioritized Experience Replay
• Prioritized sweeping (Moore & Atkeson, 1993; Andre et al., 1998) selects
which state to update next, prioritized according to the change in value, if
that update was executed.
13
Prioritized Experience Replay
14
Prioritized Experience Replay
• Note the log-log scale, which highlights the exponential speed-up from
replaying with an oracle (bright blue), compared to uniform replay (black).
15
Prioritized Experience Replay
16
Prioritized Experience Replay
17
Update TD Error
18
Scaling Learning Rate
19
Possible Question
21
Prioritized Experience Replay
This led to both faster learning and to better final policy quality across most
Prioritized Experience Replay. In ICLR, 2016 games of the Atari benchmark suite
22
Weakness of Experience Replay
23
Improvement 2: Target Network
Improvement 3: Double DQN
TD Learning for DQN
25
Problem of Overestimation
26
Recap: Maximization Bias (Lecture 2.2)
• This is because we use the same estimate 𝑄 both to choose the argmax and
to evaluate it
27
Reason 1: Maximization Bias
28
Reason 1: Maximization Bias
𝑞 ≥ max(𝑥(𝑎))
3
29
Reason 1: Maximization Bias
30
Reason 2: Bootstrapping
31
Why does overestimation happen?
32
Why is overestimation harmful?
33
Why is overestimation harmful?
34
Why is overestimation harmful?
35
Why is overestimation harmful?
• The more frequently (𝑠, 𝑎) appear in the replay buffer, the worse 𝑄(𝑠, 𝑎; 𝒘)
overestimate 𝑄∗ (𝑠, 𝑎)
36
Solution
38
TD Learning with Target Network
• TD error: 𝛿! = 𝑄 𝑠! , 𝑎! ; 𝒘 − 𝑌!
+/(1! , 3! ; 𝒘)
• SGD: 𝒘 ← 𝒘 − 𝛼 ⋅ 𝛿! ⋅ +𝒘
39
Update Target Network
• Periodically update 𝒘9
• Option 1: 𝒘9 ← 𝒘
• Option 2: 𝒘9 ← 𝜏 ⋅ 𝒘 + 1 − 𝜏 ⋅ 𝒘9
40
Deep Q-Network
• Experience replay
• Target network: we used an iterative update that adjusts the action-values
(Q) towards target values that are only periodically updated, thereby
reducing correlations with the target.
• Results:
• Outperforms the best existing reinforcement learning methods on 43 of 49
games
• It was comparable to that of professional human games tester across the
set of 49 games, achieving more than 75% of the human score on more
than half of the games
Human-level Control through Deep Reinforcement Learning. Nature, 2015.
41
Algorithm
42
Comparisons
43
Recap: Double Q-Learning (Lecture 2.2)
44
Recap: Double Tabular Q-Learning (Lecture 2.2)
45
Double DQN
Naïve Update
46
Double DQN
47
Double DQN
Double DQN
• Selection using DQN:
𝑎∗ = arg max 𝑄(𝑠!"#, 𝑎; 𝒘)
3
48
Double DQN (DDQN)
Theorem
• Consider a state 𝑠 in which all the true optimal action values are equal at 𝑄∗ 𝑠, 𝑎 = 𝑉∗ 𝑠 for some 𝑉∗ 𝑠 .
• Let 𝑄" be arbitrary value estimates that are on the whole unbiased in the sense that ∑# 𝑄" 𝑠, 𝑎 − 𝑉∗ 𝑠 = 0,
$ &
but that are not all correct, such that ∑# 𝑄" 𝑠, 𝑎 − 𝑉∗ 𝑠 = 𝐶 for some 𝐶 > 0, where 𝑚 ≥ 2 is the
%
number of action in 𝑠.
'
• Under these conditions, 𝑚𝑎𝑥 𝑄" 𝑠, 𝑎 ≥ 𝑉∗ 𝑠 + %($
.
#
• This lower bound is tight. Under the same conditions, the lower bound on the absolute
error of the Double Q-learning estimate is zero.
49
Why does double DQN work better?
50
Double DQN (DDQN)
𝑸∗ (𝒔, 𝒂) = 𝐬𝐢𝐧 𝒔
polynomial (d=6)
51
Double DQN (DDQN)
52
Summary: Overestimation
Computing
TD targets
53
Improvement 4: Dueling
Network
Dueling Network
55
Advantage Function
56
Dueling Network: Formulation
57
Recap: Bellman Optimality Equation for 𝑉∗ (Lec 1.3)
𝑉∗ 𝑠 = max 𝑄∗ (𝑠, 𝑎)
3
58
Recap: Bellman Optimality Equation for 𝑄∗ (Lec 1.3)
𝑄∗ 𝑠, 𝑎 = 𝑅(𝑠, 𝑎) + 𝛾 b 𝑃 𝑠 - 𝑠, 𝑎) 𝑄∗ (𝑠′)
* ! ∈,
59
Recap: Bellman Expectation Equation for 𝑽" (Lec 1.3)
𝑉^ 𝑠 = < 𝜋 𝑎 𝑠 𝑄^ (𝑠, 𝑎)
3∈_
60
Recap: Bellman Expectation Equation for for 𝑄! (Lec 1.3)
61
Properties of Advantage Function
62
Properties of Advantage Function
𝑄∗ 𝑠, 𝑎 = 𝑉∗ 𝑠 + 𝐴∗ (𝑠, 𝑎)
𝑄∗ 𝑠, 𝑎 = 𝑉∗ 𝑠 + 𝐴∗ 𝑠, 𝑎 − max 𝐴∗ 𝑠, 𝑎
3
63
Dueling Network: Formulation
𝒘 = (𝒘_ , 𝒘e )
64
Training
65
Overcome Non-identifiability
Ø Equation 1: 𝑄∗ 𝑠, 𝑎 = 𝑉∗ 𝑠 + 𝐴∗ (𝑠, 𝑎)
66
Dueling Network
𝑄∗ 𝑠, 𝑎 = 𝑉∗ 𝑠 + 𝐴∗ 𝑠, 𝑎 − max 𝐴∗ (𝑠, 𝑎)
3
• An alternative module replaces the max operator with an average::
𝑄∗ 𝑠, 𝑎 = 𝑉∗ 𝑠 + 𝐴∗ 𝑠, 𝑎 − mean 𝐴∗ (𝑠, 𝑎)
3
Ø On the one hand this loses the original semantics of V and A because they
are now off-target by a constant, but on the other hand it increases the
stability of the optimization:
Ø Dueling network controls the agent in the same way as DQN
Ø Train dueling network by TD in the same way as DQN
Ø (Do not train V and A separately.)
Dueling Network Architectures for Deep Reinforcement Learning. In ICML, 2016.
67
Dueling Network
68
Advantages of Dueling Network
• Improved efficiency. By separating these two components, the network can more effectively learn
the value of states and the relative advantages of actions, which helps in scenarios where many actions
have similar values.
• Value function: how good it is to be in a given state, regardless of the action taken.
• Advantage function: how much better an action is compared to the average action in that state.
• Improved stability. Dueling networks help to stabilize training by allowing the model to learn the
value of states even when some actions are rarely chosen. This is particularly useful in environments with
many actions, as it prevents the network from overfitting to specific action choices.
• Better generalization. By focusing on the state value, the dueling architecture can generalize better
across similar states, improving performance in environments where certain actions are rarely taken but
can still be crucial in specific states.
69
Dueling Network
Vertical section: 10 states 5 actions: go up, down, left, right, and no-op
Horizontal section: 50 states 10 and 20 actions: adding no-ops to the original environment.
70
Dueling Network
Improvements of dueling architecture over the baseline Single network
71
Improvement 5: Noisy Network
NoisyNet
• Parameter w (vector):
𝑤* = 𝜇* + 𝜎* ⋅ 𝜉*
• Parameter w (matrix):
𝑤*? = 𝜇*? + 𝜎*? ⋅ 𝜉*?
Noisy Networks for Exploration. In ICLR, 2018.
73
NoisyNet
74
Recap: Bayesian Learning for Model Parameters (Lecture 1.2)
• Step 1: Given 𝑛 data, 𝐷 = 𝑥#, 𝑥), . . , 𝑥: write down the expression for
likelihood
𝑃 𝐷|𝜃
• Step 2: Specify a prior: 𝑃(𝜃)
• Step 3: Compute the posterior:
𝑃 𝐷 𝜃 𝑃(𝜃)
𝑃 𝜃𝐷 =
𝑃(𝐷)
75
Recap: Thompson Sampling (1933) (Lecture 1.2)
76
Algorithm
• Env Environment;
• 𝜀 set of random variables of the network
• DUELING Boolean; "true" for NoisyNet-
Dueling and "false" for NoisyNet-DQN
• 𝐵 empty replay buffer; 𝑁% replay buffer
size
• 𝜁 initial network parameters;
• 𝜁 & initial target network parameters
• 𝑁' training batch size
• 𝑁 & target network replacement
frequency
77
Rainbow: Combining All Tricks
78
Rainbow: Combining All Tricks
79
Homework Assignment 2
82
Extra Reading Materials
83
Thanks & QA?