0% found this document useful (0 votes)

8 views

2.4+Advanced+Tricks+for+DQNs

The document discusses advanced techniques for Deep Q-Learning (DQN), focusing on methods such as experience replay and prioritized experience replay to enhance data efficiency and reduce correlations in training data. It highlights the importance of managing transition data and addresses issues like maximization bias and overestimation in action-value estimates. Additionally, it outlines improvements like the use of target networks and double DQN to mitigate these challenges.

Uploaded by

shengaa1028

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views

2.4+Advanced+Tricks+for+DQNs

Uploaded by

shengaa1028

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 82

Some slides are from: Sergey Levine (UCB), Shusen Wang (Meta),

Katerina Fragkiadaki (CMU)

COMP 4901Z: Reinforcement Learning

2.4 Advanced Tricks for DQNs

Long Chen (Dept. of CSE)

Deep Q-Learning [1]

• To alleviate the problems of correlated data and non-stationary

distributions,
• Proposed experience replay [2]
• Results: outperformed all previous RL algorithms on six of the seven
games.

[1] Playing Atari with Deep Reinforcement Learning. In NIPS workshop, 2013.
[2] Reinforcement Learning for Robots using Neural Networks. Technical Report, 1993.
2
Experience Replay
Experience Replay

• A transition: (𝑠! , 𝑎! , 𝑟! , 𝑠!"#)

• Store recent 𝑛 transitions in a replay buffer.

A deeper look at experience replay. In NeurIPS workshop, 2017

Revisiting fundamentals of experience replay. In ICML, 2019

4
Experience Replay

• A transition: (𝑠! , 𝑎! , 𝑟! , 𝑠!"#)

• Store recent 𝑛 transitions in a replay buffer.
• Remove old transitions so that the buffer has at most 𝑛 transitions.
• Buffer capacity 𝑛 is a tuning hyper-parameter.
• 𝑛 is typically large, e.g., 10$ ~ 10%
• The setting of 𝑛 is application-specific

A deeper look at experience replay. In NeurIPS workshop, 2017

Revisiting fundamentals of experience replay. In ICML, 2019

5
TD with Experience Replay

"
# (
• Find 𝒘 by minimizing 𝐿 𝒘 = ∑&!'# !
& )

• Stochastic gradient descent (SGD)

• Randomly sample a transition, (𝑠* , 𝑎* , 𝑟* , 𝑠*"#), from the buffer
• Compute TD errors, 𝛿*
+(#" /) + /(1# , 3# ; 𝒘)
• Stochastic gradient: 𝒈* = +𝒘
= 𝛿* ⋅ +𝒘
• SGD: 𝒘 ← 𝒘 − 𝛼 ⋅ 𝒈*

6
Advantages of Experience Replay

• Each step of experience is potentially used in many weight updates, which

allows for greater data efficiency.
• Learning directly from consecutive samples is inefficient, due to the strong
correlations between the samples.
• Instead, randomizing the samples breaks these correlations and therefore
reduces the variance of the updates.
• By using experience replay the behavior distribution is averaged over many
of its previous states, smoothing out learning and avoiding oscillations or
divergence in the parameters.

Playing Atari with Deep Reinforcement Learning. In NIPS workshop, 2013.

7
Algorithm: Deep Q-Learning

Playing Atari with Deep Reinforcement Learning. In NIPS workshop, 2013.

8
Improvement 1: Prioritized
Experience Replay
Prioritized Experience Replay

• Basic Idea: not all transitions are equally important

• Q: Which kind of transitions is more important, left or right?

Prioritized Experience Replay. In ICLR, 2016

10
Recap: Asynchronous Dynamic Programming

• Several simple ideas for asynchronous dynamic programming:

• In-place dynamic programming
• Priortized sweeping
• Real-time dynamic programming

11
Recap: Prioritized Sweeping (Lecture 1.4)

• Use magnitude of Bellman error to guide state selection, e.g.

max 𝑅 𝑠, 𝑎 + 𝛾 < 𝑃 𝑠 8 𝑠, 𝑎 𝑉 𝑠 8 −𝑉 𝑠
3
1$ ∈7

• Backup the state with the largest remaining Bellman error

• Update Bellman of affected states after each backup
• Requires knowledge of reverse dynamics (predecessor states)
• Can be implemented efficiently by maintaining a priority queue

12
Prioritized Experience Replay

• Prioritized sweeping (Moore & Atkeson, 1993; Andre et al., 1998) selects
which state to update next, prioritized according to the change in value, if
that update was executed.

Prioritized Experience Replay. In ICLR, 2016

13
Prioritized Experience Replay

A Motivative Example (Blind Cliffwalk)

• With only 𝑛 states, the environment requires an exponential number of
random steps until the first non-zero reward; to be precise, the chance that
a random sequence of actions will lead to the reward is 29: .

Red arrow: lead to terminate state

Prioritized Experience Replay. In ICLR, 2016

14
Prioritized Experience Replay

• Note the log-log scale, which highlights the exponential speed-up from
replaying with an oracle (bright blue), compared to uniform replay (black).

Prioritized Experience Replay. In ICLR, 2016

15
Prioritized Experience Replay

• Q: How to measure the importance of each transition?

• Simplest way: the magnitude of a transition’s TD error 𝛿
• It is particularly suitable for incremental, online RL algorithms (SARSA, Q-
Learning)
𝑝*;
𝑃 𝑖 =
∑< 𝑝<;
where 𝑝* > 0 is the priority of transition 𝑖.
The exponent 𝛼 determines how much prioritization is used.
Especially, 𝛼 = 0 corresponding to the uniform case.
Prioritized Experience Replay. In ICLR, 2016

16
Prioritized Experience Replay

• Motivation: big |𝛿* | shall be given high priority.

• Option 1: Sampling probability 𝑝* ∝ 𝛿* + 𝜖
#
• Option 2: Sampling probability 𝑝* ∝ =3:<(*)
• The transitions are sorted so that |𝛿* | is in the descending order
• 𝑟𝑎𝑛𝑘(𝑖) is the rank of the 𝑖-th transition.

Prioritized Experience Replay. In ICLR, 2016

17
Update TD Error

• Associate each transition, 𝑠! , 𝑎! , 𝑟! , 𝑠!"# , with a TD error, 𝛿!

• If a transition is newly collected, we do not know its 𝛿!
• Simple set its 𝛿! to the maximum
• It has the highest priority
• Each time 𝑠! , 𝑎! , 𝑟! , 𝑠!"# is selected from the buffer, we update its 𝛿!

Prioritized Experience Replay. In ICLR, 2016

18
Scaling Learning Rate

• SGD: 𝒘 ← 𝒘 − 𝛼 ⋅ 𝒈, where 𝛼 is the learning rate

• If uniform sampling is used, 𝛼 is the same for all transitions
• If weighted sampling is used, 𝛼 shall be adjusted according to the importance
• Q: why?
• Scale the learning rate by 𝑛 𝑝* 9> , where 𝛽 ∈ (0, 1).
#
• If 𝑝# = 𝑝) = … = 𝑝: = :
(uniform sampling), the scaling factor is equal to 1.

• High-importance transitions (with high 𝑝* ) have low learning rates

• In the beginning, set 𝛽 small; increase 𝛽 to 1 over time.
Prioritized Experience Replay. In ICLR, 2016

19
Possible Question

• Q: If transition (𝑠? , 𝑎? , 𝑟? , 𝑠*"#) is important, we increase the sampling

#
weight to 10 times, but we decrease the learning rate to , does the
#@
prioritized experience replay still works?
• Case 1: learning rate 𝛼, use sample (𝑠? , 𝑎? , 𝑟? , 𝑠*"#) to calculate 1 time
gradient, and update 1 time parameter
;
• Case 2: learning rate #@
, use sample (𝑠? , 𝑎? , 𝑟? , 𝑠*"#) to calculate 10 times
gradient, and update 10 times parameter
• Q: Which one is better?
• A: Case 2 usually works better, but its computation is 10 times.
20
Prioritized Experience Replay

Big |𝛿! | ==> High probability ==> Small learning rate

21
Prioritized Experience Replay

This led to both faster learning and to better final policy quality across most
Prioritized Experience Replay. In ICLR, 2016 games of the Atari benchmark suite
22
Weakness of Experience Replay

• Q: What is the weakness of experience replay?

• A: On-policy methods can not use experience replay, such as SARSA,
REINFORCE, A2C, …

23
Improvement 2: Target Network
Improvement 3: Double DQN
TD Learning for DQN

• Use a transition, (𝑠! , 𝑎! , 𝑟! , 𝑠!"#), to update 𝒘

• TD target: 𝑌! = 𝑟! + 𝛾 ⋅ max 𝑄(𝑠!"#, 𝑎; 𝒘)

• TD target 𝑦! is partly an estimate mad by the DQN 𝑄

• TD error: 𝛿! = 𝑄 𝑠! , 𝑎! ; 𝒘 − 𝑌!
+ /(1! , 3! ; 𝒘)
• SGD: 𝒘 ← 𝒘 − 𝛼 ⋅ 𝛿! ⋅ +𝒘
+ /(1! , 3! ; 𝒘)
• Rewrite SGD: 𝒘 ← 𝒘 − 𝛼 ⋅ (𝑄 𝑠! , 𝑎! ; 𝒘 − 𝑌! ) ⋅ +𝒘

• We use 𝑦! , which is partly based on 𝑄, to update 𝑄 itself.

25
Problem of Overestimation

• TD learning makes DQN overestimate action-values. (Why?)

• Reason 1: The maximization bias.
• TD target: 𝑌! = 𝑟! + 𝛾 ⋅ max 𝑄(𝑠!"#, 𝑎; 𝒘)
3
• TD target is bigger than the real action-value
• Reason 2: Bootstrapping propagates the overestimation.

26
Recap: Maximization Bias (Lecture 2.2)

• We often need to maximize over our value estimates. The estimated

maxima suffer from maximization bias
• Consider a state for which all action 𝑎, 𝑄∗ 𝑠, 𝑎 = 0. Our estimates 𝑄 𝑠, 𝑎
are uncertain, some are positive and some negative
• Intuitively (Jensen’s Inequality):
𝔼 max 𝜇* ≥ max 𝔼 𝜇*
*

• This is because we use the same estimate 𝑄 both to choose the argmax and
to evaluate it

27
Reason 1: Maximization Bias

• Let 𝑥#, 𝑥), … , 𝑥: be observed real numbers.

• Add zero-mean random noise to 𝑥#, 𝑥), … , 𝑥: and obtain 𝑄#, 𝑄), … , 𝑄:
• The zero-mean noise does not affect the mean:
𝔼 𝑚𝑒𝑎𝑛* 𝑄* = 𝑚𝑒𝑎𝑛* (𝑥* )
• The zero-mean noise increases the maximum:
𝔼 max 𝑄* ≥ max(𝑥* )
* *

• The zero-mean noise decrease the minimum:

𝔼 𝑚𝑖𝑛* 𝑄* ≤ 𝑚𝑖𝑛* (𝑥* )

28
Reason 1: Maximization Bias

• True action-value: 𝑥 𝑎# , … , 𝑥(𝑎: )

• Noisy estimations made by DQN: 𝑄 𝑠, 𝑎#; 𝒘 , … , 𝑄(𝑠, 𝑎: ; 𝒘)
• Suppose the estimate is unbiased:
𝑚𝑒𝑎𝑛3 𝑥 𝑎 = 𝑚𝑒𝑎𝑛3 (𝑄(𝑠, 𝑎; 𝒘))

• 𝑞 = max 𝑄(𝑠, 𝑎; 𝒘), is typically an overestimation:

𝑞 ≥ max(𝑥(𝑎))
3

29
Reason 1: Maximization Bias

• We conclude that 𝑄!"# = max 𝑄(𝑠!"#, 𝑎; 𝒘) is an overestimation of the true

3
action-value at time 𝑡 + 1
• The TD target, 𝑌! = 𝑟! + 𝛾 ⋅ 𝑄!"#, is thereby an overestimation
• TD learning pushes 𝑄(𝑠! , 𝑎! ; 𝒘) towards 𝑦! which overestimates the trues
action-value

30
Reason 2: Bootstrapping

• TD learning performs bootstrapping

• TD target in part uses 𝑄!"# = max 𝑄(𝑠!"#, 𝑎; 𝒘)
3
• Use the TD target for updating 𝑄(𝑠!"#, 𝑎; 𝒘)
• Suppose DQN overestimates the action-value
• The 𝑄(𝑠!"#, 𝑎; 𝒘) is an overestimation
• The maximization further pushes 𝑄!"# up
• When 𝑄!"# is used for updating 𝑄(𝑠! , 𝑎! ; 𝒘), the overestimation is
propagated back to DQN

31
Why does overestimation happen?

32
Why is overestimation harmful?

• The agent is controlled by the DQN: 𝑎! = arg max 𝑄(𝑠! , 𝑎; 𝒘)

• Uniform overestimation is not a problem

33
Why is overestimation harmful?

• The agent is controlled by the DQN: 𝑎! = arg max 𝑄(𝑠! , 𝑎; 𝒘)

• Uniform overestimation is not a problem

• 𝑄∗ 𝑠, 𝑎# = 200, 𝑄∗ 𝑠, 𝑎) = 100, 𝑄∗ 𝑠, 𝑎B = 230

• Action 𝑎B will be selected
• Suppose 𝑄 𝑠, 𝑎* ; 𝒘 = 𝑄∗ 𝑠, 𝑎* + 100, for all 𝑎*
• Then DQN believes 𝑎B has the highest value and will select 𝑎B

34
Why is overestimation harmful?

• The agent is controlled by the DQN: 𝑎! = arg max 𝑄(𝑠! , 𝑎; 𝒘)

• Uniform overestimation is not a problem

• Non-uniform overestimation is problematic

• 𝑄∗ 𝑠, 𝑎# = 200, 𝑄∗ 𝑠, 𝑎) = 100, 𝑄∗ 𝑠, 𝑎B = 230

• 𝑄 𝑠, 𝑎#; 𝒘 = 280, 𝑄 𝑠, 𝑎); 𝒘 = 300, 𝑄 𝑠, 𝑎B; 𝒘 = 240
• Action 𝑎) will be selected

35
Why is overestimation harmful?

• Unfortunately, the overestimation is non-uniform

• Why?
• Use a transition, (𝑠! , 𝑎! , 𝑟! , 𝑠!"#), to update 𝒘
• The TD target, 𝑦! , overestimates 𝑄∗ (𝑠! , 𝑎! )
• TD algorithm pushes 𝑄(𝑠! , 𝑎! ; 𝒘) towards 𝑦!
• Thus, 𝑄(𝑠! , 𝑎! ; 𝒘) overestimates 𝑄∗ (𝑠! , 𝑎! )

• The more frequently (𝑠, 𝑎) appear in the replay buffer, the worse 𝑄(𝑠, 𝑎; 𝒘)
overestimate 𝑄∗ (𝑠, 𝑎)

36
Solution

• Problem: DQN trained by TD overestimates action-values

• Solution 1: Use a target network [1] to compute TD targets (address the
problem caused by bootstrapping)
• Solution 2: Use double DQN [2] to alleviate the overestimation caused by
maximization.

[1] Human-level Control Through Deep Reinforcement Learning. Nature, 2015.

[2] Deep Reinforcement Learning with Double Q-Learning. In AAAI, 2016.
37
Target Network

• Target network: 𝑄(𝑠, 𝑎; 𝒘9)

• The same network structure as the DQN, 𝑄(𝑠, 𝑎; 𝒘)
• Different parameters: 𝒘9 ≠ 𝒘
• Use 𝑄(𝑠, 𝑎; 𝒘) to control the agent and collect experience:
{(𝑠! , 𝑎! , 𝑟! , 𝑠!"#)}
• Use 𝑄(𝑠, 𝑎; 𝒘9) to compute TD target

𝑌! = 𝑟! + 𝛾 ⋅ max 𝑄(𝑠!"#, 𝑎; 𝒘9)

38
TD Learning with Target Network

• Use a transition, (𝑠! , 𝑎! , 𝑟! , 𝑠!"#), to update 𝒘

• TD target: 𝑌! = 𝑟! + 𝛾 ⋅ max 𝑄(𝑠!"#, 𝑎; 𝒘9)

• TD error: 𝛿! = 𝑄 𝑠! , 𝑎! ; 𝒘 − 𝑌!
+/(1! , 3! ; 𝒘)
• SGD: 𝒘 ← 𝒘 − 𝛼 ⋅ 𝛿! ⋅ +𝒘

39
Update Target Network

• Periodically update 𝒘9
• Option 1: 𝒘9 ← 𝒘
• Option 2: 𝒘9 ← 𝜏 ⋅ 𝒘 + 1 − 𝜏 ⋅ 𝒘9

40
Deep Q-Network

• Experience replay
• Target network: we used an iterative update that adjusts the action-values
(Q) towards target values that are only periodically updated, thereby
reducing correlations with the target.
• Results:
• Outperforms the best existing reinforcement learning methods on 43 of 49
games
• It was comparable to that of professional human games tester across the
set of 49 games, achieving more than 75% of the human score on more
than half of the games
Human-level Control through Deep Reinforcement Learning. Nature, 2015.

41
Algorithm

Human-level Control through Deep Reinforcement Learning. Nature, 2015.

42
Comparisons

• TD learning with naïve update:

• TD Target: 𝑌! = 𝑟! + 𝛾 ⋅ max 𝑄(𝑠!"#, 𝑎; 𝒘)
3

• TD learning with target network:

• TD Target: 𝑌! = 𝑟! + 𝛾 ⋅ max 𝑄(𝑠!"#, 𝑎; 𝒘9)
3
• Though better than the naïve update, TD learning with target network
nevertheless overestimates action-values

43
Recap: Double Q-Learning (Lecture 2.2)

• Train two action-value functions, Q1 and Q2

• Do Q-learning on both, but
• Never on the same time steps (Q1 and Q2 are independent)
• Pick Q1 or Q2 at random to be updated on each step
• If updating Q1, use Q2 for the value of the next state:
• 𝑄# 𝑆! , 𝐴! ← 𝑄# 𝑆! , 𝐴! +

𝛼 𝑅!"# + 𝑸𝟐 𝑆!"#, arg max 𝑸𝟏 𝑆!"#, 𝑎 − 𝑄# 𝑆! , 𝐴!

• Action selections are 𝜖-greedy with respect to the sum of Q1 and Q2

44
Recap: Double Tabular Q-Learning (Lecture 2.2)

45
Double DQN

Naïve Update

• TD Target: 𝑌! = 𝑟! + 𝛾 ⋅ max 𝑄(𝑠!"#, 𝑎; 𝒘)

• Selection using DQN:

𝑎∗ = arg max 𝑄(𝑠!"#, 𝑎; 𝒘)
3
• Evaluation using DQN:
𝑌! = 𝑟! + 𝛾 ⋅ 𝑄(𝑠!"#, 𝑎∗ ; 𝒘)
• Serious overestimation

46
Double DQN

Using Target Network

• Selection using target network:
𝑎∗ = arg max 𝑄(𝑠!"#, 𝑎; 𝒘9)
3

• Evaluation using target network:

𝑌! = 𝑟! + 𝛾 ⋅ 𝑄(𝑠!"#, 𝑎∗ ; 𝒘9)
• It works better, but overestimation is still serious.

Deep Reinforcement Learning with Double Q-learning. In AAAI, 2016.

47
Double DQN

Double DQN
• Selection using DQN:
𝑎∗ = arg max 𝑄(𝑠!"#, 𝑎; 𝒘)
3

• Evaluation using target network:

𝑌! = 𝑟! + 𝛾 ⋅ 𝑄(𝑠!"#, 𝑎∗ ; 𝒘9)
• It is the best among the three; but overestimation still may happen

Deep Reinforcement Learning with Double Q-learning. In AAAI, 2016.

48
Double DQN (DDQN)

Theorem
• Consider a state 𝑠 in which all the true optimal action values are equal at 𝑄∗ 𝑠, 𝑎 = 𝑉∗ 𝑠 for some 𝑉∗ 𝑠 .

• Let 𝑄" be arbitrary value estimates that are on the whole unbiased in the sense that ∑# 𝑄" 𝑠, 𝑎 − 𝑉∗ 𝑠 = 0,
$ &
but that are not all correct, such that ∑# 𝑄" 𝑠, 𝑎 − 𝑉∗ 𝑠 = 𝐶 for some 𝐶 > 0, where 𝑚 ≥ 2 is the
%
number of action in 𝑠.

'
• Under these conditions, 𝑚𝑎𝑥 𝑄" 𝑠, 𝑎 ≥ 𝑉∗ 𝑠 + %($
.
#

• This lower bound is tight. Under the same conditions, the lower bound on the absolute
error of the Double Q-learning estimate is zero.

Deep Reinforcement Learning with Double Q-learning. In AAAI, 2016.

49
Why does double DQN work better?

• Double DQN decouples selection from evaluation.

• Selection using DQN: 𝑎∗ = arg max 𝑄(𝑠!"#, 𝑎; 𝒘)

• Evaluation using target network: 𝑌! = 𝑟! + 𝛾 ⋅ 𝑄(𝑠!"#, 𝑎∗ ; 𝒘9)

• Double DQN alleviates overestimation:
𝑄 𝑠!"#, 𝑎∗ ; 𝒘9 ≤ max 𝑄(𝑠!"#, 𝑎; 𝒘9)
3
Estimation by Estimation by
Double DQN target network

50
Double DQN (DDQN)

Purple line: true values

𝑸∗ (𝒔, 𝒂) = 𝐬𝐢𝐧 𝒔
polynomial (d=6)

𝑸∗ (𝒔, 𝒂) = 𝟐𝐞𝐱𝐩 −𝒔𝟐

polynomial (d=6)

𝑸∗ (𝒔, 𝒂) = 𝟐𝐞𝐱𝐩 −𝒔𝟐

polynomial (d=9)

Deep Reinforcement Learning with Double Q-learning. In AAAI, 2016.

51
Double DQN (DDQN)

Deep Reinforcement Learning with Double Q-learning. In AAAI, 2016.

52
Summary: Overestimation

• Because of the maximization, the TD target overestimates the true action-value

• By creating a “positive feedback loop”, bootstrapping further exacerbates the
overestimation.
• Target network can partly avoid bootstrapping. (Not completely, because 𝑤 !
depends on 𝑤.
• Double DQN alleviates the overestimation caused by the maximization.

Computing
TD targets

53
Improvement 4: Dueling
Network
Dueling Network

• Dueling Architecture: Two-stream Q network (vs. single-stream Q network)

Dueling Network Architectures for Deep Reinforcement Learning. In ICML, 2016.

55
Advantage Function

• Action-value function: 𝑄^ 𝑠! , 𝑎! = 𝔼[𝐺! |𝑆! = 𝑠! , 𝐴! = 𝑎! ]

• State-value function: 𝑉^ 𝑠! = 𝔼_ [𝑄^ (𝑠! , 𝐴)]

• Optimal action-value function: 𝑄∗ 𝑠, 𝑎 = max 𝑄^ (𝑠, 𝑎)

• Optimal state-value function: 𝑉∗ 𝑠 = max 𝑉^ (𝑠)

^
• Optimal advantage function: 𝐴∗ 𝑠, 𝑎 = 𝑄∗ 𝑠, 𝑎 − 𝑉∗ (𝑠)

• Idea borrowed from Policy Gradient (will introduced in the following

lectures)
Dueling Network Architectures for Deep Reinforcement Learning. In ICML, 2016.

56
Dueling Network: Formulation

Dueling Network Architectures for Deep Reinforcement Learning. In ICML, 2016.

57
Recap: Bellman Optimality Equation for 𝑉∗ (Lec 1.3)

𝑉∗ 𝑠 = max 𝑄∗ (𝑠, 𝑎)
3

𝑉∗ 𝑠 = max 𝑅(𝑠, 𝑎) + 𝛾 < 𝑃 𝑠 8 𝑠, 𝑎) 𝑉∗ (𝑠′)

3
18∈7

58
Recap: Bellman Optimality Equation for 𝑄∗ (Lec 1.3)

𝑄∗ 𝑠, 𝑎 = 𝑅(𝑠, 𝑎) + 𝛾 b 𝑃 𝑠 - 𝑠, 𝑎) 𝑄∗ (𝑠′)
* ! ∈,

𝑄∗ 𝑠, 𝑎 = 𝑅 𝑠, 𝑎 + 𝛾 < 𝑃 𝑠 8 𝑠, 𝑎) max 𝑄∗ (𝑠 8 , 𝑎′)

38
1$ ∈7

59
Recap: Bellman Expectation Equation for 𝑽" (Lec 1.3)

𝑉^ 𝑠 = < 𝜋 𝑎 𝑠 𝑄^ (𝑠, 𝑎)
3∈_

𝑉^ 𝑠 = < 𝜋 𝑎 𝑠 𝑅(𝑠, 𝑎) + 𝛾 < 𝑃 𝑠 8 𝑠, 𝑎) 𝑉^ (𝑠′)

3∈_ 18∈7

60
Recap: Bellman Expectation Equation for for 𝑄! (Lec 1.3)

𝑄^ 𝑠, 𝑎 = 𝑅(𝑠, 𝑎) + 𝛾 < 𝑃 𝑠 8 𝑠, 𝑎) 𝑉^ (𝑠′)

1$ ∈7

𝑄^ 𝑠, 𝑎 = 𝑅(𝑠, 𝑎) + 𝛾 < 𝑃 𝑠 8 𝑠, 𝑎) < 𝜋 𝑎′ 𝑠′ 𝑄^ (𝑠′, 𝑎′)

18∈7 38∈_

61
Properties of Advantage Function

• Theorem 1: 𝑉∗ 𝑠 = max 𝑄∗ (𝑠, 𝑎)

• Recall the definition of the optimal advantage function

𝐴∗ 𝑠, 𝑎 = 𝑄∗ 𝑠, 𝑎 − 𝑉∗ (𝑠)
• It follows that
max 𝐴∗ 𝑠, 𝑎 = max 𝑄∗ 𝑠, 𝑎 − 𝑉∗ 𝑠 = 0
3 3

Dueling Network Architectures for Deep Reinforcement Learning. In ICML, 2016.

62
Properties of Advantage Function

• Definition of advantage: 𝐴∗ 𝑠, 𝑎 = 𝑄∗ 𝑠, 𝑎 − 𝑉∗ (𝑠, 𝑎)

𝑄∗ 𝑠, 𝑎 = 𝑉∗ 𝑠 + 𝐴∗ (𝑠, 𝑎)

𝑄∗ 𝑠, 𝑎 = 𝑉∗ 𝑠 + 𝐴∗ 𝑠, 𝑎 − max 𝐴∗ 𝑠, 𝑎
3

Dueling Network Architectures for Deep Reinforcement Learning. In ICML, 2016.

63
Dueling Network: Formulation

• Theorem 2: 𝑄∗ 𝑠, 𝑎 = 𝑉∗ 𝑠 + 𝐴∗ 𝑠, 𝑎 − max 𝐴∗ (𝑠, 𝑎)

• Approximate 𝑉∗ 𝑠 by a neural network, 𝑉 𝑠; 𝒘e

• Approximate 𝐴∗ (𝑠, 𝑎) by a neural network, 𝐴∗ (𝑠, 𝑎; 𝒘_ )
• Thus, approximate 𝑄∗ 𝑠, 𝑎 by the dueling network
𝑄 𝑠, 𝑎; 𝒘_ , 𝒘e = 𝑉 𝑠; 𝒘e + 𝐴 𝑠, 𝑎; 𝒘_ − max 𝐴(𝑠, 𝑎; 𝒘_ )
3

𝒘 = (𝒘_ , 𝒘e )

Dueling Network Architectures for Deep Reinforcement Learning. In ICML, 2016.

64
Training

• Dueling network, 𝑄(𝑠, 𝑎; 𝒘), is an approximation to 𝑄∗ (𝑠, 𝑎)

• Learn the parameter, 𝒘 = (𝒘_ , 𝒘e ), in the same way as the other DQNs
• Previous tricks can be used in the same way:
• Prioritized experience replay
• Double DQN
• Multi-step TD target

Dueling Network Architectures for Deep Reinforcement Learning. In ICML, 2016.

65
Overcome Non-identifiability

Ø Equation 1: 𝑄∗ 𝑠, 𝑎 = 𝑉∗ 𝑠 + 𝐴∗ (𝑠, 𝑎)

Ø Equation 2: 𝑄∗ 𝑠, 𝑎 = 𝑉∗ 𝑠 + 𝐴∗ 𝑠, 𝑎 − max 𝐴∗ (𝑠, 𝑎)

Question: why it is the zero term necessary?

Ø Equation 1 has the problem of non-identifiability.
Ø Let 𝑉 8 = 𝑉∗ + 10 and A8 = 𝐴∗ − 10
Ø Then 𝑄∗ 𝑠, 𝑎 = 𝑉∗ 𝑠 + 𝐴∗ 𝑠, 𝑎 = 𝑉 8 𝑠 + 𝐴′(𝑠, 𝑎)
Ø Equation 2 does not have the problem.

Dueling Network Architectures for Deep Reinforcement Learning. In ICML, 2016.

66
Dueling Network

𝑄∗ 𝑠, 𝑎 = 𝑉∗ 𝑠 + 𝐴∗ 𝑠, 𝑎 − max 𝐴∗ (𝑠, 𝑎)
3
• An alternative module replaces the max operator with an average::
𝑄∗ 𝑠, 𝑎 = 𝑉∗ 𝑠 + 𝐴∗ 𝑠, 𝑎 − mean 𝐴∗ (𝑠, 𝑎)
3
Ø On the one hand this loses the original semantics of V and A because they
are now off-target by a constant, but on the other hand it increases the
stability of the optimization:
Ø Dueling network controls the agent in the same way as DQN
Ø Train dueling network by TD in the same way as DQN
Ø (Do not train V and A separately.)
Dueling Network Architectures for Deep Reinforcement Learning. In ICML, 2016.

67
Dueling Network

• Value function + Advantage function

• The value stream learns to pay attention

to the road.
• The advantage stream learns to pay
attention only when there are cars
immediately in front, so as to avoid
collisions.

Dueling Network Architectures for Deep Reinforcement Learning. In ICML, 2016.

68
Advantages of Dueling Network

• Improved efficiency. By separating these two components, the network can more effectively learn
the value of states and the relative advantages of actions, which helps in scenarios where many actions
have similar values.
• Value function: how good it is to be in a given state, regardless of the action taken.
• Advantage function: how much better an action is compared to the average action in that state.

• Improved stability. Dueling networks help to stabilize training by allowing the model to learn the
value of states even when some actions are rarely chosen. This is particularly useful in environments with
many actions, as it prevents the network from overfitting to specific action choices.

• Better generalization. By focusing on the state value, the dueling architecture can generalize better
across similar states, improving performance in environments where certain actions are rarely taken but
can still be crucial in specific states.

69
Dueling Network

Vertical section: 10 states 5 actions: go up, down, left, right, and no-op
Horizontal section: 50 states 10 and 20 actions: adding no-ops to the original environment.

Dueling Network Architectures for Deep Reinforcement Learning. In ICML, 2016.

70
Dueling Network
Improvements of dueling architecture over the baseline Single network

Dueling Network Architectures for Deep Reinforcement Learning. In ICML, 2016.

71
Improvement 5: Noisy Network
NoisyNet

• Parameter w (vector):
𝑤* = 𝜇* + 𝜎* ⋅ 𝜉*
• Parameter w (matrix):
𝑤*? = 𝜇*? + 𝜎*? ⋅ 𝜉*?
Noisy Networks for Exploration. In ICLR, 2018.

73
NoisyNet

• Fully connected layer:

𝑧 = ReLU(𝑾𝒙 + 𝒃)
• NoisyNet version:
𝑧 = ReLU 𝑾𝝁 + 𝑾𝝈 ⋅ 𝑾𝝃 𝒙 + 𝒃 𝝁 + 𝒃 𝝈 ⋅ 𝒃 𝝃

Noisy Networks for Exploration. In ICLR, 2018.

74
Recap: Bayesian Learning for Model Parameters (Lecture 1.2)

• Step 1: Given 𝑛 data, 𝐷 = 𝑥#, 𝑥), . . , 𝑥: write down the expression for
likelihood
𝑃 𝐷|𝜃
• Step 2: Specify a prior: 𝑃(𝜃)
• Step 3: Compute the posterior:
𝑃 𝐷 𝜃 𝑃(𝜃)
𝑃 𝜃𝐷 =
𝑃(𝐷)

75
Recap: Thompson Sampling (1933) (Lecture 1.2)

• Represent a distribution for the mean reward of each bandit as opposed to

the mean reward estimate alone. At each timestep:
1. Sample from the mean reward distribution:
𝜃#̅ ~𝑃y 𝜃# , 𝜃)̅ ~𝑃y 𝜃) , … , 𝜃<̅ ~𝑃y 𝜃<

2. Choose action 𝑎 = arg max 𝔼ji 𝑅 𝑎

3
3. Observe the reward
4. Update the mean reward posterior distribution: 𝑃y 𝜃# , 𝑃y 𝜃) , … , 𝑃y 𝜃<

76
Algorithm
• Env Environment;
• 𝜀 set of random variables of the network
• DUELING Boolean; "true" for NoisyNet-
Dueling and "false" for NoisyNet-DQN
• 𝐵 empty replay buffer; 𝑁% replay buffer
size
• 𝜁 initial network parameters;
• 𝜁 & initial target network parameters
• 𝑁' training batch size
• 𝑁 & target network replacement
frequency

Noisy Networks for Exploration. In ICLR, 2018.

77
Rainbow: Combining All Tricks

• Multi-step bootstrap targets

• Maximum bias (Double DQN)
• Prioritized experience replay
• Dueling network architecture
• Noisy DQN
• Distributional Q-Learning (skip)

Rainbow: Combining Improvements in Deep Reinforcement Learning. In AAAI, 2018.

78
Rainbow: Combining All Tricks

Rainbow: Combining Improvements in Deep Reinforcement Learning. In AAAI, 2018.

79
Homework Assignment 2

• The HW2 will be released in today

• The deadline is Oct 31st (two weeks)

82
Extra Reading Materials

• Double Q-learning. In NIPS, 2010.

• Playing Atari with Deep Reinforcement Learning. In NeurIPS workshop, 2013.
• Human-level Control through Deep Reinforcement Learning. Nature, 2015.
• Deep Reinforcement Learning with Double Q-learning. In AAAI, 2016.
• Prioritized Experience Replay. In ICLR, 2016.
• Dueling Network Architectures for Deep Reinforcement Learning. In ICML, 2016.
• A Distributional Perspective on Reinforcement Learning. In ICML, 2017.
• Noisy Networks for Exploration. In ICLR, 2018.
• Rainbow: Combining Improvements in Deep Reinforcement Learning. In AAAI, 2018.

83
Thanks & QA?

Evan Moor
33% (3)
Evan Moor
1 page
Avid Fact Sheet
No ratings yet
Avid Fact Sheet
2 pages
RL Course Report
No ratings yet
RL Course Report
10 pages
Unit 5d - Deep Reinforcement Learning
No ratings yet
Unit 5d - Deep Reinforcement Learning
52 pages
Playing Geometry Dash With Convolutional Neural Networks
No ratings yet
Playing Geometry Dash With Convolutional Neural Networks
7 pages
S18 Reinforcement Learning 2
No ratings yet
S18 Reinforcement Learning 2
46 pages
Walking Through Original DQN Paper - by Stas Olekhnovich - Medium
No ratings yet
Walking Through Original DQN Paper - by Stas Olekhnovich - Medium
13 pages
Temporal Difference Learning
No ratings yet
Temporal Difference Learning
17 pages
What is TD Learning
No ratings yet
What is TD Learning
15 pages
RLDL_PBL_AmriteshChandra_09411503121
No ratings yet
RLDL_PBL_AmriteshChandra_09411503121
15 pages
Full Download Foundations of Deep Reinforcement Learning Theory and Practice in Python First Edition Laura Graesser PDF
100% (5)
Full Download Foundations of Deep Reinforcement Learning Theory and Practice in Python First Edition Laura Graesser PDF
62 pages
Q-Learning and Deep Q Networks (DQN)
No ratings yet
Q-Learning and Deep Q Networks (DQN)
52 pages
ExperienceReplay
No ratings yet
ExperienceReplay
17 pages
10. Learning Task
No ratings yet
10. Learning Task
14 pages
10 Deep Reinforcement
No ratings yet
10 Deep Reinforcement
40 pages
4b - Deep Reinforcement Learning
No ratings yet
4b - Deep Reinforcement Learning
29 pages
Untitled document
No ratings yet
Untitled document
11 pages
SP14 CS188 Lecture 10 -- Reinforcement Learning I -Print
No ratings yet
SP14 CS188 Lecture 10 -- Reinforcement Learning I -Print
25 pages
Download Full Foundations of Deep Reinforcement Learning Theory and Practice in Python First Edition Laura Graesser PDF All Chapters
100% (4)
Download Full Foundations of Deep Reinforcement Learning Theory and Practice in Python First Edition Laura Graesser PDF All Chapters
62 pages
DL questions
No ratings yet
DL questions
30 pages
Tri-Tue-Nhan-Tao - Nathan-Lambert - Lec13 - 6up-Reinforcement-Learning - (Cuuduongthancong - Com)
No ratings yet
Tri-Tue-Nhan-Tao - Nathan-Lambert - Lec13 - 6up-Reinforcement-Learning - (Cuuduongthancong - Com)
8 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
6 pages
3.5 Intro2DeepQLearning
No ratings yet
3.5 Intro2DeepQLearning
12 pages
1611.01606v1
No ratings yet
1611.01606v1
13 pages
Q_Networks[1]-31-50
No ratings yet
Q_Networks[1]-31-50
20 pages
Chapter_1_Introduction_RL_Report_Kiran
No ratings yet
Chapter_1_Introduction_RL_Report_Kiran
2 pages
Reinforcement Learning: Csci 5512: Artificial Intelligence Ii
No ratings yet
Reinforcement Learning: Csci 5512: Artificial Intelligence Ii
30 pages
SP14 CS188 Lecture 10 - Reinforcement Learning I PDF
No ratings yet
SP14 CS188 Lecture 10 - Reinforcement Learning I PDF
38 pages
15
No ratings yet
15
17 pages
Reinforcement Learning 2
No ratings yet
Reinforcement Learning 2
13 pages
Deep-Learning-book-part5
No ratings yet
Deep-Learning-book-part5
142 pages
07 Deep Reinforcement Learning (John)
No ratings yet
07 Deep Reinforcement Learning (John)
52 pages
Modern_Deep_Reinforcement_Learning_Algorithms
No ratings yet
Modern_Deep_Reinforcement_Learning_Algorithms
56 pages
21 - Reinforcement Learning
No ratings yet
21 - Reinforcement Learning
25 pages
Deep RL Tutorial Small
No ratings yet
Deep RL Tutorial Small
66 pages
Temporal-Difference (TD) Learning: Basics
No ratings yet
Temporal-Difference (TD) Learning: Basics
6 pages
4a - Approximate Reinforcement Learning
No ratings yet
4a - Approximate Reinforcement Learning
55 pages
2410.22766v1
No ratings yet
2410.22766v1
12 pages
RL Concepts and Methods
No ratings yet
RL Concepts and Methods
8 pages
The Effects of Memory Replay in Reinforcement Learning
No ratings yet
The Effects of Memory Replay in Reinforcement Learning
14 pages
lecture7
No ratings yet
lecture7
52 pages
13-RL DRL
No ratings yet
13-RL DRL
102 pages
CH5_Function Approximation (1)
No ratings yet
CH5_Function Approximation (1)
33 pages
Deep Reinforcement Learning Handout v2.0.docx (1)
0% (1)
Deep Reinforcement Learning Handout v2.0.docx (1)
6 pages
ML at Icl Reinforcement Learning: in A Nutshell
No ratings yet
ML at Icl Reinforcement Learning: in A Nutshell
60 pages
AI A Z HandBook
No ratings yet
AI A Z HandBook
12 pages
Chapter 1
No ratings yet
Chapter 1
33 pages
Reinforcement Learning Mastery Path
No ratings yet
Reinforcement Learning Mastery Path
18 pages
Demonstration Final Presentation (1)
No ratings yet
Demonstration Final Presentation (1)
59 pages
Instant ebooks textbook (Ebook) Foundations of Deep Reinforcement Learning: Theory and Practice in Python by Laura Graesser; Wah Loon Keng ISBN 9780135172490, 0135172497 download all chapters
100% (9)
Instant ebooks textbook (Ebook) Foundations of Deep Reinforcement Learning: Theory and Practice in Python by Laura Graesser; Wah Loon Keng ISBN 9780135172490, 0135172497 download all chapters
65 pages
F20-AI-L11
No ratings yet
F20-AI-L11
52 pages
SP14 CS188 Lecture 10 - Reinforcement Learning I
No ratings yet
SP14 CS188 Lecture 10 - Reinforcement Learning I
35 pages
Deep Quality-Value (DQV) Learning: Preprint. Work in Progress
No ratings yet
Deep Quality-Value (DQV) Learning: Preprint. Work in Progress
10 pages
Report On Reinforcement Learning
No ratings yet
Report On Reinforcement Learning
26 pages
A Deeper Look at Experience Replay
No ratings yet
A Deeper Look at Experience Replay
9 pages
Rl Dp and Value and Policy
No ratings yet
Rl Dp and Value and Policy
4 pages
Hota-ML-ReinforcementLearning
No ratings yet
Hota-ML-ReinforcementLearning
12 pages
6S191 MIT DeepLearning L5
No ratings yet
6S191 MIT DeepLearning L5
62 pages
UNIT- 5
No ratings yet
UNIT- 5
43 pages
Chapter 3
No ratings yet
Chapter 3
14 pages
Reinforcement Learning: Instructor: Max Welling
No ratings yet
Reinforcement Learning: Instructor: Max Welling
18 pages
Python Machine Learning By Example: Unlock machine learning best practices with real-world use cases
From Everand
Python Machine Learning By Example: Unlock machine learning best practices with real-world use cases
Yuxi (Hayden) Liu
No ratings yet
Sample Bco or Bow
No ratings yet
Sample Bco or Bow
6 pages
Lesson Plan Guidelines: Class Description
No ratings yet
Lesson Plan Guidelines: Class Description
7 pages
Eneda, Achiette Asu
No ratings yet
Eneda, Achiette Asu
18 pages
7th Grade Literature - The Witch of Blackbird Pond - Overview
No ratings yet
7th Grade Literature - The Witch of Blackbird Pond - Overview
8 pages
Assessment of Learning Examination
No ratings yet
Assessment of Learning Examination
5 pages
Alberto Torres: Objective
No ratings yet
Alberto Torres: Objective
1 page
Teaching Styles
67% (3)
Teaching Styles
11 pages
The nature of reading
No ratings yet
The nature of reading
6 pages
Clements Sarama 2002
No ratings yet
Clements Sarama 2002
5 pages
ICT Technical-Drafting 10 Q4 LAS3 FINAL
No ratings yet
ICT Technical-Drafting 10 Q4 LAS3 FINAL
10 pages
Section II
No ratings yet
Section II
8 pages
Interview Protocol
No ratings yet
Interview Protocol
1 page
Jaak Panksepp1
100% (1)
Jaak Panksepp1
31 pages
Girne American University: Faculty of Business & Economics Syllabus
No ratings yet
Girne American University: Faculty of Business & Economics Syllabus
4 pages
Kakakakaka
No ratings yet
Kakakakaka
2 pages
Facilitating Module 1 PDF
No ratings yet
Facilitating Module 1 PDF
13 pages
DLP in EUMOP Mathematics 6
No ratings yet
DLP in EUMOP Mathematics 6
2 pages
BUSINESS FINANCE QUARTER 2 WEEK5
No ratings yet
BUSINESS FINANCE QUARTER 2 WEEK5
5 pages
CS Form No 212 Work Experience Sheet
No ratings yet
CS Form No 212 Work Experience Sheet
2 pages
RPMS-PPST Minutes
0% (1)
RPMS-PPST Minutes
2 pages
Lesson Plan - Maths - Pre Test For Clocks
100% (1)
Lesson Plan - Maths - Pre Test For Clocks
2 pages
Session Guide in Arts (Drawing)
No ratings yet
Session Guide in Arts (Drawing)
6 pages
TCD Essay
No ratings yet
TCD Essay
1 page
Mps Classroom Form
No ratings yet
Mps Classroom Form
5 pages
Weekly Home Learning Plan For Contemporary Philippine Arts From The Regions Quarter 2
No ratings yet
Weekly Home Learning Plan For Contemporary Philippine Arts From The Regions Quarter 2
1 page
Orca Share Media1672228603096 7013835118880240423
No ratings yet
Orca Share Media1672228603096 7013835118880240423
47 pages
Deep Learning Prerequisites: Logistic Regression in Python
No ratings yet
Deep Learning Prerequisites: Logistic Regression in Python
8 pages