0% found this document useful (0 votes)
8 views

2.4+Advanced+Tricks+for+DQNs

The document discusses advanced techniques for Deep Q-Learning (DQN), focusing on methods such as experience replay and prioritized experience replay to enhance data efficiency and reduce correlations in training data. It highlights the importance of managing transition data and addresses issues like maximization bias and overestimation in action-value estimates. Additionally, it outlines improvements like the use of target networks and double DQN to mitigate these challenges.

Uploaded by

shengaa1028
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

2.4+Advanced+Tricks+for+DQNs

The document discusses advanced techniques for Deep Q-Learning (DQN), focusing on methods such as experience replay and prioritized experience replay to enhance data efficiency and reduce correlations in training data. It highlights the importance of managing transition data and addresses issues like maximization bias and overestimation in action-value estimates. Additionally, it outlines improvements like the use of target networks and double DQN to mitigate these challenges.

Uploaded by

shengaa1028
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 82

Some slides are from: Sergey Levine (UCB), Shusen Wang (Meta),

Katerina Fragkiadaki (CMU)

COMP 4901Z: Reinforcement Learning


2.4 Advanced Tricks for DQNs

Long Chen (Dept. of CSE)


Deep Q-Learning [1]

• To alleviate the problems of correlated data and non-stationary


distributions,
• Proposed experience replay [2]
• Results: outperformed all previous RL algorithms on six of the seven
games.

[1] Playing Atari with Deep Reinforcement Learning. In NIPS workshop, 2013.
[2] Reinforcement Learning for Robots using Neural Networks. Technical Report, 1993.
2
Experience Replay
Experience Replay

• A transition: (𝑠! , 𝑎! , 𝑟! , 𝑠!"#)


• Store recent 𝑛 transitions in a replay buffer.

A deeper look at experience replay. In NeurIPS workshop, 2017


Revisiting fundamentals of experience replay. In ICML, 2019

4
Experience Replay

• A transition: (𝑠! , 𝑎! , 𝑟! , 𝑠!"#)


• Store recent 𝑛 transitions in a replay buffer.
• Remove old transitions so that the buffer has at most 𝑛 transitions.
• Buffer capacity 𝑛 is a tuning hyper-parameter.
• 𝑛 is typically large, e.g., 10$ ~ 10%
• The setting of 𝑛 is application-specific

A deeper look at experience replay. In NeurIPS workshop, 2017


Revisiting fundamentals of experience replay. In ICML, 2019

5
TD with Experience Replay

"
# (
• Find 𝒘 by minimizing 𝐿 𝒘 = ∑&!'# !
& )

• Stochastic gradient descent (SGD)


• Randomly sample a transition, (𝑠* , 𝑎* , 𝑟* , 𝑠*"#), from the buffer
• Compute TD errors, 𝛿*
+(#" /) + /(1# , 3# ; 𝒘)
• Stochastic gradient: 𝒈* = +𝒘
= 𝛿* ⋅ +𝒘
• SGD: 𝒘 ← 𝒘 − 𝛼 ⋅ 𝒈*

6
Advantages of Experience Replay

• Each step of experience is potentially used in many weight updates, which


allows for greater data efficiency.
• Learning directly from consecutive samples is inefficient, due to the strong
correlations between the samples.
• Instead, randomizing the samples breaks these correlations and therefore
reduces the variance of the updates.
• By using experience replay the behavior distribution is averaged over many
of its previous states, smoothing out learning and avoiding oscillations or
divergence in the parameters.

Playing Atari with Deep Reinforcement Learning. In NIPS workshop, 2013.

7
Algorithm: Deep Q-Learning

Playing Atari with Deep Reinforcement Learning. In NIPS workshop, 2013.

8
Improvement 1: Prioritized
Experience Replay
Prioritized Experience Replay

• Basic Idea: not all transitions are equally important


• Q: Which kind of transitions is more important, left or right?

Prioritized Experience Replay. In ICLR, 2016

10
Recap: Asynchronous Dynamic Programming

• Several simple ideas for asynchronous dynamic programming:


• In-place dynamic programming
• Priortized sweeping
• Real-time dynamic programming

11
Recap: Prioritized Sweeping (Lecture 1.4)

• Use magnitude of Bellman error to guide state selection, e.g.

max 𝑅 𝑠, 𝑎 + 𝛾 < 𝑃 𝑠 8 𝑠, 𝑎 𝑉 𝑠 8 −𝑉 𝑠
3
1$ ∈7

• Backup the state with the largest remaining Bellman error


• Update Bellman of affected states after each backup
• Requires knowledge of reverse dynamics (predecessor states)
• Can be implemented efficiently by maintaining a priority queue

12
Prioritized Experience Replay

• Prioritized sweeping (Moore & Atkeson, 1993; Andre et al., 1998) selects
which state to update next, prioritized according to the change in value, if
that update was executed.

Prioritized Experience Replay. In ICLR, 2016

13
Prioritized Experience Replay

A Motivative Example (Blind Cliffwalk)


• With only 𝑛 states, the environment requires an exponential number of
random steps until the first non-zero reward; to be precise, the chance that
a random sequence of actions will lead to the reward is 29: .

Red arrow: lead to terminate state

Prioritized Experience Replay. In ICLR, 2016

14
Prioritized Experience Replay

• Note the log-log scale, which highlights the exponential speed-up from
replaying with an oracle (bright blue), compared to uniform replay (black).

Prioritized Experience Replay. In ICLR, 2016

15
Prioritized Experience Replay

• Q: How to measure the importance of each transition?


• Simplest way: the magnitude of a transition’s TD error 𝛿
• It is particularly suitable for incremental, online RL algorithms (SARSA, Q-
Learning)
𝑝*;
𝑃 𝑖 =
∑< 𝑝<;
where 𝑝* > 0 is the priority of transition 𝑖.
The exponent 𝛼 determines how much prioritization is used.
Especially, 𝛼 = 0 corresponding to the uniform case.
Prioritized Experience Replay. In ICLR, 2016

16
Prioritized Experience Replay

• Motivation: big |𝛿* | shall be given high priority.


• Option 1: Sampling probability 𝑝* ∝ 𝛿* + 𝜖
#
• Option 2: Sampling probability 𝑝* ∝ =3:<(*)
• The transitions are sorted so that |𝛿* | is in the descending order
• 𝑟𝑎𝑛𝑘(𝑖) is the rank of the 𝑖-th transition.

Prioritized Experience Replay. In ICLR, 2016

17
Update TD Error

• Associate each transition, 𝑠! , 𝑎! , 𝑟! , 𝑠!"# , with a TD error, 𝛿!


• If a transition is newly collected, we do not know its 𝛿!
• Simple set its 𝛿! to the maximum
• It has the highest priority
• Each time 𝑠! , 𝑎! , 𝑟! , 𝑠!"# is selected from the buffer, we update its 𝛿!

Prioritized Experience Replay. In ICLR, 2016

18
Scaling Learning Rate

• SGD: 𝒘 ← 𝒘 − 𝛼 ⋅ 𝒈, where 𝛼 is the learning rate


• If uniform sampling is used, 𝛼 is the same for all transitions
• If weighted sampling is used, 𝛼 shall be adjusted according to the importance
• Q: why?
• Scale the learning rate by 𝑛 𝑝* 9> , where 𝛽 ∈ (0, 1).
#
• If 𝑝# = 𝑝) = … = 𝑝: = :
(uniform sampling), the scaling factor is equal to 1.

• High-importance transitions (with high 𝑝* ) have low learning rates


• In the beginning, set 𝛽 small; increase 𝛽 to 1 over time.
Prioritized Experience Replay. In ICLR, 2016

19
Possible Question

• Q: If transition (𝑠? , 𝑎? , 𝑟? , 𝑠*"#) is important, we increase the sampling


#
weight to 10 times, but we decrease the learning rate to , does the
#@
prioritized experience replay still works?
• Case 1: learning rate 𝛼, use sample (𝑠? , 𝑎? , 𝑟? , 𝑠*"#) to calculate 1 time
gradient, and update 1 time parameter
;
• Case 2: learning rate #@
, use sample (𝑠? , 𝑎? , 𝑟? , 𝑠*"#) to calculate 10 times
gradient, and update 10 times parameter
• Q: Which one is better?
• A: Case 2 usually works better, but its computation is 10 times.
20
Prioritized Experience Replay

Big |𝛿! | ==> High probability ==> Small learning rate

21
Prioritized Experience Replay

This led to both faster learning and to better final policy quality across most
Prioritized Experience Replay. In ICLR, 2016 games of the Atari benchmark suite
22
Weakness of Experience Replay

• Q: What is the weakness of experience replay?


• A: On-policy methods can not use experience replay, such as SARSA,
REINFORCE, A2C, …

23
Improvement 2: Target Network
Improvement 3: Double DQN
TD Learning for DQN

• Use a transition, (𝑠! , 𝑎! , 𝑟! , 𝑠!"#), to update 𝒘

• TD target: 𝑌! = 𝑟! + 𝛾 ⋅ max 𝑄(𝑠!"#, 𝑎; 𝒘)


3

• TD target 𝑦! is partly an estimate mad by the DQN 𝑄


• TD error: 𝛿! = 𝑄 𝑠! , 𝑎! ; 𝒘 − 𝑌!
+ /(1! , 3! ; 𝒘)
• SGD: 𝒘 ← 𝒘 − 𝛼 ⋅ 𝛿! ⋅ +𝒘
+ /(1! , 3! ; 𝒘)
• Rewrite SGD: 𝒘 ← 𝒘 − 𝛼 ⋅ (𝑄 𝑠! , 𝑎! ; 𝒘 − 𝑌! ) ⋅ +𝒘

• We use 𝑦! , which is partly based on 𝑄, to update 𝑄 itself.

25
Problem of Overestimation

• TD learning makes DQN overestimate action-values. (Why?)


• Reason 1: The maximization bias.
• TD target: 𝑌! = 𝑟! + 𝛾 ⋅ max 𝑄(𝑠!"#, 𝑎; 𝒘)
3
• TD target is bigger than the real action-value
• Reason 2: Bootstrapping propagates the overestimation.

26
Recap: Maximization Bias (Lecture 2.2)

• We often need to maximize over our value estimates. The estimated


maxima suffer from maximization bias
• Consider a state for which all action 𝑎, 𝑄∗ 𝑠, 𝑎 = 0. Our estimates 𝑄 𝑠, 𝑎
are uncertain, some are positive and some negative
• Intuitively (Jensen’s Inequality):
𝔼 max 𝜇* ≥ max 𝔼 𝜇*
*

• This is because we use the same estimate 𝑄 both to choose the argmax and
to evaluate it

27
Reason 1: Maximization Bias

• Let 𝑥#, 𝑥), … , 𝑥: be observed real numbers.


• Add zero-mean random noise to 𝑥#, 𝑥), … , 𝑥: and obtain 𝑄#, 𝑄), … , 𝑄:
• The zero-mean noise does not affect the mean:
𝔼 𝑚𝑒𝑎𝑛* 𝑄* = 𝑚𝑒𝑎𝑛* (𝑥* )
• The zero-mean noise increases the maximum:
𝔼 max 𝑄* ≥ max(𝑥* )
* *

• The zero-mean noise decrease the minimum:


𝔼 𝑚𝑖𝑛* 𝑄* ≤ 𝑚𝑖𝑛* (𝑥* )

28
Reason 1: Maximization Bias

• True action-value: 𝑥 𝑎# , … , 𝑥(𝑎: )


• Noisy estimations made by DQN: 𝑄 𝑠, 𝑎#; 𝒘 , … , 𝑄(𝑠, 𝑎: ; 𝒘)
• Suppose the estimate is unbiased:
𝑚𝑒𝑎𝑛3 𝑥 𝑎 = 𝑚𝑒𝑎𝑛3 (𝑄(𝑠, 𝑎; 𝒘))

• 𝑞 = max 𝑄(𝑠, 𝑎; 𝒘), is typically an overestimation:


3

𝑞 ≥ max(𝑥(𝑎))
3

29
Reason 1: Maximization Bias

• We conclude that 𝑄!"# = max 𝑄(𝑠!"#, 𝑎; 𝒘) is an overestimation of the true


3
action-value at time 𝑡 + 1
• The TD target, 𝑌! = 𝑟! + 𝛾 ⋅ 𝑄!"#, is thereby an overestimation
• TD learning pushes 𝑄(𝑠! , 𝑎! ; 𝒘) towards 𝑦! which overestimates the trues
action-value

30
Reason 2: Bootstrapping

• TD learning performs bootstrapping


• TD target in part uses 𝑄!"# = max 𝑄(𝑠!"#, 𝑎; 𝒘)
3
• Use the TD target for updating 𝑄(𝑠!"#, 𝑎; 𝒘)
• Suppose DQN overestimates the action-value
• The 𝑄(𝑠!"#, 𝑎; 𝒘) is an overestimation
• The maximization further pushes 𝑄!"# up
• When 𝑄!"# is used for updating 𝑄(𝑠! , 𝑎! ; 𝒘), the overestimation is
propagated back to DQN

31
Why does overestimation happen?

32
Why is overestimation harmful?

• The agent is controlled by the DQN: 𝑎! = arg max 𝑄(𝑠! , 𝑎; 𝒘)


3

• Uniform overestimation is not a problem

33
Why is overestimation harmful?

• The agent is controlled by the DQN: 𝑎! = arg max 𝑄(𝑠! , 𝑎; 𝒘)


3

• Uniform overestimation is not a problem

• 𝑄∗ 𝑠, 𝑎# = 200, 𝑄∗ 𝑠, 𝑎) = 100, 𝑄∗ 𝑠, 𝑎B = 230


• Action 𝑎B will be selected
• Suppose 𝑄 𝑠, 𝑎* ; 𝒘 = 𝑄∗ 𝑠, 𝑎* + 100, for all 𝑎*
• Then DQN believes 𝑎B has the highest value and will select 𝑎B

34
Why is overestimation harmful?

• The agent is controlled by the DQN: 𝑎! = arg max 𝑄(𝑠! , 𝑎; 𝒘)


3

• Uniform overestimation is not a problem


• Non-uniform overestimation is problematic

• 𝑄∗ 𝑠, 𝑎# = 200, 𝑄∗ 𝑠, 𝑎) = 100, 𝑄∗ 𝑠, 𝑎B = 230


• 𝑄 𝑠, 𝑎#; 𝒘 = 280, 𝑄 𝑠, 𝑎); 𝒘 = 300, 𝑄 𝑠, 𝑎B; 𝒘 = 240
• Action 𝑎) will be selected

35
Why is overestimation harmful?

• Unfortunately, the overestimation is non-uniform


• Why?
• Use a transition, (𝑠! , 𝑎! , 𝑟! , 𝑠!"#), to update 𝒘
• The TD target, 𝑦! , overestimates 𝑄∗ (𝑠! , 𝑎! )
• TD algorithm pushes 𝑄(𝑠! , 𝑎! ; 𝒘) towards 𝑦!
• Thus, 𝑄(𝑠! , 𝑎! ; 𝒘) overestimates 𝑄∗ (𝑠! , 𝑎! )

• The more frequently (𝑠, 𝑎) appear in the replay buffer, the worse 𝑄(𝑠, 𝑎; 𝒘)
overestimate 𝑄∗ (𝑠, 𝑎)

36
Solution

• Problem: DQN trained by TD overestimates action-values


• Solution 1: Use a target network [1] to compute TD targets (address the
problem caused by bootstrapping)
• Solution 2: Use double DQN [2] to alleviate the overestimation caused by
maximization.

[1] Human-level Control Through Deep Reinforcement Learning. Nature, 2015.


[2] Deep Reinforcement Learning with Double Q-Learning. In AAAI, 2016.
37
Target Network

• Target network: 𝑄(𝑠, 𝑎; 𝒘9)


• The same network structure as the DQN, 𝑄(𝑠, 𝑎; 𝒘)
• Different parameters: 𝒘9 ≠ 𝒘
• Use 𝑄(𝑠, 𝑎; 𝒘) to control the agent and collect experience:
{(𝑠! , 𝑎! , 𝑟! , 𝑠!"#)}
• Use 𝑄(𝑠, 𝑎; 𝒘9) to compute TD target

𝑌! = 𝑟! + 𝛾 ⋅ max 𝑄(𝑠!"#, 𝑎; 𝒘9)


3

38
TD Learning with Target Network

• Use a transition, (𝑠! , 𝑎! , 𝑟! , 𝑠!"#), to update 𝒘

• TD target: 𝑌! = 𝑟! + 𝛾 ⋅ max 𝑄(𝑠!"#, 𝑎; 𝒘9)


3

• TD error: 𝛿! = 𝑄 𝑠! , 𝑎! ; 𝒘 − 𝑌!
+/(1! , 3! ; 𝒘)
• SGD: 𝒘 ← 𝒘 − 𝛼 ⋅ 𝛿! ⋅ +𝒘

39
Update Target Network

• Periodically update 𝒘9
• Option 1: 𝒘9 ← 𝒘
• Option 2: 𝒘9 ← 𝜏 ⋅ 𝒘 + 1 − 𝜏 ⋅ 𝒘9

40
Deep Q-Network

• Experience replay
• Target network: we used an iterative update that adjusts the action-values
(Q) towards target values that are only periodically updated, thereby
reducing correlations with the target.
• Results:
• Outperforms the best existing reinforcement learning methods on 43 of 49
games
• It was comparable to that of professional human games tester across the
set of 49 games, achieving more than 75% of the human score on more
than half of the games
Human-level Control through Deep Reinforcement Learning. Nature, 2015.

41
Algorithm

Human-level Control through Deep Reinforcement Learning. Nature, 2015.

42
Comparisons

• TD learning with naïve update:


• TD Target: 𝑌! = 𝑟! + 𝛾 ⋅ max 𝑄(𝑠!"#, 𝑎; 𝒘)
3

• TD learning with target network:


• TD Target: 𝑌! = 𝑟! + 𝛾 ⋅ max 𝑄(𝑠!"#, 𝑎; 𝒘9)
3
• Though better than the naïve update, TD learning with target network
nevertheless overestimates action-values

43
Recap: Double Q-Learning (Lecture 2.2)

• Train two action-value functions, Q1 and Q2


• Do Q-learning on both, but
• Never on the same time steps (Q1 and Q2 are independent)
• Pick Q1 or Q2 at random to be updated on each step
• If updating Q1, use Q2 for the value of the next state:
• 𝑄# 𝑆! , 𝐴! ← 𝑄# 𝑆! , 𝐴! +

𝛼 𝑅!"# + 𝑸𝟐 𝑆!"#, arg max 𝑸𝟏 𝑆!"#, 𝑎 − 𝑄# 𝑆! , 𝐴!


3

• Action selections are 𝜖-greedy with respect to the sum of Q1 and Q2

44
Recap: Double Tabular Q-Learning (Lecture 2.2)

45
Double DQN

Naïve Update

• TD Target: 𝑌! = 𝑟! + 𝛾 ⋅ max 𝑄(𝑠!"#, 𝑎; 𝒘)


3

• Selection using DQN:


𝑎∗ = arg max 𝑄(𝑠!"#, 𝑎; 𝒘)
3
• Evaluation using DQN:
𝑌! = 𝑟! + 𝛾 ⋅ 𝑄(𝑠!"#, 𝑎∗ ; 𝒘)
• Serious overestimation

46
Double DQN

Using Target Network


• Selection using target network:
𝑎∗ = arg max 𝑄(𝑠!"#, 𝑎; 𝒘9)
3

• Evaluation using target network:


𝑌! = 𝑟! + 𝛾 ⋅ 𝑄(𝑠!"#, 𝑎∗ ; 𝒘9)
• It works better, but overestimation is still serious.

Deep Reinforcement Learning with Double Q-learning. In AAAI, 2016.

47
Double DQN

Double DQN
• Selection using DQN:
𝑎∗ = arg max 𝑄(𝑠!"#, 𝑎; 𝒘)
3

• Evaluation using target network:


𝑌! = 𝑟! + 𝛾 ⋅ 𝑄(𝑠!"#, 𝑎∗ ; 𝒘9)
• It is the best among the three; but overestimation still may happen

Deep Reinforcement Learning with Double Q-learning. In AAAI, 2016.

48
Double DQN (DDQN)

Theorem
• Consider a state 𝑠 in which all the true optimal action values are equal at 𝑄∗ 𝑠, 𝑎 = 𝑉∗ 𝑠 for some 𝑉∗ 𝑠 .

• Let 𝑄" be arbitrary value estimates that are on the whole unbiased in the sense that ∑# 𝑄" 𝑠, 𝑎 − 𝑉∗ 𝑠 = 0,
$ &
but that are not all correct, such that ∑# 𝑄" 𝑠, 𝑎 − 𝑉∗ 𝑠 = 𝐶 for some 𝐶 > 0, where 𝑚 ≥ 2 is the
%
number of action in 𝑠.

'
• Under these conditions, 𝑚𝑎𝑥 𝑄" 𝑠, 𝑎 ≥ 𝑉∗ 𝑠 + %($
.
#

• This lower bound is tight. Under the same conditions, the lower bound on the absolute
error of the Double Q-learning estimate is zero.

Deep Reinforcement Learning with Double Q-learning. In AAAI, 2016.

49
Why does double DQN work better?

• Double DQN decouples selection from evaluation.

• Selection using DQN: 𝑎∗ = arg max 𝑄(𝑠!"#, 𝑎; 𝒘)


3

• Evaluation using target network: 𝑌! = 𝑟! + 𝛾 ⋅ 𝑄(𝑠!"#, 𝑎∗ ; 𝒘9)


• Double DQN alleviates overestimation:
𝑄 𝑠!"#, 𝑎∗ ; 𝒘9 ≤ max 𝑄(𝑠!"#, 𝑎; 𝒘9)
3
Estimation by Estimation by
Double DQN target network

50
Double DQN (DDQN)

Purple line: true values

𝑸∗ (𝒔, 𝒂) = 𝐬𝐢𝐧 𝒔
polynomial (d=6)

𝑸∗ (𝒔, 𝒂) = 𝟐𝐞𝐱𝐩 −𝒔𝟐


polynomial (d=6)

𝑸∗ (𝒔, 𝒂) = 𝟐𝐞𝐱𝐩 −𝒔𝟐


polynomial (d=9)

Deep Reinforcement Learning with Double Q-learning. In AAAI, 2016.

51
Double DQN (DDQN)

Deep Reinforcement Learning with Double Q-learning. In AAAI, 2016.

52
Summary: Overestimation

• Because of the maximization, the TD target overestimates the true action-value


• By creating a “positive feedback loop”, bootstrapping further exacerbates the
overestimation.
• Target network can partly avoid bootstrapping. (Not completely, because 𝑤 !
depends on 𝑤.
• Double DQN alleviates the overestimation caused by the maximization.

Computing
TD targets

53
Improvement 4: Dueling
Network
Dueling Network

• Dueling Architecture: Two-stream Q network (vs. single-stream Q network)

Dueling Network Architectures for Deep Reinforcement Learning. In ICML, 2016.

55
Advantage Function

• Action-value function: 𝑄^ 𝑠! , 𝑎! = 𝔼[𝐺! |𝑆! = 𝑠! , 𝐴! = 𝑎! ]


• State-value function: 𝑉^ 𝑠! = 𝔼_ [𝑄^ (𝑠! , 𝐴)]

• Optimal action-value function: 𝑄∗ 𝑠, 𝑎 = max 𝑄^ (𝑠, 𝑎)


^

• Optimal state-value function: 𝑉∗ 𝑠 = max 𝑉^ (𝑠)


^
• Optimal advantage function: 𝐴∗ 𝑠, 𝑎 = 𝑄∗ 𝑠, 𝑎 − 𝑉∗ (𝑠)

• Idea borrowed from Policy Gradient (will introduced in the following


lectures)
Dueling Network Architectures for Deep Reinforcement Learning. In ICML, 2016.

56
Dueling Network: Formulation

Dueling Network Architectures for Deep Reinforcement Learning. In ICML, 2016.

57
Recap: Bellman Optimality Equation for 𝑉∗ (Lec 1.3)

𝑉∗ 𝑠 = max 𝑄∗ (𝑠, 𝑎)
3

𝑉∗ 𝑠 = max 𝑅(𝑠, 𝑎) + 𝛾 < 𝑃 𝑠 8 𝑠, 𝑎) 𝑉∗ (𝑠′)


3
18∈7

58
Recap: Bellman Optimality Equation for 𝑄∗ (Lec 1.3)

𝑄∗ 𝑠, 𝑎 = 𝑅(𝑠, 𝑎) + 𝛾 b 𝑃 𝑠 - 𝑠, 𝑎) 𝑄∗ (𝑠′)
* ! ∈,

𝑄∗ 𝑠, 𝑎 = 𝑅 𝑠, 𝑎 + 𝛾 < 𝑃 𝑠 8 𝑠, 𝑎) max 𝑄∗ (𝑠 8 , 𝑎′)


38
1$ ∈7

59
Recap: Bellman Expectation Equation for 𝑽" (Lec 1.3)

𝑉^ 𝑠 = < 𝜋 𝑎 𝑠 𝑄^ (𝑠, 𝑎)
3∈_

𝑉^ 𝑠 = < 𝜋 𝑎 𝑠 𝑅(𝑠, 𝑎) + 𝛾 < 𝑃 𝑠 8 𝑠, 𝑎) 𝑉^ (𝑠′)


3∈_ 18∈7

60
Recap: Bellman Expectation Equation for for 𝑄! (Lec 1.3)

𝑄^ 𝑠, 𝑎 = 𝑅(𝑠, 𝑎) + 𝛾 < 𝑃 𝑠 8 𝑠, 𝑎) 𝑉^ (𝑠′)


1$ ∈7

𝑄^ 𝑠, 𝑎 = 𝑅(𝑠, 𝑎) + 𝛾 < 𝑃 𝑠 8 𝑠, 𝑎) < 𝜋 𝑎′ 𝑠′ 𝑄^ (𝑠′, 𝑎′)


18∈7 38∈_

61
Properties of Advantage Function

• Theorem 1: 𝑉∗ 𝑠 = max 𝑄∗ (𝑠, 𝑎)


3

• Recall the definition of the optimal advantage function


𝐴∗ 𝑠, 𝑎 = 𝑄∗ 𝑠, 𝑎 − 𝑉∗ (𝑠)
• It follows that
max 𝐴∗ 𝑠, 𝑎 = max 𝑄∗ 𝑠, 𝑎 − 𝑉∗ 𝑠 = 0
3 3

Dueling Network Architectures for Deep Reinforcement Learning. In ICML, 2016.

62
Properties of Advantage Function

• Definition of advantage: 𝐴∗ 𝑠, 𝑎 = 𝑄∗ 𝑠, 𝑎 − 𝑉∗ (𝑠, 𝑎)

𝑄∗ 𝑠, 𝑎 = 𝑉∗ 𝑠 + 𝐴∗ (𝑠, 𝑎)

𝑄∗ 𝑠, 𝑎 = 𝑉∗ 𝑠 + 𝐴∗ 𝑠, 𝑎 − max 𝐴∗ 𝑠, 𝑎
3

Dueling Network Architectures for Deep Reinforcement Learning. In ICML, 2016.

63
Dueling Network: Formulation

• Theorem 2: 𝑄∗ 𝑠, 𝑎 = 𝑉∗ 𝑠 + 𝐴∗ 𝑠, 𝑎 − max 𝐴∗ (𝑠, 𝑎)


3

• Approximate 𝑉∗ 𝑠 by a neural network, 𝑉 𝑠; 𝒘e


• Approximate 𝐴∗ (𝑠, 𝑎) by a neural network, 𝐴∗ (𝑠, 𝑎; 𝒘_ )
• Thus, approximate 𝑄∗ 𝑠, 𝑎 by the dueling network
𝑄 𝑠, 𝑎; 𝒘_ , 𝒘e = 𝑉 𝑠; 𝒘e + 𝐴 𝑠, 𝑎; 𝒘_ − max 𝐴(𝑠, 𝑎; 𝒘_ )
3

𝒘 = (𝒘_ , 𝒘e )

Dueling Network Architectures for Deep Reinforcement Learning. In ICML, 2016.

64
Training

• Dueling network, 𝑄(𝑠, 𝑎; 𝒘), is an approximation to 𝑄∗ (𝑠, 𝑎)


• Learn the parameter, 𝒘 = (𝒘_ , 𝒘e ), in the same way as the other DQNs
• Previous tricks can be used in the same way:
• Prioritized experience replay
• Double DQN
• Multi-step TD target

Dueling Network Architectures for Deep Reinforcement Learning. In ICML, 2016.

65
Overcome Non-identifiability

Ø Equation 1: 𝑄∗ 𝑠, 𝑎 = 𝑉∗ 𝑠 + 𝐴∗ (𝑠, 𝑎)

Ø Equation 2: 𝑄∗ 𝑠, 𝑎 = 𝑉∗ 𝑠 + 𝐴∗ 𝑠, 𝑎 − max 𝐴∗ (𝑠, 𝑎)


3

Question: why it is the zero term necessary?


Ø Equation 1 has the problem of non-identifiability.
Ø Let 𝑉 8 = 𝑉∗ + 10 and A8 = 𝐴∗ − 10
Ø Then 𝑄∗ 𝑠, 𝑎 = 𝑉∗ 𝑠 + 𝐴∗ 𝑠, 𝑎 = 𝑉 8 𝑠 + 𝐴′(𝑠, 𝑎)
Ø Equation 2 does not have the problem.

Dueling Network Architectures for Deep Reinforcement Learning. In ICML, 2016.

66
Dueling Network

𝑄∗ 𝑠, 𝑎 = 𝑉∗ 𝑠 + 𝐴∗ 𝑠, 𝑎 − max 𝐴∗ (𝑠, 𝑎)
3
• An alternative module replaces the max operator with an average::
𝑄∗ 𝑠, 𝑎 = 𝑉∗ 𝑠 + 𝐴∗ 𝑠, 𝑎 − mean 𝐴∗ (𝑠, 𝑎)
3
Ø On the one hand this loses the original semantics of V and A because they
are now off-target by a constant, but on the other hand it increases the
stability of the optimization:
Ø Dueling network controls the agent in the same way as DQN
Ø Train dueling network by TD in the same way as DQN
Ø (Do not train V and A separately.)
Dueling Network Architectures for Deep Reinforcement Learning. In ICML, 2016.

67
Dueling Network

• Value function + Advantage function

• The value stream learns to pay attention


to the road.
• The advantage stream learns to pay
attention only when there are cars
immediately in front, so as to avoid
collisions.

Dueling Network Architectures for Deep Reinforcement Learning. In ICML, 2016.

68
Advantages of Dueling Network

• Improved efficiency. By separating these two components, the network can more effectively learn
the value of states and the relative advantages of actions, which helps in scenarios where many actions
have similar values.
• Value function: how good it is to be in a given state, regardless of the action taken.
• Advantage function: how much better an action is compared to the average action in that state.

• Improved stability. Dueling networks help to stabilize training by allowing the model to learn the
value of states even when some actions are rarely chosen. This is particularly useful in environments with
many actions, as it prevents the network from overfitting to specific action choices.

• Better generalization. By focusing on the state value, the dueling architecture can generalize better
across similar states, improving performance in environments where certain actions are rarely taken but
can still be crucial in specific states.

69
Dueling Network

Vertical section: 10 states 5 actions: go up, down, left, right, and no-op
Horizontal section: 50 states 10 and 20 actions: adding no-ops to the original environment.

Dueling Network Architectures for Deep Reinforcement Learning. In ICML, 2016.

70
Dueling Network
Improvements of dueling architecture over the baseline Single network

Dueling Network Architectures for Deep Reinforcement Learning. In ICML, 2016.

71
Improvement 5: Noisy Network
NoisyNet

• Parameter w (vector):
𝑤* = 𝜇* + 𝜎* ⋅ 𝜉*
• Parameter w (matrix):
𝑤*? = 𝜇*? + 𝜎*? ⋅ 𝜉*?
Noisy Networks for Exploration. In ICLR, 2018.

73
NoisyNet

• Fully connected layer:


𝑧 = ReLU(𝑾𝒙 + 𝒃)
• NoisyNet version:
𝑧 = ReLU 𝑾𝝁 + 𝑾𝝈 ⋅ 𝑾𝝃 𝒙 + 𝒃 𝝁 + 𝒃 𝝈 ⋅ 𝒃 𝝃

Noisy Networks for Exploration. In ICLR, 2018.

74
Recap: Bayesian Learning for Model Parameters (Lecture 1.2)

• Step 1: Given 𝑛 data, 𝐷 = 𝑥#, 𝑥), . . , 𝑥: write down the expression for
likelihood
𝑃 𝐷|𝜃
• Step 2: Specify a prior: 𝑃(𝜃)
• Step 3: Compute the posterior:
𝑃 𝐷 𝜃 𝑃(𝜃)
𝑃 𝜃𝐷 =
𝑃(𝐷)

75
Recap: Thompson Sampling (1933) (Lecture 1.2)

• Represent a distribution for the mean reward of each bandit as opposed to


the mean reward estimate alone. At each timestep:
1. Sample from the mean reward distribution:
𝜃#̅ ~𝑃y 𝜃# , 𝜃)̅ ~𝑃y 𝜃) , … , 𝜃<̅ ~𝑃y 𝜃<

2. Choose action 𝑎 = arg max 𝔼ji 𝑅 𝑎


3
3. Observe the reward
4. Update the mean reward posterior distribution: 𝑃y 𝜃# , 𝑃y 𝜃) , … , 𝑃y 𝜃<

76
Algorithm
• Env Environment;
• 𝜀 set of random variables of the network
• DUELING Boolean; "true" for NoisyNet-
Dueling and "false" for NoisyNet-DQN
• 𝐵 empty replay buffer; 𝑁% replay buffer
size
• 𝜁 initial network parameters;
• 𝜁 & initial target network parameters
• 𝑁' training batch size
• 𝑁 & target network replacement
frequency

Noisy Networks for Exploration. In ICLR, 2018.

77
Rainbow: Combining All Tricks

• Multi-step bootstrap targets


• Maximum bias (Double DQN)
• Prioritized experience replay
• Dueling network architecture
• Noisy DQN
• Distributional Q-Learning (skip)

Rainbow: Combining Improvements in Deep Reinforcement Learning. In AAAI, 2018.

78
Rainbow: Combining All Tricks

Rainbow: Combining Improvements in Deep Reinforcement Learning. In AAAI, 2018.

79
Homework Assignment 2

• The HW2 will be released in today


• The deadline is Oct 31st (two weeks)

82
Extra Reading Materials

• Double Q-learning. In NIPS, 2010.


• Playing Atari with Deep Reinforcement Learning. In NeurIPS workshop, 2013.
• Human-level Control through Deep Reinforcement Learning. Nature, 2015.
• Deep Reinforcement Learning with Double Q-learning. In AAAI, 2016.
• Prioritized Experience Replay. In ICLR, 2016.
• Dueling Network Architectures for Deep Reinforcement Learning. In ICML, 2016.
• A Distributional Perspective on Reinforcement Learning. In ICML, 2017.
• Noisy Networks for Exploration. In ICLR, 2018.
• Rainbow: Combining Improvements in Deep Reinforcement Learning. In AAAI, 2018.

83
Thanks & QA?

You might also like