Lecture 6
Lecture 6
Emma Brunskill
Winter 2019
1
With many slides for DQN from David Silver and Ruslan Salakhutdinov and some
vision slides from Gianni Di Caro and images from Stanford CS231n,
http://cs231n.github.io/convolutional-networks/
Emma Brunskill (CS234 Reinforcement Learning.
Lecture
) 6: CNNs and Deep Q Learning 1 Winter 2019 1 / 68
Table of Contents
2 Deep Q Learning
𝑠 𝑤 𝑉# (𝑠; 𝑤)
𝑠 𝑤 𝑄# (𝑠, 𝑎; 𝑤)
𝑎
Emma Brunskill (CS234 Reinforcement Learning.
Lecture
) 6: CNNs and Deep Q Learning 1 Winter 2019 4 / 68
Recall: Stochastic Gradient Descent
Goal: Find the parameter vector w that minimizes the loss between a
true value function V π (s) and its approximation V̂ π (s; w ) as
represented with a particular function class parameterized by w .
Generally use mean squared error and define the loss as
J(w ) = Eπ [(V π (s) − V̂ π (s; w ))2 ]
Can use gradient descent to find a local minimum
1
∆w = − α∇w J(w )
2
Stochastic gradient descent (SGD) samples the gradient:
Objective function is
2 Deep Q Learning
Sum over i is only over the neurons in the receptive field of the
hidden layer neuron
The same weights w and bias b are used for each of the hidden
neurons
In this example, 24 × 24 hidden neurons
All the neurons in the first hidden layer detect exactly the same
feature, just at different locations in the input image.
Feature: the kind of input pattern (e.g., a local edge) that makes the
neuron produce a certain response level
Why does this makes sense?
Suppose the weights and bias are (learned) such that the hidden
neuron can pick out, a vertical edge in a particular local receptive field.
That ability is also likely to be useful at other places in the image.
Useful to apply the same feature detector everywhere in the image.
Yields translation (spatial) invariance (try to detect feature at any part
of the image)
Inspired by visual system
1
http://cs231n.github.io/convolutional-networks/
Emma Brunskill (CS234 Reinforcement Learning.
Lecture
) 6: CNNs and Deep Q Learning 1 Winter 2019 25 / 68
Pooling Layers
2 Deep Q Learning
𝑠 𝑤 𝑉# (𝑠; 𝑤)
𝑠 𝑤 𝑄# (𝑠, 𝑎; 𝑤)
𝑎
Q̂ π (s, a; w ) ≈ Q π
Minimize the mean-squared error between the true action-value
function Q π (s, a) and the approximate action-value function:
𝑠) , 𝑎) , 𝑟) , 𝑠)*"
∆w = α(r + γ max
0
Q̂(s 0 , a0 ; w ) − Q̂(s, a; w ))∇w Q̂(s, a; w )
a
𝑠) , 𝑎) , 𝑟) , 𝑠)*"
∆w = α(r + γ max
0
Q̂(s 0 , a0 ; w ) − Q̂(s, a; w ))∇w Q̂(s, a; w )
a
Can treat the target as a scalar, but the weights will get
updated on the next round, changing the target value
To help improve stability, fix the target weights used in the target
calculation for multiple updates
Use a different set of weights to compute target than is being updated
Let parameters w − be the set of weights used in the target, and w
be the weights that are being updated
Slight change to computation of target value:
(s, a, r , s 0 ) ∼ D: sample an experience tuple from the dataset
Compute the target value for the sampled s: r + γ maxa0 Q̂(s 0 , a0 ; w − )
Use stochastic gradient descent to update the network weights
∆w = α(r + γ max
0
Q̂(s 0 , a0 ; w − ) − Q̂(s, a; w ))∇w Q̂(s, a; w )
a
7: else
8:
9: end if
10: t =t +1
11: end loop
89/: ./01/!123
.2456 7214 .2456 7214
pα
P(i) = P i α
k pk
1
See paper for details and an alternative
Emma Brunskill (CS234 Reinforcement Learning.
Lecture
) 6: CNNs and Deep Q Learning 1 Winter 2019 55 / 68
Check Your Understanding
pα
P(i) = P i α
k pk
Advantage function
Identifiable?
Advantage function
Unidentifiable
Option 1: Force A(s, a) = 0 if a is action taken
0
Q̂(s, a; w ) = V̂ (s; w ) + Â(s, a; w ) − max
0
Â(s, a ; w )
a ∈A
Try Huber
( 2 loss on Bellman error
x
if |x| ≤ δ
L(x) = 2 δ2
δ|x| − 2 otherwise
Try Huber
( 2 loss on Bellman error
x
if |x| ≤ δ
L(x) = 2 δ2
δ|x| − 2 otherwise
2 Deep Q Learning