18-deeprl
18-deeprl
Source: D. Silver
Deep Q learning
• Regular TD update: “nudge” Q(s,a) towards the target
target estimate
Deep Q learning
• Regular TD update: “nudge” Q(s,a) towards the target
target estimate
• Compare to supervised learning:
L(w) = ( y - f (x;w))
2
Mnih et al. Human-level control through deep reinforcement learning, Nature 2015
Experience replay
• At each time step:
– Take action at according to epsilon-greedy policy
– Store experience (st, at, rt+1, st+1) in replay memory buffer
– Randomly sample mini-batch of experiences from the
buffer
Mnih et al. Human-level control through deep reinforcement learning, Nature 2015
Experience replay
• At each time step:
– Take action at according to epsilon-greedy policy
– Store experience (st, at, rt+1, st+1) in replay memory buffer
– Randomly sample mini-batch of experiences from the
buffer
– Perform update to reduce objective function
é 2ù
Es,a,s' ê( R(s) + g max a' Q(s', a';w ) - Q(s, a;w)) ú
-
ë û
Keep parameters of target
network fixed, update every
once in a while
Mnih et al. Human-level control through deep reinforcement learning, Nature 2015
Atari
Mnih et al. Human-level control through deep reinforcement learning, Nature 2015
Deep Q learning in Atari
• End-to-end learning of Q(s,a) from pixels s
• Output is Q(s,a) for 18 joystick/button configurations
• Reward is change in score for that step
Q(s,a1)
Q(s,a2)
Q(s,a3)
.
.
.
.
.
.
.
.
.
.
.
Q(s,a18)
Mnih et al. Human-level control through deep reinforcement learning, Nature 2015
Deep Q learning in Atari
• Input state s is stack of raw pixels from last 4 frames
• Network architecture and hyperparameters fixed for all games
Mnih et al. Human-level control through deep reinforcement learning, Nature 2015
Deep Q learning in Atari
Mnih et al. Human-level control through deep reinforcement learning, Nature 2015
Breakout demo
https://www.youtube.com/watch?v=TmPfTpjtdgg
Playing Go
• Go is a known (and
deterministic)
environment
• Therefore, learning to
play Go involves solving
a known MDP
• Key challenges: huge
state and action space,
long sequences, sparse
rewards
Review: AlphaGo
• Policy network:
initialized by
supervised training on
large amount of
human games
• Value network:
trained to predict
outcome of game
based on self-play
• Networks are used to
guide Monte Carlo
tree search (MCTS)
D. Silver et al., Mastering the Game of Go with Deep Neural Networks and Tree Search,
Nature 529, January 2016
Convolutional neural network
Summary
• Deep Learning Strengths
– universal approximators: learn non-trivial fns
– compositional models ~similar to human brain
– universal representation across modalities
– discover features automagically
• in a task-specific manner
• features not limited by human creativity
• Deep Learning Weaknesses
– resource hungry (data/compute)
– Uninterpretable
• Deep RL: replace value/policy tables by deep nets
– Great success in Go, Atari.