07 Deep Reinforcement Learning (John)
07 Deep Reinforcement Learning (John)
I
I
Environment
observation, reward
Robotics:
I Observations: camera images, joint angles
I Actions: joint torques
I Rewards: stay balanced, navigate to target locations,
serve and protect humans
Business Operations
Inventory Management
I Observations: current inventory levels
I Actions: number of units of each item to purchase
I Rewards: profit
In Other ML Problems
I
Hard Attention1
I
I
I
1
V. Mnih et al. Recurrent models of visual attention. In: Advances in Neural Information Processing
Systems. 2014, pp. 22042212.
2
H. Daum
e Iii, J. Langford, and D. Marcu. Search-based structured prediction. In: Machine learning 75.3
(2009), pp. 297325; S. Ross, G. J. Gordon, and D. Bagnell. A Reduction of Imitation Learning and Structured
Prediction to No-Regret Online Learning. In: AISTATS. vol. 1. 2. 2011, p. 6; M. Ranzato et al. Sequence level
training with recurrent neural networks. In: arXiv preprint arXiv:1511.06732 (2015).
Supervised learning:
I Environment samples input-output pair (xt , yt )
I Agent predicts y
t = f (xt )
I Agent receives loss `(yt , y
t )
I Environment asks agent a question, and then tells her the
right answer
Contextual bandits:
I Environment samples input xt
I Agent takes action y
t = f (xt )
I Agent receives cost ct P(ct | xt , yt ) where P is an
unknown probability distribution
I Environment asks agent a question, and gives her a noisy
score on her answer
I Application: personalized recommendations
Reinforcement learning:
I Environment samples input xt P(xt | xt1 , yt1 )
I
I
I
I
I
Might be overkill
Other methods worth investigating first
I
I
I
3
W. B. Powell. Approximate Dynamic Programming: Solving the curses of dimensionality. Vol. 703. John
Wiley & Sons, 2007.
I
I
I
V. Mnih et al. Playing Atari with Deep Reinforcement Learning. In: arXiv preprint arXiv:1312.5602 (2013).
J. Schulman et al. Trust Region Policy Optimization. In: arXiv preprint arXiv:1502.05477 (2015).
X. Guo et al. Deep learning for real-time Atari game play using offline Monte-Carlo tree search planning. In:
Advances in Neural Information Processing Systems. 2014, pp. 33383346.
7
D. Silver et al. Mastering the game of Go with deep neural networks and tree search. In: Nature 529.7587
(2016), pp. 484489.
8
S. Levine et al. End-to-end training of deep visuomotor policies. In: arXiv preprint arXiv:1504.00702 (2015).
J. Schulman et al. High-dimensional continuous control using generalized advantage estimation. In: arXiv
preprint arXiv:1506.02438 (2015).
10
V. Mnih et al. Asynchronous Methods for Deep Reinforcement Learning. In: arXiv preprint
arXiv:1602.01783 (2016).
Definition
S: state space
A: action space
P(r , s 0 | s, a): transition + reward probability distribution
: Initial state distribution
Episodic Setting
Policies
I
I
Episodic Setting
s0 (s0 )
a0 (a0 | s0 )
s1 , r0 P(s1 , r0 | s0 , a0 )
a1 (a1 | s1 )
s2 , r1 P(s2 , r1 | s1 , a1 )
...
aT 1 (aT 1 | sT 1 )
sT , rT 1 P(sT | sT 1 , aT 1 )
Objective:
maximize (), where
() = E [r0 + r1 + + rT 1 | ]
Episodic Setting
Agent
a0
s0
0
aT-1
a1
s1
sT
s2
r1
r0
rT-1
Environment
Objective:
maximize (), where
() = E [r0 + r1 + + rT 1 | ]
Parameterized Policies
Deterministic: a = (s, )
Stochastic: (a | s, )
11
12
N. Heess et al. Learning continuous control policies by stochastic value gradients. In: Advances in Neural
Information Processing Systems. 2015, pp. 29262934.
13
T. Jie and P. Abbeel. On a connection between importance sampling and the likelihood ratio policy
gradient. In: Advances in Neural Information Processing Systems. 2010, pp. 10001008.
TY
1
[(at | st , )P(st+1 , rt | st , at )]
t=0
T
1
X
t=0
log p( | ) =
T
1
X
log (at | st , )
t=0
"
E [R] = E R
T
1
X
#
log (at | st , )
t=0
Previous slide:
"
E [R] = E
T
1
X
!
rt
t=0
I
T
1
X
!#
log (at | st , )
t=0
=E
"T 1
X
t=0
t=0
log (at | st , )
T
1
X
t 0 =t
#
rt 0
I
I
t 0 =t
t=0
I
Now, we want
b(st ) E rt + rt+1 + 2 rt+2 + + T 1t rT 1
Write gradient estimator more generally as
"T 1
#
X
E [R] E
log (at | st , )At
t=0
Reinforcement learning
I
I
I
14
15
J. Schulman et al. Trust Region Policy Optimization. In: arXiv preprint arXiv:1502.05477 (2015).
S. Kakade. A Natural Policy Gradient. In: NIPS. vol. 14. 2001, pp. 15311538; J. A. Bagnell and
J. Schneider. Covariant policy search. In: IJCAI. 2003; J. Peters and S. Schaal. Natural actor-critic. In:
Neurocomputing 71.7 (2008), pp. 11801190.
16
J. Schulman et al. High-dimensional continuous control using generalized advantage estimation. In: arXiv
preprint arXiv:1506.02438 (2015); V. Mnih et al. Asynchronous Methods for Deep Reinforcement Learning. In:
arXiv preprint arXiv:1602.01783 (2016).
17
D. Silver et al. Deterministic policy gradient algorithms. In: ICML. 2014; N. Heess et al. Learning
continuous control policies by stochastic value gradients. In: Advances in Neural Information Processing
Systems. 2015, pp. 29262934.
Interlude
Value Functions
I
Definitions:
Q (s, a) = E r0 + r1 + 2 r2 + . . . | s0 = s, a0 = a
Called Q-function or state-action-value function
V (s) = E r0 + r1 + 2 r2 + . . . | s0 = s
= Ea [Q (s, a)]
Called state-value function
A (s, a) = Q (s, a) V (s)
Called advantage function
h
= Es1 ,a1 ...,sk ,ak | s0 ,a0 r0 + r1 + + k1 rk1 + k Q (sk
Bellman Backups
I
Introducing Q
I
I
I
I
a1
Q is a fixed point of B:
BQ = Q
Value iteration:
I
I
Initialize Q
Do Q BQ until convergence
Policy iteration:
I
I
Initialize
Repeat:
I
I
Compute Q
GQ (greedy policy for Q )
where [GQ ](s) = arg maxa Q (s, a)
Q](s , a ) = r + E
[
[B
0 0
0
a1 [Q(s1 , a1 )]
I
18
T. Jaakkola, M. I. Jordan, and S. P. Singh. On the convergence of stochastic iterative dynamic programming
algorithms. In: Neural computation 6.6 (1994), pp. 11851201; D. P. Bertsekas. Dynamic programming and
optimal control. Vol. 2. 2. Athena Scientific, 2012.
Neural-Fitted Algorithms
I
I
(1)
One version19
Online Algorithms
I
I
I
20
V. Mnih et al. Playing Atari with Deep Reinforcement Learning. In: arXiv preprint arXiv:1312.5602 (2013).
21
Z. Wang, N. de Freitas, and M. Lanctot. Dueling Network Architectures for Deep Reinforcement Learning.
In: arXiv preprint arXiv:1511.06581 (2015); H. Van Hasselt, A. Guez, and D. Silver. Deep reinforcement learning
with double Q-learning. In: CoRR, abs/1509.06461 (2015); T. Schaul et al. Prioritized experience replay. In:
arXiv preprint arXiv:1511.05952 (2015); M. Hausknecht and P. Stone. Deep recurrent Q-learning for partially
observable MDPs. In: arXiv preprint arXiv:1507.06527 (2015).
22
V. Mnih et al. Asynchronous Methods for Deep Reinforcement Learning. In: arXiv preprint
arXiv:1602.01783 (2016).
Conclusion
Q-function methods
I
I
Vanilla PG
Natural PG
Q-Learning
OK
Good
Bad
Data Efficient
Bad
OK
OK
Fin