A Deeper Look at Experience Replay
A Deeper Look at Experience Replay
Algorithm 2: Buffer-Q
Initialize the value function Q
Initialize the replay buffer M
while not converged do
Get the initial state S
while S is not the terminal state do
Select an action A according to a -greedy policy derived from Q
Execute the action A, get the reward R and the next state S 0
Store the transition (S, A, R, S 0 ) into the replay buffer M
Sample a batch of transitions B from M
Update the value function Q with B
S ← S0
end
end
Algorithm 3: Combined-Q
Initialize the value function Q
Initialize the replay buffer M
while not converged do
Get the initial state S
while S is not the terminal state do
Select an action A according to a -greedy policy derived from Q
Execute the action A, get the reward R and the next state S 0
Store the transition t = (S, A, R, S 0 ) into the replay buffer M
Sample a batch of transitions B from M
Update the value function Q with B and t
S ← S0
end
end
game Pong. Figure 1 elaborates the tasks. we only sample 9 transitions, and the mini-batch con-
sists of the sampled 9 transitions and the latest transition.
Our first task is a grid world, the agent is placed at the
The behavior policy is a -greedy policy with = 0.1.
same location at the beginning of each episode (S in Fig-
We plot the on-line training progression for each experi-
ure 1(a)), and the location of the goal is fixed (G in Fig-
ment, in other words, we plot the episode return against
ure 1(a)). There are four possible actions {Left, Right,
the number of training episodes during the on-line train-
Up, Down}, and the reward is −1 at every time step, im-
ing.
plying the agent should learn to reach the goal as soon as
possible. Some fixed walls are placed in the grid worlds,
and if the agent bumps into the wall, it will remain in the 5 Evaluation
same position.
Our second task is the Lunar Lander task in Box2D 5.1 Tabular Function Representation
(Catto (2011)). The state space is R8 with each dimen-
sion unbounded. This task has four discrete actions. Among the three tasks, only the grid world is compatible
Solving the Lunar Lander task needs careful exploration. with tabular methods.
Negative rewards are constantly given during the land- In the tabular methods, the value function q is repre-
ing, so the algorithm can easily get trapped in a local sented by a look-up table. The initial values for all state-
minima, where it avoids negative rewards by doing noth- action pairs are set to 0, which is an optimistic initial-
ing after certain steps until timeout. ization (Sutton (1996)) to encourage exploration. The
The last task is the Atari game Pong. It is important to discount factor is 1.0, and the learning rate is 0.1.
note that our evaluation is aimed to study the idea of ex- Figures 2 (a - c) show the training progression of differ-
perience replay. We are not going to study how the ex- ent algorithms with different replay buffer size for the
perience replay interacts with a deep convolutional net- grid world task. In Figure 2(a), Online-Q solves the task
work. To this end, it is better to use an accurate state in about 1, 000 episodes. In Figure 2(b), although all the
representation of the game rather than try to learn the Buffer-Q agents with various replay buffer size tend to
representation during an end-to-end training. We there- find the solution, it is interesting to see that smallest re-
fore use the ram of the game as the state rather than the play buffer works best in terms of both the learning speed
128
raw pixels. A state is then a vector in {0, . . . , 255} . and the final performance. When we increase the buffer
We normalize each element of this vector into [0, 1] by size from 102 to 105 , the learning speed keeps decreas-
dividing 255. The game Pong has six discrete actions. ing. When we keep increasing the buffer size to 106 , the
To conduct experiments efficiently, we introduce time- learning speed catches up but is still slower than buffer
out in our tasks. In other words, an episode will ends size 102 . We do not keep increasing the replay buffer size
automatically after certain time steps. Timeout is nec- to a larger value than 106 as in all of our experiments the
essary in practice, otherwise an episode can be arbitrar- total training steps is less than 106 . Things are different
ily long. However we have to note that timeout makes in Figure 2(c), all of the Combined-Q agents with dif-
the environment non-stationary. To reduce the influence ferent buffer size learn to solve the task at similar speed.
of timeout on our experimental results, we manually se- When we zoom in, we can find the agents with large re-
lected a large enough timeout for each task, so that an play buffer learn fastest as suggested by the purple line
episode rarely ends due to timeout. We set timeout to and the yellow line. This is contrary to what we observed
5,000, 1,000 and 10,000 for the grid world, the Lunar with the Buffer-Q agents. From Figure 2(b), we can learn
Lander and the game Pong respectively. Furthermore, that in the original experience replay a large replay buffer
we use the partial-episode-bootstrap (PEB) technique in- hurts the performance, and through Figure 2(c) it is clear
troduced by Pardo et al. 2017, where we continue boot- that CER makes the agent less sensitive to the replay
strapping from the next state during the training when buffer size.
the episode ends due to timeout. Pardo et al. 2017 shows Q-learning with a tabular function representation is guar-
PEB significantly reduces the negative influence of the anteed to converge under any data distribution only if
timeout mechanism. each state-action pair is visited infinitely many times (to-
Different mini-batch size has different computation com- gether with some other weak conditions). However the
plexity, as a result, throughout our evaluation we do not data distribution does influence the convergence speed.
vary the batch size and use a mini-batch of fixed size 10 In the original experience replay, if a large replay buffer
for all the tasks. In other words, we sample 10 transi- is used, a rare on-line transition is likely to influence the
tions from the replay buffer at each time step. For CER, agent later compared with a small replay buffer. We use
a simple example to show this. Assume we have a re-
Figure 1: From left to right: the grid world, Lunar Lander, Pong
play buffer with size m, and we sample 1 transition from in Buffer-Q. Compared with Figure 3(c), it is clear that
the replay buffer per time step. We assume the replay adding the on-line transition significantly improves the
buffer is full at current time step and a new transition t learning speed, especially for a large replay buffer. The
comes. We then remove the oldest transition in the re- results are similar to what we observed with tabular func-
play buffer and add t into the buffer. The probability that tion representation.
t is replayed within k (k <= m) time steps is
Figure 2: Training progression with tabular function representation in the grid world. Lines with different colors
represent replay buffers with different size, and the number inside the image shows the replay buffer size. The results
are averaged over 30 independent runs, and standard errors are plotted.
Figure 3: Training progression of linear function approximator on the grid world task. Lines with different colors
represent replay buffers with different size, and the number inside the image shows the replay buffer size. The results
are averaged over 30 independent runs, and standard errors is plotted.
off between the data quality and data correlation. With ture effort should focus on developing a new principled
a smaller replay buffer, data tends to be more fresh how- algorithm to fully replace experience replay.
ever they are highly temporal correlated, while training a
neural network often needs i.i.d. data. With a larger re- Acknowledgements
play buffer, the sampled data tends to be uncorrelated,
however they are more outdated. The Buffer-Q agent The authors thank Kristopher De Asis and Yi Wan
with extreme large replay buffer (e.g., 105 or 106 ) fails for their thoughtful comments. We also thank Arash
to find the optimal solution. Comparing Figure 4 (a) and Tavakoli, Vitaly Ledvik and Fabio Pardo for pointing out
(b), it is clear that CER significantly speeds up the learn- the improper processing of timeout termination in the
ing, especially for a large replay buffer. previous version of the paper.
6 Conclusion
Figure 4: Training progression with non-linear function representation in the grid world. Lines with different colors
represent replay buffers with different size, and the number inside the image shows the replay buffer size. The results
are averaged over 30 independent runs, and standard errors are plotted.
Figure 5: Training progression with non-linear function representation in the Lunar Lander. Lines with different colors
represent replay buffers with different size, and the number inside the image shows the replay buffer size. The results
are averaged over 30 independent runs, and standard errors are plotted. The curves are smoothed by a sliding window
of size 30.
Figure 6: Training progression with non-linear function representation in the game Pong. Lines with different colors
represent replay buffers with different size, and the number inside the image shows the replay buffer size. The results
are averaged over 10 independent runs, and standard errors are plotted. The curves are smoothed by a sliding window
of size 30. It is expected that the agent does not solve the game Pong, as it is to difficult to approximate the state-value
function with a single-hidden-layer network.
References Tieleman, T. and Hinton, G. (2012). Lecture 6.5-
rmsprop: Divide the gradient by a running average of
Andrychowicz, M., Wolski, F., Ray, A., Schneider, J., its recent magnitude. COURSERA: Neural networks
Fong, R., Welinder, P., McGrew, B., Tobin, J., Abbeel, for machine learning, 4(2):26–31.
P., and Zaremba, W. (2017). Hindsight experience re-
play. arXiv preprint arXiv:1707.01495. Todorov, E., Erez, T., and Tassa, Y. (2012). Mujoco: A
physics engine for model-based control. In Intelligent
Bellemare, M. G., Naddaf, Y., Veness, J., and Bowl- Robots and Systems (IROS), 2012 IEEE/RSJ Interna-
ing, M. (2013). The arcade learning environment: An tional Conference on, pages 5026–5033. IEEE.
evaluation platform for general agents. J. Artif. Intell.
Res.(JAIR), 47:253–279. Wang, Z., Bapst, V., Heess, N., Mnih, V., Munos, R.,
Kavukcuoglu, K., and de Freitas, N. (2016). Sam-
Catto, E. (2011). Box2d: A 2d physics engine for games. ple efficient actor-critic with experience replay. arXiv
Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., preprint arXiv:1611.01224.
Tassa, Y., Silver, D., and Wierstra, D. (2015). Contin- Watkins, C. J. C. H. (1989). Learning from delayed re-
uous control with deep reinforcement learning. arXiv wards. PhD thesis, King’s College, Cambridge.
preprint arXiv:1509.02971.
Lin, L.-H. (1992). Self-improving reactive agents based
on reinforcement learning, planning and teaching. Ma-
chine learning, 8(3/4):69–97.
Liu, R. and Zou, J. (2017). The effects of memory replay
in reinforcement learning.
Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap,
T., Harley, T., Silver, D., and Kavukcuoglu, K. (2016).
Asynchronous methods for deep reinforcement learn-
ing. In International Conference on Machine Learn-
ing, pages 1928–1937.
Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Ve-
ness, J., Bellemare, M. G., Graves, A., Riedmiller, M.,
Fidjeland, A. K., Ostrovski, G., et al. (2015). Human-
level control through deep reinforcement learning. Na-
ture, 518(7540):529–533.
Pardo, F., Tavakoli, A., Levdik, V., and Kormushev, P.
(2017). Time limits in reinforcement learning. arXiv
preprint arXiv:1712.00378.
Schaul, T., Quan, J., Antonoglou, I., and Silver, D.
(2015). Prioritized experience replay. arXiv preprint
arXiv:1511.05952.
Sutton, R. S. (1991). Dyna, an integrated architecture
for learning, planning, and reacting. ACM SIGART
Bulletin, 2(4):160–163.
Sutton, R. S. (1996). Generalization in reinforcement
learning: Successful examples using sparse coarse
coding. In Advances in neural information process-
ing systems, pages 1038–1044.
Sutton, R. S. and Barto, A. G. (1998). Reinforcement
learning: An introduction, volume 1. MIT press Cam-
bridge.
Tassa, Y., Doron, Y., Muldal, A., Erez, T., Li, Y., Casas,
D. d. L., Budden, D., Abdolmaleki, A., Merel, J.,
Lefrancq, A., et al. (2018). Deepmind control suite.
arXiv preprint arXiv:1801.00690.