0% found this document useful (0 votes)
49 views

A Deeper Look at Experience Replay

1) A recent paper reexamined experience replay, a technique widely used in deep reinforcement learning algorithms. 2) Through empirical studies, they found that both too small and too large replay buffer sizes can hurt learning, challenging the common assumption that a fixed large size works best. 3) They proposed a simple combined experience replay method that remedies the issues of a large buffer by adding the latest transition to each training batch, improving performance.

Uploaded by

Võ Minh Trí
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
49 views

A Deeper Look at Experience Replay

1) A recent paper reexamined experience replay, a technique widely used in deep reinforcement learning algorithms. 2) Through empirical studies, they found that both too small and too large replay buffer sizes can hurt learning, challenging the common assumption that a fixed large size works best. 3) They proposed a simple combined experience replay method that remedies the issues of a large buffer by adding the latest transition to each training batch, improving performance.

Uploaded by

Võ Minh Trí
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

A Deeper Look at Experience Replay

Shangtong Zhang, Richard S. Sutton


Dept. of Computing Science
University of Alberta
{shangtong.zhang, rsutton}@ualberta.ca
arXiv:1712.01275v3 [cs.LG] 30 Apr 2018

Abstract et al. 2016), which is a desired property for many RL


algorithms as they are often pretty hungry for data. Al-
Recently experience replay is widely used in though algorithms in pre-deep-RL era do not need to care
various deep reinforcement learning (RL) al- about how to stabilize a neural network, they do care
gorithms, in this paper we rethink the utility of data efficiency. If experience replay is a perfect idea,
experience replay. It introduces a new hyper- it should already be widely used in early ages. However
parameter, the memory buffer size, which unfortunately, to our best knowledge no previous work
needs carefully tuning. However unfortunately has shown what is wrong with experience replay.
the importance of this new hyper-parameter Moreover, with the success of the Deep-Q-Network
has been underestimated in the community for (DQN, Mnih et al. 2015), the community seems to have
a long time. In this paper we did a system- a default value for the size of the replay buffer, i.e. 106 .
atic empirical study of experience replay under For instance, Mnih et al. (2015) set the size of their replay
various function representations. We showcase buffer for DQN to 106 for various Atari games (Belle-
that a large replay buffer can significantly hurt mare et al. 2013), after which Lillicrap et al. (2015) also
the performance. Moreover, we propose a sim- set their replay buffer for Deep Deterministic Policy Gra-
ple O(1) method to remedy the negative influ- dient (DDPG) to 106 to address various Mujoco tasks
ence of a large replay buffer. We showcase its (Todorov et al. 2012). Moreover, Andrychowicz et al.
utility in both simple grid world and challeng- (2017) set their replay buffer to 106 in their Hindsight
ing domains like Atari games. Experience Replay (HER) for a physical robot arm and
Tassa et al. (2018) use a replay buffer with capacity as
106 to solve the tasks in the DeepMind Control Suite. In
1 Introduction the aforementioned works, the tasks vary from simula-
tion environments to real world robots and the function
Experience replay enjoys a great success recently in
approximators vary from shallow fully-connected net-
the deep RL community and has become a new norm
works to deep convolutional networks. However they
in many deep RL algorithms (Lillicrap et al. 2015;
all use a replay buffer with same capacity. It seems to
Andrychowicz et al. 2017). Until now it is the only
be robust under complex deep RL systems and nobody
method that can generate uncorrelated data for the on-
bothers to tune the replay buffer size. However when
line training of deep RL systems except the use of mul-
we separate the experience replay from complex learning
tiple workers (Mnih et al. 2016), which unfortunately
systems, we can easily find the agent is pretty sensitive to
changes the problem setting somehow. In this paper, we
the size of the replay buffer. Some facts of replay buffer
rethink the utility of experience replay. Some critical
size are hidden by the complexity of the learning system.
flaw of experience replay is hidden by the complexity
of the deep RL systems, which explains the confusing Our first contribution is that we present a systematic eval-
phenomena that experience replay itself was proposed in uation of experience replay under various function repre-
the early age of RL, but it did not draw much attention sentations, i.e. tabular case, linear-function approxima-
when tabular methods and linear function approximation tion and non-linear function approximation. We show-
dominated the field. Experience replay not only provides case that both a small replay buffer and a large replay
uncorrelated data to train a neural network, but also sig- buffer can heavily hurt the learning process. In other
nificantly improves the data efficiency (Lin 1992;Wang words, the size of the replay buffer, which has been
under-estimated by the community for a long time, is Experience replay can be interpreted as a planning
an important task-dependent hyper-parameter that needs method, because it is comparable to Dyna (Sutton 1991)
careful tuning. Some facts of experience replay are hid- with a look-up table. However, the key difference is that
den by the complex modern deep RL systems. Dyna only samples states and actions, while experience
replay samples full transitions, which may be biased and
Another contribution is that we propose a simple method
potentially harmful.
to remedy the negative influence of a large replay buffer,
which requires only O(1) extra computation. To be more There are also successful trials to eliminate experience
specific, whenever we sample a batch of transitions, we replay in deep RL. The most famous one is the Asyn-
add the latest transition to the batch and use the corrected chronous Advantage Actor-Critic method (Mnih et al.
batch to train the agent. We refer to this method as com- 2016), where experience replay was replaced by par-
bined experience replay (CER) in the rest of this paper. allelized workers. The workers are distributed among
processes, and different workers have different random
It is important to note that experience replay itself is not a
seeds. As a result, the data collected is still uncorrelated.
complete learning algorithm, it has to be combined with
other algorithms to form a complete learning system. In
our evaluation, we consider the combination of experi- 3 Algorithms
ence replay with Q-learning (Watkins 1989), following
the DQN paradigm. Experience replay was first introduced by Lin (1992).
We perform our evaluation and showcase the utility of The key idea of experience replay is to train the agent
CER in both small toy task, e.g. Grid World, and large with the transitions sampled from the a buffer of previ-
scale challenging domains, e.g. the Lunar Lander and ously experienced transitions. A transition is defined to
Atari games. be a quadruple (s, a, r, s0 ), where s is the state, a is
the action, r is the received reward after executing the
action a in the state s and s0 is the next state. At each
2 Related Work time step, the current transition is added to the replay
buffer and some transitions are sampled from the replay
CER can be treated inaccurately as a special case of pri- buffer to train the agent. There are various sampling
oritized experience replay (PER, Schaul et al. 2015). In strategies to sample transitions from the replay buffer,
PER, Schaul et al. (2015) proposed to give the latest tran- among which uniform sampling is the most popular one.
sition a largest priority. However PER is still a stochastic Although there is also prioritized sampling (Schaul et al.
replay method, which means giving the latest transition 2015), where each transition is associated with a priority,
a largest priority does not guarantee it will be replayed it always suffers from O(log N ) time complexity. So we
immediately. Moreover, it is important to note that PER therefore constrict our evaluation in uniform sampling.
and CER are aimed to solve different problems, i.e. CER
is designed to remedy the negative influence of a large We compared three algorithms: Q-Learning with online
replay buffer while PER is designed to replay the transi- transitions (referred to as Online-Q, Algorithm 1), Q-
tions in the buffer more efficiently. To be more specific, Learning with experience replay (transitions for train-
if the replay buffer size is set properly, we do not ex- ing only from the buffer, referred to as Buffer-Q, Al-
pect CER can further improve the performance however gorithm 2) and Q-Learning with CER (referred to as
PER is always expected to improve the performance. Al- Combined-Q, Algorithm 3). Online-Q is the primitive
though there is a similar part to CER in PER, i.e. giving a Q-Learning, where the transition at every time step is
largest priority to the latest transition, PER never shows used to update the value function immediately. Buffer-Q
how that part interacts with the size of the replay buffer refers to DQN-like Q-Learning, where the current tran-
and whether that part itself can make a significant contri- sition is not used to update the value function immedi-
bution to the whole learning system. Furthermore, PER ately. Instead, it is stored into the replay buffer and only
is an O(log N ) algorithm with fancy data structures, e.g. the sampled transitions from the replay buffer are used
a sum-tree, which significantly prevents it from being for learning. Combined-Q uses both the current transi-
widely used. However CER is an O(1) plug-in, which tion and the transitions from the replay buffer to update
needs only little extra computation and engineer effort. the value function at every time step.

Liu and Zou (2017) did a theoretical study on the in-


fluence of the size of the replay buffer. However their 4 Testbeds
analytical study only applies to an ordinary differential
equation model, and their experiments did not properly We use three tasks to evaluate the aforementioned algo-
handle the episode end by timeout. rithms: a grid world, the Lunar Lander and the Atari
Algorithm 1: Online-Q
Initialize the value function Q
while not converged do
Get the initial state S
while S is not the terminal state do
Select an action A according to a -greedy policy derived from Q
Execute the action A, get the reward R and the next state S 0
Update the value function Q with (S, A, R, S 0 )
S ← S0
end
end

Algorithm 2: Buffer-Q
Initialize the value function Q
Initialize the replay buffer M
while not converged do
Get the initial state S
while S is not the terminal state do
Select an action A according to a -greedy policy derived from Q
Execute the action A, get the reward R and the next state S 0
Store the transition (S, A, R, S 0 ) into the replay buffer M
Sample a batch of transitions B from M
Update the value function Q with B
S ← S0
end
end

Algorithm 3: Combined-Q
Initialize the value function Q
Initialize the replay buffer M
while not converged do
Get the initial state S
while S is not the terminal state do
Select an action A according to a -greedy policy derived from Q
Execute the action A, get the reward R and the next state S 0
Store the transition t = (S, A, R, S 0 ) into the replay buffer M
Sample a batch of transitions B from M
Update the value function Q with B and t
S ← S0
end
end
game Pong. Figure 1 elaborates the tasks. we only sample 9 transitions, and the mini-batch con-
sists of the sampled 9 transitions and the latest transition.
Our first task is a grid world, the agent is placed at the
The behavior policy is a -greedy policy with  = 0.1.
same location at the beginning of each episode (S in Fig-
We plot the on-line training progression for each experi-
ure 1(a)), and the location of the goal is fixed (G in Fig-
ment, in other words, we plot the episode return against
ure 1(a)). There are four possible actions {Left, Right,
the number of training episodes during the on-line train-
Up, Down}, and the reward is −1 at every time step, im-
ing.
plying the agent should learn to reach the goal as soon as
possible. Some fixed walls are placed in the grid worlds,
and if the agent bumps into the wall, it will remain in the 5 Evaluation
same position.
Our second task is the Lunar Lander task in Box2D 5.1 Tabular Function Representation
(Catto (2011)). The state space is R8 with each dimen-
sion unbounded. This task has four discrete actions. Among the three tasks, only the grid world is compatible
Solving the Lunar Lander task needs careful exploration. with tabular methods.
Negative rewards are constantly given during the land- In the tabular methods, the value function q is repre-
ing, so the algorithm can easily get trapped in a local sented by a look-up table. The initial values for all state-
minima, where it avoids negative rewards by doing noth- action pairs are set to 0, which is an optimistic initial-
ing after certain steps until timeout. ization (Sutton (1996)) to encourage exploration. The
The last task is the Atari game Pong. It is important to discount factor is 1.0, and the learning rate is 0.1.
note that our evaluation is aimed to study the idea of ex- Figures 2 (a - c) show the training progression of differ-
perience replay. We are not going to study how the ex- ent algorithms with different replay buffer size for the
perience replay interacts with a deep convolutional net- grid world task. In Figure 2(a), Online-Q solves the task
work. To this end, it is better to use an accurate state in about 1, 000 episodes. In Figure 2(b), although all the
representation of the game rather than try to learn the Buffer-Q agents with various replay buffer size tend to
representation during an end-to-end training. We there- find the solution, it is interesting to see that smallest re-
fore use the ram of the game as the state rather than the play buffer works best in terms of both the learning speed
128
raw pixels. A state is then a vector in {0, . . . , 255} . and the final performance. When we increase the buffer
We normalize each element of this vector into [0, 1] by size from 102 to 105 , the learning speed keeps decreas-
dividing 255. The game Pong has six discrete actions. ing. When we keep increasing the buffer size to 106 , the
To conduct experiments efficiently, we introduce time- learning speed catches up but is still slower than buffer
out in our tasks. In other words, an episode will ends size 102 . We do not keep increasing the replay buffer size
automatically after certain time steps. Timeout is nec- to a larger value than 106 as in all of our experiments the
essary in practice, otherwise an episode can be arbitrar- total training steps is less than 106 . Things are different
ily long. However we have to note that timeout makes in Figure 2(c), all of the Combined-Q agents with dif-
the environment non-stationary. To reduce the influence ferent buffer size learn to solve the task at similar speed.
of timeout on our experimental results, we manually se- When we zoom in, we can find the agents with large re-
lected a large enough timeout for each task, so that an play buffer learn fastest as suggested by the purple line
episode rarely ends due to timeout. We set timeout to and the yellow line. This is contrary to what we observed
5,000, 1,000 and 10,000 for the grid world, the Lunar with the Buffer-Q agents. From Figure 2(b), we can learn
Lander and the game Pong respectively. Furthermore, that in the original experience replay a large replay buffer
we use the partial-episode-bootstrap (PEB) technique in- hurts the performance, and through Figure 2(c) it is clear
troduced by Pardo et al. 2017, where we continue boot- that CER makes the agent less sensitive to the replay
strapping from the next state during the training when buffer size.
the episode ends due to timeout. Pardo et al. 2017 shows Q-learning with a tabular function representation is guar-
PEB significantly reduces the negative influence of the anteed to converge under any data distribution only if
timeout mechanism. each state-action pair is visited infinitely many times (to-
Different mini-batch size has different computation com- gether with some other weak conditions). However the
plexity, as a result, throughout our evaluation we do not data distribution does influence the convergence speed.
vary the batch size and use a mini-batch of fixed size 10 In the original experience replay, if a large replay buffer
for all the tasks. In other words, we sample 10 transi- is used, a rare on-line transition is likely to influence the
tions from the replay buffer at each time step. For CER, agent later compared with a small replay buffer. We use
a simple example to show this. Assume we have a re-
Figure 1: From left to right: the grid world, Lunar Lander, Pong

play buffer with size m, and we sample 1 transition from in Buffer-Q. Compared with Figure 3(c), it is clear that
the replay buffer per time step. We assume the replay adding the on-line transition significantly improves the
buffer is full at current time step and a new transition t learning speed, especially for a large replay buffer. The
comes. We then remove the oldest transition in the re- results are similar to what we observed with tabular func-
play buffer and add t into the buffer. The probability that tion representation.
t is replayed within k (k <= m) time steps is

1 k 5.3 Non-linear Function Approximation


1 − (1 − )
m
We use a single hidden layer network as our non-linear
This function is monotonically decreasing as m in- function approximator. We apply the Relu nonlinearity
creases. So with a larger replay buffer, a rare transition is over the hidden units, and the output units are linear to
likely to make influence later. If that transition happens produce the state-action value. With a neural network as
to be important, it will further influence the data collec- the function approximator, Buffer-Q is almost the same
tion of the agent in the future. As a result, the overall as DQN. Thus we also employs a target network to gain
learning speed is slowed down. This explains the phe- stable update targets following Mnih et al. 2015. Our
nomena in Figure 2(b) that when we increase the replay preliminary experiments show that random exploration
buffer size from 102 to 106 the learning is slowed down. at the beginning stage and a decayed exploration rate ()
Note with the replay buffer size 107 , the replay buffer do not help the learning process in our tasks.
never gets full thus all transitions are well preserved. It
is a special case. In the grid world task we use 50 hidden units, and for the
other tasks we use 100 hidden units. In the grid world
In CER, all the transitions influence the agent immedi- task, we use a one-hot vector to encode the current posi-
ately. As a result, the agent becomes less sensitive to the tion of the agent. We use a RMSProp optimizer (Tiele-
selection of the replay buffer size. man and Hinton (2012)) for all the tasks, while the initial
learning rates vary among tasks. We use 0.01, 0.0005
5.2 Linear Function Approximation and 0.0025 for the grid world, the Lunar Lander and the
game Pong respectively. These initial learning rates are
We consider linear function approximation with tile cod- empirically tuned to achieve best performance.
ing (Sutton and Barto (1998)). Among our three tasks,
only the Lunar Lander task is compatible with tile cod- Figure 4 shows the learning progression of the agents
ing, so we only consider this task in this part of our eval- with various replay buffer sizes in the grid world task.
uation. In our experiments, tile coding is done via the tile We observed that the replay buffer based agents with
coding software 1 with 8 tilings. We set the initial weight buffer size 100 and the Online-Q agent fails to learn any-
parameters to 0 to encourage exploration. The discount thing. It is expected as in this case the network tends to
factor is to 1.0, and the learning rate is 0.1/8 = 0.125. over-fit recent transitions thus forgets what it has learned
The results are summarized in Figure 3. Figure 3(b) from previous transitions. In Figure 4(a), the Buffer-
shows that a larger replay buffer hurts the learning speed Q agent with replay buffer size 104 learns fast. This
is a medium buffer size rather than the smallest replay
1
http://incompleteideas.net/sutton/ buffer size as we observed with tabular and linear func-
tiles/tiles3.html tion representation. We hypothesize that there is a trade-
(a) Online-Q (b) Buffer-Q (c) Combined-Q

Figure 2: Training progression with tabular function representation in the grid world. Lines with different colors
represent replay buffers with different size, and the number inside the image shows the replay buffer size. The results
are averaged over 30 independent runs, and standard errors are plotted.

(a) Online-Q (b) Buffer-Q (c) Combined-Q

Figure 3: Training progression of linear function approximator on the grid world task. Lines with different colors
represent replay buffers with different size, and the number inside the image shows the replay buffer size. The results
are averaged over 30 independent runs, and standard errors is plotted.
off between the data quality and data correlation. With ture effort should focus on developing a new principled
a smaller replay buffer, data tends to be more fresh how- algorithm to fully replace experience replay.
ever they are highly temporal correlated, while training a
neural network often needs i.i.d. data. With a larger re- Acknowledgements
play buffer, the sampled data tends to be uncorrelated,
however they are more outdated. The Buffer-Q agent The authors thank Kristopher De Asis and Yi Wan
with extreme large replay buffer (e.g., 105 or 106 ) fails for their thoughtful comments. We also thank Arash
to find the optimal solution. Comparing Figure 4 (a) and Tavakoli, Vitaly Ledvik and Fabio Pardo for pointing out
(b), it is clear that CER significantly speeds up the learn- the improper processing of timeout termination in the
ing, especially for a large replay buffer. previous version of the paper.

Figure 5 shows the learning progression of the agents


with various replay buffer sizes in the Lunar Lander task.
Different from the grid world task, the Online-Q agent
and the replay buffer based agents with buffer size 100
do achieve a good performance level. The Online-Q
agent achieves almost the best performance among all
the agents. This suggests that in this task the neural net-
work function approximator is less likely to over-fit re-
cent transitions. From Figure 5(b), it is clear that the
Buffer-Q agent with a medium buffer size (103 ) achieves
best performance level. With a large replay buffer (105
or 106 ), the Buffer-Q agent fails to solve the task. Com-
paring Figure 5 (b) and (c), we can see that CER does
improve the performance for agents with a large replay
buffer. One interesting observation is that some replay
buffer based agents tend to over-fit the task after certain
time steps, thus the performance drops. We found even if
we decrease the initial learning rate, this drop still exists.
Figure 6 shows the learning progression of the agents
with various replay buffer sizes in the game Pong. We
observed similar phenomena as the grid world task.
However in this task CER does not provides much im-
provement.

6 Conclusion

Experience replay can improve data efficiency and stabi-


lize the training of a neural network, however it does not
come for free. Some important transitions are delayed
to make effect by experience replay. This flaw is hid-
den by the complexity of model deep RL systems. This
negative effect is partially controlled by the size of re-
play buffer, which is shown in this paper to be an impor-
tant task-dependent hyper-parameter but has been under-
estimated by the community for a long time. PER is a
promising approach addressing this issue, however it of-
ten comes with O(log N ) complexity and non-negligible
extra engineer effort. We propose CER, which is simi-
lar to a component in PER but only requires O(1) extra
computation, and showcase it can significantly remedy
the negative influence of a large replay buffer. However
it is important to note that CER is only a workaround, the
idea of experience replay itself is heavily flawed. So fu-
(a) Online-Q (b) Buffer-Q (c) Combined-Q

Figure 4: Training progression with non-linear function representation in the grid world. Lines with different colors
represent replay buffers with different size, and the number inside the image shows the replay buffer size. The results
are averaged over 30 independent runs, and standard errors are plotted.

(a) Online-Q (b) Buffer-Q (c) Combined-Q

Figure 5: Training progression with non-linear function representation in the Lunar Lander. Lines with different colors
represent replay buffers with different size, and the number inside the image shows the replay buffer size. The results
are averaged over 30 independent runs, and standard errors are plotted. The curves are smoothed by a sliding window
of size 30.

(a) Online-Q (b) Buffer-Q (c) Combined-Q

Figure 6: Training progression with non-linear function representation in the game Pong. Lines with different colors
represent replay buffers with different size, and the number inside the image shows the replay buffer size. The results
are averaged over 10 independent runs, and standard errors are plotted. The curves are smoothed by a sliding window
of size 30. It is expected that the agent does not solve the game Pong, as it is to difficult to approximate the state-value
function with a single-hidden-layer network.
References Tieleman, T. and Hinton, G. (2012). Lecture 6.5-
rmsprop: Divide the gradient by a running average of
Andrychowicz, M., Wolski, F., Ray, A., Schneider, J., its recent magnitude. COURSERA: Neural networks
Fong, R., Welinder, P., McGrew, B., Tobin, J., Abbeel, for machine learning, 4(2):26–31.
P., and Zaremba, W. (2017). Hindsight experience re-
play. arXiv preprint arXiv:1707.01495. Todorov, E., Erez, T., and Tassa, Y. (2012). Mujoco: A
physics engine for model-based control. In Intelligent
Bellemare, M. G., Naddaf, Y., Veness, J., and Bowl- Robots and Systems (IROS), 2012 IEEE/RSJ Interna-
ing, M. (2013). The arcade learning environment: An tional Conference on, pages 5026–5033. IEEE.
evaluation platform for general agents. J. Artif. Intell.
Res.(JAIR), 47:253–279. Wang, Z., Bapst, V., Heess, N., Mnih, V., Munos, R.,
Kavukcuoglu, K., and de Freitas, N. (2016). Sam-
Catto, E. (2011). Box2d: A 2d physics engine for games. ple efficient actor-critic with experience replay. arXiv
Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., preprint arXiv:1611.01224.
Tassa, Y., Silver, D., and Wierstra, D. (2015). Contin- Watkins, C. J. C. H. (1989). Learning from delayed re-
uous control with deep reinforcement learning. arXiv wards. PhD thesis, King’s College, Cambridge.
preprint arXiv:1509.02971.
Lin, L.-H. (1992). Self-improving reactive agents based
on reinforcement learning, planning and teaching. Ma-
chine learning, 8(3/4):69–97.
Liu, R. and Zou, J. (2017). The effects of memory replay
in reinforcement learning.
Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap,
T., Harley, T., Silver, D., and Kavukcuoglu, K. (2016).
Asynchronous methods for deep reinforcement learn-
ing. In International Conference on Machine Learn-
ing, pages 1928–1937.
Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Ve-
ness, J., Bellemare, M. G., Graves, A., Riedmiller, M.,
Fidjeland, A. K., Ostrovski, G., et al. (2015). Human-
level control through deep reinforcement learning. Na-
ture, 518(7540):529–533.
Pardo, F., Tavakoli, A., Levdik, V., and Kormushev, P.
(2017). Time limits in reinforcement learning. arXiv
preprint arXiv:1712.00378.
Schaul, T., Quan, J., Antonoglou, I., and Silver, D.
(2015). Prioritized experience replay. arXiv preprint
arXiv:1511.05952.
Sutton, R. S. (1991). Dyna, an integrated architecture
for learning, planning, and reacting. ACM SIGART
Bulletin, 2(4):160–163.
Sutton, R. S. (1996). Generalization in reinforcement
learning: Successful examples using sparse coarse
coding. In Advances in neural information process-
ing systems, pages 1038–1044.
Sutton, R. S. and Barto, A. G. (1998). Reinforcement
learning: An introduction, volume 1. MIT press Cam-
bridge.
Tassa, Y., Doron, Y., Muldal, A., Erez, T., Li, Y., Casas,
D. d. L., Budden, D., Abdolmaleki, A., Merel, J.,
Lefrancq, A., et al. (2018). Deepmind control suite.
arXiv preprint arXiv:1801.00690.

You might also like