0% found this document useful (0 votes)

13 views

RL Course Report

The document discusses improving the deep deterministic policy gradient (DDPG) reinforcement learning algorithm by replacing its uniform experience replay with prioritized experience replay. The authors test their modified DDPG algorithm on several control tasks from OpenAI Gym and find that prioritized experience replay leads to faster training, more stable performance, and better final results compared to uniform experience replay.

Uploaded by

syed zain abbas

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views

RL Course Report

Uploaded by

syed zain abbas

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/332859622

Improving DDPG via Prioritized Experience Replay (RL course report)

Preprint · May 2019

DOI: 10.13140/RG.2.2.23694.41287

CITATIONS READS
0 1,550

2 authors, including:

Yuenan Hou
The Chinese University of Hong Kong
32 PUBLICATIONS 779 CITATIONS

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Lane Detection in Autonomous Driving View project

Vision-based Robotic Grasping View project

All content following this page was uploaded by Yuenan Hou on 04 May 2019.

The user has requested enhancement of the downloaded file.

Improving DDPG via Prioritized Experience Replay

Yuenan Hou Yi Zhang

Department of Information Engineering Department of Information Engineering
The Chinese University of Hong Kong The Chinese University of Hong Kong
Shatin, Hong Kong Shatin, Hong Kong
[email protected] [email protected]

Abstract
Experience replay plays an important role in reinforcement learning. It reuses
previous experiences to prevent the input data from being highly correlated. Re-
cently, a deep reinforcement learning algorithm with experience replay, called deep
deterministic policy gradient (DDPG), has achieved good performance in many con-
tinuous control tasks. However, it assumes the experiences are of equal importance
and samples them uniformly, which obviously neglects the difference in the value
of each individual experience. To improve the efficiency of experience replay in
DDPG method, we propose to replace the original uniform experience replay with
prioritized experience replay. We test the algorithms in five tasks in the OpenAI
Gym, a testbed for reinforcement learning algorithms. In the experiment, we find
that DDPG with prioritized experience replay mechanism significantly outperforms
that with uniform sampling in terms of training time, training stability and final per-
formance. We also find that our algorithm can achieve better performance in three
tasks compared with some state-of-the-art algorithms like Q-prop, which directly
proves the effectiveness of our proposed scheme. Our weekly report and codes are
available at https://github.com/cardwing/Codes-for-RL-Project.

1 Introduction
Reinforcement learning (RL) is a process where an agent learns to interact with the environment
to maximize the long-term rewards [19]. At each time step t, the agent observes state st , performs
action at and receives a reward rt . Conventional reinforcement learning algorithms like tabular
Q-learning [19], have been proposed to handle problems with finite state-action space. To apply these
methods to domains with infinite state space such as speech recognition [15] and image processing [2],
features are hand crafted to represent the input high-dimensional states. The emergence of deep
learning [6] has made the feature extraction process learned in an end-to-end manner.
However, deep learning can not be directly applied to RL scenarios since the states RL agent receives
have strong temporal correlations. The strong correlation contradicts the independent requirement of
deep learning algorithms which adopt gradient-based optimization algorithms to update the network
parameters. Therefore, experience replay [3, 7] is proposed to break the temporal correlation between
input samples. Experience replay utilizes a finite-size memory to store previous met samples. Then,
at each iteration, a fixed number of samples are selected randomly from the memory to update the
network.
With experience replay, deep reinforcement learning, which combines DL with RL, has achieved
remarkable performance in many tasks. For instance, deep Q network (DQN) [14, 13] is proposed to
play Atari 2600 games and surpass human players. As to domains with infinite action space, deep
deterministic policy gradient (DDPG) [10] is designed to handle the continuous control tasks.
However, DDPG assumes all samples to be of equal importance and adopts uniform experience
replay. This is irrational since it neglects the difference in the value of each individual sample. A
more rational sampling strategy should be utilized to replace the uniform experience replay. By
referring to [16] which applied prioritized experience replay to DQN, we propose to adopt prioritized
experience replay (PER) in DDPG. The difference between our method and [16] is that our application
field has infinite action space and is more challenging than theirs which has discrete and finite action
space. In the presented DDPG with PER, we first select the absolute temporal difference error as the
measure to reflect the value of each individual sample. Meanwhile we are trying to find a better metric
to represent the importance of samples for training. Besides, stochastic prioritization is adopted to
select experiences to prevent over-fitting. Moreover, importance-sampling weights are used to correct
the state visitation bias introduced by prioritized sampling.
Therefore, our contribution is to combine the prioritized experience replay mechanism with DDPG
method in the continuous control domain where actions spaces are continuous. We test our algorithm
in the inverted pendulum, inverted double pendulum, halfcheetah, walker and hopper task in the
OpenAI Gym [4] which uses a physical engine called MuJoCo [20] as the simulator and our model is
based on TensorFlow [1]. In the experiment, we find that compared with agents which use uniform
sampling, agents utilizing prioritized experience replay approach can not only get more rewards in the
long term, but also achieve more stable performance in the same training steps. To be more specific,
prioritized sampling helps the agent get almost the same average rewards as the uniform sampling
while it takes half of the training episodes as the uniform one.

2 Methodology
2.1 DRL and DDPG

A standard reinforcement learning process is described as follows: an agent interacts with its
surrounding environment through a sequence of observations st , actions at and rewards rt . Here,
actions at are real-valued and rewards rt are functions w.r.t st and at . We define the policy the agent
uses to choose actions as π = P0 (at |st ). The target of the agent is to select actions that optimizes
PT
discounted future reward Rt = i=1 γ i−1 ri , γ ∈ [0, 1] is the discounting factor. We define the
optimal action-value function (Q∗ (s, a)) as the maximum return under state s and action a, which is
described as the following equation:
Q∗ (st , at ) = max E[rt + γrt+1 + γ 2 rt+2 + ...] (1)
π

Many RL algorithms make use of the following equation to get the optimal action-value function,
which is called Bellman equation:
Q∗ (st , at ) = E[rt + γ max
,
Q∗ (s,t+1 , a,t+1 )] (2)
at+1

This equation is based on the following idea: the maximum return under state st and action at equals
to the addition of the immediate reward and the maximum return under next state st+1 . In order to
get the optimal action-value function, we can use the Bellman equation to update the action-value
function iteratively. This approach can approximate the action-value function well as time-step t
approaches infinity [19] (i.e. Q(st , at ) → Q∗ (st , at ) as t → ∞).
Such approach is limited to the situations where states are low-dimensional and discrete for it needs
to approximate action-value under every state and every action. It is a common practice to use
a non-linear neural network as the function approximator to estimate the action-value function in
high-dimensional state domains. We define the neural network which is designed for estimating
the action-value function under state-action pair (st , at ) as action-value network Q(st , at ; θ1 ), with
network parameters θ1 . Using the neural network as the approximator has an added advantage: it can
generalize to unobserved states, which can greatly expand the application fields of the action-value
function.
However, reinforcement learning could be unstable because of the correlations present in the sequence
of observations when a nonlinear function approximator such as a neural network with nonlinear
rectified units is used to present the action-value function. The problem is addressed by using a novel
variant of Q-learning [22] which consists of experience replay mechanism and iterative update. The
former randomly samples the data and thereby removes correlations in the observation sequence and
smooths the data distribution. The later iteratively updates the action-value towards the target value

2
to reduce the correlation with the target. The Q-learning update is performed on the action-value
network by minimizing the following loss function:
L(θ1 ) = (rt + γQ0 (st+1 , at+1 ; θ1 ) − Q(st , at ; θ1 ))2 (3)
where st+1 is next state, Q(st , at ; θ1 ) is the action value of state-action pair (st , at ) under network
parameters θ1 . Instead of compute the full gradient of all the samples, stochastic gradient descent
(SGD) which computes the gradient of a small part of all the samples is employed to reduce expensive
and time-costly computation.
However, it is impossible to apply Q-learning method from low-dimensional action space to high-
dimensional action space, because in high-dimensional action domains it is time-costly to find the
optimal action at+1 at every time t which maximizes the Q value under state st+1 . Therefore we
exploit the idea mentioned in [18], which is called DPG algorithm.
To be more specific, we use actor function µ(st ; θ2 ) to map state st to a deterministic action at :
at = µ(st ; θ2 ). Similar to the definition of the Q network, actor network µ(st ; θ2 ) is referred to
the neural network which works as a function approximator to evaluate actor function µ(st ), with
network parameters θ2 . Therefore, we use two neural networks: action-value network Q(st , at ; θ1 )
and actor network µ(st ; θ2 ) to approximate the optimal action-value function Q(st , at ; θ1 ) and actor
function µ(st ) respectively. The actor network is updated using the policy gradient [18] (i.e. the
gradient of the policy’s performance):
∇θ2 Q(s, a; θ1 )|s=st ,a=µ(st ;θ2 )
(4)
= ∇a Q(s, a; θ1 )|s=st ,a=µ(st ;θ2 ) ∇θ2 µ(s; θ2 )|s=st

Previously, non-linear neural network with thousands of parameters has been proved to be prone
to diverge when trained and learning tends to be unstable in this condition. However, with recent
proposed solutions like replay buffer, batch normalization [8] and target neural network, the learning
of large non-linear neural network can be more stable and more likely to converge. To be more
specific, replay buffer is used to store previously met experience and the neural network is trained
by randomly selected experiences so that the correlations between the input data are broken. Batch
normalization is used to normalize the input data so that they can have unit mean and variance,
making parameters of the network adjusted within a proper magnitude and thus stabilizing the
learning process. Target network approach is that we make a copy of the the trained network, force
the network to change slowly towards the copy one. This approach can greatly improve the stability of
the training process, which can be seen as a version of supervised learning. All these parts constitute
the DDPG algorithm.

2.2 Prioritized Experience Replay

The main part of prioritized experience replay is the index used to reflect the importance of each
transition. It is natural to select how much an agent can learn from the transition as the criterion,
given the current state. However, this criterion is easy to think of but hard to put in practice. A good
choice is to use the absolute TD-error value of the transition as the index, which can roughly measure
how surprising the transition is to the agent. And this value is easy to get in RL algorithms like
Q-learning or SARSA, which already computes and update the policy with the value.
Since TD-error implicitly reflects the amount agent can learn from the experience, it is natural to
consider that we only greedily select experiences with the largest TD-error to replay, which is called
greedy TD-error prioritization. However, this approach has three shortcomings. First, updates of
TD-error only happens in those transitions which are used to replay so that time-costly sweeps through
the whole replay buffer are avoided. In this condition, transitions which receive a low magnitude
of TD-error on first replay tend not to be replayed for a long time or even never be replayed before
being regarded, which apparently wastes a lot of precious transitions. Second, greedy TD-error
prioritization only works well in conditions where rewards are deterministic. When the rewards
are stochastic and function approximating itself brings noise (i.e. in simulated control task and
use DDPG), greedy TD-error prioritization will exhibit bad performance. Third, greedy TD-error
prioritization only makes a small part of the experience replayed frequently and the errors reduce
slowly when using deep neural network to approximate functions, which means transitions with high
TD-error are often replayed. This practice leads to lack of experience diversity, which makes the
network over-fitting easily.

3
Therefore, to alleviate these problems, we introduce some randomness in selecting experience and
thus making a trade-off between greedy TD-error prioritization and pure random sampling. This
practice is called stochastic prioritization. We guarantee that transitions with lowest magnitude of TD-
error have probability to be replayed and transitions’ probability of being replayed is in proportion to
their priority. To be more specific, in sampling transitions which are used to train the neural network,
we make use of the rank-based prioritized experience replay approach [16]. The main difference
between the original paper and our paper is that we apply the rank-based prioritized experience replay
approach to a new field (i.e. continuous control domains). Concretely, we rank the transitions in the
replay buffer according to their absolute value of TD-error. Based on the intuition that transitions
with high TD-error should be sampled more frequently for they are more valuable to the learning
Dα
process, we define the probability of each transition i being sampled to be :P (i) = P iDα , where
k k
1
Di = rank(i) > 0, rank(i) is the rank of the transition i in the replay buffer with TD-error being
the criterion. If α is set to be zero, that belongs to the uniform sampling. When α approaches
infinity, it corresponds to greedy prioritization. Therefore, this value controls the extent to which the
prioritization is used.
Such randomness has an added advantage. Since pure random sampling minimizes the correlations
between the transitions used to train the neural network, selecting those with high magnitude of TD-
error will unavoidably add the correlation. This addition is due to the fact that an improper continuous
series of actions will have big TD-errors in these transitions, which means the probability of selecting
a series of these continuous transitions increases. Randomness added in selecting transitions will
correct the problem for it includes those with small TD-errors. Therefore, the drawback brought by
biasing towards big TD-errors transitions is alleviated owing to such randomness.
Prioritized experience replay inevitably changes the distribution the expected value will converge
to because it tends to sample valuable transitions and thus changing the state visitation frequency.
However, such changes may significantly affect the convergence of training process because of the
unbiased updates requirement of reinforcement learning. In order to remove the effect this change
brings, we add the importance-sampling weights:Wi = N β ·P1 (i)β in the computation of weight
updates [11]. If β is set to be 1, that means aggressively corrects the prioritized sampling probability
P (i).
The unbiased property of policy updates is very important when the training process is approaching
the end in many RL fields for the training process is highly dynamic. Moreover, policy, state visitation
frequency and target network constantly changes throughout the process, making the training of
neural network prone to diverge. Therefore, we intend to use importance-sampling weights to slightly
correct the sampling probability in early training stages and increase the correction as training
proceeds, which reaches peak at the end of the training process. In other words, in the experiment,
we change β linearly from 0.5 to 1.
Importance-sampling weights have an added advantage in our experiment. In training the neural
network, in order to make sure that the gradient is accurate, small training steps are necessary.
However, in our method, experience prioritization tends to make transitions with high TD-error
replayed more frequently while Importance-sampling weights can correct the bias introduced and
reduce the change of gradient magnitude, which is equivalent to restrict the range of step size. In
this condition, the gradient can be more accurately approximated, stabilizing the training of neural
networks.
Therefore, we combine the rank-based prioritized experience replay mechanism with the the state-of-
the-art DDPG algorithm. Our main contribution is to replace the uniform sampling approach with
prioritized sampling approach (see Algorithm 1).

2.3 Intuitive Explanation of PER

In this section, we presents an intuitive explanation on the characteristics of DDPG with PER. Suppose
the experience j is selected to update the network, then the differentiation of the loss function with

4
Algorithm 1 DDPG with rank-based prioritization
1: Initialize critic network Q(st , at ; θ1 ) and actor network µ(st ; θ2 ) with random weights θ1 , θ2
2: Initialize replay buffer B with size V
3: Set target network Q0 (st , at ; θ1 ) = Q(st , at ; θ1 ), µ0 (st ; θ2 ) = µ(st ; θ2 )
4: Set priority D1 = 1, exponents α = 0.7, β = 0.5, minibatch K=32
5: for episode = 1, H do
6: Initialize a random process < used to explore actions thoroughly
7: Receive initial state s1
8: for t = 1, T do
9: Select action at = µ(st ; θ2 ) + <t
10: Obtain reward rt and new state st+1
11: Store transition (st , at , rt , st+1 ) in replay buffer B and set Dt = maxi<t Di
12: if t > V then
13: for j =1, K do
Djα
14: Sample transition j with probability P (j) = P D α
i i
15: Compute importance-sampling weight Wj = N β ·P (j)β1·maxi wi
16: Compute TD-error δj = rj + γQ0 (sj+1 , µ0 (sj+1 )) − Q(sj , aj )
17: Update the priority of transition j according to absolute TD-error |δj |
18: end for
1 2
P
19: Minimize the loss function to update critic network: L = K i wi δi
1
P
20: Update the actor network:∇θ2 µ|si ≈ K i ∇a Q(st , at ; θ1 )|s=si ,a=µ(si ) ∇θ2 µ(st ; θ2 )|si

21: Update the target critic network: Q0 (st , at ; θ1 ) = (1−η)·Q0 (st , at ; θ1 )+η ·Q(st , at ; θ1 )

22: Update the target actor network: µ0 (st ; θ2 ) = (1 − η) · µ0 (st ; θ2 ) + η · µ(st ; θ2 )

23: end if
24: end for
25: end for

regard to the weight θ1 can be computed as follows:

∇θ1 L(θ1 )
1
= ∇θ1 (rt + γQ0 (st+1 , at+1 ; θ1 ) − Q(st , at ; θ1 ))2
2 (5)
=(rt + γQ0 (st+1 , at+1 ; θ1 ) − Q(st , at ; θ1 ))∇θ1 Q(s, a; θ1 )
=δj ∇θ1 Q(s, a; θ1 ).

A prioritized sampling strategy tends to replay experiences with high magnitudes of TD-errors.
Thus the TD-error used in Equation (5) has a large magnitude, which leads the parameters of the
action-value network to change towards the steepest direction. In this case, the local optimal solution
can be avoided in computing the weights of the action-value network. The training process of the
network is prone to oscillate, since the change of the weights of the network is enlarged by the high
TD-error. Fortunately, the importance-sampling weight reduces the magnitude of the gradient, thus it
can stabilize the training process, i.e.,
Wj ∇θ1 L(θ1 ) = Wj δj ∇θ1 Q(s, a; θ1 ). (6)

In prioritized sampling, the experiences with high TD-errors δj are replayed more frequently, which
may lead to low accuracy of the gradient estimation. From Equation (??), it is clear that the addition
of the IS weight Wj can reduce the magnitude of the gradient change. Thus the gradient change is
constrained and the gradient can be more accurately approximated, which will greatly stabilize the
training process.
As shown in Fig. 1, a uniform experience replay usually selects the experiences with low magnitudes
of TD-errors, and gradually minimize the loss function, which may fall into a local optimal solution.
Prioritized experience replay with IS weights enable the replay incline to those experiences with
the larger magnitudes of TD-errors. Therefore, it optimizes the loss function in a steeper way, and

5
its training process is rather stable due to the constrains of IS weights. In contrast, considering
prioritized experience replay without IS weights, it may suffer from larger step sizes, and incline
to a local optimal solution. Besides, the training process may oscillate. By introducing IS weights,
prioritized experience replay can present a more global and stable solution for the reinforcement
learning problems.

Figure 1: Illustration of the performance of uniform sampling, PER without IS and PER with IS in
the training process, respectively. The contour lines constitute the space curve formed by the loss
function L(θ) of all experiences.

3 Experiments
To test the proposed DDPG with PER algorithm, several experiments are conducted. The per-
formances of the proposed algorithm are compared with original DDPG algorithm and related
state-of-the-art algorithms.

3.1 Experimental Settings

The network we use is the same as the low-dimensional network introduced in [10], except that the
size of the replay buffer is set as 104 to fasten the training speed. More specifically, Adam [9] is
used to update the parameters of the network. The learning rate is set as 10−4 and 10−3 for actor
and critic network, respectively. To avoid over-fitting problem, a regularization item L2 is added in
the computation of loss function for the critic network. The discount factor γ is set 0.99, and the
updating rate of the target networks is set as η = 0.01. The tanh layer is added in the final layer of
the actor network to restrict the actions in an appropriate range. The input of the last hidden layer of
critic network consists of actions. The neural network has one input layer, two hidden layers and
one output layer, with 400 and 300 units in the two hidden layers, respectively. The parameters of
all layers of both critic and actor networks are randomly selected from a uniform distribution with
regard to the input of the layers except the final layer. The parameters of the final layer are assigned
from the uniform distribution [−3 × 10−3 , 3 × 10−3 ]. For all the experiments, Ornstein-Uhlenbeck
(OU) process is used as the noise process [21] with θ = 0.15 and σ = 0.2.
A circular queue is used to store the experiences and a binary heap is utilized as the the data structure
of the priority queue to minimize the additional time cost for prioritized sampling. The minibatch size
is set as 32. The observations the agents get are low dimensional vectors which are used to describe
the agent and the environment (e.g. speed, joint angles and coordinate information). A physical
engine called MuJoCo [20] is used to test all the algorithms and the OpenAI Gym platform [4] is
exploited to load the MuJoCo model as well as record the average rewards and the training episodes.
The model is based on TensorFlow [1]. Four indexes are selected to evaluate the performance of the
algorithms, i.e., rewards, learning speed, learning stability and sensitiveness to the hyperparameters.
The experiments are carried out for five typical simulated continuous control tasks as shown in Fig. 2.

3.2 Experimental Results

The experimental results of DDPG with PER are shown in Fig. 3 with comparison to the original
DDPG with uniform sampling. For the inverted pendulum task, as shown in Fig. 3(1), it is clear

6
(1) (2) (3) (4) (5)
Figure 2: OpenAI Gym tasks in the experiments. (1)∼(5): inverted pendulum, inverted double
pendulum, hopper, halfcheetah and walker.

PER Uniform sampling

600 6000 800 2200 800

2000
700 700
500 5000 1800
600 1600 600

Averaged Rewards
Averaged Rewards
Averaged Rewards
Averaged Rewards

Averaged Rewards

400 4000 1400

500 500
1200
400 400
300 3000
1000

300 800 300

200 2000
600 200
200
400
100 1000 100
100
200

0 0
0 0 0 0 200 400 600
0 500 1000 1500 0 200 400 600
0 500 1000 0 2000 4000 6000 Episode
Episode Episode
Episode Episode
(2) Inverted Double (4) Hopper (3) Halfcheetah (5) Walker
(1) Inverted Pendulum
Pendulum

Figure 3: Learning performance regarding averaged rewards for the tasks of (1)∼(5): inverted
pendulum, inverted double pendulum, hopper, halfcheetah and walker. Each experiment is run 3
times.

that DDPG with PER outperforms DDPG with uniform sampling regarding the training episodes.
DDPG with PER takes about 700 episodes to achieve a total reward of around 600, while DDPG with
uniform sampling takes around 1300 episodes to achieve a total reward of 100. In addition, the fewer
spines occur in the reward curves of DDPG with PER, which demonstrates more stable performance
of DDPG with PER.
For the other four simulated tasks, i.e., inverted double pendulum, halfcheetah, hopper and walker
tasks, the experimental settings are the same as the inverted pendulum task except that in halfcheetah
the maximum step in an episode is set as 500 in the first 500 episodes and then changed to 2000 to
alleviate the harm of the less accurate estimation of the action-value function in early stages. As show
in Fig. 3, it is clear that the reward curve of DDPG with PER is almost monotonically increasing
in the whole training process while DDPG with uniform sampling is unstable. In addition, DDPG
with PER achieves an average reward of 2000 and 800 in the hopper and walker tasks, respectively.
It outperforms the original DDPG with uniform sampling in the same training episodes.

3.3 Comparison to More State-of-the-art Algorithms

For continuous control problems, there have been several effective algorithms, such as A3C [12], Q-
Prop [5], TRPO [17] and DDPG. Especially according to [5], conservative Q-Prop using trust-region
is better than the aggressive version in OpenAI Gym tasks. Hence we further test the proposed DDPG
with PER (PER-DDPG) with comparison to TRPO, conservative Q-Prop algorithm using trust-region
(TR-c-Q-Prop) and the original DDPG in the halfcheetah, hopper and walker tasks.
The experimental results are shown as in Table 1. The maximum average reward of DDPG with PER
is almost twice as that of the other three algorithms in the hopper and walker tasks. Besides, in the

7
PER PER Uniform sampling Uniform sampling

Rewards
(1) 1000 1000 1000 1000
Inverted 500 500 500 500
Pendulum
0 0 0 0
0 200 400 600 0 200 400 600 0 500 1000 0 500 1000

(2) 10000 10000 10000 10000

Rewards

Inverted
Double 5000 5000 5000 5000
Pendulum
0 0 0 0
0 2000 4000 6000 0 2000 4000 6000 0 2000 4000 6000 0 2000 4000 6000

300 300
Rewards

(4) 4000 10000 200 200

Hopper 2000 100 100
5000
0 0
0 0 −100 −100
0 500 1000 1500 0 500 1000 1500 0 500 1000 1500 0 500 1000 1500

8000 8000
Rewards

(3) 6000 6000 4000 4000

Halfcheetah 4000 4000 2000 2000
2000 2000
0 0 0 0
0 200 400 600 0 200 400 600 0 200 400 600 0 200 400 600

6000 4000 800 800

Rewards

(5) 4000
600 600
2000 400 400
Walker 2000 200 200
0 0 0
0 −200 −200
0 200 400 600 0 200 400 600 0 200 400 600 0 200 400 600
Episode Episode Episode Episode
V=10,000 V=100,000 V=1,000,000 η=0.01 η=0.05 η=0.1

Figure 4: Learning performance regarding different hyperparameters for DDPG with PER and
uniform sampling, respectively. (1)∼(5): inverted pendulum, inverted double pendulum, hopper,
halfcheetah and walker. Each experiment is run 3 times.

Task TRPO PER-DDPG TR-c-Q-Prop DDPG

Halfcheetah 4734(> 104 ) 21000(500) 4721(4350) 7490(600)
Hopper 2486(5715) 4400(1600) 2957(5945) 2604(965)
Walker 3567(> 104 ) 6900(4200) 3947(2165) 3626(2125)
Table 1: Performance of TRPO, PER-DDPG, TR-c-Q-Prop and DDPG regarding the maximum
average reward (the numbers outside the brackets) and episodes (inside the brackets).

halfcheetah task, DDPG with PER obtains far more average rewards (21, 000) within fewer training
episodes (500) compared with the other three algorithms. All these results further demonstrate the
effectiveness of DDPG with PER.

4 Conclusion

A novel DDPG with PER algorithm is proposed for continuous control, where uniform experience
replay is replaced by prioritized experience replay with specific algorithms for the DDPG method.
By comparison to the original DDPG with uniform sampling and other state-of-the-art methods, the
experimental results show that DDPG with PER achieves better performance regarding time efficiency,
stability of the learning process and robustness to the hyperparameters. Our future work will focus on
prioritized experience replay beyond the criterion of TD-errors, the experience discarding mechanism
and more applications on more complicated tasks such as object grasping and humanoid walking.

8
References
[1] Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin,
Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. Tensorflow: A system for large-scale machine
learning. In 12th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 16),
pages 265–283, 2016.

[2] Michael D Abràmoff, Paulo J Magalhães, and Sunanda J Ram. Image processing with imagej. Biophotonics
international, 11(7):36–42, 2004.

[3] Sander Adam, Lucian Busoniu, and Robert Babuska. Experience replay for real-time reinforcement
learning control. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews),
42(2):201–212, 2012.

[4] Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and
Wojciech Zaremba. Openai gym. arXiv preprint arXiv:1606.01540, 2016.

[5] Shixiang Gu, Timothy P. Lillicrap, Zoubin Ghahramani, Richard E. Turner, and Sergey Levine. Q-prop:
Sample-efficient policy gradient with an off-policy critic. In ICLR. OpenReview.net, 2017.

[6] Geoffrey E Hinton and Ruslan R Salakhutdinov. Reducing the dimensionality of data with neural networks.
science, 313(5786):504–507, 2006.

[7] Yuenan Hou, Lifeng Liu, Qing Wei, Xudong Xu, and Chunlin Chen. A novel ddpg method with prioritized
experience replay. In Systems, Man, and Cybernetics (SMC), 2017 IEEE International Conference on,
pages 316–321. IEEE, 2017.

[8] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing
internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.

[9] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint
arXiv:1412.6980, 2014.

[10] Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David
Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. arXiv preprint
arXiv:1509.02971, 2015.

[11] A Rupam Mahmood, Hado P van Hasselt, and Richard S Sutton. Weighted importance sampling for
off-policy learning with linear function approximation. In Advances in Neural Information Processing
Systems, pages 3014–3022, 2014.

[12] Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley,
David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In
International conference on machine learning, pages 1928–1937, 2016.

[13] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra,
and Martin Riedmiller. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602,
2013.

[14] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare,
Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through
deep reinforcement learning. Nature, 518(7540):529, 2015.

[15] Lawrence R Rabiner. A tutorial on hidden markov models and selected applications in speech recognition.
Proceedings of the IEEE, 77(2):257–286, 1989.

[16] Tom Schaul, John Quan, Ioannis Antonoglou, and David Silver. Prioritized experience replay. arXiv
preprint arXiv:1511.05952, 2015.

[17] John Schulman, Sergey Levine, Pieter Abbeel, Michael I. Jordan, and Philipp Moritz. Trust region policy
optimization. In ICML, volume 37 of JMLR Workshop and Conference Proceedings, pages 1889–1897.
JMLR.org, 2015.

[18] David Silver, Guy Lever, Nicolas Heess, Thomas Degris, Daan Wierstra, and Martin Riedmiller. Determin-
istic policy gradient algorithms. In ICML, 2014.

[19] Richard S Sutton and Andrew G Barto. Introduction to reinforcement learning, volume 135. MIT press
Cambridge, 1998.

9
[20] Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control. In
2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 5026–5033. IEEE,
2012.

[21] George E Uhlenbeck and Leonard S Ornstein. On the theory of the brownian motion. Physical review,
36(5):823, 1930.

[22] Christopher John Cornish Hellaby Watkins. Learning from delayed rewards. PhD thesis, King’s College,
Cambridge, 1989.

View publication stats

Daa Project Report 2
No ratings yet
Daa Project Report 2
14 pages
AppearTV Hardware Maintenance Guide
No ratings yet
AppearTV Hardware Maintenance Guide
10 pages
RL Course Report
No ratings yet
RL Course Report
10 pages
applsci_13_02443
No ratings yet
applsci_13_02443
23 pages
A Deeper Look at Experience Replay
No ratings yet
A Deeper Look at Experience Replay
9 pages
The Effects of Memory Replay in Reinforcement Learning
No ratings yet
The Effects of Memory Replay in Reinforcement Learning
14 pages
Reconciling Lambda-Returns With Experience Replay
No ratings yet
Reconciling Lambda-Returns With Experience Replay
17 pages
Hindsight Experience Replay
No ratings yet
Hindsight Experience Replay
15 pages
Exploring Game Playing AI Using Reinforcement Learning Techniques
No ratings yet
Exploring Game Playing AI Using Reinforcement Learning Techniques
5 pages
Deep Reinforcement Learning With Quantum-Inspired Experience Replay Qing Wei, Hailan Ma, Chunlin Chen, Member, IEEE, Daoyi Dong, Senior Member, IEEE
No ratings yet
Deep Reinforcement Learning With Quantum-Inspired Experience Replay Qing Wei, Hailan Ma, Chunlin Chen, Member, IEEE, Daoyi Dong, Senior Member, IEEE
12 pages
The_Use_of_Reinforcement_Learning_in_Gaming_The_Br
No ratings yet
The_Use_of_Reinforcement_Learning_in_Gaming_The_Br
9 pages
Sullivan and Longo - 2023 - Explaining Deep Q-Learning Experience Replay With
No ratings yet
Sullivan and Longo - 2023 - Explaining Deep Q-Learning Experience Replay With
23 pages
2.4+Advanced+Tricks+for+DQNs
No ratings yet
2.4+Advanced+Tricks+for+DQNs
82 pages
Laser Is
No ratings yet
Laser Is
20 pages
A Survey of Deep Reinforcement Learning in Video Games
No ratings yet
A Survey of Deep Reinforcement Learning in Video Games
13 pages
IMPLing The DQN
No ratings yet
IMPLing The DQN
9 pages
Hyperparameter Tuning For Deep Reinforcement Learning Applications
No ratings yet
Hyperparameter Tuning For Deep Reinforcement Learning Applications
12 pages
RL Proposal II
No ratings yet
RL Proposal II
1 page
Project3-Arc1 1
No ratings yet
Project3-Arc1 1
7 pages
Project 3
No ratings yet
Project 3
23 pages
Four Room Pages 1
No ratings yet
Four Room Pages 1
8 pages
Self Learning Agent-1
No ratings yet
Self Learning Agent-1
12 pages
CS5500: Reinforcement Learning Assignment 3: Additional Guidelines
No ratings yet
CS5500: Reinforcement Learning Assignment 3: Additional Guidelines
7 pages
Niall - Project 3
No ratings yet
Niall - Project 3
32 pages
IPSJ-GPWS2021017
No ratings yet
IPSJ-GPWS2021017
5 pages
Untitled document
No ratings yet
Untitled document
11 pages
1611.01606v1
No ratings yet
1611.01606v1
13 pages
RL
No ratings yet
RL
94 pages
Twin Delayed Multi-Agent Deep Deterministic Policy Gradient
No ratings yet
Twin Delayed Multi-Agent Deep Deterministic Policy Gradient
5 pages
A Novel Movie Recommendation System Based On Deep Reinforcement Learning With Prioritized Experience Replay
No ratings yet
A Novel Movie Recommendation System Based On Deep Reinforcement Learning With Prioritized Experience Replay
5 pages
The Application of Reinforcement Learning in Video Games
No ratings yet
The Application of Reinforcement Learning in Video Games
10 pages
Master Thesis: Evaluating The Impact of Curriculum Learning On The Training Process For An Intelligent Agent in A Videogame
No ratings yet
Master Thesis: Evaluating The Impact of Curriculum Learning On The Training Process For An Intelligent Agent in A Videogame
97 pages
Generating Intelligent Agent Behaviors in Multi-Agent Game AI Using Deep Reinforcement Learning Algorithm
No ratings yet
Generating Intelligent Agent Behaviors in Multi-Agent Game AI Using Deep Reinforcement Learning Algorithm
9 pages
MA Neville Walo Seer RLRL
No ratings yet
MA Neville Walo Seer RLRL
53 pages
AI_20report_205
No ratings yet
AI_20report_205
18 pages
DeGraeve 90841400 Vaneberck 77931400 2021
No ratings yet
DeGraeve 90841400 Vaneberck 77931400 2021
76 pages
A-Study-on-the-Agent-in-Fighting-Games-Based-on-Deep-Reinforcement-LearningMobile-Information-Systems
No ratings yet
A-Study-on-the-Agent-in-Fighting-Games-Based-on-Deep-Reinforcement-LearningMobile-Information-Systems
8 pages
Majorsynopsis2 9918103065
No ratings yet
Majorsynopsis2 9918103065
3 pages
3551349.3560507
No ratings yet
3551349.3560507
8 pages
Thesis Fabrizio Galli
No ratings yet
Thesis Fabrizio Galli
22 pages
Panzer 2021 - Deep Reinforcement Learning In Production Planning And Control A Systematic Literature Review - CPSL2021
No ratings yet
Panzer 2021 - Deep Reinforcement Learning In Production Planning And Control A Systematic Literature Review - CPSL2021
11 pages
Modern_Deep_Reinforcement_Learning_Algorithms
No ratings yet
Modern_Deep_Reinforcement_Learning_Algorithms
56 pages
FinalTermThesis Prashant Pandey
No ratings yet
FinalTermThesis Prashant Pandey
65 pages
Yuming Li Pin Ni Victor Chang
No ratings yet
Yuming Li Pin Ni Victor Chang
19 pages
Assignment3 Yash Patel
No ratings yet
Assignment3 Yash Patel
10 pages
Slides Extended RL Lecture
No ratings yet
Slides Extended RL Lecture
72 pages
Download Full The Art of Reinforcement Learning Michael Hu PDF All Chapters
No ratings yet
Download Full The Art of Reinforcement Learning Michael Hu PDF All Chapters
41 pages
Mathematics 11 01110 v3
No ratings yet
Mathematics 11 01110 v3
14 pages
Disertatie
No ratings yet
Disertatie
5 pages
Playing Geometry Dash With Convolutional Neural Networks
No ratings yet
Playing Geometry Dash With Convolutional Neural Networks
7 pages
Natural_Language-Based_HumanMachine_Collaborative_Learning_Games_Algorithm_Based_on_Deep_Rein-Forcement_Learning
No ratings yet
Natural_Language-Based_HumanMachine_Collaborative_Learning_Games_Algorithm_Based_on_Deep_Rein-Forcement_Learning
13 pages
Applying Q (λ) -learning in Deep Reinforcement Learning to Play Atari Games
No ratings yet
Applying Q (λ) -learning in Deep Reinforcement Learning to Play Atari Games
6 pages
OR-Gym: A Reinforcement Learning Library For Operations Research Problems
No ratings yet
OR-Gym: A Reinforcement Learning Library For Operations Research Problems
28 pages
DeepMind Whitepaper
No ratings yet
DeepMind Whitepaper
9 pages
Seminar Report Amrita 056[1] (AutoRecovered)
No ratings yet
Seminar Report Amrita 056[1] (AutoRecovered)
31 pages
4b - Deep Reinforcement Learning
No ratings yet
4b - Deep Reinforcement Learning
29 pages
A Brief Survey of Deep Reinforcement Learning PDF
No ratings yet
A Brief Survey of Deep Reinforcement Learning PDF
14 pages
10423950
No ratings yet
10423950
72 pages
Random Optimization: Fundamentals and Applications
From Everand
Random Optimization: Fundamentals and Applications
Fouad Sabry
No ratings yet
Backpropagation: Fundamentals and Applications for Preparing Data for Training in Deep Learning
From Everand
Backpropagation: Fundamentals and Applications for Preparing Data for Training in Deep Learning
Fouad Sabry
No ratings yet
AI for Everyone: An Intermediate Guide to Artificial Intelligence
From Everand
AI for Everyone: An Intermediate Guide to Artificial Intelligence
Nova Clarke
No ratings yet
Reinforcement Learning: A Practical Guide to Algorithms
From Everand
Reinforcement Learning: A Practical Guide to Algorithms
Trilokesh Khatri
No ratings yet
Class10 English1 Unit08 NCERT TextBook EnglishEdition
No ratings yet
Class10 English1 Unit08 NCERT TextBook EnglishEdition
14 pages
1968 Ford Mustang
No ratings yet
1968 Ford Mustang
7 pages
3-Projection of Points and Lines
No ratings yet
3-Projection of Points and Lines
38 pages
Ajdarning Tavbasi
No ratings yet
Ajdarning Tavbasi
5 pages
Official List
No ratings yet
Official List
1 page
Internal Audit in Focus 2024 1709245839
No ratings yet
Internal Audit in Focus 2024 1709245839
11 pages
Vibrators FMC
No ratings yet
Vibrators FMC
48 pages
1244-A430-ELC-0001-004
No ratings yet
1244-A430-ELC-0001-004
1 page
Computer Studies Ii
No ratings yet
Computer Studies Ii
51 pages
Talk - Three Wishes - Barney Wiki - Fandom
No ratings yet
Talk - Three Wishes - Barney Wiki - Fandom
6 pages
Airplane Operational and Functional Test: 1. General
No ratings yet
Airplane Operational and Functional Test: 1. General
4 pages
Product Data Sheet 6AV6647 0AC11 3AX0
No ratings yet
Product Data Sheet 6AV6647 0AC11 3AX0
9 pages
ELECTRONIC DOCUMENT MANAGEMENT SYSTEM & SUBMITTALS Rev.1
No ratings yet
ELECTRONIC DOCUMENT MANAGEMENT SYSTEM & SUBMITTALS Rev.1
8 pages
Indise Out Film Analysis
No ratings yet
Indise Out Film Analysis
2 pages
Report On ERT
No ratings yet
Report On ERT
15 pages
Renal Function Tests
100% (1)
Renal Function Tests
7 pages
L8 Fundamental of Imgae Formation
No ratings yet
L8 Fundamental of Imgae Formation
27 pages
Evaluation of Ointment
No ratings yet
Evaluation of Ointment
15 pages
Food Experts ASDA Protocol V 3 01 ENG - INT 20130520
No ratings yet
Food Experts ASDA Protocol V 3 01 ENG - INT 20130520
15 pages
Folder 994K PDF
No ratings yet
Folder 994K PDF
32 pages
Practice Test 1 - IELTS - Ms. Ngọc
No ratings yet
Practice Test 1 - IELTS - Ms. Ngọc
7 pages
Laptop Questions
No ratings yet
Laptop Questions
40 pages
Chapter 2 Developing Marketing Strategies and Plans: The Value Chain
No ratings yet
Chapter 2 Developing Marketing Strategies and Plans: The Value Chain
9 pages
Data Cloud 2
No ratings yet
Data Cloud 2
1 page
Spectrometer Solid
No ratings yet
Spectrometer Solid
2 pages
Orgman - Module 4 - Grade 11 - Abm Rizal - MR - Arnold Paombong
No ratings yet
Orgman - Module 4 - Grade 11 - Abm Rizal - MR - Arnold Paombong
9 pages
OPC A and E Server For iFIX
No ratings yet
OPC A and E Server For iFIX
43 pages
CIS Amazon Web Services Foundations Benchmark v1.3.0 DRAFT 24JUL20
No ratings yet
CIS Amazon Web Services Foundations Benchmark v1.3.0 DRAFT 24JUL20
192 pages