0% found this document useful (0 votes)
10 views

Deep Reinforcement Learning With Optimized Reward Functions For Robotic Trajectory Planning

Uploaded by

Safae Belkhyr
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

Deep Reinforcement Learning With Optimized Reward Functions For Robotic Trajectory Planning

Uploaded by

Safae Belkhyr
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Received July 6, 2019, accepted July 21, 2019, date of publication July 31, 2019, date of current version

August 15, 2019.


Digital Object Identifier 10.1109/ACCESS.2019.2932257

Deep Reinforcement Learning With Optimized


Reward Functions for Robotic Trajectory Planning
JIEXIN XIE 1,2 , ZHENZHOU SHAO 1,2,3 , YUE LI 1,2 , YONG GUAN1,2,3 ,
AND JINDONG TAN 4 , (Member, IEEE)
1 Information Engineering College, Capital Normal University, Beijing 100048, China
2 Beijing Advanced Innovation Center for Imaging Technology, Capital Normal University, Beijing 100048, China
3 Beijing Key Laboratory of Light Industrial Robot and Safety Verification, Capital Normal University, Beijing 100048, China
4 Engineering College, University of Tennessee, Knoxville, TN 37996, USA

Corresponding author: Zhenzhou Shao ([email protected])


This work was supported in part by the National Key Research and Development Program of China under Grant 2017YFB1303000, in part
by the Project of Beijing Municipal Commission of Education under Grant KM201710028017, in part by the National Natural Science
Foundation of China under Grant 61702348, Grant 61772351, Grant 61602326, and Grant 61602324, in part by the Project of the Beijing
Municipal Science and Technology Commission under Grant LJ201607, in part by Capacity Building for Sci-Tech Innovation-Fundamental
Scientific Research Fund under Grant 025185305000, and in part by the Youth Innovative Research Team of Capital Normal University.

ABSTRACT To improve the efficiency of deep reinforcement learning (DRL)-based methods for robotic
trajectory planning in the unstructured working environment with obstacles. Different from the traditional
sparse reward function, this paper presents two brand-new dense reward functions. First, the azimuth
reward function is proposed to accelerate the learning process locally with a more reasonable trajectory by
modeling the position and orientation constraints, which can reduce the blindness of exploration dramatically.
To further improve the efficiency, a reward function at subtask-level is proposed to provide global guidance
for the agent in the DRL. The subtask-level reward function is designed under the assumption that the task can
be divided into several subtasks, which reduces the invalid exploration greatly. The extensive experiments
show that the proposed reward functions are able to improve the convergence rate by up to three times with
the state-of-the-art DRL methods. The percentage increase in convergence means is 2.25%–13.22% and the
percentage decreases with respect to standard deviation by 10.8%–74.5%.

INDEX TERMS Deep reinforcement learning, robot manipulator, trajectory planning, reward function.

I. INTRODUCTION
Trajectory planning is a fundamental problem for the motion
control of robot manipulator. Conventional trajectory plan-
ning methods are usually appropriate for the structured
environment [1]–[5]. However, the working environment of
robot manipulator may be various in the complex tasks
in practice. In recent years, trajectory planning with Deep
Reinforcement Learning (DRL) paves an alternative way to
solve this problem [6]–[8]. It enables the robot manipula-
tor to autonomously learn and plan an optimal trajectory
in unstructured working environment. As shown in Fig. 1,
the agent in DRL explores the possible actions with a ‘‘Trial
and Error’’ mechanism [9], [10], according to the current
state of the robot manipulator. By maximizing the accumu- FIGURE 1. Scheme of Deep Reinforcement Learning.
lated reward with the optimization strategy, robot manipu-
lator can finish the trajectory planning task in unstructured
environment [11]–[13]. In DRL, the typical optimization strategies include Deep
Q Network (DQN) [14]–[16], Deep SARSA (State Action
The associate editor coordinating the review of this manuscript and Reward State Action) [17], Rainbow [18] and so on. How-
approving it for publication was Francisco J. Garcia-Penalvo. ever, the spaces of output action yielded by those methods

VOLUME 7, 2019 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see http://creativecommons.org/licenses/by/4.0/ 105669
J. Xie et al.: Deep Reinforcement Learning With Optimized Reward Functions for Robotic Trajectory Planning

are discrete, which cannot be applied to the tasks with In Section III, subtask-level reward function is introduced.
continuous action spaces, just like trajectory planning of The implementation of reward function is illustrated in
robot manipulator. To solve this problem, Deep Deterministic Section IV. It mainly discusses how to implement the pro-
Strategy Gradient (DDPG) [19] and Critics of Asynchronous posed reward functions on the current mainstream DRL
Advantage Actors (A3C) [20] are put forward. By the non- methods. Next, experimental results are demonstrated and
linear approximation, DDPG makes the action space con- discussed in Section V. Finally, the conclusions are drawn
tinuous, and Giuseppe Paolo et al. [11] further improve the in Section VI.
performance of DDPG with asynchronous execution strat-
egy. However, the efficiency of DDPG is low due to the II. AZIMUTH REWARD FUNCTION
operation of experience replay. This weakness is replaced For DRL based methods, blindness of exploration in unstruc-
with the asynchronous update in A3C. The multithreaded tured environments is a major problem. As a consequence,
implementation in A3C is capable of improving the learn- trajectory planning task is always suffered from its inef-
ing efficiency greatly. However, the drawback of A3C is ficiency and bad robustness. To cope with this problem,
the fixed learning rate. Especially, in complicated work- we hope to the replace traditional sparse reward function
ing environment, the robustness of A3C may reduce badly. with a new dense reward function. Dense reward function
Recently, Distributed Proximal Policy Optimization (DPPO) can give more information after each action, but are much
is proposed [21], DPPO introduces a penalty term, which more difficult to construct than sparse functions. In this
can provide a more reasonable update proportion, thereby paper, we select the position and orientation as constraints
reducing the impact of unreasonable learning rate. to build the azimuth reward function for DRL based meth-
Nevertheless, randomness and blindness are still the prob- ods. Azimuth reward function uses the relative orientation
lems in DRL methods. In particular, when considering the and relative position of the endpoint of robot manipulator,
unstructured working environment with obstacles, this matter obstacle and target, as referred to position reward function
will be much more prominent. The core of this problem is the and orientation reward function, respectively. The azimuth
reward function. To the best of our knowledge, all the reward reward function can improve the learning process locally
functions used in robot manipulator trajectory planning task with a more reasonable trajectory by giving each action a
are sparse reward functions. The value of sparse reward more accurate and comprehensive evaluation, which will help
functions are zero everywhere, except for a few places (the the robot manipulator reduce the blindness of exploration
obstacles and target in trajectory planning task) [22]. This effectively and improve work efficiency.
kind of reward function always leads to a lot of ineffective
explorations, which will decrease the efficiency of algorithm A. POSITION REWARD FUNCTION
severely [23]–[25]. To cope with the problem, we present an In unstructured environment with obstacles, robot manipula-
optimized method for robotic trajectory planning. The pri- tor must avoid obstacles and navigate itself to the destination
mary contributions of this paper are summarized as follows: in real time. Therefore, our position reward function consists
1) Considering the features of trajectory planning task and of two items, including obstacle avoidance and target guid-
work environment, two brand-new dense reward functions ance. The obstacle avoidance is responsible for alerting the
are proposed. Dense reward function gives non-zero rewards robot manipulator to keep a certain safety distance off the
most of the time which is quite different from sparse reward obstacles, while the target guidance is used to motivate robot
function. Dense reward function give more information after manipulator to reach the target as soon as possible.
each action, which can reduce the blindness of exploration of
DRL methods in trajectory planning task. 1) OBSTACLE AVOIDANCE
2) Fist, azimuth reward function is proposed. Azimuth Gaussian distribution is used to model obstacle avoidance,
reward function includes position reward function and ori- E means the endpoint of robot manipulator, and obstacle is
entation reward function. Position reward function is built by denoted by O, DEO is the relative distance between E and O.
Gaussian distribution and Triplet Loss function; orientation The risk of collision increases as DEO decreases, meanwhile
reward function is modeled by Coulomb’s law. The azimuth robot manipulator will get more punishment as well. Obstacle
reward function can give the robot manipulator a reason- avoidance is modeled by Gaussian function fobstacle (DEO )
able constraint in position and orientation during exploration, in (1).
thereby reducing the invalid exploration practically.
1 D2
3) To further improve learning efficiency, we propose EO
fobstacle (DEO ) = − √ e(− 2 ) . (1)
anther reward function at subtask-level to provide a global 2π
guidance for the agency. The subtask-level reward function is
built by an idea of serialization. By this innovative structure, 2) TARGET GUIDANCE
we model the characteristics of each subtask accurately as Inspired by the idea of the Triplet loss function [26], [27],
well as reducing the computation overhead greatly. we describe target guidance as shown in (2).
The rest of this paper is organized as follows. The struc- h i
ture of azimuth reward function is presented in Section II. ftriplet (DEO , DET ) = D2EO − D2ET − α , (2)
+

105670 VOLUME 7, 2019


J. Xie et al.: Deep Reinforcement Learning With Optimized Reward Functions for Robotic Trajectory Planning

FIGURE 3. Hazard division of work area for robot manipulator.



−→0 QE QO EO
EO = − → ,
r2 2 −
(5)
EO
FIGURE 2. Scheme of orientation reward function.
where r1 is the relative distance between target and the end-
point of robot manipulator, r2 is the relative distance between
where [•]+ represents an operation, the value of the function obstacle and the endpoint of robot manipulator. QE denotes
is valid when the calculation result in [•]+ is non-negative; the charge of endpoint of robot manipulator, QO is the charge
Otherwise, the result is 0. DET indicates the relative dis- of obstacle and QT represents the charge of target. In prac-
tance between the endpoint of robot manipulator E and the tice, the attraction of target to robot manipulator should be
target T , and α is the margin [28] between DET and DEO . greater than the rejection of the obstacle. Otherwise, robot
The value of α needs to be adjusted according to the actual manipulator may not be able to reach the target in order
working environment. In this paper, we set the value of α to to avoid obstacles. In this paper, we set the value of QT


0.095 empirically. twice greater than QO . EB indicates the desired direction of
−→
By combining obstacle avoidance and target guidance, relative movement, EC is the actual motion vector. The angle

→ −

we describe the position reward function as (3): between EB and EC is written as ϕ, which is used to measure
the similarity between current motion vector and the desired
Rlocation (DEO , DET ) = fobstacle (DEO )+ftriplet (DEO , DET ) . motion vector programmed by agent. The smaller ϕ means
(3) the higher similarity between two vectors, ϕ can be calculated
by (6).
B. ORIENTATION REWARD FUNCTION −
→ − → − →
How to avoid obstacles safely and rapidly is a crucial issue (EO0 + ET 0 ) · EC
ϕ = arccos − → − → → .
− (6)
to be considered in unstructured environment with obstacles. |EO0 + ET 0 | · |EC|
In practice, for the endpoint of robot manipulator, the direc-
Combining all the factors above, the orientation reward
tion of relative movement between target and obstacle is
function we proposed is shown in (7), where τ is compen-
always overlapped, which increases the difficulty in obstacle
sating parameter. In this paper, we set τ to 0.785 according to
avoidance. In this case, it is necessary to design a strategy for
experimental experience.
choosing a reasonable direction.
Motivated by the electric attraction and electrostatic repul- ϕ×π
Rorientation (ϕ) = τ − . (7)
sion between charges, we model the orientation reward func- 180
tion for trajectory planning according to Coulomb’s law [29]
C. MODELING OF AZIMUTH REWARD FUNCTION
in this paper. The relation between the obstacle and the
In the process of trajectory planning, position and orienta-
endpoint of robot manipulator can be described as charges
tion are two key factors to be considered comprehensively.
repel each other. Similarly, the relation between target and
However, the working environment for robot manipulator
the endpoint of robot manipulator can be expressed as unlike
is complex, and the weights of the two items in azimuth
charges attract each other.
reward function are always different in different scenarios.
Orientation reward function is illustrated as in Fig. 2, where
−→0 −→ To solve this problem, we introduce a weight vector λ =
ET is the vector of attraction for target and EO0 is the vector
−→ [λlocation , λorientation ] to build the azimuth reward function.
of repulsion for obstacle. The arithmetic expressions of ET 0 As shown in Fig. 3, the workspace around obstacle is divided
−→0
and EO are formulated in (4) and (5). into three parts, including safety, warning and danger areas.

→ In order to improve the learning efficiency, λ is adjusted
−→0 QE QT ET dynamically in different working areas. In the safety area,
ET = → ,
r12 |−
(4)
the position reward function plays a leading role; In the
ET |
VOLUME 7, 2019 105671
J. Xie et al.: Deep Reinforcement Learning With Optimized Reward Functions for Robotic Trajectory Planning

FIGURE 4. Diagram of subtask switching.

FIGURE 5. Calculation structure of subtask reward function.

warning area, along with robot manipulator approaching the to further improve the learning efficiency of trajectory plan-
obstacle, the weight of position reward function decreases ning, the corresponding subtask-level reward functions are
and orientation reward function increases; For the danger proposed in this paper, that is, target approaching and obstacle
area, the orientation reward function begins to dominate. The avoidance reward functions. Robot manipulator will switch
adjustment strategy of λ is summarized in (8). between two subtasks during trajectory planning, as shown
 T in Fig. 4. Initially, the robot manipulator moves toward the
λ

 location = 0
 E ∈ danger target with target approaching reward function. During the
λorientation = 1



 exploration, if the relative distance between robot manipu-
DEO − dw T

  
 λlocation =


 lator and obstacle is less than a threshold, obstacle avoid-
λ=  dm − dw   E ∈ warnning . (8) ance subtask will be activated. When the robot manipulator
dm − DEO  bypasses the obstacle, switches back to target approaching
λ
 

 orientation =
dm − dw


 subtask. The subtask switching is determined by the threshold
T
λ
 
location = 1



E ∈ safety of the relative distance, which can be adjusted according
λorientation = 0


to actual requirement, as different thresholds will result in
where dw and dm are the radii of danger area and warning different motion trajectories. The details are introduced as
area as shown in Fig. 3. Combining with weight λ, the final follows.
expression of azimuth reward function is defined as (9).
T  A. DETERMINATION OF SUBTASK REWARD FUNCTION
λlocation Rlocation (DEO , DET )
 
R= . (9)
λorientation Rorientation (ϕ) In this paper, the determination of subtask reward function is
implemented based on the time nodes in discipline sequence,
III. SUBTASK-LEVEL REWARD FUNCTION as shown in Fig. 5. Each small box represents a time node,
Although the proposed azimuth reward function in the value at each time node can only be set to 0 or 1. The
Section 2 is able to reduce the local blindness of exploration values are calculated by the relative distances, which are the
using DRL methods, it mainly focuses on the local explo- distances from robot manipulator to target and obstacle in
ration at one moment, lacks of global guidance in trajectory the target approaching and obstacle avoidance, respectively.
planning task. Therefore, another reward function at subtask The value of each node is only connected with the relative
level is designed to provide a macroscopical guidance. The distance of the current time node and the previous adjacent
same as azimuth reward function, subtask reward function time node. Specifically, if the relative distance of the current
is also a dense azimuth reward function, it will work all time node is smaller than the previous adjacent one, the value
the times during exploration and give each action a signi- of the node is 1, otherwise it equals to 0.
ficative evaluation at subtask-level. The trajectory planning The parameter n in the blue box indicates the number of
task of robot manipulator can be divided into several simple time nodes in discipline sequence to be considered for reward
subtasks, mainly including target approaching and obsta- calculation. The larger n means the stricter requirement of
cle avoidance. Considering the characteristics of subtasks, the critic network for actions. The output of the reward func-
the rewards functions should be designed specifically. Thus, tion is completely determined by the value of time nodes

105672 VOLUME 7, 2019


J. Xie et al.: Deep Reinforcement Learning With Optimized Reward Functions for Robotic Trajectory Planning

FIGURE 6. Target approaching subtask reward function.

in discipline sequence. From the characteristics of discipline FIGURE 7. Desired obstacle avoidance trajectory.

sequence, we can see that only one operation is required


in each decision, that is, calculating the relative distance
between the current time node and the previous adjacent time
node. The values of the remaining n − 1 time nodes can be
read from the prevenient results. Base on such computation
structure, the computation overhead can be reduced greatly.

B. TARGET APPROACHING REWARD FUNCTION


In the target approaching subtask, if each action invariably
makes the robot manipulator closer to the target, agent can FIGURE 8. Obstacle avoidance subtask reward function.

get the maximum reward, as shown in Fig. 6. P is the number


of 1 in discipline sequence while Q is the number of 0. When
P is equal to n, robot manipulator will get the maximum moves away, the maximal negative reward will be output.
reward, in this article, the value is set to 5. If P is equal to Q, We summarize obstacle avoidance subtask reward function
the output of subtask reward function is 0. And if Q goes to n, as (11).
the maximal negative reward will be output. The arithmetic n − 2 × |P-Q|
Robstacle−avoidance (PQ) = Maximum × .
expression of moving subtask reward function is summarized n
as (10). (11)
P−Q
RTarget−approching (PQ) = Maximum × . (10) IV. IMPLEMENTATION OF REWARD FUNCTION
n In this section, we introduce how to implement the pro-
posed reward function in the mainstream DRL methods.
C. OBSTACLE AVOIDANCE REWARD FUNCTION By comparing different DRL methods, it can be noticed the
When the relative distance between robot manipulator and performance of methods based on actor network and critic
the obstacle is less than the set threshold, the agent will network (AC frame) is much better than the one which using
switch to obstacle avoidance subtask. It is to be empha- actor network (A frame) or critic network (C frame) alone.
sized that in obstacle avoidance subtask, the relative distance In view of this, this paper mainly introduces the imple-
between robot manipulator and the obstacle is different from mentation of reward function on methods with AC frame.
the target approaching subtask. In the process of obstacle As shown in Fig. 9, the learning procedure of robot manipu-
avoidance, we hope the robot manipulator can take a smooth lator is composed of 4 stages, including initialization, action
and appropriate trajectory. An appropriate trajectory planning selection, reward calculation and network training. At the
means the robot manipulator can bypass the obstacle safely, initialization stage, actor network µ(S, L|θ µ ) and critic net-
while taking as few detours as possible. A desired obstacle work Q(S, L, a|θ Q ) are initialized randomly. Critic network is
avoidance trajectory is shown in Fig. 7. The relative distance responsible for judging the value of the action, and the actor
between robot manipulator and the obstacle is an oscillating network is used to predict which action will be performed.
process. For this purpose, we make the obstacle avoidance The weight of actor network and critic network are denoted as
subtask reward function as in Fig. 8. θ µ and θ Q , L indicates the subtask. In action selection stage,
Similarly, P indicates the number of 1 and Q is the number environment state S shows the relative distance between robot
of 0. First, we compute the absolute value of the difference manipulator, obstacle or target. By considering environment
between P and Q. If the result is 0, the trajectory is closest to state S, subtask L and the value given by critic network, actor
our expectation, and the agent can get the maximum reward. network will compute the torque of joints (action) and put
When the result equals to n/2, reward function will output them into effect. The next stage is reward calculation, reward
0. For the worst situation, the result equals to n, it means for current action is computed by subtask reward function
the robot manipulator moves straightly to the obstacle or and azimuth reward function, and the result is sent to critic

VOLUME 7, 2019 105673


J. Xie et al.: Deep Reinforcement Learning With Optimized Reward Functions for Robotic Trajectory Planning

reward function is verified as well. In the last set of exper-


iments, azimuth reward function and subtask-level reward
function are used together.
Simulation experiments are conducted in V-REP [30], [31].
To ensure the versatility of the experiments, two random
unstructured environments with obstacles are initialized as
in Fig. 10. For scene A, there is only one obstacle in the
working environment, as the task is relatively simple. Scene B
is used to simulate a more difficult trajectory planning task,
FIGURE 9. Diagram of the training process for DRL with AC frame. there are two obstacles in the environment and their inter-
ference is much more serious. The maximal reward in all
Algorithm 1 Trajectory Planning Algorithm With Subtask experiment is set to 2000, when the reward reaches 90% of the
Reward Function and Azimuth Reward Function upper limit, the trajectory planning task is considered com-
Require: pleted. The computer configuration used in the experiments
Environmental status S, Subtask L. is summarized in TABLE 1.
Ensure:
Action a
1: Initialize Actor Network µ(S, L|θ µ ) and Critic Network
Q(S, L, a|θ Q )
2: for episode = 1 to M do
3: for t=1 to T do
4: at ← µ(S, L|θ µ )
5: Rsubtask ← F (S, L)
6: Compute Razimuth
7: R = Razimuth + Rsubtask
8: Update weigh of Actor Network θ µ
9: Update weigh of Critic Network θ Q
10: end for
11: end for
FIGURE 10. Simulation environments for robot manipulator.

network for training and evaluation. The ground work in TABLE 1. Configuration used in the experiments.
network training stage is updating the weights of network.
In this phase, environmental status S, subtask L, action a
and reward R are taken into account comprehensively. The
overall process is summarized as algorithm 1, where M is
the maximum epidode (training time) and T is the maximal
training steps in each epidode.

V. EXPERIMENTAL RESULTS AND DISCUSSIONS


In this section, three sets of experiments are conducted
to test the performance of proposed reward functions. The
performance of our method is evaluated with three indi- A. AZIMUTH REWARD FUNCTION
cators, including convergence rate, mean value and stan- In this part, four kinds of reward functions including basic
dard deviation of convergence. Convergence rate and mean (which is a sparse reward function), position, orientation and
value are used to verify the learning efficiency, and stan- azimuth are applied to three mainly DRL methods, respec-
dard deviation is for robustness. In the first experiments, tively. During the experiment, we initialize the same working
we apply our azimuth reward function to the state-of-the- environment for 30 times. After all the methods converged,
art DRL methods, including Distributed Proximal Policy we calculate the average of the convergence rate, mean value
Optimization (DPPO) [21], Deep Deterministic Policy Gradi- and standard deviation, as summarized in the TABLE 2.
ent (DDPG) [19] and Asynchronous Advantage Actor-Critic Please note that for DDPG in complex working environment
(A3C) [20]. The effectiveness of azimuth reward function B, DDPG still cannot converge after a long time of training.
can be verified by comparing three evaluation indicators. The changing process of reward for each method is visualized
In the second set of experiments, subtask-level reward func- in Fig. 11.
tion is applied. We further discuss the effect of the length of From TABLE 2, we can observe that DPPO with azimuth
discipline sequence n, and the well performance of subtask reward function achieves the best learning efficiency both

105674 VOLUME 7, 2019


J. Xie et al.: Deep Reinforcement Learning With Optimized Reward Functions for Robotic Trajectory Planning

TABLE 2. Results with azimuth reward function.

been able to avoid obstacle skillfully, and the position reward


function is more dominative to reach the target. Therefore,
compared to orientation reward function, DPPO with position
reward function is capable of faster convergence.
Similarly, A3C with azimuth reward function improves
remarkably in both convergence rate and mean value com-
pared to the basic. A3C with azimuth reward function also
has better performance in robustness. Compared to A3C with
basic reward function, standard deviation declines by 28.7%
in scene A and 54.6% in scene B, respectively. This shows
that our azimuth reward function can also improve the sta-
bility of exploration. By comparing the results of A3C and
DPPO, an obvious phenomenon can be found. In the working
environment A, A3C achieves the best result, while in the
complicated working environment B, DPPO performs much
better. This is mainly related to the learning rate of the two
methods. The character of fixed learning rate of A3C may
have advantages in simple tasks. DPPO is quite different, for
its learning rate is dynamic. In complex tasks, by adjusting
the learning rate dynamically, DPPO can get a better result.
As shown in TABLE 2 and Fig. 11(c), there is no doubt
that the benefit of azimuth reward function for DPPG is
FIGURE 11. Diagram of convergence process with azimuth reward tremendous. Compared to basic reward function, not only
function.
the training cost is reduced by half, but the convergent mean
also increases at least 13%, and standard deviation is also
significantly reduced by 10.8%. In summary, the proposed
in scene A and B. Compared to DPPO with basic reward azimuth reward function achieves favorable results in all three
function, the convergence rate increases at least 43.9% and experiments, it shows that the reward function has good effect
mean value for convergence is improved by 2.55%. Effects and robustness to a certain extent.
of position and orientation reward function are inferior to
azimuth. However, in comparison to basic reward function,
their rates of convergence are improved by 16.6% and 29.2% B. SUBTASK-LEVEL REWARD FUNCTION
as well. From Fig. 11(a), we can find that reward curves of In this part, the first thing for discussion is n, the length of
orientation and position have a cross point, the reward of posi- discipline sequence. This parameter represents the strictness
tion reward function is less than orientation reward function of critic network, and it has a direct impact on training results.
in the early stage. As the training process is implemented, Generally speaking, if the critic network is too tolerant (n is
position reward gradually overtakes. This is because robot too small), the network may not achieve the optimal training
manipulator hits the obstacle frequently in the early stage result. On the contrary, if critic network is too strict (n is too
of exploration. Under this condition, the orientation reward large), it may lead to a consequence that the robot manipulator
function plays an important role. Later, robot manipulator has fails to finish the task. Because the robot manipulator has too

VOLUME 7, 2019 105675


J. Xie et al.: Deep Reinforcement Learning With Optimized Reward Functions for Robotic Trajectory Planning

TABLE 3. Comparison of different n in DPPO method.

TABLE 4. Result with subtask-level reward function.

many requirements to meet. To explore the resultant effect both in convergence performance and robustness by using
of n in experiments, we conduct 8 sets of experiments with our subtask-level reward function in scene A or scene B. The
different n. DPPO is selected as the test method. The results convergent mean rises by up to 7.5%, and convergence rate is
are summarized in TABLE 3, and the changing curve of the accelerated by 34.7% at least. For robustness, the standard
three indicators is shown in Fig. 12. deviation reduces by 14.7%-35.3%. Although the standard
deviation is visibly improved, but it’s not hard to see that the
standard deviation of subtask-level reward function is slightly
worser than the result of azimuth reward function. This is
mainly because subtask-level reward function is duty for
providing a global guidance while azimuth reward function is
for modeling the position and orientation constraints. Obvi-
ously, azimuth reward function will have more advantages in
robustness.
It is worth noting that the most obvious improvement
brought by the subtask-level reward function is in the early
FIGURE 12. The results with different n in DPPO. stage of training, the reason is that the subtask planning can
make robot manipulator get rid of blind exploration earlier.
From the information above, we can notice that when the For methods with basic reward function, we can find that the
value of n set to 9, DPPO achieves the best performance reward value stays at 0 lasts for a few episodes in the early
in scene A. However, in scene B, the most suitable value stage of exploration. This is because the robot manipulator
of n is 7. It’s not hard to see that the optimal length of often hits obstacles in the inchoate blind exploration. Results
discipline sequence is related to the complexity of working in Fig. 13 show that our subtask-level reward function can
environment. When the working environment is complicated, solve this problem in most cases.
critic network should relax the requirements for discipline
sequence. If n is greater than 15, the robot manipulator is
hard to complete the task, which means that the stringency C. SUBTASK-LEVEL AND AZIMUTH REWARD FUNCTION
of critic network has exceeded the limits of the environment In the last set of experiments, experimental group uses both
or task. Therefore, the value of n is not invariable. It should azimuth reward function and subtask-level reward function
be adjusted according to the specific constraints of task and (referred to as SA reward function for abbreviation here-
environment. inafter). Meanwhile, other groups use basic reward function,
Then, we apply our subtask-level reward function to subtask-level reward function and azimuth reward function,
DPPO, DDPG and A3C, and their convergence results are respectively. As shown in TABLE 5, the results of SA are
shown in TABLE 4. Three methods all have great promotion superior to others in all cases. Compared to basic reward

105676 VOLUME 7, 2019


J. Xie et al.: Deep Reinforcement Learning With Optimized Reward Functions for Robotic Trajectory Planning

TABLE 5. Results with azimuth reward function and subtask-level reward function.

FIGURE 14. Diagram of convergence process with SA reward function.


FIGURE 13. Diagram of convergence process with subtask-level reward
function.

robot manipulator get rid of blind exploration at the early


function, convergence rate is accelerated by 97.8% in DDPG, stages of exploration. Thus, subtask reward function plays
and this promotion is even more than 200% in DPPO and a more important role in the beginning. In the later stages
A3C. For convergent mean value, the promotion is between of exploration, azimuth reward function shows more advan-
1.9%-11.3%, and it serves to show that this advance is more tages, in addition, safety guarantee in obstacle avoidance is
obvious in DDPG. The performance of robustness is excellent also an important duty of azimuth reward function. The two
at the same time, the standard deviation is decreased by reward functions work corporately to complete the task.
29.2%-74.5%, this shows our SA reward function have good
stability and robustness. On analysis of the reward curves VI. CONCLUSIONS
of different reward functions in Fig. 14, a situation can be To cope with the inefficiency and blindness of DRL based
found. At the early stages of exploration, reward of subtask is methods in trajectory planning task for robot manipulator,
always greater than azimuth, and this situation is going into this paper proposed two brand-new dense reward functions,
reverse as the training goes on. The reason is not difficult including azimuth reward function and subtask-level reward
to explain, the foremost role of subtask reward function is function to replace the traditional sparse reward function. The
to provide a global guidance for the agent, it can make the former can reduce the local uncertainty of trajectory planning

VOLUME 7, 2019 105677


J. Xie et al.: Deep Reinforcement Learning With Optimized Reward Functions for Robotic Trajectory Planning

in the unstructured environment with obstacles, while the [18] M. Hessel, J. Modayil, H. van Hasselt, T. Schaul, G. Ostrovski,
latter provides a global guidance for the agent during explo- W. Dabney, D. Horgan, B. Piot, M. Azar, and D. Silver, ‘‘Rainbow:
Combining improvements in deep reinforcement learning,’’ in Proc. AAAI,
ration, which further reduces the blindness of DRL methods. New Orleans, LA, USA, Feb. 2018, pp. 220–232.
Experimental results demonstrate that state-of-the-art DRL [19] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa,
methods using the proposed reward functions can improve the D. Silver, and D. Wierstra, ‘‘Continuous control with deep rein-
forcement learning,’’ 2015, arXiv:1509.02971. [Online]. Available:
convergence rate and trajectory planning quality dramatically https://arxiv.org/abs/1509.02971
with respect to the accuracy and robustness. [20] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. P. Lillicrap, T. Harley,
In the future work, we plan to extend this method to D. Silver, and K. Kavukcuoglu, ‘‘Asynchronous methods for deep rein-
forcement learning,’’ in Proc. ICML, New York, NY, USA, Jun. 2016,
multi-objective trajectory robotic planning task, which will pp. 1928–1937.
further increase the universality of the method. Experiment [21] N. Heess, D. Tb, S. Sriram, J. Lemmon, J. Merel, G. Wayne, Y. Tassa,
with real robot manipulator will be performed at the same T. Erez, Z. Wang, S. M. A. Eslami, M. Riedmiller, and D. Silver,
‘‘Emergence of locomotion behaviours in rich environments,’’
time. 2017, arXiv:1707.02286. [Online]. Available: https://arxiv.org/pdf/
1707.02286.pdf
REFERENCES [22] W. D. Smart and L. P. Kaelbling, ‘‘Effective reinforcement learning
for mobile robots,’’ in Proc. ICRA, Washington, DC, USA, May 2016,
[1] L. Milica, A. Nǎstase, and G. Andrei, ‘‘Optimal path planning for a new pp. 3404–3410.
type of 6RSS parallel robot based on virtual displacements expressed [23] B. Badnava and N. Mozayani, ‘‘A new potential-based reward shaping for
through Hermite polynomials,’’ Mech. Mach. Theory, vol. 126, pp. 14–33, reinforcement learning agent,’’ 2019, arXiv:1902.06239. [Online]. Avail-
Aug. 2018. able: https://arxiv.org/abs/1902.06239
[2] L. E. Kavraki, P. Svestka, J.-C. Latombe, and M. H. Overmars, ‘‘Probabilis- [24] A. Nair, B. McGrew, M. Andrychowicz, W. Zaremba, and P. Abbeel,
tic roadmaps for path planning in high-dimensional configuration spaces,’’ ‘‘Overcoming exploration in reinforcement learning with demonstrations,’’
IEEE Trans. Robot. Automat., vol. 12, no. 4, pp. 566–580, Aug. 1996. in Proc. ICRA, Brisbane, QLD, Australia, May 2018, pp. 6292–6299.
[3] S. M. LaValle, ‘‘Rapidly-exploring random trees: A new tool for path [25] K. Macek, I. Petrovi, and N. Peric, ‘‘A reinforcement learning approach
planning,’’ Dept. Comput. Sci., Iowa State Univ., Ames, IA, USA, 1998. to obstacle avoidance of mobile robots learning with demonstrations,’’ in
[4] R. Menasri, A. Nakib, B. Daachi, H. Oulhadj, and P. Siarry, ‘‘A trajectory Proc. AMC, Maribor, Slovenia, Jul. 2018, pp. 462–466.
planning of redundant manipulators based on bilevel optimization,’’ Appl. [26] D. Cheng, Y. Gong, S. Zhou, J. Wang, and N. Zheng, ‘‘Person re-
Math. Comput., vol. 250, pp. 934–947, Jan. 2015. identification by multi-channel parts-based CNN with improved triplet loss
[5] X. Broquère, D. Sidobre, and K. Nguyen, ‘‘From motion planning to function,’’ in Proc. CVPR, Las Vegas, NV, USA, Jun. 2016, pp. 1335–1344.
trajectory control with bounded jerk for service manipulator robots,’’ in [27] S. Zhang, Q. Zhang, X. Wei, Y. Zhang, and Y. Xia, ‘‘Person
Proc. ICRA, Anchorage, AK, USA, May 2010, pp. 4505–4510. re-identification with triplet focal loss,’’ IEEE Access, vol. 11,
[6] K. Katyal, I.-J. Wang, and P. Burlina, ‘‘Leveraging deep reinforcement pp. 78092–78099, 2018.
learning for reaching robotic tasks,’’ in Proc. CVPR, Honolulu, HI, USA, [28] W. Cheng, X. Chen, J. Zhang, J. Wang, and K. Huang, ‘‘Beyond triplet loss:
Jul. 2017, pp. 8–19. A deep quadruplet network for person re-identification,’’ in Proc. CVPR,
[7] X. Lei, Z. Zhang, and P. Dong, ‘‘Dynamic path planning of unknown Honolulu, HI, USA, Jul. 2017, pp. 403–412.
environment based on deep reinforcement learning,’’ J. Robot., vol. 2018, [29] J. C. Alexander and J. H. Maddocks, ‘‘On the kinematics of wheeled mobile
no. 12, 2018, Art. no. 5781591. robots,’’ Int. J. Robot. Res., vol. 8, pp. 15–27, Oct. 1989.
[8] C. Wang, L. Wei, M. Song, and N. Mahmoudian, ‘‘Reinforcement learning- [30] E. Rohmer, S. P. N. Singh, and M. Freese, ‘‘V-REP: A versatile and scalable
based multi-AUV adaptive trajectory planning for under-ice field estima- robot simulation framework,’’ in Proc. IROS, Tokyo, Japan, Nov. 2013,
tion,’’ Sensors, vol. 18, no. 11, pp. 3571–3859, 2018. pp. 1321–1326.
[9] A. H. Qureshi, Y. Nakamura, H. Ishiguro, and Y. Yoshikawa, ‘‘Robot gains [31] M. Freese, S. P. Singh, F. Ozaki, and N. Matsuhira, ‘‘Virtual robot experi-
social intelligence through multimodal deep reinforcement learning,’’ in mentation platform V-REP: A versatile 3D robot simulator,’’ in Proc. ICS,
Proc. IEEE-RAS, Cancun, Mexico, Nov. 2016, pp. 745–751. Darmstadt, Germany, Nov. 2010, pp. 536–541.
[10] G. Kahn, A. Villaflor, B. Ding, P. Abbeel, and S. Levine, ‘‘Self-supervised
deep reinforcement learning with generalized computation graphs for robot
navigation,’’ in Proc. ICRA, Brisbane, QLD, Australia, May 2018, pp. 1–8.
[11] L. Tai, G. Paolo, and M. Liu, ‘‘Virtual-to-real deep reinforcement learning:
JIEXIN XIE was born in Xinxiang, China, in 1990.
Continuous control of mobile robots for mapless navigation,’’ in Proc.
He received the B.S. degree in communica-
IROS, Vancouver, BC, Canada, Sep. 2017, pp. 31–36.
[12] M. Everett, Y. F. Chen, and J. P. How, ‘‘Motion planning among dynamic,
tion engineering from Henan University, Kaifeng,
decision-making agents with deep reinforcement learning,’’ in Proc. IROS, China, in 2012. He is currently pursuing the mas-
Madrid, Spain, Oct. 2018, pp. 3052–3059. ter’s degree with the College of Information Engi-
[13] X. Chen, A. Ghadirzadeh, J. Folkesson, M. Björkman, and P. Jensfelt, neering, Capital Normal University, China. His
‘‘Deep reinforcement learning to acquire navigation skills for wheel- research interests include surgical robotics and
legged robots in complex environments,’’ in Proc. IROS, Madrid, Spain, machine learning.
Oct. 2018, pp. 3110–3116.
[14] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness,
M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland,
G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou,
H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis, ‘‘Human-
ZHENZHOU SHAO received the B.E. and
level control through deep reinforcement learning,’’ Nature, vol. 518,
M.E. degrees from the Department of Informa-
pp. 529–533, 2015.
tion Engineering, Northeastern University, China,
[15] F. Zhang, J. Leitner, M. Milford, B. Upcroft, and P. Corke,
‘‘Towards vision-based deep reinforcement learning for robotic in 2007 and 2009, respectively, and the Ph.D.
motion control,’’ 2015. arXiv:1511.03791. [Online]. Available: degree from the Department of Mechanical,
https://arxiv.org/abs/1511.03791 Aerospace, and Biomedical Engineering with
[16] L. Tai and M. Liu, ‘‘A robot exploration strategy based on Q-learning The University of Tennessee, USA, in 2013.
network,’’ in Proc. RCAR, Angkor Wat, Cambodia, Jun. 2016, pp. 57–62. He is currently with the College of Information
[17] D. Zhao, H. Wang, K. Shao, and Y. Zhu, ‘‘Deep reinforcement learning Engineering, Capital Normal University, China.
with experience replay based on SARSA,’’ in Proc. IEEE-SSCI, Orlando, His research interests include surgical robotics,
FL, USA, Dec. 2016, pp. 1–6. machine learning, and human–robot interaction.

105678 VOLUME 7, 2019


J. Xie et al.: Deep Reinforcement Learning With Optimized Reward Functions for Robotic Trajectory Planning

YUE LI was born in Shijiazhuang, China, in 1992. JINDONG TAN received the Ph.D. degree in elec-
He received the B.S. degree in computer science trical and computer engineering from Michigan
and technology from Hebei GEO University, State University, East Lansing, MI, USA, in 2002.
Shijiazhuang, China, in 2016. He is currently pur- He is currently a Professor with the Engineering
suing the master’s degree with the College of Infor- College, University of Tennessee. His research
mation Engineering, Capital Normal University, interests include distributed robotics, wireless sen-
China. His research interests include deep rein- sor networks, human–robot interaction, biosensing
forcement learning and trajectory planning. and signal processing, and surgical robots and nav-
igation. He is a member of the ACM and Sigma Xi.

YONG GUAN received the Ph.D. degree from the


College of Mechanical Electronic and Information
Engineering, China University of Mining and
Technology, China, in 2004. He is currently a
Professor with Capital Normal University. His
research interests include formal verification,
PHM for power, and embedded system design.
He is a member of the Chinese Institute of Elec-
tronics Embedded Expert Committee. He is also
a member of the Beijing Institute of Electronics
Professional Education Committee and a Standing Council Member of the
Beijing Society for Information Technology in Agriculture.

VOLUME 7, 2019 105679

You might also like