Deep Reinforcement Learning With Optimized Reward Functions For Robotic Trajectory Planning
Deep Reinforcement Learning With Optimized Reward Functions For Robotic Trajectory Planning
ABSTRACT To improve the efficiency of deep reinforcement learning (DRL)-based methods for robotic
trajectory planning in the unstructured working environment with obstacles. Different from the traditional
sparse reward function, this paper presents two brand-new dense reward functions. First, the azimuth
reward function is proposed to accelerate the learning process locally with a more reasonable trajectory by
modeling the position and orientation constraints, which can reduce the blindness of exploration dramatically.
To further improve the efficiency, a reward function at subtask-level is proposed to provide global guidance
for the agent in the DRL. The subtask-level reward function is designed under the assumption that the task can
be divided into several subtasks, which reduces the invalid exploration greatly. The extensive experiments
show that the proposed reward functions are able to improve the convergence rate by up to three times with
the state-of-the-art DRL methods. The percentage increase in convergence means is 2.25%–13.22% and the
percentage decreases with respect to standard deviation by 10.8%–74.5%.
INDEX TERMS Deep reinforcement learning, robot manipulator, trajectory planning, reward function.
I. INTRODUCTION
Trajectory planning is a fundamental problem for the motion
control of robot manipulator. Conventional trajectory plan-
ning methods are usually appropriate for the structured
environment [1]–[5]. However, the working environment of
robot manipulator may be various in the complex tasks
in practice. In recent years, trajectory planning with Deep
Reinforcement Learning (DRL) paves an alternative way to
solve this problem [6]–[8]. It enables the robot manipula-
tor to autonomously learn and plan an optimal trajectory
in unstructured working environment. As shown in Fig. 1,
the agent in DRL explores the possible actions with a ‘‘Trial
and Error’’ mechanism [9], [10], according to the current
state of the robot manipulator. By maximizing the accumu- FIGURE 1. Scheme of Deep Reinforcement Learning.
lated reward with the optimization strategy, robot manipu-
lator can finish the trajectory planning task in unstructured
environment [11]–[13]. In DRL, the typical optimization strategies include Deep
Q Network (DQN) [14]–[16], Deep SARSA (State Action
The associate editor coordinating the review of this manuscript and Reward State Action) [17], Rainbow [18] and so on. How-
approving it for publication was Francisco J. Garcia-Penalvo. ever, the spaces of output action yielded by those methods
VOLUME 7, 2019 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see http://creativecommons.org/licenses/by/4.0/ 105669
J. Xie et al.: Deep Reinforcement Learning With Optimized Reward Functions for Robotic Trajectory Planning
are discrete, which cannot be applied to the tasks with In Section III, subtask-level reward function is introduced.
continuous action spaces, just like trajectory planning of The implementation of reward function is illustrated in
robot manipulator. To solve this problem, Deep Deterministic Section IV. It mainly discusses how to implement the pro-
Strategy Gradient (DDPG) [19] and Critics of Asynchronous posed reward functions on the current mainstream DRL
Advantage Actors (A3C) [20] are put forward. By the non- methods. Next, experimental results are demonstrated and
linear approximation, DDPG makes the action space con- discussed in Section V. Finally, the conclusions are drawn
tinuous, and Giuseppe Paolo et al. [11] further improve the in Section VI.
performance of DDPG with asynchronous execution strat-
egy. However, the efficiency of DDPG is low due to the II. AZIMUTH REWARD FUNCTION
operation of experience replay. This weakness is replaced For DRL based methods, blindness of exploration in unstruc-
with the asynchronous update in A3C. The multithreaded tured environments is a major problem. As a consequence,
implementation in A3C is capable of improving the learn- trajectory planning task is always suffered from its inef-
ing efficiency greatly. However, the drawback of A3C is ficiency and bad robustness. To cope with this problem,
the fixed learning rate. Especially, in complicated work- we hope to the replace traditional sparse reward function
ing environment, the robustness of A3C may reduce badly. with a new dense reward function. Dense reward function
Recently, Distributed Proximal Policy Optimization (DPPO) can give more information after each action, but are much
is proposed [21], DPPO introduces a penalty term, which more difficult to construct than sparse functions. In this
can provide a more reasonable update proportion, thereby paper, we select the position and orientation as constraints
reducing the impact of unreasonable learning rate. to build the azimuth reward function for DRL based meth-
Nevertheless, randomness and blindness are still the prob- ods. Azimuth reward function uses the relative orientation
lems in DRL methods. In particular, when considering the and relative position of the endpoint of robot manipulator,
unstructured working environment with obstacles, this matter obstacle and target, as referred to position reward function
will be much more prominent. The core of this problem is the and orientation reward function, respectively. The azimuth
reward function. To the best of our knowledge, all the reward reward function can improve the learning process locally
functions used in robot manipulator trajectory planning task with a more reasonable trajectory by giving each action a
are sparse reward functions. The value of sparse reward more accurate and comprehensive evaluation, which will help
functions are zero everywhere, except for a few places (the the robot manipulator reduce the blindness of exploration
obstacles and target in trajectory planning task) [22]. This effectively and improve work efficiency.
kind of reward function always leads to a lot of ineffective
explorations, which will decrease the efficiency of algorithm A. POSITION REWARD FUNCTION
severely [23]–[25]. To cope with the problem, we present an In unstructured environment with obstacles, robot manipula-
optimized method for robotic trajectory planning. The pri- tor must avoid obstacles and navigate itself to the destination
mary contributions of this paper are summarized as follows: in real time. Therefore, our position reward function consists
1) Considering the features of trajectory planning task and of two items, including obstacle avoidance and target guid-
work environment, two brand-new dense reward functions ance. The obstacle avoidance is responsible for alerting the
are proposed. Dense reward function gives non-zero rewards robot manipulator to keep a certain safety distance off the
most of the time which is quite different from sparse reward obstacles, while the target guidance is used to motivate robot
function. Dense reward function give more information after manipulator to reach the target as soon as possible.
each action, which can reduce the blindness of exploration of
DRL methods in trajectory planning task. 1) OBSTACLE AVOIDANCE
2) Fist, azimuth reward function is proposed. Azimuth Gaussian distribution is used to model obstacle avoidance,
reward function includes position reward function and ori- E means the endpoint of robot manipulator, and obstacle is
entation reward function. Position reward function is built by denoted by O, DEO is the relative distance between E and O.
Gaussian distribution and Triplet Loss function; orientation The risk of collision increases as DEO decreases, meanwhile
reward function is modeled by Coulomb’s law. The azimuth robot manipulator will get more punishment as well. Obstacle
reward function can give the robot manipulator a reason- avoidance is modeled by Gaussian function fobstacle (DEO )
able constraint in position and orientation during exploration, in (1).
thereby reducing the invalid exploration practically.
1 D2
3) To further improve learning efficiency, we propose EO
fobstacle (DEO ) = − √ e(− 2 ) . (1)
anther reward function at subtask-level to provide a global 2π
guidance for the agency. The subtask-level reward function is
built by an idea of serialization. By this innovative structure, 2) TARGET GUIDANCE
we model the characteristics of each subtask accurately as Inspired by the idea of the Triplet loss function [26], [27],
well as reducing the computation overhead greatly. we describe target guidance as shown in (2).
The rest of this paper is organized as follows. The struc- h i
ture of azimuth reward function is presented in Section II. ftriplet (DEO , DET ) = D2EO − D2ET − α , (2)
+
−
→
−→0 QE QO EO
EO = − → ,
r2 2 −
(5)
EO
FIGURE 2. Scheme of orientation reward function.
where r1 is the relative distance between target and the end-
point of robot manipulator, r2 is the relative distance between
where [•]+ represents an operation, the value of the function obstacle and the endpoint of robot manipulator. QE denotes
is valid when the calculation result in [•]+ is non-negative; the charge of endpoint of robot manipulator, QO is the charge
Otherwise, the result is 0. DET indicates the relative dis- of obstacle and QT represents the charge of target. In prac-
tance between the endpoint of robot manipulator E and the tice, the attraction of target to robot manipulator should be
target T , and α is the margin [28] between DET and DEO . greater than the rejection of the obstacle. Otherwise, robot
The value of α needs to be adjusted according to the actual manipulator may not be able to reach the target in order
working environment. In this paper, we set the value of α to to avoid obstacles. In this paper, we set the value of QT
−
→
0.095 empirically. twice greater than QO . EB indicates the desired direction of
−→
By combining obstacle avoidance and target guidance, relative movement, EC is the actual motion vector. The angle
−
→ −
→
we describe the position reward function as (3): between EB and EC is written as ϕ, which is used to measure
the similarity between current motion vector and the desired
Rlocation (DEO , DET ) = fobstacle (DEO )+ftriplet (DEO , DET ) . motion vector programmed by agent. The smaller ϕ means
(3) the higher similarity between two vectors, ϕ can be calculated
by (6).
B. ORIENTATION REWARD FUNCTION −
→ − → − →
How to avoid obstacles safely and rapidly is a crucial issue (EO0 + ET 0 ) · EC
ϕ = arccos − → − → → .
− (6)
to be considered in unstructured environment with obstacles. |EO0 + ET 0 | · |EC|
In practice, for the endpoint of robot manipulator, the direc-
Combining all the factors above, the orientation reward
tion of relative movement between target and obstacle is
function we proposed is shown in (7), where τ is compen-
always overlapped, which increases the difficulty in obstacle
sating parameter. In this paper, we set τ to 0.785 according to
avoidance. In this case, it is necessary to design a strategy for
experimental experience.
choosing a reasonable direction.
Motivated by the electric attraction and electrostatic repul- ϕ×π
Rorientation (ϕ) = τ − . (7)
sion between charges, we model the orientation reward func- 180
tion for trajectory planning according to Coulomb’s law [29]
C. MODELING OF AZIMUTH REWARD FUNCTION
in this paper. The relation between the obstacle and the
In the process of trajectory planning, position and orienta-
endpoint of robot manipulator can be described as charges
tion are two key factors to be considered comprehensively.
repel each other. Similarly, the relation between target and
However, the working environment for robot manipulator
the endpoint of robot manipulator can be expressed as unlike
is complex, and the weights of the two items in azimuth
charges attract each other.
reward function are always different in different scenarios.
Orientation reward function is illustrated as in Fig. 2, where
−→0 −→ To solve this problem, we introduce a weight vector λ =
ET is the vector of attraction for target and EO0 is the vector
−→ [λlocation , λorientation ] to build the azimuth reward function.
of repulsion for obstacle. The arithmetic expressions of ET 0 As shown in Fig. 3, the workspace around obstacle is divided
−→0
and EO are formulated in (4) and (5). into three parts, including safety, warning and danger areas.
−
→ In order to improve the learning efficiency, λ is adjusted
−→0 QE QT ET dynamically in different working areas. In the safety area,
ET = → ,
r12 |−
(4)
the position reward function plays a leading role; In the
ET |
VOLUME 7, 2019 105671
J. Xie et al.: Deep Reinforcement Learning With Optimized Reward Functions for Robotic Trajectory Planning
warning area, along with robot manipulator approaching the to further improve the learning efficiency of trajectory plan-
obstacle, the weight of position reward function decreases ning, the corresponding subtask-level reward functions are
and orientation reward function increases; For the danger proposed in this paper, that is, target approaching and obstacle
area, the orientation reward function begins to dominate. The avoidance reward functions. Robot manipulator will switch
adjustment strategy of λ is summarized in (8). between two subtasks during trajectory planning, as shown
T in Fig. 4. Initially, the robot manipulator moves toward the
λ
location = 0
E ∈ danger target with target approaching reward function. During the
λorientation = 1
exploration, if the relative distance between robot manipu-
DEO − dw T
λlocation =
lator and obstacle is less than a threshold, obstacle avoid-
λ= dm − dw E ∈ warnning . (8) ance subtask will be activated. When the robot manipulator
dm − DEO bypasses the obstacle, switches back to target approaching
λ
orientation =
dm − dw
subtask. The subtask switching is determined by the threshold
T
λ
location = 1
E ∈ safety of the relative distance, which can be adjusted according
λorientation = 0
to actual requirement, as different thresholds will result in
where dw and dm are the radii of danger area and warning different motion trajectories. The details are introduced as
area as shown in Fig. 3. Combining with weight λ, the final follows.
expression of azimuth reward function is defined as (9).
T A. DETERMINATION OF SUBTASK REWARD FUNCTION
λlocation Rlocation (DEO , DET )
R= . (9)
λorientation Rorientation (ϕ) In this paper, the determination of subtask reward function is
implemented based on the time nodes in discipline sequence,
III. SUBTASK-LEVEL REWARD FUNCTION as shown in Fig. 5. Each small box represents a time node,
Although the proposed azimuth reward function in the value at each time node can only be set to 0 or 1. The
Section 2 is able to reduce the local blindness of exploration values are calculated by the relative distances, which are the
using DRL methods, it mainly focuses on the local explo- distances from robot manipulator to target and obstacle in
ration at one moment, lacks of global guidance in trajectory the target approaching and obstacle avoidance, respectively.
planning task. Therefore, another reward function at subtask The value of each node is only connected with the relative
level is designed to provide a macroscopical guidance. The distance of the current time node and the previous adjacent
same as azimuth reward function, subtask reward function time node. Specifically, if the relative distance of the current
is also a dense azimuth reward function, it will work all time node is smaller than the previous adjacent one, the value
the times during exploration and give each action a signi- of the node is 1, otherwise it equals to 0.
ficative evaluation at subtask-level. The trajectory planning The parameter n in the blue box indicates the number of
task of robot manipulator can be divided into several simple time nodes in discipline sequence to be considered for reward
subtasks, mainly including target approaching and obsta- calculation. The larger n means the stricter requirement of
cle avoidance. Considering the characteristics of subtasks, the critic network for actions. The output of the reward func-
the rewards functions should be designed specifically. Thus, tion is completely determined by the value of time nodes
in discipline sequence. From the characteristics of discipline FIGURE 7. Desired obstacle avoidance trajectory.
network for training and evaluation. The ground work in TABLE 1. Configuration used in the experiments.
network training stage is updating the weights of network.
In this phase, environmental status S, subtask L, action a
and reward R are taken into account comprehensively. The
overall process is summarized as algorithm 1, where M is
the maximum epidode (training time) and T is the maximal
training steps in each epidode.
many requirements to meet. To explore the resultant effect both in convergence performance and robustness by using
of n in experiments, we conduct 8 sets of experiments with our subtask-level reward function in scene A or scene B. The
different n. DPPO is selected as the test method. The results convergent mean rises by up to 7.5%, and convergence rate is
are summarized in TABLE 3, and the changing curve of the accelerated by 34.7% at least. For robustness, the standard
three indicators is shown in Fig. 12. deviation reduces by 14.7%-35.3%. Although the standard
deviation is visibly improved, but it’s not hard to see that the
standard deviation of subtask-level reward function is slightly
worser than the result of azimuth reward function. This is
mainly because subtask-level reward function is duty for
providing a global guidance while azimuth reward function is
for modeling the position and orientation constraints. Obvi-
ously, azimuth reward function will have more advantages in
robustness.
It is worth noting that the most obvious improvement
brought by the subtask-level reward function is in the early
FIGURE 12. The results with different n in DPPO. stage of training, the reason is that the subtask planning can
make robot manipulator get rid of blind exploration earlier.
From the information above, we can notice that when the For methods with basic reward function, we can find that the
value of n set to 9, DPPO achieves the best performance reward value stays at 0 lasts for a few episodes in the early
in scene A. However, in scene B, the most suitable value stage of exploration. This is because the robot manipulator
of n is 7. It’s not hard to see that the optimal length of often hits obstacles in the inchoate blind exploration. Results
discipline sequence is related to the complexity of working in Fig. 13 show that our subtask-level reward function can
environment. When the working environment is complicated, solve this problem in most cases.
critic network should relax the requirements for discipline
sequence. If n is greater than 15, the robot manipulator is
hard to complete the task, which means that the stringency C. SUBTASK-LEVEL AND AZIMUTH REWARD FUNCTION
of critic network has exceeded the limits of the environment In the last set of experiments, experimental group uses both
or task. Therefore, the value of n is not invariable. It should azimuth reward function and subtask-level reward function
be adjusted according to the specific constraints of task and (referred to as SA reward function for abbreviation here-
environment. inafter). Meanwhile, other groups use basic reward function,
Then, we apply our subtask-level reward function to subtask-level reward function and azimuth reward function,
DPPO, DDPG and A3C, and their convergence results are respectively. As shown in TABLE 5, the results of SA are
shown in TABLE 4. Three methods all have great promotion superior to others in all cases. Compared to basic reward
TABLE 5. Results with azimuth reward function and subtask-level reward function.
in the unstructured environment with obstacles, while the [18] M. Hessel, J. Modayil, H. van Hasselt, T. Schaul, G. Ostrovski,
latter provides a global guidance for the agent during explo- W. Dabney, D. Horgan, B. Piot, M. Azar, and D. Silver, ‘‘Rainbow:
Combining improvements in deep reinforcement learning,’’ in Proc. AAAI,
ration, which further reduces the blindness of DRL methods. New Orleans, LA, USA, Feb. 2018, pp. 220–232.
Experimental results demonstrate that state-of-the-art DRL [19] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa,
methods using the proposed reward functions can improve the D. Silver, and D. Wierstra, ‘‘Continuous control with deep rein-
forcement learning,’’ 2015, arXiv:1509.02971. [Online]. Available:
convergence rate and trajectory planning quality dramatically https://arxiv.org/abs/1509.02971
with respect to the accuracy and robustness. [20] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. P. Lillicrap, T. Harley,
In the future work, we plan to extend this method to D. Silver, and K. Kavukcuoglu, ‘‘Asynchronous methods for deep rein-
forcement learning,’’ in Proc. ICML, New York, NY, USA, Jun. 2016,
multi-objective trajectory robotic planning task, which will pp. 1928–1937.
further increase the universality of the method. Experiment [21] N. Heess, D. Tb, S. Sriram, J. Lemmon, J. Merel, G. Wayne, Y. Tassa,
with real robot manipulator will be performed at the same T. Erez, Z. Wang, S. M. A. Eslami, M. Riedmiller, and D. Silver,
‘‘Emergence of locomotion behaviours in rich environments,’’
time. 2017, arXiv:1707.02286. [Online]. Available: https://arxiv.org/pdf/
1707.02286.pdf
REFERENCES [22] W. D. Smart and L. P. Kaelbling, ‘‘Effective reinforcement learning
for mobile robots,’’ in Proc. ICRA, Washington, DC, USA, May 2016,
[1] L. Milica, A. Nǎstase, and G. Andrei, ‘‘Optimal path planning for a new pp. 3404–3410.
type of 6RSS parallel robot based on virtual displacements expressed [23] B. Badnava and N. Mozayani, ‘‘A new potential-based reward shaping for
through Hermite polynomials,’’ Mech. Mach. Theory, vol. 126, pp. 14–33, reinforcement learning agent,’’ 2019, arXiv:1902.06239. [Online]. Avail-
Aug. 2018. able: https://arxiv.org/abs/1902.06239
[2] L. E. Kavraki, P. Svestka, J.-C. Latombe, and M. H. Overmars, ‘‘Probabilis- [24] A. Nair, B. McGrew, M. Andrychowicz, W. Zaremba, and P. Abbeel,
tic roadmaps for path planning in high-dimensional configuration spaces,’’ ‘‘Overcoming exploration in reinforcement learning with demonstrations,’’
IEEE Trans. Robot. Automat., vol. 12, no. 4, pp. 566–580, Aug. 1996. in Proc. ICRA, Brisbane, QLD, Australia, May 2018, pp. 6292–6299.
[3] S. M. LaValle, ‘‘Rapidly-exploring random trees: A new tool for path [25] K. Macek, I. Petrovi, and N. Peric, ‘‘A reinforcement learning approach
planning,’’ Dept. Comput. Sci., Iowa State Univ., Ames, IA, USA, 1998. to obstacle avoidance of mobile robots learning with demonstrations,’’ in
[4] R. Menasri, A. Nakib, B. Daachi, H. Oulhadj, and P. Siarry, ‘‘A trajectory Proc. AMC, Maribor, Slovenia, Jul. 2018, pp. 462–466.
planning of redundant manipulators based on bilevel optimization,’’ Appl. [26] D. Cheng, Y. Gong, S. Zhou, J. Wang, and N. Zheng, ‘‘Person re-
Math. Comput., vol. 250, pp. 934–947, Jan. 2015. identification by multi-channel parts-based CNN with improved triplet loss
[5] X. Broquère, D. Sidobre, and K. Nguyen, ‘‘From motion planning to function,’’ in Proc. CVPR, Las Vegas, NV, USA, Jun. 2016, pp. 1335–1344.
trajectory control with bounded jerk for service manipulator robots,’’ in [27] S. Zhang, Q. Zhang, X. Wei, Y. Zhang, and Y. Xia, ‘‘Person
Proc. ICRA, Anchorage, AK, USA, May 2010, pp. 4505–4510. re-identification with triplet focal loss,’’ IEEE Access, vol. 11,
[6] K. Katyal, I.-J. Wang, and P. Burlina, ‘‘Leveraging deep reinforcement pp. 78092–78099, 2018.
learning for reaching robotic tasks,’’ in Proc. CVPR, Honolulu, HI, USA, [28] W. Cheng, X. Chen, J. Zhang, J. Wang, and K. Huang, ‘‘Beyond triplet loss:
Jul. 2017, pp. 8–19. A deep quadruplet network for person re-identification,’’ in Proc. CVPR,
[7] X. Lei, Z. Zhang, and P. Dong, ‘‘Dynamic path planning of unknown Honolulu, HI, USA, Jul. 2017, pp. 403–412.
environment based on deep reinforcement learning,’’ J. Robot., vol. 2018, [29] J. C. Alexander and J. H. Maddocks, ‘‘On the kinematics of wheeled mobile
no. 12, 2018, Art. no. 5781591. robots,’’ Int. J. Robot. Res., vol. 8, pp. 15–27, Oct. 1989.
[8] C. Wang, L. Wei, M. Song, and N. Mahmoudian, ‘‘Reinforcement learning- [30] E. Rohmer, S. P. N. Singh, and M. Freese, ‘‘V-REP: A versatile and scalable
based multi-AUV adaptive trajectory planning for under-ice field estima- robot simulation framework,’’ in Proc. IROS, Tokyo, Japan, Nov. 2013,
tion,’’ Sensors, vol. 18, no. 11, pp. 3571–3859, 2018. pp. 1321–1326.
[9] A. H. Qureshi, Y. Nakamura, H. Ishiguro, and Y. Yoshikawa, ‘‘Robot gains [31] M. Freese, S. P. Singh, F. Ozaki, and N. Matsuhira, ‘‘Virtual robot experi-
social intelligence through multimodal deep reinforcement learning,’’ in mentation platform V-REP: A versatile 3D robot simulator,’’ in Proc. ICS,
Proc. IEEE-RAS, Cancun, Mexico, Nov. 2016, pp. 745–751. Darmstadt, Germany, Nov. 2010, pp. 536–541.
[10] G. Kahn, A. Villaflor, B. Ding, P. Abbeel, and S. Levine, ‘‘Self-supervised
deep reinforcement learning with generalized computation graphs for robot
navigation,’’ in Proc. ICRA, Brisbane, QLD, Australia, May 2018, pp. 1–8.
[11] L. Tai, G. Paolo, and M. Liu, ‘‘Virtual-to-real deep reinforcement learning:
JIEXIN XIE was born in Xinxiang, China, in 1990.
Continuous control of mobile robots for mapless navigation,’’ in Proc.
He received the B.S. degree in communica-
IROS, Vancouver, BC, Canada, Sep. 2017, pp. 31–36.
[12] M. Everett, Y. F. Chen, and J. P. How, ‘‘Motion planning among dynamic,
tion engineering from Henan University, Kaifeng,
decision-making agents with deep reinforcement learning,’’ in Proc. IROS, China, in 2012. He is currently pursuing the mas-
Madrid, Spain, Oct. 2018, pp. 3052–3059. ter’s degree with the College of Information Engi-
[13] X. Chen, A. Ghadirzadeh, J. Folkesson, M. Björkman, and P. Jensfelt, neering, Capital Normal University, China. His
‘‘Deep reinforcement learning to acquire navigation skills for wheel- research interests include surgical robotics and
legged robots in complex environments,’’ in Proc. IROS, Madrid, Spain, machine learning.
Oct. 2018, pp. 3110–3116.
[14] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness,
M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland,
G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou,
H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis, ‘‘Human-
ZHENZHOU SHAO received the B.E. and
level control through deep reinforcement learning,’’ Nature, vol. 518,
M.E. degrees from the Department of Informa-
pp. 529–533, 2015.
tion Engineering, Northeastern University, China,
[15] F. Zhang, J. Leitner, M. Milford, B. Upcroft, and P. Corke,
‘‘Towards vision-based deep reinforcement learning for robotic in 2007 and 2009, respectively, and the Ph.D.
motion control,’’ 2015. arXiv:1511.03791. [Online]. Available: degree from the Department of Mechanical,
https://arxiv.org/abs/1511.03791 Aerospace, and Biomedical Engineering with
[16] L. Tai and M. Liu, ‘‘A robot exploration strategy based on Q-learning The University of Tennessee, USA, in 2013.
network,’’ in Proc. RCAR, Angkor Wat, Cambodia, Jun. 2016, pp. 57–62. He is currently with the College of Information
[17] D. Zhao, H. Wang, K. Shao, and Y. Zhu, ‘‘Deep reinforcement learning Engineering, Capital Normal University, China.
with experience replay based on SARSA,’’ in Proc. IEEE-SSCI, Orlando, His research interests include surgical robotics,
FL, USA, Dec. 2016, pp. 1–6. machine learning, and human–robot interaction.
YUE LI was born in Shijiazhuang, China, in 1992. JINDONG TAN received the Ph.D. degree in elec-
He received the B.S. degree in computer science trical and computer engineering from Michigan
and technology from Hebei GEO University, State University, East Lansing, MI, USA, in 2002.
Shijiazhuang, China, in 2016. He is currently pur- He is currently a Professor with the Engineering
suing the master’s degree with the College of Infor- College, University of Tennessee. His research
mation Engineering, Capital Normal University, interests include distributed robotics, wireless sen-
China. His research interests include deep rein- sor networks, human–robot interaction, biosensing
forcement learning and trajectory planning. and signal processing, and surgical robots and nav-
igation. He is a member of the ACM and Sigma Xi.