0% found this document useful (0 votes)
16 views

Collision Avoidance Using RL

Uploaded by

Malik Faiq
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views

Collision Avoidance Using RL

Uploaded by

Malik Faiq
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

robotics

Article
A Collision Avoidance Method Based on Deep
Reinforcement Learning
Shumin Feng 1 , Bijo Sebastian 2 and Pinhas Ben-Tzvi 1, *

1 Robotics and Mechatronics Lab, Department of Mechanical Engineering, Virginia Polytechnic Institute and
State University, Blacksburg, VA 24061, USA; [email protected]
2 Torc Robotics, Blacksburg, VA 24060, USA; [email protected]
* Correspondence: [email protected]; Tel.: +1-(540)-231-6938

Abstract: This paper set out to investigate the usefulness of solving collision avoidance problems
with the help of deep reinforcement learning in an unknown environment, especially in compact
spaces, such as a narrow corridor. This research aims to determine whether a deep reinforcement
learning-based collision avoidance method is superior to the traditional methods, such as potential
field-based methods and dynamic window approach. Besides, the proposed obstacle avoidance
method was developed as one of the capabilities to enable each robot in a novel robotic system,
namely the Self-reconfigurable and Transformable Omni-Directional Robotic Modules (STORM),
to navigate intelligently and safely in an unknown environment. A well-conceived hardware and
software architecture with features that enable further expansion and parallel development designed
for the ongoing STORM projects is also presented in this work. A virtual STORM module with
skid-steer kinematics was simulated in Gazebo to reduce the gap between the simulations and the
real-world implementations. Moreover, comparisons among multiple training runs of the neural
networks with different parameters related to balance the exploitation and exploration during the
training process, as well as tests and experiments conducted in both simulation and real-world,
 are presented in detail. Directions for future research are also provided in the paper.


Citation: Feng, S.; Sebastian, B.;


Keywords: collision avoidance; neural network; deep reinforcement learning
Ben-Tzvi, P. A Collision Avoidance
Method Based on Deep
Reinforcement Learning. Robotics
2021, 10, 73. https://doi.org/
1. Introduction
10.3390/robotics10020073
Artificial Intelligence (AI) has been a topic of great interest in a wide range of fields,
Received: 14 April 2021 such as gaming, computer vision, and robotics. One of the impressive applications is
Accepted: 14 May 2021 the AI-based AlphaZero [1] that has superhuman performance in the games of chess
Published: 19 May 2021 and shogi, as well as Go, with the help of Deep Reinforcement Learning (DRL) based
techniques. DRL [2,3] is a combination of reinforcement learning and deep learning
Publisher’s Note: MDPI stays neutral techniques. It allows an agent to learn to behave in an environment based on feedback
with regard to jurisdictional claims in rewards or punishments. A neural network is used to approximate the associated value
published maps and institutional affil- function, which allows the proposed architecture to evaluate the possible actions based
iations. on an inherent understanding of the situations. This enables an agent to learn from
scratch by itself without experiences and supervisions from human beings, which may
offer possibilities for the agent to discover superhuman behaviors and achieve better
performances. The exciting achievements of DRL-based programs in gaming encourage
Copyright: © 2021 by the authors. more and more researchers to investigate real-world applications. For example, a DRL
Licensee MDPI, Basel, Switzerland. algorithm based on off-policy training of deep Q-functions successfully learned a complex
This article is an open access article door opening skill on real robotic manipulators [4].
distributed under the terms and A vast majority of current research in robotics is concerned with mobile robots [5,6],
conditions of the Creative Commons especially mobile robot navigation. This is another popular area where DRL was applied
Attribution (CC BY) license (https:// to seek valuable solutions [7,8]. Our interests focus on investigating whether a DRL-based
creativecommons.org/licenses/by/ solution is beneficial to improve the robot behaviors when solving one of the autonomous
4.0/).

Robotics 2021, 10, 73. https://doi.org/10.3390/robotics10020073 https://www.mdpi.com/journal/robotics


Robotics2021,
Robotics 2021,10,
10,73x FOR PEER REVIEW 22 ofof1920

navigationproblems,
navigation problems,namely
namely obstacle
obstacle avoidance,
avoidance, in inaddition
additionto toanalyzing
analyzingand andcomparing
compar-
some state-of-the-art autonomous navigation methods, especially
ing some state-of-the-art autonomous navigation methods, especially obstacle avoidance obstacle avoidance ap-
proaches theoretically and experimentally.
approaches theoretically and experimentally.
InInthis
thiswork,
work,we we solve
solve the
the obstacle
obstacle avoidance
avoidance problem
problemusingusingDRL-based
DRL-basedmethods methodsto
toaddress
address thetheshortcomings
shortcomings of of
existing
existingstate-of-the-art
state-of-the-art local pathpath
local planning
planning techniques. Spe-
techniques.
cifically, thethe
Specifically, proposed
proposed workworkaims to to
aims improve
improve thetherobot’s
robot’sperformance
performance in in
problematic
problematic sit-
uations [9],
situations [9],asasmentioned
mentionedbelow.
below.
ItItisisaacommon
commonproblem
problemwithwithexisting
existingstate-of-the-art
state-of-the-artmethods
methodstototrigger
triggerthetheobstacle
obstacle
avoidance maneuver within a specific distance, commonly
avoidance maneuver within a specific distance, commonly referred to as the triggeringreferred to as the triggering
distance.When
distance. Whenanan obstacle
obstacle or multiple
or multiple obstacles
obstacles are detected
are detected within within that distance,
that distance, the
the local
localplanners
path path planners willcommands
will send send commands
to the robotto thetorobot to steer
steer away away
from thefrom the obstacles.
obstacles. It shouldIt
beshould
notedbe noted
that that the triggering
the triggering distance distance
is different is different
from thefrom the maximum
maximum range of range of the
the sensor.
This is illustrated
sensor. in Figurein1.Figure
This is illustrated As described in the figure,
1. As described in thethe robot
figure, thehas information
robot about
has information
the environment
about within the
the environment maximum
within range of the
the maximum rangesensor, denoted
of the sensor, in denoted
red, but chooses to
in red, but
act solely based on the information available within the triggering
chooses to act solely based on the information available within the triggering distance, distance, denoted in
blue. As ain
denoted result,
blue.this
As information
a result, thiswaste leads towaste
information an improper
leads toinference
an improperof the nature ofofthe
inference the
environment,
nature of the and the robot and
environment, will the
choose
robot towill
steerchoose
to direct A and
to steer to B withAequal
direct and B probability.
with equal
However,
probability. only directiononly
However, A is direction
the correct A option.
is the correct option.

Figure1.1.Robot
Figure Robotinina anarrow
narrowcorridor.
corridor.The
Theblue
bluecircle
circle indicates
indicates the
the trigger
trigger distance.
distance. Both
Both AA and
and B
B are
are the open areas detected by the autonomous robot at the current
the open areas detected by the autonomous robot at the current time step. time step.

Themajor
The majorlimitation
limitationofofthe theproposed
proposedtechnique
techniqueisisthethefact
factthat
thattototrain
trainthe
theneural
neural
network architecture, a large
network architecture, a large dataset isdataset is required. To
To address this issue, a simulatedenvi-
address this issue, a simulated en-
vironment was developed in Gazebo [10] to collectthe
ronment was developed in Gazebo [10] to collect thedata.
data.Several
Severaldifferent
differentfeatures,
features,es-
pecially scenarios
especially scenarioswith
with compact
compact spaces
spaces and problematic
problematic situations,
situations, as asmentioned
mentionedabove,above,
wereconstructed
were constructedininthethesimulation
simulationenvironments
environmentstototrain trainthe
theneural
neuralnetworks.
networks.
Anotherobjective
Another objective of
of the proposed
proposed DRL-based
DRL-basedobstacle
obstacleavoidance
avoidancetechnique
technique is is
to to
en-
able autonomous
enable autonomous navigation
navigation ofof
thethe
Self-reconfigurable
Self-reconfigurable and Transformable
and Transformable Directional
Directional Ro-
botic Modules
Robotic Modules(STORM)
(STORM)[11–13].
[11–13].Several
Several related
related projects [14–16] of of the
theSTORM
STORMsystem
system
were
weredeveloped
developedininparallel,
parallel,which
whichincrease
increasethe therequirements
requirementsofofthe thehardware
hardwareand andsoftware
software
architecture
architectureofofSTORM.
STORM.InInorder ordertotoefficiently
efficientlyhandle
handlethe thecomplexity
complexitywhile whileallowing
allowingfor for
further
furtherexpansion
expansionand andparallel
paralleldevelopment,
development, thethe
software andand
software electronic architecture
electronic of theof
architecture
STORM
the STORM system are required
system to be to
are required evenbe more
even moreadvanced. A well-conceived
advanced. A well-conceived hardware and
hardware
software architecture
and software was developed
architecture for STORM,
was developed as also detailed
for STORM, as also in this paper.
detailed In addition,
in this paper. In
aaddition,
virtual STORM module
a virtual STORM with skid-steer
module withkinematics according to
skid-steer kinematics the mechanical
according design of
to the mechanical
the locomotion module of STORM was developed in Gazebo
design of the locomotion module of STORM was developed in Gazebo in order to in order to reduce the gap
reduce
between the simulations and the real-world experiments.
the gap between the simulations and the real-world experiments.
InInsummary,
summary,the themajor
major contributions
contributions of of
this paper
this papercancan
be be
listed as follows:
listed as follows:(1) develop-
(1) devel-
ing a DRL-based
oping a DRL-basedsolution to the
solution to obstacle avoidance
the obstacle problem,
avoidance exploring
problem, and evaluating
exploring the
and evaluating
Robotics 2021, 10, 73 3 of 19

influence of selecting different critical parameters for the training process; (2) addressing
the compact spaces and problematic scenarios when solving the obstacle avoidance prob-
lem and comparing the performances with the traditional dynamic window-based method
in simulation; (3) developing a virtual STORM module in Gazebo in order to reduce the
gap between the simulations and the real-world experiments; (4) development of electronic
and software architectures for STROM to enable the parallel design of different controllers
for the STORM system.

2. Related Works
2.1. Autonomous Navigation
In this section, we review and compare the state-of-the-art autonomous navigation
techniques theoretically as background. Several common drawbacks of the previous
methods are pointed out with the intention of developing and improving our methods to
get rid of these issues.
In general, the autonomous navigation problem can be subdivided into several sub-
tasks, including localization, path planning, and obstacle avoidance. A global path planner
is commonly used to plan an optimal collision-free path from the start to the goal position
for the robot in advance, provided prior knowledge of the environment is available in the
form of a map. Some of the commonly used global path planners include the A* searching
algorithm [17] or the visibility graph method [18]. Despite the ability to find optimal paths,
global path planning methods suffer from a major drawback. Prior knowledge of the
environment is always required for global path planning. As a result, the derived path
will not be reliable if the map changes or if the map is not accurate enough. To this extent,
local path planners were developed for mobile robots to avoid obstacles based on real-time
sensory information. Methods such as potential fields (PF) [19] and dynamic window
approach [20] can be utilized to avoid collisions in a dynamic environment.
Although potential fields based methods are straightforward and capable of over-
coming unknown scenarios, they have some significant drawbacks [21]: the robot can get
trapped in a local minima away from the goal, the robot may not be able to find a path
between two closely spaced obstacles, and oscillations can occur when the robot attempts to
move through narrow passages. To overcome these shortcomings, modified versions of the
potential fields method were developed, such as the Virtual Force Field (VFF) method [21]
and Virtual Field Histogram (VFH) [22]. Although the above-mentioned drawbacks are
handled by these methods, deciding upon the direction of motion can be difficult for
these methods, particularly in some problematic scenarios [9]. Some fuzzy logic-based
methods [23,24] also exist to teach a mobile robot the proper action based on sensory
information to improve obstacle avoidance performance. However, the selection of the
linguistic variables used for the fuzzy rules and the definition of membership functions
can become complicated in the presence of a large number of sensor readings, such as in
the case of data emanating from a LIDAR.
For instance, in the case of existing algorithms such as the Vector Field Histogram
(VFH) based path planner [9] that rely solely on information available within the triggering
distance, the two openings, A and B, are equally appropriate. This leads to a problematic
scenario if the planner chooses to steer in the direction of B, as mentioned in Section 1.
The robot will crash into the wall unless it stops or moves backward. This problem was
previously pointed out by Ulrich and Borenstein in [9] when they developed the VFH*
method. They designed a hybrid path planner by combining the A* search algorithm
and VFH to prevent failure in the above-mentioned scenario. However, it requires prior
knowledge of the environment to improve the performance of the local planner.
A better solution would be to create a technique where the robot uses all the infor-
mation available within the maximum sensing range to infer more about the nature of
the environment. This in turn, creates an additional layer of autonomy where the robot
attempts to understand the environment before making obstacle avoidance decisions,
thereby resulting in reliable, intelligent behaviors. As soon as the robot sees the wall near
Robotics 2021, 10, 73 4 of 19

B, the obstacle avoidance technique should infer that the robot is in a corridor and should
turn towards A.
As such, we explore the use of DRL for developing an intelligent obstacle avoidance
approach. This enables the proposed algorithm to develop an inherent ability to understand
the environment based on sensor readings and thereby make the right decision without
error from preprocessing of the sensory data and waste of the information. To emphasize
the advantages of the proposed DRL based method, its features are compared with state-
of-the-art classic methods, as shown in Table 1.

Table 1. Comparison among path planning approach.

Methods Strategy Prior Knowledge Real-Time Capability Comments


Avoids expanding
A* [17] paths that are
already expensive
Explores all the
Dijkstra’s Algorithm Search-based approach Offline planning
Yes possible path and take
[25]
more time
Finds out the optimal
Greedy Best First
path fast but not
Search [26]
always work
Selects a random node
Probabilistic roadmap from the C-space and
(PRM) [27]
performs multi-query
Sample-based planning Yes Offline planning Selects a random node
from the C-space and
RRT [28] incrementally adds
new node
Has several drawbacks:
local minima,
PF [19] no passage of narrow
path, oscillations in
narrow passages
Overcomes the local
No minima problem,
VFF [21]
but the other two
Potential field-based Online planning shortcomings still exist
approach
Overcomes the
VFH [9], VFH+ [29] drawbacks,
but problematic
scenarios exist
Overcomes the
problematic scenarios
VFH* [9] Yes but relies on
information from
a map
Generates various
Fuzzy logic-based decisions according
FLC [23] No Online planning
method to different
sensory information
Generates optimal
decisions from
a well-trained
neural network.
Proposed Approach DRL-based approach No Online planning Complex scenarios can
be overcome as long as
the robot has
sufficient experiences
during training
Robotics 2021, 10, 73 5 of 19

2.2. Deep Reinforcement Learning in Robotics


More and more current research attempts to adopt various DRL methods to solve real-
world problems in robotics. Reducing the training time and improving the sample efficiency
is of great importance when training on physical robots. For instance, deep deterministic
policy gradient (DDPG) and normalized advantage functions (NAF) were suitable for real
robotic systems. An asynchronous DRL with a parallel NAF algorithm, demonstrated
in [4], was proved to achieve sample-efficient training on a cluster of robotic manipulators.
Another work [30] related to robotic manipulation employ model-free DRL with a soft
Q-learning (SQL) based method that leverages the compositionality of maximum entropy
policies in both simulated and real-world tasks is more sample efficient than prior model-
free DRL methods.
Various DRL methods were employed and extended to solve autonomous navigation
problems. For example, actor-critic model whose policy including a function of the goal
and the current state to generalize across targets and scenes was adopted to solve the
problem of navigating a room to find a target object using [7]. Autonomous navigation
of unmanned aerial vehicles was formulated as a partially observable Markov decision
process in [31] with a solution based on recurrent deterministic policy gradient algorithm
(RDPG). A motion planner for mapless navigation of mobile robots was trained through
an asynchronous DDPG to improve the effectiveness of the training process [8].

3. Collision Avoidance Approach


In our proposed approach, the collision avoidance problem is defined as a Markov
Decision Process which can be solved using DRL. Three different DRL methods were
compared in our previous work [32] as the basis for selecting the DRL approach in this
work for further analysis and investigation. With regards to obstacle avoidance appli-
cations, as compared to existing methods such as the potential field method [19] and
dynamic window approach (DWA) [20], DRL does not require a complex mathematical
model. Instead, an efficient model has been derived automatically by updating the pa-
rameters of the neural network based on the observations obtained from the sensors as
inputs. In addition, this method allows the robot to operate efficiently without a precise
map or a high-resolution sensor. The setup of the simulated training environment and
implementation details of the DRL-based obstacle avoidance approach is presented in the
following subsections.

3.1. Problem Formulation


In order to plan a collision-free path, the robot needs to extract information from
the environment using sensor data and thereby generate commands to avoid obstacles.
The relationship between the observations from the sensors and action can be represented
by a mathematical model. In general, the problem can be formulated as

( v t , ω t ) = f (Ot ) (1)

where vt , ωt and Ot are the linear velocity, angular velocity, and the observation from the
sensor at each time step.

3.2. DRL Methods


DRL is a combination of reinforcement learning and deep learning techniques. Re-
inforcement learning (RL) aims to enable an agent to learn the optimal behavior when
interacting with an environment by means of trial-and-error search and delayed reward.
The main elements of an RL problem include an agent, environment states, policy, re-
ward function, and a value function [33]. DRL draws more and more research interests
nowadays. Several efficient DRL methods were discussed and analyzed in this work to
investigate the feasibility and capability of using DRL to solve collision avoidance problems.
Deep Q-network (DQN) [34] is a well-known method tested on several Atari games. It is
(( ))
Li (θi ) = (r + γ (max Q s ', a ';θi− − Q(s, a;θi ))2
a'

Robotics 2021, 10, 73 where Li (θ i ) is a sequence of loss functions for updating the
6 of neural
19 n

( (
r + γ (max Q s ', a '; θ i −
a'
)) is the target for iteration i , and θ i − are the weights f
proved that iteration.
previous this methodTo
is capable
improveof evaluating a fixed policyefficiency,
the computational using a nonlinear functionupdates
minibatch
approximator. The DQN can be represented according to the following equation:
plied to samples of experience, randomly selected from the memory.
A major drawback Li (θi ) = (of
r +DQN
γ(maxisQdefined
s0 , a0 ; θi− as−
 overestimation.
Q(s, a; θi ))
2 To improve (2) the res
a0
searchers introduced Double Deep Q-networks (DDQN). The main difference
DDQNLiand
where (θi ) DQN is
 that DDQN
is a sequence of loss has two sets
functions weightstheθ i neural
for ofupdating −
and θnetwork,
i . At each ti
0 0 − −
r + γ(max Q s , a ; θi the target for iteration i, and θi are the weights from the
the actiona0 is from theisneural network with weights θ i , but the evaluation of tha
previous iteration. To improve the computational efficiency, minibatch updates are applied
is from the target network with weights θ i − .
to samples of experience, randomly selected from the memory.
AAnother
major drawbackattemptoftoDQN improve the trainings
is defined includesToextending
as overestimation. improve the the basic DRL m
results,
researchers introduced
using Prioritized Double Deep
Experience Q-networks
Replay (PER). (DDQN). The main
This method difference
samples between
the experience b
DDQN and DQN
the priority of eachis that DDQN has in
experience two sets
the of weights
replay memory.θi andTheθi− . At each time
training step,
efficiency was
the action is from the neural network with weights θi , but the evaluation of that action is
to be increased with PER, but more parameters were introduced that increased t
from the target network with weights θi− .
plexity
Anotherof tuning
attemptthe model.the trainings includes extending the basic DRL methods
to improve
using An evaluation
Prioritized Experienceof the three
Replay DRLThis
(PER). methods
method was performed
samples the experienceexperimentally
based on in
the priority of each experience in the replay memory.
vious work [32]. DQN, DDQN, and DDQN with PER were applied successively The training efficiency was proved to
be increased with PER, but more parameters were introduced that increased the complexity
tual Turtlebot running in a simulation environment developed in Gazebo [10]. T
of tuning the model.
age AnQ-values
evaluation and ofaccumulated
the three DRL methodsrewards was were analyzed
performed and demonstrated
experimentally in our previ-in Figur
3. Firstly,
ous work [32]. theDQN,averageDDQN, Q-values
and DDQN of DQN
with PER and DDQN
were applied were compared,
successively and the curv
to a virtual
Turtlebot running in a simulation environment developed
ure 2) show that the Q-values of DQN were higher in the beginning, but the in Gazebo [10]. The average
Q-values
droppedand dueaccumulated rewards were
to overestimation. analyzed
Besides, theand demonstrated
learning process in Figures
with DDQN2 and 3. is mor
Firstly, the average Q-values of DQN and DDQN were compared, and the curves (Figure 2)
To decide
show that theifQ-values
the PERofisDQN suitable for theinproposed
were higher the beginning, method,
but the thevaluesDDQN
dropped based
due algorit
extended.
to The comparison
overestimation. between
Besides, the learning the with
process DDQN DDQN with PER stable.
is more and the original
To decide if DDQ
method,
the as shown
PER is suitable in Figure
for the proposed 3,method,
indicatedthe DDQN that the reward
based of DDQN
algorithm with PER co
was extended.
The comparison
faster than thebetween original theDDQN
DDQN but withachieved
PER and the original
a lower DDQN
value in based
the end.method,
Therefore,
as shown in Figure 3, indicated that the reward of DDQN with PER converged faster than
stacle avoidance method was developed based on DDQN due to the results and
the original DDQN but achieved a lower value in the end. Therefore, our obstacle avoidance
complexity.
method was developed based on DDQN due to the results and tuning complexity.

Figure Average
Figure2. 2. q-values
Average estimated
q-values by DQN,by
estimated DDQN [32].
DQN, DDQN [32].
Robotics 2021, 10, 73 7 of 19

Robotics 2021, 10, x FOR PEER REVIEW 7 of 20

Rewards
Figure 3.Figure from the from
3. Rewards training
thewith Turtlebot
training employedemployed
with Turtlebot DDQN and DDQN
DDQN with
and PER,with
DDQN respec-
PER, re-
tively [32].
spectively [32].

3.3. Collision Avoidance with DRL


3.3. Collision Avoidance with DRL
In the proposed approach, a neural network is utilized as the nonlinear function
In the proposed approach, a neural network is utilized as the nonlinear function ap-
approximator to estimate the action-value function, which enhances the reinforcement
proximator to estimate the action-value function, which enhances the reinforcement learn-
learning process to a DRL problem.
ing process to a DRL problem.
The collision avoidance problem was defined as an MDP, represented by the tuple
The collision avoidance problem was defined as an MDP, represented by the tuple
hS, A, P, R, γi, where the possible states s form the state space, S (s ∈ S), A is an action
, A , P , R ,  the
space andS contains , where
possibletheactions
possible states
a (a ∈ A),s Pform
is thethe state
state ( s  SR),is A
space, Smodel,
transition theis an ac-
reward function,
tion spaceand is the discount
andγcontains factor.actions
the possible A neural a network P two
( a  A ),with hidden
is the layers is model,
state transition
used to estimate
R is thethe action-value
reward function, and  Qis(s,
function the θ ) with weights.
a; discount factor. A neural network with two hid-
To train the neural network, large-scale interaction data representing the transition
den layers is used to estimate the action-value function Q ( s , a ;  ) with weights.
(st , at , st+1 , rt+1 ) is needed. A simulation environment with a virtual STORM prototype
was developed To train the neural
to acquire network,
the data and trainlarge-scale interaction
the neural network.dataAtrepresenting
each time stepthe transition
t,
the virtual  t robot
s , a t , s is needed. A simulation environment with a virtual STORM
1 t  1  the current state s t to the neural network as the input. The output is
, r
t sends
prototype
the value was developed
of each actionto inacquire the data
the current state.and train
The the neural
virtual robot isnetwork.
then made At each time step
to choose an t , the
virtual robot
action according to the sends the current
decaying ε-greedystatepolicy.
st toThethe neural
value of network as the
ε decreases input.
from The aoutput is
1 with
decay rate theβ,valueas given by, action in the current state. The virtual robot is then made to choose an
of each
action according to the decaying = β × ε kpolicy. The value of ε decreases from
ε k+1ε-greedy (3) 1 with a
where k decay rate β, εaswill
is the epoch. given
stopby,
decaying when it reaches 0.05 and remain fixed. After per-
forming the chosen action at , the robot transitions into the new state st+1 and receives a
 kfor
reward signal rt+1 . These transitions are recorded    k the neural network.
1 updating (3)

3.4. Implementation Details


where k is the epoch.  will stop decaying when it reaches 0.05 and remain fixed. After
In our proposed approach, the state space S is formed based on the observations
performing the chosen action at , the robot transitions into the new state st 1 and re-
from the LIDAR, and at each time step, the states are equal to the observation (st = Ot ).
ceives a reward signal r
The action space A consists of 11 possible actions ai = (v,are
t  1 . These transitions ωmrecorded
) with thefor updating
same the neural net-
linear velocity
work.
v but different angular velocities ωm = −0.8 + 0.16 × m(m = 0, 1, 2, · · · 10).
The state transition model P is not required in the proposed approach, and the imme-
3.4. Implementation
diate reward Details
at each time step is assigned based upon the following equation,
In our proposed approach,( the state space S is formed based on the observations
from the LIDAR, and at each 5 time step,
(without collision
the states are)equal to the observation(4)( st  Ot ).
r t +1 =
− 1000 ( collision )
The action space A consists of 11 possible actions a  (v,  ) with the same linear ve-
i m

robot vexperiences
when thelocity but different angularitvelocities
a collision, m  0.8
receives a reward of 
−0.16
1000.mThe(mrobot wastrained
 0,1,2, 10) .
The Once
for 3000 epochs. state transition model
the time step P is
reaches not
the required value
maximum in the or
proposed
the robotapproach, and the im-
crashes into
the wall,mediate rewardresets
the simulation at each time
with thestep is assigned
robot based
at a random upon the
position, andfollowing equation,
a new epoch starts.
The neural network was implemented using Keras with Tensorflow as the backend.
The inputs of the network being the preprocessed  5 laser data, which
(without represents the current
collision)
state of the environment. Two hiddenrt layers,1   each with 300 neurons, were used in the (4)
 1000 (collision)
network with ReLU activation function. The outputs are the Q-values of all 11 actions.
Robotics 2021, 10, 73 8 of 19

To update the weights of the neural network, a mini-batch gradient descent was performed
on the loss function,
1 n
2n i∑
L(θ ) = (yi − Q(si , ai ; θ ))2 (5)
=1
where yi is the current target outputs of the action-value function, and n is the size of the
mini-batch. To avoid overestimation, the target yi was updated as,

yi = ri+1 + γQ(si+1 , max


a
( Q(si+1 , a; θ ); θ − )) (6)

where ri+1 is the immediate reward after taking action ai , and γ is the discount factor.
Another neural network Q0 (s, a; θ − ) is initialized with random weights θ − . This allows the
action a with the maximum value to be chosen from the neural network Q0 , while the value
of a is decided in the target network. This in turn prevents overestimation of the action. A
similar approach was used by H. van Hasselt, A. Guez, and D. Silver in [35]. The proposed
approach is summarized in Algorithm 1.

Algorithm 1. Pseudo Code of the Proposed Obstacle Avoidance Method Implemented in


Gazebo Simulator.
1. Initialize the Gazebo simulator;
Initialize the memory D and the Q-network with random weight θ;
Initialize the target Q-network with random weights θ −
2: for episode = 1, k do
3: Put the visual robot at a random position in the 3D world;
Get the first state s1
4: for t = 1,T do
5: With probability ε select a random action at
6: Otherwise, select at = max a
( Q(s, a; θi ))
7: Take action at ; get reward rt+1 and state st+1
8: Store transition (st , at , st+1 , rt+1 ) in D
9: Sample random mini-batch of transitions (si , ai , si+1 , ri+1 )
10: if ri+1 = −1000 then
11: y i = r i +1
12: else
13: yi = ri+1 + γQ(si+1 , max a
( Q(si+1 , a; θ ); θ − ))
14: end if
15: Calculate θ by performing mini-batch gradient descent on the
Mini-batch of loss L(θi ) = (yi − Q(s, a; θi ))2
16: Replace the target network parameters θ − ← θ every N step
17: end for
18: end for

4. Robot Platform Description


This section provides detailed information about the robot platform. It is a locomotion
module that belongs to a robotic system named Self-configurable and Transformable Omni-
Directional Robotic Modules (STORM) [13,36]. A virtual STORM module was developed
according to the mechanical design of the locomotion module, as described in Section 4.1.
The electronic system and the software architecture designed for further extension of the
ongoing research STORM are presented in Sections 4.2 and 4.3, respectively.

4.1. Mechanical System


Multi-directional mobility was taken into consideration when designing the mechani-
cal system to improve the spatial alignment capabilities, thereby improving autonomous
docking capabilities [12]. As such, the locomotion module essentially consists of two
tracked units, two-wheeled units, and a vertical translational mechanism (VTM) that
toggles between the two modes of locomotion, as shown in Figure 4.
The wheeled units can be translated down to enable them, allowing for higher speed
and better performance in position control. The tracked units can be engaged by retracting
the wheeled units to allow for better traction while operating in an unstructured environ-
4. Robot Platform Description
This section provides detailed information about the robot platform. It is a locomo-
tion module that belongs to a robotic system named Self-configurable and Transformable
Omni-Directional Robotic Modules (STORM) [13,36]. A virtual STORM module was de-
Robotics 2021, 10, 73 veloped according to the mechanical design of the locomotion module, as described 9inof 19
Section 4.1. The electronic system and the software architecture designed for further ex-
tension of the ongoing research STORM are presented in Section 4.2 and Section 4.3, re-
spectively.
ment. Since the different modes are aligned perpendicular to each other, switching between
4.1.modes
the Mechanical
allowsSystem
the robot to move sideways as well. Furthermore, the vertical translation
structure housing the mobility
Multi-directional wheeledwas unit provides
taken vertical mobility
into consideration as well. Having
when designing varied
the mechan-
modes of locomotion
ical system to improveprovides a alignment
the spatial wide variety of capabilities
capabilities, therebytoimproving
the locomotion module,
autonomous
including translation along
docking capabilities X, Y,
[12]. As andthe
such, Z axes. This improves
locomotion module the spatial alignment
essentially consists ofcapabili-
two
tracked
ties units, two-wheeled
of STORM units,docking
for autonomous and a vertical translational
applications. mechanism
The main (VTM) that
specifications [11]tog-
of the
gles between
prototype the two modesinofTable
are summarized locomotion,
2. as shown in Figure 4.

Figure
Figure 4. The
4. The architectureofofthe
architecture theSTORM
STORMlocomotion
locomotion module.
module. The
Thetracked
trackedunit,
unit,wheeled
wheeledunit, and
unit, thethe
and vertical translational
vertical translational
mechanism are indicated in the figure. (a) Tracked mobility mode. (b) Wheeled mobility. Note that the
mechanism are indicated in the figure. (a) Tracked mobility mode. (b) Wheeled mobility. Note that the tracks tracks areare
lifted by by
lifted
the vertical translational mechanism. (c) Various sensors that available on STORM.
the vertical translational mechanism. (c) Various sensors that available on STORM.

The wheeled units can be translated down to enable them, allowing for higher speed
Table 2. Robot parameters.
and better performance in position control. The tracked units can be engaged by retracting
Robotics 2021, 10, x FOR PEER REVIEW 10 of 20
the wheeled units to allow for better traction while operating in Parameters
Characteristic an unstructured environ-
ment. Since the different modes are aligned perpendicular to each other, switching be-
Outer dimensions 410 × 305 × 120 mm3
tween the modes allows the robot
Vertical translation to move sideways as well. Furthermore,
50 mm the vertical
translation structure
Robot housing
mass the wheeled unit provides vertical
mass
Robot mobility
8.40 kgkg as well. Having
8.40
varied modes TU of locomotion
mass
TU mass provides a wide variety of capabilities
× kgtokgthe locomotion
2 × 2.812.81
2
module, including WUtranslation
WU mass mass along X, Y, and Z axes. This improves
2 × 0.76 kgspatial
2 × the
0.76 kg alignment
capabilities of STORM VTUfor autonomous docking applications. The 0.73 mainkgspecifications [11]
VTU 0.73 kg
of the prototype are summarized in Table 2.
4.2. Electronic
4.2. Electronic System
Table 2. RobotSystem
parameters.
The electronic
The electronic system
system isis designed
designed toto enable
enable the
the robot
robot to
to perform
perform the
the diverse
diverse tasks.
tasks.
The electronic Characteristic
hardware architecture is presented in Parameters
Figure 5.
The electronic hardware architecture is presented in Figure 5.
Outer dimensions 410 × 305 × 120 mm3
Vertical translation 50 mm

Figure
Figure 5.
5. The
The electrical
electrical architecture
architecture of STORM.

The internal electronic system can be divided into three subsystems: control, sensing,
and actuation. All the components built-in or mounted on the robot platform are shown
inside the dashed line (Figure 5), classified as the internal hardware group. A workstation
computer and a wireless router belong to the external hardware group. Optionally, an
Robotics 2021, 10, 73 10 of 19

The internal electronic system can be divided into three subsystems: control, sensing,
and actuation. All the components built-in or mounted on the robot platform are shown
inside the dashed line (Figure 5), classified as the internal hardware group. A worksta-
tion computer and a wireless router belong to the external hardware group. Optionally,
an Operator Control Unit (OCU), such as a joystick, is able to control the motions of the
robot module via Bluetooth. The core of the mobile robot is the control system which
contains a single-board computer (ODROID-XU4) and two microcontrollers (TEENSY 2.0).
The ODROID-XU4 has an ARM CPU and 2 GB of high-speed RAM. The TEENSY 2.0 micro-
controller has an 8-Bit AVR Processor running at 16 MHz (ATMEGA32U4). The electronic
control system is in charge of sending commands, acquiring sensor data and communicat-
ing with the external hardware or other robots if required. The main components of the
actuation system are five motors. Each tracked unit is actuated by a high-torque Maxon
EC motor with an encoder. Maxon Flat EC motors with hall sensors drive the wheeled
units. A DC motor is used to drive the VTM to provide vertical mobility. A potentiometer
is attached to monitor the position of the VTM and gives feedback to a PID controller for
precise height control. The sensing system has four ultrasonic sensors, two USB cameras,
a Hokuyo LIDAR, and a 9-axis motion tracking device. The LIDAR is mounted on top of
the VTU to assist with autonomous navigation.

4.3. Software System


A three-layer software architecture is designed to manage different controllers for
various subtasks of the STORM system. These controllers can be developed in parallel
Robotics 2021, 10, x FOR PEER REVIEW
without the complete knowledge of the software architecture, such as the previously 11 of 20
developed autonomous docking controller [16], the path planner [14], and the trajectory
tracking controller [15].
The
Thesoftware
softwarearchitecture
architectureisisdeveloped
developedinina apublish-subscribe
publish-subscribepattern
patternbased
basedononRobot
Robot
Operating
OperatingSystem
System(ROS).
(ROS).AsAsshown
shownininFigure
Figure6,6,the
thesoftware
softwarearchitecture
architectureconsists
consistsofofthree
three
layers:
layers:the
the actuation
actuation layer,
layer, the platform layer,
the platform layer, and
andthe
thecontrol
controllayer.
layer.This
This architecture
architecture di-
divides the overall software system into smaller modules. This feature, together
vides the overall software system into smaller modules. This feature, together with the with the
publish-subscribe
publish-subscribepattern,
pattern,enables
enablesmodular
modulardesign
designand andparallel
paralleldevelopment.
development.

Figure6.6.Software
Figure Softwarearchitecture:
architecture:The
Thevarious
variousprograms
programsare
aredivided
dividedinto
intothree
threedifferent
different layers.
layers.

Thesoftware
The softwarein in
thethe actuation
actuation layer
layer is embedded
is embedded in thein the microcontroller.
microcontroller. The com-
The commands
mands
from the from the higher-level
higher-level controllerscontrollers can be published
can be published to the command
to the command manager inmanager in the
the platform
platform
layer. layer. The manager
The command command manager
converts converts
it into it into 8-bit
8-bit unsigned unsigned
integer array integer arraythe
for sending for
enable
sendingsignal
the for eachsignal
enable of the for
individual
each of sensors as well sensors
the individual as speedasinformation for the
well as speed motors.
information
All
forthe
thecomponents
motors. Allinthe thecomponents
actuation layer are actuation
in the programmed layerasare
state machines operating
programmed as state on
ma-
chines operating on the commands from the command manager. They transit between
standby and execution states based on the commands.
Due to the different properties of the various sensors, the software for the sensing
system is separated into the actuation layer and the platform layer. The ultrasonic sensors,
potentiometer, and the 9-axis IMU are interfaced with the microcontrollers. The raw data
from these sensors are published to the status reporter together with the feedback infor-
Robotics 2021, 10, 73 11 of 19

the commands from the command manager. They transit between standby and execution
states based on the commands.
Due to the different properties of the various sensors, the software for the sensing
system is separated into the actuation layer and the platform layer. The ultrasonic sensors,
potentiometer, and the 9-axis IMU are interfaced with the microcontrollers. The raw
data from these sensors are published to the status reporter together with the feedback
information about the motors. The cameras, LIDAR, and an optional localization device
named Pozyx [37] are directly connected to the single-board computer. The software
processes to interface these sensors are in the platform layer. Each sensor publishes out a
topic for acquiring and sending the raw data. The high-level controllers can then subscribe
to the status reporter and the sensor topics for any required information.
All the software in the platform layer and the actuation layer is designed to be reusable.
The implementation of the control layer represents the extensible nature of the software
Robotics 2021, 10, x FOR PEER REVIEW 12 of 20
architecture. The architecture allows for different control methods to be developed in
parallel. Each method can choose the required information provided by the status reporter
and the sensor processes running on the single-board computer. It is possible that some
of each track or wheel of the STORM locomotion module in the simulation as well as in
high-level controllers may require more computing power than what is provided by the
the real-world experiments. As such, a skid-steering kinematic model of the STORM lo-
single-board computer. For instance, the neural network in the proposed obstacle avoidance
comotion module was developed and presented in the following subsection.
approach cannot be implemented on the single-board computer. Instead, it is implemented
on a workstation, and the platform layer is made to subscribe to the topics published by
5.1. Kinematic Model of the Virtual STORM Module
the workstation through ROS.
The locomotion module can be regarded as a tracked mobile robot when the VTU is
lifted.
5. When the
Simulations VTU
and is lowered, the robot is considered to be in wheeled mobile robot
Results
mode.To train and test the cases,
However, in both neuralitnetwork,
operates as a skid-steer
various mobile
simulation robot, as shown
environments andin Figure
a virual
7. The kinematics for this system can be represented as follows:
STORM locomotion module were developed in Gazebo simulator [10]. It is critical to map
the obtained linear and angular velocities from the high-level commands to the velocities
of each track or wheel of the STORM ω)
, vR ) = f (v,module
(vLlocomotion in the simulation as well as
(7)
in the real-world experiments. As such, a skid-steering kinematic model of the STORM
locomotion module was developed and presented in the following subsection.
where v = (vx , vy ) is the linear velocity of the robot; v x and v y are the components of
v along
5.1. Kinematic
the XModel
and Yof axes,
the Virtual ω is the angular velocity of the robot; vL and
STORM Module
respectively;
vR are
Thethe
locomotion module
linear velocity of can be regarded
the left and rightastracked
a tracked mobile robot
or wheeled whenunit
mobility the with
VTUre-is
lifted. When the VTU is lowered,
spect to the local frame of reference. the robot is considered to be in wheeled mobile robot
mode.AnyHowever,
planar in both cases,
movement ofitaoperates as acan
rigid body skid-steer mobileasrobot,
be regarded as shown
a rotation in aFigure
about single
7. The kinematics for this system can be represented as follows:
point, defined as the instantaneous center of rotation (ICR). The skid-steer mobile robot
during a turning maneuver can be considered as a rigid body that is rotating about its
(v L , v R ) = f (v, ω ) (7)
ICR. The ICRs of the left and right treads are different from the ICR of the body of the
robot. According
where v = (v x , vy to
) isthe
theKennedy’s Theorem,
linear velocity of the allrobot;
three ICRs have
v x and to bethe
vy are aligned in a straight
components of v
line [38].
along the XInand
Figure
Y axes, C 1 = ( x ,
7, respectively; y )
c1 c1 ω is isthetheICR of thevelocity
angular STORMoflocomotion
the robot; vmodule,
and v andare
L R
the = ( xc2 ,velocity
C2 linear yc2 ) and C3 =left
of the (xcand
3 , yc3right
) aretracked
the ICRs of the left
or wheeled and right
mobility unitwheeled or tracked
with respect to the
local
units,frame of reference.
respectively.

Figure
Figure 7.7. Schematic
Schematicfigures
figuresof
ofthe
thelocomotion
locomotionmodule:
module:(a)(a)tracked
trackedmobility
mobilitymode
modewith
withICR
ICRlocations
loca-
tions(b)
and and (b) wheeled
wheeled mobility
mobility withlocations
with ICR ICR locations are shown.
are shown.

Based on the calculated ICRs, the linear velocities of the robot can be calculated as
follows:
Robotics 2021, 10, 73 12 of 19

Any planar movement of a rigid body can be regarded as a rotation about a single
point, defined as the instantaneous center of rotation (ICR). The skid-steer mobile robot
during a turning maneuver can be considered as a rigid body that is rotating about its ICR.
The ICRs of the left and right treads are different from the ICR of the body of the robot.
According to the Kennedy’s Theorem, all three ICRs have to be aligned in a straight line [38].
In Figure 7, C1 = ( xc1 , yc1 ) is the ICR of the STORM locomotion module, and C2 = ( xc2 , yc2 )
and C3 = ( xc3 , yc3 ) are the ICRs of the left and right wheeled or tracked units, respectively.
Based on the calculated ICRs, the linear velocities of the robot can be calculated
as follows:
v L = −| xc2 |ω + vy
(8)
v R = | xc3 |ω + vy
In the above model, the local x coordinates xc2 and xc3 were estimated via experiments
in simulation. Known values of v L and v R were given to the robot and the resulting ω and
Robotics 2021, 10, x FOR PEER REVIEW 13 of 20
vy . Using Equation (8), the average values of xc2 and xc3 were calculated from the recorded
data. The estimated values are as follows: in the tracked mobility mode, | xc2 | = 0.29 m
and | xc3 | = 0.3 m, in the wheeled mobility mode, | xc2 | = 0.14 m, and | xc2 | = 0.15 m.
Withthe
With theknowledge
knowledgeof ofthe
thekinematic
kinematicrelations
relationsofofthe
thelocomotion
locomotionmodule,
module,thethevirtual
virtual
STORMmodule
STORM modulewith
withaa2D 2Dlaser
laserscanner
scannerwas
wassimulated
simulatedin inGazebo,
Gazebo, as
asshown
shown in
inFigure
Figure8.8.
TheLIDAR
The LIDARtakestakes512
512measurements
measurementsin inaa270 ◦ span.
270° span.AAsetsetof
of50
50range
rangemeasurements
measurementsare are
selectedevenly
selected evenlyfrom
from thethe sensor.
sensor. The The distance
distance measurements
measurements are preprocessed
are preprocessed to fall
to fall within
awithin
range aofrange
0.00 mofto0.00
5.00mm.
to 5.00 m.

Figure8.8.Simulation
Figure Simulationenvironment.
environment.AAvirtual
virtualSTORM
STORMlocomotion
locomotionmodule
modulewith
withaa2D
2Drange
rangesensor
sensorin
in the simulation environment.
the simulation environment.

Thesensor
The sensordata
dataand
andstate
stateinformation
informationwerewereobtained
obtainedfrom
fromthe
thesimulator
simulatorand andthen
then
provided to the neural network as inputs. The commands to the virtual
provided to the neural network as inputs. The commands to the virtual robot were sent robot were sent
based on the outputs of the neural network. Communication was done
based on the outputs of the neural network. Communication was done through publishing through publish-
ing subscribing
and and subscribing to ROS
to ROS topics.
topics.
Theproposed
The proposedcollision
collisionavoidance
avoidancemethod
methodisisgeneral
generalininthe
thesense
sensethat
thatititcan
canbe
beapplied
applied
to both the tracked mobility mode and wheeled mobility mode of the
to both the tracked mobility mode and wheeled mobility mode of the locomotion mod-locomotion module.
TheThe
ule. trainings andand
trainings experiments
experimentsin both the the
in both simulation
simulation environments
environments andand real-world im-
real-world
plementation were
implementation wereconducted
conducted with
withthe robot
the robotininthe
thetracked
trackedmode.
mode.

5.2.
5.2.Training
TrainingParameter
ParameterSelection
Selectionand andAnalysis
Analysis
The proposed approach uses
The proposed approach uses ε-greedy ε-greedy policy to choose
policy the appropriate
to choose the appropriateaction. This pol-
action. This
icy
policy allows the selection of a random action with probability ε. Otherwise, the robotwill
allows the selection of a random action with probability ε. Otherwise, the robot will
select
selectthe
theaction
actionwithwiththe
thehighest
highestvalue
valuefrom
fromthetheneural
neuralnetwork.
network.At Atthe
thebeginning
beginningofofthe the
training, settoto1,1,
training,εεisisset andand the robot
the robot always
always selects a random
selects a random action to acquire
action varied
to acquire training
varied train-
data for updating
ing data for updatingthe neural network.
the neural This This
network. is defined as theasexploration
is defined stage.stage.
the exploration However,
How-
the robot cannot always select a random action, as it also needs to exploit
ever, the robot cannot always select a random action, as it also needs to exploit the actions. the actions.
With
Witha aprobability
probability ofof − ε−),ε),
(1 (1 thetherobot willwill
robot follow the the
follow output of the
output of current neural
the current network
neural net-
and choose the action with the highest value in the current state. If the
work and choose the action with the highest value in the current state. If the robot can robot can receive a
similar reward to what it gets in the recorded transitions, the neural network
receive a similar reward to what it gets in the recorded transitions, the neural network is is understood
as well-trained.
understood If not, the neural
as well-trained. network
If not, needs
the neural more data
network needsto exploit.
more data Thetodecay rateThe
exploit. β
decay rate β of the probability ε is used for balancing the exploration and the exploitation.
A lower β means that the robot chooses fewer random actions when it meets the same
environment state. It relies more on the previous experiences it had already obtained. The
value of β will stop decaying after it reaches 0.05, which means that the robot will end up
Robotics 2021, 10, 73 13 of 19

of the probability ε is used for balancing the exploration and the exploitation. A lower β
means that the robot chooses fewer random actions when it meets the same environment
state. It relies more on the previous experiences it had already obtained. The value of β
will stop decaying after it reaches 0.05, which means that the robot will end up choosing a
random action with a probability of 5%. To demonstrate the inferences of the parameters β
and ε, as well as to evaluate the performances of the virtual STORM, a test of the virtual
robot was set up in the simulation as follows:
1. The training process was performed three times in Map 2, as shown in Figure 9b,
with the decay rate β set to 0.997, 0.998, and 0.999, respectively. Each neural network
was trained for 3000 epochs for the following reasons: (a) All the three tests reach
the lowest value of ε to choose a random action after 3000 epochs, for example,
when β = 0.999, ε reaches 0.05 at log0.999 0.05 = 2994 epochs. (b) The current work did
not include the improvement of the training efficiency. Training the neural network
Robotics 2021, 10, x FOR PEER REVIEW for 3000 epochs keeps the entire test within an acceptable training time, while14atofthe 20
same time ensuring that reasonable results can be obtained according to our previous
experiences when training the neural network in [32].
2.
2. The three
The threeneural
neuralnetworks were
networks applied
were to thetovirtual
applied STORMSTORM
the virtual successively to navigate
successively to
navigate the test map with similar features in Map 2, as shown in Figure 10,minutes.
the test map with similar features in Map 2, as shown in Figure 10, for five for five
The metric
minutes. formetric
The evaluating the performance
for evaluating of the neural
the performance networks
of the is chosen is
neural networks to chosen
be the
number of collisions undergone within five minutes of simulation.
to be the number of collisions undergone within five minutes of simulation.

Figure9.
Figure 9. The
The training
training maps.
maps. (a)
(a) Map
Map 1:
1: a simple corridor-like
corridor-like path
path with
with some
some problematic
problematicscenarios.
scenarios.(b)
(b)Map
Map2:
2: aa complex
complex
circuitwith
circuit withdifferent
differentenvironmental
environmentalfeatures
featuressuch
suchas
asstraight
straighttracks,
tracks,90-degree
90-degreeturns,
turns,and
andacute
acuteand
andobtuse
obtuseangle
anglecorners.
corners.

5.3. Comparison with State-of-the-Art Method


To demonstrate the advantages of the proposed DRL based planner, it was compared
with the DWA local planner [30] in the virtual world, as shown in Figure 9a. There are
mainly two reasons why this map was developed. The first one is to serve as the testbed
for comparing the proposed method with the dynamic window-based method. The second
one is to address the problem state in the introduction (Figure 1) by adding features in each
corner of the map, which aims to demonstrate that the proposed method has the ability to
avoid the problem mentioned in the introduction section.

Figure 10. The test map with similar features to that in Map 2.
Robotics 2021, 10, 73 14 of 19
Figure 9. The training maps. (a) Map 1: a simple corridor-like path with some problematic scenarios. (b) Map 2: a complex
circuit with different environmental features such as straight tracks, 90-degree turns, and acute and obtuse angle corners.

Figure 10. The test map with similar features to that in Map 2.
Figure 10. The test map with similar features to that in Map 2.

5.3. Comparison
The proposed withmethod
State-of-the-Art
allows the Method
robot to avoid obstacles based on the information
from theTo demonstrate the advantages of DWA
sensor directly. In contrast, the local planner
the proposed is based
DRL based on a dynamic
planner, window
it was compared
approach.
with the DWA Besides the planner
local data from theinlaser
[30] the sensor,
virtualaworld,
cost map, generated
as shown according
in Figure to odom-
9a. There are
etry information,
mainly two reasons is required
why thisformap
the DWA planner to predict
was developed. The firsttheone
safeismotion.
to serveThe feasibility
as the testbed
of
forthe path fromthe
comparing theproposed
DWA local plannerwith
method relies
thevery much window-based
dynamic on the precisionmethod.of the provided
The sec-
costmap.
ond one is to address the problem state in the introduction (Figure 1) by adding narrow
This is especially true in the case when the robot is navigating through features
passageways.
in each cornerConsequently,
of the map, which the drift
aimsintoodometer
demonstrate datathat
may thecause collisions
proposed under
method hasthe
the
DWA local planner. As such, the proposed method is superior
ability to avoid the problem mentioned in the introduction section. to the DWA planner since it
only relies on the sensor data as input.
The proposed method allows the robot to avoid obstacles based on the information
from the sensor directly. In contrast, the DWA local planner is based on a dynamic win-
5.4. Simulation Results
dow approach. Besides the data from the laser sensor, a cost map, generated according to
The results of the simulation are summarized and analyzed in this section. Figure 11
shows the training results to compare the performances of the neural network with different
decay rates.
From the results, it can be observed that with a lower value of β in Equation (3),
the neural network tended to get higher rewards fast. However, the rewards obtained
during the training with a low value of β were not stable. This is because the robot
chose an action with the highest Q-value according to the current neural network with
a high probability. Nevertheless, this action was overestimated in the beginning due
to insufficient experiences acquired at that time. The Q-value of that action decreased
with more exploitations. A higher value of β caused the robot to explore more actions
at first. This allowed the robot to take different actions when it met the same situation
to determine which one was better. The reward curves show that the learning process
with the highest β was slow but stably increased. It should be pointed out that the robot
cannot achieve the highest score by navigating the training map for one round without any
collisions, as shown in the results, because it still had the learning capabilities by selecting a
random action with a probability of 5%. However, after training, the robot with each neural
network was able to navigate the training map without collisions when it took action with
the highest values according to the outputs of that neural network without the probabilities
of choosing random actions. In conclusion, a higher decaying rate of the probability of
choosing a random action causes the robot to learn fast, but the value of the actions as the
outputs of the neural networks was not stable. One can decrease the value of β to evaluate
if the actions that are already learned are proper or not. One can decrease the value of β to
obtain more stable results on the account of longer training time.
The performance of the robot with different neural networks over a duration of
five minutes in the test map differs from the test map shown in Table 3. It proved that the
DRL-based obstacle avoidance method has the potential to deal with similar situations
the provided costmap. This is especially true in the case when the robot is navigating
through narrow passageways. Consequently, the drift in odometer data may cause colli-
sions under the DWA local planner. As such, the proposed method is superior to the DWA
planner since it only relies on the sensor data as input.
Robotics 2021, 10, 73 15 of 19
5.4. Simulation Results
The results of the simulation are summarized and analyzed in this section. Figure 11
shows the training results to compare the performances of the neural network with dif-
it had already learned. Further tests of the robustness of the proposed controller were
ferent decay rates.
conducted in the real world.

Figure11.
Figure Performancecomparisons
11.Performance comparisonsofofthe theneural
neuralnetwork
network with
with different
different decay
decay rates:
rates: in in
thethe first
first
three plots, the blue lines indicate the reward values, whereas the red line is the data processed bya
three plots, the blue lines indicate the reward values, whereas the red line is the data processed by
amoving
movingaverage
averagefilter
filterwith
witha awindow
window size ofof
size 500.
500.The
Theblue, red,
blue, and
red, green
and lines
green in in
lines thethe
lastlast
plot stand
plot
stand
for thefor the trainings
trainings with different
with different decaydecay
rates. rates.

3. Performance
TableFrom the results,of it
obstacle
can beavoidance
observedapproach.
that with a lower value of β in Equation (3), the
neural network tended to get higher rewards fast. However, the rewards obtained during
Decay Rate β Number of Collisions in 5 min
the training with a low value of β were not stable. This is because the robot chose an action
with the highest Q-value 0.999 according to the current neural network with 0 a high probability.
Nevertheless, this action was overestimated in the beginning due1to insufficient experi-
0.997
0.995 2
ences acquired at that time. The Q-value of that action decreased with more exploitations.
A higher value of β caused the robot to explore more actions at first. This allowed the
robotFigure
to take12 shows aactions
different sequencewhenof frames captured
it met the from the to
same situation simulated
determineenvironment
which one with
was
a virtual
better. The STORM
rewardlocomotion
curves show module under
that the the DRL-based
learning process with obstacle avoidance
the highest β wascontroller.
slow but
For theincreased.
stably simulatedItcase,
shouldthe be
virtual robotout
pointed maneuvers through
that the robot the narrow
cannot achievecorridor-like maze
the highest score
successfully without any collisions. In comparison, another sequence of
by navigating the training map for one round without any collisions, as shown in the re- frames captured
from the
sults, simulation
because it still is shown
had in Figurecapabilities
the learning 13 with thebylocomotion
selecting amodule
random under thewith
action DWA a
local planner.
probability TheHowever,
of 5%. trajectoriesafter
of the robot under
training, the with
the robot actioneachof the two different
neural network was planners
able
are
to compared
navigate in Figure
the training 14.without
map As seen from Figure
collisions when13b,
it tooktheaction
robotwith
goestheout of thevalues
highest maze
due to the lack of environmental information in order to plan a global path in advance.
After a recovery behavior, the robot comes up with a new plan and goes in the correct
direction, as shown in Figure 13c,d. It should be noted that at one of the corners of the
maze, as shown in Figure 13e,f, the robot almost bumped into the wall and failed to come
up with a reliable local path. The reason is that the costmap generated from the odometer
data was not precise at that specific time step, as shown in Figure 14b. This causes the robot
to spin around and collect more information and thereby plan a new path. In conclusion,
the proposed DRL-based method is capable of planning a proper motion for the mobile
robot solely based on the data collected from the laser sensor directly. This in turn reduces
the inaccuracy caused by other processes, such as costmap generation.
come up with a reliable local path. The reason is that the costmap generated from the
odometer data was not precise at that specific time step, as shown in Figure 14b. This
causes the robot to spin around and collect more information and thereby plan a new
path. In conclusion, the proposed DRL-based method is capable of planning a proper mo-
Robotics 2021, 10, 73 tion for the mobile robot solely based on the data collected from the laser sensor directly.
16 of 19
This in turn reduces the inaccuracy caused by other processes, such as costmap genera-
tion.

Figure12.
Figure 12.STORM
STORMlocomotion
locomotionmodule
modulewith
withthe
theproposed
proposedDRL-based
DRL-basedplanner
plannerinitialized
initializedatat(a)(a)start-
Robotics
Robotics 2021,
2021, 10,
10, xx FOR
FOR PEER
PEER REVIEW
REVIEWstarting position, successfully passes the corners with special features shown in (b), (d), (e),17 of
of 20
and
17 20
ing position, successfully passes the corners with special features shown in (b,d,e), and the corridors
the corridors as shown in (c), (f), (g); finally goes to the goal position in the virtual environment.
as shown in (c,f,g); finally goes to the goal position in the virtual environment.

Figure
Figure 13. STORM locomotion module with the DWA local planner initialized at the (a) starting
Figure 13.13. STORM
STORM locomotion
locomotion module
module with with the
the DWA
DWA local
local planner
planner initialized
initialized at
at the
the (a)
(a) starting
starting
position
position goes out of the maze due to the lack of environmental information. In a recovery behavior,
position goes
goes out
out of
of the
the maze
maze due
due to
to the
the lack
lack of
of environmental
environmental information.
information. In a recovery behavior,
behavior,
the
the robot
therobot comes
robotcomes
comesup up with
withaaanew
upwith new plan
newplan
planandand
andgoesgoes
goesinin the
inthe correct
thecorrect direction,
direction,asas
correctdirection, shown
asshown
shown (b),
(b), (c),
(b–d). At (d).
(c), one At
(d). one
of the
At one
of
of the
the corners
corners of
of the
of the maze,
corners theas maze,
shown
maze, as
as shown in
in (e–g),
shown (e),
inthe (f),
(f), (g),
(e),robot the
the robot
almost
(g), almost
bumped
robot intobumped
almost the wallinto
bumped andthe
into wall
failed
the walltoandcome
and failed
up
failed
to
to come
withcome up
up with
with
a reliable aa reliable
local path. Itlocal
reliable local path.
path.
finally goesIt finally
It to
finally
the (h)goes
goes
goalto the
the (h)
(h) goal
toposition. goal position.
position.

Figure
Figure 14.
Figure14. (a)
14.(a) Comparison
(a)Comparison
Comparison betweenthethe
between
between the trajectories
trajectories
trajectories of the
the proposed
of proposed
of the proposed DRL-based
DRL-based
DRL-based planner
planner
planner and theand
and
DWAthe
the
DWA
DWA local Planner.
local Planner. (b) The costmap feeds into the DWA planner at the moment the robot fails
fails to
local Planner. (b) The (b) The costmap
costmap feeds
feeds into the into
DWAthe DWA at
planner planner at the the
the moment moment
robot the
failsrobot
to produce to
produce
produce aa path.
path.
a path.

5.5.
5.5. Real-World
Real-World Implementation
Implementation
The
The obstacle avoidance
obstacle avoidance controller
controller for
for the
the STORM
STORM prototype
prototype was
was embedded
embedded in in the
the
control
control layer. The neural network trained with β = 0.999 was selected as it offered the best
layer. The neural network trained with β = 0.999 was selected as it offered the best
performance.
performance. AsAs mentioned
mentioned before,
before, the
the trained
trained neural
neural network
network was
was made
made to to run
run on
on aa
Robotics 2021, 10, 73 17 of 19

5.5. Real-World Implementation


The obstacle avoidance controller for the STORM prototype was embedded in the
control layer. The neural network trained with β = 0.999 was selected as it offered the
best performance. As mentioned before, the trained neural network was made to run on
a workstation since it was beyond the computational power of the onboard single board
computer. The neural network generated the command ai = (v, ωm ) with an update rate
of 10 Hz. These commands were subscribed by the command manager on board STORM
locomotion module and converted into the proper form for the actuation layer.
As mentioned before, the neural network found optimal actions based on the sensor
observations in the simulated world, but the map was the same as that used for training.
In order to validate the ability of the robot to travel without collisions, not only in the
same map as the training case but also in environments with different features, three maps
different from the training scenarios were built to test the robot. One of the maps had a
circular feature with different diameters as compared to the simulated training environment.
Figure 15 shows the three test maps and the trajectories followed by the robot when
traveling through the maps. The robot traveled in each of the three maps for five minutes
without
Robotics 2021, 10, x FOR PEER REVIEW
collision. This proved that the proposed approach could handle situations
18 of 20
different
from the training scenario provided in the 3D simulation.

Figure 15. The test maps and the trajectories of the robot.
Figure 15. The test maps and the trajectories of the robot.
6. Conclusions and Future Work
The DRL-based method shows the potential to improve the performance of the robot
when avoiding obstacles in some highly occupied environments and problematic scenar-
ios. The proposed DRL architecture was trained inside a simulated Gazebo environment
Robotics 2021, 10, 73 18 of 19

6. Conclusions and Future Work


The DRL-based method shows the potential to improve the performance of the robot
when avoiding obstacles in some highly occupied environments and problematic scenarios.
The proposed DRL architecture was trained inside a simulated Gazebo environment to
allow for sufficient data collection over varied features without any damage to the real
robot. The results from the simulations show that more exploration in training leads to
stably increased rewards with slow training speed. The trained DRL architecture was
tested in real-world scenarios with the STORM locomotion module. The experimental
results show that the proposed approach was able to perform well in previously unseen
scenarios that are much different from the training scenario, thereby proving a generalized
nature of the trained architecture.
Future work will incorporate a specified goal position without a prior map, along with
obstacle avoidance capabilities. The maps developed in the Gazebo simulator aimed
to emphasize the compact spaced environment and problematic scenarios. The train-
ing environment should be more complex and involve various features to enable robust
performances of the robot. Further training and investigations will include a multi-stage
training with a shared memory set and maps of gradually increased difficulty, in contrast
to learning directly from the complex environments. Besides, the training efficiency will
be taken into consideration when improving the proposed obstacle avoidance methods.
To a larger extent, this work is a step forward in enabling multi-robot exploration and
navigation in unknown environments for the STORM system. As part of future work,
new simulation environments with moving obstacles and uneven terrain will be considered
to enrich the experience of the trained neural network.

Author Contributions: Methodology, S.F.; project administration, P.B.-T.; supervision, P.B.-T.; valida-
tion, S.F.; writing—original draft, S.F.; writing—review and editing, B.S. and P.B.-T. All authors have
read and agreed to the published version of the manuscript.
Funding: This research received no external funding.
Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.
Conflicts of Interest: The authors declare no conflict of interest.

References
1. Silver, D.; Hubert, T.; Schrittwieser, J.; Antonoglou, I.; Lai, M.; Guez, A.; Lanctot, M.; Sifre, L.; Kumaran, D.; Graepel, T.; et al.
Mastering chess and shogi by self-play with a general reinforcement learning algorithm. arXiv 2017, arXiv:1712.01815.
2. François-Lavet, V.; Henderson, P.; Islam, R.; Bellemare, M.G.; Pineau, J. An Introduction to Deep Reinforcement Learning. Found.
Trends Mach. Learn. 2018, 11, 219–354. [CrossRef]
3. Moreira, I.; Rivas, J.; Cruz, F.; Dazeley, R.; Ayala, A.; Fernandes, B. Deep reinforcement learning with interactive feedback in a
human-robot environment. Appl. Sci. 2020, 10, 5574. [CrossRef]
4. Gu, S.; Holly, E.; Lillicrap, T.; Levine, S. Deep reinforcement learning for robotic manipulation with asynchronous off-
policy updates. In Proceedings of the 2017 IEEE International Conference on Robotics and Automation (ICRA), Singapore,
29 May–3 June 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 3389–3396. [CrossRef]
5. Pirník, R.; Hruboš, M.; Nemec, D.; Mravec, T.; Božek, P. Integration of inertial sensor data into control of the mobile platform.
Adv. Intell. Syst. Comput. 2017, 511, 271–282.
6. Kilin, A.; Bozek, P.; Karavaev, Y.; Klekovkin, A.; Shestakov, V. Experimental investigations of a highly maneuverable mobile
omniwheel robot. Int. J. Adv. Robot. Syst. 2017, 14, 1–9. [CrossRef]
7. Zhu, Y.; Mottaghi, R.; Kolve, E.; Lim, J.J.; Gupta, A.; Fei-Fei, L.; Farhadi, A. Target-driven visual navigation in indoor scenes using
deep reinforcement learning. In Proceedings of the 2017 IEEE International Conference on Robotics and Automation (ICRA),
Singapore, 29 May–3 June 2017; IEEE: Piscataway, NJ, USA, 2017; Volume 1, pp. 3357–3364.
8. Tai, L.; Paolo, G.; Liu, M. Virtual-to-real deep reinforcement learning: Continuous control of mobile robots for mapless navigation.
In Proceedings of the 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Vancouver, BC, Canada,
24–28 September 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 31–36.
9. Ulrich, I.; Borenstein, J. VFH*: Local obstacle avoidance with look-ahead verification. In Proceedings of the Proceedings 2000 ICRA.
Millennium Conference. IEEE International Conference on Robotics and Automation. Symposia Proceedings (Cat. No.00CH37065),
San Francisco, CA, USA, 24–28 April 2000; IEEE: Piscataway, NJ, USA, 2000; Volume 3, pp. 2505–2511.
Robotics 2021, 10, 73 19 of 19

10. Gazebo. Available online: http://gazebosim.org/ (accessed on 12 May 2017).


11. Kumar, P.; Saab, W.; Ben-Tzvi, P. Design of a Multi-Directional Hybrid-Locomotion Modular Robot With Feedforward Stability
Control. In Proceedings of the Volume 5B: 41st Mechanisms and Robotics Conference, Cleveland, OH, USA, 6–9 August 2017;
ASME: New York, NY, USA, 2017; p. V05BT08A010. [CrossRef]
12. Ben-Tzvi, P.; Saab, W. A hybrid tracked-wheeled multi-directional mobile robot. J. Mech. Robot. 2019, 11, 1–10. [CrossRef]
13. Moubarak, P.M.; Alvarez, E.J.; Ben-Tzvi, P. Reconfiguring a modular robot into a humanoid formation: A multi-body dynamic
perspective on motion scheduling for modules and their assemblies. In Proceedings of the 2013 IEEE International Conference
on Automation Science and Engineering (CASE), Madison, WI, USA, 17–21 August 2013; IEEE: Piscataway, NY, USA, 2013;
pp. 687–692. [CrossRef]
14. Sebastian, B.; Ben-Tzvi, P. Physics Based Path Planning for Autonomous Tracked Vehicle in Challenging Terrain. J. Intell. Robot.
Syst. Theory Appl. 2018, 1–16. [CrossRef]
15. Sebastian, B.; Ben-Tzvi, P. Active Disturbance Rejection Control for Handling Slip in Tracked Vehicle Locomotion. J. Mech. Robot.
2018, 11, 021003. [CrossRef]
16. Sohal, S.S.; Saab, W.; Ben-Tzvi, P. Improved Alignment Estimation for Autonomous Docking of Mobile Robots. In Proceedings of the
Volume 5A: 42nd Mechanisms and Robotics Conference, Quebec City, QC, Canada, 26–29 August 2018; ASME: New York, NY, USA,
2018; p. V05AT07A072. [CrossRef]
17. Hart, P.; Nilsson, N.; Raphael, B. A Formal Basis for the Heuristic Determination of Minimum Cost Paths. IEEE Trans. Syst. Sci.
Cybern. 1968, 4, 100–107. [CrossRef]
18. Lozano-Pérez, T.; Wesley, M.A. An Algorithm for Planning Collision-Free Paths Among Polyhedral Obstacles. Commun. ACM
1979, 22, 560–570. [CrossRef]
19. Khatib, O. Real-Time Obstacle Avoidance for Manipulators and Mobile Robots. Int. J. Rob. Res. 1986, 5, 90–98. [CrossRef]
20. Brock, O.; Khatib, O. High-speed navigation using the global dynamic window approach. In Proceedings of the Proceedings 1999
IEEE International Conference on Robotics and Automation (Cat. No.99CH36288C), Detroit, MI, USA, 10–15 May 1999; IEEE:
Piscataway, NJ, USA, 1999; Volume 1, pp. 341–346.
21. Koren, Y.; Borenstein, J. Potential field methods and their inherent limitations for mobile robot navigation. In Proceedings of the
Proceedings. 1991 IEEE International Conference on Robotics and Automation, Sacramento, CA, USA, 7–12 April 1991; IEEE:
Piscataway, NJ, USA, 2016; Volume 11, pp. 1398–1404.
22. Borenstein, J.; Koren, Y. The vector field histogram-fast obstacle avoidance for mobile robots. IEEE Trans. Robot. Autom. 1991, 7,
278–288. [CrossRef]
23. Faisal, M.; Hedjar, R.; Al Sulaiman, M.; Al-Mutib, K. Fuzzy logic navigation and obstacle avoidance by a mobile robot in an
unknown dynamic environment. Int. J. Adv. Robot. Syst. 2013, 10, 37. [CrossRef]
24. Pothal, J.K.; Parhi, D.R. Navigation of multiple mobile robots in a highly clutter terrains using adaptive neuro-fuzzy inference
system. Rob. Auton. Syst. 2015, 72, 48–58. [CrossRef]
25. Dijkstra, E.W. A note on two problems in connexion with graphs. Numer. Math. 1959, 1, 269–271. [CrossRef]
26. Dechter, R.; Pearl, J. Generalized Best-First Search Strategies and the Optimality of A. J. ACM 1985, 32, 505–536. [CrossRef]
27. Kavraki, L.E.; Svestka, P.; Latombe, J.-C.; Overmars, M.H. Probabilistic roadmaps for path planning in high-dimensional
configuration spaces. IEEE Trans. Robot. Autom. 1996, 12, 566–580. [CrossRef]
28. Elbanhawi, M.; Simic, M. Sampling-based robot motion planning: A review. IEEE Access 2014, 2, 56–77. [CrossRef]
29. Ulrich, I.; Borenstein, J. VFH+: Reliable obstacle avoidance for fast mobile robots. Proc. IEEE Int. Conf. Robot. Autom. 1998, 2, 1572–1577.
30. Haarnoja, T.; Pong, V.; Zhou, A.; Dalal, M.; Abbeel, P.; Levine, S. Composable deep reinforcement learning for robotic manipulation.
arXiv 2018, arXiv:1803.06773v1.
31. Wang, C.; Wang, J.; Shen, Y.; Zhang, X. Autonomous Navigation of UAVs in Large-Scale Complex Environments: A Deep
Reinforcement Learning Approach. IEEE Trans. Veh. Technol. 2019, 68, 2124–2136. [CrossRef]
32. Feng, S.; Ren, H.; Wang, X.; Ben-Tzvi, P. Mobile robot obstacle avoidance base on deep reinforcement learning. Proc. ASME Des.
Eng. Tech. Conf. 2019, 5A-2019, 1–8.
33. Sutton, R.S.; Barto, A.G. Chapter 1 Introduction. Reinf. Learn. An Introd. 1988. [CrossRef]
34. Mnih, V.; Kavukcuoglu, K.; Silver, D.; Graves, A.; Antonoglou, I.; Wierstra, D.; Riedmiller, M. Playing Atari with Deep
Reinforcement Learning. arXiv 2013, arXiv:1312.5602.
35. Van Hasselt, H.; Guez, A.; Silver, D. Deep Reinforcement Learning with Double Q-learning. Assoc. Adv. Artif. Intell. 2016, 30,
2094–2100.
36. Saab, W.; Ben-Tzvi, P. A Genderless Coupling Mechanism with 6-DOF Misalignment Capability for Modular Self-Reconfigurable
Robots. J. Mech. Robot. 2016, 8, 1–9. [CrossRef]
37. POZYX Positioning System. Available online: https://www.pozyx.io/ (accessed on 23 February 2018).
38. Mandow, A.; Martinez, J.L.; Morales, J.; Blanco, J.L.; Garcia-Cerezo, A.; Gonzalez, J. Experimental kinematics for wheeled
skid-steer mobile robots. IEEE Int. Conf. Intell. Robot. Syst. 2007, 1222–1227. [CrossRef]

You might also like