Collision Avoidance Using RL
Collision Avoidance Using RL
Article
A Collision Avoidance Method Based on Deep
Reinforcement Learning
Shumin Feng 1 , Bijo Sebastian 2 and Pinhas Ben-Tzvi 1, *
1 Robotics and Mechatronics Lab, Department of Mechanical Engineering, Virginia Polytechnic Institute and
State University, Blacksburg, VA 24061, USA; [email protected]
2 Torc Robotics, Blacksburg, VA 24060, USA; [email protected]
* Correspondence: [email protected]; Tel.: +1-(540)-231-6938
Abstract: This paper set out to investigate the usefulness of solving collision avoidance problems
with the help of deep reinforcement learning in an unknown environment, especially in compact
spaces, such as a narrow corridor. This research aims to determine whether a deep reinforcement
learning-based collision avoidance method is superior to the traditional methods, such as potential
field-based methods and dynamic window approach. Besides, the proposed obstacle avoidance
method was developed as one of the capabilities to enable each robot in a novel robotic system,
namely the Self-reconfigurable and Transformable Omni-Directional Robotic Modules (STORM),
to navigate intelligently and safely in an unknown environment. A well-conceived hardware and
software architecture with features that enable further expansion and parallel development designed
for the ongoing STORM projects is also presented in this work. A virtual STORM module with
skid-steer kinematics was simulated in Gazebo to reduce the gap between the simulations and the
real-world implementations. Moreover, comparisons among multiple training runs of the neural
networks with different parameters related to balance the exploitation and exploration during the
training process, as well as tests and experiments conducted in both simulation and real-world,
are presented in detail. Directions for future research are also provided in the paper.
navigationproblems,
navigation problems,namely
namely obstacle
obstacle avoidance,
avoidance, in inaddition
additionto toanalyzing
analyzingand andcomparing
compar-
some state-of-the-art autonomous navigation methods, especially
ing some state-of-the-art autonomous navigation methods, especially obstacle avoidance obstacle avoidance ap-
proaches theoretically and experimentally.
approaches theoretically and experimentally.
InInthis
thiswork,
work,we we solve
solve the
the obstacle
obstacle avoidance
avoidance problem
problemusingusingDRL-based
DRL-basedmethods methodsto
toaddress
address thetheshortcomings
shortcomings of of
existing
existingstate-of-the-art
state-of-the-art local pathpath
local planning
planning techniques. Spe-
techniques.
cifically, thethe
Specifically, proposed
proposed workworkaims to to
aims improve
improve thetherobot’s
robot’sperformance
performance in in
problematic
problematic sit-
uations [9],
situations [9],asasmentioned
mentionedbelow.
below.
ItItisisaacommon
commonproblem
problemwithwithexisting
existingstate-of-the-art
state-of-the-artmethods
methodstototrigger
triggerthetheobstacle
obstacle
avoidance maneuver within a specific distance, commonly
avoidance maneuver within a specific distance, commonly referred to as the triggeringreferred to as the triggering
distance.When
distance. Whenanan obstacle
obstacle or multiple
or multiple obstacles
obstacles are detected
are detected within within that distance,
that distance, the
the local
localplanners
path path planners willcommands
will send send commands
to the robotto thetorobot to steer
steer away away
from thefrom the obstacles.
obstacles. It shouldIt
beshould
notedbe noted
that that the triggering
the triggering distance distance
is different is different
from thefrom the maximum
maximum range of range of the
the sensor.
This is illustrated
sensor. in Figurein1.Figure
This is illustrated As described in the figure,
1. As described in thethe robot
figure, thehas information
robot about
has information
the environment
about within the
the environment maximum
within range of the
the maximum rangesensor, denoted
of the sensor, in denoted
red, but chooses to
in red, but
act solely based on the information available within the triggering
chooses to act solely based on the information available within the triggering distance, distance, denoted in
blue. As ain
denoted result,
blue.this
As information
a result, thiswaste leads towaste
information an improper
leads toinference
an improperof the nature ofofthe
inference the
environment,
nature of the and the robot and
environment, will the
choose
robot towill
steerchoose
to direct A and
to steer to B withAequal
direct and B probability.
with equal
However,
probability. only directiononly
However, A is direction
the correct A option.
is the correct option.
Figure1.1.Robot
Figure Robotinina anarrow
narrowcorridor.
corridor.The
Theblue
bluecircle
circle indicates
indicates the
the trigger
trigger distance.
distance. Both
Both AA and
and B
B are
are the open areas detected by the autonomous robot at the current
the open areas detected by the autonomous robot at the current time step. time step.
Themajor
The majorlimitation
limitationofofthe theproposed
proposedtechnique
techniqueisisthethefact
factthat
thattototrain
trainthe
theneural
neural
network architecture, a large
network architecture, a large dataset isdataset is required. To
To address this issue, a simulatedenvi-
address this issue, a simulated en-
vironment was developed in Gazebo [10] to collectthe
ronment was developed in Gazebo [10] to collect thedata.
data.Several
Severaldifferent
differentfeatures,
features,es-
pecially scenarios
especially scenarioswith
with compact
compact spaces
spaces and problematic
problematic situations,
situations, as asmentioned
mentionedabove,above,
wereconstructed
were constructedininthethesimulation
simulationenvironments
environmentstototrain trainthe
theneural
neuralnetworks.
networks.
Anotherobjective
Another objective of
of the proposed
proposed DRL-based
DRL-basedobstacle
obstacleavoidance
avoidancetechnique
technique is is
to to
en-
able autonomous
enable autonomous navigation
navigation ofof
thethe
Self-reconfigurable
Self-reconfigurable and Transformable
and Transformable Directional
Directional Ro-
botic Modules
Robotic Modules(STORM)
(STORM)[11–13].
[11–13].Several
Several related
related projects [14–16] of of the
theSTORM
STORMsystem
system
were
weredeveloped
developedininparallel,
parallel,which
whichincrease
increasethe therequirements
requirementsofofthe thehardware
hardwareand andsoftware
software
architecture
architectureofofSTORM.
STORM.InInorder ordertotoefficiently
efficientlyhandle
handlethe thecomplexity
complexitywhile whileallowing
allowingfor for
further
furtherexpansion
expansionand andparallel
paralleldevelopment,
development, thethe
software andand
software electronic architecture
electronic of theof
architecture
STORM
the STORM system are required
system to be to
are required evenbe more
even moreadvanced. A well-conceived
advanced. A well-conceived hardware and
hardware
software architecture
and software was developed
architecture for STORM,
was developed as also detailed
for STORM, as also in this paper.
detailed In addition,
in this paper. In
aaddition,
virtual STORM module
a virtual STORM with skid-steer
module withkinematics according to
skid-steer kinematics the mechanical
according design of
to the mechanical
the locomotion module of STORM was developed in Gazebo
design of the locomotion module of STORM was developed in Gazebo in order to in order to reduce the gap
reduce
between the simulations and the real-world experiments.
the gap between the simulations and the real-world experiments.
InInsummary,
summary,the themajor
major contributions
contributions of of
this paper
this papercancan
be be
listed as follows:
listed as follows:(1) develop-
(1) devel-
ing a DRL-based
oping a DRL-basedsolution to the
solution to obstacle avoidance
the obstacle problem,
avoidance exploring
problem, and evaluating
exploring the
and evaluating
Robotics 2021, 10, 73 3 of 19
influence of selecting different critical parameters for the training process; (2) addressing
the compact spaces and problematic scenarios when solving the obstacle avoidance prob-
lem and comparing the performances with the traditional dynamic window-based method
in simulation; (3) developing a virtual STORM module in Gazebo in order to reduce the
gap between the simulations and the real-world experiments; (4) development of electronic
and software architectures for STROM to enable the parallel design of different controllers
for the STORM system.
2. Related Works
2.1. Autonomous Navigation
In this section, we review and compare the state-of-the-art autonomous navigation
techniques theoretically as background. Several common drawbacks of the previous
methods are pointed out with the intention of developing and improving our methods to
get rid of these issues.
In general, the autonomous navigation problem can be subdivided into several sub-
tasks, including localization, path planning, and obstacle avoidance. A global path planner
is commonly used to plan an optimal collision-free path from the start to the goal position
for the robot in advance, provided prior knowledge of the environment is available in the
form of a map. Some of the commonly used global path planners include the A* searching
algorithm [17] or the visibility graph method [18]. Despite the ability to find optimal paths,
global path planning methods suffer from a major drawback. Prior knowledge of the
environment is always required for global path planning. As a result, the derived path
will not be reliable if the map changes or if the map is not accurate enough. To this extent,
local path planners were developed for mobile robots to avoid obstacles based on real-time
sensory information. Methods such as potential fields (PF) [19] and dynamic window
approach [20] can be utilized to avoid collisions in a dynamic environment.
Although potential fields based methods are straightforward and capable of over-
coming unknown scenarios, they have some significant drawbacks [21]: the robot can get
trapped in a local minima away from the goal, the robot may not be able to find a path
between two closely spaced obstacles, and oscillations can occur when the robot attempts to
move through narrow passages. To overcome these shortcomings, modified versions of the
potential fields method were developed, such as the Virtual Force Field (VFF) method [21]
and Virtual Field Histogram (VFH) [22]. Although the above-mentioned drawbacks are
handled by these methods, deciding upon the direction of motion can be difficult for
these methods, particularly in some problematic scenarios [9]. Some fuzzy logic-based
methods [23,24] also exist to teach a mobile robot the proper action based on sensory
information to improve obstacle avoidance performance. However, the selection of the
linguistic variables used for the fuzzy rules and the definition of membership functions
can become complicated in the presence of a large number of sensor readings, such as in
the case of data emanating from a LIDAR.
For instance, in the case of existing algorithms such as the Vector Field Histogram
(VFH) based path planner [9] that rely solely on information available within the triggering
distance, the two openings, A and B, are equally appropriate. This leads to a problematic
scenario if the planner chooses to steer in the direction of B, as mentioned in Section 1.
The robot will crash into the wall unless it stops or moves backward. This problem was
previously pointed out by Ulrich and Borenstein in [9] when they developed the VFH*
method. They designed a hybrid path planner by combining the A* search algorithm
and VFH to prevent failure in the above-mentioned scenario. However, it requires prior
knowledge of the environment to improve the performance of the local planner.
A better solution would be to create a technique where the robot uses all the infor-
mation available within the maximum sensing range to infer more about the nature of
the environment. This in turn, creates an additional layer of autonomy where the robot
attempts to understand the environment before making obstacle avoidance decisions,
thereby resulting in reliable, intelligent behaviors. As soon as the robot sees the wall near
Robotics 2021, 10, 73 4 of 19
B, the obstacle avoidance technique should infer that the robot is in a corridor and should
turn towards A.
As such, we explore the use of DRL for developing an intelligent obstacle avoidance
approach. This enables the proposed algorithm to develop an inherent ability to understand
the environment based on sensor readings and thereby make the right decision without
error from preprocessing of the sensory data and waste of the information. To emphasize
the advantages of the proposed DRL based method, its features are compared with state-
of-the-art classic methods, as shown in Table 1.
( v t , ω t ) = f (Ot ) (1)
where vt , ωt and Ot are the linear velocity, angular velocity, and the observation from the
sensor at each time step.
Robotics 2021, 10, 73 where Li (θ i ) is a sequence of loss functions for updating the
6 of neural
19 n
( (
r + γ (max Q s ', a '; θ i −
a'
)) is the target for iteration i , and θ i − are the weights f
proved that iteration.
previous this methodTo
is capable
improveof evaluating a fixed policyefficiency,
the computational using a nonlinear functionupdates
minibatch
approximator. The DQN can be represented according to the following equation:
plied to samples of experience, randomly selected from the memory.
A major drawback Li (θi ) = (of
r +DQN
γ(maxisQdefined
s0 , a0 ; θi− as−
overestimation.
Q(s, a; θi ))
2 To improve (2) the res
a0
searchers introduced Double Deep Q-networks (DDQN). The main difference
DDQNLiand
where (θi ) DQN is
that DDQN
is a sequence of loss has two sets
functions weightstheθ i neural
for ofupdating −
and θnetwork,
i . At each ti
0 0 − −
r + γ(max Q s , a ; θi the target for iteration i, and θi are the weights from the
the actiona0 is from theisneural network with weights θ i , but the evaluation of tha
previous iteration. To improve the computational efficiency, minibatch updates are applied
is from the target network with weights θ i − .
to samples of experience, randomly selected from the memory.
AAnother
major drawbackattemptoftoDQN improve the trainings
is defined includesToextending
as overestimation. improve the the basic DRL m
results,
researchers introduced
using Prioritized Double Deep
Experience Q-networks
Replay (PER). (DDQN). The main
This method difference
samples between
the experience b
DDQN and DQN
the priority of eachis that DDQN has in
experience two sets
the of weights
replay memory.θi andTheθi− . At each time
training step,
efficiency was
the action is from the neural network with weights θi , but the evaluation of that action is
to be increased with PER, but more parameters were introduced that increased t
from the target network with weights θi− .
plexity
Anotherof tuning
attemptthe model.the trainings includes extending the basic DRL methods
to improve
using An evaluation
Prioritized Experienceof the three
Replay DRLThis
(PER). methods
method was performed
samples the experienceexperimentally
based on in
the priority of each experience in the replay memory.
vious work [32]. DQN, DDQN, and DDQN with PER were applied successively The training efficiency was proved to
be increased with PER, but more parameters were introduced that increased the complexity
tual Turtlebot running in a simulation environment developed in Gazebo [10]. T
of tuning the model.
age AnQ-values
evaluation and ofaccumulated
the three DRL methodsrewards was were analyzed
performed and demonstrated
experimentally in our previ-in Figur
3. Firstly,
ous work [32]. theDQN,averageDDQN, Q-values
and DDQN of DQN
with PER and DDQN
were applied were compared,
successively and the curv
to a virtual
Turtlebot running in a simulation environment developed
ure 2) show that the Q-values of DQN were higher in the beginning, but the in Gazebo [10]. The average
Q-values
droppedand dueaccumulated rewards were
to overestimation. analyzed
Besides, theand demonstrated
learning process in Figures
with DDQN2 and 3. is mor
Firstly, the average Q-values of DQN and DDQN were compared, and the curves (Figure 2)
To decide
show that theifQ-values
the PERofisDQN suitable for theinproposed
were higher the beginning, method,
but the thevaluesDDQN
dropped based
due algorit
extended.
to The comparison
overestimation. between
Besides, the learning the with
process DDQN DDQN with PER stable.
is more and the original
To decide if DDQ
method,
the as shown
PER is suitable in Figure
for the proposed 3,method,
indicatedthe DDQN that the reward
based of DDQN
algorithm with PER co
was extended.
The comparison
faster than thebetween original theDDQN
DDQN but withachieved
PER and the original
a lower DDQN
value in based
the end.method,
Therefore,
as shown in Figure 3, indicated that the reward of DDQN with PER converged faster than
stacle avoidance method was developed based on DDQN due to the results and
the original DDQN but achieved a lower value in the end. Therefore, our obstacle avoidance
complexity.
method was developed based on DDQN due to the results and tuning complexity.
Figure Average
Figure2. 2. q-values
Average estimated
q-values by DQN,by
estimated DDQN [32].
DQN, DDQN [32].
Robotics 2021, 10, 73 7 of 19
Rewards
Figure 3.Figure from the from
3. Rewards training
thewith Turtlebot
training employedemployed
with Turtlebot DDQN and DDQN
DDQN with
and PER,with
DDQN respec-
PER, re-
tively [32].
spectively [32].
robot vexperiences
when thelocity but different angularitvelocities
a collision, m 0.8
receives a reward of
−0.16
1000.mThe(mrobot wastrained
0,1,2, 10) .
The Once
for 3000 epochs. state transition model
the time step P is
reaches not
the required value
maximum in the or
proposed
the robotapproach, and the im-
crashes into
the wall,mediate rewardresets
the simulation at each time
with thestep is assigned
robot based
at a random upon the
position, andfollowing equation,
a new epoch starts.
The neural network was implemented using Keras with Tensorflow as the backend.
The inputs of the network being the preprocessed 5 laser data, which
(without represents the current
collision)
state of the environment. Two hiddenrt layers,1 each with 300 neurons, were used in the (4)
1000 (collision)
network with ReLU activation function. The outputs are the Q-values of all 11 actions.
Robotics 2021, 10, 73 8 of 19
To update the weights of the neural network, a mini-batch gradient descent was performed
on the loss function,
1 n
2n i∑
L(θ ) = (yi − Q(si , ai ; θ ))2 (5)
=1
where yi is the current target outputs of the action-value function, and n is the size of the
mini-batch. To avoid overestimation, the target yi was updated as,
where ri+1 is the immediate reward after taking action ai , and γ is the discount factor.
Another neural network Q0 (s, a; θ − ) is initialized with random weights θ − . This allows the
action a with the maximum value to be chosen from the neural network Q0 , while the value
of a is decided in the target network. This in turn prevents overestimation of the action. A
similar approach was used by H. van Hasselt, A. Guez, and D. Silver in [35]. The proposed
approach is summarized in Algorithm 1.
Figure
Figure 4. The
4. The architectureofofthe
architecture theSTORM
STORMlocomotion
locomotion module.
module. The
Thetracked
trackedunit,
unit,wheeled
wheeledunit, and
unit, thethe
and vertical translational
vertical translational
mechanism are indicated in the figure. (a) Tracked mobility mode. (b) Wheeled mobility. Note that the
mechanism are indicated in the figure. (a) Tracked mobility mode. (b) Wheeled mobility. Note that the tracks tracks areare
lifted by by
lifted
the vertical translational mechanism. (c) Various sensors that available on STORM.
the vertical translational mechanism. (c) Various sensors that available on STORM.
The wheeled units can be translated down to enable them, allowing for higher speed
Table 2. Robot parameters.
and better performance in position control. The tracked units can be engaged by retracting
Robotics 2021, 10, x FOR PEER REVIEW 10 of 20
the wheeled units to allow for better traction while operating in Parameters
Characteristic an unstructured environ-
ment. Since the different modes are aligned perpendicular to each other, switching be-
Outer dimensions 410 × 305 × 120 mm3
tween the modes allows the robot
Vertical translation to move sideways as well. Furthermore,
50 mm the vertical
translation structure
Robot housing
mass the wheeled unit provides vertical
mass
Robot mobility
8.40 kgkg as well. Having
8.40
varied modes TU of locomotion
mass
TU mass provides a wide variety of capabilities
× kgtokgthe locomotion
2 × 2.812.81
2
module, including WUtranslation
WU mass mass along X, Y, and Z axes. This improves
2 × 0.76 kgspatial
2 × the
0.76 kg alignment
capabilities of STORM VTUfor autonomous docking applications. The 0.73 mainkgspecifications [11]
VTU 0.73 kg
of the prototype are summarized in Table 2.
4.2. Electronic
4.2. Electronic System
Table 2. RobotSystem
parameters.
The electronic
The electronic system
system isis designed
designed toto enable
enable the
the robot
robot to
to perform
perform the
the diverse
diverse tasks.
tasks.
The electronic Characteristic
hardware architecture is presented in Parameters
Figure 5.
The electronic hardware architecture is presented in Figure 5.
Outer dimensions 410 × 305 × 120 mm3
Vertical translation 50 mm
Figure
Figure 5.
5. The
The electrical
electrical architecture
architecture of STORM.
The internal electronic system can be divided into three subsystems: control, sensing,
and actuation. All the components built-in or mounted on the robot platform are shown
inside the dashed line (Figure 5), classified as the internal hardware group. A workstation
computer and a wireless router belong to the external hardware group. Optionally, an
Robotics 2021, 10, 73 10 of 19
The internal electronic system can be divided into three subsystems: control, sensing,
and actuation. All the components built-in or mounted on the robot platform are shown
inside the dashed line (Figure 5), classified as the internal hardware group. A worksta-
tion computer and a wireless router belong to the external hardware group. Optionally,
an Operator Control Unit (OCU), such as a joystick, is able to control the motions of the
robot module via Bluetooth. The core of the mobile robot is the control system which
contains a single-board computer (ODROID-XU4) and two microcontrollers (TEENSY 2.0).
The ODROID-XU4 has an ARM CPU and 2 GB of high-speed RAM. The TEENSY 2.0 micro-
controller has an 8-Bit AVR Processor running at 16 MHz (ATMEGA32U4). The electronic
control system is in charge of sending commands, acquiring sensor data and communicat-
ing with the external hardware or other robots if required. The main components of the
actuation system are five motors. Each tracked unit is actuated by a high-torque Maxon
EC motor with an encoder. Maxon Flat EC motors with hall sensors drive the wheeled
units. A DC motor is used to drive the VTM to provide vertical mobility. A potentiometer
is attached to monitor the position of the VTM and gives feedback to a PID controller for
precise height control. The sensing system has four ultrasonic sensors, two USB cameras,
a Hokuyo LIDAR, and a 9-axis motion tracking device. The LIDAR is mounted on top of
the VTU to assist with autonomous navigation.
Figure6.6.Software
Figure Softwarearchitecture:
architecture:The
Thevarious
variousprograms
programsare
aredivided
dividedinto
intothree
threedifferent
different layers.
layers.
Thesoftware
The softwarein in
thethe actuation
actuation layer
layer is embedded
is embedded in thein the microcontroller.
microcontroller. The com-
The commands
mands
from the from the higher-level
higher-level controllerscontrollers can be published
can be published to the command
to the command manager inmanager in the
the platform
platform
layer. layer. The manager
The command command manager
converts converts
it into it into 8-bit
8-bit unsigned unsigned
integer array integer arraythe
for sending for
enable
sendingsignal
the for eachsignal
enable of the for
individual
each of sensors as well sensors
the individual as speedasinformation for the
well as speed motors.
information
All
forthe
thecomponents
motors. Allinthe thecomponents
actuation layer are actuation
in the programmed layerasare
state machines operating
programmed as state on
ma-
chines operating on the commands from the command manager. They transit between
standby and execution states based on the commands.
Due to the different properties of the various sensors, the software for the sensing
system is separated into the actuation layer and the platform layer. The ultrasonic sensors,
potentiometer, and the 9-axis IMU are interfaced with the microcontrollers. The raw data
from these sensors are published to the status reporter together with the feedback infor-
Robotics 2021, 10, 73 11 of 19
the commands from the command manager. They transit between standby and execution
states based on the commands.
Due to the different properties of the various sensors, the software for the sensing
system is separated into the actuation layer and the platform layer. The ultrasonic sensors,
potentiometer, and the 9-axis IMU are interfaced with the microcontrollers. The raw
data from these sensors are published to the status reporter together with the feedback
information about the motors. The cameras, LIDAR, and an optional localization device
named Pozyx [37] are directly connected to the single-board computer. The software
processes to interface these sensors are in the platform layer. Each sensor publishes out a
topic for acquiring and sending the raw data. The high-level controllers can then subscribe
to the status reporter and the sensor topics for any required information.
All the software in the platform layer and the actuation layer is designed to be reusable.
The implementation of the control layer represents the extensible nature of the software
Robotics 2021, 10, x FOR PEER REVIEW 12 of 20
architecture. The architecture allows for different control methods to be developed in
parallel. Each method can choose the required information provided by the status reporter
and the sensor processes running on the single-board computer. It is possible that some
of each track or wheel of the STORM locomotion module in the simulation as well as in
high-level controllers may require more computing power than what is provided by the
the real-world experiments. As such, a skid-steering kinematic model of the STORM lo-
single-board computer. For instance, the neural network in the proposed obstacle avoidance
comotion module was developed and presented in the following subsection.
approach cannot be implemented on the single-board computer. Instead, it is implemented
on a workstation, and the platform layer is made to subscribe to the topics published by
5.1. Kinematic Model of the Virtual STORM Module
the workstation through ROS.
The locomotion module can be regarded as a tracked mobile robot when the VTU is
lifted.
5. When the
Simulations VTU
and is lowered, the robot is considered to be in wheeled mobile robot
Results
mode.To train and test the cases,
However, in both neuralitnetwork,
operates as a skid-steer
various mobile
simulation robot, as shown
environments andin Figure
a virual
7. The kinematics for this system can be represented as follows:
STORM locomotion module were developed in Gazebo simulator [10]. It is critical to map
the obtained linear and angular velocities from the high-level commands to the velocities
of each track or wheel of the STORM ω)
, vR ) = f (v,module
(vLlocomotion in the simulation as well as
(7)
in the real-world experiments. As such, a skid-steering kinematic model of the STORM
locomotion module was developed and presented in the following subsection.
where v = (vx , vy ) is the linear velocity of the robot; v x and v y are the components of
v along
5.1. Kinematic
the XModel
and Yof axes,
the Virtual ω is the angular velocity of the robot; vL and
STORM Module
respectively;
vR are
Thethe
locomotion module
linear velocity of can be regarded
the left and rightastracked
a tracked mobile robot
or wheeled whenunit
mobility the with
VTUre-is
lifted. When the VTU is lowered,
spect to the local frame of reference. the robot is considered to be in wheeled mobile robot
mode.AnyHowever,
planar in both cases,
movement ofitaoperates as acan
rigid body skid-steer mobileasrobot,
be regarded as shown
a rotation in aFigure
about single
7. The kinematics for this system can be represented as follows:
point, defined as the instantaneous center of rotation (ICR). The skid-steer mobile robot
during a turning maneuver can be considered as a rigid body that is rotating about its
(v L , v R ) = f (v, ω ) (7)
ICR. The ICRs of the left and right treads are different from the ICR of the body of the
robot. According
where v = (v x , vy to
) isthe
theKennedy’s Theorem,
linear velocity of the allrobot;
three ICRs have
v x and to bethe
vy are aligned in a straight
components of v
line [38].
along the XInand
Figure
Y axes, C 1 = ( x ,
7, respectively; y )
c1 c1 ω is isthetheICR of thevelocity
angular STORMoflocomotion
the robot; vmodule,
and v andare
L R
the = ( xc2 ,velocity
C2 linear yc2 ) and C3 =left
of the (xcand
3 , yc3right
) aretracked
the ICRs of the left
or wheeled and right
mobility unitwheeled or tracked
with respect to the
local
units,frame of reference.
respectively.
Figure
Figure 7.7. Schematic
Schematicfigures
figuresof
ofthe
thelocomotion
locomotionmodule:
module:(a)(a)tracked
trackedmobility
mobilitymode
modewith
withICR
ICRlocations
loca-
tions(b)
and and (b) wheeled
wheeled mobility
mobility withlocations
with ICR ICR locations are shown.
are shown.
Based on the calculated ICRs, the linear velocities of the robot can be calculated as
follows:
Robotics 2021, 10, 73 12 of 19
Any planar movement of a rigid body can be regarded as a rotation about a single
point, defined as the instantaneous center of rotation (ICR). The skid-steer mobile robot
during a turning maneuver can be considered as a rigid body that is rotating about its ICR.
The ICRs of the left and right treads are different from the ICR of the body of the robot.
According to the Kennedy’s Theorem, all three ICRs have to be aligned in a straight line [38].
In Figure 7, C1 = ( xc1 , yc1 ) is the ICR of the STORM locomotion module, and C2 = ( xc2 , yc2 )
and C3 = ( xc3 , yc3 ) are the ICRs of the left and right wheeled or tracked units, respectively.
Based on the calculated ICRs, the linear velocities of the robot can be calculated
as follows:
v L = −| xc2 |ω + vy
(8)
v R = | xc3 |ω + vy
In the above model, the local x coordinates xc2 and xc3 were estimated via experiments
in simulation. Known values of v L and v R were given to the robot and the resulting ω and
Robotics 2021, 10, x FOR PEER REVIEW 13 of 20
vy . Using Equation (8), the average values of xc2 and xc3 were calculated from the recorded
data. The estimated values are as follows: in the tracked mobility mode, | xc2 | = 0.29 m
and | xc3 | = 0.3 m, in the wheeled mobility mode, | xc2 | = 0.14 m, and | xc2 | = 0.15 m.
Withthe
With theknowledge
knowledgeof ofthe
thekinematic
kinematicrelations
relationsofofthe
thelocomotion
locomotionmodule,
module,thethevirtual
virtual
STORMmodule
STORM modulewith
withaa2D 2Dlaser
laserscanner
scannerwas
wassimulated
simulatedin inGazebo,
Gazebo, as
asshown
shown in
inFigure
Figure8.8.
TheLIDAR
The LIDARtakestakes512
512measurements
measurementsin inaa270 ◦ span.
270° span.AAsetsetof
of50
50range
rangemeasurements
measurementsare are
selectedevenly
selected evenlyfrom
from thethe sensor.
sensor. The The distance
distance measurements
measurements are preprocessed
are preprocessed to fall
to fall within
awithin
range aofrange
0.00 mofto0.00
5.00mm.
to 5.00 m.
Figure8.8.Simulation
Figure Simulationenvironment.
environment.AAvirtual
virtualSTORM
STORMlocomotion
locomotionmodule
modulewith
withaa2D
2Drange
rangesensor
sensorin
in the simulation environment.
the simulation environment.
Thesensor
The sensordata
dataand
andstate
stateinformation
informationwerewereobtained
obtainedfrom
fromthe
thesimulator
simulatorand andthen
then
provided to the neural network as inputs. The commands to the virtual
provided to the neural network as inputs. The commands to the virtual robot were sent robot were sent
based on the outputs of the neural network. Communication was done
based on the outputs of the neural network. Communication was done through publishing through publish-
ing subscribing
and and subscribing to ROS
to ROS topics.
topics.
Theproposed
The proposedcollision
collisionavoidance
avoidancemethod
methodisisgeneral
generalininthe
thesense
sensethat
thatititcan
canbe
beapplied
applied
to both the tracked mobility mode and wheeled mobility mode of the
to both the tracked mobility mode and wheeled mobility mode of the locomotion mod-locomotion module.
TheThe
ule. trainings andand
trainings experiments
experimentsin both the the
in both simulation
simulation environments
environments andand real-world im-
real-world
plementation were
implementation wereconducted
conducted with
withthe robot
the robotininthe
thetracked
trackedmode.
mode.
5.2.
5.2.Training
TrainingParameter
ParameterSelection
Selectionand andAnalysis
Analysis
The proposed approach uses
The proposed approach uses ε-greedy ε-greedy policy to choose
policy the appropriate
to choose the appropriateaction. This pol-
action. This
icy
policy allows the selection of a random action with probability ε. Otherwise, the robotwill
allows the selection of a random action with probability ε. Otherwise, the robot will
select
selectthe
theaction
actionwithwiththe
thehighest
highestvalue
valuefrom
fromthetheneural
neuralnetwork.
network.At Atthe
thebeginning
beginningofofthe the
training, settoto1,1,
training,εεisisset andand the robot
the robot always
always selects a random
selects a random action to acquire
action varied
to acquire training
varied train-
data for updating
ing data for updatingthe neural network.
the neural This This
network. is defined as theasexploration
is defined stage.stage.
the exploration However,
How-
the robot cannot always select a random action, as it also needs to exploit
ever, the robot cannot always select a random action, as it also needs to exploit the actions. the actions.
With
Witha aprobability
probability ofof − ε−),ε),
(1 (1 thetherobot willwill
robot follow the the
follow output of the
output of current neural
the current network
neural net-
and choose the action with the highest value in the current state. If the
work and choose the action with the highest value in the current state. If the robot can robot can receive a
similar reward to what it gets in the recorded transitions, the neural network
receive a similar reward to what it gets in the recorded transitions, the neural network is is understood
as well-trained.
understood If not, the neural
as well-trained. network
If not, needs
the neural more data
network needsto exploit.
more data Thetodecay rateThe
exploit. β
decay rate β of the probability ε is used for balancing the exploration and the exploitation.
A lower β means that the robot chooses fewer random actions when it meets the same
environment state. It relies more on the previous experiences it had already obtained. The
value of β will stop decaying after it reaches 0.05, which means that the robot will end up
Robotics 2021, 10, 73 13 of 19
of the probability ε is used for balancing the exploration and the exploitation. A lower β
means that the robot chooses fewer random actions when it meets the same environment
state. It relies more on the previous experiences it had already obtained. The value of β
will stop decaying after it reaches 0.05, which means that the robot will end up choosing a
random action with a probability of 5%. To demonstrate the inferences of the parameters β
and ε, as well as to evaluate the performances of the virtual STORM, a test of the virtual
robot was set up in the simulation as follows:
1. The training process was performed three times in Map 2, as shown in Figure 9b,
with the decay rate β set to 0.997, 0.998, and 0.999, respectively. Each neural network
was trained for 3000 epochs for the following reasons: (a) All the three tests reach
the lowest value of ε to choose a random action after 3000 epochs, for example,
when β = 0.999, ε reaches 0.05 at log0.999 0.05 = 2994 epochs. (b) The current work did
not include the improvement of the training efficiency. Training the neural network
Robotics 2021, 10, x FOR PEER REVIEW for 3000 epochs keeps the entire test within an acceptable training time, while14atofthe 20
same time ensuring that reasonable results can be obtained according to our previous
experiences when training the neural network in [32].
2.
2. The three
The threeneural
neuralnetworks were
networks applied
were to thetovirtual
applied STORMSTORM
the virtual successively to navigate
successively to
navigate the test map with similar features in Map 2, as shown in Figure 10,minutes.
the test map with similar features in Map 2, as shown in Figure 10, for five for five
The metric
minutes. formetric
The evaluating the performance
for evaluating of the neural
the performance networks
of the is chosen is
neural networks to chosen
be the
number of collisions undergone within five minutes of simulation.
to be the number of collisions undergone within five minutes of simulation.
Figure9.
Figure 9. The
The training
training maps.
maps. (a)
(a) Map
Map 1:
1: a simple corridor-like
corridor-like path
path with
with some
some problematic
problematicscenarios.
scenarios.(b)
(b)Map
Map2:
2: aa complex
complex
circuitwith
circuit withdifferent
differentenvironmental
environmentalfeatures
featuressuch
suchas
asstraight
straighttracks,
tracks,90-degree
90-degreeturns,
turns,and
andacute
acuteand
andobtuse
obtuseangle
anglecorners.
corners.
Figure 10. The test map with similar features to that in Map 2.
Robotics 2021, 10, 73 14 of 19
Figure 9. The training maps. (a) Map 1: a simple corridor-like path with some problematic scenarios. (b) Map 2: a complex
circuit with different environmental features such as straight tracks, 90-degree turns, and acute and obtuse angle corners.
Figure 10. The test map with similar features to that in Map 2.
Figure 10. The test map with similar features to that in Map 2.
5.3. Comparison
The proposed withmethod
State-of-the-Art
allows the Method
robot to avoid obstacles based on the information
from theTo demonstrate the advantages of DWA
sensor directly. In contrast, the local planner
the proposed is based
DRL based on a dynamic
planner, window
it was compared
approach.
with the DWA Besides the planner
local data from theinlaser
[30] the sensor,
virtualaworld,
cost map, generated
as shown according
in Figure to odom-
9a. There are
etry information,
mainly two reasons is required
why thisformap
the DWA planner to predict
was developed. The firsttheone
safeismotion.
to serveThe feasibility
as the testbed
of
forthe path fromthe
comparing theproposed
DWA local plannerwith
method relies
thevery much window-based
dynamic on the precisionmethod.of the provided
The sec-
costmap.
ond one is to address the problem state in the introduction (Figure 1) by adding narrow
This is especially true in the case when the robot is navigating through features
passageways.
in each cornerConsequently,
of the map, which the drift
aimsintoodometer
demonstrate datathat
may thecause collisions
proposed under
method hasthe
the
DWA local planner. As such, the proposed method is superior
ability to avoid the problem mentioned in the introduction section. to the DWA planner since it
only relies on the sensor data as input.
The proposed method allows the robot to avoid obstacles based on the information
from the sensor directly. In contrast, the DWA local planner is based on a dynamic win-
5.4. Simulation Results
dow approach. Besides the data from the laser sensor, a cost map, generated according to
The results of the simulation are summarized and analyzed in this section. Figure 11
shows the training results to compare the performances of the neural network with different
decay rates.
From the results, it can be observed that with a lower value of β in Equation (3),
the neural network tended to get higher rewards fast. However, the rewards obtained
during the training with a low value of β were not stable. This is because the robot
chose an action with the highest Q-value according to the current neural network with
a high probability. Nevertheless, this action was overestimated in the beginning due
to insufficient experiences acquired at that time. The Q-value of that action decreased
with more exploitations. A higher value of β caused the robot to explore more actions
at first. This allowed the robot to take different actions when it met the same situation
to determine which one was better. The reward curves show that the learning process
with the highest β was slow but stably increased. It should be pointed out that the robot
cannot achieve the highest score by navigating the training map for one round without any
collisions, as shown in the results, because it still had the learning capabilities by selecting a
random action with a probability of 5%. However, after training, the robot with each neural
network was able to navigate the training map without collisions when it took action with
the highest values according to the outputs of that neural network without the probabilities
of choosing random actions. In conclusion, a higher decaying rate of the probability of
choosing a random action causes the robot to learn fast, but the value of the actions as the
outputs of the neural networks was not stable. One can decrease the value of β to evaluate
if the actions that are already learned are proper or not. One can decrease the value of β to
obtain more stable results on the account of longer training time.
The performance of the robot with different neural networks over a duration of
five minutes in the test map differs from the test map shown in Table 3. It proved that the
DRL-based obstacle avoidance method has the potential to deal with similar situations
the provided costmap. This is especially true in the case when the robot is navigating
through narrow passageways. Consequently, the drift in odometer data may cause colli-
sions under the DWA local planner. As such, the proposed method is superior to the DWA
planner since it only relies on the sensor data as input.
Robotics 2021, 10, 73 15 of 19
5.4. Simulation Results
The results of the simulation are summarized and analyzed in this section. Figure 11
shows the training results to compare the performances of the neural network with dif-
it had already learned. Further tests of the robustness of the proposed controller were
ferent decay rates.
conducted in the real world.
Figure11.
Figure Performancecomparisons
11.Performance comparisonsofofthe theneural
neuralnetwork
network with
with different
different decay
decay rates:
rates: in in
thethe first
first
three plots, the blue lines indicate the reward values, whereas the red line is the data processed bya
three plots, the blue lines indicate the reward values, whereas the red line is the data processed by
amoving
movingaverage
averagefilter
filterwith
witha awindow
window size ofof
size 500.
500.The
Theblue, red,
blue, and
red, green
and lines
green in in
lines thethe
lastlast
plot stand
plot
stand
for thefor the trainings
trainings with different
with different decaydecay
rates. rates.
3. Performance
TableFrom the results,of it
obstacle
can beavoidance
observedapproach.
that with a lower value of β in Equation (3), the
neural network tended to get higher rewards fast. However, the rewards obtained during
Decay Rate β Number of Collisions in 5 min
the training with a low value of β were not stable. This is because the robot chose an action
with the highest Q-value 0.999 according to the current neural network with 0 a high probability.
Nevertheless, this action was overestimated in the beginning due1to insufficient experi-
0.997
0.995 2
ences acquired at that time. The Q-value of that action decreased with more exploitations.
A higher value of β caused the robot to explore more actions at first. This allowed the
robotFigure
to take12 shows aactions
different sequencewhenof frames captured
it met the from the to
same situation simulated
determineenvironment
which one with
was
a virtual
better. The STORM
rewardlocomotion
curves show module under
that the the DRL-based
learning process with obstacle avoidance
the highest β wascontroller.
slow but
For theincreased.
stably simulatedItcase,
shouldthe be
virtual robotout
pointed maneuvers through
that the robot the narrow
cannot achievecorridor-like maze
the highest score
successfully without any collisions. In comparison, another sequence of
by navigating the training map for one round without any collisions, as shown in the re- frames captured
from the
sults, simulation
because it still is shown
had in Figurecapabilities
the learning 13 with thebylocomotion
selecting amodule
random under thewith
action DWA a
local planner.
probability TheHowever,
of 5%. trajectoriesafter
of the robot under
training, the with
the robot actioneachof the two different
neural network was planners
able
are
to compared
navigate in Figure
the training 14.without
map As seen from Figure
collisions when13b,
it tooktheaction
robotwith
goestheout of thevalues
highest maze
due to the lack of environmental information in order to plan a global path in advance.
After a recovery behavior, the robot comes up with a new plan and goes in the correct
direction, as shown in Figure 13c,d. It should be noted that at one of the corners of the
maze, as shown in Figure 13e,f, the robot almost bumped into the wall and failed to come
up with a reliable local path. The reason is that the costmap generated from the odometer
data was not precise at that specific time step, as shown in Figure 14b. This causes the robot
to spin around and collect more information and thereby plan a new path. In conclusion,
the proposed DRL-based method is capable of planning a proper motion for the mobile
robot solely based on the data collected from the laser sensor directly. This in turn reduces
the inaccuracy caused by other processes, such as costmap generation.
come up with a reliable local path. The reason is that the costmap generated from the
odometer data was not precise at that specific time step, as shown in Figure 14b. This
causes the robot to spin around and collect more information and thereby plan a new
path. In conclusion, the proposed DRL-based method is capable of planning a proper mo-
Robotics 2021, 10, 73 tion for the mobile robot solely based on the data collected from the laser sensor directly.
16 of 19
This in turn reduces the inaccuracy caused by other processes, such as costmap genera-
tion.
Figure12.
Figure 12.STORM
STORMlocomotion
locomotionmodule
modulewith
withthe
theproposed
proposedDRL-based
DRL-basedplanner
plannerinitialized
initializedatat(a)(a)start-
Robotics
Robotics 2021,
2021, 10,
10, xx FOR
FOR PEER
PEER REVIEW
REVIEWstarting position, successfully passes the corners with special features shown in (b), (d), (e),17 of
of 20
and
17 20
ing position, successfully passes the corners with special features shown in (b,d,e), and the corridors
the corridors as shown in (c), (f), (g); finally goes to the goal position in the virtual environment.
as shown in (c,f,g); finally goes to the goal position in the virtual environment.
Figure
Figure 13. STORM locomotion module with the DWA local planner initialized at the (a) starting
Figure 13.13. STORM
STORM locomotion
locomotion module
module with with the
the DWA
DWA local
local planner
planner initialized
initialized at
at the
the (a)
(a) starting
starting
position
position goes out of the maze due to the lack of environmental information. In a recovery behavior,
position goes
goes out
out of
of the
the maze
maze due
due to
to the
the lack
lack of
of environmental
environmental information.
information. In a recovery behavior,
behavior,
the
the robot
therobot comes
robotcomes
comesup up with
withaaanew
upwith new plan
newplan
planandand
andgoesgoes
goesinin the
inthe correct
thecorrect direction,
direction,asas
correctdirection, shown
asshown
shown (b),
(b), (c),
(b–d). At (d).
(c), one At
(d). one
of the
At one
of
of the
the corners
corners of
of the
of the maze,
corners theas maze,
shown
maze, as
as shown in
in (e–g),
shown (e),
inthe (f),
(f), (g),
(e),robot the
the robot
almost
(g), almost
bumped
robot intobumped
almost the wallinto
bumped andthe
into wall
failed
the walltoandcome
and failed
up
failed
to
to come
withcome up
up with
with
a reliable aa reliable
local path. Itlocal
reliable local path.
path.
finally goesIt finally
It to
finally
the (h)goes
goes
goalto the
the (h)
(h) goal
toposition. goal position.
position.
Figure
Figure 14.
Figure14. (a)
14.(a) Comparison
(a)Comparison
Comparison betweenthethe
between
between the trajectories
trajectories
trajectories of the
the proposed
of proposed
of the proposed DRL-based
DRL-based
DRL-based planner
planner
planner and theand
and
DWAthe
the
DWA
DWA local Planner.
local Planner. (b) The costmap feeds into the DWA planner at the moment the robot fails
fails to
local Planner. (b) The (b) The costmap
costmap feeds
feeds into the into
DWAthe DWA at
planner planner at the the
the moment moment
robot the
failsrobot
to produce to
produce
produce aa path.
path.
a path.
5.5.
5.5. Real-World
Real-World Implementation
Implementation
The
The obstacle avoidance
obstacle avoidance controller
controller for
for the
the STORM
STORM prototype
prototype was
was embedded
embedded in in the
the
control
control layer. The neural network trained with β = 0.999 was selected as it offered the best
layer. The neural network trained with β = 0.999 was selected as it offered the best
performance.
performance. AsAs mentioned
mentioned before,
before, the
the trained
trained neural
neural network
network was
was made
made to to run
run on
on aa
Robotics 2021, 10, 73 17 of 19
Figure 15. The test maps and the trajectories of the robot.
Figure 15. The test maps and the trajectories of the robot.
6. Conclusions and Future Work
The DRL-based method shows the potential to improve the performance of the robot
when avoiding obstacles in some highly occupied environments and problematic scenar-
ios. The proposed DRL architecture was trained inside a simulated Gazebo environment
Robotics 2021, 10, 73 18 of 19
Author Contributions: Methodology, S.F.; project administration, P.B.-T.; supervision, P.B.-T.; valida-
tion, S.F.; writing—original draft, S.F.; writing—review and editing, B.S. and P.B.-T. All authors have
read and agreed to the published version of the manuscript.
Funding: This research received no external funding.
Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.
Conflicts of Interest: The authors declare no conflict of interest.
References
1. Silver, D.; Hubert, T.; Schrittwieser, J.; Antonoglou, I.; Lai, M.; Guez, A.; Lanctot, M.; Sifre, L.; Kumaran, D.; Graepel, T.; et al.
Mastering chess and shogi by self-play with a general reinforcement learning algorithm. arXiv 2017, arXiv:1712.01815.
2. François-Lavet, V.; Henderson, P.; Islam, R.; Bellemare, M.G.; Pineau, J. An Introduction to Deep Reinforcement Learning. Found.
Trends Mach. Learn. 2018, 11, 219–354. [CrossRef]
3. Moreira, I.; Rivas, J.; Cruz, F.; Dazeley, R.; Ayala, A.; Fernandes, B. Deep reinforcement learning with interactive feedback in a
human-robot environment. Appl. Sci. 2020, 10, 5574. [CrossRef]
4. Gu, S.; Holly, E.; Lillicrap, T.; Levine, S. Deep reinforcement learning for robotic manipulation with asynchronous off-
policy updates. In Proceedings of the 2017 IEEE International Conference on Robotics and Automation (ICRA), Singapore,
29 May–3 June 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 3389–3396. [CrossRef]
5. Pirník, R.; Hruboš, M.; Nemec, D.; Mravec, T.; Božek, P. Integration of inertial sensor data into control of the mobile platform.
Adv. Intell. Syst. Comput. 2017, 511, 271–282.
6. Kilin, A.; Bozek, P.; Karavaev, Y.; Klekovkin, A.; Shestakov, V. Experimental investigations of a highly maneuverable mobile
omniwheel robot. Int. J. Adv. Robot. Syst. 2017, 14, 1–9. [CrossRef]
7. Zhu, Y.; Mottaghi, R.; Kolve, E.; Lim, J.J.; Gupta, A.; Fei-Fei, L.; Farhadi, A. Target-driven visual navigation in indoor scenes using
deep reinforcement learning. In Proceedings of the 2017 IEEE International Conference on Robotics and Automation (ICRA),
Singapore, 29 May–3 June 2017; IEEE: Piscataway, NJ, USA, 2017; Volume 1, pp. 3357–3364.
8. Tai, L.; Paolo, G.; Liu, M. Virtual-to-real deep reinforcement learning: Continuous control of mobile robots for mapless navigation.
In Proceedings of the 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Vancouver, BC, Canada,
24–28 September 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 31–36.
9. Ulrich, I.; Borenstein, J. VFH*: Local obstacle avoidance with look-ahead verification. In Proceedings of the Proceedings 2000 ICRA.
Millennium Conference. IEEE International Conference on Robotics and Automation. Symposia Proceedings (Cat. No.00CH37065),
San Francisco, CA, USA, 24–28 April 2000; IEEE: Piscataway, NJ, USA, 2000; Volume 3, pp. 2505–2511.
Robotics 2021, 10, 73 19 of 19