0% found this document useful (0 votes)
53 views

A Deep Reinforcement Learning Algorithm For Robotic Manipulation Tasks in Simulated Environments

Uploaded by

Hiền Xuân
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
53 views

A Deep Reinforcement Learning Algorithm For Robotic Manipulation Tasks in Simulated Environments

Uploaded by

Hiền Xuân
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Proceeding Paper

A Deep Reinforcement Learning Algorithm for Robotic


Manipulation Tasks in Simulated Environments †
Carlos Calderon-Cordova * and Roger Sarango

Department of Computer Science and Electronics, Universidad Técnica Particular de Loja, Loja 1101608, Ecuador;
[email protected]
* Correspondence: [email protected] or [email protected]
† Presented at the XXXI Conference on Electrical and Electronic Engineering, Quito, Ecuador, 29 November–1

December 2023.

Abstract: Industrial robots are used in a variety of industrial process tasks, and due to the complexity
of the environment in which these systems are deployed, more robust and accurate control methods
are required. Deep reinforcement learning emerges as a comprehensive approach that directly allows
for the mapping of sensor data and the setting of motion actions to the robot. In this work, we propose
a robotic system implemented in a semi-photorealistic simulator whose motion control is based on
the A2C algorithm in a DRL agent; the task to be performed is to reach a goal within a work area.
The evaluation is executed in a simulation scenario where a fixed position of a target is maintained
while the agent (robotic manipulator) tries to reach it with the end-effector from an initial position.
Finally, the trained agent fulfills the established task; this is demonstrated by the results obtained in
the training and evaluation processes, and the reward value increases when the measured distance
decreases between the end-effector and the target.

Keywords: robotics manipulation; DRL algorithms; deep learning; reinforcement learning; robotic
simulator; CoppeliaSim

1. Introduction
In recent decades, Industry 4.0 (I4.0) has promoted the upgrade of automation and the
digitization of industrial processes, using emerging technologies such as cloud computing,
Citation: Calderon-Cordova, C.;
artificial intelligence (AI), internet of things (IoT), cyber-physical systems, etc. [1]. Some of
Sarango, R. A Deep Reinforcement
the industrial processes such as welding, manufacturing, and object manipulation require
Learning Algorithm for Robotic
robotic systems that enable decision making to execute specific tasks within production
Manipulation Tasks in Simulated
lines [2].
Environments. Eng. Proc. 2023, 47, 12.
The integration of artificial intelligence improves the efficiency of factory automation;
https://doi.org/10.3390/
engproc2023047012
in particular, machine learning (ML) is implemented, which gives systems the ability to
learn to execute a task and has been demonstrated to improve control in many applica-
Academic Editor: Jackeline Abad tions [3]. The learning process is limited by several factors: the volume of data required
Published: 4 December 2023
for the system, the elements of scenarios in industrial areas, deployment time, the cost of
equipment, and others. Currently, there are several simulation platforms [4] that allow for
the simulation of various scenarios, generating synthetic data and robotic components that
allow for the developing of an end-to-end system based on artificial intelligence.
Copyright: © 2023 by the authors. In machine learning, Deep Reinforcement Learning (DRL) has been of great interest to
Licensee MDPI, Basel, Switzerland. researchers with academic and industrial approaches for several application areas [5]. In
This article is an open access article the field of robotic manipulation, there are several frameworks and simulation platforms
distributed under the terms and
that allow for the implementation and evaluation of various control algorithms for the
conditions of the Creative Commons
application of RL in robotics [6]. Despite the success of DRL, there are also methods that
Attribution (CC BY) license (https://
allow for the optimization of DRL algorithms by performing CPU and GPU combinations,
creativecommons.org/licenses/by/
thus allowing larger data batches to be used without affecting the final performance [7].
4.0/).

Eng. Proc. 2023, 47, 12. https://doi.org/10.3390/engproc2023047012 https://www.mdpi.com/journal/engproc


Eng. Proc. 2023, 47, 12 2 of 10

In the literature, there is a large number of papers presenting reviews of DRL algo-
rithms for robotic manipulation; however, in these papers, they do not provide details
of the implementation process for a specific problem. The contribution of this work is to
implement a robotic system in a simulation platform; the end-to-end control approach is
based on DRL, and its performance in the classical task of reaching a goal using a robotic
manipulator is evaluated.
The structure of the paper is described as follows: Section 2 describes the main
fundamentals and structure involved in deep reinforcement learning applied to robotic
manipulation. Section 3 describes the components and architecture of the robotic system
implemented in simulation. Section 4 provides the details of the training and evaluation
processes of the robotic system whose control is based on DRL. Section 5 presents some
results obtained with the simulation implementation of the DRL algorithms. Finally, some
conclusions obtained from the complete development process of this work are presented.

2. Deep Reinforcement Learning


Deep reinforcement learning algorithms are based on the iterative trial-and-error
learning process. The interaction between the agent and the environment in which the task
is performed is modeled after a Markov Decision Process (MDP). With this method, the
interaction is reduced to three signals: the current state of the environment (observations),
a decision made by the agent based on the state (action), and feedback with positive
or negative value depending on the action performed by the agent (reward) [8]. The
mathematical foundations of DRL approaches are based on the Markov Decision Process
(MDP) [9], which consists of five elements:

MDP = (S, A, P, R, γ), (1)

where S is the set of states of the agent and the environment, A is the set of actions executed
by the agent, P is the model of the system—in other words, it is the transition probability
of a state—R is the reward function, and γ is a discount factor [10]. The DRL objective
function has two forms: the first is a value function that defines the expectation of the
accumulated reward. V π (s) represents the estimates of the state-value function in the
policy for the MDP; given a policy π, we have the expected return:
 
V π (s) = Eπ rt + γr t+1 + γr t+2 + · · · st = s , (2)

The second is the action-value function, which is known as the Q function; this
function indicates that after performing an action a, based on the policy π on the state s, an
accumulative reward (r) is generated. Qπ (s, a) represents the estimates of the action-value
function in the policy for the MDP; given a policy π, we have the following expected return:
 
Qπ (s, α) = Eπ rt + γr t+1 + γr t+2 + · · · st = s, at = a , (3)

The Advantage Actor-Critic (A2C) algorithm combines two elements of reinforcement


learning: a policy gradient (actor) and a learned value function (critic). The Actor network
learns a parameterized policy, and the Critic network learns the value function that evalu-
ates pairwise state–action signals. The Critic network provides a reinforcement signal to
the Actor network. The Advantage function of this algorithm is that the Actor network
chooses an action at each time step, and the Critic network evaluates that action based on
the Q value of an input state. As the Critical network learns which states are better, the
Actor network uses that information to teach the agent to search for better states.

At = Rt + Vγ(rt+1 ) − Vγ(rt ), (4)

where At is the action of the state, Rt is the reward obtained in a given a state St , Rt = R(St ),
and R is the reward function. To have a correct advantage function, the TD error is used
as a better estimator of the decision-making process when choosing an action in a given
Eng. Proc. 2023, 47, x 3 of 10

Eng. Proc. 2023, 47, 12 where 𝐴𝑡 is the action of the state, 𝑅𝑡 is the reward obtained in a given a state 𝑆𝑡3, of
𝑅𝑡10=
𝑅(𝑆𝑡 ), and 𝑅 is the reward function. To have a correct advantage function, the TD error
is used as a better estimator of the decision-making process when choosing an action in a
given state [9]. Agents can be trained using A2C to estimate the advantage function by
state [9]. Agents can be trained using A2C to estimate the advantage function by combining
combining the estimated reward value and the observed reward value.
the estimated reward value and the observed reward value.

3.3.Proposed
ProposedSystem
System
Therobotic
The roboticsystem
systemproposed
proposedininthis
thiswork
workwas
wasdeveloped
developedusing
usingfree
freesoftware
softwaretools
tools
forrobotics
for roboticsresearch,
research,the
theagent
agenttotobe
beevaluated
evaluatedisisaarobotic
roboticmanipulator
manipulatorofof77degrees
degreesofof
freedom, and the task to be performed is to reach a target within its
freedom, and the task to be performed is to reach a target within its working area working areaininaa
semi-photorealistic scene. The training data are obtained directly from the simulator,
semi-photorealistic scene. The training data are obtained directly from the simulator, and and
by means of the API integrated in this platform, the data flow control and control
by means of the API integrated in this platform, the data flow control and control of the of the
robot is performed.
robot is performed.

3.1.
3.1.System
SystemArchitecture
Architecture
The
Therobotic
roboticmanipulator
manipulatorcontrol
controlapproach
approachisisbased
basedon onthe
thegeneral
generalDRL DRLscheme;
scheme;the the
software
softwaresystem
systemarchitecture
architectureis is aa client–server
client–server type. The Theclient
clientcontains
containsthe thecomponents
componentsto
totrain
train and
and evaluate
evaluate thethe
DRL DRLagent agent located
located onserver
on the the server side, which
side, which consists consists of the
of the simula-
simulation platform.
tion platform. Since
Since this is this is an end-to-end
an end-to-end controlcontrol approach,
approach, the input the data
inputtodata to the
the system
system
consistsconsists
of the of the position
position of theof the robot
robot joints,joints, the position
the position (x, y,(x,
z) y,ofz)a of
tipa located
tip located at the
at the end
end of the end-effector, and the measured distance
of the end-effector, and the measured distance between the between the end-effector and the target
and the target
sphere.
sphere.The
Thecontrol
controlactions
actionsareareangles
anglesforforeach
eachofofthe
thejoints
jointsofofthe
the7-DoF
7-DoFrobot.robot.
This
Thisconfiguration
configurationisisshown
shownininFigure
Figure11andanddescribes
describesthetheprocess
processofoflearning
learningan anagent
agent
using
usingDRLDRLforforrobotic
roboticmanipulator
manipulatorininCoppeliaSim.
CoppeliaSim.

Figure1.1.DRL
Figure DRLprocess
processarchitecture
architecturefor
forthe
therobotic
roboticmanipulator.
manipulator.

Whenusing
When usinganan A2C
A2C algorithm
algorithm with
with an an Actor-Critic
Actor-Critic policy,
policy, the model
the model structure
structure con-
consists
ofsists
twoofnetworks:
two networks:
(1) the(1) the network
Actor Actor network recommends
recommends the action
the action to beby
to be taken taken by the
the agent
agent
and andthe
gives gives the probability
probability distribution
distribution of the
of the state, state,
and andCritical
(2) the (2) thenetwork
Critical network
will give will
an
give an of
estimate estimate of thevalue
the reward rewardfor value for the executed
the executed actions. Itactions. It yields
yields the the estimated
estimated total rewardstotal
inrewards
the given
in state in thestate
the given future. Thefuture.
in the neuralThe network
neuralarchitecture consists of consists
network architecture the structure
of the
shown in Figure
structure shown 2, in
which consists
Figure of the
2, which two networks
consists Actor
of the two and Critic;
networks eachand
Actor network
Critic;layer
each
has a size of
network 256,has
layer 512, and of
a size 256, respectively,
256, 512, and 256,with an ReLU activation
respectively, function.
with an ReLU activation func-
tion.
3.2. Reward Function
The reward function should make the robotic manipulator reach the target position,
whose end-effector should approach a sphere located in the working area. After executing
an action, an observation is obtained, and the total reward is calculated, which is achieved
with the following expression:
Rt = rdist + rctrl + r pa + rl + rc , (5)
Eng. Proc. 2023, 47, 12 4 of 10

where rdist represents the distance reward, which quantifies the Euclidean distance between
the end of the robotic manipulator and the target sphere. This reward has a more negative
value when the end of the manipulator is farther away from the target. On the other hand,
rctrl corresponds to the control reward, which is calculated as the squared Euclidean norm
of the action taken by the agent. This reward penalizes the agent when it makes excessively
large movements. In addition, the reward r pa is obtained by comparing the previous
distance with the current distance after performing an action. If the previous distance is
less than the current distance, the agent receives a negative reward, as this indicates that
the end-effector of the manipulator is moving away from the target position. The difference
reward is calculated by the following function:

r pa = d p − d a , (6)

where d p is the previous distance between the end-effector and the target to be reached,
this distance value is calculated before executing an action of the robot, and d a , indicates
the current distance between the end-effector position and the sphere position. The arrival
reward rl of the manipulator is defined as a positive value that is within a threshold distance
between the end-effector and the target. The collision reward rc is a positive value given to
the agent when the effector touches the target object. The total reward is composed of the
five elements already mentioned and is defined by the following function:
 q
2


 r dist : − ∑in=1 (yi − xi )
 rctrl : −∑n ( Action)2


i =1
Rt = r pa : d p − d a (7)

rl : i f |d a | < du





rc : i f collision = True

where Action is a vector composed of the angular positions of the seven joints of the robot.
The term du is the threshold distance that determines that the end-effector is approaching
the target, and the term “collision” has a logical value that represents the collision between
these two objects in the simulation; once this condition is fulfilled, the agent is given a
Eng. Proc. 2023, 47, x positive value for reaching the target. These rewards are essential components in guiding
4 of 10
the agent’s on-task behavior, encouraging precise movements toward the target sphere and
discouraging abrupt or distant actions.

Figure
Figure2.2.Neural
Neuralnetwork
networkstructure
structurefor
forA2C.
A2C.

3.3.Reward
3.2. Action Function
and Observation Space
Thereward
The robotic function
manipulator consists
should makeofthe
seven jointsmanipulator
robotic (7 DoF); as areach
result,the
wetarget
have position,
the action
space of
whose a vector containing
end-effector the angles
should approach of each
a sphere joint of
located inthe
therobot. Thearea.
working observation space is
After executing
a vector of 1 × 18 elements containing the position of the end-effector, the angles
an action, an observation is obtained, and the total reward is calculated, which is achieved of each
joint, and the distance between
with the following expression: the end-effector and the target sphere.

3.4. Simulation 𝑅𝑡 = 𝑟𝑑𝑖𝑠𝑡 + 𝑟𝑐𝑡𝑟𝑙 + 𝑟𝑝𝑎 + 𝑟𝑙 + 𝑟𝑐 , (5)


The implementation of the robotic manipulator occurs in the CoppeliaSim simula-
where 𝑟𝑑𝑖𝑠𝑡 represents the distance reward, which quantifies the Euclidean distance be-
tor [11]; this platform provides an API in Python that allows for communication with
tween the end of the robotic manipulator and the target sphere. This reward has a more
negative value when the end of the manipulator is farther away from the target. On the
other hand, 𝑟𝑐𝑡𝑟𝑙 corresponds to the control reward, which is calculated as the squared
Euclidean norm of the action taken by the agent. This reward penalizes the agent when it
makes excessively large movements. In addition, the reward 𝑟𝑝𝑎 is obtained by compar-
tion space of a vector containing the angles of each joint of the robot. The observation
space is a vector of 1 × 18 elements containing the position of the end-effector, the angles
of each joint, and the distance between the end-effector and the target sphere.

Eng. Proc. 2023, 47, 12 3.4. Simulation 5 of 10


The implementation of the robotic manipulator occurs in the CoppeliaSim simulator
[11]; this platform provides an API in Python that allows for communication with external
software,software,
external data acquisition, and control
data acquisition, functions.
and control The base
functions. Thearchitecture for thisfor
base architecture system
this
consists of two elements: the first is the server in which the robotic
system consists of two elements: the first is the server in which the robotic manipulatormanipulator (Franka
Emika Panda)
(Franka and a static
Emika Panda) and aelement (red sphere)
static element that will
(red sphere) thatbewillthebetarget to reach
the target are are
to reach de-
ployed. The
deployed. second
The is the
second client,
is the which
client, consists
which of Python
consists scripts
of Python for the
scripts forcreation of a cus-
the creation of
atomized
customizedenvironment
environmentwith with
Gymnasium
Gymnasium [12] and
[12]the
andtraining and evaluation
the training of theof
and evaluation agent
the
with Stable
agent Baselines
with Stable 3 [13].3 [13].
Baselines
The simulation environment
The simulation environment is is presented
presented in in Figure
Figure 3;3;the
theobjective
objective ofof the
the agent
agent isisto
to
move the end-effector to a target position, which is marked by a red
move the end-effector to a target position, which is marked by a red sphere. At the end ofsphere. At the end of
theend-effector,
the end-effector,aamarkmarkisisplaced
placedin inorder
ordertotoset
setaaparameter
parameterfor forthetheobservation
observationspace;
space;thethe
workingarea
working areaisisinside
insidethe
thewhite
whitecolored
coloredarea.
area.

Figure3.3.Simulation
Figure Simulationscene
scenefor
forthe
therobotic
roboticmanipulator.
manipulator.

4.4. Training
TrainingandandEvaluation
EvaluationDetails
Details
The
Therobotic
roboticsystem
systemwaswastrained
trainedandandevaluated
evaluatedin inaasimulation
simulationenvironment,
environment,and andthethe
following
following software
software components
components werewere used:
used: Python
Python as as the
the base
base programming
programming language,
language,
CoppeliaSim
CoppeliaSim as as the
the simulation
simulation platform,
platform, Stable
Stable Baselines
Baselines 33 as as the
the DRL
DRL framework
framework that that
allowed
allowed for the integration of a deep neural network (DNN), and the A2Calgorithm
for the integration of a deep neural network (DNN), and the A2C algorithm inin
the
thetest
testagent.
agent. The
The hardware
hardwareused
usedforforthe
the implementation
implementationof ofthis
thissystem
systemhas
hasthethefollowing
following
features:
features:Ryzen
Ryzen77CPU,CPU,NVIDIA
NVIDIAGeForce
GeForceRTX RTX3070
307088GB GBGPU,
GPU,andand16 16GB
GBRAM.
RAM.
During
During the training phase, the robotic manipulator searches to approximatethe
the training phase, the robotic manipulator searches to approximate posi-
the po-
tion of the target located at a point in the scene; the coordinates of the robot
sition of the target located at a point in the scene; the coordinates of the robot joints to- joints towards
the
wardstarget
theare given
target randomly
are and updated
given randomly after several
and updated afterepisodes, making the
several episodes, robot the
making perform
robot
its task better. This makes the DRL training efficient and robust.
perform its task better. This makes the DRL training efficient and robust.
Figure
Figure44presents
presentsthethemain
mainelements
elements usedused for
for this
this robotic
robotic system
system in in software.
software. In
In (a),
(a),
the custom environment for CoppeliaSim is presented, and the class is created; Gymna-
the custom environment for CoppeliaSim is presented, and the class is created; Gymna-
sium methods for RL and control functions and the data acquisition of the simulator are
sium methods for RL and control functions and the data acquisition of the simulator are
established. (b) is part of the client in which the agent training is performed, and the model
established. (b) is part of the client in which the agent training is performed, and the model
is saved. (c) is also part of the client, and the trained agent is evaluated, which involves
is saved. (c) is also part of the client, and the trained agent is evaluated, which involves
loading the model and running it on the simulation platform to verify that the assigned
loading the model and running it on the simulation platform to verify that the assigned
task is being performed.
task is being performed.
The selected algorithm is the Actor-Critical Advantage (A2C) [14], because it meets the
requirements of the action and observation spaces used for Gymnasium that are supported
by the DRL framework, and it is one of the algorithms used for robotic manipulation [15].
The input state of the deep neural network consists of a vector with these elements: the
position of the tip point of the end-effector, the angles of the joints, and the distance from
the end-effector to the target. The action performed by the system is a vector containing
seven elements corresponding to each of the robot’s joints.
For additional details on the development and implementation of the robotic system
in the simulation platform, a link of the repository is attached here: https://github.com/
RogerSgo/DRL-Sim-Reach-Target (accessed on 2 October 2023).
Eng.Proc.
Eng. Proc.2023, 47,x 12
2023,47, 6 of1010
6 of

(a) (b) (c)


Figure
Figure4.4.The
Theflowcharts of the
flowcharts software
of the system:
software (a) Customized
system: environment
(a) Customized for CoppeliaSim;
environment (b)
for CoppeliaSim;
Flowchart for client training; (c) Flowchart for the evaluation of the trained model.
(b) Flowchart for client training; (c) Flowchart for the evaluation of the trained model.

The selected algorithm is the Actor-Critical Advantage (A2C) [14], because it meets
5. Results
the requirements
This sectionofpresents
the action theand observation
main spaces used
results obtained fromfor Gymnasium
the that arecarried
experimentation sup-
ported by the DRL framework, and it is one of the algorithms used
out in the CoppeliaSim EDU 4.5 simulation environment. The model was trained with for robotic manipula-
tion
1000[15]. The input
episodes state aofstandard
and with the deepActor-Critic
neural network
neural consists
network of architecture
a vector with of these ele-of
3 layers
ments: the256,
the sizes position of the
512, and 256 tip point
in each of the end-effector,
network. The resultingthe testangles
modelof theabout
took joints,1 hand the
to train.
distance from
Table the end-effector
1 shows the reward to values
the target. The action
obtained duringperformed by the
the training of system
the agentis awith
vector
the
containing seven during
A2C algorithm elements1000corresponding to each
episodes; it can of thethat
be seen robot’s joints. worked according
the training
Forcriteria
to the additional details on
established the
for development
the assigned task.and Inimplementation
this case, it isofobserved
the roboticthat system
the as
inthe the simulation
distance platform,
(da ) between a link and
the end-effector of thethe repository
target decreases, is attachedvalue
the reward here:
(Rt )
https://github.com/RogerSgo/DRL-Sim-Reach-Target
increases, which indicates the agent learning to perform (accessed on 2 October
this movement 2023).
action.

5.Table
Results
1. Reward values.
This section presents the main results obtained from the experimentation carried out
Episode dp da rdist rctrl rpa rl r R
in the CoppeliaSim EDU 4.5 simulation environment. The model wasc trained with t1000
1 0.467 0.465 and with
episodes −0.465 a standard −0.350
Actor-Critic0.002neural network 0.000architecture0.000of 3 layers−0.813
of the
500 0.158 0.145 − 0.145 0.552 0.013 0.145
sizes 256, 512, and 256 in each network. The resulting test model took about 1 h to train.0.00 − 0.538
1000 0.092 0.090 −0.090 −0.0333 0.001 0.090 0.00 −0.332
Table 1 shows the reward values obtained during the training of the agent with the
A2C algorithm during 1000 episodes; it can be seen that the training worked according to
the criteria established
The learning forofthe
curve theassigned
A2C agent task. In thisoncase,
is based it is observed
the training reward. that
Tothe as the
obtain the
distance
learning(dcurve,
a) between the of
the sum end-effector
the rewards and thethe
that target decreases,
agent the reward
obtains during the setvalue (Rt) in-
of episodes
creases, which
determined forindicates
trainingthe agent learning
is measured. A highto cumulative
perform thisreward
movement action.
indicates that the agent is
successful in the task. The learning curve is presented in Figure 5.
Table 1.
ToReward
verify values.
the performance of the reach-to-target task using the A2C algorithm, the
trained model is evaluated by performing multiple simulation experiments in CoppeliaSim.
Episode dp da rdist rctrl rpa rl rc Rt
In Figure 6, which robotic manipulator can execute the task of reaching a target using the
1 0.467 0.465 −0.465 −0.350 0.002 0.000 0.000 −0.813
DRL control method is demonstrated. The robot starts from an initial position, and with the
use500 0.158model,
of the trained 0.145 −0.145motion
it performs 0.552 0.013until0.145
predictions it reaches0.00 −0.538
the target located
1000 0.092 0.090 −0.090
at a position in the working area of the environment. −0.0333 0.001 0.090 0.00 −0.332

The learning curve of the A2C agent is based on the training reward. To obtain the
learning curve, the sum of the rewards that the agent obtains during the set of episodes
Eng. Proc. 2023, 47, x 7 of 10

Eng. Proc. 2023, 47, 12 7 of 10


determined for training is measured. A high cumulative reward indicates that the agent
is successful in the task. The learning curve is presented in Figure 5.

Figure 5. Agent learning curve.

To verify the performance of the reach-to-target task using the A2C algorithm, the
trained model is evaluated by performing multiple simulation experiments in Coppeli-
aSim. In Figure 6, which robotic manipulator can execute the task of reaching a target
using the DRL control method is demonstrated. The robot starts from an initial position,
and with the use of the trained model, it performs motion predictions until it reaches the
Figure Agentlearning
target5.5.located
Figure Agent learning curve.in the working area of the environment.
at a position
curve.

To verify the performance of the reach-to-target task using the A2C algorithm, the
trained model is evaluated by performing multiple simulation experiments in Coppeli-
aSim. In Figure 6, which robotic manipulator can execute the task of reaching a target
using the DRL control method is demonstrated. The robot starts from an initial position,
and with the use of the trained model, it performs motion predictions until it reaches the
target located at a position in the working area of the environment.

(a) (b) (c)


Figure6.6.The
Figure Thesequence
sequenceof
ofmanipulator
manipulatormovement
movement to
to reach
reachthe
thetarget:
target: (a)
(a)The
Theinitial
initialposition
positionof
ofthe
the
robotic manipulator; (b) End-effector moves about half the distance to the target; (c) End-effector
robotic manipulator; (b) End-effector moves about half the distance to the target; (c) End-effector
arrives at the target goal.
arrives at the target goal.

Theperformance
The performanceevaluation
evaluationofofthe the A2C
A2C method
method forfor this
this robotics
robotics application
application is based
is based on
on the average reward; this parameter is a common metric for the
the average reward; this parameter is a common metric for the evaluation of RL algorithms. evaluation of RL algo-
(a) rithms.
The taskThe
to betask to be
performed(b)performed
in this work in this
waswork was the of
the reaching reaching (c)of
a target byathetarget by the end-
end-effector of
effector
a robotic of a robotic
manipulator, manipulator, and according to the reward function designed in the
this
Figure 6. The sequence of and according
manipulator to the reward
movement to reachfunction
the target: designed in this
(a) The initial system,
position of the
system,
reward
robotic the reward
value
manipulator;should value
(b) should
increase
End-effector increase
as the
movesmeasuredas the
about measured
distance
half distance
between
the distance between
end-effector
to the target; end-effector
and target
(c) End-effector
and target
decreases.
arrives decreases.
at theThis
target goal. This
shows that shows that the performance
the performance is satisfactory.is satisfactory.
Theevaluation
The evaluationofofthe the trained
trained DRL DRL model
model waswas performed
performed by running
by running it with
it with 100 epi-
100 episodes.
sodes.
TheTwo important
performance parameters
evaluation ofare presented
the A2C methodto measure
for this
Two important parameters are presented to measure the agent’s performance based on the agent’s
robotics performance
application is based
based
A2C
on the average
as shown reward;
in Figure 7; thethis
curveparameter
in red is theis areward
common metric
value for the evaluation
accumulated during thatofnumber
RL algo- of
rithms.
episodes, The
andtaskthe to be performed
curve in thisthe
in blue represents work was the
measured reaching
distance of aend-effector
of the target by the end-
tip when
effector of a robotic
approaching manipulator, and according to the reward function designed in this
the target.
system, With
thethe results
reward obtained
value shouldfrom the training
increase process distance
as the measured in Table between
1 and theend-effector
evaluation
process
and in Figure
target 7, it This
decreases. is shown
shows that thethe
that reward value increases
performance when the measured distance
is satisfactory.
between
The the tip positioned
evaluation at the extreme
of the trained DRL model of the was
end-effector
performed andby therunning
target sphere
it withdecreases,
100 epi-
with each
sodes. Twoaction predicted
important by the DRL
parameters model. This
are presented indicatesthe
to measure that the trained
agent’s agent satisfies
performance based
the execution of the task during the evaluation of the system in simulation.
Eng. Proc. 2023, 47, x 8 of 10

on A2C as shown in Figure 7; the curve in red is the reward value accumulated during
Eng. Proc. 2023, 47, 12
that number of episodes, and the curve in blue represents the measured distance of8 of
the10

end-effector tip when approaching the target.

Figure 7.
Figure The evaluation
7. The evaluationparameters
parametersof of the
the DRL
DRLAgent.
Agent. The
Thered
redcurve
curveisis the
the reward
reward accumulated
accumulated
when
whenthe
theend-effector
end-effector approaches
approaches the
the target
target sphere
sphere position.
position. The
The blue
blue curve
curveisisthethedistance
distancebetween
between
the end-effector and the target sphere.
the end-effector and the target sphere.

With the results


In addition to the obtained
metricsfrom the training
(reward) process distance
and measured in Table 1(m)
andofthe
theevaluation pro-
robotic system
training,
cess the use
in Figure 7, itofisthe PC hardware
shown resources
that the reward of some
value software
increases whencomponents
the measured wasdistance
checked.
Table 2 shows
between the tipthe hardware
positioned at resources
the extremeused
of by
thethe software components
end-effector and the target of sphere
the robotic
de-
manipulator
creases, system
with each during
action the training
predicted by theprocess and verifies
DRL model. that thethat
This indicates highest consumption
the trained agent
occurs by
satisfies thethe following:
execution the
of the simulation
task during theplatform consumes
evaluation 35% ofinthe
of the system total GPU, the
simulation.
Python programming
In addition language
to the metrics consumes
(reward) and 39% of thedistance
measured RAM memory,
(m) of theand finally,
robotic when
system
using Anaconda
training, the use oftothe implement
PC hardware the different
resourcesscripts
of somein Firefox,
softwareitcomponents
uses 9% of the was RAM.
checked.
Table 2 shows the hardware resources used by the software components of the robotic
Table 2. Hardware
manipulator system performance.
during the training process and verifies that the highest consumption
occurs by the following: the simulation platform consumes 35% of the total GPU, the Py-
PC Component
thon programming
Software language consumes 39% of the RAM memory, and finally, when using
Anaconda to implement the CPU (%)
different GPU (%)
scripts VRAM
in Firefox, (%) 9%RAM
it uses of the(%)
RAM. FPS
CoppeliaSim 4.5.1 2 35 37 1 59
Python
Table 2. 3.11.4performance. 1
Hardware 0.4 - 39 -
Firefox 118.0.2 0.2 0.1 - 9 -
PC Component
Software
6. Conclusions CPU (%) GPU (%) VRAM (%) RAM (%) FPS
CoppeliaSim 4.5.1 2 35 37 1 59
In this paper, the A2C-based deep reinforcement learning method to solve the task of
Pythona 3.11.4
reaching 1 manipulator
goal for a robotic 0.4is presented; simulation
- 39in CoppeliaSim
results -
Firefox 118.0.2 0.2 0.1 - 9 -
validate the performance of the proposed system. In addition to the training and evaluation
processes of the DRL model, the impact of this DRL control method on the performance of
6. Conclusions
a PC hardware is verified.
Reward calculation
In this paper, is mainly based
the A2C-based on the measured
deep reinforcement distancemethod
learning betweentothe end-effector
solve the task
position
of reaching and the target
a goal to be reached,
for a robotic as well
manipulator as a penalty
is presented; to the agent
simulation when
results making
in Coppeli-
movements
aSim validatethat
theare too large. During
performance the training
of the proposed and evaluation
system. In additionprocesses of the agent,
to the training and
each parameter
evaluation of theofreward
processes function
the DRL is monitored
model, the impactin oforder to determine
this DRL if the task
control method ontothe
be
executed is being
performance fulfilled;
of a PC hardwarethese
is data are evidenced in Table 1 and Figure 6. Therefore, as
verified.
the end-effector-target
Reward calculationdistance decreases,
is mainly based onthethereward has distance
measured a higher between
value. the end-effec-
tor position and the target to be reached, as well as a penalty to theintegrated
The agent is composed of artificial neural networks that are agent when in amaking
critical
actor architecture and learned based on sensory information of the end-effector’s
movements that are too large. During the training and evaluation processes of the agent, state with
respect to the target. Its policy based on direct feedback from the environment with data is
evaluated with the following parameters: position, distance, and motion cost.
In Section 5, some quantitative results were presented such as the agent’s learning
curve, rewards obtained during a certain number of episodes, and the distance measured
Eng. Proc. 2023, 47, 12 9 of 10

between the end-effector and the target. These values were obtained from the training
and evaluation of the system and show that the agent was learning to perform the task of
reaching a target within the work zone in the CoppeliaSim simulation scene.
The simulation experiments performed showed that the training process was per-
formed in a short time due to GPU acceleration. Current implementation used data from
various sources such as joint positions given in angles, data related to the distance from the
end-effector to the target, and the distance between the end-effector mark and the target
sphere. The effect of the processing performance can be varied by scaling the system via
increase of the number of sensing sensors and/or complexity of the deep neural network.
The performance of the DRL agent depends on many factors, and we can emphasize
these: the number of episodes for training the model, the framework selected to implement
the DRL algorithm, the design of the neural network architecture, and process logic for the
actions at each step to be taken by the agent.

Author Contributions: Conceptualization, methodology, C.C.-C. and R.S.; software, R.S.; validation,
R.S.; writing—original draft preparation, C.C.-C. and R.S.; writing—review and editing, R.S.; super-
vision, C.C.-C.; project administration and funding acquisition, C.C.-C. All authors have read and
agreed to the published version of the manuscript.
Funding: This research was funded by Universidad Tecnica Particular de Loja, grant number
PROY_ARTIC_CE_2022_3667.
Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.
Data Availability Statement: Data are contained within the article.
Conflicts of Interest: The authors declare no conflict of interest.

References
1. del Real Torres, A.; Andreiana, D.S.; Ojeda Roldan, A.; Hernandez Bustos, A.; Acevedo Galicia, L.E. A Review of Deep
Reinforcement Learning Approaches for Smart Manufacturing in Industry 4.0 and 5.0 Framework. Appl. Sci. 2022, 12, 12377.
[CrossRef]
2. Bhuiyan, T.; Kästner, L.; Hu, Y.; Kutschank, B.; Lambrecht, J. Deep-Reinforcement-Learning-based Path Planning for Industrial
Robots using Distance Sensors as Observation. In Proceedings of the 2023 8th International Conference on Control and Robotics
Engineering (ICCRE), Niigata, Japan, 21–23 April 2023.
3. Jiang, R.; Wang, Z.; He, B.; Di, Z. Vision-Based Deep Reinforcement Learning For UR5 Robot Motion Control. In Proceedings of
the 2021 IEEE International Conference on Consumer Electronics and Computer Engineering (ICCECE), Guangzhou, China,
15–17 January 2021; pp. 246–250.
4. Collins, J.; Chand, S.; Vanderkop, A.; Howard, D. A review of physics simulators for robotic applications. IEEE Access 2021, 9,
51416–51431. [CrossRef]
5. Gupta, S.; Singal, G.; Garg, D. Deep reinforcement learning techniques in diversified domains: A survey. Arch. Comput. Methods
Eng. 2021, 28, 4715–4754. [CrossRef]
6. Nguyen, H.; La, H. Review of deep reinforcement learning for robot manipulation. In Proceedings of the 2019 Third IEEE
International Conference on Robotic Computing (IRC), Naples, Italy, 25–27 February 2019; pp. 590–595.
7. Stooke, A.; Abbeel, P. Accelerated methods for deep reinforcement learning. arXiv 2018, arXiv:1803.02811.
8. Gym, O.; Sanghi, N. Deep Reinforcement Learning with Python; Springer: Berlin/Heidelberg, Germany, 2021.
9. Dong, H.; Ding, Z.; Zhang, S. Deep Reinforcement Learning—Fundamentals, Research and Applications; Springer: Berlin/Heidelberg,
Germany, 2020.
10. Liu, L.-L.; Chen, E.-L.; Gao, Z.-G.; Wang, Y. Research on motion planning of seven degree of freedom manipulator based on
DDPG. In Proceedings of the Advanced Manufacturing and Automation VIII 8, Changzhou, China, 20–21 September 2018; pp.
356–367.
11. Robotics, C. Robotics Simulator CoppeliaSim. Available online: https://www.coppeliarobotics.com/ (accessed on 1 July 2023).
12. Towers, M.; Terry, J.K.; Kwiatkowski, A.; Balis, J.U.; Cola, G.d.; Deleu, T.; Goulão, M.; Kallinteris, A.; Arjun, K.G.; Krimmel, M.;
et al. Gymnasium. Available online: https://github.com/Farama-Foundation/Gymnasium (accessed on 1 July 2023).
13. Raffin, A.; Hill, A.; Gleave, A.; Kanervisto, A.; Ernestus, M.; Dormann, N. Stable-baselines3: Reliable reinforcement learning
implementations. J. Mach. Learn. Res. 2021, 22, 12348–12355.
Eng. Proc. 2023, 47, 12 10 of 10

14. Mnih, V.; Badia, A.P.; Mirza, M.; Graves, A.; Lillicrap, T.; Harley, T.; Silver, D.; Kavukcuoglu, K. Asynchronous methods for deep
reinforcement learning. In Proceedings of the International Conference on Machine Learning, New York, NY, USA, 20–22 June
2016; pp. 1928–1937.
15. Han, D.; Mulyana, B.; Stankovic, V.; Cheng, S. A Survey on Deep Reinforcement Learning Algorithms for Robotic Manipulation.
Sensors 2023, 23, 3762. [CrossRef] [PubMed]

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.

You might also like