A Deep Reinforcement Learning Algorithm For Robotic Manipulation Tasks in Simulated Environments
A Deep Reinforcement Learning Algorithm For Robotic Manipulation Tasks in Simulated Environments
Department of Computer Science and Electronics, Universidad Técnica Particular de Loja, Loja 1101608, Ecuador;
[email protected]
* Correspondence: [email protected] or [email protected]
† Presented at the XXXI Conference on Electrical and Electronic Engineering, Quito, Ecuador, 29 November–1
December 2023.
Abstract: Industrial robots are used in a variety of industrial process tasks, and due to the complexity
of the environment in which these systems are deployed, more robust and accurate control methods
are required. Deep reinforcement learning emerges as a comprehensive approach that directly allows
for the mapping of sensor data and the setting of motion actions to the robot. In this work, we propose
a robotic system implemented in a semi-photorealistic simulator whose motion control is based on
the A2C algorithm in a DRL agent; the task to be performed is to reach a goal within a work area.
The evaluation is executed in a simulation scenario where a fixed position of a target is maintained
while the agent (robotic manipulator) tries to reach it with the end-effector from an initial position.
Finally, the trained agent fulfills the established task; this is demonstrated by the results obtained in
the training and evaluation processes, and the reward value increases when the measured distance
decreases between the end-effector and the target.
Keywords: robotics manipulation; DRL algorithms; deep learning; reinforcement learning; robotic
simulator; CoppeliaSim
1. Introduction
In recent decades, Industry 4.0 (I4.0) has promoted the upgrade of automation and the
digitization of industrial processes, using emerging technologies such as cloud computing,
Citation: Calderon-Cordova, C.;
artificial intelligence (AI), internet of things (IoT), cyber-physical systems, etc. [1]. Some of
Sarango, R. A Deep Reinforcement
the industrial processes such as welding, manufacturing, and object manipulation require
Learning Algorithm for Robotic
robotic systems that enable decision making to execute specific tasks within production
Manipulation Tasks in Simulated
lines [2].
Environments. Eng. Proc. 2023, 47, 12.
The integration of artificial intelligence improves the efficiency of factory automation;
https://doi.org/10.3390/
engproc2023047012
in particular, machine learning (ML) is implemented, which gives systems the ability to
learn to execute a task and has been demonstrated to improve control in many applica-
Academic Editor: Jackeline Abad tions [3]. The learning process is limited by several factors: the volume of data required
Published: 4 December 2023
for the system, the elements of scenarios in industrial areas, deployment time, the cost of
equipment, and others. Currently, there are several simulation platforms [4] that allow for
the simulation of various scenarios, generating synthetic data and robotic components that
allow for the developing of an end-to-end system based on artificial intelligence.
Copyright: © 2023 by the authors. In machine learning, Deep Reinforcement Learning (DRL) has been of great interest to
Licensee MDPI, Basel, Switzerland. researchers with academic and industrial approaches for several application areas [5]. In
This article is an open access article the field of robotic manipulation, there are several frameworks and simulation platforms
distributed under the terms and
that allow for the implementation and evaluation of various control algorithms for the
conditions of the Creative Commons
application of RL in robotics [6]. Despite the success of DRL, there are also methods that
Attribution (CC BY) license (https://
allow for the optimization of DRL algorithms by performing CPU and GPU combinations,
creativecommons.org/licenses/by/
thus allowing larger data batches to be used without affecting the final performance [7].
4.0/).
In the literature, there is a large number of papers presenting reviews of DRL algo-
rithms for robotic manipulation; however, in these papers, they do not provide details
of the implementation process for a specific problem. The contribution of this work is to
implement a robotic system in a simulation platform; the end-to-end control approach is
based on DRL, and its performance in the classical task of reaching a goal using a robotic
manipulator is evaluated.
The structure of the paper is described as follows: Section 2 describes the main
fundamentals and structure involved in deep reinforcement learning applied to robotic
manipulation. Section 3 describes the components and architecture of the robotic system
implemented in simulation. Section 4 provides the details of the training and evaluation
processes of the robotic system whose control is based on DRL. Section 5 presents some
results obtained with the simulation implementation of the DRL algorithms. Finally, some
conclusions obtained from the complete development process of this work are presented.
where S is the set of states of the agent and the environment, A is the set of actions executed
by the agent, P is the model of the system—in other words, it is the transition probability
of a state—R is the reward function, and γ is a discount factor [10]. The DRL objective
function has two forms: the first is a value function that defines the expectation of the
accumulated reward. V π (s) represents the estimates of the state-value function in the
policy for the MDP; given a policy π, we have the expected return:
V π (s) = Eπ rt + γr t+1 + γr t+2 + · · · st = s , (2)
The second is the action-value function, which is known as the Q function; this
function indicates that after performing an action a, based on the policy π on the state s, an
accumulative reward (r) is generated. Qπ (s, a) represents the estimates of the action-value
function in the policy for the MDP; given a policy π, we have the following expected return:
Qπ (s, α) = Eπ rt + γr t+1 + γr t+2 + · · · st = s, at = a , (3)
where At is the action of the state, Rt is the reward obtained in a given a state St , Rt = R(St ),
and R is the reward function. To have a correct advantage function, the TD error is used
as a better estimator of the decision-making process when choosing an action in a given
Eng. Proc. 2023, 47, x 3 of 10
Eng. Proc. 2023, 47, 12 where 𝐴𝑡 is the action of the state, 𝑅𝑡 is the reward obtained in a given a state 𝑆𝑡3, of
𝑅𝑡10=
𝑅(𝑆𝑡 ), and 𝑅 is the reward function. To have a correct advantage function, the TD error
is used as a better estimator of the decision-making process when choosing an action in a
given state [9]. Agents can be trained using A2C to estimate the advantage function by
state [9]. Agents can be trained using A2C to estimate the advantage function by combining
combining the estimated reward value and the observed reward value.
the estimated reward value and the observed reward value.
3.3.Proposed
ProposedSystem
System
Therobotic
The roboticsystem
systemproposed
proposedininthis
thiswork
workwas
wasdeveloped
developedusing
usingfree
freesoftware
softwaretools
tools
forrobotics
for roboticsresearch,
research,the
theagent
agenttotobe
beevaluated
evaluatedisisaarobotic
roboticmanipulator
manipulatorofof77degrees
degreesofof
freedom, and the task to be performed is to reach a target within its
freedom, and the task to be performed is to reach a target within its working area working areaininaa
semi-photorealistic scene. The training data are obtained directly from the simulator,
semi-photorealistic scene. The training data are obtained directly from the simulator, and and
by means of the API integrated in this platform, the data flow control and control
by means of the API integrated in this platform, the data flow control and control of the of the
robot is performed.
robot is performed.
3.1.
3.1.System
SystemArchitecture
Architecture
The
Therobotic
roboticmanipulator
manipulatorcontrol
controlapproach
approachisisbased
basedon onthe
thegeneral
generalDRL DRLscheme;
scheme;the the
software
softwaresystem
systemarchitecture
architectureis is aa client–server
client–server type. The Theclient
clientcontains
containsthe thecomponents
componentsto
totrain
train and
and evaluate
evaluate thethe
DRL DRLagent agent located
located onserver
on the the server side, which
side, which consists consists of the
of the simula-
simulation platform.
tion platform. Since
Since this is this is an end-to-end
an end-to-end controlcontrol approach,
approach, the input the data
inputtodata to the
the system
system
consistsconsists
of the of the position
position of theof the robot
robot joints,joints, the position
the position (x, y,(x,
z) y,ofz)a of
tipa located
tip located at the
at the end
end of the end-effector, and the measured distance
of the end-effector, and the measured distance between the between the end-effector and the target
and the target
sphere.
sphere.The
Thecontrol
controlactions
actionsareareangles
anglesforforeach
eachofofthe
thejoints
jointsofofthe
the7-DoF
7-DoFrobot.robot.
This
Thisconfiguration
configurationisisshown
shownininFigure
Figure11andanddescribes
describesthetheprocess
processofoflearning
learningan anagent
agent
using
usingDRLDRLforforrobotic
roboticmanipulator
manipulatorininCoppeliaSim.
CoppeliaSim.
Figure1.1.DRL
Figure DRLprocess
processarchitecture
architecturefor
forthe
therobotic
roboticmanipulator.
manipulator.
Whenusing
When usinganan A2C
A2C algorithm
algorithm with
with an an Actor-Critic
Actor-Critic policy,
policy, the model
the model structure
structure con-
consists
ofsists
twoofnetworks:
two networks:
(1) the(1) the network
Actor Actor network recommends
recommends the action
the action to beby
to be taken taken by the
the agent
agent
and andthe
gives gives the probability
probability distribution
distribution of the
of the state, state,
and andCritical
(2) the (2) thenetwork
Critical network
will give will
an
give an of
estimate estimate of thevalue
the reward rewardfor value for the executed
the executed actions. Itactions. It yields
yields the the estimated
estimated total rewardstotal
inrewards
the given
in state in thestate
the given future. Thefuture.
in the neuralThe network
neuralarchitecture consists of consists
network architecture the structure
of the
shown in Figure
structure shown 2, in
which consists
Figure of the
2, which two networks
consists Actor
of the two and Critic;
networks eachand
Actor network
Critic;layer
each
has a size of
network 256,has
layer 512, and of
a size 256, respectively,
256, 512, and 256,with an ReLU activation
respectively, function.
with an ReLU activation func-
tion.
3.2. Reward Function
The reward function should make the robotic manipulator reach the target position,
whose end-effector should approach a sphere located in the working area. After executing
an action, an observation is obtained, and the total reward is calculated, which is achieved
with the following expression:
Rt = rdist + rctrl + r pa + rl + rc , (5)
Eng. Proc. 2023, 47, 12 4 of 10
where rdist represents the distance reward, which quantifies the Euclidean distance between
the end of the robotic manipulator and the target sphere. This reward has a more negative
value when the end of the manipulator is farther away from the target. On the other hand,
rctrl corresponds to the control reward, which is calculated as the squared Euclidean norm
of the action taken by the agent. This reward penalizes the agent when it makes excessively
large movements. In addition, the reward r pa is obtained by comparing the previous
distance with the current distance after performing an action. If the previous distance is
less than the current distance, the agent receives a negative reward, as this indicates that
the end-effector of the manipulator is moving away from the target position. The difference
reward is calculated by the following function:
r pa = d p − d a , (6)
where d p is the previous distance between the end-effector and the target to be reached,
this distance value is calculated before executing an action of the robot, and d a , indicates
the current distance between the end-effector position and the sphere position. The arrival
reward rl of the manipulator is defined as a positive value that is within a threshold distance
between the end-effector and the target. The collision reward rc is a positive value given to
the agent when the effector touches the target object. The total reward is composed of the
five elements already mentioned and is defined by the following function:
q
2
r dist : − ∑in=1 (yi − xi )
rctrl : −∑n ( Action)2
i =1
Rt = r pa : d p − d a (7)
rl : i f |d a | < du
rc : i f collision = True
where Action is a vector composed of the angular positions of the seven joints of the robot.
The term du is the threshold distance that determines that the end-effector is approaching
the target, and the term “collision” has a logical value that represents the collision between
these two objects in the simulation; once this condition is fulfilled, the agent is given a
Eng. Proc. 2023, 47, x positive value for reaching the target. These rewards are essential components in guiding
4 of 10
the agent’s on-task behavior, encouraging precise movements toward the target sphere and
discouraging abrupt or distant actions.
Figure
Figure2.2.Neural
Neuralnetwork
networkstructure
structurefor
forA2C.
A2C.
3.3.Reward
3.2. Action Function
and Observation Space
Thereward
The robotic function
manipulator consists
should makeofthe
seven jointsmanipulator
robotic (7 DoF); as areach
result,the
wetarget
have position,
the action
space of
whose a vector containing
end-effector the angles
should approach of each
a sphere joint of
located inthe
therobot. Thearea.
working observation space is
After executing
a vector of 1 × 18 elements containing the position of the end-effector, the angles
an action, an observation is obtained, and the total reward is calculated, which is achieved of each
joint, and the distance between
with the following expression: the end-effector and the target sphere.
Figure3.3.Simulation
Figure Simulationscene
scenefor
forthe
therobotic
roboticmanipulator.
manipulator.
4.4. Training
TrainingandandEvaluation
EvaluationDetails
Details
The
Therobotic
roboticsystem
systemwaswastrained
trainedandandevaluated
evaluatedin inaasimulation
simulationenvironment,
environment,and andthethe
following
following software
software components
components werewere used:
used: Python
Python as as the
the base
base programming
programming language,
language,
CoppeliaSim
CoppeliaSim as as the
the simulation
simulation platform,
platform, Stable
Stable Baselines
Baselines 33 as as the
the DRL
DRL framework
framework that that
allowed
allowed for the integration of a deep neural network (DNN), and the A2Calgorithm
for the integration of a deep neural network (DNN), and the A2C algorithm inin
the
thetest
testagent.
agent. The
The hardware
hardwareused
usedforforthe
the implementation
implementationof ofthis
thissystem
systemhas
hasthethefollowing
following
features:
features:Ryzen
Ryzen77CPU,CPU,NVIDIA
NVIDIAGeForce
GeForceRTX RTX3070
307088GB GBGPU,
GPU,andand16 16GB
GBRAM.
RAM.
During
During the training phase, the robotic manipulator searches to approximatethe
the training phase, the robotic manipulator searches to approximate posi-
the po-
tion of the target located at a point in the scene; the coordinates of the robot
sition of the target located at a point in the scene; the coordinates of the robot joints to- joints towards
the
wardstarget
theare given
target randomly
are and updated
given randomly after several
and updated afterepisodes, making the
several episodes, robot the
making perform
robot
its task better. This makes the DRL training efficient and robust.
perform its task better. This makes the DRL training efficient and robust.
Figure
Figure44presents
presentsthethemain
mainelements
elements usedused for
for this
this robotic
robotic system
system in in software.
software. In
In (a),
(a),
the custom environment for CoppeliaSim is presented, and the class is created; Gymna-
the custom environment for CoppeliaSim is presented, and the class is created; Gymna-
sium methods for RL and control functions and the data acquisition of the simulator are
sium methods for RL and control functions and the data acquisition of the simulator are
established. (b) is part of the client in which the agent training is performed, and the model
established. (b) is part of the client in which the agent training is performed, and the model
is saved. (c) is also part of the client, and the trained agent is evaluated, which involves
is saved. (c) is also part of the client, and the trained agent is evaluated, which involves
loading the model and running it on the simulation platform to verify that the assigned
loading the model and running it on the simulation platform to verify that the assigned
task is being performed.
task is being performed.
The selected algorithm is the Actor-Critical Advantage (A2C) [14], because it meets the
requirements of the action and observation spaces used for Gymnasium that are supported
by the DRL framework, and it is one of the algorithms used for robotic manipulation [15].
The input state of the deep neural network consists of a vector with these elements: the
position of the tip point of the end-effector, the angles of the joints, and the distance from
the end-effector to the target. The action performed by the system is a vector containing
seven elements corresponding to each of the robot’s joints.
For additional details on the development and implementation of the robotic system
in the simulation platform, a link of the repository is attached here: https://github.com/
RogerSgo/DRL-Sim-Reach-Target (accessed on 2 October 2023).
Eng.Proc.
Eng. Proc.2023, 47,x 12
2023,47, 6 of1010
6 of
The selected algorithm is the Actor-Critical Advantage (A2C) [14], because it meets
5. Results
the requirements
This sectionofpresents
the action theand observation
main spaces used
results obtained fromfor Gymnasium
the that arecarried
experimentation sup-
ported by the DRL framework, and it is one of the algorithms used
out in the CoppeliaSim EDU 4.5 simulation environment. The model was trained with for robotic manipula-
tion
1000[15]. The input
episodes state aofstandard
and with the deepActor-Critic
neural network
neural consists
network of architecture
a vector with of these ele-of
3 layers
ments: the256,
the sizes position of the
512, and 256 tip point
in each of the end-effector,
network. The resultingthe testangles
modelof theabout
took joints,1 hand the
to train.
distance from
Table the end-effector
1 shows the reward to values
the target. The action
obtained duringperformed by the
the training of system
the agentis awith
vector
the
containing seven during
A2C algorithm elements1000corresponding to each
episodes; it can of thethat
be seen robot’s joints. worked according
the training
Forcriteria
to the additional details on
established the
for development
the assigned task.and Inimplementation
this case, it isofobserved
the roboticthat system
the as
inthe the simulation
distance platform,
(da ) between a link and
the end-effector of thethe repository
target decreases, is attachedvalue
the reward here:
(Rt )
https://github.com/RogerSgo/DRL-Sim-Reach-Target
increases, which indicates the agent learning to perform (accessed on 2 October
this movement 2023).
action.
5.Table
Results
1. Reward values.
This section presents the main results obtained from the experimentation carried out
Episode dp da rdist rctrl rpa rl r R
in the CoppeliaSim EDU 4.5 simulation environment. The model wasc trained with t1000
1 0.467 0.465 and with
episodes −0.465 a standard −0.350
Actor-Critic0.002neural network 0.000architecture0.000of 3 layers−0.813
of the
500 0.158 0.145 − 0.145 0.552 0.013 0.145
sizes 256, 512, and 256 in each network. The resulting test model took about 1 h to train.0.00 − 0.538
1000 0.092 0.090 −0.090 −0.0333 0.001 0.090 0.00 −0.332
Table 1 shows the reward values obtained during the training of the agent with the
A2C algorithm during 1000 episodes; it can be seen that the training worked according to
the criteria established
The learning forofthe
curve theassigned
A2C agent task. In thisoncase,
is based it is observed
the training reward. that
Tothe as the
obtain the
distance
learning(dcurve,
a) between the of
the sum end-effector
the rewards and thethe
that target decreases,
agent the reward
obtains during the setvalue (Rt) in-
of episodes
creases, which
determined forindicates
trainingthe agent learning
is measured. A highto cumulative
perform thisreward
movement action.
indicates that the agent is
successful in the task. The learning curve is presented in Figure 5.
Table 1.
ToReward
verify values.
the performance of the reach-to-target task using the A2C algorithm, the
trained model is evaluated by performing multiple simulation experiments in CoppeliaSim.
Episode dp da rdist rctrl rpa rl rc Rt
In Figure 6, which robotic manipulator can execute the task of reaching a target using the
1 0.467 0.465 −0.465 −0.350 0.002 0.000 0.000 −0.813
DRL control method is demonstrated. The robot starts from an initial position, and with the
use500 0.158model,
of the trained 0.145 −0.145motion
it performs 0.552 0.013until0.145
predictions it reaches0.00 −0.538
the target located
1000 0.092 0.090 −0.090
at a position in the working area of the environment. −0.0333 0.001 0.090 0.00 −0.332
The learning curve of the A2C agent is based on the training reward. To obtain the
learning curve, the sum of the rewards that the agent obtains during the set of episodes
Eng. Proc. 2023, 47, x 7 of 10
To verify the performance of the reach-to-target task using the A2C algorithm, the
trained model is evaluated by performing multiple simulation experiments in Coppeli-
aSim. In Figure 6, which robotic manipulator can execute the task of reaching a target
using the DRL control method is demonstrated. The robot starts from an initial position,
and with the use of the trained model, it performs motion predictions until it reaches the
Figure Agentlearning
target5.5.located
Figure Agent learning curve.in the working area of the environment.
at a position
curve.
To verify the performance of the reach-to-target task using the A2C algorithm, the
trained model is evaluated by performing multiple simulation experiments in Coppeli-
aSim. In Figure 6, which robotic manipulator can execute the task of reaching a target
using the DRL control method is demonstrated. The robot starts from an initial position,
and with the use of the trained model, it performs motion predictions until it reaches the
target located at a position in the working area of the environment.
Theperformance
The performanceevaluation
evaluationofofthe the A2C
A2C method
method forfor this
this robotics
robotics application
application is based
is based on
on the average reward; this parameter is a common metric for the
the average reward; this parameter is a common metric for the evaluation of RL algorithms. evaluation of RL algo-
(a) rithms.
The taskThe
to betask to be
performed(b)performed
in this work in this
waswork was the of
the reaching reaching (c)of
a target byathetarget by the end-
end-effector of
effector
a robotic of a robotic
manipulator, manipulator, and according to the reward function designed in the
this
Figure 6. The sequence of and according
manipulator to the reward
movement to reachfunction
the target: designed in this
(a) The initial system,
position of the
system,
reward
robotic the reward
value
manipulator;should value
(b) should
increase
End-effector increase
as the
movesmeasuredas the
about measured
distance
half distance
between
the distance between
end-effector
to the target; end-effector
and target
(c) End-effector
and target
decreases.
arrives decreases.
at theThis
target goal. This
shows that shows that the performance
the performance is satisfactory.is satisfactory.
Theevaluation
The evaluationofofthe the trained
trained DRL DRL model
model waswas performed
performed by running
by running it with
it with 100 epi-
100 episodes.
sodes.
TheTwo important
performance parameters
evaluation ofare presented
the A2C methodto measure
for this
Two important parameters are presented to measure the agent’s performance based on the agent’s
robotics performance
application is based
based
A2C
on the average
as shown reward;
in Figure 7; thethis
curveparameter
in red is theis areward
common metric
value for the evaluation
accumulated during thatofnumber
RL algo- of
rithms.
episodes, The
andtaskthe to be performed
curve in thisthe
in blue represents work was the
measured reaching
distance of aend-effector
of the target by the end-
tip when
effector of a robotic
approaching manipulator, and according to the reward function designed in this
the target.
system, With
thethe results
reward obtained
value shouldfrom the training
increase process distance
as the measured in Table between
1 and theend-effector
evaluation
process
and in Figure
target 7, it This
decreases. is shown
shows that thethe
that reward value increases
performance when the measured distance
is satisfactory.
between
The the tip positioned
evaluation at the extreme
of the trained DRL model of the was
end-effector
performed andby therunning
target sphere
it withdecreases,
100 epi-
with each
sodes. Twoaction predicted
important by the DRL
parameters model. This
are presented indicatesthe
to measure that the trained
agent’s agent satisfies
performance based
the execution of the task during the evaluation of the system in simulation.
Eng. Proc. 2023, 47, x 8 of 10
on A2C as shown in Figure 7; the curve in red is the reward value accumulated during
Eng. Proc. 2023, 47, 12
that number of episodes, and the curve in blue represents the measured distance of8 of
the10
Figure 7.
Figure The evaluation
7. The evaluationparameters
parametersof of the
the DRL
DRLAgent.
Agent. The
Thered
redcurve
curveisis the
the reward
reward accumulated
accumulated
when
whenthe
theend-effector
end-effector approaches
approaches the
the target
target sphere
sphere position.
position. The
The blue
blue curve
curveisisthethedistance
distancebetween
between
the end-effector and the target sphere.
the end-effector and the target sphere.
between the end-effector and the target. These values were obtained from the training
and evaluation of the system and show that the agent was learning to perform the task of
reaching a target within the work zone in the CoppeliaSim simulation scene.
The simulation experiments performed showed that the training process was per-
formed in a short time due to GPU acceleration. Current implementation used data from
various sources such as joint positions given in angles, data related to the distance from the
end-effector to the target, and the distance between the end-effector mark and the target
sphere. The effect of the processing performance can be varied by scaling the system via
increase of the number of sensing sensors and/or complexity of the deep neural network.
The performance of the DRL agent depends on many factors, and we can emphasize
these: the number of episodes for training the model, the framework selected to implement
the DRL algorithm, the design of the neural network architecture, and process logic for the
actions at each step to be taken by the agent.
Author Contributions: Conceptualization, methodology, C.C.-C. and R.S.; software, R.S.; validation,
R.S.; writing—original draft preparation, C.C.-C. and R.S.; writing—review and editing, R.S.; super-
vision, C.C.-C.; project administration and funding acquisition, C.C.-C. All authors have read and
agreed to the published version of the manuscript.
Funding: This research was funded by Universidad Tecnica Particular de Loja, grant number
PROY_ARTIC_CE_2022_3667.
Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.
Data Availability Statement: Data are contained within the article.
Conflicts of Interest: The authors declare no conflict of interest.
References
1. del Real Torres, A.; Andreiana, D.S.; Ojeda Roldan, A.; Hernandez Bustos, A.; Acevedo Galicia, L.E. A Review of Deep
Reinforcement Learning Approaches for Smart Manufacturing in Industry 4.0 and 5.0 Framework. Appl. Sci. 2022, 12, 12377.
[CrossRef]
2. Bhuiyan, T.; Kästner, L.; Hu, Y.; Kutschank, B.; Lambrecht, J. Deep-Reinforcement-Learning-based Path Planning for Industrial
Robots using Distance Sensors as Observation. In Proceedings of the 2023 8th International Conference on Control and Robotics
Engineering (ICCRE), Niigata, Japan, 21–23 April 2023.
3. Jiang, R.; Wang, Z.; He, B.; Di, Z. Vision-Based Deep Reinforcement Learning For UR5 Robot Motion Control. In Proceedings of
the 2021 IEEE International Conference on Consumer Electronics and Computer Engineering (ICCECE), Guangzhou, China,
15–17 January 2021; pp. 246–250.
4. Collins, J.; Chand, S.; Vanderkop, A.; Howard, D. A review of physics simulators for robotic applications. IEEE Access 2021, 9,
51416–51431. [CrossRef]
5. Gupta, S.; Singal, G.; Garg, D. Deep reinforcement learning techniques in diversified domains: A survey. Arch. Comput. Methods
Eng. 2021, 28, 4715–4754. [CrossRef]
6. Nguyen, H.; La, H. Review of deep reinforcement learning for robot manipulation. In Proceedings of the 2019 Third IEEE
International Conference on Robotic Computing (IRC), Naples, Italy, 25–27 February 2019; pp. 590–595.
7. Stooke, A.; Abbeel, P. Accelerated methods for deep reinforcement learning. arXiv 2018, arXiv:1803.02811.
8. Gym, O.; Sanghi, N. Deep Reinforcement Learning with Python; Springer: Berlin/Heidelberg, Germany, 2021.
9. Dong, H.; Ding, Z.; Zhang, S. Deep Reinforcement Learning—Fundamentals, Research and Applications; Springer: Berlin/Heidelberg,
Germany, 2020.
10. Liu, L.-L.; Chen, E.-L.; Gao, Z.-G.; Wang, Y. Research on motion planning of seven degree of freedom manipulator based on
DDPG. In Proceedings of the Advanced Manufacturing and Automation VIII 8, Changzhou, China, 20–21 September 2018; pp.
356–367.
11. Robotics, C. Robotics Simulator CoppeliaSim. Available online: https://www.coppeliarobotics.com/ (accessed on 1 July 2023).
12. Towers, M.; Terry, J.K.; Kwiatkowski, A.; Balis, J.U.; Cola, G.d.; Deleu, T.; Goulão, M.; Kallinteris, A.; Arjun, K.G.; Krimmel, M.;
et al. Gymnasium. Available online: https://github.com/Farama-Foundation/Gymnasium (accessed on 1 July 2023).
13. Raffin, A.; Hill, A.; Gleave, A.; Kanervisto, A.; Ernestus, M.; Dormann, N. Stable-baselines3: Reliable reinforcement learning
implementations. J. Mach. Learn. Res. 2021, 22, 12348–12355.
Eng. Proc. 2023, 47, 12 10 of 10
14. Mnih, V.; Badia, A.P.; Mirza, M.; Graves, A.; Lillicrap, T.; Harley, T.; Silver, D.; Kavukcuoglu, K. Asynchronous methods for deep
reinforcement learning. In Proceedings of the International Conference on Machine Learning, New York, NY, USA, 20–22 June
2016; pp. 1928–1937.
15. Han, D.; Mulyana, B.; Stankovic, V.; Cheng, S. A Survey on Deep Reinforcement Learning Algorithms for Robotic Manipulation.
Sensors 2023, 23, 3762. [CrossRef] [PubMed]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.