Improving Deep Reinforcement L
Improving Deep Reinforcement L
Abstract—Autonomous robotic navigation has become including supervised, unsupervised, and reinforcement
hotspot research, particularly in complex environments, where learning, have played a pivotal role in robotics research,
inefficient exploration can lead to inefficient navigation. Previous enabling learning, adaptation, and effective detection and
approaches often had a wide range of assumptions and prior classification. Deep reinforcement learning (DRL), a fusion of
knowledge. Adaptations of machine learning (ML) approaches, RL and deep neural networks, has emerged as a powerful
especially deep learning, play a vital role in the applications of approach for decision-making tasks involving high-
navigation, detection, and prediction about robotic analysis. dimensional inputs [5], [6].This article aims to delve into the
Further development is needed due to the fast growth of urban application of RL techniques, specifically Q-learning and deep
megacities. The main problem of training convergence time in
Q-networks, for mobile robot path planning. By seamlessly
deep reinforcement learning (DRL) for mobile robot navigation
refers to the amount of time it takes for the agent to learn an
integrating these techniques with widely used frameworks
optimal policy through trial and error and is caused by the need such as ROS, Gazebo, and OpenAI, a robust and autonomous
to collect a large amount of data and computational demands of navigation system can be developed, leading to improved
training deep neural networks. Meanwhile, the assumption of performance, optimized routes, and efficient obstacle
reward in DRL for navigation is problematic as it can be difficult avoidance in complex environments. The evaluation of this
or impossible to define a clear reward function in real-world system will undoubtedly contribute to the advancement of
scenarios, making it challenging to train the agent to navigate autonomous robotics. The trial-and-error learning process
effectively. This paper proposes a neuro-symbolic approach that inherent in RL offers immense potential for building human-
combine the strengths of deep reinforcement learning and fuzzy level agents and has been extensively explored in various
logic to address the challenges of deep reinforcement learning for domains [7] [8]. Deep learning (DL), characterized by its
mobile robot navigation in terms of training time and the ability to extract meaningful patterns and classifications from
assumption of reward by incorporating symbolic representations raw sensory data through deep neural networks, has
to guide the learning process, and inferring the underlying revolutionized the field of machine learning. When combined
objectives of the task which is expected to reduce the training with RL, in the form of DRL, this integration has shown
convergence time. remarkable success in tackling challenges associated with
sequential decision-making [9], [10]. Notably, DRL excels in
Keywords—Autonomous navigation; deep reinforcement
scenarios involving a vast number of states, making it an ideal
learning; mobile robots; neuro-symbolic; Fuzzy Logic
candidate for addressing navigation complexities.
I. INTRODUCTION Nevertheless, achieving optimal navigation remains an
ongoing challenge, necessitating further optimization and
Advancements in robot navigation have spurred the effective handling of high-dimensional data. Reinforcement
development of algorithms that leverage basic rules and learning methods offer valuable approaches for learning and
environmental mapping to optimize path planning. Rule-based planning navigation, empowering agents to interact with their
methods, such as Fuzzy logic and Neuro-fuzzy techniques, environment and make autonomous decisions. Various studies
have been extensively explored to enhance navigation have proposed agent-based DRL approaches for navigation,
decisions and tracking performance under uncertain conditions successfully simulating diverse scenarios without the need for
[1], [2]. While these methods offer valuable insights, they intricate rule-based systems or laborious parameter tuning.
often require extensive justification and may not fully meet However, there is still room for improvement in terms of
the demands for efficient and accurate path planning. achieving the shortest and fastest routes. To enhance
To address this challenge, researchers have turned to bio- navigation performance and optimize evacuation paths,
inspired approaches, such as genetic algorithms and swarm researchers have explored techniques such as look-ahead
optimization, which draw inspiration from biological behavior crowded estimation and Q-learning, which have demonstrated
and incorporate prior knowledge to simulate human cognitive superior results compared to other RL algorithms [6].
processes [3], [4]. One particularly promising area in Additionally, CNN-based robot-assisted evacuation systems
navigation research is reinforcement learning (RL), which have been developed to maximize pedestrian outflow by
enables autonomous agents to learn and make sequential extracting specific features from high-dimensional images.
decisions in complex environments. Machine learning models, Furthermore, iterative, and incremental learning strategies,
935 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 14, No. 11, 2023
like vector quantization with Q learning (VQQL), have been reinforcement learning algorithms. The main focus of this
proposed to expedite the learning process and optimize project is to apply reinforcement learning techniques to mobile
navigation by gradually improving interactions among agents robot path planning. Unlike traditional approaches that rely on
[11], [12]. These advancements in DRL continue to show SLAM or mapping techniques, the project aims to enable the
great promise in addressing the speed of agent learning and robot to learn the optimal path through a reward and
optimizing navigation processes. In the realm of task planning, punishment system. By using reinforcement learning
the ability to find a series of steps that transform initial algorithms such as Q-learning, SARSA, and DQN, the robot
conditions into desired states is crucial. Task planning can learn to navigate its environment efficiently and safely. To
becomes especially important when atomic actions alone facilitate communication between the simulation and the
cannot accomplish a task. Neuro-symbolic task planning has robot, ROS integration is implemented. This integration
emerged as an effective approach, allowing for the allows the robot to receive sensor data, send control
incorporation of restrictions, guidelines, and requirements in commands, and interact with the simulation environment
each activity. However, traditional task planners often rely on seamlessly. By leveraging the capabilities of ROS, the
detailed hand-coded explanations, limiting their scalability. To reinforcement learning algorithms can effectively interface
overcome this limitation, a combination of deep learning and with the robot's actions and observations [15]–[17].The
symbolic planning, known as a neuro-symbolic approach, has reinforcement learning algorithms receive feedback through a
shown potential by leveraging visual information instead of reward and punishment system based on the robot's
hand-coded explanations [3], [13], [14]. However, collecting performance in reaching the goal while avoiding collisions
image data for neuro-symbolic models in robotic applications and obstacles. The training aims to optimize the robot's
is a labor-intensive process that involves steps such as creating decision-making and path planning abilities. Performance
problem instances, defining initial and goal states, operating analysis is conducted to assess the effectiveness of the trained
robots, and capturing scene images. The challenges associated reinforcement learning models. Metrics such as the time taken
with data collection have hindered the widespread adoption of to reach the goal, collision occurrences with static and
neuro-symbolic models in robot task planning. Neuro- dynamic obstacles, and the number of pathing alterations are
symbolic models excel in reasoning, providing explanations measured and analyzed. These metrics provide insights into
and manipulating complex data structures. Conversely, the path planning efficiency, collision avoidance capabilities,
numerical models, such as neuronal models, are preferred for and adaptability of the reinforcement learning approach. In
pattern recognition due to their generalization and learning conclusion, the methodology of this project involves using
abilities. A unified strategy proposes that the characteristic simulation tools (Gazebo, ROS, and OpenAI Gym) to evaluate
properties of symbolic artificial intelligence can emerge from the application of reinforcement learning algorithms (Q-
distributed local computations performed by neuronal models, learning, SARSA, and DQN) in mobile robot path planning.
spanning cognitive functions from the neuron level to the The integration of ROS ensures seamless communication
structural level of the nervous system. By integrating neuro- between the simulation environment and the robot, while the
symbolic and numerical models, a comprehensive framework OpenAI Gym environment provides a standardized framework
can be established to leverage the strengths of both approaches for training and evaluating the algorithms. The methodology
in robotics. This integrated approach holds the potential to enables rigorous testing and analysis of the robot's
enable efficient task planning, grounding symbols in performance in terms of path planning, collision avoidance,
perceptual information, and enhancing pattern recognition and adaptability to dynamic environments. This following
capabilities. Ultimately, this integration could advance subsection discusses the mathematical model of Q-learning
cognitive functions and pave the way for the creation of more with fuzzy logic approach theory towards navigation
sophisticated robotic systems. problems, and experimentation setup that is used in this work.
This paper is organized as follows. Section II presents the In the context of agents utilizing visual SLAM, traditional
proposed method which integrates the reinforcement learning algorithms are still employed for final path planning on the
(RL) and fuzzy logic for mobile robot path planning, aiming map. However, RL offers numerous applications, and in
to create a robust autonomous navigation system that mobile robot navigation, it can replace the path planning part.
optimizes routes and efficiently avoids obstacles in complex The RL model, after training, can effectively make decisions,
environments. Section III illustrates the simulation set-up, enabling the agent to select its path from one location to
while Section IV provides an evaluation of the training another based on interactions with the environment [18], [19].
process of the policy optimization. Finally, Section V presents The environment is abstracted into a grid map representation,
the evaluation and verification of the developed policy based with each position on the map corresponding to an agent's
on the proposed method, followed by the conclusion. state. Transitioning from one state to another reflects the
actual movement of the entity, while the agent's behavioral
II. METHODS decision-making is represented by its state choice at each step
The methodology for this project involves the utilization of in the RL model. The reward value plays a pivotal role in
simulation tools, namely Gazebo, ROS (Robot Operating guiding path selection. Early Q-learning recorded reward
System), and OpenAI Gym. Gazebo provides a realistic values between position states in a table, guiding the next state
environment for simulating the mobile robot path planning selection. As depth-enhanced learning emerges, the DL model
system, while ROS serves as a comprehensive framework for is integrated, replacing the table with a neural network, which
controlling the robot and interfacing with its sensors and provides corresponding decision results by inputting the state
actuators. OpenAI Gym is used to train and evaluate the [20], [21]. The weighting parameters in the neural network
936 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 14, No. 11, 2023
influence the choice of the next state. On the other hand, when action a. 𝑄π is also known as action value function or Q-
incorporating fuzzy logic into the RL model, the decision- Learning algorithm.
making process becomes more nuanced and interpretable.
Fuzzy logic allows for handling uncertainties and imprecise
𝑄 (𝑠 𝑎) *𝑅 𝑠 𝑠𝑎 𝑎+ *∑ 𝛾 𝑟 𝑠
information, enabling the agent to reason with vague input and
output values. By combining RL and fuzzy logic, the agent
can make more human-like decisions, considering both the 𝑠 𝑎 𝑎+
environment's precise measurements and the agent's subjective 𝑄 (𝑠 𝑎) ,𝑟 𝛾𝑟 𝛾 𝑟 𝑠 𝑎-
understanding of the surroundings. This fusion can enhance
path planning in complex and dynamic environments by Reinforcement learning is about finding an optimal policy
considering various factors and optimizing the decision- that achieves a lot of reward over the long-term. A policy 𝜋 is
making process. defined to be better than or equal to a policy 𝜋′ if its
expected return is greater than or equal to that of 𝜋′ for all
A. Q-Learning Algorithm
states.
RL defines any decision maker as an agent and everything
outside the agent as the environment. The agent aims to 𝜋 𝜋 𝑎 (𝑠) 𝑉 (𝑠) 𝑟 𝑎 𝑠𝑡𝑎𝑡 𝑠
maximize the accumulated reward and obtains a reward value
Optimal Value Functions must satisfy the below
as a feedback signal for training through interaction with the
conditions:
environment. Beyond the agent (who perform actions) and the
environment (which made of states), there are three major 𝑉 (𝑠) = 𝑚𝑎𝑥 𝑉. (𝑠), for all states
elements of a reinforcement learning system:
𝑄* (𝑠, 𝑎) = 𝑚𝑎𝑥𝑄. (𝑠, 𝑎), for all states and actions
Policy 𝝅: It is to formalize an agent's decision and
determine the agent’s behaviour at a given time. A We get the optimal policy by solving 𝑄 (𝑠, 𝑎) to find the
policy 𝜋 is a function that maps between the perceived action that gives the most optimal state-action value function,
state and the action is taken from that state. 𝜋 (𝑠) 𝑎𝑟 𝑄 (𝑠 𝑎)
Reward 𝒓: The agent receives feedback known as
rewards, 𝑟𝑡+1 for each action at time step t, indicating the Q-Learning algorithm is an off-policy value-based RL
inherent desirability of that state. The main goal of the algorithm and very effective under unknown environment [6],
agent is to maximize the cumulative reward over time. [21], [22]. The value of a state-action can be decomposed into
The total sum of the rewards (return) is: immediate reward plus the value of successor state-
action𝑄π(𝑠 𝑎 ) with a discount factor (𝛾).
𝑅𝑡 = 𝑟𝑡+1 + 𝑟𝑡+2 + 𝑟𝑡+3+... 𝑟𝑇, 𝑇: final time step
𝑄 (𝑠 𝑎) ,𝑟 𝛾𝑄 (𝑠 𝑎 ) 𝑠 𝑎-
The agent-environment interaction breaks into episodes
where each episode ends in a state called the terminal state, And according to the Bellman optimality, the optimal
followed by a reset to a standard starting state. In some cases, value function can be expressed as:
the episodes continue where final time step would be 𝑇 = ∞, 𝑄 (𝑠 𝑎) ,𝑟 𝛾 𝑄 (𝑠 𝑎 ) 𝑠 𝑎-
and the return become infinite. So, a discount factor 𝛾 is
introduced. The discounted return is defined as: Update the value function iteratively to obtain optimal
value function,
Rt = rt+1 + 𝛾 rt+2 +𝛾 rt+3 + … =∑ 𝛾 𝑟
𝑄(𝑠, 𝑎) ← 𝑄(𝑠, 𝑎) + 𝛼. [𝑟 + 𝛾 max 𝑎 𝑄(𝑠 𝑎 ) - 𝑄 (𝑠, 𝑎)],
0 < 𝛾< 1
Rewards can be sparse (after a long sequence of actions), 𝛼: learning rate
every time step, or at the end of the episodes. 𝑄 (𝑠, 𝑎) converges to 𝑄* (𝑠, 𝑎) as 𝑡 → ∞.
Value function: Most of the RL algorithms are based on Algorithm 1 illustrates the overall framework of the
estimating value functions (states or state action). Value proposed Q-learning to generate the shortest route for
function is used to estimate how good a certain state is navigation mapping.
for the agent to be in (state value function), or how
good a certain action is to perform in a specific state Algorithm 1. Overall framework of the Q-Learning
(state-action value function). The state value functions Initialize Q (s, a) arbitrarily
under the policy 𝜋, denoted 𝑉𝜋(𝑠), is the expected repeat for each episode.
return, Initialize s.
for each step of the episode do
Choose a from s using € greedy policy.
𝑉 (𝑠) *𝑅 𝑠 𝑠+ {∑ 𝛾 𝑟 𝑠 𝑠} Do action a and observer r and s’
𝑄(𝑠, 𝑎) ← 𝑄(𝑠, 𝑎) + 𝛼. [𝑟 + 𝛾 max 𝑎′ 𝑄 (𝑠 𝑎 ) - 𝑄 (𝑠, 𝑎)]
s ← s’
The state-action value function under policy 𝜋, denoted 𝑄π
until s is terminal
(𝑠, 𝑎), as the expected accumulated return from state s and
937 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 14, No. 11, 2023
Fig. 1. Visual range. Far High Negative High Negative High Negative
938 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 14, No. 11, 2023
939 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 14, No. 11, 2023
nuanced behaviors of the agent, its path planning strategies, employed in our model. The stability in parameter learning,
and obstacle avoidance mechanisms in diverse settings. The particularly evident in the FL example, facilitates faster
improved method consistently demonstrated superior convergence to the optimal values. This not only enhances the
performance, efficiently finding optimal routes to reach the efficiency of path planning but also showcases the model's
target point while navigating around obstacles effectively. The robustness in navigating complex environments.
simulations offered valuable insights into the agent's behavior,
path planning, and obstacle avoidance, elucidating
fundamental aspects of autonomous robot navigation. The
superior performance of the enhanced method was evident in
its ability to navigate efficiently, choosing optimal routes
while circumventing obstacles effectively.
Furthermore, the deliberate choice of Fig. 5(b) as a test run
was made to rigorously assess the proposed method's
robustness in scenarios with increased complexity and
multiple target points. This strategic selection adds an
additional layer of validation, demonstrating the algorithm's
efficacy in handling intricate navigation tasks.
940 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 14, No. 11, 2023
V. EVALUATION AND VERIFICATION OF THE DEVELOPED TABLE III. THE EXAMPLE OF REINFORCEMENT LEARNING WITH FUZZY
LOGIC ALGORITHM
POLICY
This work conducted three comprehensive tests to Length/Time
Examples
rigorously evaluate the performance of the proposed method. Test 1 Test 2 Test 3 Mean
Each simulation aimed to assess the effectiveness of the First Round 8.8 m/65 s 8.6 m/53 s 8.7 m/56 s 8.7 m/58 s
respective algorithm in enabling the mobile robot to learn and
navigate its environment autonomously. Second Round 8.8 m/63 s 8.9 m/69 s 8.7 m/60 s 8.8 m/64 s
Third Round 8.7 m/66 s 8.7 m/70 s 8.5 m/68 s 8.6 m/68 s
To verify the practical performance of the model, physical
tests were conducted on the robotic machine based on the Fourth Round 8.7 m/73 s 8.8 m/65 s 8.6 m/66 s 8.7 m/68 s
robot operating system (ROS). The TurtleBot machine car was Fifth Round 8.4 m/71 s 8.7 m/73 s 8.4 m/69 s 8.5 m/69 s
employed for these experiments to ensure consistency and
reliability. The test environment comprised an obstacle zone The overarching analysis of these comprehensive tests
constructed in the laboratory terrain, with the ideal distance reveals that the Fuzzy Logic approach consistently
from the starting point to the target point set at 8.3 meters. Fig. outperforms other methods in terms of both time consumption
5(a) and Fig. 5(b) depict the laser environment after its and path length, particularly in the scenario represented in Fig.
construction. It's important to note that the use of TurtleBot in 5(b). It consistently finds shorter paths in less time,
these experiments is not meant to directly reduce errors in the highlighting its superior efficiency. Additionally, the Fuzzy
algorithm. Instead, TurtleBot provides a standardized platform Logic method demonstrates remarkable stability in locating
for testing, ensuring consistency and reliability across multiple multiple paths, underscoring its prowess in complex
trials. The choice of TurtleBot contributes to the creation of a environment path-finding.
controlled and reproducible testing environment, minimizing
potential errors arising from variations in hardware and However, it's important to acknowledge certain limitations
environmental conditions. This emphasis on error reduction associated with the Fuzzy Logic-based approach. While it
pertains to the establishment of a robust and reliable basis for excels in various aspects of path planning, it may face
evaluating the proposed method's performance in real-world challenges when confronted with highly dynamic and rapidly
scenarios rather than directly mitigating errors in the algorithm changing environments. Fuzzy Logic, being rule-based and
or system. Following the integration of the trained model into reliant on predetermined membership functions, might
the navigation function package, a meticulous series of struggle to adapt swiftly to unpredictable obstacles or
verification tests was carried out to assess its performance. situations. Additionally, its performance could be impacted by
Each testing round consisted of five restarts, with three the complexity and size of the environment, as processing a
experiments conducted within each round to ensure the vast amount of data can introduce computational overhead.
robustness of the evaluation process. For instance, in the first Therefore, while the Fuzzy Logic approach proves highly
round of experiments, the robot's performance was tested effective in many scenarios, it may not be the optimal choice
through three individual trials: the first trial covered a distance for applications demanding real-time adaptability in extremely
of 8.8 meters in 77 seconds; the second trial covered 9.0 dynamic settings. Exploring its boundaries and considering
meters in 78 seconds, and the third trial spanned 8.6 meters in alternative approaches for such specific scenarios remains a
73 seconds. By calculating the mean of these results, we valuable avenue for future research and development.
obtained an average performance of 8.8 meters covered in 76
seconds. VI. CONCLUSION
Table III presents the detailed results of the first round, This research introduced a novel navigation method based
where the robot covered distances of 8.8 meters in 65 seconds, on Q- learning and fuzzy logic for efficient path planning of
8.6 meters in 53 seconds, and 8.7 meters in 56 seconds during agents in diverse environments. The proposed approach
the three tests. The calculated mean for Table II was 8.7 combines the strengths of deep learning with symbolic
meters covered in 58 seconds. Notably, Table II exhibited a reasoning, specifically Fuzzy Logic, to overcome the
higher learning rate compared to Table II, indicating improved challenges faced by traditional DRL methods in mobile robot
efficiency in path planning and execution. navigation, reducing the global path search time by 6-9% and
shortening the average path search length by 4-10% compared
TABLE II. THE EXAMPLE OF RL ALGORITHM to pure Q-learning. The incorporation of symbolic
representations in the learning process leads to reduced
length/time training convergence time and more practical path planning
Examples
Test 1 Test 2 Test 3 Mean results. The experimental results demonstrate its efficiency
First Round 8.8 m/77 s 9.0 m/78 s 8.6 m/73 s 8.8 m/76 s and effectiveness in complex environments, making it a
Second promising solution for autonomous robotic navigation in
9.3 m/86 s 9.1 m/83 s 8.9 m/74 s 9.1 m/81 s
Round urban megacities. As future work, the effectiveness of new RL
Third Round 8.9 m/68 s 8.6 m/63 s 9.2 m/70 s 8.9 m/67 s algorithms will be explored in even more challenging
Fourth environments, further advancing the field of autonomous
9.1 m/78 s 8.9 m/73 s 8.7 m/71 s 8.9 m/74 s
Round
robotic navigation.
Fifth Round 9.2 m/80 s 9.2 m/77 s 8.6 m/77 s 9.0 m/78 s
941 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 14, No. 11, 2023
942 | P a g e
www.ijacsa.thesai.org
© 2023. This work is licensed under
http://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding
the ProQuest Terms and Conditions, you may use this content in accordance
with the terms of the License.