Enhancing Q-Learning Speed Using Selective Signal Injection
Enhancing Q-Learning Speed Using Selective Signal Injection
Injection
Sadredin Hokmi Mohammad Haeri
Department of Electrical Engineering Department of Electrical Engineering
Sharif University of Technology Sharif University of Technology
Tehran, Iran Tehran, Iran
[email protected] [email protected]
Abstract— Reinforcement learning algorithms, especially asymptotic variance. Numerical experiments using this
model-free algorithms like Q-learning, have shown reliable approach have demonstrated fast convergence, even in non-
results in finding optimal solutions for many real-time ideal scenarios. The mentioned algorithms are mostly based
applications. However, challenges such as exploration in real- on modifying coefficients, learning rates, and gains at
time and the convergence rate need to be addressed, and many different timescales.
researches have proposed algorithms to tackle these challenges.
Algorithms like Speedy Q-learning, Zap Q-learning, algorithms Some methods are based on perturbing and injecting noise
based on adding a regularization term, and many others have into the action space [7] and parameter space [8]. Perturbing
been introduced. In this paper, an algorithm based on signal the target action with noise smooths biased Q-value estimates
injection is presented, which, according to the numerical results, and prevents overfitting to specific actions. This encourages
can significantly reduce unnecessary explorations that are risky exploration of different actions, reducing overestimation bias,
in real-time. Additionally, the proposed algorithm is not and improving generalization. Exploration with different
dependent on the learning rate (𝜶), 𝜸, or changes in coefficients. actions improves the quality of exploration, leading to faster
However, for to be effective and increase the convergence rate, convergence. An alternative approach is to inject noise
parameters of the algorithm should be chosen within the correct directly into the agent’s parameters, promoting more
range. The results of applying the proposed algorithm have been consistent exploration and a broader range of behaviors.
compared with two reliable algorithms, speedy Q-learning and
the standard Q-learning, in a 9x9 maze, for the agent to reach
Regularized Q-learning [9] is another type of algorithms
the target in fewer iterations. that demonstrates how adding a suitable regularization term
ensures the algorithm’s convergence. Additionally,
Keywords— Q-learning, Exploration, Signal injection, experimental results show that regularized Q-learning
Convergence rate, Maze. converges in environments where Q-learning with linear
function approximation is known to diverge. This algorithm
I. INTRODUCTION
operates on a single time scale, resulting in faster convergence
Reinforcement Learning (RL) is a framework in which an rates in experimental results. In [10], regularization through
agent, interacting with a dynamic environment, learns the learned Fourier features is presented, prioritizing the learning
optimal sequence of actions or policy to achieve a specified of low-frequency functions and enhancing the learning
goal. This interaction is typically modeled as infinite horizon process by reducing the network’s susceptibility to noise
Markov decision process. One of the most popular and during optimization, particularly during Bellman updates.
promising reinforcement learning algorithms is Q-learning Regularization can guarantee generalization in Bayesian
presented in [1]. This algorithm is widely recognized model- reinforcement learning [11]. By adding regularization, the
free RL algorithm designed to estimate the optimal action- optimal policy becomes stable in a meaningful way and
value function. It combines aspects of dynamic programming, ensures fast convergence rates for mirror descent in
specifically the value iteration algorithm, with stochastic regularized Markov Decision Processes (MDPs) [12].
approximation techniques. In finite state-action problems, Q-
In this paper, the goal is to achieve a faster convergence
learning has been proven to converge to the optimal action-
rate for Q-learning compared to other existing algorithms. The
value function [2]. However, it faces challenges with slow
convergence. Speedy Q-learning (SQL), introduced in [3], proposed algorithm involves selective periodic signal
was developed to overcome the slow convergence issue. In injection, where regular and periodic perturbation of Q-values,
each iteration, SQL utilizes two consecutive estimates of the along with improved exploration, reduces learning time and
Q-function along with a rapid learning rate in its update increases the convergence rate. The proposed algorithm has
mechanism. In fact, by adjusting the learning rate in two been applied to the benchmark problem, a 9x9 maze (Fig. 1)1.
consecutive estimates, it achieves faster convergence. This In Fig. 1, agent starts moving from the coordinates (2, 4)
approach allows SQL to converge more rapidly and provides and must reach the yellow target at the coordinates (8, 8). The
a stronger finite-time performance bound compared to results of applying the proposed algorithm and comparing it
standard Q-learning. Optimized-weighted-speedy Q-learning with the results of other algorithms are shown in Section Ⅳ.
[4] and generalized speedy Q-learning [5] have been
introduced in this regard to reduce learning time and increase II. PRELIMINARIES
the convergence rate. Zap Q-learning [6] is another method Reinforcement learning is a machine learning method that
developed to enhance Watkins’ original Q-learning algorithm. operates without prior knowledge, relying on feedback from
It is a matrix-gain algorithm specifically designed to minimize the environment. Through continuous interaction and trial and
1
Applying the standard Q-learning algorithm to a 9x9 maze is available
as open-source MATLAB code online, as used in the paper.
V. CONCLUSION
Fig. 5. Reward plots of 100 trials over iterations (Algorithm 2). The convergence rate is one of the most important criteria
in reinforcement learning, and many papers have proposed
techniques and algorithms to increase this rate and reduce the
learning time. Most of these researches focus on changing the
learning rate, injecting noise into the action and parameter
space, and regularization. In this paper, an algorithm based on
the selective injection of signal into the Q values is presented.
Results in Section Ⅳ demonstrate that by choosing
appropriate algorithm parameters, the convergence rate
increases in comparison to other algorithms. One of the
important and challenging topics in reinforcement learning is
the issue of non-stationarity, and it is interesting for this
criterion to be considered in the proposed algorithm.
Additionally, by taking into account other scenarios, the
strengths and weaknesses of the algorithm can be examined.
REFERENCES
Fig. 6. Histogram plots of iterations over 100 trials (Algorithm 1). [1] C. J. Watkins and P. Dayan, “Q-learning,” Machine Learning, vol. 8,
no. 3-4, pp. 279-292, 1992.
[2] T. Jaakkola, M. I. Jordan, and S. P. Singh, “Convergence of stochastic
iterative dynamic programming algorithms,” In Advances in Neural
Information Processing Systems, Denver, CO, USA, pp. 703-710, Dec.
1994.
[3] M. G. Azar, R. Munos, M. Ghavamzadaeh, and H. J. Kappen, “Speedy
Q-learning,” In Advances in Neural Information Processing Systems,
vol. 24, pp. 2411-2419, 2011.
[4] Y. Cao, X. Fang, “Optimized-weighted-speedy Q-learning algorithm
for multi-UGV in static environment path planning under anti-collision
cooperation mechanism,” Mathematics 2023, vol. 11, pp. 1-28, May
2023.
[5] I. John, C. Kamanchi and S. Bhatnagar, “Generalized speedy Q-
learning,” IEEE Control Systems Letters, vol. 4, no. 3, pp. 524-529,
July 2020.
[6] A. M. Devraj and S. P. Meyn, “Zap Q-learning,” In Proceedings of the
International Conference on Neural Information Processing Systems,”
pp. 2232 2241, 2017.
[7] Y. Zhang, J. Liu, C. Li, Y. Niu, Y. Yang, Y. Liu, and W. Ouyang, “A
perspective of Q-value estimation on offline-to-online reinforcement
learning,” The 38th AAAI Conference on Artificial Intelligence (AAAI-
Fig. 7. Reward plots of 100 trials over iterations (Algorithm 1). 24), 2024.
[8] M. Plappert, R. Houthooft, P. Dhariwal, S. Sidor, R. Y. Chen, X. Chen,
iteration value is 1518, which is approximately five times of T. Asfour, P. Abbeel, and M. Andrychowicz, “Parameter space noise
this amount for the proposed algorithm. Also, in Fig. 5, in the for exploration,” arXiv:1706.01905, 2017.
worst case, a reward of 1 is received after approximately 650 [9] H.-D. Lim, D. W. Kim, and D. Lee, “Regularized Q-learning,”
arXiv:2202.05404, 2022.
iterations. In this algorithm, the mean value of the histogram [10] A. Li and D. Pathak, “Functional regularization for reinforcement
values over the trials is 74.95. learning via learned Fourier features,” In Advances in Neural
Information Processing Systems, vol. 34, pp. 19046-19055, 2021.
Finally, the results obtained from applying Q-learning are [11] A. Tamar, D. Soudry, and E. Zisselman, “Regularization guarantees
shown in Figs. 6 and 7. In Fig. 6, the maximum number of generalization in Bayesian reinforcement learning through algorithmic
iterations is 3726 and in Fig. 7, a reward of 1 is received at stability,” In Proceedings of the AAAI Conference on Artificial
most after 1500 iterations, which is significantly more than the Intelligence, vol. 36, no. 8, pp. 8423-8431, 2022.
[12] W. Zhan, S. Cen, B. Huang, Y. Chen, J. D. Lee, and Y. Chi, “Policy
other two algorithms. Also, the mean value of the histogram mirror descent for regularized reinforcement learning: A generalized
values is 127.47. framework with linear convergence,” arXiv:2105.11066, 2021.
[13] C. Szepesvari, “The asymptotic convergence-rate of Q-learning,” In
According to the results, the proposed algorithm can Advances in Neural Information Processing Systems, vol. 10, pp. 1064-
achieve convergence in fewer iterations over trials. As 1070, 1998.