0% found this document useful (0 votes)
4 views

Enhancing Q-Learning Speed Using Selective Signal Injection

Uploaded by

Sina
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Enhancing Q-Learning Speed Using Selective Signal Injection

Uploaded by

Sina
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Enhancing Q-learning Speed Using Selective Signal

Injection
Sadredin Hokmi Mohammad Haeri
Department of Electrical Engineering Department of Electrical Engineering
Sharif University of Technology Sharif University of Technology
Tehran, Iran Tehran, Iran
[email protected] [email protected]

Abstract— Reinforcement learning algorithms, especially asymptotic variance. Numerical experiments using this
model-free algorithms like Q-learning, have shown reliable approach have demonstrated fast convergence, even in non-
results in finding optimal solutions for many real-time ideal scenarios. The mentioned algorithms are mostly based
applications. However, challenges such as exploration in real- on modifying coefficients, learning rates, and gains at
time and the convergence rate need to be addressed, and many different timescales.
researches have proposed algorithms to tackle these challenges.
Algorithms like Speedy Q-learning, Zap Q-learning, algorithms Some methods are based on perturbing and injecting noise
based on adding a regularization term, and many others have into the action space [7] and parameter space [8]. Perturbing
been introduced. In this paper, an algorithm based on signal the target action with noise smooths biased Q-value estimates
injection is presented, which, according to the numerical results, and prevents overfitting to specific actions. This encourages
can significantly reduce unnecessary explorations that are risky exploration of different actions, reducing overestimation bias,
in real-time. Additionally, the proposed algorithm is not and improving generalization. Exploration with different
dependent on the learning rate (𝜶), 𝜸, or changes in coefficients. actions improves the quality of exploration, leading to faster
However, for to be effective and increase the convergence rate, convergence. An alternative approach is to inject noise
parameters of the algorithm should be chosen within the correct directly into the agent’s parameters, promoting more
range. The results of applying the proposed algorithm have been consistent exploration and a broader range of behaviors.
compared with two reliable algorithms, speedy Q-learning and
the standard Q-learning, in a 9x9 maze, for the agent to reach
Regularized Q-learning [9] is another type of algorithms
the target in fewer iterations. that demonstrates how adding a suitable regularization term
ensures the algorithm’s convergence. Additionally,
Keywords— Q-learning, Exploration, Signal injection, experimental results show that regularized Q-learning
Convergence rate, Maze. converges in environments where Q-learning with linear
function approximation is known to diverge. This algorithm
I. INTRODUCTION
operates on a single time scale, resulting in faster convergence
Reinforcement Learning (RL) is a framework in which an rates in experimental results. In [10], regularization through
agent, interacting with a dynamic environment, learns the learned Fourier features is presented, prioritizing the learning
optimal sequence of actions or policy to achieve a specified of low-frequency functions and enhancing the learning
goal. This interaction is typically modeled as infinite horizon process by reducing the network’s susceptibility to noise
Markov decision process. One of the most popular and during optimization, particularly during Bellman updates.
promising reinforcement learning algorithms is Q-learning Regularization can guarantee generalization in Bayesian
presented in [1]. This algorithm is widely recognized model- reinforcement learning [11]. By adding regularization, the
free RL algorithm designed to estimate the optimal action- optimal policy becomes stable in a meaningful way and
value function. It combines aspects of dynamic programming, ensures fast convergence rates for mirror descent in
specifically the value iteration algorithm, with stochastic regularized Markov Decision Processes (MDPs) [12].
approximation techniques. In finite state-action problems, Q-
In this paper, the goal is to achieve a faster convergence
learning has been proven to converge to the optimal action-
rate for Q-learning compared to other existing algorithms. The
value function [2]. However, it faces challenges with slow
convergence. Speedy Q-learning (SQL), introduced in [3], proposed algorithm involves selective periodic signal
was developed to overcome the slow convergence issue. In injection, where regular and periodic perturbation of Q-values,
each iteration, SQL utilizes two consecutive estimates of the along with improved exploration, reduces learning time and
Q-function along with a rapid learning rate in its update increases the convergence rate. The proposed algorithm has
mechanism. In fact, by adjusting the learning rate in two been applied to the benchmark problem, a 9x9 maze (Fig. 1)1.
consecutive estimates, it achieves faster convergence. This In Fig. 1, agent starts moving from the coordinates (2, 4)
approach allows SQL to converge more rapidly and provides and must reach the yellow target at the coordinates (8, 8). The
a stronger finite-time performance bound compared to results of applying the proposed algorithm and comparing it
standard Q-learning. Optimized-weighted-speedy Q-learning with the results of other algorithms are shown in Section Ⅳ.
[4] and generalized speedy Q-learning [5] have been
introduced in this regard to reduce learning time and increase II. PRELIMINARIES
the convergence rate. Zap Q-learning [6] is another method Reinforcement learning is a machine learning method that
developed to enhance Watkins’ original Q-learning algorithm. operates without prior knowledge, relying on feedback from
It is a matrix-gain algorithm specifically designed to minimize the environment. Through continuous interaction and trial and

1
Applying the standard Q-learning algorithm to a 9x9 maze is available
as open-source MATLAB code online, as used in the paper.

XXX-X-XXXX-XXXX-X/XX/$XX.00 ©20XX IEEE


rate, respectively. In state 𝑠, the agent selects action 𝑎 using
the epsilon-greedy strategy, executes it, and receives a reward
𝑅 . The agent then moves to the next state 𝑠 and selects the
action 𝑎 that maximizes 𝑄 𝑠 , 𝑎 to update the value function.
The action in the next state 𝑠 is selected again using the
epsilon-greedy strategy after updating the value function. The
Q-learning algorithm is as follows.
Algorithm 1 Q-learning
Input: Set of states (𝑆), set of actions (𝐴), and reward function (𝑅)
1: Initialize action-value function 𝑄 arbitrarily (𝑄 𝑠, 𝑎 0)
2: Repeat for each episode:
Initialize set of states (𝑆)
3: Repeat for each step of episode:
Derive 𝜋 from 𝑄
Choose 𝑎 from 𝑠 using 𝜋 derived from 𝑄
Fig. 1. A visualization of a 9x9 maze. Take action 𝑎 and observe 𝑅 and 𝑠
error, it aims to achieve a specific goal or maximize overall 4: Update the state-action value function:
rewards. Unlike supervised learning, it does not require 𝑄 𝑠, 𝑎 ← 𝑄 𝑠, 𝑎 𝛼 𝑅 𝛾max 𝑄 𝑠 , 𝑎 − 𝑄 𝑠, 𝑎 .
labeled data but instead evaluates each action based on 𝑠←𝑠
rewards or punishments from the environment. In Until 𝑠 is terminal
reinforcement learning, agents solve sequential decision
B. Speedy Q-learning
problems typically modeled by an MDP. The MDP is defined
by ⟨𝑆, 𝐴, 𝑃, 𝑅⟩ , where 𝑆 is the set of states, 𝐴 is the set of Although Q-learning is a foundational reinforcement
actions, 𝑅 is the reward function, and 𝑃 is the state transition learning algorithm, it has limitations. When the state-action
probability. After taking action 𝐴 in state 𝑆 , the agent space is finite, it can converge to the optimal strategy, but as
receives a reward 𝑅 𝑅 𝑠 , 𝑎 and the next state 𝑆 the discount factor 𝛾 approaches 1, its convergence slows
follows the transition probability 𝑃 𝑆 , 𝑅 |𝑆 , 𝐴 . During down significantly [13]. To address this, speedy Q-learning
learning, the agent aims to maximize the cumulative expected was proposed in 2011 [3], which accelerates convergence by
discounted return as follows adopting a more rapid learning rate. This larger learning rate
replaces the standard rate in Q-learning, speeding up updates.
𝐸 ∑ 𝛾 𝑅 , (1)
By using the current Q value instead of the historical Q value,
where 𝛾 ∈ 0,1 is a discount factor. The agent follows a speedy Q-learning selects a more efficient estimator,
policy 𝜋 𝑎|𝑠 , which gives the probability of selecting each improving the algorithm’s convergence rate [3].
possible action from a given state. It denotes the probability
Considering 𝑄 𝑠 , 𝑎 represents the maximum value
that the agent in state 𝑠 chooses action 𝑎 . The state-value
function 𝑉 𝑠 represents the expected discounted return if the function of the next state when the Q-value for the state-action
agent follows the policy 𝜋 given by pair 𝑠, 𝑎 is updated in the previous iteration. The update
process of the speedy Q-learning algorithm’s Q value is given by
𝑉 𝑠 𝐸 ∑ 𝛾 𝑅 |𝑠 𝑠. (2)
𝑄 𝑠, 𝑎 1 − 𝛼 𝑄 𝑠, 𝑎 𝛼 𝑅 𝛾𝑄 𝑠 , 𝑎 1−𝛼
Likewise, the state-action value function 𝑄 𝑠, 𝑎 𝛾 max 𝑄 𝑠 , 𝑎 − 𝑄 𝑠 , 𝑎 . (6)
quantifies the expected return associated with beginning in
state 𝑠 and performing action 𝑎 . This function can be The speedy Q-learning algorithm is as follows.
represented by the following relation. Algorithm 2 Speedy Q-learning
𝑄 𝑠, 𝑎 𝐸 ∑ 𝛾 𝑅 |𝑠 𝑠, 𝑎 𝑎 (3) Input: Set of states (𝑆), set of actions (𝐴), and reward function (𝑅)
The optimal state-action value function is denoted as 𝑄 ∗ 1: Initialize action-value function 𝑄 arbitrarily (𝑄 𝑠, 𝑎 0)
max 𝑄 𝑠, 𝑎 . Additionally, 𝑉 𝑠 is recursively expressed, 2: for 𝑖 1 to episodes
Initialize set of states (𝑆)
known as the Bellman equation, as follows. 3: Repeat:
𝑉 𝑠 𝐸 𝑅 𝛾𝑉 𝑠 |𝑠 𝑠 (4) Use a policy (e.g., epsilon greedy) based on 𝑄
Perform action 𝑎 to receive an immediate reward 𝑅 and
Reinforcement learning algorithms employ these transition to the next state 𝑠
equations to iteratively update the agent’s policy. The 4: Update the state–action value function:
following provides a brief explanation of two algorithms, Q- 𝑄 𝑠, 𝑎 1 − 𝛼 𝑄 𝑠, 𝑎 𝛼 𝑅 𝛾𝑄 𝑠 , 𝑎 1−
learning and Speedy Q-learning, which use the state-action 𝛼 𝛾 max 𝑄 𝑠 , 𝑎 − 𝑄 𝑠 , 𝑎 .
value function to refine the strategy based on received 5: 𝑠 ← 𝑠
feedback, ultimately aiming to achieve the optimal strategy. Until 𝑠 is terminal
A. Q-learning End for
The Q-learning algorithm introduced as a solution for As mentioned in Section Ⅰ, various methods have been
reinforcement learning tasks [1], employs a state-action value proposed to increase the convergence rate. In this section, two
function 𝑄 𝑠, 𝑎 . The core concept of the algorithm is that the common and reliable algorithms, Q-learning and speedy Q-
agent continually explores an unknown environment. The learning, have been briefly explained for comparison with the
process for updating the value function 𝑄 𝑠, 𝑎 in the Q- proposed algorithm in Section Ⅲ. Results obtained from
learning algorithm is outlined as applying these algorithms are presented in Section Ⅳ.
𝑄 𝑠, 𝑎 ← 𝑄 𝑠, 𝑎 𝛼 𝑅 𝛾max 𝑄 𝑠 , 𝑎 − 𝑄 𝑠, 𝑎 , (5) III. THE PROPOSED ALGORITHM
where 𝑠 , 𝑎 , and 𝛼 denote next state, next action, and learning In the proposed algorithm, adding alternating and regular
values based on the current Q value and continuously
perturbing the Q values, help to reduce unnecessary
explorations and increase the convergence rate. Within the
algorithm based on the Q value in each iteration, a range from
𝑄 − 𝜀 to 𝑄 + 𝜀 is selected and divided into 𝑚 subintervals
(The value of 𝑚 is typically chosen as 2, and the value of 𝑟
determines the precision of the Q value assigned in that
iteration). Then, the midpoint of the last subinterval and the
first subinterval in each iteration is determined, and based on
the periodicity 𝑇 , one of the two values is selected. The
criterion for executing this algorithm is the convergence of the
Q values. After the convergence, the standard Q-learning
algorithm (Algorithm 1) is executed. Furthermore, 𝑟 can be
decreased if the Q values move toward convergence during
the algorithm execution (the Q values become progressively
closer), and 𝑟 can be increased if the convergence slows Fig. 2. Histogram plots of iterations over 100 trials (Algorithm 3).
down. Finally, the basis of the algorithm is the selective signal
injection into the Q values in the manner described. The
pseudocode of the proposed algorithm is as follows ( ⌊𝑎⌋
represents the floor function of 𝑎).
Algorithm 3 Proposed Algorithm
Input: Set of states (𝑆), set of actions (𝐴), and reward function (𝑅)
1: Initialize action-value function 𝑄 arbitrarily (𝑄 𝑠, 𝑎 = 0),
𝜀 , 𝜀 , 𝑚, 𝑟 and 𝑇
𝑛=1
2: for 𝑖 = 1 to episodes
Initialize set of states (𝑆)
3: Repeat for each step of episode:
Derive a strategy (e.g., epsilon greedy) based on 𝑄
Choose 𝑎 from 𝑠 using 𝜋 derived from 𝑄
Take action 𝑎 and observe 𝑅 and 𝑠
4: Update the state–action value function:
𝑄 𝑠, 𝑎 ← (𝑄(𝑠, 𝑎 + 𝛼 𝑅 + 𝛾max 𝑄(𝑠 , 𝑎 − 𝑄(𝑠, 𝑎 +
Fig. 3. Reward plots of 100 trials over iterations (Algorithm 3).
𝜀 1− −𝜀 ) − + 1 + (𝑄(𝑠, 𝑎) + 𝛼 𝑅 +
The results of applying the proposed algorithm are shown
𝛾max 𝑄(𝑠 , 𝑎 ) − 𝑄(𝑠, 𝑎) − 𝜀 −𝜀 1+ ) .
in Figs. 2 and 3. According to Fig. 2, the agent is able to learn
𝑖
𝑛←𝑛+ the path to the target with the fewest possible iterations and
𝑇 minimal extra exploration after only 10 initial trials. The mean
𝑠←𝑠
Until 𝑠 is terminal value of the histogram values over the trials is 39.64. The
maximum number of iterations observed in Fig. 2 is 286,
Remark 1. For a smooth transition in the periodicity 𝑇, instead which will be compared with the results of other algorithms in
of a hard switch, the following function can be used the following. Additionally, according to Fig. 3, which shows
𝑄(𝑠, 𝑎) + 𝜀 1− −𝜀 sin + the changes in rewards for each iteration out of the agent’s 100
attempts to reach the target. In the worst case, a reward of 1
𝑄(𝑠, 𝑎) − 𝜀 −𝜀 1+ , (7) was received after 125 iterations.
In Section Ⅳ, the results of applying the three algorithms
mentioned in Section Ⅲ are presented.
IV. SIMULATION
In this section, the results of the proposed algorithm are
examined and compared with the results of the other two
algorithms. As briefly mentioned in Section Ⅰ, the goal of the
agent in Fig. 1 is to reach the target at the coordinates (8, 8).
If an algorithm can achieve this result with fewer trials, it has
better performance. Additionally, if it reaches the goal in each
trial with fewer iterations and less exploration, it also
demonstrates better efficiency. The results are plotted in two
kinds of figures: reward over the number of iterations and the
number of iterations over the number of trials. The reward
assignment is as follows: bumping into a wall incurs a penalty
of -1, reaching the path to the target yields a reward of 1, Fig. 4. Histogram plots of iterations over 100 trials (Algorithm 2).
otherwise, the reward is 0. 𝛼 is 1⁄(𝑖 + 1) , where 𝑖 is the
number of iterations and 𝛾 is 0.5. 𝜀 , 𝜀 , 𝑇 , and initial 𝑟 are The results obtained from applying the speedy Q-learning
chosen as 0.1, 0, 51, and 4, respectively. algorithm are shown in Figs. 4 and 5. In Fig. 4, the maximum
mentioned in Section Ⅰ, most fast learning algorithms are
based on adjusting coefficients and the learning rate across
different timescales. Algorithms similar to the proposed one,
in that they do not require coefficients changing and can
essentially be injected or added, can contribute to
generalization. The numerical results of the simulation over
100 trials are summarized in Table I.
TABLE I. NUMERICAL RESULTS
Numerical results in 100 trials
Algorithm Maximum number of Mean value of Execution time
iterations iterations (seconds)
Q-learning 3726 127.47 115
Speedy Q-learning 1518 74.95 94
Proposed algorithm 286 39.64 45

V. CONCLUSION
Fig. 5. Reward plots of 100 trials over iterations (Algorithm 2). The convergence rate is one of the most important criteria
in reinforcement learning, and many papers have proposed
techniques and algorithms to increase this rate and reduce the
learning time. Most of these researches focus on changing the
learning rate, injecting noise into the action and parameter
space, and regularization. In this paper, an algorithm based on
the selective injection of signal into the Q values is presented.
Results in Section Ⅳ demonstrate that by choosing
appropriate algorithm parameters, the convergence rate
increases in comparison to other algorithms. One of the
important and challenging topics in reinforcement learning is
the issue of non-stationarity, and it is interesting for this
criterion to be considered in the proposed algorithm.
Additionally, by taking into account other scenarios, the
strengths and weaknesses of the algorithm can be examined.
REFERENCES
Fig. 6. Histogram plots of iterations over 100 trials (Algorithm 1). [1] C. J. Watkins and P. Dayan, “Q-learning,” Machine Learning, vol. 8,
no. 3-4, pp. 279-292, 1992.
[2] T. Jaakkola, M. I. Jordan, and S. P. Singh, “Convergence of stochastic
iterative dynamic programming algorithms,” In Advances in Neural
Information Processing Systems, Denver, CO, USA, pp. 703-710, Dec.
1994.
[3] M. G. Azar, R. Munos, M. Ghavamzadaeh, and H. J. Kappen, “Speedy
Q-learning,” In Advances in Neural Information Processing Systems,
vol. 24, pp. 2411-2419, 2011.
[4] Y. Cao, X. Fang, “Optimized-weighted-speedy Q-learning algorithm
for multi-UGV in static environment path planning under anti-collision
cooperation mechanism,” Mathematics 2023, vol. 11, pp. 1-28, May
2023.
[5] I. John, C. Kamanchi and S. Bhatnagar, “Generalized speedy Q-
learning,” IEEE Control Systems Letters, vol. 4, no. 3, pp. 524-529,
July 2020.
[6] A. M. Devraj and S. P. Meyn, “Zap Q-learning,” In Proceedings of the
International Conference on Neural Information Processing Systems,”
pp. 2232 2241, 2017.
[7] Y. Zhang, J. Liu, C. Li, Y. Niu, Y. Yang, Y. Liu, and W. Ouyang, “A
perspective of Q-value estimation on offline-to-online reinforcement
learning,” The 38th AAAI Conference on Artificial Intelligence (AAAI-
Fig. 7. Reward plots of 100 trials over iterations (Algorithm 1). 24), 2024.
[8] M. Plappert, R. Houthooft, P. Dhariwal, S. Sidor, R. Y. Chen, X. Chen,
iteration value is 1518, which is approximately five times of T. Asfour, P. Abbeel, and M. Andrychowicz, “Parameter space noise
this amount for the proposed algorithm. Also, in Fig. 5, in the for exploration,” arXiv:1706.01905, 2017.
worst case, a reward of 1 is received after approximately 650 [9] H.-D. Lim, D. W. Kim, and D. Lee, “Regularized Q-learning,”
arXiv:2202.05404, 2022.
iterations. In this algorithm, the mean value of the histogram [10] A. Li and D. Pathak, “Functional regularization for reinforcement
values over the trials is 74.95. learning via learned Fourier features,” In Advances in Neural
Information Processing Systems, vol. 34, pp. 19046-19055, 2021.
Finally, the results obtained from applying Q-learning are [11] A. Tamar, D. Soudry, and E. Zisselman, “Regularization guarantees
shown in Figs. 6 and 7. In Fig. 6, the maximum number of generalization in Bayesian reinforcement learning through algorithmic
iterations is 3726 and in Fig. 7, a reward of 1 is received at stability,” In Proceedings of the AAAI Conference on Artificial
most after 1500 iterations, which is significantly more than the Intelligence, vol. 36, no. 8, pp. 8423-8431, 2022.
[12] W. Zhan, S. Cen, B. Huang, Y. Chen, J. D. Lee, and Y. Chi, “Policy
other two algorithms. Also, the mean value of the histogram mirror descent for regularized reinforcement learning: A generalized
values is 127.47. framework with linear convergence,” arXiv:2105.11066, 2021.
[13] C. Szepesvari, “The asymptotic convergence-rate of Q-learning,” In
According to the results, the proposed algorithm can Advances in Neural Information Processing Systems, vol. 10, pp. 1064-
achieve convergence in fewer iterations over trials. As 1070, 1998.

You might also like