Enhancing Q-Learning Speed Using Selective Signal Injection

Uploaded by

Sina

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views

Enhancing Q-Learning Speed Using Selective Signal Injection

Uploaded by

Sina

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 4

Enhancing Q-learning Speed Using Selective Signal

Injection
Sadredin Hokmi Mohammad Haeri
Department of Electrical Engineering Department of Electrical Engineering
Sharif University of Technology Sharif University of Technology
Tehran, Iran Tehran, Iran
[email protected] [email protected]

Abstract— Reinforcement learning algorithms, especially asymptotic variance. Numerical experiments using this
model-free algorithms like Q-learning, have shown reliable approach have demonstrated fast convergence, even in non-
results in finding optimal solutions for many real-time ideal scenarios. The mentioned algorithms are mostly based
applications. However, challenges such as exploration in real- on modifying coefficients, learning rates, and gains at
time and the convergence rate need to be addressed, and many different timescales.
researches have proposed algorithms to tackle these challenges.
Algorithms like Speedy Q-learning, Zap Q-learning, algorithms Some methods are based on perturbing and injecting noise
based on adding a regularization term, and many others have into the action space [7] and parameter space [8]. Perturbing
been introduced. In this paper, an algorithm based on signal the target action with noise smooths biased Q-value estimates
injection is presented, which, according to the numerical results, and prevents overfitting to specific actions. This encourages
can significantly reduce unnecessary explorations that are risky exploration of different actions, reducing overestimation bias,
in real-time. Additionally, the proposed algorithm is not and improving generalization. Exploration with different
dependent on the learning rate (𝜶), 𝜸, or changes in coefficients. actions improves the quality of exploration, leading to faster
However, for to be effective and increase the convergence rate, convergence. An alternative approach is to inject noise
parameters of the algorithm should be chosen within the correct directly into the agent’s parameters, promoting more
range. The results of applying the proposed algorithm have been consistent exploration and a broader range of behaviors.
compared with two reliable algorithms, speedy Q-learning and
the standard Q-learning, in a 9x9 maze, for the agent to reach
Regularized Q-learning [9] is another type of algorithms
the target in fewer iterations. that demonstrates how adding a suitable regularization term
ensures the algorithm’s convergence. Additionally,
Keywords— Q-learning, Exploration, Signal injection, experimental results show that regularized Q-learning
Convergence rate, Maze. converges in environments where Q-learning with linear
function approximation is known to diverge. This algorithm
I. INTRODUCTION
operates on a single time scale, resulting in faster convergence
Reinforcement Learning (RL) is a framework in which an rates in experimental results. In [10], regularization through
agent, interacting with a dynamic environment, learns the learned Fourier features is presented, prioritizing the learning
optimal sequence of actions or policy to achieve a specified of low-frequency functions and enhancing the learning
goal. This interaction is typically modeled as infinite horizon process by reducing the network’s susceptibility to noise
Markov decision process. One of the most popular and during optimization, particularly during Bellman updates.
promising reinforcement learning algorithms is Q-learning Regularization can guarantee generalization in Bayesian
presented in [1]. This algorithm is widely recognized model- reinforcement learning [11]. By adding regularization, the
free RL algorithm designed to estimate the optimal action- optimal policy becomes stable in a meaningful way and
value function. It combines aspects of dynamic programming, ensures fast convergence rates for mirror descent in
specifically the value iteration algorithm, with stochastic regularized Markov Decision Processes (MDPs) [12].
approximation techniques. In finite state-action problems, Q-
In this paper, the goal is to achieve a faster convergence
learning has been proven to converge to the optimal action-
rate for Q-learning compared to other existing algorithms. The
value function [2]. However, it faces challenges with slow
convergence. Speedy Q-learning (SQL), introduced in [3], proposed algorithm involves selective periodic signal
was developed to overcome the slow convergence issue. In injection, where regular and periodic perturbation of Q-values,
each iteration, SQL utilizes two consecutive estimates of the along with improved exploration, reduces learning time and
Q-function along with a rapid learning rate in its update increases the convergence rate. The proposed algorithm has
mechanism. In fact, by adjusting the learning rate in two been applied to the benchmark problem, a 9x9 maze (Fig. 1)1.
consecutive estimates, it achieves faster convergence. This In Fig. 1, agent starts moving from the coordinates (2, 4)
approach allows SQL to converge more rapidly and provides and must reach the yellow target at the coordinates (8, 8). The
a stronger finite-time performance bound compared to results of applying the proposed algorithm and comparing it
standard Q-learning. Optimized-weighted-speedy Q-learning with the results of other algorithms are shown in Section Ⅳ.
[4] and generalized speedy Q-learning [5] have been
introduced in this regard to reduce learning time and increase II. PRELIMINARIES
the convergence rate. Zap Q-learning [6] is another method Reinforcement learning is a machine learning method that
developed to enhance Watkins’ original Q-learning algorithm. operates without prior knowledge, relying on feedback from
It is a matrix-gain algorithm specifically designed to minimize the environment. Through continuous interaction and trial and

1
Applying the standard Q-learning algorithm to a 9x9 maze is available
as open-source MATLAB code online, as used in the paper.

rate, respectively. In state 𝑠, the agent selects action 𝑎 using
the epsilon-greedy strategy, executes it, and receives a reward
𝑅 . The agent then moves to the next state 𝑠 and selects the
action 𝑎 that maximizes 𝑄 𝑠 , 𝑎 to update the value function.
The action in the next state 𝑠 is selected again using the
epsilon-greedy strategy after updating the value function. The
Q-learning algorithm is as follows.
Algorithm 1 Q-learning
Input: Set of states (𝑆), set of actions (𝐴), and reward function (𝑅)
1: Initialize action-value function 𝑄 arbitrarily (𝑄 𝑠, 𝑎 0)
2: Repeat for each episode:
Initialize set of states (𝑆)
3: Repeat for each step of episode:
Derive 𝜋 from 𝑄
Choose 𝑎 from 𝑠 using 𝜋 derived from 𝑄
Fig. 1. A visualization of a 9x9 maze. Take action 𝑎 and observe 𝑅 and 𝑠
error, it aims to achieve a specific goal or maximize overall 4: Update the state-action value function:
rewards. Unlike supervised learning, it does not require 𝑄 𝑠, 𝑎 ← 𝑄 𝑠, 𝑎 𝛼 𝑅 𝛾max 𝑄 𝑠 , 𝑎 − 𝑄 𝑠, 𝑎 .
labeled data but instead evaluates each action based on 𝑠←𝑠
rewards or punishments from the environment. In Until 𝑠 is terminal
reinforcement learning, agents solve sequential decision
B. Speedy Q-learning
problems typically modeled by an MDP. The MDP is defined
by ⟨𝑆, 𝐴, 𝑃, 𝑅⟩ , where 𝑆 is the set of states, 𝐴 is the set of Although Q-learning is a foundational reinforcement
actions, 𝑅 is the reward function, and 𝑃 is the state transition learning algorithm, it has limitations. When the state-action
probability. After taking action 𝐴 in state 𝑆 , the agent space is finite, it can converge to the optimal strategy, but as
receives a reward 𝑅 𝑅 𝑠 , 𝑎 and the next state 𝑆 the discount factor 𝛾 approaches 1, its convergence slows
follows the transition probability 𝑃 𝑆 , 𝑅 |𝑆 , 𝐴 . During down significantly [13]. To address this, speedy Q-learning
learning, the agent aims to maximize the cumulative expected was proposed in 2011 [3], which accelerates convergence by
discounted return as follows adopting a more rapid learning rate. This larger learning rate
replaces the standard rate in Q-learning, speeding up updates.
𝐸 ∑ 𝛾 𝑅 , (1)
By using the current Q value instead of the historical Q value,
where 𝛾 ∈ 0,1 is a discount factor. The agent follows a speedy Q-learning selects a more efficient estimator,
policy 𝜋 𝑎|𝑠 , which gives the probability of selecting each improving the algorithm’s convergence rate [3].
possible action from a given state. It denotes the probability
Considering 𝑄 𝑠 , 𝑎 represents the maximum value
that the agent in state 𝑠 chooses action 𝑎 . The state-value
function 𝑉 𝑠 represents the expected discounted return if the function of the next state when the Q-value for the state-action
agent follows the policy 𝜋 given by pair 𝑠, 𝑎 is updated in the previous iteration. The update
process of the speedy Q-learning algorithm’s Q value is given by
𝑉 𝑠 𝐸 ∑ 𝛾 𝑅 |𝑠 𝑠. (2)
𝑄 𝑠, 𝑎 1 − 𝛼 𝑄 𝑠, 𝑎 𝛼 𝑅 𝛾𝑄 𝑠 , 𝑎 1−𝛼
Likewise, the state-action value function 𝑄 𝑠, 𝑎 𝛾 max 𝑄 𝑠 , 𝑎 − 𝑄 𝑠 , 𝑎 . (6)
quantifies the expected return associated with beginning in
state 𝑠 and performing action 𝑎 . This function can be The speedy Q-learning algorithm is as follows.
represented by the following relation. Algorithm 2 Speedy Q-learning
𝑄 𝑠, 𝑎 𝐸 ∑ 𝛾 𝑅 |𝑠 𝑠, 𝑎 𝑎 (3) Input: Set of states (𝑆), set of actions (𝐴), and reward function (𝑅)
The optimal state-action value function is denoted as 𝑄 ∗ 1: Initialize action-value function 𝑄 arbitrarily (𝑄 𝑠, 𝑎 0)
max 𝑄 𝑠, 𝑎 . Additionally, 𝑉 𝑠 is recursively expressed, 2: for 𝑖 1 to episodes
Initialize set of states (𝑆)
known as the Bellman equation, as follows. 3: Repeat:
𝑉 𝑠 𝐸 𝑅 𝛾𝑉 𝑠 |𝑠 𝑠 (4) Use a policy (e.g., epsilon greedy) based on 𝑄
Perform action 𝑎 to receive an immediate reward 𝑅 and
Reinforcement learning algorithms employ these transition to the next state 𝑠
equations to iteratively update the agent’s policy. The 4: Update the state–action value function:
following provides a brief explanation of two algorithms, Q- 𝑄 𝑠, 𝑎 1 − 𝛼 𝑄 𝑠, 𝑎 𝛼 𝑅 𝛾𝑄 𝑠 , 𝑎 1−
learning and Speedy Q-learning, which use the state-action 𝛼 𝛾 max 𝑄 𝑠 , 𝑎 − 𝑄 𝑠 , 𝑎 .
value function to refine the strategy based on received 5: 𝑠 ← 𝑠
feedback, ultimately aiming to achieve the optimal strategy. Until 𝑠 is terminal
A. Q-learning End for
The Q-learning algorithm introduced as a solution for As mentioned in Section Ⅰ, various methods have been
reinforcement learning tasks [1], employs a state-action value proposed to increase the convergence rate. In this section, two
function 𝑄 𝑠, 𝑎 . The core concept of the algorithm is that the common and reliable algorithms, Q-learning and speedy Q-
agent continually explores an unknown environment. The learning, have been briefly explained for comparison with the
process for updating the value function 𝑄 𝑠, 𝑎 in the Q- proposed algorithm in Section Ⅲ. Results obtained from
learning algorithm is outlined as applying these algorithms are presented in Section Ⅳ.
𝑄 𝑠, 𝑎 ← 𝑄 𝑠, 𝑎 𝛼 𝑅 𝛾max 𝑄 𝑠 , 𝑎 − 𝑄 𝑠, 𝑎 , (5) III. THE PROPOSED ALGORITHM
where 𝑠 , 𝑎 , and 𝛼 denote next state, next action, and learning In the proposed algorithm, adding alternating and regular
values based on the current Q value and continuously
perturbing the Q values, help to reduce unnecessary
explorations and increase the convergence rate. Within the
algorithm based on the Q value in each iteration, a range from
𝑄 − 𝜀 to 𝑄 + 𝜀 is selected and divided into 𝑚 subintervals
(The value of 𝑚 is typically chosen as 2, and the value of 𝑟
determines the precision of the Q value assigned in that
iteration). Then, the midpoint of the last subinterval and the
first subinterval in each iteration is determined, and based on
the periodicity 𝑇 , one of the two values is selected. The
criterion for executing this algorithm is the convergence of the
Q values. After the convergence, the standard Q-learning
algorithm (Algorithm 1) is executed. Furthermore, 𝑟 can be
decreased if the Q values move toward convergence during
the algorithm execution (the Q values become progressively
closer), and 𝑟 can be increased if the convergence slows Fig. 2. Histogram plots of iterations over 100 trials (Algorithm 3).
down. Finally, the basis of the algorithm is the selective signal
injection into the Q values in the manner described. The
pseudocode of the proposed algorithm is as follows ( ⌊𝑎⌋
represents the floor function of 𝑎).
Algorithm 3 Proposed Algorithm
Input: Set of states (𝑆), set of actions (𝐴), and reward function (𝑅)
1: Initialize action-value function 𝑄 arbitrarily (𝑄 𝑠, 𝑎 = 0),
𝜀 , 𝜀 , 𝑚, 𝑟 and 𝑇
𝑛=1
2: for 𝑖 = 1 to episodes
Initialize set of states (𝑆)
3: Repeat for each step of episode:
Derive a strategy (e.g., epsilon greedy) based on 𝑄
Choose 𝑎 from 𝑠 using 𝜋 derived from 𝑄
Take action 𝑎 and observe 𝑅 and 𝑠
4: Update the state–action value function:
𝑄 𝑠, 𝑎 ← (𝑄(𝑠, 𝑎 + 𝛼 𝑅 + 𝛾max 𝑄(𝑠 , 𝑎 − 𝑄(𝑠, 𝑎 +
Fig. 3. Reward plots of 100 trials over iterations (Algorithm 3).
𝜀 1− −𝜀 ) − + 1 + (𝑄(𝑠, 𝑎) + 𝛼 𝑅 +
The results of applying the proposed algorithm are shown
𝛾max 𝑄(𝑠 , 𝑎 ) − 𝑄(𝑠, 𝑎) − 𝜀 −𝜀 1+ ) .
in Figs. 2 and 3. According to Fig. 2, the agent is able to learn
𝑖
𝑛←𝑛+ the path to the target with the fewest possible iterations and
𝑇 minimal extra exploration after only 10 initial trials. The mean
𝑠←𝑠
Until 𝑠 is terminal value of the histogram values over the trials is 39.64. The
maximum number of iterations observed in Fig. 2 is 286,
Remark 1. For a smooth transition in the periodicity 𝑇, instead which will be compared with the results of other algorithms in
of a hard switch, the following function can be used the following. Additionally, according to Fig. 3, which shows
𝑄(𝑠, 𝑎) + 𝜀 1− −𝜀 sin + the changes in rewards for each iteration out of the agent’s 100
attempts to reach the target. In the worst case, a reward of 1
𝑄(𝑠, 𝑎) − 𝜀 −𝜀 1+ , (7) was received after 125 iterations.
In Section Ⅳ, the results of applying the three algorithms
mentioned in Section Ⅲ are presented.
IV. SIMULATION
In this section, the results of the proposed algorithm are
examined and compared with the results of the other two
algorithms. As briefly mentioned in Section Ⅰ, the goal of the
agent in Fig. 1 is to reach the target at the coordinates (8, 8).
If an algorithm can achieve this result with fewer trials, it has
better performance. Additionally, if it reaches the goal in each
trial with fewer iterations and less exploration, it also
demonstrates better efficiency. The results are plotted in two
kinds of figures: reward over the number of iterations and the
number of iterations over the number of trials. The reward
assignment is as follows: bumping into a wall incurs a penalty
of -1, reaching the path to the target yields a reward of 1, Fig. 4. Histogram plots of iterations over 100 trials (Algorithm 2).
otherwise, the reward is 0. 𝛼 is 1⁄(𝑖 + 1) , where 𝑖 is the
number of iterations and 𝛾 is 0.5. 𝜀 , 𝜀 , 𝑇 , and initial 𝑟 are The results obtained from applying the speedy Q-learning
chosen as 0.1, 0, 51, and 4, respectively. algorithm are shown in Figs. 4 and 5. In Fig. 4, the maximum
mentioned in Section Ⅰ, most fast learning algorithms are
based on adjusting coefficients and the learning rate across
different timescales. Algorithms similar to the proposed one,
in that they do not require coefficients changing and can
essentially be injected or added, can contribute to
generalization. The numerical results of the simulation over
100 trials are summarized in Table I.
TABLE I. NUMERICAL RESULTS
Numerical results in 100 trials
Algorithm Maximum number of Mean value of Execution time
iterations iterations (seconds)
Q-learning 3726 127.47 115
Speedy Q-learning 1518 74.95 94
Proposed algorithm 286 39.64 45

V. CONCLUSION
Fig. 5. Reward plots of 100 trials over iterations (Algorithm 2). The convergence rate is one of the most important criteria
in reinforcement learning, and many papers have proposed
techniques and algorithms to increase this rate and reduce the
learning time. Most of these researches focus on changing the
learning rate, injecting noise into the action and parameter
space, and regularization. In this paper, an algorithm based on
the selective injection of signal into the Q values is presented.
Results in Section Ⅳ demonstrate that by choosing
appropriate algorithm parameters, the convergence rate
increases in comparison to other algorithms. One of the
important and challenging topics in reinforcement learning is
the issue of non-stationarity, and it is interesting for this
criterion to be considered in the proposed algorithm.
Additionally, by taking into account other scenarios, the
strengths and weaknesses of the algorithm can be examined.
REFERENCES
Fig. 6. Histogram plots of iterations over 100 trials (Algorithm 1). [1] C. J. Watkins and P. Dayan, “Q-learning,” Machine Learning, vol. 8,
no. 3-4, pp. 279-292, 1992.
[2] T. Jaakkola, M. I. Jordan, and S. P. Singh, “Convergence of stochastic
iterative dynamic programming algorithms,” In Advances in Neural
Information Processing Systems, Denver, CO, USA, pp. 703-710, Dec.
1994.
[3] M. G. Azar, R. Munos, M. Ghavamzadaeh, and H. J. Kappen, “Speedy
Q-learning,” In Advances in Neural Information Processing Systems,
vol. 24, pp. 2411-2419, 2011.
[4] Y. Cao, X. Fang, “Optimized-weighted-speedy Q-learning algorithm
for multi-UGV in static environment path planning under anti-collision
cooperation mechanism,” Mathematics 2023, vol. 11, pp. 1-28, May
2023.
[5] I. John, C. Kamanchi and S. Bhatnagar, “Generalized speedy Q-
learning,” IEEE Control Systems Letters, vol. 4, no. 3, pp. 524-529,
July 2020.
[6] A. M. Devraj and S. P. Meyn, “Zap Q-learning,” In Proceedings of the
International Conference on Neural Information Processing Systems,”
pp. 2232 2241, 2017.
[7] Y. Zhang, J. Liu, C. Li, Y. Niu, Y. Yang, Y. Liu, and W. Ouyang, “A
perspective of Q-value estimation on offline-to-online reinforcement
learning,” The 38th AAAI Conference on Artificial Intelligence (AAAI-
Fig. 7. Reward plots of 100 trials over iterations (Algorithm 1). 24), 2024.
[8] M. Plappert, R. Houthooft, P. Dhariwal, S. Sidor, R. Y. Chen, X. Chen,
iteration value is 1518, which is approximately five times of T. Asfour, P. Abbeel, and M. Andrychowicz, “Parameter space noise
this amount for the proposed algorithm. Also, in Fig. 5, in the for exploration,” arXiv:1706.01905, 2017.
worst case, a reward of 1 is received after approximately 650 [9] H.-D. Lim, D. W. Kim, and D. Lee, “Regularized Q-learning,”
arXiv:2202.05404, 2022.
iterations. In this algorithm, the mean value of the histogram [10] A. Li and D. Pathak, “Functional regularization for reinforcement
values over the trials is 74.95. learning via learned Fourier features,” In Advances in Neural
Information Processing Systems, vol. 34, pp. 19046-19055, 2021.
Finally, the results obtained from applying Q-learning are [11] A. Tamar, D. Soudry, and E. Zisselman, “Regularization guarantees
shown in Figs. 6 and 7. In Fig. 6, the maximum number of generalization in Bayesian reinforcement learning through algorithmic
iterations is 3726 and in Fig. 7, a reward of 1 is received at stability,” In Proceedings of the AAAI Conference on Artificial
most after 1500 iterations, which is significantly more than the Intelligence, vol. 36, no. 8, pp. 8423-8431, 2022.
[12] W. Zhan, S. Cen, B. Huang, Y. Chen, J. D. Lee, and Y. Chi, “Policy
other two algorithms. Also, the mean value of the histogram mirror descent for regularized reinforcement learning: A generalized
values is 127.47. framework with linear convergence,” arXiv:2105.11066, 2021.
[13] C. Szepesvari, “The asymptotic convergence-rate of Q-learning,” In
According to the results, the proposed algorithm can Advances in Neural Information Processing Systems, vol. 10, pp. 1064-
achieve convergence in fewer iterations over trials. As 1070, 1998.

Research Proposal Guidelines
No ratings yet
Research Proposal Guidelines
5 pages
Reinforcement_Learning_Algorithms_in_Global_Path_Planning_for_Mobile_Robot
No ratings yet
Reinforcement_Learning_Algorithms_in_Global_Path_Planning_for_Mobile_Robot
5 pages
Smooth Q-Learning - Accelerate Convergence
No ratings yet
Smooth Q-Learning - Accelerate Convergence
7 pages
Applying Q (λ) -learning in Deep Reinforcement Learning to Play Atari Games
No ratings yet
Applying Q (λ) -learning in Deep Reinforcement Learning to Play Atari Games
6 pages
Issues in Using Function Approximation For Reinforcement Learning
No ratings yet
Issues in Using Function Approximation For Reinforcement Learning
9 pages
5 2405.13629v2
No ratings yet
5 2405.13629v2
30 pages
Reinforcement Learning by Comparing Immediate Reward: Punit Pandey Deepshikhapandey
No ratings yet
Reinforcement Learning by Comparing Immediate Reward: Punit Pandey Deepshikhapandey
5 pages
Midterm_Report_Example3
No ratings yet
Midterm_Report_Example3
4 pages
Evaluating the Performance of Reinforcement Learning Algorithms
No ratings yet
Evaluating the Performance of Reinforcement Learning Algorithms
12 pages
RL Course Report
No ratings yet
RL Course Report
10 pages
RL Course Report
No ratings yet
RL Course Report
11 pages
ML Module - 5 QB Solved-1
No ratings yet
ML Module - 5 QB Solved-1
11 pages
QQL - s41598 023 30990 5
No ratings yet
QQL - s41598 023 30990 5
10 pages
Unit Iv Deep Q Learning
No ratings yet
Unit Iv Deep Q Learning
27 pages
UNIT 3
No ratings yet
UNIT 3
9 pages
Reinforcement Learning Optimization
No ratings yet
Reinforcement Learning Optimization
6 pages
NIPS 2011 Speedy Q Learning Paper
No ratings yet
NIPS 2011 Speedy Q Learning Paper
9 pages
RL Concepts and Methods
No ratings yet
RL Concepts and Methods
8 pages
Active One-Shot Learning
No ratings yet
Active One-Shot Learning
8 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
9 pages
DQL: A New Updating Strategy For Reinforcement Learning Based On Q-Learning
No ratings yet
DQL: A New Updating Strategy For Reinforcement Learning Based On Q-Learning
12 pages
Assignment 3 - ReinforcementLearning - 200508263 - AdityaAnantharaman - Trikkur
No ratings yet
Assignment 3 - ReinforcementLearning - 200508263 - AdityaAnantharaman - Trikkur
9 pages
Stock Price Prediction Using Reinforcement Learning
No ratings yet
Stock Price Prediction Using Reinforcement Learning
6 pages
The Papir Hogy Nem
No ratings yet
The Papir Hogy Nem
13 pages
Lecture Notes on Reinforcement Learning Basics
No ratings yet
Lecture Notes on Reinforcement Learning Basics
6 pages
Rprop PDF
No ratings yet
Rprop PDF
6 pages
15
No ratings yet
15
17 pages
Asynchronous Methods For Deep Reinforcement Learning
No ratings yet
Asynchronous Methods For Deep Reinforcement Learning
28 pages
Lecture10
No ratings yet
Lecture10
25 pages
Simulation of The Navigation of A Mobile Robot by The Q-Learning Using Artificial Neuron Networks
No ratings yet
Simulation of The Navigation of A Mobile Robot by The Q-Learning Using Artificial Neuron Networks
12 pages
Origins of Life Questions and Debates
No ratings yet
Origins of Life Questions and Debates
12 pages
PD Control Based On Reinforcement Learning Compensation For A DC Servo Drive
No ratings yet
PD Control Based On Reinforcement Learning Compensation For A DC Servo Drive
6 pages
Reinforcement Learning As Classification: Leveraging Modern Classifiers
No ratings yet
Reinforcement Learning As Classification: Leveraging Modern Classifiers
8 pages
Wcci 14 S
No ratings yet
Wcci 14 S
7 pages
1、Bayesian Q-learning（1998）
No ratings yet
1、Bayesian Q-learning（1998）
8 pages
Multi Layer Feed-Forward Network Learning
No ratings yet
Multi Layer Feed-Forward Network Learning
5 pages
8200 Non Delusional Q Learning and Value Iteration
No ratings yet
8200 Non Delusional Q Learning and Value Iteration
11 pages
Multi-Agent Deep Reinforcement Learning: Maxim Egorov Stanford University
No ratings yet
Multi-Agent Deep Reinforcement Learning: Maxim Egorov Stanford University
8 pages
ICTAI12
No ratings yet
ICTAI12
7 pages
J22
No ratings yet
J22
18 pages
Leave one out and SVM BASIC
No ratings yet
Leave one out and SVM BASIC
19 pages
rl-3
No ratings yet
rl-3
31 pages
RLC Project Report
No ratings yet
RLC Project Report
2 pages
Minor Project
No ratings yet
Minor Project
9 pages
Learning To Drive A Real Car in 20 Minutes
No ratings yet
Learning To Drive A Real Car in 20 Minutes
8 pages
Machine Learning by Tom Mitchell - Definitions
No ratings yet
Machine Learning by Tom Mitchell - Definitions
12 pages
RL chap 4
No ratings yet
RL chap 4
7 pages
Answerkey
No ratings yet
Answerkey
4 pages
Autonomous Car Racing in Simulation Environment Using Deep Reinforcement Learning
No ratings yet
Autonomous Car Racing in Simulation Environment Using Deep Reinforcement Learning
6 pages
Reconciling Lambda-Returns With Experience Replay
No ratings yet
Reconciling Lambda-Returns With Experience Replay
17 pages
Backpropagation and Resilient Propagation
No ratings yet
Backpropagation and Resilient Propagation
6 pages
Retrace
No ratings yet
Retrace
18 pages
Cutler16_ICRA_final_submission
No ratings yet
Cutler16_ICRA_final_submission
7 pages
RL UNIT V QA (1)
No ratings yet
RL UNIT V QA (1)
13 pages
Evaluation of Deep Reinforcement Learning Algorithms for Autonomous Driving
No ratings yet
Evaluation of Deep Reinforcement Learning Algorithms for Autonomous Driving
7 pages
Catboost
No ratings yet
Catboost
11 pages
Knowing When To Quit On Unlearnable Problems
No ratings yet
Knowing When To Quit On Unlearnable Problems
11 pages
On Averaging and Extrapolation For Gradient Descent: Alan Luner Benjamin Grimmer
No ratings yet
On Averaging and Extrapolation For Gradient Descent: Alan Luner Benjamin Grimmer
33 pages
Iterative Design of Experiment: Implication of Genetic Algorithm To Kriging Modelling Method
No ratings yet
Iterative Design of Experiment: Implication of Genetic Algorithm To Kriging Modelling Method
5 pages
A Review On Simulation Optimization
No ratings yet
A Review On Simulation Optimization
4 pages
Random Optimization: Fundamentals and Applications
From Everand
Random Optimization: Fundamentals and Applications
Fouad Sabry
No ratings yet
I&E Assignment 2 Template - S23
No ratings yet
I&E Assignment 2 Template - S23
6 pages
Beej Mantras
No ratings yet
Beej Mantras
2 pages
11 WFD
No ratings yet
11 WFD
138 pages
Lecture Plan 8
No ratings yet
Lecture Plan 8
2 pages
FG 1732
No ratings yet
FG 1732
9 pages
JEMARO - Motivation Letter Application Templatev4
No ratings yet
JEMARO - Motivation Letter Application Templatev4
4 pages
1 SM
No ratings yet
1 SM
7 pages
Phys532 Pset 2
No ratings yet
Phys532 Pset 2
2 pages
Electrochemical Machining (ECM)
No ratings yet
Electrochemical Machining (ECM)
48 pages
Modern Syllabus of L.K.G
100% (3)
Modern Syllabus of L.K.G
3 pages
DM HROD 2022 1702 Clarifications On DM HROD 2022 1509 Submission of The IPCRF of Teachers For SY 2021 2022 - 071822 159
No ratings yet
DM HROD 2022 1702 Clarifications On DM HROD 2022 1509 Submission of The IPCRF of Teachers For SY 2021 2022 - 071822 159
3 pages
Pipeline Pressure Testing Calculations Client: Date: Project: Pipeline Details: From KP 0.012 To KP 22.3 (Insert (A) To (D) )
No ratings yet
Pipeline Pressure Testing Calculations Client: Date: Project: Pipeline Details: From KP 0.012 To KP 22.3 (Insert (A) To (D) )
3 pages
English Leap Podcast EP47 Being Ambitious but Lazy
No ratings yet
English Leap Podcast EP47 Being Ambitious but Lazy
13 pages
Easy Access Case Analysis
No ratings yet
Easy Access Case Analysis
11 pages
Earth Science Prelim
No ratings yet
Earth Science Prelim
6 pages
Assignment 1 - Written Report Bip3063 Group 2
No ratings yet
Assignment 1 - Written Report Bip3063 Group 2
24 pages
CHAPTER 5
No ratings yet
CHAPTER 5
3 pages
Transformer Technical Data Sheet For The 1LAP016417
No ratings yet
Transformer Technical Data Sheet For The 1LAP016417
1 page
Reviewer in Industrial and Organizational Psychology
100% (1)
Reviewer in Industrial and Organizational Psychology
4 pages
Module 4 Practice Quiz
No ratings yet
Module 4 Practice Quiz
2 pages
Engineering Journal - 4th Quarter 2010 - Design of Structural Steel Pipe Racks
No ratings yet
Engineering Journal - 4th Quarter 2010 - Design of Structural Steel Pipe Racks
8 pages
Experiment #1 TITLE: The Magnetic Circuit
No ratings yet
Experiment #1 TITLE: The Magnetic Circuit
11 pages
Analisis Swot
No ratings yet
Analisis Swot
35 pages
Stacies Program of Study
No ratings yet
Stacies Program of Study
3 pages
Math 151 Homework 04 PDF
No ratings yet
Math 151 Homework 04 PDF
6 pages
Learning Delivery Modality Course 1
No ratings yet
Learning Delivery Modality Course 1
3 pages
Heat Transmission
No ratings yet
Heat Transmission
24 pages
Bird SK-4000-TC SiteHawk Analysator Manual en
No ratings yet
Bird SK-4000-TC SiteHawk Analysator Manual en
52 pages
Decision Map PDF
No ratings yet
Decision Map PDF
1 page