Intelligent Algorithmic Trading Strategy Using Reinforcement Learning and Directional Change
Intelligent Algorithmic Trading Strategy Using Reinforcement Learning and Directional Change
ABSTRACT Designing a profitable trading strategy plays a critical role in algorithmic trading, where the
algorithm can manage and execute automated trading decisions. Determining a specific trading rule for
trading at a particular time is a critical research problem in financial market trading. However, an intelligent,
and a dynamic algorithmic trading driven by the current patterns of a given price time-series may help deal
with this issue. Thus, Reinforcement Learning (RL) can achieve optimal dynamic algorithmic trading by
considering the price time-series as its environment. A comprehensive representation of the environment
states is indeed vital for proposing a dynamic algorithmic trading using RL. Therefore, we propose a
representation of the environment states using the Directional Change (DC) event approach with a dynamic
DC threshold. We refer to the proposed algorithmic trading approach as the DCRL trading strategy.
In addition, the proposed DCRL trading strategy was trained using the Q-learning algorithm to find an
optimal trading rule. We evaluated the DCRL trading strategy on real stock market data (S&P500, NASDAQ,
and Dow Jones, for five years period from 2015-2020), and the results demonstrate that the DCRL state
representation policies obtained more substantial trading returns and improved the Sharpe Ratios in a volatile
stock market. In addition, a series of performance analyses demonstrate the robust performance and extensive
applicability of the proposed DCRL trading strategy.
INDEX TERMS Machine learning, reinforcement learning, Q-learning, directional change event, algorith-
mic trading, stock market.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
VOLUME 9, 2021 114659
M. E. Aloud, N. Alkhamees: Intelligent Algorithmic Trading Strategy Using RL and DC
RL is a machine learning method used for sequential algorithm to make decisions and take the most appropriate
decision-making problems [5]. It achieves policy improve- trading action.
ment throughout continuous interaction with and ongoing The DCRL algorithmic trading strategies were evaluated
evaluation of its environment. A RL agent performs a using real financial market data for stock trading. We con-
sequence of actions based on the environment states to receive ducted a series of systematic experiments to confirm the
a predefined reward. In contrast to supervised machine learn- effectiveness and interpretability of the trading performance
ing, which requires historical labeled data, the RL agent results. Therefore, we selected three common US stock
learns the environment’s states and performs actions through indices to evaluate the performance of the DCRL algorithmic
continuous evaluation of the dynamic environment. The RL trading strategies (with and without Q-learning) and com-
algorithm has several advantages, e.g., self-learning, ongoing pare their performance against Zero-Intelligence (ZI) trading
behavior enhancements, and adaptivity to the environment agents. The experimental results demonstrate that the DCRL
states. RL has been applied effectively in different domains, algorithmic trading strategies are effective in different market
e.g., job scheduling [6], pattern recognition [7], and algorith- situations and can potentially generate profits.
mic trading [8]–[11]. Our primary contributions are summarized as follows.
Despite the effectiveness and robustness of the RL algo- First, we contribute to the financial market literature by
rithm, employing an algorithmic trading strategy remains a designing and developing an algorithmic trading strategy that
challenge in real-world trading for three reasons. First, using is suitable for stock markets by improving the RL envi-
a physically fixed time interval (e.g., hourly data) to represent ronment state representation and action decision-making to
the environment states make the flow of price time-series ensure stable trading returns even in the case of volatile price
irregularly spaced because prices are transacted at irregu- time-series. Second, we contribute to the application of the
lar times and at different magnitudes and directions [12]. DC event approach for the representation of the environmen-
Physical time employs a point-based system, where a single tal states in RL. The proposed algorithmic trading considers
time unit for observing price changes in range from seconds sequential DC event recognition in the price time-series pro-
to hours or even days; thus, time is homogeneous. Under cess using the dynamic DC threshold. This model can support
intrinsic time, the Directional Change (DC) event approach decision-makers to determine optimal trading opportunities
emerges as an alternative approach for price time-series anal- to maximize profits. Finally, we contribute to the literature
ysis that can capture periodic patterns in price time-series. by using the Q-learning algorithm to improve the learning
Second, selecting appropriate features and data to represent process via the prior gained experience, and we capture
the environment states can be difficult. For example, man- long-term learning and continuous improvements via the Q-
ual selection of features and data is challenging due to the learning algorithm to achieve optimal policies under different
large search space (e.g., fundamental, and technical indicator market states.
data) [9]. Finally, machine learning algorithms have a com- The remainder of this paper organized as follows.
plex structure and a large number of different parameters [4]. Section II presents a brief discussion of literature related
Reducing the number of parameters results in simplifying the to the RL algorithm in financial trading. Section III pro-
tracking and interruption of the trading performance results. vides a brief description of the DC event approach and
This study extends the Alkhamees and Aloud [7], where the definition of the dynamic DC threshold. Section IV
a DCRL model was introduced to detect directional price describes DCRL algorithmic trading and the Q-learning
changes in price time-series. The proposed DCRL model is algorithm. Section V presents the datasets, experiment set-
considered an alternative approach to the traditional time- tings, profitability results, and discusses the empirical results.
series analytical approaches for environment state represen- Section VI concludes the paper and presents suggestions for
tation. Basically, these traditional approaches are based on potential future work.
fixed time interval analysis, in contrast, the DCRL model
samples price time-series under intrinsic time. The DCRL II. RELATED WORKS
model also learns the states of the price time-series to find Several works in financial and machine learning literature
the optimal dynamic threshold for DC event analysis. The have exploited RL in different financial market studies,
dynamic DC threshold was introduced [13] to replace the e.g., financial signal representation [4], [7], [14], building
fixed DC threshold, which is used to identify DC events algorithmic trading [4], [8]–[10], [15], [16], portfolio man-
(e.g., directional price changes). agement [11], [17], [18], optimizing trade execution [19],
This paper develops an intelligent and dynamic algorithmic Foreign Exchange (FX) asset allocations [20], changes in
trading strategy using the proposed DCRL model, specifi- market regimes [11], and stock market modelling [21], [22].
cally, we present two algorithmic trading strategies where the Building algorithmic trading using RL has been the focus of
first is a direct RL approach and the second additionally incor- many studies for a range of market settings. Some studies
porates a RL Q-learning algorithm. Essentially, the proposed have used direct RL [23], while others have employed a
DCRL algorithmic trading employs the DC event approach value-based RL approach with a Q-Learning matrix to realize
with the dynamic DC threshold to derive the state represen- algorithmic trading [15], [23], [24]. In addition, other studies
tation in RL. In addition, it uses the RL decision-making have used Recurrent RL (RRL) approach [10], [11], [25]
or applied a Q-learning algorithm to the design of trading Q-learning and temporal difference algorithms using real
strategies [9], [26], [27]. Furthermore, several recent stud- data. Their results demonstrated that the deferential Sharpe
ies have employed deep RL for financial portfolio manage- ratio RRL system outperformed the Q-learning algorithm.
ment [17], [18]. Carapuço et al. [28] developed an RL trading system to trade
Serving the literature on algorithmic trading using direct in the foreign exchange market. They used ANNs with three
RL. Bertsimas and Lo [23] examined an application of the RL hidden layers, where the neurons were trained as RL agents
algorithm for trading a large block of equity over a fixed time under the Q-learning algorithm using a simulated market
horizon to minimize the expected cost of executing trades. environment framework. The framework was tested using
They identify optimal trading rules (i.e., executed actions) as EUR/USD market data from 2010 to 2017 with more than
a strategy that evolves over a few days. Their experimental 10 tests with different initial conditions, and an average total
results demonstrated that the RL strategy saved between 25% profit of 114.0% ± 19.6% was achieved.
and 40% of execution costs compared to the naïve strategy. Other literature studies have used the Recurrent RL (RRL)
However, this study’s main drawback was the assumption approach. Moody et al. [10] proposed an application of
that the quantity of each buy order is significantly high to the RRL approach. RRL is an unconstrained RL algorithm
increase the price of the traded security. The work by [22] that solves the problem of dimensionality. Several studies
designed a next-generation multi-agent systems (MAS) stock have extended the RRL model. For example, Zhang and
market simulator. Each agent learns price forecasting and Maringer [25] used technical analysis indicators, fundamen-
stock trading autonomously via RL. The results demonstrate tal analysis, and econometric study with RRL to improve
that agent learning allows accurate simulation of the market trading decisions. The analytical indicators were filtered
microstructure. using the genetic algorithm evolutionary process. Refer-
Several studies in the literature utilized a value-based RL ence [8] combined RRL and a particle swarm with a Cal-
approach with a Q-Learning matrix for algorithmic trad- mar ratio-based objective function for portfolio trading. They
ing. Gao and Laiwan [24] and Pendharkar and Cusatis [15] evaluated their method using S&P100 index stocks, and the
employed a value-based RL approach with a Q-Learning results demonstrated that the proposed portfolio trading sys-
matrix to develop algorithmic trading methods. Here, the core tem outperformed benchmark trading strategies, particularly
idea is to approximately calculate each state’s value func- under high transaction cost conditions. In addition, the results
tion (or state-action pair) and subsequently select the greedy demonstrated that the Calmar ratio was the best fitness func-
trading action based on the value function. [24] used two tion for particle swarm algorithms.
performance functions, i.e., absolute profit and relative risk- In recent years, RL research has clustered around deep
adjusted profit, to train the algorithmic trading model. The learning RL. The work by [17] used a financial-model-free
authors in [15] proposed several RL agents for trading port- RL framework to deliver a deep RL solution to the portfo-
folio assets. They designed on-policy (SARSA (λ)) and lio management problem. The central part of the deep RL
off-policy (Q-learning) discrete state and discrete action framework is the Ensemble of Identical Independent Evalua-
agents. Here, the goal is to maximize one of the two val- tors (EIIE) topology. An EIIE is a neural network designed
ues the portfolio returns or differential Sharpe ratios. They to examine the historical data of an asset and evaluate its
examined the impact of RL and trading frequencies. The potential growth. In their work, the portfolio weights iden-
results demonstrate that a continuous adaptive action RL tify the action for the RL agent. The reward of the RL
trading strategy consistently performs the best in forecasting agent is the explicit average value of the recurring loga-
portfolio allocations in the following period. The learning rithmic returns. In a similar context, [18] offers a portfolio
frequency of RL algorithmic trading is essential in determin- management approach using deep RL on markets with a
ing trading performance. The work by [9] and [20] demon- dynamic number of assets. The neural network architecture is
strated the effectiveness of the policy-based model over employed and trained using deep RL. Their design was tested
the value-based function model relative to performance and on a historical dataset of one of the largest world cryptocur-
applicability. rency markets. The results outperform state-of-the-art meth-
With regard to the adoption of Q-learning, Neuneier [26] ods in the literature, accomplishing average daily returns of
applied a Q-learning algorithm to optimize a trading over 24%.
portfolio. Neuneier constructed an Artificial Neural Net- The main advantage of the algorithmic trading strate-
work (ANN) to forecast price movement and then used the gies proposed in this paper is their continuous adaptability
Q-learning algorithm to find an optimal policy. Another to new market conditions using a learning process result-
study [27] proposed a portfolio optimization technique using ing from dynamic DC events. In addition, existing RL
the RL Q-learning approach. This method improved the algorithmic trading modules does not consider an event-
Q-learning algorithm for optimal asset allocation intro- based system, where an event is the basic unit for
duced [26]. This model simplifies the previous model [26] studying price time-series. Thus, the representation of
by using one value function for several assets, facilitat- environmental states (i.e., market states) in RL algorith-
ing model-free policy iteration. Another study [9] used mic trading must be improved to realize adaptability to
direct RL alteration and compared their algorithm to market behaviours continuous changes.
These functions estimate the effectiveness of an agent’s action to learn the stock market environment and discover trading
in a given state. The notion of ‘‘effectiveness’’ in RL is rules. The underlying challenge of stock market trading is
defined according to future rewards, i.e., the expected return capturing market states at a specific time. For price time-
in a financial trading context. Thus, these value functions are series, commonly employed data in financial forecasting lit-
determined based on specified policies. The value of state s erature represent the price sequence at regular time intervals
following policy π (denoted vπ (s) is the expected return when (e.g., daily data). In this study, we used the daily data of stock
starting in s and following π through the specified period. market indexes, i.e., the opening, closing, high, and low prices
For the MDP, we can define the state-value function vπ (s) for for each day.
policy π as follows in Eq. 1: The market state variable of each trading day is represented
by a pair of the DC price trend direction (an upward or
vπ (s) = Eπ [Gt |St = s] (1)
downward trend) and the type of detected event (overnight
where Eπ [.] is the expected value following policy π, and t or previous day event). This gives six states for our research
is any time. Here, Gt is the cumulative discount rate for state problem. A lookup table (Table 1) is established for state
s at time t, which is defined as follows in Eq. 2: representation of the environment, where each state is sig-
hX∞ i nified with a single action associated with an expected
Gt (s) = γ k Rt+k+1 |St = s (2) reward.
k=0
where gamma (γ ) is a discount factor that takes a value TABLE 1. Lookup table for DCRL algorithmic trading.
between 0 and 1. Discount factor (γ ) defines the importance
of future rewards and weighs recent rewards more heavily.
In algorithmic trading, a higher discount factor value implies
that the agent will become more long-term investment ori-
ented. For example, in the ultimate case of γ = 1, the agent
considers each reward equally through the market run. In con-
trast, for γ = 0, the agent is biased because it only reflects
the current reward and discards future rewards.
Similarly, we define the function of a state-action pair
Q(s, a). The value of taking action a in state s following policy
π (denoted Qπ (s, a)) is defined as the expected return starting
from s, taking action a, and subsequently following pol-
icy π . The action value function for policy π is expressed as
follows:
Qπ (s, a) = Eπ [Gt |St = s, At = a] (3)
where Gt is the cumulative discount rate for all actions in state
s at time t, which is defined as follows.
hX∞ i
Gt (s) = γ k Rt+k+1 |St = s, At = a (4)
k=0
In RL, there are two main algorithms designed to find
optimal action at+1 to take given current state st+1 . The
first algorithm is the off-policy algorithm, where the Q(s, a) The agent uses an RL algorithm to change from state st to
function does not depend on the agent’s learning policy: thus, st+1 , which is based on learning the dynamics of the environ-
it learns from taking different actions (e.g., random actions). ment. Thus, if state st were Overnight or PreviousDay with an
The second algorithm is an on-policy algorithm, where the Upward trend, the action would be Sell because we think that
Q(s, a) function is dependent on the agent’s learning policy; the price increase that occurred due to an overnight or pre-
thus, the agent learns from actions it has taken using the vious day price change was high. The same applies if st was
current policy π(a|s). Overnight or PreviousDay with a Downward trend, i.e., the
action would be Buy because we think that prices have fallen
B. DCRL STATES sharply due to a sudden overnight or previous day change in
The principle of RL is that an agent continuously interacts price. The Overnight or Previous Day states are satisfied if
with the environment and learns the optimal trading rule the five-day moving average is greater than the overnight or
to improve its trading strategy. For stock market trading, previous day’s price change. However, if the detected state is
the environment comprises the current stock price data and Neutral, which indicates no significant event was identified
historical price series, including a variety of fundamental in the price time-series between time t − 1 and t, we use
data and technical analysis indicators. Therefore, selecting the optimal state-action value function to select the optimal
the set of data inputs is a prerequisite for trading agents policy for t + 1.
C. DCRL ACTIONS action), and the variation of unrealized profit was employed
At each time step t, the agent observes the environment’s for opening (buy action) or holding a position.
state st and executes a trading action following policy
π (s, a). Here, the agent actions are buy, sell, or hold, i.e., E. Q-LEARNING ALGORITHM
A = {Buy, Sell, Hold}. An agent receives a reward after it Q-learning is an off-policy RL algorithm that seeks to maxi-
takes an action. An action at may have an impact on the mize the total reward. Quality in the RL approach signifies
agent’s portfolio value, specifically, the cash and share values how effective an executed action at at time t was relative
giving that a trading action executes at the current market to achieving a particular future reward. In the Q-learning
closure price pt . algorithm, we create a Q-table or matrix that follows policy
Two experimental design constraints are assumed regard- π (s, a) and randomly initialize the values in the matrix. Then,
ing the quantity of traded shares Qt at time t. First, for the for each iteration of the market run, the Q-values are updated
Buy action, the amount of shares to be bought by the trading and stored in the matrix. Accordingly, the Q-matrix turn into
agent is based on all available cash that agent has at time t. a reference matrix for the agent to determine the optimal
For the Sell action, the agent sells all of available shares at action based on the maximum Q-value. The Q-function uses
time t. In other words, the agent spends 100% of its cash when the Bellman equation, which takes two inputs, i.e., state st
buying and 100% of its shares when selling. Second, there and an action under policy π (s, a). Given the current state
is no transaction cost in this simulation. By making these st of the environment at time t and the taken action at+1 ,
simplified assumptions, the complexity of the trading strategy we can formulate the action value function following policy
is reduced to a level that can be explored and examined within π as follows:
the scope of this study. Simplicity is essential to understand Q (s, a) = Q (a, s) + α[R (s, a)+γ max Q s0 , a0 − Q(s, a)]
an agents’ trading behaviors and the trading rules generated
by the agent because assigning variable quantities may result (7)
in a more complicated analysis. Note that relaxation of these where Q(s, a) is the new Q-value for state st and action at ,
assumptions does not affect generality or the accuracy of α is the learning rate satisfying 0 ≤ α ≤ 1, R(s, a) is the
the obtained results. Nevertheless, we are aware of share reward for taking action at at state st , γ is the discount factor
quantity’s critical role as a choice variable for the generated (also referred to as the discount rate) satisfying 0 ≤ γ ≤ 1,
trading rules (especially with risk aversion). and max Q0 (s0 , a0 ) is the maximum expected reward for new
state s0 and all possible actions at state s0 . Low alpha (α)
D. REWARD FUNCTION values imply a slower learning rate, while higher alpha values
An agent designed based on the RL algorithm learns the indicate more rapidly learning of Q-value updates.
optimal policy to trade to achieve maximum profit; therefore, For simplicity, we refer to DCRL with Q-learning algo-
the reward function design is critical when designing trading rithm as QDCRL. A QDCRL agent learns an optimal state-
strategies based on the RL algorithm. In stock market trading action value function Q∗ for the Neutral state, where an
literature, several studies have used the Rate of Return (RoR) update process considers a quintuple Q(st , at , rt , st+1 , at+1 )
as a reward function [15]. of the environment. For the six states and three actions,
In this study, we used two immediate reward criteria for we create a matrix Q ∈ R6×3 initialized with random val-
the DCRL agent. The first criterion is Buy action, where ues. Therefore, Q(s, a) represents the Q-value for state s and
the Relative Return (RR) is used (Eq. 5). Here, pSell and action a. The Q(s, a) initial random values are subsequently
pBuy are the selling and buying prices, respectively. The RR updated in the simulation run by identifying new states and
defined as the difference between the absolute price return at actions using the dataset, where reward r(s, a) is assigned for
time t and the return reached by the target time. The second each selected action. The structure of QDCRL algorithmic
immediate reward criterion is for the Sell action, where the trading is shown in Figure 2.
RoR is used (Eq. 6). The RoR is the net gain (or loss) of a
single trade over a particular period based on the trade’s initial V. EXPERIMENT AND RESULTS
cost. In this section, we discuss a series of experiments conducted
with the proposed DCRL (with and without Q-learning)
algorithmic trading strategies, including the datasets used,
RR = (pt − pt−1 )/pt−1 (5)
performance evaluation metrics, benchmarks, experimental
RoR = (pSell − pBuy )/pBuy (6) settings, and trading performance results.
We evaluated three aspects of the proposed DCRL and
Here, pt and pt−1 are the current price at time t and time QDCRL algorithmic trading strategies, i.e., trading perfor-
t − 1, respectively. To assess the action taken (i.e., the exe- mance profitability and effectiveness, as well as adaptability
cuted trading action), we employ two reward functions so and efficiency of the dynamic threshold DC event approach
that we can consider the different impact of both the Sell and for the RL environment state representation. Finally, we con-
Buy actions. The authors of [28] used two reward functions, firmed the efficacy of the Q-learning algorithm in RL for
i.e., the trade profit was used for closing a position (sell algorithmic trading.
TABLE 2. Descriptive statistics of stock indices. TABLE 4. ROI and Sharpe Ratio (SR) for different parameters settings
(discount factor (γ ) and learning rate (α)). Results for the S&P500 stock
index are shown.
FIGURE 3. Price curve movement of S&P500, NASDAQ, and Dow Jones stock indices during target period.
during the trading period, which is measured to avoid trading. The reason behind choosing the Direct RL model
exceedingly frequent trading resulting in extremely as a baseline benchmark is to provide a rational comparison
high risk. of the minimum level of supervised learning. Besides, this
will allow us to evaluate the dynamic DC event approach’s
D. BENCHMARK TRADING STRATEGY effectiveness in representing the environment’s state. Further-
To further evaluate the performance the proposed DCRL and more, we compare the performance of DCRL and QDCRL
QDCRL trading strategies, we compare them to the ZI agent with a classic DC event approach -fixed threshold- intro-
with a budget constraint for stock trading. The ZI is a bench- duced by [12]. The DC approach provides a pattern detec-
mark used to evaluate intelligent algorithmic trading mod- tion for price time-series with no utilization of any machine
els. It is a complete random approach that allows to assess learning techniques. We employed the DC approach using a
the intelligence and learning effectiveness of the DCRL and variety of fixed thresholds ranging from [0.01, 0.001]. The
QDCRL. In addition, we benchmark with the Direct RL average performance of the different simulation runs was
designed by [9], which is a classical RL model for algorithmic reported.
FIGURE 5. Profit curves of QDCRL, DCRL, and ZI trading strategies for three stock indices.
the total available trading time. The higher the number of outperformed the QDCRL. After that, we can clearly see how
trading signals taking place the more likely its leading to the learning is well reflected in the QDCRL performance, and
negative investment results. hence, how the QDCRL has significantly outperformed both
Figure 4 shows the daily portfolio return for the QDCRL, DCRL and ZI. The same applies to the Dow Jones (in the third
DCRL, and ZI trading agents during the target period chart), where, initially, ZI and DCRL outperformed QDCRL.
(1260 days) for the S&P500, NASDAQ, and Dow Jones However, as learning goes on, the QDCRL significantly out-
indices. We have excluded the portfolio return for the Direct performed both DCRL and ZI. The same also applies to NAS-
RL given the massive negative returns during the vast of DAQ, where learning has proven to be effective when used
trading periods. For the S&P500, the DCRL and ZI initially with RL and DC. Finally, learning had remarkably effect on
QDCRL performance, and QDCRL generally outperformed may permit the agent to submit more trading signals. The
both DCRL and ZI. results of this study suggest that adaptive QDCRL agents with
Q-learning provide the best performance based on investment
VI. CONCLUSION profitability and are more promising in practical applications.
In this paper, we have proposed two algorithmic trading This paper can be further extended in several research
strategies based on the DCRL model. Our main focus was directions. For example, in the future, we can examine DCRL
to improve the environment state representations for the RL (with and without Q-learning) on high-frequency trading to
algorithm. The dynamic DC threshold event approach was explore and confirm the effectiveness of DCRL algorithmic
able to precisely represent the environment states. In addition, trading, thus further improving and optimising DCRL to fit
it was able efficiently capture stable market states, which led that trading context. In addition, we can evaluate applying
to achieving profitable trading returns under acceptable risk DCRL algorithmic trading to different emerging markets,
levels in several stock indices. The effectiveness and robust- e.g., the Forex market and cryptocurrencies. Finally, DCRL
ness of the DCRL trading strategies were verified on real algorithmic trading can only trade one asset at a time; thus,
stock market data, and the experimental results demonstrate we can also extend our investigations to managing portfolios
that the proposed DCRL algorithmic trading outperformed involving multiple assets.
the ZI, Direct RL and classic DC trading strategies with
higher total profits and SR, as well as more consistent profit ACKNOWLEDGMENT
curves. The authors would like to thank the anonymous reviewers for
Our primary contributions are summarized as follows. their useful comments and suggestions.
We defined the environment states in the RL algorithm using
the dynamic DC threshold event approach, we developed a DECLARATION OF INTEREST
simple lookup table for RL algorithmic stock trading, and The authors report no conflicts of interest. They alone are
we employed the Q-learning algorithm to select the optimal responsible for the content and writing of the article.
policy under the Natural market state.
Given the dynamic nature of the price time-series, trained REFERENCES
and adaptive algorithmic trading must be retrained when [1] P. Treleaven, M. Galas, and V. Lalchand, ‘‘Algorithmic trading review,’’
the environment states changes based on specified precon- Commun. ACM, vol. 56, pp. 76–85, Nov. 2013.
[2] B. Bruce, Trading Algorithms, Student-Managed Investment Funds.
ditions. The learning mechanism based on the dynamic DC 2nd ed. Cambridge, U.K.: Academic, 2020, pp. 285–315.
threshold event approach is effective relative to improving the [3] E. Fama and M. Blume, ‘‘Filter rules and stock market trading profits,’’
market’s states’ representation. The DCRL agents’ trading J. Bus., vol. 39, pp. 226–241, Jan. 1966.
[4] K. Lei, B. Zhang, Y. Li, M. Yang, and Y. Shen, ‘‘Time-driven feature-
performance (with and without Q-learning) were generally aware jointly deep reinforcement learning for financial signal represen-
significant and turned a profit within an appropriate level of tation and algorithmic trading,’’ Expert Syst. Appl., vol. 140, Feb. 2020,
risk. These results indicate that, to generate proper trading Art. no. 112872.
[5] R. Sutton and A. Barto, Reinforcement Learning: An Introduction.
rules and high-performance returns, learning the environment Cambridge, MA, USA: MIT Press, 1998.
states is required (i.e., adaptive and non-static representations [6] S. Chinchali, P. Hu, T. Chu, M. Sharma, M. Bansal, R. Misra, M. Pavone,
of the price time-series is needed). and S. Katti, ‘‘Cellular network traffi scheduling with deep reinforcement
learning,’’ in Proc. 32nd AAAI Conf. Artif. Intell., 2018, pp. 766–774.
We used two reward functions for the DCRL agents, where [7] N. Alkhamees and M. Aloud, ‘‘DCRL: Approach to identify financial
each reward is associated with a specific action (either a buy events from time series using directional change and reinforcement learn-
or sell action). The relative return reward function was sued ing,’’ Int. J. Adv. Comput. Sci. Appl., vol. 12, no. 8, 2020.
[8] F. Bertoluzzo and M. Corazza, ‘‘Reinforcement learning for automatic
for the buy action, and the rate of return reward function financial trading: Introduction and some applications,’’ Dept. Econ., Ca’
was used for sell action. We found that using these reward Foscari Univ. Venice, Venice, Italy, Work. Paper 2012:33, 2012.
functions (rather than a single reward function) improved the [9] J. Moody and M. Saffell, ‘‘Learning to trade via direct reinforcement,’’
IEEE Trans. Neural Netw., vol. 12, no. 4, pp. 875–889, Jul. 2001.
Q-learning matrix’s performance. [10] J. Moody, L. Wu, Y. Liao, and M. Saffell, ‘‘Performance functions and
There are two reasons why the QDCRL trading algorithm reinforcement learning for trading systems and portfolios,’’ J. Forecasting,
outperformed DCRL. The first is the learning process for vol. 17, nos. 5–6, pp. 441–470, Sep. 1998.
[11] S. Almahdi and S. Yang, ‘‘An adaptive portfolio trading system: A risk-
the optimal trading policy under specific market conditions. return portfolio optimization using recurrent reinforcement learning with
As stated previously, the performance of QDCRL agents expected maximum drawdown,’’ Expert Syst. Appl., vol. 87, pp. 267–279,
depends on the selection of the optimal policy. The learning Nov. 2017.
[12] J. B. Glattfelder, A. Dupuis, and R. B. Olsen, ‘‘Patterns in high-frequency
frequency of algorithmic trading plays a critical role in influ- FX data: Discovery of 12 empirical scaling laws,’’ Quant. Finance, vol. 11,
encing the agent’s trading analytical performance; however, no. 4, pp. 599–614, Apr. 2011.
we did not find that large learning rate (α) values are always [13] N. Alkhamees and M. Fasli, ‘‘Event detection from time-series streams
using directional change and dynamic thresholds,’’ in Proc. IEEE Int. Conf.
effective. We consider the difference between loss and reward Big Data (Big Data), Boston, MA, USA, Dec. 2017, pp. 1882–1891.
in the Neutral state was caused by the fact that Q-learning [14] Y. Deng, F. Bao, Y. Kong, Z. Ren, and Q. Dai, ‘‘Deep direct reinforcement
may effectively model the long-term discounted returns of a learning for financial signal representation and trading,’’ IEEE Trans.
Neural Netw. Learn. Syst., vol. 28, no. 3, pp. 653–664, Mar. 2017.
particular state. In addition, we restricted the agent to select [15] P. C. Pendharkar and P. Cusatis, ‘‘Trading financial indices with reinforce-
from a finite action set based on the optimal policy, which ment learning agents,’’ Expert Syst. Appl., vol. 103, pp. 1–13, Aug. 2018.
[16] L. Weng, X. Sun, M. Xia, J. Liu, and Y. Xu, ‘‘Portfolio trading system of [37] M. Aloud, ‘‘Profitability of directional change based trading strategies: The
digital currencies: A deep reinforcement learning with multidimensional case of Saudi stock market,’’ Int. J. Econ. Financ., vol. 6, no. 1, pp. 87–95,
attention gating mechanism,’’ Neurocomputing, vol. 402, pp. 171–182, 2016.
Aug. 2020. [38] M. Aloud, ‘‘Investment opportunities forecasting: A genetic programming-
[17] Z. Jiang, D. Xu, and J. Liang, ‘‘A deep reinforcement learning based dynamic portfolio trading system under a directional-change frame-
framework for the financial portfolio management problem,’’ 2017, work,’’ J. Comput. Finance, vol. 22, pp. 1–35, Mar. 2017.
arXiv:1706.10059. [Online]. Available: http://arxiv.org/abs/1706.10059 [39] M. Aloud and M. Fasli, ‘‘Exploring trading strategies and their effects in
[18] C. Betancourt and W.-H. Chen, ‘‘Deep reinforcement learning for portfolio the foreign exchange market,’’ Comput. Intell., vol. 33, no. 2, pp. 280–307,
management of markets with a dynamic number of assets,’’ Expert Syst. May 2017.
Appl., vol. 164, Feb. 2021, Art. no. 114002. [40] M. Kampouridis and F. E. B. Otero, ‘‘Evolving trading strategies using
[19] Y. Nevmyvaka, Y. Feng, and M. Kearns, ‘‘Reinforcement learning for directional changes,’’ Expert Syst. Appl., vol. 73, pp. 145–160, May 2017.
optimized trade execution,’’ in Proc. 23rd Int. Conf. Mach. Learn. (ICML), [41] M. Aloud, ‘‘Time series analysis indicators under directional changes: The
2006, pp. 1–8. case of Saudi stock market,’’ Int. J. Econ. Financ., vol. 6, no. 1, pp. 55–64,
[20] M. Dempster and V. Leemans, ‘‘An automated FX trading system using 2016.
adaptive reinforcement learning,’’ Expert Syst. Appl. vol. 30, pp. 543–552, [42] J. Ma, X. Xiong, F. He, and W. Zhang, ‘‘Volatility measurement with direc-
Apr. 2006. tional change in Chinese stock market: Statistical property and investment
[21] C.-H. Kuo, C.-T. Chen, S.-J. Lin, and S.-H. Huang, ‘‘Improving gener- strategy,’’ Phys. A, Stat. Mech. Appl., vol. 471, pp. 169–180, Apr. 2017.
alization in reinforcement learning–based trading by using a generative [43] G. Jeong and H. Y. Kim, ‘‘Improving financial trading decisions using deep
adversarial market model,’’ IEEE Access, vol. 9, pp. 50738–50754, 2021. Q-learning: Predicting the number of shares, action strategies, and transfer
[22] J. Lussange, I. Lazarevich, S. Bourgeois-Gironde, S. Palminteri, and learning,’’ Expert Syst. Appl., vol. 117, pp. 125–138, Mar. 2019.
B. Gutkin, ‘‘Modelling stock markets by multi-agent reinforcement learn- [44] D. K. Gode and S. Sunder, ‘‘Allocative efficiency of markets with zero-
ing,’’ Comput. Econ., vol. 57, no. 1, pp. 113–147, Jan. 2021. intelligence traders: Market as a partial substitute for individual rational-
[23] D. Bertsimas and A. W. Lo, ‘‘Optimal control of execution costs,’’ J. Finan- ity,’’ J. Political Economy, vol. 101, no. 1, pp. 119–137, Feb. 1993.
cial Markets, vol. 1, no. 1, pp. 1–50, Apr. 1998.
[24] X. Gao and C. Laiwan, ‘‘An algorithm for trading and portfolio manage-
ment using Q-learning and Sharpe ratio maximization,’’ in Proc. Int. Conf.
Neural Inf. Process., 2000, pp. 832–837.
[25] J. Zhang and D. Maringer, ‘‘Indicator selection for daily equity trading with
recurrent reinforcement learning,’’ in Proc. 15th Annu. Conf. Companion
Genetic Evol. Comput., Jul. 2013, pp. 1757–1758.
[26] R. Neuneier, ‘‘Optimal asset allocation using adaptive dynamic program-
ming,’’ in Proc. Adv. Neural Inf. Process. Syst., Cambridge, MA, USA: MONIRA ESSA ALOUD received the B.Sc. degree in information technol-
MIT Press, 1996, pp. 952–958. ogy and the M.Sc. degree in e-commerce technology from King Saud Univer-
[27] R. Neuneier, ‘‘Enhancing Q-learning for optimal asset allocation,’’ in Proc. sity, in 2006 and 2008, respectively, and the Ph.D. degree from the School of
Adv. Neural Inf. Process. Syst., 1998, pp. 936–942. Computer Science and Electronic Engineering (CSEE), University of Essex,
[28] J. Carapuço, R. Neves, and N. Horta, ‘‘Reinforcement learning applied to U.K., in 2013. She is currently an Associate Professor with the Department
Forex trading,’’ Appl. Soft Comput., vol. 73, pp. 783–794, Dec. 2018.
of Management Information Systems, College of Business Administration,
[29] M. Aloud, E. Tsang, R. Olsen, and A. Dupuis, ‘‘A directional-change
events approach for studying financial time series,’’ Econ. Open Access
King Saud University. She is also a member of the Computational Finance
Open Assess. E-J., vol. 6, pp. 1–18, Dec. 2012. and Economics Research Laboratory, Centre for Computational Finance
[30] A. Bakhach, E. P. K. Tsang, and H. Jalalian, ‘‘Forecasting directional and Economic Agents (CCFEA), University of Essex. While in CSEE, she
changes in the FX markets,’’ in Proc. IEEE Symp. Ser. Comput. Intell. worked on research projects with Olsen Ltd. She served as the Dean for
(SSCI), Dec. 2016, pp. 1–8. the College of Business Administration, Princess Nourah Bint Abdulrahman
[31] E. P. K. Tsang, R. Tao, A. Serguieva, and S. Ma, ‘‘Profiling high-frequency University, from March 2018 to August 2019. Since her appointment, she
equity price movements in directional changes,’’ Quant. Finance, vol. 17, has developed and implemented various strategic initiatives, including imple-
no. 2, pp. 217–225, Feb. 2017. menting student engagement and career development programs, launching
[32] H. Ao and E. Tsang, ‘‘Trading algorithms built with directional changes,’’ Trading Stock Lounge and new Bloomberg Finance Lab, and introducing
in Proc. IEEE Conf. Comput. Intell. for Financial Eng. Econ. (CIFEr), faculty professional development initiatives and incentives.
May 2019, pp. 1–7.
[33] A. M. Bakhach, E. P. K. Tsang, and V. L. Raju Chinthalapati, ‘‘TSFDC:
A trading strategy based on forecasting directional change,’’ Intell. Syst.
Accounting, Finance Manage., vol. 25, no. 3, pp. 105–123, Jul. 2018.
[34] N. Alkhamees and M. Fasli, ‘‘An exploration of the directional change
NORA ALKHAMEES received the B.Sc. degree in information technology
based trading strategy with dynamic thresholds on variable frequency data
streams,’’ in Proc. Int. Conf. Frontiers Adv. Data Sci. (FADS), Oct. 2017,
and the M.Sc. degree in information systems from the College of Computer
pp. 108–113. and Information Sciences, King Saud University (KSA), in 2008 and 2011,
[35] N. Alkhamees and M. Fasli, ‘‘A directional change based trading strategy respectively, and the Ph.D. degree in computer science from the School of
with dynamic thresholds,’’ in Proc. IEEE Int. Conf. Data Sci. Adv. Anal. Computer Science and Electronic Engineering (CSEE), University of Essex,
(DSAA), Oct. 2017, pp. 283–292. U.K., in 2019. She is currently working as an Assistant Professor with
[36] M. Aloud, ‘‘Directional-change event trading strategy: Profit-maximizing the Department of Management Information Systems, College of Business
learning strategy,’’ in Proc. 7th Int. Conf. Adv. Cogn. Technol. Appl., Administration, King Saud University.
F. Nice, Ed., 2015, pp. 123–129.