pair_trading
pair_trading
Article
Reinforcement Learning Pair Trading: A Dynamic
Scaling Approach
Hongshen Yang * and Avinash Malik
Department of ECSE, The University of Auckland, Auckland 1010 , New Zealand; [email protected]
* Correspondence: [email protected]
Abstract: Cryptocurrency is a cryptography-based digital asset with extremely volatile prices. Around
USD 70 billion worth of cryptocurrency is traded daily on exchanges. Trading cryptocurrency
arXiv:2407.16103v2 [q-fin.CP] 11 Dec 2024
is difficult due to the inherent volatility of the crypto market. This study investigates whether
Reinforcement Learning (RL) can enhance decision-making in cryptocurrency algorithmic trading
compared to traditional methods. In order to address this question, we combined reinforcement
learning with a statistical arbitrage trading technique, pair trading, which exploits the price difference
between statistically correlated assets. We constructed RL environments and trained RL agents to
determine when and how to trade pairs of cryptocurrencies. We developed new reward shaping
and observation/action spaces for reinforcement learning. We performed experiments with the
developed reinforcement learner on pairs of BTC-GBP and BTC-EUR data separated by 1 min
intervals (n = 263,520). The traditional non-RL pair trading technique achieved an annualized
profit of 8.33%, while the proposed RL-based pair trading technique achieved annualized profits
from 9.94% to 31.53%, depending upon the RL learner. Our results show that RL can significantly
outperform manual and traditional pair trading techniques when applied to volatile markets such
as cryptocurrencies.
Keywords: pair trading; reinforcement learning; algorithmic trading; deep learning; cryptocurrency
1. Introduction
Citation: Yang, Hongshen, and
Avinash Malik. 2024. Reinforcement Arbitrage is a subdomain of financial trading that profits from price discrepancies
Learning Pair Trading: A Dynamic in different markets (Dybvig and Ross 1989). Pair trading is one of the well-known arbi-
Scaling Approach. Journal of Risk and trage trading methods in financial markets. Arbitrageurs identify two highly correlated
Financial Management 17: 555. assets to form a pair. When a price discrepancy happens, they buy the underpriced
https://doi.org/10.3390/jrfm17120555 asset and sell the overpriced correlated asset to profit from the mean reversion of the
prices. With the rise of high-frequency trading, the ability to conduct fast and accurate
Academic Editor: Xianrong
(Shawn) Zheng
analyses has become critical. Arbitrage requires practitioners to constantly analyze the
market conditions at the fastest speed possible, as arbitrageurs must compete for transi-
Received: 8 November 2024 tory opportunities (Brogaard et al. 2014). Therefore, we explore how Artificial Intelligence
Revised: 28 November 2024 (AI) can enhance the process of pair trading, focusing on the speed and adaptability of
Accepted: 6 December 2024 decision-making.
Published: 11 December 2024
Reinforcement Learning (RL) is a captivating domain of AI. The idea of RL is to
let the agent(s) learn to interact with an environment. The agent should learn from the
environment’s responses to optimize its behavior (Sutton and Barto 2018). If we view the
Copyright: © 2024 by the authors.
financial market from the perspective of the RL environment, actions in the financial market
Licensee MDPI, Basel, Switzerland. are investment decisions. By allowing agents to adapt dynamically to market conditions,
This article is an open access article RL has the potential to overcome the limitations of static, rule-based strategies in volatile
distributed under the terms and and complex financial environments. For gaining profits, arbitrageurs are incentivized to
conditions of the Creative Commons train agents to produce lucrative investment decisions, and RL facilitates agents’ learning
Attribution (CC BY) license (https:// process from the profit/loss of the market.
creativecommons.org/licenses/by/ The combination of RL and various financial trading techniques is still evolving rapidly.
4.0/). There has been some work in RL infrastructural construction (Liu et al. 2021, 2022a, 2022b)
and some experiments in profitable RL agent training (Meng and Khushi 2019; Pricope
2021; Zhang et al. 2020). Trading actions in traditional pair trading follow static rules. In
reality, the complexity of financial markets should allow more flexibility in the decision-
making process. An experienced trader might analyze market conditions to make informed
decisions. However, it is not feasible to output efficient decisions at short, intermittent
intervals 24/7. RL algorithms enable a fast-track decision-making process for analyzing
trading signals and generating trading actions.
Designing a high-frequency trading system based on RL requires addressing critical
challenges. The first challenge is how to construct an RL environment that accommodates
RL agents for arbitrage. The second challenge involves identifying compatible instruments
with historical correlations to form profitable pairs. The third challenge concerns timing.
Instead of blindly following preset rules, the system requires flexibility in choosing in-
vestment timings to achieve greater profitability. The final challenge involves investment
quantity. Since investment opportunities vary in quality, a critical consideration is whether
RL agents can replicate decision-making capabilities comparable to the scrutiny applied by
experienced traders.
This paper investigates key questions centered around the application of Reinforce-
ment Learning (RL) in pair trading. To address the fast decision-making requirements in a
high-frequency trading environment, we constructed a tailored RL environment for pair
trading and fine-tuned reward shaping to encourage the agent to make profitable decisions.
The contributions of this work are as follows: (1) the construction of an RL environment
specifically designed for quantity-varying pair trading; (2) the proposal of a novel pair trad-
ing method that incorporates adaptive investment quantities to capture opportunities in
highly volatile markets; (3) the use of a grid search technique to fine-tune hyperparameters
for enhanced profitability; (4) the introduction of an RL component for market analysis and
decision-making in pair trading, along with a novel RL model optimized for investment
quantity decisions.
The structure of the paper is arranged as follows: the background and related work are
introduced in Sections 2 and 3. The methodology is presented in Section 4. Experiments and
results are included in Section 5. A discussion of the results and conclusions is provided
in Section 6.
2. Background
First, we define the basic terms of financial trading. A long position is created when an
investor uses cash to buy an asset, and a short position is created when an investor sells a
borrowed asset. The portfolio is the investor’s total holding, including long/short position
and cash. Transaction cost is a percentage fee payable to the broker for any long/short
actions. Finally, risk is defined as the volatility of the portfolio.
T
SSD pi ,p j = ∑ ( p i − p j )2 . (1)
t =1
• Observe is the process of market analysis. The price of assets in pairs is collected and
processed. The price difference ( pi − p j ) is called spread S. The arbitrageurs observe
the current positions and spread of the current market.
• Orient is the process exploring what could be done. Three possible actions for pair
trading are long leg, short leg, and close position, as defined above.
• Decide what action to take. Position opening triggers when the price difference devi-
ates too much. This is indicated by the spread movement beyond an open threshold.
Position closing happens when the spread reverts back to some closing threshold.
Gatev et al. (2006) adopted two times the standard deviation of the spread as the
opening threshold and the price crossing as the closing threshold. In practice, the
threshold varies according to the characteristics of the financial instrument.
• Act once the decision is made. The long leg orders us to buy asset i and sell asset j.
The short leg orders us to sell asset i and buy asset j. Closing a position means clearing
all the active positions to hold cash only.
A graphical visualization of pair trading is presented in Figure 1. Figure 1a shows
the market interactions according to the Spread (S) and thresholds. A position is opened
whenever the spread deviates beyond the open threshold. The position closure happens
when the spread reverts below the close threshold. Figure 1b, which shares the same time
axis with (a), is a stretched view of (a). It presents the corresponding actions with the
crossing of Spread (S) and zones. The spread deviations are classified into zones based on
the Spread (S), Open-Threshold (OT), and Close-Threshold (CT) Equation (2):
Figure 1. Stretched pair trading view of price distance between pi and p j . Figure (b), which shares
the same time axis with (a), is a stretched view of (a). It presents the corresponding same actions with
the crossing of Spread (S) and zones in two different views.
J. Risk Financial Manag. 2024, 17, 555 4 of 19
3. Related Work
3.1. Reinforcement Learning in Algorithmic Trading
Reinforcement learning in AlphaGo captured the world’s attention in 2016 by par-
ticipating in a series of machine versus human competitions on the board game GO
(Silver and Hassabis 2016). Surprisingly, the research regarding RL in the financial market
started long before that. Recurrent reinforcement learning studies were the mainstream
works (Bertoluzzo and Corazza 2007; Gold 2003; Maringer and Ramtohul 2012; Zhang and
Maringer 2016) in the early stage of financial trading. After the upsurge of AlphaGo, some
significant advancements were brought to RL trading as well; Huang (2018) re-described
the Markov Decision Process (MDP) financial market as a game process to incorporate RL as
J. Risk Financial Manag. 2024, 17, 555 5 of 19
a financial trading game (Huang 2018). Pricope (2021) proposed deep RL agents to develop
profitable high-frequency trading strategies with sequential model-based optimization
tuning the hyperparameters. With the recent development, newer RL models such as Deep
Q-Learning (DQN), Policy Gradients (PG), and Advantage Actor-Critic (A2C) have also
been introduced by researchers (Meng and Khushi 2019; Mohammadshafie et al. 2024;
Zhang et al. 2020) for financial trading. A noteworthy research work is that of the FinRL
group in the infrastructures and ensemble learning mechanism (Liu et al. 2021, 2022a,
2022b).
4. Methodology
In this section, we introduce the architecture of the methodology (Figure 2). The
architecture includes five steps: (1) pair formation for selecting assets to form a tradeable
pair (Section 4.1); (2) spread calculation utilizing the moving-window technique to extract
the spread in a limited retrospective time frame (Section 4.2); (3) parameter selection from an
historical dataset to decide the most suitable hyperparameters for pair trading (Section 4.2);
(4) RL trading by allowing RL to decide the trading timing and quantity in pair trading
(Section 4.4); (5) investment action for taking the actions produced from RL trading into
market execution.
J. Risk Financial Manag. 2024, 17, 555 6 of 19
cov( X, Y )
ρ X,Y = , (4)
σX σY
where ρ X,Y is the correlation coefficient between assets X and Y, cov( X, Y ) is the covariance
of X and Y, and σX and σY are the standard deviations of X and Y, respectively. The
Engle–Granger cointegration test (Burgess 2003; Dunis and Ho 2005) involves two steps.
First, the linear regression is performed:
Yt = α + βXt + ϵt , (5)
where Yt and Xt are the asset price series, α and β are the regression coefficients, and ϵt is the
residual term. The second step tests the residuals ϵt for stationarity using an Augmented
Dickey-Fuller (ADF) (Dickey and Fuller 1979) test. The ADF test regression is given in
Equation (6):
p
∆ϵt = γϵt−1 + ∑ δi ∆ϵt−i + νt , (6)
i =1
J. Risk Financial Manag. 2024, 17, 555 7 of 19
where ∆ϵt is the first difference of the residuals, γ is the coefficient to be tested for station-
arity, p is the number of lagged difference terms included, and νt is the error term. If γ is
significantly different from zero, the residuals are stationary, indicating co-integration.
A moving window is applied to historical pricing data, as shown in Figure 3. In
this figure, the blue line represents the historical prices, while the dashed boxes illustrate
the moving window. During the selection phase, averaged correlation and co-integration
batches are employed to ensure that the selected assets exhibit a strong, long-term statisti-
cal relationship.
pi = β 0 + β 1 × p j + si ∼ N (0, σ2 ). (7)
We normalize the spread with z-score Equation (8) to scale the spread into constant
mean and standard deviation. The mean of the spread in the sliding window is represented
as s̄:
s − s̄
Z= . (8)
σs
• Adjust position happens when a previous position is open, and the RL agent wants
to open another position. For example, if the current position is a 70% long-leg and
the new action is A = 0.8, only the extra 10% shall be actioned.
• Action reward means the agent needs to be rewarded for taking a desired action in
the corresponding zone. In general, the agent is free to decide on any action. However,
we use action reward to encourage the agent to choose desired actions. It rewards
the agent for making a desired action in certain zones (Table 1) with some freedom
in neutral zones. The stronger the action reward, the more it resembles traditional
pair trading.
• Transaction punishment is a negative reward for encouraging small adjustments
instead of large changes in the position. The punishment is the difference between the
action and position. If the current position in observation is P and the action is A, the
transaction punishment is Equation (10):
where ∆Pt represents the profit or loss from the trade, and c is the transaction cost.
The reward function penalizes the agent for transaction costs while directly linking
rewards to trade profitability.
• The goal is to maximize the cumulative discounted reward:
∞
" #
R1 = E ∑ γ1t r1 (st , at ) .
t =0
RL2 extends RL1 by shifting focus from trade timing to determining the investment
quantity for a given trade opportunity. It models pair trading as an MDP defined by
(S2 , A2 , T2 , r2 , γ2 ):
• S2 is the state space, which is identical to RL1 .
• A2 = [−1, 1], where the continuous value represents the investment quantity. Here,
0 stands for no involvement, positive values represent buying, and negative values
represent selling.
• r2 (s, a) is the reward function, defined as follows:
where c(| at |) represents transaction costs proportional to the absolute investment size
| at |. This reward structure incentivizes the agent to optimize both the direction and
magnitude of its investment.
• The objective is to maximize the cumulative discounted reward:
∞
" #
R2 = E ∑ γ2t r2 (st , at ) .
t =0
The primary differences between RL1 and RL2 lie in their action spaces and reward
functions. RL1 operates with a discrete action space and focuses on optimizing directional
timing and trade management. In contrast, RL2 uses a continuous action space A2 = [−1, 1],
enabling it to adjust investment sizes dynamically. The environments are designed to
guide the agents by rewarding profitable actions and penalizing costly ones, encouraging
effective decision-making for timing and quantity. The exact mechanisms driving these
decisions are embedded within the neural network, shaped by the agent’s interactions with
the environment.
5. Benchmark Results
Next, we carry out experiments using the proposed methodology. We adopt the
same dataset and the same parameters for non-RL pair trading and RL pair trading for
comparison purposes.
5.1.1. Datasets
The application of our trading methodology is on Binance, the largest cryptocurrency
market.1 For the best market liquidity, we picked Bitcoin–Fiat currencies under different
trading intervals for pair trading. Pair formation criteria are based on Pearson’s correlation
and augment the Engle–Granger two-step cointegration test (Section 4.1) for quote curren-
cies that follow a similar trend against the base currency (Figure 5). The formation period
J. Risk Financial Manag. 2024, 17, 555 11 of 19
is from October 2023 to November 2023, and the test is in December 2023, with trading
intervals of 1 min (121,500 entries), 3 min (40,500 entries), and 5 min (24,300 entries), respec-
tively. We exhaustively compared correlation and co-integration for the best pair (Table 2).2
Although Binance has quite a few fiat currencies, only the US Dollar (USD), Great British
Pound (GBP), Euro (EUR), and Russian Ruble (RUB) display relatively strong liquidity.
The pair with the strongest correlation and co-integration is BTCEUR and BTCGBP under a
1 min trading interval (Table 2).
Pairs 1m 3m 5m
coint corr coint corr coint corr
BTCEUR-BTCGBP 0.5667 0.8758 0.4667 0.8759 0.4667 0.8754
BTCEUR-BTCRUB 0.3333 0.8417 0.3333 0.8417 0.3167 0.8416
BTCEUR-BTCUSD 0.1667 0.9328 0.2000 0.9327 0.2000 0.9329
BTCGBP-BTCRUB 0.3500 0.7606 0.3333 0.7608 0.3333 0.7603
BTCGBP-BTCUSD 0.4833 0.8404 0.4167 0.8403 0.4000 0.8403
BTCRUB-BTCUSD 0.4000 0.8538 0.3333 0.8539 0.3500 0.8543
The transaction cost in the experiment is set to 0.02% commission based on Binance’s
fee scheme.3 The transaction cost of 0.02% is a flat percentage charge for transactions in
both directions. A pair trading leg, including long, the first asset, and short, the second
asset, is charged for both long and short actions.
During the training period, the most profitable parameter set is open threshold = 1.8
z-score, close threshold = 0.4 z-score, and window size = 900 intervals. Some example results
of the grid search are presented in Table 3.
The setup of RL-based pair trading relies on these parameters. The window size
decides the retrospective length of the spread, and the thresholds decide the zones. Al-
gorithms such as PPO and A2C are applicable to both discrete and continuous action
spaces. Some algorithms, e.g., DQN, can only be used in discrete space, and DDPG is only
applicable in a continuous space. Therefore, we adopt PPO, DQN, and A2C in RL pair
trading, which decide the timing, and PPO, A2C, and SAC in RL pair trading, which decide
both the timing and investment quantity. The algorithms are adopted from the Baseline3
collection (Raffin et al. 2021).
Rp − R f
Sharpe Ratio = . (13)
σp
J. Risk Financial Manag. 2024, 17, 555 13 of 19
We also care about the strategies’ activities, such as the order count and win/loss ratio.
The indicators used for comparison are presented in Table 4.
Behavior-wise, PPO, DQN, and SAC tend to conduct excessive transactions that are
not profitable. On the contrary, A2C have fewer trades but higher profits on each trade. RL2
pair trading shows further fewer total actions because of the adjusted position action, where
we do not consider a position adjustment as one trade until it is closed. Apart from the
result in Table 5, the portfolio growth trend with the best-performing RL algorithm agent is
presented in Figure 6 (a comparison with Gatev et al. (2006) is provided in Figure A1a in the
Appendix). Most of the pair trading experiments, including Gatev et al. (2006), RL1 , and
RL2 , display a stable upturn, which is ideal from the perspective of pair trading. From the
drawdown graphs, we can observe that RL1 produces fewer drawdowns compared to the
non-RL pair trading method from Gatev et al. (2006) and has a significantly higher win/loss
action ratio due to differences in threshold settings. However, RL1 ’s cumulative profit is
not consistently higher, and when transaction fees are zero, its cumulative profit is slightly
lower than that of the Gatev et al. method. RL2 displays the strongest profitability, despite a
lower win/loss action ratio, due to its progressive trading strategy. In general, all three pair
trading methods show the ability to generate stable income in a volatile trading market.
(a) RL1 Pair Trading (A2C). (b) RL2 Pair Trading (A2C).
Figure 6. Comparison of portfolio value trends for RL1 and RL2 Pair Trading (A2C).
The first adoption of Reinforcement Learning pair trading grants the RL1 agent the flexibil-
ity to decide action timing. The second adoption of Reinforcement Learning2 pair trading
further gives the RL agent the access to decide the timing and invest quantity.
We compared it to the traditional rule-based pair trading technique (Gatev et al. 2006)
and a state-of-the-art RL pair trading technique (Kim and Kim 2019) for December 2023
in the cryptocurrency market for BTCEUR and BTCGBP under a standard future 0.02%
transaction cost. Kim and Kim’s method does not perform well in the cryptocurrency
world. Gatev et al.’s method achieved 8.33% per trading period. Our first adoption of
the RL1 method achieved 9.94%, and the second adoption of the RL2 method achieved
31.53% returns during the trading period. The outperformance is generally consistent
across different transaction costs. The evaluation metrics show that RL-based techniques
are generally more active than traditional techniques in the cryptocurrency market under
various transaction costs. In general, our trading methods have greater market participation
than Gatev et al.’s traditional rule-based pair trading and Kim and Kim’s threshold-adaptive
RL pair trading (Tables 5 and 6).
Comparison between RL-based pair trading revealed the relationship between prof-
itability and actions. Because financial trading is a special case of the RL environment, every
action in financial trading is punished by the transaction cost. We notice that profitable RL
J. Risk Financial Manag. 2024, 17, 555 17 of 19
trading often has a lower total trade count and higher profit per-win trade. That means the
RL is better at spotting chances to make higher profits. RL2 pair trading produces higher
profits because of higher average wins from the position adjustment mechanism. When
we adopt the righteous trading method, market volatility and transaction cost play crucial
roles in profitable trading. Variable thresholds might not be adaptive to highly volatile
markets, and fixed-threshold pair trading could lead to missing trading opportunities. RL
with dynamic scaling investment could be a good direction in volatile market conditions if
low transaction costs are achievable.
The techniques presented have certain limitations and offer opportunities for future
work. One limitation is the relatively limited dataset scope, which could be expanded to
include more diverse assets and longer timeframes to improve generalization. Additionally,
focusing only on two-leg strategies restricts the potential for complex arbitrage opportu-
nities; expanding to multi-leg strategies would enhance robustness. The computational
demand during training can also be resource-intensive, requiring system parameter tun-
ing. The model lacks consideration for transaction costs, which might impact real-world
profitability. A lack of direct comparison with traditional models is another shortcoming.
Future work could involve developing the Reinforcement Learning (RL) approach to multi-
leg strategies, integrating pair formation into the trading process, cross-validating across
different environments, and experimenting with alternative reward functions to improve
decision-making and risk management.
Author Contributions: Conceptualization, A.M. and H.Y.; methodology, A.M. and H.Y.; software,
A.M. and H.Y.; validation, H.Y.; formal analysis, H.Y.; investigation, H.Y.; resources, A.M. and
H.Y.; data curation, H.Y.; writing—original draft preparation, H.Y.; writing—review and editing,
A.M.; visualization, H.Y.; supervision, A.M.; project administration, A.M.; funding acquisition, Not
applicable. All authors have read and agreed to the published version of the manuscript.
Funding: This research received no external funding
Institutional Review Board Statement: Not applicable.
Data Availability Statement: The original data presented in the study are openly available from
Binance Exchange accessed on 8 November 2024 at (https://data.binance.vision/).
Conflicts of Interest: The authors declare no conflicts of interest.
Appendix A
(a) (b)
Figure A1. Comparison of Pair Trading strategies from (a) Gatev et al. (2006) and (b) Kim and Kim
(2019).
J. Risk Financial Manag. 2024, 17, 555 18 of 19
Notes
1 https://www.binance.com/en, accessed on 8 November 2024.
2 While calculating the co-integration and correlation, intervals with low volume trades are exempted from the calculation.
3 https://www.binance.com/en/fee/futureFee, accessed on 8 November 2024.
4 We adopt the Federal Reserve interest rate of 5.5%, which is correct as of 13 June 2024.
References
AlMahamid, Fadi, and Katarina Grolinger. 2021. Reinforcement learning algorithms: An overview and classification. Paper presented
at the 2021 IEEE Canadian Conference on Electrical and Computer Engineering (CCECE), Virtual, September 12–17; pp. 1–7.
Bellman, Richard. 1957. A Markovian Decision Process. Journal of Mathematics and Mechanics 6: 679–84. [CrossRef]
Bertoluzzo, Francesco, and Marco Corazza. 2007. Making Financial Trading by Recurrent Reinforcement Learning. In Knowledge-Based
Intelligent Information and Engineering Systems. Edited by Bruno Apolloni, Robert J. Howlett and Lakhmi Jain. Lecture Notes in
Computer Science. Berlin and Heidelberg: Springer, pp. 619–26. [CrossRef]
Brogaard, Jonathan, Terrence Hendershott, and Ryan Riordan. 2014. High-Frequency Trading and Price Discovery. The Review of
Financial Studies 27: 2267–306. [CrossRef]
Burgess, A. Neil. 2003. Using Cointegration to Hedge and Trade International Equities. In Applied Quantitative Methods for Trading and
Investment. Hoboken: John Wiley & Sons, Ltd., pp. 41–69. [CrossRef]
Dickey, David A., and Wayne A. Fuller. 1979. Distribution of the Estimators for Autoregressive Time Series with a Unit Root. Journal of
the American Statistical Association 74: 427–31. [CrossRef]
Do, Binh, and Robert Faff. 2010. Does Simple Pairs Trading Still Work? Financial Analysts Journal 66: 83–95. [CrossRef]
Dunis, Christian L., and Richard Ho. 2005. Cointegration portfolios of European equities for index tracking and market neutral
strategies. Journal of Asset Management 6: 33–52. [CrossRef]
Dybvig, Philip H., and Stephen A. Ross. 1989. Arbitrage. In Finance. Edited by John Eatwell, Murray Milgate and Peter Newman.
London: Palgrave Macmillan UK, pp. 57–71. [CrossRef]
Fadok, David S., John Boyd, and John Warden. 1995. Air power’s quest for strategic paralysis. Proceedings of the School of Advanced
Airpower Studies. Available online: https://media.defense.gov/2017/Dec/27/2001861508/-1/-1/0/T_0029_FADOK_BOYD_
AND_WARDEN.PDF (accessed on 8 November 2024)
Gatev, Evan, William N. Goetzmann, and K. Geert Rouwenhorst. 2006. Pairs Trading: Performance of a Relative Value Arbitrage Rule.
The Review of Financial Studies 19: 797–827. [CrossRef]
Gold, Carl. 2003. FX trading via recurrent reinforcement learning. Paper presented at the 2003 IEEE International Conference
on Computational Intelligence for Financial Engineering, Hong Kong, China, March 20–23, pp. 363–70, ISBN 9780780376540.
[CrossRef]
Haarnoja, Tuomas, Aurick Zhou, Kristian Hartikainen, George Tucker, Sehoon Ha, Jie Tan, Vikash Kumar, Henry Zhu, Abhishek Gupta,
Pieter Abbeel, and et al. 2019. Soft Actor-Critic Algorithms and Applications. arXiv arXiv:1812.05905. [CrossRef]
Han, Weiguang, Jimin Huang, Qianqian Xie, Boyi Zhang, Yanzhao Lai, and Min Peng. 2023. Mastering Pair Trading with Risk-Aware
Recurrent Reinforcement Learning. arXiv arXiv:2304.00364.
Huang, Chien Yi. 2018. Financial Trading as a Game: A Deep Reinforcement Learning Approach. arXiv arXiv:1807.02787. [CrossRef]
Huck, Nicolas. 2010. Pairs trading and outranking: The multi-step-ahead forecasting case. European Journal of Operational Research 207:
1702–16. [CrossRef]
Kim, Taewook, and Ha Young Kim. 2019. Optimizing the Pairs-Trading Strategy Using Deep Reinforcement Learning with Trading
and Stop-Loss Boundaries. Complexity 2019: e3582516. [CrossRef]
Liu, Xiao-Yang, Hongyang Yang, Jiechao Gao, and Christina Dan Wang. 2021. FinRL: Deep Reinforcement Learning Framework to
Automate Trading in Quantitative Finance. Paper presented at the Proceedings of the Second ACM International Conference on
AI in Finance, Virtual Event, November 3–5, pp. 1–9. [CrossRef]
Liu, Xiao-Yang, Hongyang Yang, Qian Chen, Runjia Zhang, Liuqing Yang, Bowen Xiao, and Christina Dan Wang. 2022a. FinRL: A
Deep Reinforcement Learning Library for Automated Stock Trading in Quantitative Finance. arXiv arXiv:2011.09607. [CrossRef]
Liu, Xiao-Yang, Ziyi Xia, Jingyang Rui, Jiechao Gao, Hongyang Yang, Ming Zhu, Christina Dan Wang, Zhaoran Wang, and Jian
Guo. 2022b. FinRL-Meta: Market Environments and Benchmarks for Data-Driven Financial Reinforcement Learning. arXiv
arXiv:2211.03107. [CrossRef]
Lucarelli, Giorgio, and Matteo Borrotti. 2019. A Deep Reinforcement Learning Approach for Automated Cryptocurrency Trading.
In Artificial Intelligence Applications and Innovations. IFIP Advances in Information and Communication Technology. Edited by
John MacIntyre, Ilias Maglogiannis, Lazaros Iliadis and Elias Pimenidis. Cham: Springer International Publishing, pp. 247–58.
[CrossRef]
Mandelbrot, Benoit. 1967. The Variation of Some Other Speculative Prices. The Journal of Business 40: 393–413. [CrossRef]
Maringer, Dietmar, and Tikesh Ramtohul. 2012. Regime-switching recurrent reinforcement learning for investment decision making.
Computational Management Science 9: 89–107. [CrossRef]
Meng, Terry Lingze, and Matloob Khushi. 2019. Reinforcement Learning in Financial Markets. Data 4: 110. [CrossRef]
J. Risk Financial Manag. 2024, 17, 555 19 of 19
Mnih, Volodymyr, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. 2013.
Playing Atari with Deep Reinforcement Learning. arXiv arXiv:1312.5602. [CrossRef]
Mohammadshafie, Alireza, Akram Mirzaeinia, Haseebullah Jumakhan, and Amir Mirzaeinia. 2024. Deep Reinforcement Learning
Strategies in Finance: Insights into Asset Holding, Trading Behavior, and Purchase Diversity. arXiv arXiv:2407.09557. [CrossRef]
Perlin, Marcelo. 2007. M of a Kind: A Multivariate Approach at Pairs Trading. Available online: https://doi.org/10.2139/ssrn.952782
(accessed on 8 November 2024).
Perlin, Marcelo Scherer. 2009. Evaluation of pairs-trading strategy at the Brazilian financial market. Journal of Derivatives & Hedge
Funds 15: 122–36. [CrossRef]
Pricope, Tidor-Vlad. 2021. Deep Reinforcement Learning in Quantitative Algorithmic Trading: A Review. arXiv arXiv:2106.00123.
[CrossRef]
Raffin, Antonin, Ashley Hill, Adam Gleave, Anssi Kanervisto, Maximilian Ernestus, and Noah Dormann. 2021. Stable-baselines3:
Reliable reinforcement learning implementations. Journal of Machine Learning Research 22: 1–8.
Sarmento, Simão Moraes, and Nuno Horta. 2020. Enhancing a Pairs Trading strategy with the application of Machine Learning. Expert
Systems with Applications 158: 113490. [CrossRef]
Schulman, John, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. 2017. Proximal Policy Optimization Algorithms.
arXiv arXiv:1707.06347. [CrossRef]
Sharpe, William F. 1964. Capital Asset Prices: A Theory of Market Equilibrium Under Conditions of Risk. The Journal of Finance 19:
425–42. [CrossRef]
Silver, David, and Demis Hassabis. 2016. AlphaGo: Mastering the ancient game of Go with Machine Learning. Available online:
https://research.google/blog/alphago-mastering-the-ancient-game-of-go-with-machine-learning/ (accessed on 8 November
2024).
Sutton, Richard S., and Andrew G. Barto. 2018. Reinforcement Learning: An Introduction. Cambridge: MIT Press.
Vergara, Gabriel, and Werner Kristjanpoller. 2024. Deep reinforcement learning applied to statistical arbitrage investment strategy on
cryptomarket. Applied Soft Computing 153: 111255. [CrossRef]
Wang, Cheng, Patrik Sandås, and Peter Beling. 2021. Improving Pairs Trading Strategies via Reinforcement Learning. Paper Presented
at the 2021 International Conference on Applied Artificial Intelligence (ICAPAI), Halden, Norway, May 19–21, pp. 1–7. [CrossRef]
Yang, Hongshen, and Avinash Malik. 2024. Optimal market-neutral currency trading on the cryptocurrency platform. arXiv
arXiv:2405.15461. [CrossRef]
Zhang, Jin, and Dietmar Maringer. 2016. Using a Genetic Algorithm to Improve Recurrent Reinforcement Learning for Equity Trading.
Computational Economics 47: 551–67. [CrossRef]
Zhang, Zihao, Stefan Zohren, and Roberts Stephen. 2020. Deep Reinforcement Learning for Trading. The Journal of Financial Data
Science 2: 25–40. [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.