Reinforcement Learning For Finance A Review
Reinforcement Learning For Finance A Review
Finance: A Review
*
Master in Finance. Research professor, Observatorio de Economía y Operaciones Nu-
méricas (ODEON), Universidad Externado de Colombia. Bogotá (Colombia). [diego.leon@
uexternado.edu.co], [ORCID ID: 0000-0003-1434-7569].
Artículo recibido: 26 de abril de 2023
Aceptado: 26 de junio de 2023
Abstract
This paper provides a comprehensive review of the application of Reinforcement
Learning (RL) in the domain of finance, shedding light on the groundbreak-
ing progress achieved and the challenges that lie ahead. We explore how RL, a
subfield of machine learning, has been instrumental in solving complex finan-
cial problems by enabling decision-making processes that optimize long-term
rewards. Reinforcement learning (RL) is a powerful machinelearning technique
that can be used to train agents to make decisions in complex environments.
In finance, RL has been used to solve a variety of problems, including optimal
execution, portfolio optimization, option pricing and hedging, market making,
smart order routing, and robo-advising. In this paper, we review the recent de-
velopments in RL for finance. We begin by introducing RL and Markov deci-
sion processes (MDPs), which is the mathematical framework for RL. We then
discuss the various RL algorithms that have been used in finance, with a focus
on value-based and policy-based methods. We also discuss the use of neural
networks in RL for finance. Finally, we discuss the results of recent studies that
have used RL to solve financial problems. We conclude by discussing the chal-
lenges and opportunities for future research in RL for finance.
Key words: Reinforcement learning; machine learning; Markov decision
process; finance.
JEL classification: G10, G12, G13.
Resumen
Este artículo ofrece una revisión exhaustiva de la aplicación del aprendizaje por
refuerzo (AR) en el dominio de las finanzas, y arroja una luz sobre el innovador
progreso alcanzado y los desafíos que se avecinan. Exploramos cómo el AR,
un subcampo del aprendizaje automático, ha sido instrumental para resolver
problemas financieros complejos al permitir procesos de toma de decisiones
que optimizan las recompensas a largo plazo. El AR es una poderosa técnica
de aprendizaje automático que se puede utilizar para entrenar a agentes a fin
de tomar decisiones en entornos complejos. En finanzas, el AR se ha utilizado
para resolver una variedad de problemas, incluyendo la ejecución óptima, la
optimización de carteras, la valoración y cobertura de opciones, la creación de
mercados, el enrutamiento inteligente de órdenes y el robo-asesoramiento. En este
artículo revisamos los desarrollos recientes en AR para finanzas. Comenzamos
proporcionando una introducción al AR y a los procesos de decisión de Markov
(MDP), que es el marco matemático para el AR. Luego discutimos los diversos
odeon, issn: 1794-1113, e-issn: 2346-2140, N.° 24, enero-junio de 2023, pp. 7-24
ODEON N.º 24
9
Introduction
odeon, issn: 1794-1113, e-issn: 2346-2140, N.° 24, enero-junio de 2023, pp. 7-24
10
1. Reinforcement Learning
odeon, issn: 1794-1113, e-issn: 2346-2140, N.° 24, enero-junio de 2023, pp. 7-24
ODEON N.º 24
11
leads to a low reward, the agent opts for a change in the policy that generates
a different action to finally obtain greater benefits.
Now, the long-term value function thought by the agent as the third element
of RL should beconsidered. This means that reward signals are the immediate
results of previous actions, while the value function gives the long-term results
of the decision made. The value of a state is the total amount of reward that
an agent can expect to accumulate in the future from that state, while rewards
determine the immediate and intrinsic convenience of environmental states
(Sutton & Barto, 2018). The value function becomes very important in later
studies because, in the algorithmic application, it is these value functions that
define decision-making that results in benefit maximization.
The last element is the environment model, established from the situations
proposed to achieve low-risk experimentation. It is understood as a series of
inferences about the possiblebehavior of the environment, given a state and an
action, the model could predict the next state and its next reward.
In this way, the components that encompass the actors of RL can be syn-
thesized. And additionally, there are a series of criteria related to the balance
between exploration and exploitation, the acceleration of the learning process,
and generalization, which influence learning.
The action of delegating to the agent the responsibility of determining the
strategy to explore the environment, and controlling the training examples
through the sequence of actions, provides RL with a defining characteristic.
This is where the agent must find the balance between exploring new states
to obtain new information and exploit already assimilated and learned actions
with which they obtain a great reward, which guarantees an accumulated reward
(Kaelbling et al., 1996).
Since it is impossible to explore and exploit simultaneously with a single
action selection, “conflicts” are created between exploration and exploitation
and there are several proposals to achieve the balance between them (Kael-
bling, 1993). There is the E-greedy strategy, optimistic initial values, action
selection methods based on the Boltzmann distribution, interval estimation
method, exploration bonus used in Dyna, and competition maps (Sutton,
1990, 1991).
Regarding the acceleration of learning, it seeks to attack the agent’s need
to reiterate actionsto learn the value function. A weakness that is mitigated with
the incorporation of informationpredicted by an external observer or integrat-
ing learning with planning. And considering generalization, which is based on
odeon, issn: 1794-1113, e-issn: 2346-2140, N.° 24, enero-junio de 2023, pp. 7-24
12
the estimation of optimal values defined over the set of states, represented in a
tabular manner if they are in small states (Thrun & Möller, 1991).
In that sense, the elements that interact within RL are exposed, as seen in
Figure 1, within an environment of actions and rewards for the agent. Likewise,
the criteria in which such learning is developed and where knowledge has been
used for its employability, being State 𝑆𝑡 the initial moment, with the interaction
of an agent (Agent) who performs an action (Action 𝑎 𝑡) in a certain environ-
ment (Environment) and that finally, obtains a reward (Reward 𝑟) that must be
greater than the initial one (Reward 𝑆𝑡+1) (Kapoor et al., 2022).
One of the conclusions the evaluator must reach is that the studied agent must
accept success or failure through the previously mentioned processes. However,
such learning must be focused on understanding the environment and its behav-
ior through rewards or punishments. Similarly, two cross-sectional stages are
recognized for evaluating any model: prediction and control.
On the one hand, according to the associations of stimuli and their deriva-
tions, it is possible to propose an evaluation of the future given a policy, without
the need to depend on time. On the other hand, control allows for the future to
be optimized with the application of accurate conjectures to find the best policy.
Referring to the term of optimal control describes the problem of design-
ing a dynamic controller over time, known in dynamic programming, which,
odeon, issn: 1794-1113, e-issn: 2346-2140, N.° 24, enero-junio de 2023, pp. 7-24
ODEON N.º 24
13
odeon, issn: 1794-1113, e-issn: 2346-2140, N.° 24, enero-junio de 2023, pp. 7-24
14
odeon, issn: 1794-1113, e-issn: 2346-2140, N.° 24, enero-junio de 2023, pp. 7-24
ODEON N.º 24
15
odeon, issn: 1794-1113, e-issn: 2346-2140, N.° 24, enero-junio de 2023, pp. 7-24
16
odeon, issn: 1794-1113, e-issn: 2346-2140, N.° 24, enero-junio de 2023, pp. 7-24
ODEON N.º 24
17
following figures show this relevance achieved in recent years by the academic
community.
Figure 2 shows a relevant increase in the papers published about RL and
finance, with special behavior from the last five years, with almost sixty docu-
ments published in 2020.
Source: Scopus.
Source: Scopus.
odeon, issn: 1794-1113, e-issn: 2346-2140, N.° 24, enero-junio de 2023, pp. 7-24
18
Figure 3 shows that scientific articles and conference are equivalent to 85%
of the publications in RL and finance, revealing that is an important topic for
cutting edge research in the world.
The rapid changes in the finance industry due to the increasing amount of
data have revolutionized the techniques on data processing and data analysis
and brought new theoretical and computational challenges. Given that traditional
approaches to financial decision-making heavily rely on model assumptions,
reinforcement learning (RL) can make full use of the large amount of financial
data with fewer model assumptions and improve decisions in complex financial
environments. This section aims to review the recent developments and use of
RL approaches in finance, with a focus on value and policy-based methods that
do not require any model assumptions. It also discusses the potential benefits of
using RL approaches in finance, such as improving decision-making, reducing
transaction costs, and capturing complex patterns in financial data.
RL approaches can make full use of the large amount of financial data with
fewer model assumptions and improve decisions in complex financial environ-
ments. RL algorithms can be applied in a variety of decision-making problems
in finance, including optimal order execution, portfolio optimization, option
pricing and hedging, market making, and risk management. RL algorithms can
help in developing trading strategies that can adapt to changing market condi-
tions and improve the overall performance of the portfolio. RL algorithms can
also help in reducing transaction costs and market impact costs by optimizing
the execution of trades. The use of deep RL algorithms can help in capturing
complex patterns in financial data and improve the accuracy of predictions
(Hambly et al., 2021).
One of the most exciting implications of RL in finance, is portfolio manage-
ment. Hu and Lin (2019) discuss the application of Deep Reinforcement Learn-
ing (DRL) for optimizing finance portfolio management. The authors address
several research issues related to policy optimization for finance portfolio
management. They propose the use of a deep recurrent neural network (RNN)
model, specifically Gated Recurrent Units (GRUs), to weigh the influences
of earlier states and actions on policy optimization in non-Markov decision
processes. They also propose a risk-adjusted reward function for searching for
an optimal policy.
The authors discuss the integration of Reinforcement Learning (RL) and
Deep Learning (DL) to leverage their respective capabilities to discover an
optimal policy. They explore different types of RL approaches for integrating
odeon, issn: 1794-1113, e-issn: 2346-2140, N.° 24, enero-junio de 2023, pp. 7-24
ODEON N.º 24
19
with the DL method while solving the policy optimization problem. They also
discuss the challenges of applying DRL in optimizing finance portfolio manage-
ment. These challenges include the impossibility of obtaining a real state space
of the finance world, the need to deal with the non-Markovian property with
dependence on earlier states and actions to learn and estimate future expected
rewards, and the need to consider transaction overheads such as transaction
fees and tax when computing the risk-adjusted reward function to obtain total
effective rewards.
Finally, they propose using deep RNNs for DL and policy gradient for RL
to search for the optimal policy function’s parameters. They also discuss vari-
ous DL and RL combinations and propose one of the DRL approaches, arguing
why this one is better for optimizing finance portfolio management. The paper
concludes with the intention to investigate all types of DL and RL combinations,
find the best one, and discover its incentives for finance planning in future work.
Millea and Edalat (2022) discuss portfolio optimization, which is the process
of selecting a combination of assets that will increase in value over time. The
goal is to partition the available resources in a way that the overall portfolio
value increases over time. The paper presents a hierarchical decision-making
architecture for portfolio optimization on multiple markets, using a combination
of Deep Reinforcement Learning (DRL) and Hierarchical Risk Parity (HRP) and
Hierarchical Equal Risk Contribution (HERC) models. The experiments were
performed on the cryptocurrency market, stock market, and foreign exchange
market, showing excellent robustness and performance of the overall system.
Another framework in finance to using Reinforcement Learning (RL), is
option pricing and hedging with derivatives. The QLBS Model: The Quantita-
tive Learning from Buffer Stock (QLBS) model, proposed by Halperin (2019)
and extended in Halperin (2020), learns both the option price and the hedging
strategy in a similar spirit to the mean-variance portfolio optimization frame-
work based in Q-Learning algorithms.
Buehler et al. (2019) used deep neural networks to approximate an optimal
hedging strategy under market frictions, including transaction costs, and con-
vex risk measures. They showed that their method can accurately recover the
optimal hedging strategy in the Heston model without transaction costs and it
can be used to numerically study the impact of proportional transaction costs
on option prices.
Cannelli et al. (2020) formulated the optimal hedging problem as a Risk-
averse Contextual Multi-Armed Bandit (R-CMAB) model and proposed a
odeon, issn: 1794-1113, e-issn: 2346-2140, N.° 24, enero-junio de 2023, pp. 7-24
20
deep CMAB algorithm involving Thompson Sampling. They showed that their
algorithm outperforms DQN in terms of sample efficiency and hedging error
when compared to delta hedging. Cao et al. (2021) considered Q-learning and
Deep Deterministic Policy Gradient (DDPG) for the problem of hedging a short
position in a call option when there are transaction costs. The objective function
is set to be a weighted sum of the expected hedging cost and the standard devia-
tion of the hedging cost. They showed that their approach achieves a markedly
lower expected hedging cost but with a slightly higher standard deviation of
the hedging cost when compared to delta hedging.
For American options, the key challenge is to find the optimal exercise strat-
egy, which determines when to exercise the option as this determines the price.
Li et al. (2009) used the Least-Squares Policy Iteration (LSPI) algorithm and the
Fitted Q-learning algorithm to learn the exercise policy for American options.
Regarding algorithmic trading, Sun and Si (2022) discuss the use of Re-
inforcement Learning (RL) in automated trading for generating buy and sell
signals in financial markets. RL is a method of training an agent to make opti-
mal decisions based on the current state of the market and owned positions and
cash. The paper proposes a novel framework called Supervised Actor-Critic
Reinforcement Learning with Action Feedback (SACRL-AF) to address the
issue of incomplete fulfillment of buy or sell orders in certain situations. The
proposed framework uses Deep Deterministic Policy Gradient (DDPG) and
Twin Delayed Deep Deterministic Policy Gradient (TD3) algorithms to achieve
state-of-the-art performance in profitability.
Théate and Ernst (2021) present a new approach to solve the algorithmic trad-
ing problem using deep reinforcement learning (DRL). The proposed Trading
Deep Q-Network algorithm (TDQN) is inspired by the popular DQN algorithm
and is adapted to the specific algorithmic trading problem. The training of
the reinforcement learning (RL) agent is based on the generation of artificial
trajectories from a limited set of stock market historical data. The paper also
proposes a novel performance assessment methodology to objectively assess
the performance of trading strategies. Promising results are reported for the
TDQN algorithm.
3. Conclusions
odeon, issn: 1794-1113, e-issn: 2346-2140, N.° 24, enero-junio de 2023, pp. 7-24
ODEON N.º 24
21
approaches show how RL can be used to learn optimal strategies for option
pricing and hedging, often outperforming traditional methods, also in the field
of portfolio optimization and algorithmic trading; RL has shown remarkable
results compared with traditional methods. However, it’s important to note that
these methods often require careful tuning and may not always be applicable
in every market condition. DRL algorithms performs well on multiple markets,
including the cryptocurrency market, the stock market, and the foreign exchange
market. The system can learn when to switch between the low-level models,
and the performance is better than the individual models. Additionally, possible
future works needs to consider transaction costs, which can have a significant
impact on the performance of the system in practice.
References
Bengio, Y., Courville, A., & Vincent, P. (2013). Representation learning: A review and
new perspectives. IEEE Transactions on Pattern Analysis and Machine intel-
ligence, 35(8), 1798-1828. https://doi.org/10.1109/TPAMI.2013.50
Buehler, H., Gonon, L., Teichmann, J., & Wood, B. (2019). Deep hedging. Quantitative
Finance, 19(8), 1271-1291. https://doi.org/10.1080/14697688.2019.1571683
Cannelli, L., Nuti, G., Sala, M., & Szehr, O. (2020). Hedging using reinforcement
learning: Contextual 𝑘-armed bandit versus Q-learning. Working paper, arX-
iv:2007.01623.
Cao, J., Chen, J., Hull, J., & Poulos, Z. (2021). Deep hedging of derivatives using rein-
forcement learning. The Journal of Financial Data Science, 3(1), 10–27. https://
doi.org/10.3905/jfds.2020.1.052
Duan, Y., Schulman, J., Chen, X., Bartlett, P. L., Sutskever, I., & Abbeel, P. (2016).
𝑅𝐿2: Fast reinforcement learning via slow reinforcement learning. Working paper,
arXiv:1611.02779.
odeon, issn: 1794-1113, e-issn: 2346-2140, N.° 24, enero-junio de 2023, pp. 7-24
22
Errecalde, M. L., Muchut, A., Aguirre, G., & Montoya, C. I. (2000). Aprendizaje por
Refuerzo aplicado a la resolución de problemas no triviales. In II Workshop de
Investigadores en Ciencias de la Computación.
Ferrucci, D., Brown, E., Chu-Carroll, J., Fan, J., Gondek, D., Kalyanpur, A. A., …
& Welty, C. (2010). Building Watson: An Overview of the DeepQA Project. AI
Magazine, 31(3), 59-79. https://doi.org/10.1609/aimag.v31i3.2303
Foerster, J., Assael, I. A., De Freitas, N., & Whiteson, S. (2016). Learning to com-
municate with deep multi-agent reinforcement learning. Advances in Neural
Information processing systems, 29, 1-9.
Hambly, B., Xu, R., & Yang, H. (2021). Recent advances in reinforcement learning in
finance. arXiv preprint arXiv:2112.04553. https://arxiv.org/abs/2112.04553
Halperin, I. (2019). The QLBS Q-learner goes NuQlear: Fitted Q iteration, inverse RL,
and option portfolios. Quantitative Finance, 19(9), 1543–1553. https://doi.org/10
.1080/14697688.2019.1622302
Hu, Y. J., & Lin, S. J. (2019). Deep reinforcement learning for optimizing finance
portfolio management. In 2019 Amity International Conference on Artificial In-
telligence (AICAI) (pp. 14-20). IEEE. https://doi.org/10.1109/AICAI.2019.8701368
Kapoor, A., Gulli, A., Pal, S., & Chollet, F. (2022). Deep Learning with Tensor Flow
and Keras: Build and deploy supervised, unsupervised, deep, and reinforcement
learning models. Packt Publishing Ltd.
odeon, issn: 1794-1113, e-issn: 2346-2140, N.° 24, enero-junio de 2023, pp. 7-24
ODEON N.º 24
23
Kohl, N., & Stone, P. (2004, April). Policy gradient reinforcement learning for fast
quadrupedal locomotion. In IEEE International Conference on Robotics and
Automation, 2004. https://doi.org/10.1109/ROBOT.2004.1307456
LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep Learning. Nature, 521(7553), 436-
444. https://doi.org/10.1038/nature14539
Li, Y., Szepesvari, C., & Schuurmans, D. (2009). Learning exercise policies for Ameri-
can options. In Artificial intelligence and statistics (pp. 352–359). PMLR. https://
proceedings.mlr.press/v5/li09d.html
Millea, A., & Edalat, A. (2022). Using deep reinforcement learning with hierarchical
risk parity for portfolio optimization. International Journal of Financial Studies,
11(1), 10. https://doi.org/10.3390/ijfs11010010
Nath, S., Liu, V., Chan, A., Li, X., White, A., & White, M. (2020). Training recurrent
neural networks online by learning explicit state variables. In International con-
ference on learning representations.
Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Van Den Driessche, G., &
Hassabis, D. (2016). Mastering the game of Go with deep neural networks and
tree search. Nature, 529(7587), 484-489. https://doi.org/10.1038/nature16961
Schlegel, M., Chung, W., Graves, D., Qian, J., & White, M. (2019). Importance resa-
mpling for off-policy prediction. Advances in Neural Information Processing
Systems, 32.
Sun, Q., & Si, Y. W. (2022). Supervised actor-critic reinforcement learning with action
feedback for algorithmic trading. Applied Intelligence, 53, 16875-16892. https://
doi.org/10.1007/s10489-022-04322-5
Sutton, R. S. (1990). Integrated architectures for learning, planning, and reacting based
on approximating dynamic programming. In Machine learning proceedings 1990
(pp. 216-224). https://doi.org/10.1016/B978-1-55860-141-3.50030-4
odeon, issn: 1794-1113, e-issn: 2346-2140, N.° 24, enero-junio de 2023, pp. 7-24
24
Sutton, R. S. (1991). Dyna, an integrated architecture for learning, planning, and react-
ing. ACM Sigart Bulletin, 2(4), 160-163. https://doi.org/10.1145/122344.122377
Théate, T., & Ernst, D. (2021). An application of deep reinforcement learning to al-
gorithmic trading. Expert Systems with Applications, 173, 114632. https://doi.
org/10.1016/j.eswa.2021.114632
Thrun, S. B., & Möller, K. (1991). Active exploration in dynamic environments. Ad-
vances in neural information processing systems, 4. https://proceedings.neurips.
cc/paper/1991/hash/e5f6ad6ce374177eef023bf5d0c018b 6-Abstract.html
Taylor, M. E., & Stone, P. (2009). Transfer learning for reinforcement learning domains:
A survey. Journal of Machine Learning Research, 10(7), 1635-1685. https://doi.
org/10.5555/1577069.1755839
Torres Cortés, L. J., Velázquez Vadillo, F., & Turner Barragán, E. H. (2017). El principio
de optimalidad de Bellman aplicado a la estructura financiera corporativa. Caso
Mexicano. Análisis Económico, 32(81), 151-181.
Ziebart, B. D., Maas, A. L., Bagnell, J. A., & Dey, A. K. (2008). Maximum entropy
inverse reinforcement learning. In Proceedings of the Twenty-Third AAAI Con-
ference on Artificial Intelligence 2008.
odeon, issn: 1794-1113, e-issn: 2346-2140, N.° 24, enero-junio de 2023, pp. 7-24