Multi-Period Portfolio Optimization Using A Deep Reinforcement Learning Hyper-Heuristic Approach
Multi-Period Portfolio Optimization Using A Deep Reinforcement Learning Hyper-Heuristic Approach
Keywords: Portfolio optimization concerns with periodically allocating the limited funds to invest in a variety of potential
Portfolio optimization assets in order to satisfy investors’ appetites for risk and return goals. Recently, Deep Reinforcement Learning
Deep reinforcement learning (DRL) has shown its promising capabilities in sequential decision making problems. However, traditional
Hyper-heuristic
DRL algorithms directly operate in the space of low-level actions, which exhibits poor scalability and
Decision making
becomes intractable in real-world problem instances when the dimensionality of the environment increases.
Uncertainty
To deal with this, in this work, a novel DRL hyper-heuristic framework is proposed for multi-period portfolio
optimization problem. Instead of exploiting the entire action domain, our proposed approach is more effective
by searching for low-level well-developed trading strategies. In addition, our proposed approach is data-
driven and respects the nature of the problem by taking advantage of expert domain knowledge and posing
it multidimensional states to further leverage additional diverse information from alternative views of the
environment. The proposed approach is evaluated on five real-world capital market problem instances and
numerous experimental results demonstrate our proposed method can achieve notable performance gains
compared to state-of-art trading strategies as well as traditional DRL baseline method. The data we used
are from five stock indices, covering the period from the 2012 to 2022. Our study can have salient policy
implications for investment strategy formulation and effective regulatory frameworks establishment.
1. Introduction based on the clever and complete enumeration of the solution space.
These algorithms can eventually obtain the optimal solution, but they
Portfolio optimization plays a vital role in investment companies, may be prohibitive for solving large problem instances because of
hedge funds, banks and other financial institutions. Since the pio- the exponential time complexity. Alternatively, approximation algo-
neering work of Markowitz Modern Portfolio Theory (MPT) frame- rithms, such as metaheuristics (Woodside-Oriakhi et al., 2011) and
work (Markowitz, 1952), it has received sustained attention from both hybridization methods (Cui et al., 2014), can generate good-quality
asset liability professionals and academics. Essentially, it can be consid- solutions with small computational effort, but once the problem state-
ered the process of periodically reallocating the limited funds to invest ment changes slightly, they need to be revised. In fact, one main issue of
in a variety of financial assets in order to satisfy investors’ appetites the model-based approach is that it normally focuses on the determinis-
for risk and return goals. With different practical trading constraints
tic variants of the problem, in which some strong assumptions are often
involved (Woodside-Oriakhi et al., 2011), the problem becomes a classi-
pre-setup in the model. For example, in classical MPT based portfolio
cal NP-hard Combinatorial Optimization Problem (COP) in operational
optimization model, it often assumes the perfect information of the
research. Existing studies have largely focused on model-driven ap-
market can be obtained with perfect accuracy by financial analysts (Wu
proaches. The general process is to use mathematical formulation to
et al., 2022). Since there are so many potential sources that could effect
establish a scientific model first and then apply various optimization
algorithms to solve the model. Typically, there exist two main families the estimation, it is difficult to even get precise estimates for them
of approaches for solving model-based portfolio optimization problems. in practice (Bonami and Lejeune, 2009). As a result, the algorithms
Exact algorithms, such as Branch-and-Bound (Bertsimas and Shioda, developed using the model-driven approach may be hard to deploy
2009) and Lagrangian relaxation framework (Shaw et al., 2008), are in the unpredictable real-world financial markets because of the high
∗ Corresponding author.
E-mail address: [email protected] (T. Cui).
https://doi.org/10.1016/j.techfore.2023.122944
Received 13 March 2023; Received in revised form 7 September 2023; Accepted 20 October 2023
Available online 2 November 2023
0040-1625/© 2023 The Author(s). Published by Elsevier Inc. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-
nc-nd/4.0/).
T. Cui et al. Technological Forecasting & Social Change 198 (2024) 122944
level of uncertainty. Although some modeling techniques like stochastic efficient market theory by showcasing the trading opportunities that
programming can address this issue partially (Cui et al., 2020), they can be captured through portfolio construction. Our empirical results
often lead to extremely large models that tend to be intractable for most thereby can complement the efficient market theory to demonstrate the
practical problem instances. trading opportunities that can be grasped by the portfolio construction,
With the advances in sequence-to-sequence learning (Sutskever which can be a channel that eliminate the market mispricing.
et al., 2014) and the rise of computing power in recent decades, using Recently, Reinforcement Learning (RL) algorithms have been proved
Deep Learning (DL) methods for portfolio optimization problems has effective in sequential decision making problems. The integration of
been revisited to exploit the potential laws of the market. The idea is DL and RL (DRL), is receiving a lot of attention due to its outstanding
to train a neural network for the price movements or trends prediction accomplishments across several fields (Silver et al., 2017; Vinyals et al.,
in a supervised manner using different techniques, such as textual 2019; Jumper et al., 2021). Generally, RL can be viewed as an ap-
analysis (Boubaker et al., 2021) and sentiment analysis (Eachempati proximation of Dynamic Programming (DP) (Bellman, 1957) which is a
et al., 2021). Then the solutions can be subsequently constructed general divide-and-conquer technique for solving complex problems by
from the problem specification. It has been shown that the trend decomposing them into several different parts (sub-problems) that have
forecasting approach is not guaranteed to obtain an optimal portfolio a recursive relationship. After each part has been solved, DP provides a
since the prediction loss is in different with the overall objective of the systematic procedure for determining the combination of results of the
problem (Moody and Saffell, 2001). In addition, the performance of sub-problems in order to obtain an overall solution. Technically, a COP
these methods heavily depends on the degree of prediction accuracy. can be transferred to an equivalent DP problem. A DP problem has two
Given the efficient market theory (Fama, 1965, 1970), there is no main properties: overlapping subproblems and optimal substructure.
reliable way to predict the future path of stock price movement since The fundamental mathematical model in RL, Markov Decision Process
all the relevant information shall be reflected in the stock price. The (MDP), can satisfy both of the two properties. Consequently, RL can
theoretical mechanism behind the efficient market theory centers on provide an appropriate paradigm for finding solutions for COPs (in
the concept of competition. In efficient markets, competition among the equivalent DP formulations). As a matter of fact, DRL has already
investors ensures that prices quickly adjust to reflect new informa- shown promising abilities to obtain high-quality solutions to some
tion (Laffont and Maskin, 1990). Rational investors actively participate COPs (Kong et al., 2019; Mazyavkina et al., 2021).
in buying and selling securities, which drives prices towards their Building a successful portfolio within a vast problem space is dif-
fundamental values. Any deviation from the fair price would create ficult since there are hundreds of assets on a stock market. Existing
an opportunity for arbitrage, attracting investors who would quickly DRL-based approaches directly operate in the space of low-level actions
exploit the discrepancy and bring prices back in line with underlying (i.e. asset weights), which can easily suffer from the ‘‘curse of dimen-
values. This constant competition and price adjustment mechanism sionality’’, becoming less efficient as the dimensionality of the environ-
prevent the persistence of mispricing and, as a result, ensure market ment increases. Motivated by the demand of solving large challenging
efficiency (Lamont and Thaler, 2003). real-world problems, various techniques that can narrow regions of the
Nevertheless, tremendous studies have attempted to use financial search space of DRL have been investigated, such as Monte Carlo Tree
variables to predict future stock prices, such as book-to-market ratios Search (MCTS)-based approach (Silver et al., 2016; Lee et al., 2018),
and dividend yields (Thaler, 1999). On this basis, Ang and Bekaert rule-based approach (Radaideh and Shirvan, 2021), hyper-heuristic
(2007) demonstrates the trivial predictive power of dividend yield in based approach (Zhang et al., 2021). Hyper-heuristics (Burke et al.,
stock price prediction. In addition, Avramov (2002) points out that 2019), which can be broadly referred to as ‘‘heuristics to select or
the predictability of stock prices is highly dependent on the model generate heuristics’’, have been studied among them with the main
specification. More recently, Pyun (2019) reveals that the predictability goal of enhancing generality in performance across various problem
of stock prices is time-varying, and some sample periods retain higher instances and problem domains. Unlike meta-heuristic that operates
predictability than other sample periods. As a result, the accurate mar- directly on specific solutions, hyper-heuristic operates on the heuristic
ket trend prediction is extremely challenging due to the highly noisy, space, which makes it capable of handling cross-domain optimiza-
stochastic and chaotic nature of the financial market (Abedin et al., tion (Pillay and Qu, 2018). A well-designed hyper-heuristic can often be
2021; Efat et al., 2022; Shajalal et al., 2023), especially for portfolio applied to many different problem instances and problem domains with
constructions considering model specification and sample selection. little or no modification, which can be particularly useful in complex
Also, the non-stationary nature of market may further induce many or poorly understood environments like the financial market. Moreover,
sources of uncertainty (Ding et al., 2023), which will often cause a hyper-heuristics can provide robust solutions, as they are able to switch
distribution shift between historical and future data. Moreover, the DL between different heuristics as required, making them less susceptible
based models normally do not have interaction with the market, thus to the weaknesses of any single heuristic. The nature of the hyper-
lack adaptivity to the real-world financial environment. In recent years, heuristic allows the DRL agent to adapt its trading strategy based
the efficient market theory has been applied into the cryptocurrency on the current market situation. The hyper-heuristic framework has
markets (Chu et al., 2019; Le Tran and Leirvik, 2020; Kang et al., demonstrated great success in many traditional COPs (Rahimian et al.,
2022). Other scholars have discovered the time-varying characteristics 2017; Ahmed et al., 2019) and consequently can serve as a promis-
for market efficiency in recent studies (Okoroafor and Leirvik, 2022). ing framework for real-world complex online portfolio optimization
According to the efficient market theory, the efficient market can problems.
correct the mispricing, which eliminates the arbitrage opportunities. In this work, a novel DRL hyper-heuristic framework is proposed
It thereby motivates our study since our study aims to formulate to tackle real-world multi-period portfolio optimization problems. The
the effective portfolios based on multiple trading strategies, which following contributions can be identified:
can yield higher returns. Instead of using trend prediction, our study - In contrast with traditional DRL approaches that directly ex-
proposes a deep reinforcement learning (DRL) hyper-heuristic approach ploit the entire action domain, our proposed method searches for
to narrow down the search space and improve portfolio optimization. well-developed low-level trading strategies, and consequently it can
By incorporating domain knowledge and low-level trading strategies, significantly narrow the search space and avoid the low-value explo-
the proposed approach goes beyond pure observations and aims to rations. To the best of the our knowledge, this is the first time that this
enhance asset allocation decisions in real-world financial markets. By approach has been used to address the portfolio optimization problem.
using a DRL hyper-heuristic method, the study demonstrates how dif- - Apart from the traditional publicly available market information,
ferent trading strategies can be utilized in combination to achieve the state in our proposed approach is augmented through market indi-
better performance. The empirical results of the study complement the cators based on expert domain knowledge. These market indicators can
2
T. Cui et al. Technological Forecasting & Social Change 198 (2024) 122944
make use of more diversified information from alternative viewpoints investors match their assets and liabilities over time, considering chang-
of the market environment, as well as additional high-level and robust ing interest rates, inflation, and other factors that liabilities are sensi-
information to improve asset allocation decisions that might not be tive to. As a result, financial institutions can build a portfolio that will
adequately stated when using only raw observations. provide a reliable income stream for their pension payments, while also
- The proposed DRL hyper-heuristic framework is evaluated on protecting their investments from market volatility.
five real-world problem instances and numerous experimental results The rest of the paper is organized as follows: related work is
demonstrate our proposed method can achieve notable performance provided in Section 2. Section 3 gives the problem description. The
gains compared with state-of-art trading strategies as well as the DRL proposed solution framework is presented in Section 4. The experi-
baseline method. mental results are reported in Section 5. The research implications are
Besides, DRL is a cutting-edge technique that can be applied in summarized in Section 6. Section 7 concludes the paper.
financial trading and portfolio construction. Furthermore, portfolio
construction is a high stake decision for financial institutions and cor- 2. Related work
porations, indicating that a reliable technical approach for constructing
portfolios serves as one of the key technological factors in changing The basic MPT optimization model is essentially a parametric
financial markets in the future. As a result, how to adopt such a
quadratic programming problem. As a result, there exists exact algo-
technique in a credible way could be extraordinarily attractive for
rithms that can obtain optimal solutions for most particular data sets
the audience concerning future technological evolution in financial
efficiently. However, the inclusion of additional real-world trading con-
markets. Furthermore, technological innovations drive significant dis-
straints considerably enhances the problem’s complexity. The majority
ruptions across industries, creating both challenges and opportunities
of the time, adding new restrictions will result in a nonconvex search
for investors. Our multi-period portfolio optimization study can shed
space, making it impossible to apply accurate methodologies. Hence,
light on the impact of technological changes on investment decision-
many practitioners and researchers try to adopt different heuristic or
making. For instance, the rise of artificial intelligence and automation
metaheuristic optimization techniques to solve the problem. Schaerf
has transformed traditional industries, resulting in new investment
(2002) explored the use of local search techniques, mainly Tabu Search
opportunities in sectors such as machine learning and cybersecurity.
(TS), to the problem. Different neighborhood relations were inves-
By leveraging insights from our multi-period optimization models,
tigated. Computation results were given for the 5 general market
investors can identify optimal asset allocations and risk management
instances involving up to 225 assets. Crama and Schyns (2003) intro-
strategies to capitalize on these emerging trends.
duced a Simulated Annealing (SA) method to the practical portfolio
Furthermore, our DRL based approach concentrates on the optimal
optimization problem. The real-world trading constraints like cardi-
allocation of assets over multiple periods, allowing investors to con-
nality, turnover constraints and trading constraints (minimum trading
sider the dynamic nature of financial markets, the impact of changing
size) were also considered. The constraints were handled by a penalty
economic conditions, and the need for rebalancing portfolios over
approach, which adds a penalty term to the objective function for each
time. As a result, our multi-period portfolio optimization model has
violated constraint. Chang et al. (2000) highlighted the different shapes
significant policy implications for investors, financial institutions, and
of the efficient frontier in the presence of cardinality constraint. They
regulators. For example, our DRL-hyper heuristic framework can enrich
regulatory policies by shedding light on the impact of regulations on also showed that certain portions of the efficient frontier are discon-
portfolio decisions and market dynamics. For instance, based on our nected. Three heuristic algorithms, which are Genetic Algorithm (GA),
DRL-hyper heuristic framework, policy-makers can undertake the sen- TS and SA, were then presented to solve the cardinality constrained
sitivity analysis by examining the effect of capital requirements, margin model. Computational results were reported for 5 test instances (which
regulations, or transaction taxes on portfolio optimization strategies. By are later made publicly available via the OR-library (Beasley, 1990) and
understanding the implications of such regulations, policymakers can used as the benchmark data sets) involving up to 225 assets. Fernández
design more effective and efficient regulatory frameworks. and Gómez (2007) applied a heuristic method based on artificial neural
Our application of DRL hyper-heuristic to multi-period portfolio networks. Computational results were reported for the benchmark data
optimization, offers several advantages over existing methods. Firstly, set from the OR-library. They compared the results obtained by the
this method can narrow the search space by searching for low-level artificial neural networks with the results of three heuristic algorithms
trading strategies, which can help improve asset allocation decisions reported in Chang et al. (2000) and concluded that no one heuristic
in real-world financial markets. Additionally, the proposed framework outperformed the others in all kinds of investment policies. Chang et al.
includes market indicators based on expert domain knowledge, further (2009) investigated GA to cardinality constrained portfolio optimiza-
purifying the original data, to enhance the investment decision-making tion model with different risk measures. Computational results were
process. By adopting a DRL hyper-heuristic approach, the method reported for 3 test problems involving up to 99 assets. The authors
overcomes one of the limitations of trend-prediction-based algorithms. also verified that investors should only consider one third of total
These algorithms heavily rely on prediction accuracy, which cannot en- assets to be selected into the portfolio since they are dominated by
sure optimal portfolio decisions. The DRL hyper-heuristic approach, on those contained more assets. Woodside-Oriakhi et al. (2011) presented
the other hand, uses deep reinforcement learning to optimize portfolio 3 metaheuristic algorithms based on GA, TS and SA. The proposed
constructions based on observed market dynamics. metaheuristics make use of the subset optimization step in the sense
Nevertheless, this method still suffers from limitations, one potential that a (small) mixed-integer quadratic optimization problem can be
limitation could be the computational complexity of implementing a solved to optimality. Better quality solutions were presented for the
DRL hyper-heuristic approach, as training deep reinforcement learning benchmark data set from the OR-library and it indicated that the subset
models can require significant computational resources. Additionally, optimization step is a useful strategy for the cardinality constrained
the effectiveness of the chosen method may depend on the availability portfolio optimization problem. Cura (2009) applied a Particle Swarm
and quality of market indicators based on expert domain knowledge Optimization (PSO) approach where each particle represents a port-
and original data. folio. Computational results were reported for the benchmark data
Additionally, regarding the utility and application of this paper, set from the OR-library. They compared the results obtained by PSO
our DRL hyper-heuristic framework can enhance the Liability-Driven with those obtained by GA, TS and SA and showed that none of the
Investment (LDI). LDI aims to align an investment portfolio with the four algorithms could outperform the others in all kinds of investment
liabilities of an institution, such as future pension payments. Optimized policies. They also showed that PSO could obtain better solutions for
portfolio constructions across multiple periods can help institutional the portfolio with a low risk level.
3
T. Cui et al. Technological Forecasting & Social Change 198 (2024) 122944
4
T. Cui et al. Technological Forecasting & Social Change 198 (2024) 122944
Table 1
Brief description of the trading strategies (agent actions) used in proposed DRL
hyper-heuristic framework.
Agent action Brief description
Anticor Portfolio construction based on stock
correlations and anti-correlations in consecutive
windows.
BAH Buy and hold the asset selected, retaining the
investment ignoring short-term ups and downs
in market price.
BCRP Redistribute the investment wealth each
trading day based on hindsight.
BNN A nearest neighbor-based strategy exploited by
histograms from the nonparametric statistics
method.
CRP On a daily basis, maintain the same wealth
distribution among a particular group of assets.
CWMR Create a Gaussian distribution to represent the
portfolio vector, and then update it in
accordance with the mean reversion principle.
DDPG Baseline DRL strategy for constructing
portfolios.
EG Track the best stock and adopt regularization
Fig. 2. State features of proposed DRL hyper-heuristic framework. term to reduce departure from the prior
portfolio.
OLMAR Forecast next price relatives based on the
moving average method and then construct
endogenous state features constitute the origin state 𝑠∗𝑡 of the examined portfolios via online learning techniques.
problem. ONS Track the best CRP to date and adopt a
However, the origin state may contain a high degree of noise and L2-norm regularization to limit the variability
uncertainty, plus non-stationary trait and therefore cause a distribu- of portfolio.
PAMR Adopt the mean reversion model of financial
tion shift between training and testing data. To tackle this, market
time series based on online passive aggressive
technical indicators based on expert domain knowledge are introduced learning.
to summarize markets’ behavior from different perspectives and ex- RMR Construct portfolios based on the median
tract useful patterns. These indicators are heuristics or mathematical reversion property of financial time series using
estimations based on the general raw market data. Some popular a robust L1-estimator and passive aggressive
online learning.
technical indicators include Moving Average Convergence Divergence
(MACD) (Appel, 2005), Relative Strength Index (RSI) (Wilder, 1978),
Stochastic Oscillator (SO) (Cao et al., 2020) and Fibonacci Retracement
(FR) (Tsinaslanidis et al., 2022). In specific terms, MACD is a trend- where each element represents one specific trading strategy. Specif-
following momentum indicator based on two moving average lines of ically, we consider the eleven most prevailing trading strategies (Li
a particular asset price. The difference between long-run and short- et al., 2016), namely, Anti-correlation strategy (Anticor), Buy and Hold
run trends can be captured by MACD, which can be used as a trend
strategy (BAH), Best Constant Rebalanced Portfolio strategy (BCRP),
follower. Similarly, RSI is also a momentum indicator that quantifies
Nonparametric Nearest Neighbor log-optimal trading strategy (BNN),
the magnitude of price changes, suggesting the overbought or oversold
Constant Rebalanced Portfolio strategy (CRP), Confidence Weighted
conditions for a particular asset. SO also measures the overbought
or oversold conditions by fully considering the random amplitude of Mean Reversion (CWMR), Exponential Gradient (EG), On-Line Mov-
price fluctuations and the measurement of short-run and medium-run ing Average Reversion strategy (OLMAR), Online Newton Step (ONS),
fluctuations in the design, making its short-term market measurement Passive Aggressive Mean Reversion strategy (PAMR), Robust Median
function more accurate and effective. FR generates the impulse line Reversion strategy (RMR) as well as one baseline DRL trading strategy
from each golden section point (0.618) against the current stock price. using DDPG (Jiang et al., 2017). A detailed description can be found
Those lines could be extremely helpful to identify areas where buyers in Table 1. Once the agent action 𝑎𝜏 is selected, the assets’ weights
may be accumulating heavy buying pressure after the price drop. at time instance 𝜏 can be obtained instantly by simply executing the
To further leverage additional diverse information, we also include corresponding trading strategy. Our proposed framework can be easily
the aforementioned four market indicator vectors, 𝜶 𝑡 = [𝛼0𝑡 , … , 𝛼𝑛𝑡 ]⊺ , extended by incorporating more actions (trading strategies).
𝜼𝑡 = [𝜂0𝑡 , … , 𝜂𝑛𝑡 ]⊺ , 𝜾𝑡 = [𝜄𝑡0 , … , 𝜄𝑡𝑛 ]⊺ , 𝝃 𝑡 = [𝜉0𝑡 , … , 𝜉𝑛𝑡 ]⊺ as the augmented
state 𝑠′𝑡 where 𝜶 is MACD, 𝜼 is RSI, 𝜾 is SO, and 𝝃 is FR, respectively.
Thus, the final state 𝒔𝑡 = [𝐨𝑡 ,𝐡𝑡 ,𝐥𝑡 ,𝐜𝑡 ,𝐯𝑡 ,𝜶 𝑡 ,𝜼𝑡 ,𝜾𝑡 ,𝝃 𝑡 ,𝑟𝑡 ,𝑉 𝑡 ] at a single time
4.3. State transition
step 𝑡 is the concatenation of origin state 𝑠∗𝑡 and augmented state 𝑠′𝑡 with
the dimension of 9𝑛 + 2. In this work, some particular actions (trading
strategies) may require a certain time window for executing, thus we The state transition from 𝑠𝜏 to 𝑠𝜏+1 is governed by the function
consider a time instance 𝜏 consisting of 𝑚 time steps 𝑡 and therefore the 𝑠𝜏+1 = 𝐹 (𝑠𝜏 , 𝑎𝜏 , 𝜑𝜏 ). The transition may not only depend on the action
state 𝑠𝜏 at time instance 𝜏 is an 𝑚 by (9𝑛+2) matrix. 𝑎𝜏 but can also be affected by the uncertainties 𝜑𝜏 that exist in some
state features. In this work, the transitions for 𝑂𝜏 (𝑚 by 𝑛 opening price
4.2. Action
matrix), 𝐻 𝜏 (𝑚 by 𝑛 high price matrix), 𝐿𝜏 (𝑚 by 𝑛 low price matrix),
Normally, in hyper-heuristic framework, the actions are various 𝐶 𝜏 (𝑚 by 𝑛 closing price matrix) and 𝜈 𝜏 (𝑚 by 𝑛 volume matrix) are
heuristic rules. In our proposed solution framework, there are two subject to market uncertainties. On the other hand, the transitions for
levels of actions. Given the state 𝑠𝜏 , the agent may firstly perform 𝐫 𝜏 (𝑚-dimensional vector) and 𝐕𝜏 (𝑚-dimensional vector) are directly
an agent action 𝑎𝜏 to select a sophisticated trading strategy. The ac- affected by the agent’s actions. For a given time instance 𝜏, the portfolio
tion space is defined as a 12-dimensional vector = [𝑎1 , … , 𝑎12 ] is rebalanced after action 𝑎𝜏 is executed.
5
T. Cui et al. Technological Forecasting & Social Change 198 (2024) 122944
4.4. Reward
5. Experimental results
6
T. Cui et al. Technological Forecasting & Social Change 198 (2024) 122944
6. Research implications
To validate the effectiveness of incorporating expert domain knowl-
Our multi-period portfolio optimization based on DRL hyper-
edge, we compare the training process between the proposed method heuristic framework can deliver fruitful research implications. Firstly,
with and without state augmentation. The results are illustrated in the multi-period portfolio optimization model can align with the In-
Fig. 4. It is evident that the learning curves of proposed method with tertemporal Capital Asset Pricing Model (ICAPM) proposed by Merton
raw states grow relatively fast, but converge at lower rewards. The DRL (1973). On the basis, our the multi-period portfolio optimization model
7
T. Cui et al. Technological Forecasting & Social Change 198 (2024) 122944
Fig. 5. Comparison of portfolio compositions between proposed method and DRL baselines (DDPG, PPO) on five real-world market instances.
8
T. Cui et al. Technological Forecasting & Social Change 198 (2024) 122944
to continuously analyze new data and adjust the weights of different adaptive risk management approach allows investors to respond more
assets in the portfolio, investors can ensure that their portfolio remains effectively to changing market conditions, reduce downside risk, and
well-diversified and optimized for the specific risk and return objectives potentially limit losses during periods of market stress or volatility,
of the investor. which can impact the stock market trading behavior.
This paper showcases the first application of DRL hyper-heuristic Tianxiang Cui: Conceptualization, Methodology, Investigation, Writ-
framework to multi-period portfolio optimization problem. By taking ing – original draft, Writing – review & editing, Validation, Formal
the advantage of well-developed low-level trading strategies, the DRL analysis, Supervision. Nanjiang Du: Validation, Formal analysis, Data
agent can effectively narrow the action space and improve the overall curation, Investigation, Methodology, Visualization. Xiaoying Yang:
solution quality. Moreover, a state augmentation scheme based on ex- Validation, Formal analysis, Data curation, Investigation, Methodology,
pert domain knowledge is utilized to further improve the performance Visualization. Shusheng Ding: Validation, Investigation, Visualization,
of the proposed method. Writing – review & editing.
The results of the experiments show that our proposed framework
has a number of benefits. Firstly, it shows better performance compared Data availability
with state-of-art trading strategies as well as DRL baseline method that
directly exploit the low-level action space. Secondly, it can provide Data will be made available on request.
more diversified portfolio constructions and thus enhance robustness
against market uncertainties. Finally, the proposed algorithmic frame- Acknowledgment
work reveal that certain patterns exist between market conditions
and trading strategies, allowing investors more easily understand and This Project is Supported by Ningbo Natural Science Foundation,
accept the suggestions provided by the proposed method. China (Project ID 2023J194), and by Ningbo Government, China
(Project ID 2021B-008-C).
Our study can exhibit salient policy implications. Our DRL hyper-
heuristic framework can enrich regulatory policies by shedding light on
References
the impact of regulations on portfolio decisions and market dynamics.
In addition, regarding the utility and application of this paper, our DRL
Abedin, M., Moon, M., Hassan, M., Hajek, P., 2021. Deep learning-based exchange rate
hyper-heuristic framework can strengthen the LDI, such as the pension prediction during the COVID–19. Ann. Oper. Res. This article was supported by the
plan investments. Our study of multi-period portfolio optimization scientific research project of the Czech Sciences Foundation Grant No. 19-15498S.
using DRL hyper-heuristic in real stock markets can also create various Ahmed, L., Mumford, C., Kheiri, A., 2019. Solving urban transit route design problem
benefits to different stakeholders. These stakeholders include individual using selection hyper-heuristics. European J. Oper. Res. 274 (2), 545–559.
Almahdi, S., Yang, S.Y., 2017. An adaptive portfolio trading system: A risk-return port-
investors, asset managers, financial institutions, regulators. For exam-
folio optimization using recurrent reinforcement learning with expected maximum
ple, an asset management firm that implements our DRL hyper-heuristic
drawdown. Expert Syst. Appl. 87, 267–279.
portfolio optimization model may outperform competitors in terms of Ang, A., Bekaert, G., 2007. Stock return predictability: Is it there? Rev. Financ. Stud.
risk-adjusted returns and attract a larger client base seeking superior 20 (3), 651–707.
investment performance. Our study can also contribute to the theoret- Appel, G., 2005. Technical Analysis: Power Tools for Active Investors. FT Press.
ical understanding of RL algorithms and their application in financial Avramov, D., 2002. Stock return predictability and model uncertainty. J. Financ. Econ.
decision-making. The development of DRL based portfolio optimization 64 (3), 423–458.
Beasley, J., 1990. OR-library: distributing test problems by electronic mail. J. Oper.
models that can capture the time-varying nature of risk and return. Our
Res. Soc. 41 (11), 1069–1072.
paper thereby strengthens the theoretical understanding of the financial Bellman, R., 1957. Dynamic Programming. Princeton University Press, Princeton, NJ.
market dynamics as well as optimal portfolio construction by applying Bertsimas, D., Shioda, R., 2009. Algorithm for cardinality-constrained quadratic
DRL in a real-world investment scenario. optimization. Comput. Optim. Appl. 43 (1), 1–22.
The relevance of our paper to the financial research field lies in Bodnar, T., Parolya, N., Schmid, W., 2018. Estimation of the global minimum variance
portfolio in high dimensions. European J. Oper. Res. 266 (1), 371–390.
its ability to address the dynamic nature of financial markets and
Bonami, P., Lejeune, M.A., 2009. An exact solution approach for portfolio optimization
the need for rebalancing portfolios over time. It has significant pol-
problems under stochastic and integer constraints. Oper. Res. 57 (3), 650–670.
icy implications for investors, financial institutions, and regulators, Boubaker, S., Liu, Z., Zhai, L., 2021. Big data, news diversity and financial market
as mentioned before. Additionally, our paper contributes to the the- crash. Technol. Forecast. Soc. Change 168, 120755.
oretical understanding of reinforcement learning algorithms and their Buehler, H., Gonon, L., Teichmann, J., Wood, B., 2019. Deep hedging. Quant. Finance
application in financial decision-making. This understanding further 19 (8), 1271–1291.
Burke, E.K., Hyde, M.R., Kendall, G., Ochoa, G., Ozcan, E., Woodward, J.R.,
enhances Liability-Driven Investment and aligns portfolio constructions
2019. A classification of hyper-heuristic approaches: Revisited. In: Handbook of
with future liabilities such as pension payments by using the DRL based
Metaheuristics. Springer, pp. 453–477.
method. Campbell, J.Y., Giglio, S., Polk, C., Turley, R., 2018. An intertemporal CAPM with
Our paper further pushes the boundaries of existing theoretical stochastic volatility. J. Financ. Econ. 128 (2), 207–233.
paradigms by introducing a novel DRL hyper-heuristic approach to Cao, A., Lindner, B., Thomas, P.J., 2020. A partial differential equation for the mean–
multi-period portfolio optimization problems in real-world financial return-time phase of planar stochastic oscillators. SIAM J. Appl. Math. 80 (1),
markets. The proposed approach goes beyond traditional DRL methods 422–447.
Chang, T.J., Meade, N., Beasley, J.E., Sharaiha, Y.M., 2000. Heuristics for cardinality
by searching for well-developed low-level trading strategies instead of
constrained portfolio optimisation. Comput. Oper. Res. 27 (13), 1271–1302.
directly exploiting the entire action domain. Additionally, the paper Chang, T.-J., Yang, S.-C., Chang, K.-J., 2009. Portfolio optimization problems in
incorporates market indicators based on expert domain knowledge different risk measures using genetic algorithm. Expert Syst. Appl. 36 (7),
to augment the state and provide additional high-level and robust 10529–10537.
information for improving asset allocation decisions. Chu, J., Zhang, Y., Chan, S., 2019. The adaptive market hypothesis in the high
Furthermore, the impacts of our study are also remarkable. Our frequency cryptocurrency market. Int. Rev. Financ. Anal. 64, 221–231.
Crama, Y., Schyns, M., 2003. Simulated annealing for complex portfolio selection
DRL hyper-heuristic framework can continuously learn from historical
problems. European J. Oper. Res. 150 (3), 546–571.
market data, monitor portfolio performance, and dynamically adjust Cui, T., Bai, R., Ding, S., Parkes, A.J., Qu, R., He, F., Li, J., 2020. A hybrid combinatorial
allocations to manage risk. Investors can implement adaptive risk man- approach to a two-stage stochastic portfolio optimization model with uncertain
agement strategies by applying our hyper-heuristic framework. This asset prices. Soft Comput. 24 (4), 2809–2831.
9
T. Cui et al. Technological Forecasting & Social Change 198 (2024) 122944
Cui, T., Cheng, S., Bai, R., 2014. A combinatorial algorithm for the cardinality Pun, C.S., 2018. Time-consistent mean-variance portfolio selection with only risky
constrained portfolio optimization problem. In: IEEE Congress on Evolutionary assets. Econ. Model. 75, 281–292.
Computation. CEC, pp. 491–498. Pyun, S., 2019. Variance risk in aggregate stock returns and time-varying return
Cui, T., Ding, S., Jin, H., Zhang, Y., 2023. Portfolio constructions in cryptocurrency predictability. J. Financ. Econ. 132 (1), 150–174.
market: A CVaR-based deep reinforcement learning approach. Econ. Model. 119, Radaideh, M.I., Shirvan, K., 2021. Rule-based reinforcement learning methodology
106078. to inform evolutionary algorithms for constrained optimization of engineering
Cura, T., 2009. Particle swarm optimization approach to portfolio optimization. applications. Knowl.-Based Syst. 217, 106836.
Nonlinear Anal. RWA 10 (4), 2396–2406. Rahimian, E., Akartunalı, K., Levine, J., 2017. A hybrid integer programming and vari-
Deng, Y., Bao, F., Kong, Y., Ren, Z., Dai, Q., 2017. Deep direct reinforcement learning able neighbourhood search algorithm to solve nurse rostering problems. European
for financial signal representation and trading. IEEE Trans. Neural Netw. Learn. J. Oper. Res. 258 (2), 411–423.
Syst. 28 (3), 653–664. Rogers, L.C.G., Satchell, S.E., 1991. Estimating variance from high, low and closing
Ding, S., Cui, T., Bellotti, A.G., Abedin, M.Z., Lucey, B., 2023. The role of feature prices. Ann. Appl. Probab. 1 (4), 504–512.
importance in predicting corporate financial distress in pre and post COVID periods: Schaerf, A., 2002. Local search techniques for constrained portfolio selection problems.
Evidence from China. Int. Rev. Financ. Anal. 90, 102851. Comput. Econ. 20 (3), 177–190.
Eachempati, P., Srivastava, P.R., Kumar, A., Tan, K.H., Gupta, S., 2021. Validating the Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O., 2017. Proximal policy
impact of accounting disclosures on stock market: A deep neural network approach. optimization algorithms.
Technol. Forecast. Soc. Change 170, 120903. Shajalal, M., Hajek, P., Abedin, M.Z., 2023. Product backorder prediction using deep
Efat, M., Hajek, P., Abedin, M., Azad, R., Jaber, M., Aditya, S., Hassan, M., 2022. neural network on imbalanced data. Int. J. Prod. Res. 61 (1), 302–319.
Deep-learning model using hybrid adaptive trend estimated series for modelling Sharpe, W.F., 1994. The sharpe ratio. J. Portfolio Manag. 21 (1), 49–58.
and forecasting sales. Ann. Oper. Res. Shaw, D.X., Liu, S., Kopman, L., 2008. Lagrangian relaxation procedure for
Fama, E.F., 1965. The behavior of stock-market prices. J. Bus. 38 (1), 34–105. cardinality-constrained portfolio optimization. Optim. Methods Softw. 23 (3),
Fama, E.F., 1970. Efficient capital markets: A review of theory and empirical work. J. 411–420.
Finance 25 (2), 383–417. Shi, S., Li, J., Li, G., Pan, P., Chen, Q., Sun, Q., 2022. GPM: A graph convolu-
Fernández, A., Gómez, S., 2007. Portfolio selection using neural networks. Comput. tional network based reinforcement learning framework for portfolio management.
Oper. Res. 34 (4), 1177–1191. Neurocomputing 498, 14–27.
Gilbert-Saad, A., Siedlok, F., McNaughton, R.B., 2023. Entrepreneurial heuristics: Silver, D., Huang, A., et al., 2016. Mastering the game of go with deep neural networks
Making strategic decisions in highly uncertain environments. Technol. Forecast. and tree search. Nature 529 (7587), 484–489.
Silver, D., Schrittwieser, J., et al., 2017. Mastering the game of go without human
Soc. Change 189, 122335.
knowledge. Nature 550 (7676), 354–359.
Hautsch, N., Kyj, L.M., Malec, P., 2015. Do high-frequency data improve
Sutskever, I., Vinyals, O., Le, Q.V., 2014. Sequence to sequence learning with neural net-
high-dimensional portfolio allocations? J. Appl. Econometrics 30 (2), 263–290.
works. In: Proceedings of the 27th International Conference on Neural Information
Jeong, G., Kim, H.Y., 2019. Improving financial trading decisions using deep Q-learning:
Processing Systems. NIPS, pp. 3104–3112.
Predicting the number of shares, action strategies, and transfer learning. Expert
Tao, R., Su, C.-W., Xiao, Y., Dai, K., Khalid, F., 2021. Robo advisors, algorithmic trading
Syst. Appl. 117, 125–138.
and investment management: wonders of fourth industrial revolution in financial
Jiang, Z., Xu, D., Liang, J., 2017. A deep reinforcement learning framework for the
markets. Technol. Forecast. Soc. Change 163, 120421.
financial portfolio management problem.
Thaler, R.H., 1999. The end of behavioral finance. Financ. Anal. J. 55 (6), 12–17.
Jumper, J.M., Evans, R., et al., 2021. Highly accurate protein structure prediction with
Tsinaslanidis, P., Guijarro, F., Voukelatos, N., 2022. Automatic identification and
AlphaFold. Nature 596, 583–589.
evaluation of fibonacci retracements: Empirical evidence from three equity markets.
Kang, H.-J., Lee, S.-G., Park, S.-Y., 2022. Information efficiency in the cryptocurrency
Expert Syst. Appl. 187, 115893.
market: The efficient-market hypothesis. J. Comput. Inf. Syst. 62 (3), 622–631.
Vinyals, O., Babuschkin, I., et al., 2019. Grandmaster level in StarCraft II using
Kong, W., Liaw, C., Mehta, A., Sivakumar, D., 2019. A new dog learns old tricks:
multi-agent reinforcement learning. Nature 575 (7782), 350–354.
RL finds classic optimization algorithms. In: International Conference on Learning
Wilder, J., 1978. New Concepts in Technical Trading Systems. Trend Research.
Representations. ICLR.
Woodside-Oriakhi, M., Lucas, C., Beasley, J.E., 2011. Heuristic algorithms for the
Laffont, J.-J., Maskin, E.S., 1990. The efficient market hypothesis and insider trading
cardinality constrained efficient frontier. European J. Oper. Res. 213 (3), 538–550.
on the stock market. J. Polit. Econ. 98 (1), 70–93.
Wu, X., Chen, H., Wang, J., Troiano, L., Loia, V., Fujita, H., 2020. Adaptive stock
Lamont, O.A., Thaler, R.H., 2003. Can the market add and subtract? Mispricing in tech
trading strategies with deep reinforcement learning methods. Inform. Sci. 538,
stock carve-outs. J. Polit. Econ. 111 (2), 227–268.
142–158.
Le Tran, V., Leirvik, T., 2020. Efficiency in the markets of crypto-currencies. Finance
Wu, Q., Liu, X., Qin, J., Zhou, L., Mardani, A., Deveci, M., 2022. An integrated
Res. Lett. 35, 101382.
multi-criteria decision-making and multi-objective optimization model for socially
Lee, K., Kim, S.-A., Choi, J., Lee, S.-W., 2018. Deep reinforcement learning in continuous
responsible portfolio selection. Technol. Forecast. Soc. Change 184, 121977.
action spaces: a case study in the game of simulated curling. In: International Ye, Y., Pei, H., Wang, B., Chen, P., Zhu, Y., Xiao, J., Li, B., 2020. Reinforcement-learning
Conference on Machine Learning. ICLR, pp. 2937–2946. based portfolio management with augmented asset movement prediction states. In:
Li, J., Rao, R., Shi, J., 2018. Learning to trade with deep actor critic methods. In: 2018 The Thirty-Fourth Conference on Artificial Intelligence. AAAI, pp. 1112–1119.
11th International Symposium on Computational Intelligence and Design, Vol. 02. Zhang, Y., Bai, R., Qu, R., Tu, C., Jin, J., 2021. A deep reinforcement learning based
ISCID, pp. 66–71. hyper-heuristic for combinatorial optimisation with uncertainties. European J. Oper.
Li, B., Sahoo, D., Hoi, S.C., 2016. OLPS: A toolbox for on-line portfolio selection. J. Res..
Mach. Learn. Res. 17 (35), 1–5.
Li, X., Uysal, A.S., Mulvey, J.M., 2022. Multi-period portfolio optimization using model
predictive control with mean-variance and risk parity frameworks. European J.
Dr. Tianxiang Cui is an assistant professor in the School of Computer Science at the
Oper. Res. 299 (3), 1158–1176.
University of Nottingham Ningbo China (UNNC) and a senior member of IEEE. Before
Lillicrap, T.P., Hunt, J.J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D.,
joining UNNC, he was a senior AI engineer in Huawei and a senior algorithm researcher
Wierstra, D., 2015. Continuous control with deep reinforcement learning.
in PingAn. He was involved in some frontier industrial projects, including autonomous
Ma, Y., Ahmad, F., Liu, M., Wang, Z., 2020. Portfolio optimization in the era of
driving and quantitative trading. His main research interests include computational
digital financialization using cryptocurrencies. Technol. Forecast. Soc. Change 161, intelligence, particularly metaheuristic, evolutionary computation and neural networks;
120265. machine learning and reinforcement learning. He has published a number of research
Markowitz, H., 1952. Portfolio selection. J. Finance 7 (1), 77–91. papers in high quality academic journals, including Economic Modeling, International
Mazyavkina, N., Sviridov, S., Ivanov, S., Burnaev, E., 2021. Reinforcement learning for Journal of Production Research, International Review of Financial Analysis, Research
combinatorial optimization: A survey. Comput. Oper. Res. 134, 105400. in International Business and Finance, Resources Policy, Soft Computing, etc.
Merton, R.C., 1973. An intertemporal capital asset pricing model. Econometrica
867–887.
Moody, J., Saffell, M., 2001. Learning to trade via direct reinforcement. IEEE Trans. Nanjiang Du is now a PhD student in computer science at the University of Nottingham
Ningbo China (UNNC). His main research interest includes computational finance,
Neural Netw. 12 (4), 875–889.
machine learning and federal learning.
Okoroafor, U.C., Leirvik, T., 2022. Time varying market efficiency in the Brent and
WTI crude market. Finance Res. Lett. 45, 102191.
Peng, L., Kloeden, P.E., 2021. Time-consistent portfolio optimization. European J. Oper. Xiaoying Yang is now a PhD student in computer science at the University of
Res. 288 (1), 183–193. Nottingham Ningbo China (UNNC). Her main research interest includes computational
Pillay, N., Qu, R., 2018. Hyper-Heuristics: Theory and Applications. Springer Nature. intelligence, operations research, machine learning and reinforcement learning.
10
T. Cui et al. Technological Forecasting & Social Change 198 (2024) 122944
Dr. Shusheng Ding is a Ph.D. in finance and currently work as an assistant professor deep learning, blockchain financing, financial risk management, volatility forecasting
in finance in Business School at Ningbo University China. He previously worked as a and financial econometrics. He has published a number of research papers in high
postdoctoral research fellow at University of Nottingham Ningbo China. His research quality academic journals, including British Journal of Management, Soft Computing,
mainly focuses on the financial modeling and trading with machine learning and Quantitative Finance, Journal of Futures Markets and Economic Modeling.
11