Backtest Overfitting in The Machine Learning Era - A Comparison of Out-of-Sample Testing Methods in A Synthetic Controlled Environment
Backtest Overfitting in The Machine Learning Era - A Comparison of Out-of-Sample Testing Methods in A Synthetic Controlled Environment
Canada
sential in guiding decision-making processes across a spec-
email: [email protected] trum of financial activities, from asset allocation to risk man-
2 BSc student of Applied Mathematics & Economics, Sharif University agement, in both buy-side and sell-side institutions.
of Technology, Tehran, Iran
email: [email protected]
3 Professor, University of Toronto, Ontario, Canada 1.2. Motivation
email: [email protected] The impetus for our research stems from a pivotal ob-
The presentation slides and a commentary on this article are available servation: despite substantial progress in financial modeling
on RiskLab’s website at the University of Toronto: risklab.ca/backtesting. and an escalating reliance on machine learning algorithms,
The architecture of the codes of this article is explained on risklab.ai/
there is a glaring shortfall in effectively validating these mod-
backtesting, in both Python and Julia programming languages. The repro-
ducible results of this paper are based on authors’ Python implementation els within the ambit of financial markets. This research gap
on RiskLab’s GitHub page: github.com/RiskLabAI. becomes more pronounced when considering the extensive
literature on predicting market factors. Yet, there is a con- 1.3.2. Rising Concerns over Backtest Overfitting and
spicuous lack of discussion on tailoring cross-validation al- False Discoveries
gorithms to accurately assess these models (Lopez de Prado The evolution of financial modeling has necessitated ad-
[2018, 2020]). Further complicating this landscape is the vanced methodologies to effectively address the challenges
paucity of research dedicated to critically evaluating the back- of overfitting and false discoveries in strategy evaluation. Pi-
testing and cross-validation algorithms. We hypothesize that oneering contributions by Bailey et al. [2016] and Bailey and
the limited exploration in this domain is attributable to the López de Prado [2014b], brought to the fore the need for
inherent complexities of financial datasets, which are typ- rigorous evaluation of trading strategies. They introduced
ically noisy, non-stationary, and characterized by intricate quantifiable metrics like the Probability of Backtest Over-
patterns shaped by various variables, from macroeconomic fitting (PBO) and the Deflated Sharpe Ratio (DSR), which
shifts to market sentiments. These unique dataset attributes provided a statistical basis to assess the reliability of back-
often render traditional cross-validation methods insufficient tested strategies. Despite these advancements, a significant
or misleading (Lopez de Prado [2018]). The grave conse- gap exists in the literature: a comprehensive framework link-
quences of model inaccuracies in this context cannot be over- ing backtest overfitting assessment with the effectiveness of
stated, as they can lead to substantial financial losses and out-of-sample testing methodologies. Our study addresses
pose systemic risks. This highlights the critical need to de- this gap by proposing a novel framework that evaluates out-
velop and refine cross-validation methodologies for navigat- of-sample testing techniques through the prism of backtest
ing financial data nuances. While hedge funds and invest- overfitting. By integrating key concepts such as PBO and
ment firms might have practical approaches to address these DSR into our analysis, we aim to provide a holistic evalua-
challenges, there is a stark silence in the academic literature tion of CV methods, ranging from traditional data science
on this imperative issue. Our study seeks to bridge this gap, approaches to innovative financial models like those pro-
providing insights and methodologies vital for the rigorous posed by Lopez De Prado. This approach ensures financial
evaluation of financial models, thereby catering to finance’s models’ robustness and predictive power, filling a critical
academic and practical realms. void in quantitative finance.
findings, providing insights into the performance and robust- where the instantaneous variance, 𝜈𝑡 , adheres to the Feller
ness of various cross-validation methods. In the Discussion square-root or Cox-Ingersoll-Ross (CIR) process:
section 4, we interpret these findings, contextualizing them √
within the broader landscape of quantitative finance and dis- 𝑑𝜈𝑡 = 𝜅(𝜃 − 𝜈𝑡 )d𝑡 + 𝜉 𝜈𝑡 d𝑊𝑡𝜈 , (2.2)
cussing their implications. The paper culminates with the
with 𝑊𝑡 and 𝑊𝑡𝜈 representing Wiener processes, exhibiting
Conclusion section 5, where we summarize the key take-
a correlation of 𝜌.
aways, acknowledge the limitations of our study, and suggest
The model described in Eqn. (2.1) and Eqn. (2.2) uses
directions for future research.
four main parameters. 𝜃 is the long-term average variance,
showing the expected variance that 𝜈𝑡 will approach as 𝑡 in-
2. Methodology creases. 𝜌 describes the correlation between the two Wiener
The methodology section forms the backbone of our re- processes in the model. 𝜅 shows how quickly 𝜈𝑡 returns to
search, presenting a comprehensive and systematic approach its long-term average, 𝜃. And 𝜉 is known as the ’volatility of
to exploring and analyzing financial market dynamics through volatility’, indicating how much 𝜈𝑡 can vary.
machine learning and statistical methods. This section out- A salient feature of this model is the Feller condition, ex-
lines the construction and utilization of a Synthetic Con- pressed as 2𝜅𝜃 > 𝜉 2 . Ensuring this inequality guarantees the
trolled Environment, which integrates complex market mod- strict positivity of the process, ensuring no negative values
els such as the Heston Stochastic Volatility and Merton Jump for variance.
Diffusion models and incorporates regime-switching dynam-
ics through Markov chains. Additionally, it addresses the 2.1.2. Jumps: The Merton Jump Diffusion Model
drift burst hypothesis to model market anomalies like spec- The Merton Jump Diffusion model by Merton [1976]
ulative bubbles and flash crashes. The methodology elabo- enhances the geometric Brownian motion proposed by the
rates on developing and evaluating a prototypical financial Black-Scholes model by integrating a discrete jump compo-
machine-learning strategy, encompassing event-based sam- nent to capture abrupt stock price movements. The stock
pling, trade directionality, bet sizing, and feature selection. price dynamics are given by:
Crucially, the methodology also delves into assessing back- 𝑑𝑆𝑡 = 𝜇𝑆𝑡 𝑑𝑡 + 𝜎𝑆𝑡 𝑑𝑊𝑡 + 𝑆𝑡 𝑑𝐽𝑡 , (2.3)
test overfitting through advanced statistical techniques, en-
suring the validity and robustness of the proposed trading In Eqn. (2.3), 𝜇𝑆𝑡 𝑑𝑡 is the drift term that captures the ex-
strategies. The methodologies are meticulously designed to pected return, 𝜎𝑆𝑡 𝑑𝑊𝑡 embodies the continuous random fluc-
capture the intricate nuances of financial markets, thereby tuations with 𝜎 being the stock’s volatility, and 𝑑𝑊𝑡 the stan-
enabling a thorough and accurate analysis of trading strate- dard Brownian motion increment, and 𝑆𝑡 𝑑𝐽𝑡 accounts for in-
gies within a controlled yet realistic market simulation. stantaneous jumps in the stock price.
The jump process 𝐽𝑡 in Eqn. (2.3) is defined as:
2.1. Synthetic Controlled Environment
In financial analysis, constructing a Synthetic Controlled ∑
𝑁(𝑡)
The drift’s sudden increase is concisely encapsulated in 2.1.5. Market Synthesis: Discrete Simulation
the equation: In our study, we employ a discrete simulation approach
to model market dynamics, which can be effectively repre-
𝜇𝑡db = 𝑎|𝜏db − 𝑡|−𝛼 . (2.5) sented by the Euler-Maruyama method for stochastic dif-
ferential equations. This method provides a numerical ap-
In Eqn. (2.5), 𝜇𝑡db describes the drift at a given time 𝑡 ac- proximation of the continuous market processes in a discrete
cording to its distance relative to the bursting time 𝜏db . The framework. By applying Ito’s Lemma, the approximation is
factor 𝑎 sets the scale of the drift, while 12 < 𝛼 < 1 measures given by:
how intense this drift spike is. ( ( ))
Similarly, the abrupt rise in volatility, or the "volatility 1 𝑣2
Δ𝑆𝑡 ≈ 𝜇 − 𝜈𝑡 − 𝜆 𝑚 + 𝑆𝑡 Δ𝑡
burst", is represented as: 2 2
√ √
𝜎𝑡vb = 𝑏|𝜏db − 𝑡|−𝛽 . (2.6) + 𝜈𝑡 𝑆𝑡 𝑍 Δ𝑡 + 𝑌 Δ𝑁(𝑡). (2.9)
In Eqn. (2.9), Δ𝑆𝑡 √is the change in asset price, 𝜇 represents
In Eqn. (2.6), 𝜎𝑡vb indicates the volatility at time 𝑡. The pa- the drift rate, and 𝜈𝑡 is the volatility factor scaled by the
rameter 𝑏 quantifies the size of this volatility surge, and 0 < standard normal random variable 𝑍. 𝑌 is a normally dis-
𝛽 < 12 , gauges its sharpness. tributed jump size with mean 𝑚 and variance 𝑣2 , and Δ𝑁(𝑡)
denotes the jump process increments characterized by a Pois-
2.1.4. Regime Transitions: Markov Chain son distribution with intensity 𝜆Δ𝑡.
A regime-switching time series model is applied to sim- The variation in instantaneous variance 𝜈𝑡 is captured by
ulate market dynamics, following Hamilton [1994] as men- Eqn. (2.10):
tioned by Lopez de Prado [2020]. The market is segmented
√ √ √
into discrete regimes, each with unique characteristics. The Δ𝜈𝑡 = 𝜅(𝜃 −𝜈𝑡 )Δ𝑡+𝜉 𝜈𝑡 (𝜌𝜖 𝜖𝑡𝑃 + 1 − 𝜌2𝜖 𝜖𝑡𝜈 ) Δ𝑡, (2.10)
market’s transition between these regimes at any given time 𝑡
is determined by a Markov chain, where the transition prob- where 𝜅 is the rate at which 𝜈𝑡 reverts to its long-term mean
ability 𝑝𝑡,𝑛 depends solely on the state immediately prior. 𝜃, and 𝜉 measures the volatility of the variance. The corre-
This approach captures the fluid nature of financial markets, lated standard normal white noises 𝜖𝑡𝜈 and 𝜖𝑡𝑃 introduce ran-
√
which fluctuate between different states, reflecting shifts in domness with a correlation coefficient 𝜌𝜖 . The factor Δ𝑡 is
volatility and trends. By employing a Markov chain, these introduced to scale the model appropriately in the discrete-
transitions are modeled with mathematical precision while time setting, reflecting the properties of Brownian motion
maintaining economic plausibility, recognizing that finan- increments.
cial markets tend to exhibit a memory of only the most recent Incorporating the Markov chain regime transition model
events. into our discrete simulation, the constants 𝜇, 𝜃, 𝜉, 𝜌𝜖 , 𝜆,
A Markov chain is a mathematical system that transitions 𝑚, and 𝑣2 are adjusted for each regime. The adjustment is
from one state to another in a state space. It is defined by its dictated by the state transitions determined by the Markov
set of states and the transition probabilities between these chain, where each state encapsulates a distinct market regime
states. The fundamental property of a Markov chain is that with its own parameter set. As the market transitions be-
the probability of moving to the next state depends only on tween regimes, these parameters change accordingly, align-
the present state and not on the sequence of events that pre- ing the simulation with the underlying stochastic process that
ceded it. reflects the dynamic financial market environment.
Given a finite number of states 𝑆 = {𝑠1 , 𝑠2 , … , 𝑠𝑛 }, the
probability of transitioning from state 𝑠𝑖 to state 𝑠𝑗 in one 2.2. Prototypical Financial Machine Learning
step is denoted by 𝑃𝑖𝑗 : Strategy
Developing a coherent machine-learning strategy in quan-
𝑃𝑖𝑗 = 𝑃 (𝑋𝑛+1 = 𝑠𝑗 |𝑋𝑛 = 𝑠𝑖 ), (2.7) titative finance necessitates a meticulous fusion of statistical
techniques and market knowledge. Our proposed methodol-
where 𝑋𝑛 represents the state at time 𝑛, and 𝑃𝑖𝑗 is the entry ogy rigorously combines event-based triggers, trend-following
in the 𝑖-th row and 𝑗-th column of the transition matrix 𝑃 . mechanisms, and risk assessment tools to formulate a proto-
The matrix 𝑃 = [𝑃𝑖𝑗 ] is called the transition matrix of the typical financial machine-learning strategy. It commences
Markov chain. Each entry 𝑃𝑖𝑗 represents the one-step tran- with precisely identifying market events through CUSUM
sition probability from state 𝑠𝑖 to state 𝑠𝑗 as in Eqn. (2.7): filtering and progresses to ascertain trade directionality via
momentum analysis. The core of the strategy harnesses meta-
labeling to assess trade viability and employs an averaging
⎡𝑃11 𝑃12 ⋯ 𝑃1𝑛 ⎤
⎢𝑃 𝑃22 ⋯ 𝑃2𝑛 ⎥ approach to bet sizing sensitive to market conditions and
𝑃 = ⎢ 21
⋮ ⎥
. (2.8) position overlap. Integrating fractionally differentiated fea-
⋮ ⋮ ⋱
⎢ ⎥ tures alongside traditional technical indicators forms a ro-
⎣𝑃𝑛1 𝑃𝑛2 ⋯ 𝑃𝑛𝑛 ⎦
bust feature set, ensuring the preservation of temporal de-
pendencies and adherence to stationarity—a prerequisite for
the successful application of predictive modeling in financial moving averages are formulated as follows:
contexts.
𝑁𝑓 𝑎𝑠𝑡 −1
1 ∑
2.2.1. Sampling: CUSUM Filtering MAshort (𝑦𝑡 ) = 𝑦𝑡−𝑖 ,
𝑁𝑓 𝑎𝑠𝑡 𝑖=0
Portfolio management often relies on event-based trig- (2.13)
𝑁𝑠𝑙𝑜𝑤 −1
gers for investment decisions. These events may include 1 ∑
structural breaks, signals, or microstructural changes, often MAlong (𝑦𝑡 ) = 𝑦𝑡−𝑖 ,
𝑁𝑠𝑙𝑜𝑤 𝑖=0
prompted by macroeconomic news, volatility shifts, or sig-
nificant price deviations. In this context, it is crucial to iden- where 𝑁𝑓 𝑎𝑠𝑡 and 𝑁𝑠𝑙𝑜𝑤 represent the number of periods for
tify such events accurately, leveraging machine learning (ML) the fast (short-term) and slow (long-term) moving averages,
to ascertain the potential for reliable predictive models. The respectively.
redefinition of significant events or the enhancement of fea- A position is taken based on the relative positioning of
ture sets is a continual process refined upon discovering non- these moving averages post a CUSUM event. A trade is ini-
predictive behaviors. tiated based on these conditions:
We employ the Cumulative Sum (CUSUM) filter as an
event-based sampling technique for methodological rigor, as 1. Long Position: Triggered when MAshort (𝑦𝑡 ) surpasses
mentioned by Lopez de Prado [2018]. This method detects MAlong (𝑦𝑡 ), signaling upward market momentum.
deviations in the mean of a quantity, denoting an event when 2. Short Position: Initiated when MAshort (𝑦𝑡 ) falls be-
a threshold is crossed. Given independent and identically low MAlong (𝑦𝑡 ), indicating downward market momen-
distributed tum.
{ } (IID) observations from a locally stationary pro-
cess 𝑦𝑡 𝑡=1,…,𝑇 , we define the CUSUM as: The strategy, thus, aligns the position with the current mar-
{ ( )} ket trend, as indicated by the momentum in prices.
𝑆𝑡 = max 0, 𝑆𝑡−1 + 𝑦𝑡 − 𝔼𝑡−1 𝑦𝑡 , (2.11)
2.2.3. Size Determination: Meta-Labeling via
with the initial condition 𝑆0 = 0. A signal for action is sug- Triple-Barrier Method
gested at the smallest time 𝑡 where 𝑆𝑡 ≥ ℎ, with ℎ being the In our trading framework, once the side of a position is
predefined threshold( or)filter size. It’s notable that 𝑆𝑡 is resetdetermined through the momentum strategy, it undergoes a
to zero if 𝑦𝑡 ≤ 𝔼𝑡−1 𝑦𝑡 − 𝑆𝑡−1 , which intentionally ignores rigorous evaluation via the triple-barrier method to ascertain
negative shifts. its potential profitability. This evaluation forms the basis for
To encompass both positive and negative shifts, we ex- position sizing, leveraging a meta-labeling approach intro-
tend this to a symmetric CUSUM filter: duced by Lopez de Prado [2018].
{ ( )} Upon identification of a trade’s direction, the triple-barrier
𝑆𝑡+ = max 0, 𝑆𝑡−1 +
+ 𝑦𝑡 − 𝔼𝑡−1 𝑦𝑡 , 𝑆0+ = 0, method applies three distinct barriers to determine the out-
{ ( )}
𝑆𝑡− = min 0, 𝑆𝑡−1−
+ 𝑦𝑡 − 𝔼𝑡−1 𝑦𝑡 , 𝑆0− = 0, (2.12) come of the position. The horizontal barriers are set ac-
{ } cording to a dynamic volatility-adjusted threshold for profit-
𝑆𝑡 = max 𝑆𝑡+ , −𝑆𝑡− .
taking and stop-loss, while the vertical barrier is defined by
Adopting Lam and Yam [1997]’s strategy, we generate al- a predetermined expiration time, denoted as ℎ. The label as-
ternating buy-sell signals upon observing a return ℎ relative signment is as follows: hitting the upper barrier signifies a
to a prior peak or trough, akin to the filter trading strategy successful trade, hence labeled 1; conversely, touching the
by Fama and Blume [1966]. Our application of the CUSUM lower barrier first indicates a loss, labeled −1. If the verti-
filter using Eqn. (2.12), however, is distinct; we only sam- cal time barrier expires first, the label is determined by the
sign of the return, reflecting the result of the trade within the
) 𝑡 if 𝑆𝑡 ≥ ℎ, subsequently resetting 𝑆𝑡 assuming
ple at( bar
period [𝑡 , 𝑡 + ℎ].
𝔼𝑡−1 𝑦𝑡 = 𝑦𝑡−1 . We define 𝑦𝑡 as the natural logarithm of 𝑖,0 𝑖,0
the asset’s price to capture proportional price movements. The role of meta-labeling in this context is to scrutinize
The threshold ℎ is not static; instead, it dynamically adjusts further the trades indicated by the primary momentum model.
directly to the daily volatility, ensuring sensitivity to market It confirms or refutes the suggested positions, effectively fil-
conditions. tering out false positives and allowing for a calculated deci-
sion on the actual size of the investment. The meta-labeling
2.2.2. Side Determination: Momentum Strategy process directly informs the appropriate risk allocation for
We employ a momentum strategy based on moving av- each position by assigning a confidence level to each poten-
erages to determine the direction of trades signaled by the tial trade. This methodological step enhances the precision
event-based CUSUM filter sampling. Specifically, we calcu- of our strategy and ensures that position sizing is aligned
late two moving averages of the prices, a short-term moving with the evaluated profitability of the trade, as indicated by
average MAshort (𝑦𝑡 ) and a long-term moving average MAlong (𝑦𝑡 ), the outcome of the triple-barrier assessment.
to identify the prevailing trend. The short-term moving av-
erage is responsive to recent price changes, while the long- 2.2.4. Sample Weights: Label Uniqueness
term moving average captures the underlying trend. These The validity of the Independent and Identically Distributed
(IID) assumption is a common shortfall in financial machine
learning, as the overlapping intervals in the data often vio- 2.2.5. Financial Features: Fractional Differentiation
late it. Specifically, labels 𝑦𝑖 and 𝑦𝑗 may not be IID if there & Technical Analysis
is a shared influence from a common return 𝑟𝑡𝑗,0 ,min{𝑡𝑖,1 ,𝑡𝑗,1 } , In pursuing a robust financial machine-learning model,
where 𝑡𝑖,1 > 𝑡𝑗,0 for consecutive labels 𝑖 < 𝑗. To address the our methodology encompasses diverse features that balance
non-IID nature of financial datasets without compromising memory preservation with the necessity for stationarity. Frac-
the model granularity, we utilize sample weights as intro- tional differentiation of log prices is employed to maintain
duced by Lopez de Prado [2018]. This method recognizes as much informative historical price behavior as possible
the interconnectedness of data points and adjusts their influ- while ensuring the data adheres to the stationary requirement
ence on the model accordingly. By weighing samples based of predictive models (Lopez de Prado [2018]). Addition-
on their unique information and return impact, we enhance ally, we incorporate exponentially weighted moving aver-
model robustness, enabling more accurate analysis of finan- ages (EWMA) of volatility, capturing recent market volatil-
cial time series. ity trends and a suite of technical analysis indicators that pro-
We define concurrent labels at time 𝑡 as those that are vide insights into market sentiment and dynamics. Techni-
both influenced by at least one shared return cal analysis features are extracted from historical price and
volume data and are widely used to capture market senti-
𝑝𝑡 ment and trends, which are indicative of future price move-
𝑟𝑡−1,𝑡 = − 1. (2.14)
𝑝𝑡−1 ments and provide structured information from the other-
wise noisy market data, aiding the machine learning model
The concurrency of labels 𝑦𝑖 and 𝑦𝑗 does not necessitate a
to discern patterns associated with profitable trading oppor-
complete overlap in period; rather, it is sufficient that there
tunities. The features used for this problem are as follows:
is a partial temporal intersection involving the return at time
𝑡.
To quantify{the }
extent of overlap, we construct a binary 1. FracDiff: The fractionally differentiated log price. Fi-
indicator array 1𝑡,𝑖 𝑖=1,…,𝐼 for each time 𝑡, where 1𝑡,𝑖 is set nancial time series are characterized by a low signal-
[ ]
to 1 if the interval 𝑡𝑖,0 , 𝑡𝑖,1 overlaps with [𝑡 − 1, 𝑡], and 0 to-noise ratio and memory, challenging traditional sta-
otherwise. We then calculate the concurrency count at time tionarity transformations like integer differentiation,
𝑡, given by which remove this memory and potentially valuable
predictive signals (Lopez De Prado [2015]). To ad-
∑
𝐼 dress this, fractional differentiation is employed to pre-
𝑐𝑡 = 1𝑡,𝑖 . (2.15) serve memory while ensuring{ } stationarity.
𝑖=1 Consider a time series 𝑋𝑡 and the backshift oper-
The uniqueness of a label is inversely proportional to the ator 𝐵 such that 𝐵 𝑘 𝑋𝑡 = 𝑋𝑡−𝑘 for any non-negative
number of labels concurrent with it (Eqn. (2.15)). Conse- integer 𝑘. The binomial theorem applied to an integer
quently, we assign sample weights by inversely scaling them power can be extended to real powers using the bino-
with the concurrency count while considering the magnitude mial series and applied to the backshift operator:
of returns over the label’s lifespan. For label 𝑖, the prelimi-
∑∞ ( ) ∑∞ ∏𝑘−1
nary weight 𝑤̃ 𝑖 is computed as the norm of the sum of pro- 𝑑 𝑑 𝑘 𝑖=0 (𝑑 − 𝑖)
(1 − 𝐵) = (−𝐵) = (−𝐵)𝑘 .
portionally attributed returns: 𝑘 𝑘!
𝑘=0 𝑘=0
‖∑𝑡 ‖ (2.18)
‖ 𝑖,1 𝑟𝑡−1,𝑡 ‖
𝑤̃ 𝑖 = ‖
‖
‖.
‖ (2.16)
‖𝑡=𝑡𝑖,0 𝑐𝑡 ‖ The expansion in Eqn. (2.18) yields weights 𝜔𝑘 , which
‖ ‖
are applied to past values of the series to compute the
To facilitate a consistent scale for optimization algorithms fractionally differentiated series 𝑋̃ 𝑡 :
that default to an assumption of unit sample weights, we nor-
malize these preliminary weights calculated in Eqn. (2.16) to ∑
∞ ∏
𝑘−1
𝑑−𝑖
sum to the total number of labels 𝐼: 𝑋̃ 𝑡 = 𝜔𝑘 𝑋𝑡−𝑘 , with 𝜔𝑘 = (−1)𝑘 .
𝑘=0 𝑖=0
𝑘!
𝑤̃
𝑤𝑖 = ∑𝐼 𝑖 . (2.17) (2.19)
̃
𝑗=1 𝑤𝑗
An approach to fractional differentiation employs a
∑𝐼 fixed-width window by truncating the infinite series
Eqn. (2.17) ensures that 𝑖=1 𝑤𝑖 = 𝐼. Through this weight-
ing scheme, we emphasize observations with greater abso- based on a threshold criterion for the weights. The
lute log returns that are less common, thereby enhancing the fixed-width window approach can be formalized as
model’s capacity to learn from unique and significant market follows: find the smallest 𝑙∗ such that the modulus of
events. the weights ‖ ‖
‖𝜔𝑙∗ ‖ is not less than the threshold 𝜏, and
‖𝜔 ∗ ‖ falls below 𝜏. The adjusted weights 𝜔̃ are
‖ 𝑙 +1 ‖ 𝑘
then defined by: 10. ATR: The Average True Range quantifies market volatil-
{ ity by averaging true ranges over a period, reflecting
𝜔𝑘 if 𝑘 ≤ 𝑙∗ , the degree of price volatility.
𝜔̃ 𝑘 = (2.20)
0 if 𝑘 > 𝑙∗ . 11. Log DPO: The logarithm of the Detrended Price Os-
cillator compares rolling means at different periods to
Applying these truncated weights, the fractionally dif-
identify cyclical patterns in the price data.
ferentiated series 𝑋̃ 𝑡 is obtained through a finite sum:
12. MACD Position: Indicates the position of the MACD
Histogram relative to its signal line, with values above
∗
∑
𝑙
zero suggesting a bullish crossover and below zero a
𝑋̃ 𝑡 = 𝜔̃ 𝑘 𝑋𝑡−𝑘 , for 𝑡 = 𝑇 −𝑙∗ +1, … , 𝑇 . (2.21) bearish crossover.
𝑘=0
13. ADX Strength: Reflects the trend’s strength as mea-
The resultant series in Eqn. (2.21) is a driftless mix- sured by the ADX, categorizing trends as strong if
ture of the original level and noise components, pro- above a threshold value and weak if below.
viding a stationary series despite its non-Gaussian dis- 14. RSI Signal: Categorizes the RSI reading as signal-
tribution that exhibits memory-induced skewness and ing overbought conditions above a high threshold or
kurtosis. oversold conditions below a low threshold.
For a given time series {𝑋𝑡 }𝑡=1,…,𝑇 , the fixed-width 15. CCI Signal: Provides a signal based on the CCI read-
window fractional differentiation (FFD) approach is ing, indicating overbought or oversold conditions when
utilized to determine the order of differentiation 𝑑 ∗ crossing predefined threshold levels.
that achieves stationarity in the series {𝑋̃ 𝑡 }𝑡=𝑙∗ ,…,𝑇 us- 16. Stochastic Signal: Generates a signal from the Stochas-
ing ADF tests. The value of 𝑑 ∗ indicates the memory tic Oscillator, identifying overbought or oversold con-
that must be eliminated to attain stationarity. ditions based on threshold levels.
2. Volatility: Volatility is a fundamental feature that cap- 17. ROC Momentum: Categorizes the momentum based
tures the magnitude of price movements and is critical on the ROC, with positive values indicating an upward
for modeling risk and return in financial markets. The momentum and negative values a downward momen-
exponentially weighted moving average (EWMA) of tum.
volatility gives more weight to recent observations, 18. Kumo Breakout: Identifies price breakouts from the
making it a responsive measure of current market con- Ichimoku Cloud, suggesting a bullish breakout when
ditions. The EWMA volatility for a given day 𝑡 is cal- the price is above the cloud and bearish when below.
culated as follows: 19. TK Position: Indicates the position of the Tenkan-
√
sen relative to the Kijun-sen in the Ichimoku Indicator,
𝜎𝑡𝐸𝑊 𝑀𝐴 = 𝜆𝜎𝑡−1 2 + (1 − 𝜆)𝑟2 ,
𝑡 (2.22)
with values above one suggesting a bullish crossover
where 𝑟𝑡 is the log return at time 𝑡, and 𝜆 is the decay and below one a bearish crossover.
factor that determines the weighting of past observa- 20. Price Kumo Position: Categorizes the price position
tions. relative to the Ichimoku Cloud, suggesting bullish sen-
3. Z-Score: The Z-Score standardizes the log prices by timent when above the cloud and bearish when below.
their deviation from a rolling mean relative to the rolling 21. Cloud Thickness: Measures the thickness of the Ichimoku
standard deviation, highlighting price anomalies. Cloud by taking the logarithm of the ratio between
4. Log MACD Histogram: The difference between the the cloud spans, indicating market volatility and sup-
logarithmically transformed MACD line and its cor- port/resistance strength.
responding signal line indicates momentum shifts. 22. Momentum Confirmation: Confirms the momentum
5. ADX: The Average Directional Index measures the indicated by the Ichimoku Indicator, with the Tenkan-
strength of a trend over a given period, with higher sen above the cloud suggesting bullish momentum and
values indicating stronger trends. below suggesting bearish momentum.
6. RSI: The Relative Strength Index identifies conditions
where the asset is potentially overbought or oversold, 2.2.6. Bet Sizing: Averaging Active Bets
often signaling possible reversals. Proper bet sizing is crucial in implementing a successful
7. CCI: The Commodity Channel Index detects cyclical investment strategy informed by machine learning predic-
trends in asset prices, often used to spot impending tions. We denote by 𝑝[𝑥] the probability of a label 𝑥 occur-
market reversals. ring, where 𝑥 ∈ {−1, 1}. To determine the appropriateness
8. Stochastic: The Stochastic Oscillator compares the of a bet, we test the null hypothesis:
closing price to its price range over a specified period, Null Hypothesis 1. 𝐻0 ∶ 𝑝[𝑥 = 1] = 21 .
indicating momentum.
9. ROC: The Rate of Change measures the velocity of Calculating the test statistic:
price changes, with positive values indicating upward
𝑝[𝑥 = 1] − 12
momentum and negative values indicating downward 𝑧= √ ∼ 𝑍, (2.23)
momentum. 𝑝[𝑥 = 1](1 − 𝑝[𝑥 = 1])
where 𝑧 ∈ (−∞, +∞) and 𝑍 represents the standard normal to the number of neighbors chosen. By experiment-
distribution. The bet size is then derived as ing with small numbers of neighbors, we expose the
model to potential overfitting, where it might rely too
𝑚 = 2𝑍[𝑧] − 1, (2.24) heavily on immediate, possibly noisy data points. The
model used in our study is a custom pipeline integrat-
with 𝑚 ∈ [−1, 1] and 𝑍[⋅] being the cumulative distribution
ing standard scaling with the KNeighborsClassifier.
function (CDF) of 𝑍 for Eqn. (2.23). This formulation ac-
counts for predictions originating from both meta-labeling 2. Decision Tree: Decision Trees, while interpretable,
and standard labeling estimators. can easily overfit the training data, especially without
The process of bet sizing involves determining the size constraints on tree depth. Our configuration tests the
of individual bets based on the probability of outcomes and model in its most unconstrained form, providing in-
managing the aggregation of multiple bets that may be ac- sights into its behavior without regularizing parame-
tive concurrently. To manage ters. Our implementation uses a Decision Tree Clas-
{ }multiple concurrent bets, we sifier with a predefined random state for reproducibil-
define a binary indicator 1𝑡,𝑖 for each bet 𝑖 at time 𝑡. This
indicator takes the value of 1 if bet 𝑖 is active within the in- ity. The parameters include the maximum depth of the
terval (𝑡 − 1, 𝑡], and 0 otherwise. The aggregate bet size at tree, the minimum number of samples required to split
time 𝑡 is then the average of all active bet sizes as shown in an internal node, and the minimum number of samples
Eqn. (2.25): required to be at a leaf node.
3. XGBoost: XGBoost is an advanced implementation
∑𝐼 of gradient boosting algorithms known for its efficiency,
𝑚𝑖 1𝑡,𝑖
𝑚𝑡 = ∑𝑖=1 𝐼
, (2.25) flexibility, and portability. However, with excessively
𝑖=1 1𝑡,𝑖 high values for parameters such as the number of es-
where 𝑚𝑖 is the individual bet size. timators and learning rates, there is a risk of overfit-
ting, where the model becomes overly tailored to the
2.3. Strategy Trials training data. It excels in handling sparse data and
This section presents our strategy trials, which are inte- scales effectively across multiple cores. In our setup,
gral to our financial machine-learning research. We employ the XGBoost Classifier is employed with specific pa-
a comprehensive methodology, examining machine learning rameters like the number of trees, maximum depth of
models like k-Nearest Neighbors, Decision Trees, and XG- trees, learning rate, and subsampling ratio of the train-
Boost, each with unique parameter settings. Our approach ing instances.
deliberately tests these models under conditions conducive Each model is exhaustively assessed across its param-
to overfitting to assess their robustness and adaptability. We eter space to evaluate its efficacy and robustness in various
also introduce the Momentum Cross-Over Strategy, utiliz- market scenarios. This extensive parameterization is a delib-
ing various moving average window lengths to align trades erate strategy to test the models’ susceptibility to overfitting,
with market trends. This combination of diverse models and a critical consideration in financial machine-learning appli-
adaptive strategies, processed through a systematic pipeline cations.
that includes event-based sampling, meta-labeling, and it-
erative optimization, is designed to rigorously evaluate the 2.3.2. Momentum Cross-Over Strategy: An Overview
efficacy of trading strategies in complex market scenarios. The Momentum Cross-Over Strategy is a key element
The trials aim to balance the exploration of machine learn- of our strategy trials, aiming to align trade directions with
ing potentials in finance with the pragmatic challenges of market trends detected through moving averages. This strat-
real-world market conditions. egy’s adaptability lies in its various combinations of window
lengths for the moving averages, allowing it to capture mar-
2.3.1. Machine Learning Models: An Overview ket momentum over different time frames. By experiment-
In our strategic analysis, we leverage various machine ing with multiple window length pairs, the strategy adjusts to
learning models, each with a distinct set of parameters. This various market conditions and introduces flexibility that in-
approach is designed to rigorously test the models under vary- creases the likelihood of overfitting. This approach ensures
ing conditions, potentially increasing the risk of overfitting. a thorough examination of market trends, aiming to optimize
This methodological choice serves a dual purpose: firstly, trade positions in line with the prevailing market direction.
to rigorously challenge the robustness of the models under
extreme parameter conditions, and secondly, to examine the 2.3.3. Trials on Synthesized Data: The Pipeline
models’ performance in scenarios prone to overfitting. This Our strategy trials employ a streamlined pipeline to as-
deliberate stress testing provides valuable insights into the sess the potential for overfitting in various trading strate-
resilience and adaptability of the algorithms in complex fi- gies. The pipeline integrates event-based sampling, momen-
nancial environments. The following models and their re- tum strategy, machine learning models, and meta-labeling to
spective parameter sets are integral to this analysis: simulate diverse market conditions and test strategy efficacy.
1. K-Nearest Neighbors (k-NN): The k-NN model is The key steps of this pipeline are:
predicated on feature similarity and is highly sensitive 1. CUSUM Sampling: The process begins with the CUSUM
filter, identifying significant market shifts based on forming the analysis on one subset (the training set), and val-
deviations in log prices. This method generates sig- idating the analysis on the other subset (the test set). To re-
nals for potential trading opportunities. duce variability, multiple rounds of cross-validation are per-
2. Momentum Cross-Over Strategy: Following CUSUM formed using different partitions, and the validation results
signals, the Momentum Cross-Over Strategy is applied. are averaged over the rounds.
This step involves choosing window sizes for calculat- In financial modeling, especially for backtesting trading
ing moving averages and determining the trade direc- strategies, applying K-Fold cross-validation presents unique
tion based on their relative positions. challenges. Financial data are typically time-series data char-
3. Machine Learning Model Selection: A machine learn- acterized by temporal dependencies and non-stationarity. These
ing model, such as k-NN, Decision Tree, or XGBoost, features of financial data violate the fundamental assumption
is selected with specific parameters. This stage tests of traditional K-Fold cross-validation, which assumes that
model responses to trading signals, emphasizing the the observations are independent and identically distributed
analysis of overfitting risks under varying parameter (i.i.d).
settings. The process of K-Fold cross-validation in financial back-
4. Meta-Labeling and Sample Weights: Trade signals testing involves the following steps:
are processed through meta-labeling using the Triple- 1. The entire dataset is divided into 𝑘 consecutive folds
Barrier Method while concurrently assigning sample or segments.
weights to tackle the non-IID nature of financial data, 2. For each iteration, a different fold is treated as the test
thus enhancing the model’s learning efficacy. set (or validation set), and the remaining 𝑘 − 1 folds
5. Model Fitting and Testing: The chosen model is fit- are combined to form the training set.
ted to the data, now with meta-labels and weights, 3. The model is trained on the training set and validated
to evaluate its predictive accuracy under synthesized on the test set.
conditions.
4. The performance metric (e.g., Sharpe ratio, annual-
This pipeline approach critically examines the interplay ized return, drawdown) is recorded for each iteration.
between different components of trading strategies, focusing 5. After iterating through all folds, the performance met-
on the risk of overfitting. By simulating complex market rics are aggregated to provide an overall performance
scenarios, we aim to validate the robustness and adaptability estimate.
of these strategies for real-world application.
However, the temporal order of financial data necessi-
2.4. Backtesting on Out-of-Sample Data: tates careful handling. Shuffling or random data partition-
Cross-Validation ing, as commonly done in other domains, can lead to sig-
In quantitative finance, the rigor of a trading strategy is nificant biases and erroneous conclusions. For instance, us-
often validated through backtesting on out-of-sample data. ing future data in constructing the training set, even inad-
This process involves assessing the strategy’s performance vertently, introduces lookahead bias, severely compromising
using data not employed during the model’s training phase, the model’s validity.
providing insights into its real-world applicability. Cross- Moreover, financial markets are influenced by macroe-
validation (CV) techniques are pivotal, offering structured conomic factors and market regimes, leading to structural
methods to evaluate the strategy’s effectiveness and robust- breaks. These factors can result in model performance that
ness under various market conditions. The methodologies varies significantly across different periods, making it diffi-
for backtesting range from conventional approaches like K- cult to generalize the results obtained from a conventional
Fold Cross-Validation, which divides the data into multi- K-Fold cross-validation approach.
ple segments for iterative testing, to more specialized meth- Despite these limitations, K-Fold cross-validation is of-
ods like Walk-Forward Cross-Validation and Combinatorial ten used in preliminary model assessments, given its sim-
Purged Cross-Validation. Each method has distinct char- plicity and widespread understanding in the statistical com-
acteristics in handling the data, particularly addressing the munity. However, researchers in quantitative finance must
supplement
challenges posed by the temporal dependencies and non-stationarity or replace this method with more appropriate
in financial time series. Understanding these methods’ nu- techniques, such as Combinatorial Purged Cross-Validation,
ances in constructing backtest pathways is crucial for accu- that account for the peculiarities of financial time series data.
rate model validation and developing robust trading strate- It is crucial to interpret the results of K-Fold cross-validation
gies. in the context of financial markets with caution, understand-
ing that its assumptions may not fully align with the under-
2.4.1. Conventional Approach: K-Fold lying data characteristics.
Cross-Validation
K-fold cross-validation is a widely recognized statistical 2.4.2. Time-Consistent Validation: Walk-Forward
method for validating the performance of predictive models, Cross-Validation
particularly in machine learning contexts. It involves par- Walk-forward cross-validation (WFCV) is a method specif-
titioning a sample of data into complementary subsets, per- ically tailored for time series data, addressing the unique
challenges posed by financial market data’s temporal depen- The Purged K-Fold process involves several key modifi-
dencies and non-stationarity. Unlike conventional K-Fold cations to the standard K-fold cross-validation:
cross-validation, which can inadvertently introduce looka-
1. The dataset is partitioned into 𝑘 folds, ensuring that
head bias by shuffling data, WFCV respects the chronolog-
each fold is a contiguous segment of time to maintain
ical order of observations, ensuring a more realistic and ro-
the temporal order of observations.
bust validation of trading strategies.
The WFCV process involves the following steps: 2. Each fold is used once as the validation set, while the
remaining folds form the training set. However, unlike
1. The dataset is divided into an initial training period standard K-Fold cross-validation, a "purging" process
and a subsequent testing period. The size of these pe- is implemented.
riods can be fixed or expanded. 3. The purging process involves removing observations
2. The model is trained on the initial training set and then from the training set that occur after the start of the
tested on the subsequent testing period. validation period. This is done to eliminate the risk
3. After the first validation, the training and testing win- of information leakage from the future (validation pe-
dows are rolled forward. This means expanding or riod) into the past (training period).
shifting the training period and testing on the new sub-
4. Additionally, an "embargo" period is applied after each
sequent period.
training fold ends and before the next validation fold
4. This process is repeated until the entire dataset is tra-
starts. This embargo period serves as a buffer zone
versed, with each iteration using a new testing period
to further mitigate the risk of leakage due to temporal
immediately following the training period.
dependencies that purging might not fully address.
5. Performance metrics are recorded for each testing pe-
5. The model is trained on the purged and embargoed
riod and aggregated to evaluate the strategy’s overall
training data and then validated on the untouched val-
effectiveness.
idation fold.
WFCV’s primary advantage lies in its alignment with the 6. Performance metrics are recorded for each fold and
practical scenarios encountered in live trading. Training and aggregated to provide an overall assessment.
testing on consecutive data segments closely mimic the real-
world situation where a model is trained on past data and This methodology is particularly effective in financial
deployed on future, unseen data. This sequential approach machine learning, where models often capture temporal re-
helps understand how a strategy adapts to evolving market lationships, and even subtle information leakage can lead
conditions and objectively assesses its predictive power and to over-optimistic performance estimates. Purged K-Fold
robustness over time. Cross-Validation ensures a more robust and realistic evalu-
However, WFCV has its limitations. The repetitive re- ation of the model’s predictive power by incorporating the
training process can be computationally intensive, especially purging and embargo mechanisms.
for large datasets and complex models. Additionally, the Purged K-Fold is especially relevant for strategies that
choice of the size of the training and testing windows can rely on features extracted from historical data, as it ensures
significantly impact the results, requiring careful considera- that the model is not inadvertently trained on future data.
tion and sensitivity analysis. This method is essential for preventing the common pitfalls
WFCV is particularly pertinent in financial machine learn- of overfitting and selection bias in financial modeling.
ing due to its ability to mitigate overfitting and model de- While Purged K-Fold Cross-Validation offers significant
cay risks — common challenges in quantitative finance. It advantages in maintaining data integrity, it requires careful
ensures that models are continuously updated and validated consideration of the lengths of the purge and embargo pe-
against the most recent data, reflecting the dynamic nature riods, which should be tailored to the specific temporal de-
of financial markets. pendencies in the analyzed financial data.
Despite its advantages, WFCV should be employed as
2.4.4. Multi-Scenario, Leakage-Free Validation:
part of a comprehensive strategy validation framework, along-
Combinatorial Purged Cross-Validation
side other methods like combinatorial purged cross-validation,
Combinatorial Purged Cross-Validation (CPCV) is intro-
to fully account for the complexities of financial time series
duced by Lopez de Prado [2018] as an innovative approach
to ensure robust model validation.
to address the limitations of single-path testing inherent in
2.4.3. Leakage-Resistant Validation: Purged K-Fold conventional Walk-Forward and Cross-Validation methods.
Purged K-Fold Cross-Validation is an advanced valida- This method is specifically designed for the complex envi-
tion technique developed by Lopez de Prado [2018] to ad- ronment of financial machine learning, where temporal de-
dress the issue of information leakage in financial time se- pendencies and non-stationarity are prevalent. CPCV gen-
ries, a common pitfall in traditional cross-validation meth- erates multiple backtesting paths and integrates a purging
ods. This method is particularly suited for validating finan- mechanism to eliminate the risk of information leakage from
cial models where the integrity of the temporal order of data training observations.
is crucial for preventing look-ahead biases and ensuring re- The CPCV method is implemented as follows:
alistic performance estimation.
1. The dataset, consisting of 𝑇 observations, is partitioned (c) This method does not account for the temporal
into 𝑁 non-overlapping groups. These groups main- order of data, which can lead to unrealistic back-
tain the chronological order of data, where the first test paths in financial time series due to potential
𝑁 − 1 groups each have a size of ⌊𝑇 ∕𝑁⌋, and the information leakage and autocorrelation.
𝑁-th group contains the remaining observations. 2. Walk-Forward (WF) Validation:
2. For a selected size k of the testing set, CPCV calcu- (a) WF Validation involves an expanding and rolling
lates window approach. The dataset is sequentially
( 𝑁 )the number of possible training/testing splits as
. Each combination involves k groups for test- divided into a training set followed by a valida-
𝑁−𝑘 ( 𝑁 ) tion set.
ing, and the total number of groups tested is 𝑁−𝑘 ×𝑘,
(b) The unique aspect of WF is its chronological align-
ensuring a uniform distribution across all 𝑁 groups.
ment. The window rolls forward, ensuring the
3. From the combinatorial splits, each group is uniformly validation set always follows the training set in
included in the testing sets. This process results in a time.
comprehensive series of ( backtest
) paths, given by the (c) WF creates a single backtest path that closely
combinatorial number 𝑁𝑘 . mimics real-world trading scenarios. However,
4. Paths are generated by training classifiers on a portion it tests the strategy only once, providing limited
of the data, specifically 1 − 𝑁𝑘 , for each combination. insight into its robustness under different market
The algorithm ensures that the portion of data in the conditions.
training set is balanced against the number of paths 3. Combinatorial Purged Cross-Validation (CPCV):
and size of the testing sets. (a) CPCV enhances backtest pathways by introduc-
5. The CPCV backtesting algorithm involves purging and ing a combinatorial approach. The dataset is di-
embargoing as introduced before. Each path results vided into 𝑁 groups, from which 𝑘 groups are
from combining forecasts from different groups and selected in various combinations for training and
split combinations, ensuring a comprehensive evalua- testing.
tion of the classifier’s performance. (b) This method generates multiple backtest paths,
6. After processing all paths, the performance metrics each representing a different combination of train-
from each path are aggregated to assess the overall ing and validation sets. It addresses the issue of
effectiveness of the model, providing insights into its single-path dependency seen in WF and tradi-
robustness and consistency across various market con- tional CV.
ditions. (c) CPCV also incorporates purging and embargo-
ing to prevent information leakage, making each
CPCV’s unique combinatorial approach allows for a thor- path more realistic and reducing the risk of over-
ough evaluation of the model under diverse scenarios, ad- fitting.
dressing the critical overfitting issue. It provides a more nu- (d) The key advantage of CPCV is its ability to pro-
anced and accurate assessment of a model’s predictive capa- vide a comprehensive view of the strategy’s per-
bilities in the dynamic field of financial markets. formance across a range of scenarios, unlike the
While CPCV offers an extensive validation framework, single scenario tested in WF and traditional CV.
its combinatorial nature can be computationally demanding.
Each CV method’s approach to constructing backtest path-
Therefore, it’s essential to consider computational resources
ways has implications for its utility in financial modeling.
and execution time, particularly for large financial datasets.
Traditional CV’s disregard for temporal order limits its ap-
2.4.5. Scenario Creation: Constructing Backtest plicability for financial time series. WF’s single-path ap-
Pathways proach offers a realistic scenario but lacks robustness testing.
The creation of backtest pathways varies significantly CPCV, with its multiple, purged combinatorial paths, of-
among different cross-validation methods. Traditional Cross- fers a comprehensive evaluation of a strategy’s performance,
Validation (CV), Walk-Forward (WF) Validation, and Com- making it particularly suitable for complex financial mar-
binatorial Purged Cross-Validation (CPCV) each have dis- kets where multiple scenarios are critical for understanding
tinct methodologies for generating these paths. Understand- a strategy’s effectiveness.
ing these differences is crucial for selecting the appropriate
2.5. Assessment of Backtest Overfitting
validation method in financial modeling.
In the quest to develop robust trading strategies within
1. Traditional Cross-Validation (CV): quantitative finance, the assessment of backtest overfitting
(a) In traditional CV, the dataset is divided into 𝑘 emerges as a crucial facet. This section delves into the method-
folds. Each fold is a validation set once, while ologies deployed to evaluate and mitigate the risk of overfit-
the remaining folds constitute the training set. ting, a common pitfall where strategies appear effective in
(b) The backtest path in CV is linear and sequential. retrospective analyses but falter in prospective applications.
Each fold’s validation results contribute to a sin- Two pivotal concepts, the Probability of Backtest Overfit-
gle aggregated performance metric. ting (PBO) and the Deflated Sharpe Ratio (DSR), are har-
nessed to scrutinize the reliability of backtested strategies.
PBO is gauged through Combinatorially Symmetric Cross- 5. Finally, the PBO is estimated by calculating the distri-
Validation (CSCV), a technique that rigorously tests strat- bution of ranks out-of-sample (OOS) and integrating
egy performance across diverse market scenarios. Concur- the probability distribution function 𝑓 (𝜆) as:
rently, DSR offers a refined perspective on strategy efficacy
0
by adjusting the Probabilistic Sharpe Ratio (PSR) for multi-
PBO = 𝑓 (𝜆)𝑑𝜆. (2.27)
ple trials, thus enhancing the authenticity of our backtesting ∫−∞
results. Together, these methodologies furnish a comprehen-
sive framework for evaluating the integrity of trading strate- where the PBO represents the probability of in-sample
gies, ensuring that they are not merely artifacts of historical optimal strategies underperforming out-of-sample.
data but are genuinely predictive and robust against future This rigorous statistical approach leading to Eqn. (2.27)
market conditions. allows us to evaluate the extent of overfitting in our strat-
egy development process, ensuring that selected strategies
2.5.1. Probability of Backtest Overfitting: are robust and not merely tailored to historical market id-
Combinatorially Symmetric Cross-Validation iosyncrasies.
Backtest trials are pivotal in the realm of quantitative fi-
nance, particularly in the development of trading strategies. 2.5.2. Probability of False Discovery: The Deflated
Utilizing the methodology outlined in previous sections, we Sharpe Ratio
perform multiple backtest trials, ideally selecting the optimal In selecting the optimal strategy from multiple backtest
strategy based on its performance in these trials. However, trials, a key concern is the probability of false discovery,
this approach inherently risks backtest overfitting, where a which refers to the likelihood that the observed performance
strategy might show exceptional performance in a histori- of a strategy is due to chance rather than true predictive power.
cal context but fails to generalize to new, unseen data. To To address this, we use the Deflated Sharpe Ratio (DSR),
quantitatively assess and mitigate this risk, we calculate the which extends the Probabilistic Sharpe Ratio (PSR) concept
Probability of Backtest Overfitting (PBO) using the Combi- to account for the multiplicity of trials.
natorially Symmetric Cross-Validation (CSCV) method as The PSR, as introduced by Bailey and Lopez de Prado
introduced by Bailey et al. [2016]. CSCV provides a more [2012], adjusts the observed Sharpe Ratio (𝑆𝑅) ̂ by account-
robust measure of a strategy’s effectiveness by examining its ing for the distributional properties of returns, such as skew-
performance across different segments of market data, al- ness and kurtosis. It is calculated as:
lowing us to evaluate the consistency of trial returns both
in-sample and out-of-sample. ⎛ √ ⎞
The CSCV process is outlined in the following steps: ⎜ (𝑆𝑅
̂ − 𝑆𝑅 ∗) 𝑇 − 1 ⎟
𝑆𝑅(𝑆𝑅∗ ) = 𝑍 ⎜ √
𝑃̂ ⎟ , (2.28)
1. Formation of a performance matrix 𝑀 of size 𝑇 × 𝑁, ⎜ 2 ⎟
⎜ 1 − 𝛾̂3 𝑆𝑅 𝛾̂
̂ + 4 𝑆𝑅−1 ̂ ⎟
where each column represents the log returns series ⎝ 4 ⎠
for a specific model configuration over 𝑇 time obser-
vations. where 𝑍[.] is the cumulative distribution function (CDF) of
2. Partitioning of 𝑀 into 𝑆 disjoint submatrices 𝑀𝑠 of the standard Normal distribution, 𝑇 is the number of ob-
equal dimensions, with each submatrix being of order served returns, 𝛾̂3 is the skewness of the returns, and 𝛾̂4 is
𝑇
× 𝑁. the kurtosis of the returns. 𝑆𝑅∗ is a benchmark Sharpe ratio
𝑆 ̂ is compared.
3. Formation of combinations 𝐶𝑆 of these submatrices, against which 𝑆𝑅
The Deflated Sharpe Ratio (DSR), as introduced by Bai-
taken in groups of size 𝑆2 , yielding a total number of
ley and López de Prado [2014b], refines the Probabilistic
combinations calculated as:
Sharpe Ratio (PSR) as given in Eqn. (2.28) by considering
( )
𝑆 ∏
𝑆∕2−1
𝑆 −𝑖
the number of independent trials. This refinement yields a
= . (2.26) more precise measure of the probability of false discovery
𝑆∕2 𝑆∕2 − 𝑖
𝑖=0 when multiple strategies are tested. Specifically, the DSR
employs a benchmark Sharpe ratio (𝑆𝑅∗ ) which is calcu-
4. For each combination 𝑐 ∈ 𝐶𝑆 , the following steps are
lated in Eqn. (2.29), that is influenced by the variance of the
carried out:
estimated Sharpe Ratios (𝑆𝑅 ̂ 𝑛 ) from the trials, the number of
(a) Formation of the training set 𝐽 and the testing
trials (𝑁), and incorporates the Euler-Mascheroni constant
set 𝐽̄.
(𝛾):
(b) Computation of the performance statistic vectors
𝑅 and 𝑅̄ for the training and testing sets, respec- √ ( )
tively. 𝑆𝑅∗ = 𝑉 {𝑆𝑅 ̂ 𝑛}
(c) Identification of the optimal model 𝑛∗ in the train-
( ( ) ( ))
ing set and determination of its relative rank 𝜔̄ 𝑐 1 1
(1 − 𝛾)𝑍 −1 1 − + 𝛾𝑍 −1 1 − 𝑒−1 ,
in the testing set. ( ) 𝑁 𝑁
𝜔̄ (2.29)
(d) Definition of the logit 𝜆𝑐 = log 1−𝜔𝑐̄ .
𝑐
Table 1
Parameterization of the Heston and Merton Jump Diffusion
Models for Calm and Volatile Market Regimes
Parameter Calm Regime Volatile Regime
Heston Stochastic Volatility
Expected Return (𝜇) 0.1 0.1
Mean Reversion Rate (𝜅) 3.98 3.81
Long-term Variance (𝜃) 0.029 0.25056
Volatility of Variance (𝜉) 0.389645311 0.59176974
Correlation Coefficient (𝜌) -0.7 -0.7
Merton Jump Diffusion
Jump Intensity (𝜆) 121 121
Mean of Logarithmic Jump Size (𝑚) -0.000709 -0.000709
Variance of Logarithmic Jump Size (𝑣) 0.0119 0.0119
Parameter Value
Bubble Length (𝑇bubble ) 5 × 252 days
Pre-Burst Drift Parameter (𝑎before ) 0.35
Post-Burst Drift Parameter (𝑎after ) -0.35
Pre-Burst Volatility Parameter (𝑏before ) 0.458
Post-Burst Volatility Parameter (𝑏after ) 0.458
Drift Burst Intensity (𝛼) 0.75
Volatility Burst Intensity (𝛽) 0.225
Explosion Filter Width 0.1
3.1.4. Market Regime Dynamics: Markov Chain Figure 3: Regime Transition Diagram
Transition Modeling
In our simulation, the transitions between market regimes
are governed by a Markov Chain model, drawing insights
3.1.5. Putting Them All Together: Synthetic
from the works of Xie and Deng [2022] and Elliott et al.
Controlled Market Environment
[2016] on regime-switching Heston models as shown in Ta-
In this section, we present the integration of a compre-
ble 3. The transition matrix, pivotal to the Markov chain
hensive synthetic market environment, utilizing a blend of
model, is meticulously calibrated based on these references
the Heston Stochastic Volatility and Merton Jump Diffusion
to represent regime shifts accurately, providing a realistic
models. Our implementation leverages the Python program-
portrayal of market regime dynamics within our synthetic
ming language, numpy for numerical computations, the @ji ⌋
controlled environment.
t decorator for performance optimization, and the QuantE-
con library’s qe.MarkovChain for Markov chain generation.
Stochastic elements’ reproducibility is ensured through np. ⌋
( ( ))
1 𝑣2
Δ𝑆𝑡 = 𝜇 − 𝜈𝑡 − 𝜆 𝑚 + 𝑆𝑡 Δ𝑡
2 2
√ √
+ 𝜈𝑡 𝑆𝑡 𝑍 Δ𝑡 + 𝑌 Δ𝑁(𝑡),
√ √ √
Δ𝜈𝑡 = 𝜅(𝜃 − 𝜈𝑡 )Δ𝑡 + 𝜉 𝜈𝑡 (𝜌𝜖 𝜖𝑡𝑃 + 1 − 𝜌2𝜖 𝜖𝑡𝜈 ) Δ𝑡.
Table 4
Descriptive Statistics of Log Returns Overall and For Each
Regime
falls below MAlong (𝑦𝑡 ), indicative of downward momentum. This methodical approach to feature selection, blending
This approach aligns trading actions with the prevailing mar- fractional differentiation with technical analysis, enables our
ket trend, as reflected in the price momentum. model to capture intricate market dynamics effectively. The
average Pearson correlation between the 22 features extracted
3.2.3. Meta-Labeling Strategy from each of the 1000 generated price pathways is demon-
Incorporating Lopez de Prado [2018]’s meta-labeling with strated in Figure 9.
the triple-barrier method, our strategy evaluates trades post-
momentum-based direction determination. This process cru-
cially informs position sizing decisions and enhances trade
selection accuracy. The triple-barrier method applies two
horizontal barriers for profit-taking and stop-loss, set with
dynamic volatility-adjusted thresholds of 0.5𝜎𝑡 and 1.5𝜎𝑡 re-
spectively, and a vertical barrier with a 20 working days
expiration time. The outcome of a trade is determined as
follows: hitting the upper (profit-taking) barrier results in
a label of 1 for successful trades while reaching the lower
(stop-loss) barrier first assigns a label of −1 for unsuccessful
trades. If neither horizontal barrier is hit within the vertical
time frame, the trade is evaluated based on the sign of the
return at the end of this period.
a comprehensive array of trials by alternating between dif- this meticulous approach ensures robustness and accuracy
ferent sets of rolling window sizes in the momentum strat- in comparing our out-of-sample testing procedures.
egy and a diverse range of hyperparameters in the machine
learning models. This approach allows us to assess the per- 3.4.1. Implementation of K-Fold Cross-Validation
formance impact of these variables under varying market Our implementation of K-Fold Cross-Validation (KFold)
conditions and model specifications. Each trial represents in financial modeling utilizes the KFold class within the Cros ⌋
a unique combination of these configurations, providing us sValidatorController framework. Configured with n_split ⌋
with a broad spectrum of insights into the dynamics of our s=4, this approach partitions the dataset into four distinct seg-
financial strategy. ments, adhering to the conventional methodology of KFold.
Each segment sequentially serves as a test set, while the re-
3.3.1. Momentum Cross-Over Strategy Variations maining data forms the training set. This structure is piv-
Our financial model evaluates momentum cross-over strat- otal in our financial time series analysis, where it is crucial
egy variations by altering the moving averages’ rolling win- to avoid look-ahead bias and maintain the chronological in-
dow sizes. We test four distinct configurations: (5, 10), (20, tegrity of data.
50), (50, 100), and (70, 140), representing various pairs of
CrossValidatorController(
fast and slow-moving average window sizes. These trials
'kfold',
systematically examine the strategy’s performance under di-
n_splits=4,
verse temporal dynamics.
).cross_validator
3.3.2. Machine Learning Models Variations Given the nature of financial data, characterized by tem-
Our strategy explores various machine learning models poral dependencies, our KFold implementation is tailored to
to predict meta-labels, each with specific parameter configu- respect these sequences, ensuring more accurate and realistic
rations. Certain hyperparameters are explored with a single model validation. This adherence to the time series structure
candidate value, while others are tested across multiple val- in our KFold setup underscores our commitment to rigorous,
ues, ensuring all cases are comprehensively utilized in our temporally-aware analytical practices in financial modeling.
strategy trials. The configurations are strategically chosen
to heighten the potential for overfitting. We have 3.4.2. Implementation of Walk-Forward
Cross-Validation
1. k-Nearest Neighbors (k-NN): Implemented via skl ⌋ Implemented using the WalkForward class, our Walk-Forward
earn.neighbors.KNeighborsClassifier, with neighbors
Cross-Validation (WFCV) employs the CrossValidatorCont ⌋
parameter varied as n_neighbors : [1, 2, 3]. The roller with n_splits=4, indicating a division of the dataset
data is standardized using sklearn.preprocessing.St ⌋ into four sequential segments. This ensures chronological
andardScaler within a custom pipeline extending sk ⌋
training and testing phases, which is crucial for maintaining
learn.pipeline.Pipeline, which incorporates sample
temporal integrity in financial data analysis. This approach,
weights. emphasizing the sequence and structure of data, mirrors real-
2. Decision Tree: Utilized through sklearn.tree.Decis ⌋ world financial market dynamics and is key to achieving a
ionTreeClassifier, with parameters set to min_sampl ⌋ realistic assessment of model performance. The specific pa-
es_split : [2] and min_samples_leaf : [1]. rameterization of WFCV underscores our commitment to
3. XGBoost: Executed using xgboost.XGBClassifier, with temporal consistency and robust validation in financial mod-
parameters including n_estimators : [1000], max_de ⌋ eling.
pth : [1000000000], learning_rate : [1, 10, 100],
subsample : [1.0], and colsample_bytree : [1.0]. CrossValidatorController(
'walkforward',
3.4. Out-of-Sample Testing via Cross-Validation n_splits=4,
In our quantitative finance framework, we apply a com- ).cross_validator,
prehensive suite of cross-validation techniques to conduct By aligning model evaluation with the chronological pro-
out-of-sample testing, employing the robust CrossValidator ⌋ gression of market data, this configuration enhances the re-
Controller for initializing different validation methods. This liability and relevance of our strategy assessments.
includes K-Fold, Walk-Forward, Purged K-Fold, and Com-
binatorial Purged Cross-Validation, each specifically adapted 3.4.3. Implementation of Purged K-Fold
to the challenges of financial time series data. Utilizing ⌋ Cross-Validation
CrossValidator.backtest_predictions, we generate backtest The implementation of Purged K-Fold Cross-Validation
paths for each cross-validation method, comprising prob- in our framework leverages the PurgedKFold class through the
abilities corresponding to the meta-labels. For labels en- CrossValidatorController, specifically tailored for financial
countered across multiple backtest paths, we average their time series data. Configured with n_splits=4, an embargo
probabilities, creating a consolidated measure that informs rate of embargo=0.02, and time-based partitioning, this ap-
subsequent strategy performance calculations. Integrating proach rigorously maintains the integrity of the temporal or-
traditional and innovative cross-validation methodologies, der. The initialization parameters ensure that the dataset is
divided into four contiguous segments, each representing a our collection of 28 strategy trials. The analysis is method-
distinct period in time. ically structured to encompass a holistic evaluation of the
entire performance timeline and an annualized, segmented
CrossValidatorController( examination. Each year is meticulously analyzed, consider-
'purgedkfold', ing 252 trading days per segment. This dual-faceted analysis
n_splits=4, offers insights into the strategies’ overall and specific yearly
times=times, performances and serves as a litmus test for the effective-
embargo=0.02 ness of different out-of-sample testing techniques in curbing
).cross_validator overfitting. The scatter plot in Figure 10 illustrates a negligi-
ble correlation of -0.03 between the Probability of Backtest
This structure is instrumental for mitigating information Overfitting (PBO) and the Best Trial Deflated Sharpe Ratio
leakage and lookahead biases by purging training data that (DSR) Test Statistic in the overall analysis, signaling their
overlaps with the validation period and implementing an em- independence as evaluative tools. Their independence is in-
bargo period. Such modifications are crucial in financial strumental, as it implies a multi-faceted assessment of back-
modeling, where the chronological sequence of data plays test validity, combining robustness checks against overfitting
a pivotal role in the validity and realism of backtesting re- with adjustments for multiple hypothesis testing, thereby en-
sults. Our Purged K-Fold setup, therefore, ensures a more riching the strategy selection process with diverse yet com-
authentic and reliable assessment of the model’s predictive plementary reliability metrics.
capabilities.
CrossValidatorController(
'combinatorialpurged',
n_splits=8,
n_test_groups=2,
times=times,
embargo=0.02
).cross_validator,
Table 5
Distributions Comparison for Probability of Backtest Over-
fitting Values Across Simulations For Each Cross-Validation
Method
Test P-Value Effect Size (𝜂 2 )
Kruskal Wallis 7.05e-09 0.01022
Dunn’s Test p-Value Significant
Combinatorial Purged vs. K-Fold 4.20e-06 Yes
Combinatorial Purged vs. Purged K-Fold 3.32e-07 Yes
Combinatorial Purged vs. Walk-Forward 1.09e-06 Yes
K-Fold vs. Purged K-Fold 1.0 No
K-Fold vs. Walk-Forward 1.0 No
Purged K-Fold vs. Walk-Forward 1.0 No
Table 8
Distributions Comparison for Best Trial Deflated Sharpe Ratio
Test Statistic Efficiency Ratio Values Across Simulations For
Each Cross-Validation Method
Test P-Value Effect Size (𝜂 2 )
Kruskal Wallis 1.43e-76 0.0888 Figure 15: Comparison of Temporal Probability of Backtest
Overfitting and Best Trial Deflated Sharpe Ratio Test Statistic
Dunn’s Test p-Value Significant ADF Test Statistic Values Across Simulations For Each Cross-
Combinatorial Purged vs. K-Fold 8.14e-23 Yes Validation Method
Combinatorial Purged vs. Purged K-Fold 9.29e-22 Yes
Combinatorial Purged vs. Walk-Forward 1.06e-07 Yes
K-Fold vs. Purged K-Fold 1.00 No Table 9
K-Fold vs. Walk-Forward 2.15e-54 Yes Distributions Comparison for Probability of Backtest Overfit-
Purged K-Fold vs. Walk-Forward 9.60e-54 Yes ting ADF Test Statistic Values Across Simulations For Each
Cross-Validation Method
Test P-Value Effect Size (𝜂 2 )
Kruskal Wallis 0.0 0.55106
3.7.3. Temporal Stationarity of Overfitting Assessment Dunn’s Test p-Value Significant
In our annual time series analysis of the Probability of Combinatorial Purged vs. K-Fold 1.0 No
Backtest Overfitting (PBO), the Augmented Dickey-Fuller Combinatorial Purged vs. Purged K-Fold 1.0 No
(ADF) test statistic values were utilized to examine the sta- Combinatorial Purged vs. Walk-Forward 0.0 Yes
tionarity of the PBO through time. These values are depicted K-Fold vs. Purged K-Fold 1.0 No
in Figure 15 and quantitatively analyzed in Table 9. The K-Fold vs. Walk-Forward 0.0 Yes
’Walk-Forward’ method exhibited a markedly higher median Purged K-Fold vs. Walk-Forward 0.0 Yes
ADF value of -2.41, indicating less stationarity and greater
trend presence than other methods. The Kruskal-Wallis test
yielded a significant result (𝑝 < 0.01, 𝜂 2 = 0.55), imply-
ing substantial differences in the time series characteristics Table 10. The ’Walk-Forward’ approach demonstrated a higher
among the methods. Dunn’s Test revealed that the ’Walk- median ADF value of -3.86, suggesting a weaker presence of
Forward’ method’s ADF values were significantly different stationarity compared to the more negative ADF values of
from those of ’K-Fold’, ’Purged K-Fold’, and ’Combinato- the other methods, which implies a stronger rejection of the
rial Purged’ (all with 𝑝 = 0.0), underscoring its distinct be- unit root and thus a stronger indication of stationarity. The
havior in terms of stationarity. These findings suggest that Kruskal-Wallis test provided extremely significant evidence
while ’Walk-Forward’ might be more prone to exhibit trends of distributional differences among the methods (𝑝 = 2.01 ×
in PBO over time, the other methods did not show signifi- 10−50 , 𝜂 2 = 0.059). Dunn’s Test further identified signifi-
cant differences among themselves, indicating similar levels cant differences between ’Walk-Forward’ and all other meth-
of stationarity in their respective PBO values. ods, with ’Walk-Forward’ being less stationary compared to
The stationarity of the annual Best Trial Deflated Sharpe ’K-Fold’ (𝑝 = 2.38 × 10−37 ), ’Purged K-Fold’ (𝑝 = 1.65 ×
Ratio (DSR) Test Statistic values was assessed using the Aug- 10−31 ), and ’Combinatorial Purged’ (𝑝 = 9.81 × 10−36 ).
mented Dickey-Fuller (ADF) test, with the distributions vi- These results indicate that ’Walk-Forward’ may be less suit-
sualized in Figure 15 and the statistical analysis detailed in able for strategies that require a consistent DSR over time,
Table 10
Distributions Comparison for Best Trial Deflated Sharpe Ratio
Test Statistic ADF Test Statistic Values Across Simulations
For Each Cross-Validation Method
Test P-Value Effect Size (𝜂 2 )
Kruskal Wallis 2.01e-50 0.05853
Dunn’s Test p-Value Significant
Combinatorial Purged vs. K-Fold 1.0 No
Combinatorial Purged vs. Purged K-Fold 1.0 No
Combinatorial Purged vs. Walk-Forward 9.81e-36 Yes
K-Fold vs. Purged K-Fold 1.0 No
K-Fold vs. Walk-Forward 2.38e-37 Yes
Purged K-Fold vs. Walk-Forward 1.65e-31 Yes
systemic influences in the ’Walk-Forward’ method. market trading. The Journal of Business, 39(1):226–241, 1966. ISSN
00219398, 15375374.
[12] Evelyn Fix and Joseph Lawson Hodges. Nonparametric discrimina-
5. Conclusions tion: consistency properties. Randolph Field, Texas, Project, pages
21–49, 1951.
Our investigation into cross-validation methodologies in [13] James D. Hamilton. Time Series Analysis. Princeton University Press,
financial modeling has revealed critical insights, especially 1994.
the superiority of the ’Combinatorial Purged’ method in min- [14] Floyd B Hanson and Zongwu Zhu. Comparison of market parameters
imizing overfitting risks. This method outperforms tradi- for jump-diffusion distributions using multinomial maximum likeli-
hood estimation. In 2004 43rd IEEE Conference on Decision and
tional approaches like ’K-Fold’, ’Purged K-Fold’, and no- Control (CDC)(IEEE Cat. No. 04CH37601), volume 4, pages 3919–
tably ’Walk-Forward’ in terms of both the Probability of Back- 3924. IEEE, 2004.
test Overfitting (PBO) and the Deflated Sharpe Ratio (DSR) [15] Steven L. Heston. A closed-form solution for options with stochastic
Test Statistic. ’Walk-Forward’, in contrast, shows limita- volatility with applications to bond and currency options. The Review
tions in preventing false discovery and exhibits greater tem- of Financial Studies, 6(2):327–343, 1993.
[16] Ulrich Homm and Jörg Breitung. Testing for speculative bubbles in
poral variability and weaker stationarity from temporal as- stock markets: a comparison of alternative methods. Journal of Fi-
sessment of these methodologies using the Efficiency Ratio nancial Econometrics, 10(1):198–231, 2012.
and the Augmented Dickey-Fuller (ADF) test, raising con- [17] Kin Lam and HC Yam. Cusum techniques for technical trading in
cerns about its reliability. On the other hand, ’Combinato- financial markets. Financial Engineering and the Japanese Markets,
rial Purged’ demonstrates enhanced stability and efficiency, 4:257–274, 1997.
[18] Marcos Lopez De Prado. The future of empirical finance. Journal of
proving to be a more reliable choice for financial strategy Portfolio Management, 41(4), 2015.
development. The choice between ’Purged K-Fold’ and ’K- [19] Marcos Lopez de Prado. Advances in financial machine learning.
Fold’ requires caution, as they show no significant perfor- John Wiley & Sons, 2018.
mance difference, and ’Purged K-Fold’ may reduce the ro- [20] Marcos Lopez de Prado. Machine learning for asset managers. Cam-
bustness of training data for out-of-sample testing. These bridge University Press, 2020.
[21] Robert C. Merton. Option pricing when underlying stock returns are
findings significantly contribute to quantitative finance, pro- discontinuous. Journal of Financial Economics, 3(1):125–144, 1976.
viding a robust framework for cross-validation that aligns [22] Andrew Papanicolaou and Ronnie Sircar. A regime-switching heston
theoretical robustness with practical reliability. They un- model for vix and s&p 500 implied volatilities. Quantitative Finance,
derscore the need for tailored evaluation methods in an era 14(10):1811–1827, 2014.
of complex algorithms and large datasets, guiding decision- [23] Michael Schatz and Didier Sornette. Inefficient bubbles and efficient
drawdowns in financial markets. International Journal of Theoretical
making in a data-driven financial world. Future research and Applied Finance, 23(07):2050047, 2020.
should extend these findings to real-world market conditions [24] Laerd Statistics. Kruskal-wallis h test using spss statistics. Statistical
to enhance their applicability and generalizability. tutorials and software guides, 2015.
[25] Yurong Xie and Guohe Deng. Vulnerable european option pricing in
a markov regime-switching heston model with stochastic interest rate.
References Chaos, Solitons & Fractals, 156:111896, 2022.
[1] David H Bailey and Marcos Lopez de Prado. The sharpe ratio efficient
frontier. Journal of Risk, 15(2):13, 2012.
[2] David H Bailey and Marcos López de Prado. The deflated sharpe
ratio: Correcting for selection bias, backtest overfitting and non-
normality. Journal of Portfolio Management, 40(5):94–107, 2014b.
[3] David H Bailey, Jonathan M Borwein, Marcos López de Prado, and
Qiji Jim Zhu. Pseudomathematics and financial charlatanism: The
effects of backtest over fitting on out-of-sample performance. Notices
of the AMS, 61(5):458–471, 2014a.
[4] David H Bailey, Jonathan Borwein, Marcos Lopez de Prado, and
Qiji Jim Zhu. The probability of backtest overfitting. Journal of Com-
putational Finance, forthcoming, 2016.
[5] Carlo Bonferroni. Teoria statistica delle classi e calcolo delle proba-
bilita. Pubblicazioni del R Istituto Superiore di Scienze Economiche
e Commericiali di Firenze, 8:3–62, 1936.
[6] Leo Breiman. Classification and regression trees. Routledge, 2017.
[7] Tianqi Chen and Carlos Guestrin. Xgboost: A scalable tree boosting
system. In Proceedings of the 22nd acm sigkdd international confer-
ence on knowledge discovery and data mining, pages 785–794, 2016.
[8] Kim Christensen, Roel Oomen, and Roberto Renò. The drift burst
hypothesis. Journal of Econometrics, 227(2):461–497, 2022. ISSN
0304-4076.
[9] Olive Jean Dunn. Multiple comparisons using rank sums. Techno-
metrics, 6(3):241–252, 1964.
[10] Robert J Elliott, Katsumasa Nishide, and Carlton-James U Osakwe.
Heston-type stochastic volatility with a markov switching regime.
Journal of Futures Markets, 36(9):902–919, 2016.
[11] Eugene F. Fama and Marshall E. Blume. Filter rules and stock-