0% found this document useful (0 votes)
75 views

Backtest Overfitting in The Machine Learning Era - A Comparison of Out-of-Sample Testing Methods in A Synthetic Controlled Environment

Backtest Overfitting in the Machine Learning Era - A Comparison of Out-of-Sample Testing Methods in a Synthetic Controlled Environment
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
75 views

Backtest Overfitting in The Machine Learning Era - A Comparison of Out-of-Sample Testing Methods in A Synthetic Controlled Environment

Backtest Overfitting in the Machine Learning Era - A Comparison of Out-of-Sample Testing Methods in a Synthetic Controlled Environment
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

Backtest Overfitting in the Machine Learning Era:

A Comparison of Out-of-Sample Testing Methods in a Synthetic Controlled


Environment
Hamid Ariana,∗,1 , Daniel Norouzi Mobarekehb,2 and Luis Secoc,3
c Universityof Toronto, 40 St George St, Toronto, Ontario Canada M5S 2E4
a York University, 4700 Keele St, Toronto, Ontario, Canada M3J 1P3
b Sharif University of Technology, Teymoori Sq, Tehran, Iran 1459973941

ARTICLE INFO ABSTRACT


Keywords: This research explores the integration of advanced statistical models and machine learning in financial
Quantitative Finance analytics, representing a shift from traditional to advanced, data-driven methods. We address a critical
Machine Learning gap in quantitative finance: the need for robust model evaluation and out-of-sample testing method-
Cross-Validation ologies, particularly tailored cross-validation techniques for financial markets. We present a compre-
Probability of Backtest Overfitting hensive framework to assess these methods, considering the unique characteristics of financial data
like non-stationarity, autocorrelation, and regime shifts. Through our analysis, we unveil the marked
superiority of the Combinatorial Purged (CPCV) method in mitigating overfitting risks, outperform-
ing traditional methods like K-Fold, Purged K-Fold, and especially Walk-Forward, as evidenced by
its lower Probability of Backtest Overfitting (PBO) and superior Deflated Sharpe Ratio (DSR) Test
Statistic. Walk-Forward, by contrast, exhibits notable shortcomings in false discovery prevention,
characterized by increased temporal variability and weaker stationarity. This contrasts starkly with
CPCV’s demonstrable stability and efficiency, confirming its reliability for financial strategy devel-
opment. The analysis also suggests that choosing between Purged K-Fold and K-Fold necessitates
caution due to their comparable performance and potential impact on the robustness of training data
in out-of-sample testing. Our investigation utilizes a Synthetic Controlled Environment incorporat-
ing advanced models like the Heston Stochastic Volatility, Merton Jump Diffusion, and Drift-Burst
Hypothesis, alongside regime-switching models. This approach provides a nuanced simulation of
market conditions, offering new insights into evaluating cross-validation techniques. Our study un-
derscores the necessity of specialized validation methods in financial modeling, especially in the face
of growing regulatory demands and complex market dynamics. It bridges theoretical and practical
finance, offering a fresh outlook on financial model validation. Highlighting the significance of ad-
vanced cross-validation techniques like CPCV, our research enhances the reliability and applicability
of financial models in decision-making.

advanced, data-driven approaches signifies a new epoch in


financial analysis. This new era is marked by the capabil-
1. Introduction ity to process and analyze extensive datasets, revealing in-
tricate market patterns previously obscured by the limita-
1.1. Background tions of conventional methods. A key catalyst for this trans-
The financial sector has witnessed a paradigmatic shift formation has been the significant advancements in compu-
by integrating sophisticated statistical models and machine tational technology and the rise of high-frequency trading
learning techniques into its analytical framework. This piv- practices. These developments have led to a fundamental
otal transition from traditional quantitative methods to more change in market dynamics. As a result, the need for robust
⋆ and reliable model evaluation methodologies, particularly in
The authors are listed in alphabetical order.
∗ Corresponding author: Hamid Arian, Assistant Professor of Finance, cross-validation techniques, has gained unprecedented im-
York University, Toronto, Ontario, Canada, email: [email protected] portance. Such methods are integral to maintaining the in-
ORCID : 0000-0002-4624-9421 tegrity and effectiveness of financial models, which are es-
1 Assistant Professor of Finance, York University, Toronto, Ontario,

Canada
sential in guiding decision-making processes across a spec-
email: [email protected] trum of financial activities, from asset allocation to risk man-
2 BSc student of Applied Mathematics & Economics, Sharif University agement, in both buy-side and sell-side institutions.
of Technology, Tehran, Iran
email: [email protected]
3 Professor, University of Toronto, Ontario, Canada 1.2. Motivation
email: [email protected] The impetus for our research stems from a pivotal ob-
The presentation slides and a commentary on this article are available servation: despite substantial progress in financial modeling
on RiskLab’s website at the University of Toronto: risklab.ca/backtesting. and an escalating reliance on machine learning algorithms,
The architecture of the codes of this article is explained on risklab.ai/
there is a glaring shortfall in effectively validating these mod-
backtesting, in both Python and Julia programming languages. The repro-
ducible results of this paper are based on authors’ Python implementation els within the ambit of financial markets. This research gap
on RiskLab’s GitHub page: github.com/RiskLabAI. becomes more pronounced when considering the extensive

Arian, Norouzi, Seco Page 1 of 26


Electronic copy available at: https://ssrn.com/abstract=4686376
Backtest Overfitting in the Machine Learning Era

literature on predicting market factors. Yet, there is a con- 1.3.2. Rising Concerns over Backtest Overfitting and
spicuous lack of discussion on tailoring cross-validation al- False Discoveries
gorithms to accurately assess these models (Lopez de Prado The evolution of financial modeling has necessitated ad-
[2018, 2020]). Further complicating this landscape is the vanced methodologies to effectively address the challenges
paucity of research dedicated to critically evaluating the back- of overfitting and false discoveries in strategy evaluation. Pi-
testing and cross-validation algorithms. We hypothesize that oneering contributions by Bailey et al. [2016] and Bailey and
the limited exploration in this domain is attributable to the López de Prado [2014b], brought to the fore the need for
inherent complexities of financial datasets, which are typ- rigorous evaluation of trading strategies. They introduced
ically noisy, non-stationary, and characterized by intricate quantifiable metrics like the Probability of Backtest Over-
patterns shaped by various variables, from macroeconomic fitting (PBO) and the Deflated Sharpe Ratio (DSR), which
shifts to market sentiments. These unique dataset attributes provided a statistical basis to assess the reliability of back-
often render traditional cross-validation methods insufficient tested strategies. Despite these advancements, a significant
or misleading (Lopez de Prado [2018]). The grave conse- gap exists in the literature: a comprehensive framework link-
quences of model inaccuracies in this context cannot be over- ing backtest overfitting assessment with the effectiveness of
stated, as they can lead to substantial financial losses and out-of-sample testing methodologies. Our study addresses
pose systemic risks. This highlights the critical need to de- this gap by proposing a novel framework that evaluates out-
velop and refine cross-validation methodologies for navigat- of-sample testing techniques through the prism of backtest
ing financial data nuances. While hedge funds and invest- overfitting. By integrating key concepts such as PBO and
ment firms might have practical approaches to address these DSR into our analysis, we aim to provide a holistic evalua-
challenges, there is a stark silence in the academic literature tion of CV methods, ranging from traditional data science
on this imperative issue. Our study seeks to bridge this gap, approaches to innovative financial models like those pro-
providing insights and methodologies vital for the rigorous posed by Lopez De Prado. This approach ensures financial
evaluation of financial models, thereby catering to finance’s models’ robustness and predictive power, filling a critical
academic and practical realms. void in quantitative finance.

1.3. Literature Review 1.3.3. Exploring Market Dynamics with Synthetic


1.3.1. Evolution of Backtesting on Out-of-Sample Controlled Environment
Data Advancements in synthetic data generation within finan-
The evolution of cross-validation (CV) methodologies in cial analysis have seen the integration of sophisticated mod-
quantitative finance has been marked by a significant transi- els that adeptly replicate complex market dynamics. Our
tion from traditional data science approaches to more spe- study’s Synthetic Controlled Environment embraces this com-
cialized techniques tailored for financial market data. Con- plexity by merging the Heston Stochastic Volatility Model,
ventional methods like K-Fold Cross-Validation and Walk- √ by the stochastic differential equation 𝑑𝑆𝑡 =
characterized
Forward Cross-Validation, while effective in various ana- 𝜇𝑆𝑡 𝑑𝑡+ 𝜈𝑡 𝑆𝑡 𝑑𝑊𝑡𝑆 (Heston [1993]), with the Merton Jump
lytical contexts, have shown limitations when applied to fi- Diffusion Model (Merton [1976]), which introduces jumps
nancial markets due to their inability to adequately account in asset prices through 𝑑𝑆𝑡 = 𝜇𝑆𝑡 𝑑𝑡 + 𝜎𝑆𝑡 𝑑𝑊𝑡 + 𝑆𝑡 𝑑𝐽𝑡 .
for the temporal dependencies and non-stationarity inherent Following this, we explore the context of speculative mar-
in financial time series. Recognizing these shortcomings, ket bubbles. Schatz and Sornette [2020] categorizes bub-
Lopez de Prado introduced advanced CV techniques specif- bles into Type-I and Type-II, with Type-I characterized by
ically designed for financial applications. Purged K-Fold an efficient full price process 𝑆 but inefficiencies in both
Cross-Validation, as outlined by Lopez de Prado [2018], en- pre-drawdown 𝑆̃ and drawdown 𝑋 processes, and Type-II
hances the standard K-Fold method by incorporating a ’purg- by an efficient drawdown process 𝑋 but an overall inefficient
ing’ mechanism, eliminating data from the training set that 𝑆. This categorization provides a nuanced understanding of
could inadvertently leak information about the test set. This bubble dynamics in financial markets. Building on this foun-
approach is particularly critical in financial modeling to pre- dation, our environment further incorporates the Drift Burst
vent lookahead biases. Further advancing the field, Lopez Hypothesis (Christensen et al. [2022]), articulating short-
de Prado’s Combinatorial Purged Cross-Validation (CPCV) lived market anomalies through equations 𝜇𝑡db = 𝑎|𝜏db −𝑡|−𝛼
method offers a robust solution for backtesting trading strate- and 𝜎𝑡vb = 𝑏|𝜏db −𝑡|−𝛽 , emphasizing the critical interplay be-
gies. Unlike traditional CV methods, CPCV creates multiple tween drift and volatility during such events. The addition
training and testing combinations, ensuring that each data of the Markov Chain model for regime transitions (Hamil-
segment is used for training and validation, thus providing a ton [1994]), characterized by its transition matrix 𝑃 = [𝑃𝑖𝑗 ],
more comprehensive assessment of a strategy’s performance enables the simulation to adeptly mirror fluid market states
across various market scenarios. This method respects the adeptly, capturing the ephemeral nature of financial mar-
chronological ordering of data and effectively addresses the kets. This innovative amalgamation of stochastic volatil-
risk of overfitting, a prevalent issue in the development of ity, jump-diffusion, bubble dynamics, and regime-switching,
financial models. cohesively combined in our Synthetic Controlled Environ-
ment, sets a groundbreaking precedent in the domain of fi-

Arian, Norouzi, Seco Page 2 of 26

Electronic copy available at: https://ssrn.com/abstract=4686376


Backtest Overfitting in the Machine Learning Era

nancial model testing and validation, presenting a compre- 1.6. Contribution


hensive framework for evaluating out-of-sample testing method- This research significantly contributes to quantitative fi-
ologies in the nuanced and intricate world of quantitative fi- nance by pioneering a comprehensive framework for evalu-
nance. ating out-of-sample testing methodologies, particularly in fi-
nancial modeling. We bridge a notable gap in the existing lit-
1.4. Problem Statement erature by linking the concept of backtest overfitting, as en-
Central to our study is a problem of increasing concern capsulated by metrics like the Probability of Backtest Over-
in financial analytics: developing a robust and reliable out- fitting (PBO) and the Deflated Sharpe Ratio (DSR), with
of-sample testing methodology congruent with the unique the efficacy of out-of-sample testing methods. Our innova-
attributes of financial time series data. This issue is mul- tive approach enhances the accuracy and reliability of cross-
tifaceted. Firstly, financial time series are characterized by validation techniques, addressing the challenges posed by
non-stationarity, autocorrelation, heteroskedasticity, and regime the temporal complexities and non-stationarity of financial
shifts, challenging the applicability of conventional out-of- time series. We leverage the advanced statistical models
sample testing methods. Secondly, the temporal dynamics of the Heston Stochastic Volatility, Merton Jump Diffusion,
of financial data, with intricate lead-lag relationships and and Markov Chain for regime transitions, combined with ex-
evolutionary patterns, demand an out-of-sample testing ap- ploring market dynamics through speculative bubbles and
proach that preserves the chronological sequence of data to the Drift Burst Hypothesis. This synthesis provides a more
avoid look-ahead bias and overfitting, issues frequently en- nuanced simulation of market conditions, offering fresh in-
countered in applying machine learning models in finance. sights and methodologies that can significantly improve decision-
Despite the remarkable advancements in integrating statisti- making processes in various financial applications, from risk
cal models and machine learning techniques into financial management to algorithmic trading. Our work advances the
analysis, a significant gap persists in accurately assessing field by presenting a novel and holistic perspective on model
these models, particularly under the challenges of backtest validation in the ever-evolving quantitative finance domain,
overfitting and the dynamic nature of financial markets. Our thus enhancing both financial practices and academic research.
study specifically targets the inadequacy of existing cross-
validation techniques, which, while robust in traditional data 1.7. Scope and Limitations
science contexts, fall short of fully capturing the temporal This study evaluates cross-validation methods within syn-
dependencies and non-stationarity of financial data. This thetic market environments meticulously engineered to en-
gap is further widened by the lack of a comprehensive frame- compass diverse market conditions. Our research uses so-
work that integrates the assessment of backtest overfitting phisticated statistical models to dissect the intricacies of back-
with the effectiveness of out-of-sample testing methodolo- test overfitting in these rigorously constructed settings. A
gies. The significance of this problem is not limited to the- notable limitation of our approach is the reliance on syn-
oretical modeling but has far-reaching implications in prac- thetic data, which, while providing controlled experimental
tical aspects like risk management, algorithmic trading, and conditions, might not fully capture the complex, often unpre-
portfolio optimization. Inaccuracies in model validation can dictable dynamics of real-world financial markets. Conse-
lead to substantial financial risks and losses, accentuating quently, extrapolating our findings to actual market scenarios
the need for rigorous, tailored validation methods, especially should be cautiously approached, especially when consider-
under increasing regulatory scrutiny. ing applications in live trading environments or risk manage-
ment strategies. Moreover, while comprehensive, the spe-
1.5. Objectives of the Study cific choice of models and simulation parameters implies
The central objective of our study is to develop a compre- certain constraints. This necessitates further empirical val-
hensive evaluation framework for cross-validation methods idation in diverse, real-market contexts to enhance the gen-
in financial modeling, particularly in the context of evolving eralizability of our results. Our study’s primary aim is to
market complexities and the challenges of backtest overfit- enrich the domain of financial model validation, striking a
ting. By incorporating key concepts like the Probability of crucial balance between theoretical depth and practical rel-
Backtest Overfitting (PBO) and the Deflated Sharpe Ratio evance and paving the way for subsequent research to build
(DSR), our framework holistically assesses various cross- upon these foundational insights.
validation approaches, ranging from traditional data science
methods to more sophisticated financial models. The em- 1.8. Organization of the Paper
phasis of this study is not inherently on the incorporation of This paper is systematically structured to explore cross-
sophisticated methodologies; rather, it centers on a critical validation techniques in synthetic market environments com-
assessment of the efficacy of these approaches when con- prehensively. The paper opens with Introduction 1, setting
sidering the distinctive attributes inherent in financial data. the stage by delineating the research background, objectives,
We aim to bridge the gap between theoretical robustness and and the scope of the study. Following this, the Methodology
practical reliability in financial models, enhancing their ap- section 2 explores the details of the statistical models and
plicability in high-stakes financial decision-making, from as- algorithms employed, outlining the framework for synthetic
set allocation to risk management. data generation and analysis. The Empirical Results sec-
tion 3 thoroughly examines our rigorous testing and analysis

Arian, Norouzi, Seco Page 3 of 26

Electronic copy available at: https://ssrn.com/abstract=4686376


Backtest Overfitting in the Machine Learning Era

findings, providing insights into the performance and robust- where the instantaneous variance, 𝜈𝑡 , adheres to the Feller
ness of various cross-validation methods. In the Discussion square-root or Cox-Ingersoll-Ross (CIR) process:
section 4, we interpret these findings, contextualizing them √
within the broader landscape of quantitative finance and dis- 𝑑𝜈𝑡 = 𝜅(𝜃 − 𝜈𝑡 )d𝑡 + 𝜉 𝜈𝑡 d𝑊𝑡𝜈 , (2.2)
cussing their implications. The paper culminates with the
with 𝑊𝑡 and 𝑊𝑡𝜈 representing Wiener processes, exhibiting
Conclusion section 5, where we summarize the key take-
a correlation of 𝜌.
aways, acknowledge the limitations of our study, and suggest
The model described in Eqn. (2.1) and Eqn. (2.2) uses
directions for future research.
four main parameters. 𝜃 is the long-term average variance,
showing the expected variance that 𝜈𝑡 will approach as 𝑡 in-
2. Methodology creases. 𝜌 describes the correlation between the two Wiener
The methodology section forms the backbone of our re- processes in the model. 𝜅 shows how quickly 𝜈𝑡 returns to
search, presenting a comprehensive and systematic approach its long-term average, 𝜃. And 𝜉 is known as the ’volatility of
to exploring and analyzing financial market dynamics through volatility’, indicating how much 𝜈𝑡 can vary.
machine learning and statistical methods. This section out- A salient feature of this model is the Feller condition, ex-
lines the construction and utilization of a Synthetic Con- pressed as 2𝜅𝜃 > 𝜉 2 . Ensuring this inequality guarantees the
trolled Environment, which integrates complex market mod- strict positivity of the process, ensuring no negative values
els such as the Heston Stochastic Volatility and Merton Jump for variance.
Diffusion models and incorporates regime-switching dynam-
ics through Markov chains. Additionally, it addresses the 2.1.2. Jumps: The Merton Jump Diffusion Model
drift burst hypothesis to model market anomalies like spec- The Merton Jump Diffusion model by Merton [1976]
ulative bubbles and flash crashes. The methodology elabo- enhances the geometric Brownian motion proposed by the
rates on developing and evaluating a prototypical financial Black-Scholes model by integrating a discrete jump compo-
machine-learning strategy, encompassing event-based sam- nent to capture abrupt stock price movements. The stock
pling, trade directionality, bet sizing, and feature selection. price dynamics are given by:
Crucially, the methodology also delves into assessing back- 𝑑𝑆𝑡 = 𝜇𝑆𝑡 𝑑𝑡 + 𝜎𝑆𝑡 𝑑𝑊𝑡 + 𝑆𝑡 𝑑𝐽𝑡 , (2.3)
test overfitting through advanced statistical techniques, en-
suring the validity and robustness of the proposed trading In Eqn. (2.3), 𝜇𝑆𝑡 𝑑𝑡 is the drift term that captures the ex-
strategies. The methodologies are meticulously designed to pected return, 𝜎𝑆𝑡 𝑑𝑊𝑡 embodies the continuous random fluc-
capture the intricate nuances of financial markets, thereby tuations with 𝜎 being the stock’s volatility, and 𝑑𝑊𝑡 the stan-
enabling a thorough and accurate analysis of trading strate- dard Brownian motion increment, and 𝑆𝑡 𝑑𝐽𝑡 accounts for in-
gies within a controlled yet realistic market simulation. stantaneous jumps in the stock price.
The jump process 𝐽𝑡 in Eqn. (2.3) is defined as:
2.1. Synthetic Controlled Environment
In financial analysis, constructing a Synthetic Controlled ∑
𝑁(𝑡)

Environment is essential for thoroughly examining market 𝐽𝑡 = 𝑌𝑖 , (2.4)


𝑖=1
dynamics and validating theoretical models. This segment
delineates an integrated simulation architecture synthesiz- where 𝑁(𝑡) is a Poisson process with intensity 𝜆, and 𝑌𝑖
ing the Heston model’s stochastic volatility, Merton’s jump- represents logarithmic jump sizes, normally distributed with
diffusion framework, and the Markov chains’ regime-switching mean 𝑚 and standard deviation 𝑠.
nuance. It also contemplates the drift burst hypothesis to To simulate paths of 𝑆𝑡 , one evolves the stock price us-
capture transient market anomalies. These components con- ing the drift and diffusion terms, determines jumps based
struct a nuanced and comprehensive emulation of the finan- on 𝑁(𝑡), and adjusts the stock price according to the magni-
cial market’s complexity, serving as a critical substrate for tude from the 𝑌𝑖 distribution. By merging continuous price
the exploration and scrutiny of econometric theories. movements with jumps, this model potentially offers a more
accurate representation of real-world stock price behaviors
2.1.1. Random Walk: The Heston Stochastic Volatility than mere geometric Brownian motion.
Model
In modeling the stochastic behavior of the market price, 2.1.3. Speculative Bubbles & Flash Crashes: The
we employ the foundational Heston model, as articulated by Drift Burst Hypothesis
Heston [1993]. This model provides a framework that cap- In the study Christensen et al. [2022], the authors intro-
tures the intrinsic volatility dynamics of a financial asset. duce the drift burst hypothesis to elucidate the short-lived
At the heart of the Heston model lies the premise that the flash crashes evident in high-frequency tick data. This method-
asset price, 𝑆𝑡 , evolves according to the following stochastic ology zeroes in on the complex dance between drift and volatil-
differential equation: ity. They theorize that a sudden uptick in drift is only viable
√ if there’s a simultaneous surge in volatility. To articulate
𝑑𝑆𝑡 = 𝜇𝑆𝑡 d𝑡 + 𝜈𝑡 𝑆𝑡 d𝑊𝑡𝑆 , (2.1) this, they introduce the "volatility burst" concept, denoting a
rapid escalation in market volatility.

Arian, Norouzi, Seco Page 4 of 26

Electronic copy available at: https://ssrn.com/abstract=4686376


Backtest Overfitting in the Machine Learning Era

The drift’s sudden increase is concisely encapsulated in 2.1.5. Market Synthesis: Discrete Simulation
the equation: In our study, we employ a discrete simulation approach
to model market dynamics, which can be effectively repre-
𝜇𝑡db = 𝑎|𝜏db − 𝑡|−𝛼 . (2.5) sented by the Euler-Maruyama method for stochastic dif-
ferential equations. This method provides a numerical ap-
In Eqn. (2.5), 𝜇𝑡db describes the drift at a given time 𝑡 ac- proximation of the continuous market processes in a discrete
cording to its distance relative to the bursting time 𝜏db . The framework. By applying Ito’s Lemma, the approximation is
factor 𝑎 sets the scale of the drift, while 12 < 𝛼 < 1 measures given by:
how intense this drift spike is. ( ( ))
Similarly, the abrupt rise in volatility, or the "volatility 1 𝑣2
Δ𝑆𝑡 ≈ 𝜇 − 𝜈𝑡 − 𝜆 𝑚 + 𝑆𝑡 Δ𝑡
burst", is represented as: 2 2
√ √
𝜎𝑡vb = 𝑏|𝜏db − 𝑡|−𝛽 . (2.6) + 𝜈𝑡 𝑆𝑡 𝑍 Δ𝑡 + 𝑌 Δ𝑁(𝑡). (2.9)
In Eqn. (2.9), Δ𝑆𝑡 √is the change in asset price, 𝜇 represents
In Eqn. (2.6), 𝜎𝑡vb indicates the volatility at time 𝑡. The pa- the drift rate, and 𝜈𝑡 is the volatility factor scaled by the
rameter 𝑏 quantifies the size of this volatility surge, and 0 < standard normal random variable 𝑍. 𝑌 is a normally dis-
𝛽 < 12 , gauges its sharpness. tributed jump size with mean 𝑚 and variance 𝑣2 , and Δ𝑁(𝑡)
denotes the jump process increments characterized by a Pois-
2.1.4. Regime Transitions: Markov Chain son distribution with intensity 𝜆Δ𝑡.
A regime-switching time series model is applied to sim- The variation in instantaneous variance 𝜈𝑡 is captured by
ulate market dynamics, following Hamilton [1994] as men- Eqn. (2.10):
tioned by Lopez de Prado [2020]. The market is segmented
√ √ √
into discrete regimes, each with unique characteristics. The Δ𝜈𝑡 = 𝜅(𝜃 −𝜈𝑡 )Δ𝑡+𝜉 𝜈𝑡 (𝜌𝜖 𝜖𝑡𝑃 + 1 − 𝜌2𝜖 𝜖𝑡𝜈 ) Δ𝑡, (2.10)
market’s transition between these regimes at any given time 𝑡
is determined by a Markov chain, where the transition prob- where 𝜅 is the rate at which 𝜈𝑡 reverts to its long-term mean
ability 𝑝𝑡,𝑛 depends solely on the state immediately prior. 𝜃, and 𝜉 measures the volatility of the variance. The corre-
This approach captures the fluid nature of financial markets, lated standard normal white noises 𝜖𝑡𝜈 and 𝜖𝑡𝑃 introduce ran-

which fluctuate between different states, reflecting shifts in domness with a correlation coefficient 𝜌𝜖 . The factor Δ𝑡 is
volatility and trends. By employing a Markov chain, these introduced to scale the model appropriately in the discrete-
transitions are modeled with mathematical precision while time setting, reflecting the properties of Brownian motion
maintaining economic plausibility, recognizing that finan- increments.
cial markets tend to exhibit a memory of only the most recent Incorporating the Markov chain regime transition model
events. into our discrete simulation, the constants 𝜇, 𝜃, 𝜉, 𝜌𝜖 , 𝜆,
A Markov chain is a mathematical system that transitions 𝑚, and 𝑣2 are adjusted for each regime. The adjustment is
from one state to another in a state space. It is defined by its dictated by the state transitions determined by the Markov
set of states and the transition probabilities between these chain, where each state encapsulates a distinct market regime
states. The fundamental property of a Markov chain is that with its own parameter set. As the market transitions be-
the probability of moving to the next state depends only on tween regimes, these parameters change accordingly, align-
the present state and not on the sequence of events that pre- ing the simulation with the underlying stochastic process that
ceded it. reflects the dynamic financial market environment.
Given a finite number of states 𝑆 = {𝑠1 , 𝑠2 , … , 𝑠𝑛 }, the
probability of transitioning from state 𝑠𝑖 to state 𝑠𝑗 in one 2.2. Prototypical Financial Machine Learning
step is denoted by 𝑃𝑖𝑗 : Strategy
Developing a coherent machine-learning strategy in quan-
𝑃𝑖𝑗 = 𝑃 (𝑋𝑛+1 = 𝑠𝑗 |𝑋𝑛 = 𝑠𝑖 ), (2.7) titative finance necessitates a meticulous fusion of statistical
techniques and market knowledge. Our proposed methodol-
where 𝑋𝑛 represents the state at time 𝑛, and 𝑃𝑖𝑗 is the entry ogy rigorously combines event-based triggers, trend-following
in the 𝑖-th row and 𝑗-th column of the transition matrix 𝑃 . mechanisms, and risk assessment tools to formulate a proto-
The matrix 𝑃 = [𝑃𝑖𝑗 ] is called the transition matrix of the typical financial machine-learning strategy. It commences
Markov chain. Each entry 𝑃𝑖𝑗 represents the one-step tran- with precisely identifying market events through CUSUM
sition probability from state 𝑠𝑖 to state 𝑠𝑗 as in Eqn. (2.7): filtering and progresses to ascertain trade directionality via
momentum analysis. The core of the strategy harnesses meta-
labeling to assess trade viability and employs an averaging
⎡𝑃11 𝑃12 ⋯ 𝑃1𝑛 ⎤
⎢𝑃 𝑃22 ⋯ 𝑃2𝑛 ⎥ approach to bet sizing sensitive to market conditions and
𝑃 = ⎢ 21
⋮ ⎥
. (2.8) position overlap. Integrating fractionally differentiated fea-
⋮ ⋮ ⋱
⎢ ⎥ tures alongside traditional technical indicators forms a ro-
⎣𝑃𝑛1 𝑃𝑛2 ⋯ 𝑃𝑛𝑛 ⎦
bust feature set, ensuring the preservation of temporal de-
pendencies and adherence to stationarity—a prerequisite for

Arian, Norouzi, Seco Page 5 of 26

Electronic copy available at: https://ssrn.com/abstract=4686376


Backtest Overfitting in the Machine Learning Era

the successful application of predictive modeling in financial moving averages are formulated as follows:
contexts.
𝑁𝑓 𝑎𝑠𝑡 −1
1 ∑
2.2.1. Sampling: CUSUM Filtering MAshort (𝑦𝑡 ) = 𝑦𝑡−𝑖 ,
𝑁𝑓 𝑎𝑠𝑡 𝑖=0
Portfolio management often relies on event-based trig- (2.13)
𝑁𝑠𝑙𝑜𝑤 −1
gers for investment decisions. These events may include 1 ∑
structural breaks, signals, or microstructural changes, often MAlong (𝑦𝑡 ) = 𝑦𝑡−𝑖 ,
𝑁𝑠𝑙𝑜𝑤 𝑖=0
prompted by macroeconomic news, volatility shifts, or sig-
nificant price deviations. In this context, it is crucial to iden- where 𝑁𝑓 𝑎𝑠𝑡 and 𝑁𝑠𝑙𝑜𝑤 represent the number of periods for
tify such events accurately, leveraging machine learning (ML) the fast (short-term) and slow (long-term) moving averages,
to ascertain the potential for reliable predictive models. The respectively.
redefinition of significant events or the enhancement of fea- A position is taken based on the relative positioning of
ture sets is a continual process refined upon discovering non- these moving averages post a CUSUM event. A trade is ini-
predictive behaviors. tiated based on these conditions:
We employ the Cumulative Sum (CUSUM) filter as an
event-based sampling technique for methodological rigor, as 1. Long Position: Triggered when MAshort (𝑦𝑡 ) surpasses
mentioned by Lopez de Prado [2018]. This method detects MAlong (𝑦𝑡 ), signaling upward market momentum.
deviations in the mean of a quantity, denoting an event when 2. Short Position: Initiated when MAshort (𝑦𝑡 ) falls be-
a threshold is crossed. Given independent and identically low MAlong (𝑦𝑡 ), indicating downward market momen-
distributed tum.
{ } (IID) observations from a locally stationary pro-
cess 𝑦𝑡 𝑡=1,…,𝑇 , we define the CUSUM as: The strategy, thus, aligns the position with the current mar-
{ ( )} ket trend, as indicated by the momentum in prices.
𝑆𝑡 = max 0, 𝑆𝑡−1 + 𝑦𝑡 − 𝔼𝑡−1 𝑦𝑡 , (2.11)
2.2.3. Size Determination: Meta-Labeling via
with the initial condition 𝑆0 = 0. A signal for action is sug- Triple-Barrier Method
gested at the smallest time 𝑡 where 𝑆𝑡 ≥ ℎ, with ℎ being the In our trading framework, once the side of a position is
predefined threshold( or)filter size. It’s notable that 𝑆𝑡 is resetdetermined through the momentum strategy, it undergoes a
to zero if 𝑦𝑡 ≤ 𝔼𝑡−1 𝑦𝑡 − 𝑆𝑡−1 , which intentionally ignores rigorous evaluation via the triple-barrier method to ascertain
negative shifts. its potential profitability. This evaluation forms the basis for
To encompass both positive and negative shifts, we ex- position sizing, leveraging a meta-labeling approach intro-
tend this to a symmetric CUSUM filter: duced by Lopez de Prado [2018].
{ ( )} Upon identification of a trade’s direction, the triple-barrier
𝑆𝑡+ = max 0, 𝑆𝑡−1 +
+ 𝑦𝑡 − 𝔼𝑡−1 𝑦𝑡 , 𝑆0+ = 0, method applies three distinct barriers to determine the out-
{ ( )}
𝑆𝑡− = min 0, 𝑆𝑡−1−
+ 𝑦𝑡 − 𝔼𝑡−1 𝑦𝑡 , 𝑆0− = 0, (2.12) come of the position. The horizontal barriers are set ac-
{ } cording to a dynamic volatility-adjusted threshold for profit-
𝑆𝑡 = max 𝑆𝑡+ , −𝑆𝑡− .
taking and stop-loss, while the vertical barrier is defined by
Adopting Lam and Yam [1997]’s strategy, we generate al- a predetermined expiration time, denoted as ℎ. The label as-
ternating buy-sell signals upon observing a return ℎ relative signment is as follows: hitting the upper barrier signifies a
to a prior peak or trough, akin to the filter trading strategy successful trade, hence labeled 1; conversely, touching the
by Fama and Blume [1966]. Our application of the CUSUM lower barrier first indicates a loss, labeled −1. If the verti-
filter using Eqn. (2.12), however, is distinct; we only sam- cal time barrier expires first, the label is determined by the
sign of the return, reflecting the result of the trade within the
) 𝑡 if 𝑆𝑡 ≥ ℎ, subsequently resetting 𝑆𝑡 assuming
ple at( bar
period [𝑡 , 𝑡 + ℎ].
𝔼𝑡−1 𝑦𝑡 = 𝑦𝑡−1 . We define 𝑦𝑡 as the natural logarithm of 𝑖,0 𝑖,0
the asset’s price to capture proportional price movements. The role of meta-labeling in this context is to scrutinize
The threshold ℎ is not static; instead, it dynamically adjusts further the trades indicated by the primary momentum model.
directly to the daily volatility, ensuring sensitivity to market It confirms or refutes the suggested positions, effectively fil-
conditions. tering out false positives and allowing for a calculated deci-
sion on the actual size of the investment. The meta-labeling
2.2.2. Side Determination: Momentum Strategy process directly informs the appropriate risk allocation for
We employ a momentum strategy based on moving av- each position by assigning a confidence level to each poten-
erages to determine the direction of trades signaled by the tial trade. This methodological step enhances the precision
event-based CUSUM filter sampling. Specifically, we calcu- of our strategy and ensures that position sizing is aligned
late two moving averages of the prices, a short-term moving with the evaluated profitability of the trade, as indicated by
average MAshort (𝑦𝑡 ) and a long-term moving average MAlong (𝑦𝑡 ), the outcome of the triple-barrier assessment.
to identify the prevailing trend. The short-term moving av-
erage is responsive to recent price changes, while the long- 2.2.4. Sample Weights: Label Uniqueness
term moving average captures the underlying trend. These The validity of the Independent and Identically Distributed
(IID) assumption is a common shortfall in financial machine

Arian, Norouzi, Seco Page 6 of 26

Electronic copy available at: https://ssrn.com/abstract=4686376


Backtest Overfitting in the Machine Learning Era

learning, as the overlapping intervals in the data often vio- 2.2.5. Financial Features: Fractional Differentiation
late it. Specifically, labels 𝑦𝑖 and 𝑦𝑗 may not be IID if there & Technical Analysis
is a shared influence from a common return 𝑟𝑡𝑗,0 ,min{𝑡𝑖,1 ,𝑡𝑗,1 } , In pursuing a robust financial machine-learning model,
where 𝑡𝑖,1 > 𝑡𝑗,0 for consecutive labels 𝑖 < 𝑗. To address the our methodology encompasses diverse features that balance
non-IID nature of financial datasets without compromising memory preservation with the necessity for stationarity. Frac-
the model granularity, we utilize sample weights as intro- tional differentiation of log prices is employed to maintain
duced by Lopez de Prado [2018]. This method recognizes as much informative historical price behavior as possible
the interconnectedness of data points and adjusts their influ- while ensuring the data adheres to the stationary requirement
ence on the model accordingly. By weighing samples based of predictive models (Lopez de Prado [2018]). Addition-
on their unique information and return impact, we enhance ally, we incorporate exponentially weighted moving aver-
model robustness, enabling more accurate analysis of finan- ages (EWMA) of volatility, capturing recent market volatil-
cial time series. ity trends and a suite of technical analysis indicators that pro-
We define concurrent labels at time 𝑡 as those that are vide insights into market sentiment and dynamics. Techni-
both influenced by at least one shared return cal analysis features are extracted from historical price and
volume data and are widely used to capture market senti-
𝑝𝑡 ment and trends, which are indicative of future price move-
𝑟𝑡−1,𝑡 = − 1. (2.14)
𝑝𝑡−1 ments and provide structured information from the other-
wise noisy market data, aiding the machine learning model
The concurrency of labels 𝑦𝑖 and 𝑦𝑗 does not necessitate a
to discern patterns associated with profitable trading oppor-
complete overlap in period; rather, it is sufficient that there
tunities. The features used for this problem are as follows:
is a partial temporal intersection involving the return at time
𝑡.
To quantify{the }
extent of overlap, we construct a binary 1. FracDiff: The fractionally differentiated log price. Fi-
indicator array 1𝑡,𝑖 𝑖=1,…,𝐼 for each time 𝑡, where 1𝑡,𝑖 is set nancial time series are characterized by a low signal-
[ ]
to 1 if the interval 𝑡𝑖,0 , 𝑡𝑖,1 overlaps with [𝑡 − 1, 𝑡], and 0 to-noise ratio and memory, challenging traditional sta-
otherwise. We then calculate the concurrency count at time tionarity transformations like integer differentiation,
𝑡, given by which remove this memory and potentially valuable
predictive signals (Lopez De Prado [2015]). To ad-

𝐼 dress this, fractional differentiation is employed to pre-
𝑐𝑡 = 1𝑡,𝑖 . (2.15) serve memory while ensuring{ } stationarity.
𝑖=1 Consider a time series 𝑋𝑡 and the backshift oper-
The uniqueness of a label is inversely proportional to the ator 𝐵 such that 𝐵 𝑘 𝑋𝑡 = 𝑋𝑡−𝑘 for any non-negative
number of labels concurrent with it (Eqn. (2.15)). Conse- integer 𝑘. The binomial theorem applied to an integer
quently, we assign sample weights by inversely scaling them power can be extended to real powers using the bino-
with the concurrency count while considering the magnitude mial series and applied to the backshift operator:
of returns over the label’s lifespan. For label 𝑖, the prelimi-
∑∞ ( ) ∑∞ ∏𝑘−1
nary weight 𝑤̃ 𝑖 is computed as the norm of the sum of pro- 𝑑 𝑑 𝑘 𝑖=0 (𝑑 − 𝑖)
(1 − 𝐵) = (−𝐵) = (−𝐵)𝑘 .
portionally attributed returns: 𝑘 𝑘!
𝑘=0 𝑘=0
‖∑𝑡 ‖ (2.18)
‖ 𝑖,1 𝑟𝑡−1,𝑡 ‖
𝑤̃ 𝑖 = ‖

‖.
‖ (2.16)
‖𝑡=𝑡𝑖,0 𝑐𝑡 ‖ The expansion in Eqn. (2.18) yields weights 𝜔𝑘 , which
‖ ‖
are applied to past values of the series to compute the
To facilitate a consistent scale for optimization algorithms fractionally differentiated series 𝑋̃ 𝑡 :
that default to an assumption of unit sample weights, we nor-
malize these preliminary weights calculated in Eqn. (2.16) to ∑
∞ ∏
𝑘−1
𝑑−𝑖
sum to the total number of labels 𝐼: 𝑋̃ 𝑡 = 𝜔𝑘 𝑋𝑡−𝑘 , with 𝜔𝑘 = (−1)𝑘 .
𝑘=0 𝑖=0
𝑘!
𝑤̃
𝑤𝑖 = ∑𝐼 𝑖 . (2.17) (2.19)
̃
𝑗=1 𝑤𝑗
An approach to fractional differentiation employs a
∑𝐼 fixed-width window by truncating the infinite series
Eqn. (2.17) ensures that 𝑖=1 𝑤𝑖 = 𝐼. Through this weight-
ing scheme, we emphasize observations with greater abso- based on a threshold criterion for the weights. The
lute log returns that are less common, thereby enhancing the fixed-width window approach can be formalized as
model’s capacity to learn from unique and significant market follows: find the smallest 𝑙∗ such that the modulus of
events. the weights ‖ ‖
‖𝜔𝑙∗ ‖ is not less than the threshold 𝜏, and
‖𝜔 ∗ ‖ falls below 𝜏. The adjusted weights 𝜔̃ are
‖ 𝑙 +1 ‖ 𝑘

Arian, Norouzi, Seco Page 7 of 26

Electronic copy available at: https://ssrn.com/abstract=4686376


Backtest Overfitting in the Machine Learning Era

then defined by: 10. ATR: The Average True Range quantifies market volatil-
{ ity by averaging true ranges over a period, reflecting
𝜔𝑘 if 𝑘 ≤ 𝑙∗ , the degree of price volatility.
𝜔̃ 𝑘 = (2.20)
0 if 𝑘 > 𝑙∗ . 11. Log DPO: The logarithm of the Detrended Price Os-
cillator compares rolling means at different periods to
Applying these truncated weights, the fractionally dif-
identify cyclical patterns in the price data.
ferentiated series 𝑋̃ 𝑡 is obtained through a finite sum:
12. MACD Position: Indicates the position of the MACD
Histogram relative to its signal line, with values above


𝑙
zero suggesting a bullish crossover and below zero a
𝑋̃ 𝑡 = 𝜔̃ 𝑘 𝑋𝑡−𝑘 , for 𝑡 = 𝑇 −𝑙∗ +1, … , 𝑇 . (2.21) bearish crossover.
𝑘=0
13. ADX Strength: Reflects the trend’s strength as mea-
The resultant series in Eqn. (2.21) is a driftless mix- sured by the ADX, categorizing trends as strong if
ture of the original level and noise components, pro- above a threshold value and weak if below.
viding a stationary series despite its non-Gaussian dis- 14. RSI Signal: Categorizes the RSI reading as signal-
tribution that exhibits memory-induced skewness and ing overbought conditions above a high threshold or
kurtosis. oversold conditions below a low threshold.
For a given time series {𝑋𝑡 }𝑡=1,…,𝑇 , the fixed-width 15. CCI Signal: Provides a signal based on the CCI read-
window fractional differentiation (FFD) approach is ing, indicating overbought or oversold conditions when
utilized to determine the order of differentiation 𝑑 ∗ crossing predefined threshold levels.
that achieves stationarity in the series {𝑋̃ 𝑡 }𝑡=𝑙∗ ,…,𝑇 us- 16. Stochastic Signal: Generates a signal from the Stochas-
ing ADF tests. The value of 𝑑 ∗ indicates the memory tic Oscillator, identifying overbought or oversold con-
that must be eliminated to attain stationarity. ditions based on threshold levels.
2. Volatility: Volatility is a fundamental feature that cap- 17. ROC Momentum: Categorizes the momentum based
tures the magnitude of price movements and is critical on the ROC, with positive values indicating an upward
for modeling risk and return in financial markets. The momentum and negative values a downward momen-
exponentially weighted moving average (EWMA) of tum.
volatility gives more weight to recent observations, 18. Kumo Breakout: Identifies price breakouts from the
making it a responsive measure of current market con- Ichimoku Cloud, suggesting a bullish breakout when
ditions. The EWMA volatility for a given day 𝑡 is cal- the price is above the cloud and bearish when below.
culated as follows: 19. TK Position: Indicates the position of the Tenkan-

sen relative to the Kijun-sen in the Ichimoku Indicator,
𝜎𝑡𝐸𝑊 𝑀𝐴 = 𝜆𝜎𝑡−1 2 + (1 − 𝜆)𝑟2 ,
𝑡 (2.22)
with values above one suggesting a bullish crossover
where 𝑟𝑡 is the log return at time 𝑡, and 𝜆 is the decay and below one a bearish crossover.
factor that determines the weighting of past observa- 20. Price Kumo Position: Categorizes the price position
tions. relative to the Ichimoku Cloud, suggesting bullish sen-
3. Z-Score: The Z-Score standardizes the log prices by timent when above the cloud and bearish when below.
their deviation from a rolling mean relative to the rolling 21. Cloud Thickness: Measures the thickness of the Ichimoku
standard deviation, highlighting price anomalies. Cloud by taking the logarithm of the ratio between
4. Log MACD Histogram: The difference between the the cloud spans, indicating market volatility and sup-
logarithmically transformed MACD line and its cor- port/resistance strength.
responding signal line indicates momentum shifts. 22. Momentum Confirmation: Confirms the momentum
5. ADX: The Average Directional Index measures the indicated by the Ichimoku Indicator, with the Tenkan-
strength of a trend over a given period, with higher sen above the cloud suggesting bullish momentum and
values indicating stronger trends. below suggesting bearish momentum.
6. RSI: The Relative Strength Index identifies conditions
where the asset is potentially overbought or oversold, 2.2.6. Bet Sizing: Averaging Active Bets
often signaling possible reversals. Proper bet sizing is crucial in implementing a successful
7. CCI: The Commodity Channel Index detects cyclical investment strategy informed by machine learning predic-
trends in asset prices, often used to spot impending tions. We denote by 𝑝[𝑥] the probability of a label 𝑥 occur-
market reversals. ring, where 𝑥 ∈ {−1, 1}. To determine the appropriateness
8. Stochastic: The Stochastic Oscillator compares the of a bet, we test the null hypothesis:
closing price to its price range over a specified period, Null Hypothesis 1. 𝐻0 ∶ 𝑝[𝑥 = 1] = 21 .
indicating momentum.
9. ROC: The Rate of Change measures the velocity of Calculating the test statistic:
price changes, with positive values indicating upward
𝑝[𝑥 = 1] − 12
momentum and negative values indicating downward 𝑧= √ ∼ 𝑍, (2.23)
momentum. 𝑝[𝑥 = 1](1 − 𝑝[𝑥 = 1])

Arian, Norouzi, Seco Page 8 of 26

Electronic copy available at: https://ssrn.com/abstract=4686376


Backtest Overfitting in the Machine Learning Era

where 𝑧 ∈ (−∞, +∞) and 𝑍 represents the standard normal to the number of neighbors chosen. By experiment-
distribution. The bet size is then derived as ing with small numbers of neighbors, we expose the
model to potential overfitting, where it might rely too
𝑚 = 2𝑍[𝑧] − 1, (2.24) heavily on immediate, possibly noisy data points. The
model used in our study is a custom pipeline integrat-
with 𝑚 ∈ [−1, 1] and 𝑍[⋅] being the cumulative distribution
ing standard scaling with the KNeighborsClassifier.
function (CDF) of 𝑍 for Eqn. (2.23). This formulation ac-
counts for predictions originating from both meta-labeling 2. Decision Tree: Decision Trees, while interpretable,
and standard labeling estimators. can easily overfit the training data, especially without
The process of bet sizing involves determining the size constraints on tree depth. Our configuration tests the
of individual bets based on the probability of outcomes and model in its most unconstrained form, providing in-
managing the aggregation of multiple bets that may be ac- sights into its behavior without regularizing parame-
tive concurrently. To manage ters. Our implementation uses a Decision Tree Clas-
{ }multiple concurrent bets, we sifier with a predefined random state for reproducibil-
define a binary indicator 1𝑡,𝑖 for each bet 𝑖 at time 𝑡. This
indicator takes the value of 1 if bet 𝑖 is active within the in- ity. The parameters include the maximum depth of the
terval (𝑡 − 1, 𝑡], and 0 otherwise. The aggregate bet size at tree, the minimum number of samples required to split
time 𝑡 is then the average of all active bet sizes as shown in an internal node, and the minimum number of samples
Eqn. (2.25): required to be at a leaf node.
3. XGBoost: XGBoost is an advanced implementation
∑𝐼 of gradient boosting algorithms known for its efficiency,
𝑚𝑖 1𝑡,𝑖
𝑚𝑡 = ∑𝑖=1 𝐼
, (2.25) flexibility, and portability. However, with excessively
𝑖=1 1𝑡,𝑖 high values for parameters such as the number of es-
where 𝑚𝑖 is the individual bet size. timators and learning rates, there is a risk of overfit-
ting, where the model becomes overly tailored to the
2.3. Strategy Trials training data. It excels in handling sparse data and
This section presents our strategy trials, which are inte- scales effectively across multiple cores. In our setup,
gral to our financial machine-learning research. We employ the XGBoost Classifier is employed with specific pa-
a comprehensive methodology, examining machine learning rameters like the number of trees, maximum depth of
models like k-Nearest Neighbors, Decision Trees, and XG- trees, learning rate, and subsampling ratio of the train-
Boost, each with unique parameter settings. Our approach ing instances.
deliberately tests these models under conditions conducive Each model is exhaustively assessed across its param-
to overfitting to assess their robustness and adaptability. We eter space to evaluate its efficacy and robustness in various
also introduce the Momentum Cross-Over Strategy, utiliz- market scenarios. This extensive parameterization is a delib-
ing various moving average window lengths to align trades erate strategy to test the models’ susceptibility to overfitting,
with market trends. This combination of diverse models and a critical consideration in financial machine-learning appli-
adaptive strategies, processed through a systematic pipeline cations.
that includes event-based sampling, meta-labeling, and it-
erative optimization, is designed to rigorously evaluate the 2.3.2. Momentum Cross-Over Strategy: An Overview
efficacy of trading strategies in complex market scenarios. The Momentum Cross-Over Strategy is a key element
The trials aim to balance the exploration of machine learn- of our strategy trials, aiming to align trade directions with
ing potentials in finance with the pragmatic challenges of market trends detected through moving averages. This strat-
real-world market conditions. egy’s adaptability lies in its various combinations of window
lengths for the moving averages, allowing it to capture mar-
2.3.1. Machine Learning Models: An Overview ket momentum over different time frames. By experiment-
In our strategic analysis, we leverage various machine ing with multiple window length pairs, the strategy adjusts to
learning models, each with a distinct set of parameters. This various market conditions and introduces flexibility that in-
approach is designed to rigorously test the models under vary- creases the likelihood of overfitting. This approach ensures
ing conditions, potentially increasing the risk of overfitting. a thorough examination of market trends, aiming to optimize
This methodological choice serves a dual purpose: firstly, trade positions in line with the prevailing market direction.
to rigorously challenge the robustness of the models under
extreme parameter conditions, and secondly, to examine the 2.3.3. Trials on Synthesized Data: The Pipeline
models’ performance in scenarios prone to overfitting. This Our strategy trials employ a streamlined pipeline to as-
deliberate stress testing provides valuable insights into the sess the potential for overfitting in various trading strate-
resilience and adaptability of the algorithms in complex fi- gies. The pipeline integrates event-based sampling, momen-
nancial environments. The following models and their re- tum strategy, machine learning models, and meta-labeling to
spective parameter sets are integral to this analysis: simulate diverse market conditions and test strategy efficacy.
1. K-Nearest Neighbors (k-NN): The k-NN model is The key steps of this pipeline are:
predicated on feature similarity and is highly sensitive 1. CUSUM Sampling: The process begins with the CUSUM

Arian, Norouzi, Seco Page 9 of 26

Electronic copy available at: https://ssrn.com/abstract=4686376


Backtest Overfitting in the Machine Learning Era

filter, identifying significant market shifts based on forming the analysis on one subset (the training set), and val-
deviations in log prices. This method generates sig- idating the analysis on the other subset (the test set). To re-
nals for potential trading opportunities. duce variability, multiple rounds of cross-validation are per-
2. Momentum Cross-Over Strategy: Following CUSUM formed using different partitions, and the validation results
signals, the Momentum Cross-Over Strategy is applied. are averaged over the rounds.
This step involves choosing window sizes for calculat- In financial modeling, especially for backtesting trading
ing moving averages and determining the trade direc- strategies, applying K-Fold cross-validation presents unique
tion based on their relative positions. challenges. Financial data are typically time-series data char-
3. Machine Learning Model Selection: A machine learn- acterized by temporal dependencies and non-stationarity. These
ing model, such as k-NN, Decision Tree, or XGBoost, features of financial data violate the fundamental assumption
is selected with specific parameters. This stage tests of traditional K-Fold cross-validation, which assumes that
model responses to trading signals, emphasizing the the observations are independent and identically distributed
analysis of overfitting risks under varying parameter (i.i.d).
settings. The process of K-Fold cross-validation in financial back-
4. Meta-Labeling and Sample Weights: Trade signals testing involves the following steps:
are processed through meta-labeling using the Triple- 1. The entire dataset is divided into 𝑘 consecutive folds
Barrier Method while concurrently assigning sample or segments.
weights to tackle the non-IID nature of financial data, 2. For each iteration, a different fold is treated as the test
thus enhancing the model’s learning efficacy. set (or validation set), and the remaining 𝑘 − 1 folds
5. Model Fitting and Testing: The chosen model is fit- are combined to form the training set.
ted to the data, now with meta-labels and weights, 3. The model is trained on the training set and validated
to evaluate its predictive accuracy under synthesized on the test set.
conditions.
4. The performance metric (e.g., Sharpe ratio, annual-
This pipeline approach critically examines the interplay ized return, drawdown) is recorded for each iteration.
between different components of trading strategies, focusing 5. After iterating through all folds, the performance met-
on the risk of overfitting. By simulating complex market rics are aggregated to provide an overall performance
scenarios, we aim to validate the robustness and adaptability estimate.
of these strategies for real-world application.
However, the temporal order of financial data necessi-
2.4. Backtesting on Out-of-Sample Data: tates careful handling. Shuffling or random data partition-
Cross-Validation ing, as commonly done in other domains, can lead to sig-
In quantitative finance, the rigor of a trading strategy is nificant biases and erroneous conclusions. For instance, us-
often validated through backtesting on out-of-sample data. ing future data in constructing the training set, even inad-
This process involves assessing the strategy’s performance vertently, introduces lookahead bias, severely compromising
using data not employed during the model’s training phase, the model’s validity.
providing insights into its real-world applicability. Cross- Moreover, financial markets are influenced by macroe-
validation (CV) techniques are pivotal, offering structured conomic factors and market regimes, leading to structural
methods to evaluate the strategy’s effectiveness and robust- breaks. These factors can result in model performance that
ness under various market conditions. The methodologies varies significantly across different periods, making it diffi-
for backtesting range from conventional approaches like K- cult to generalize the results obtained from a conventional
Fold Cross-Validation, which divides the data into multi- K-Fold cross-validation approach.
ple segments for iterative testing, to more specialized meth- Despite these limitations, K-Fold cross-validation is of-
ods like Walk-Forward Cross-Validation and Combinatorial ten used in preliminary model assessments, given its sim-
Purged Cross-Validation. Each method has distinct char- plicity and widespread understanding in the statistical com-
acteristics in handling the data, particularly addressing the munity. However, researchers in quantitative finance must
supplement
challenges posed by the temporal dependencies and non-stationarity or replace this method with more appropriate
in financial time series. Understanding these methods’ nu- techniques, such as Combinatorial Purged Cross-Validation,
ances in constructing backtest pathways is crucial for accu- that account for the peculiarities of financial time series data.
rate model validation and developing robust trading strate- It is crucial to interpret the results of K-Fold cross-validation
gies. in the context of financial markets with caution, understand-
ing that its assumptions may not fully align with the under-
2.4.1. Conventional Approach: K-Fold lying data characteristics.
Cross-Validation
K-fold cross-validation is a widely recognized statistical 2.4.2. Time-Consistent Validation: Walk-Forward
method for validating the performance of predictive models, Cross-Validation
particularly in machine learning contexts. It involves par- Walk-forward cross-validation (WFCV) is a method specif-
titioning a sample of data into complementary subsets, per- ically tailored for time series data, addressing the unique

Arian, Norouzi, Seco Page 10 of 26

Electronic copy available at: https://ssrn.com/abstract=4686376


Backtest Overfitting in the Machine Learning Era

challenges posed by financial market data’s temporal depen- The Purged K-Fold process involves several key modifi-
dencies and non-stationarity. Unlike conventional K-Fold cations to the standard K-fold cross-validation:
cross-validation, which can inadvertently introduce looka-
1. The dataset is partitioned into 𝑘 folds, ensuring that
head bias by shuffling data, WFCV respects the chronolog-
each fold is a contiguous segment of time to maintain
ical order of observations, ensuring a more realistic and ro-
the temporal order of observations.
bust validation of trading strategies.
The WFCV process involves the following steps: 2. Each fold is used once as the validation set, while the
remaining folds form the training set. However, unlike
1. The dataset is divided into an initial training period standard K-Fold cross-validation, a "purging" process
and a subsequent testing period. The size of these pe- is implemented.
riods can be fixed or expanded. 3. The purging process involves removing observations
2. The model is trained on the initial training set and then from the training set that occur after the start of the
tested on the subsequent testing period. validation period. This is done to eliminate the risk
3. After the first validation, the training and testing win- of information leakage from the future (validation pe-
dows are rolled forward. This means expanding or riod) into the past (training period).
shifting the training period and testing on the new sub-
4. Additionally, an "embargo" period is applied after each
sequent period.
training fold ends and before the next validation fold
4. This process is repeated until the entire dataset is tra-
starts. This embargo period serves as a buffer zone
versed, with each iteration using a new testing period
to further mitigate the risk of leakage due to temporal
immediately following the training period.
dependencies that purging might not fully address.
5. Performance metrics are recorded for each testing pe-
5. The model is trained on the purged and embargoed
riod and aggregated to evaluate the strategy’s overall
training data and then validated on the untouched val-
effectiveness.
idation fold.
WFCV’s primary advantage lies in its alignment with the 6. Performance metrics are recorded for each fold and
practical scenarios encountered in live trading. Training and aggregated to provide an overall assessment.
testing on consecutive data segments closely mimic the real-
world situation where a model is trained on past data and This methodology is particularly effective in financial
deployed on future, unseen data. This sequential approach machine learning, where models often capture temporal re-
helps understand how a strategy adapts to evolving market lationships, and even subtle information leakage can lead
conditions and objectively assesses its predictive power and to over-optimistic performance estimates. Purged K-Fold
robustness over time. Cross-Validation ensures a more robust and realistic evalu-
However, WFCV has its limitations. The repetitive re- ation of the model’s predictive power by incorporating the
training process can be computationally intensive, especially purging and embargo mechanisms.
for large datasets and complex models. Additionally, the Purged K-Fold is especially relevant for strategies that
choice of the size of the training and testing windows can rely on features extracted from historical data, as it ensures
significantly impact the results, requiring careful considera- that the model is not inadvertently trained on future data.
tion and sensitivity analysis. This method is essential for preventing the common pitfalls
WFCV is particularly pertinent in financial machine learn- of overfitting and selection bias in financial modeling.
ing due to its ability to mitigate overfitting and model de- While Purged K-Fold Cross-Validation offers significant
cay risks — common challenges in quantitative finance. It advantages in maintaining data integrity, it requires careful
ensures that models are continuously updated and validated consideration of the lengths of the purge and embargo pe-
against the most recent data, reflecting the dynamic nature riods, which should be tailored to the specific temporal de-
of financial markets. pendencies in the analyzed financial data.
Despite its advantages, WFCV should be employed as
2.4.4. Multi-Scenario, Leakage-Free Validation:
part of a comprehensive strategy validation framework, along-
Combinatorial Purged Cross-Validation
side other methods like combinatorial purged cross-validation,
Combinatorial Purged Cross-Validation (CPCV) is intro-
to fully account for the complexities of financial time series
duced by Lopez de Prado [2018] as an innovative approach
to ensure robust model validation.
to address the limitations of single-path testing inherent in
2.4.3. Leakage-Resistant Validation: Purged K-Fold conventional Walk-Forward and Cross-Validation methods.
Purged K-Fold Cross-Validation is an advanced valida- This method is specifically designed for the complex envi-
tion technique developed by Lopez de Prado [2018] to ad- ronment of financial machine learning, where temporal de-
dress the issue of information leakage in financial time se- pendencies and non-stationarity are prevalent. CPCV gen-
ries, a common pitfall in traditional cross-validation meth- erates multiple backtesting paths and integrates a purging
ods. This method is particularly suited for validating finan- mechanism to eliminate the risk of information leakage from
cial models where the integrity of the temporal order of data training observations.
is crucial for preventing look-ahead biases and ensuring re- The CPCV method is implemented as follows:
alistic performance estimation.

Arian, Norouzi, Seco Page 11 of 26

Electronic copy available at: https://ssrn.com/abstract=4686376


Backtest Overfitting in the Machine Learning Era

1. The dataset, consisting of 𝑇 observations, is partitioned (c) This method does not account for the temporal
into 𝑁 non-overlapping groups. These groups main- order of data, which can lead to unrealistic back-
tain the chronological order of data, where the first test paths in financial time series due to potential
𝑁 − 1 groups each have a size of ⌊𝑇 ∕𝑁⌋, and the information leakage and autocorrelation.
𝑁-th group contains the remaining observations. 2. Walk-Forward (WF) Validation:
2. For a selected size k of the testing set, CPCV calcu- (a) WF Validation involves an expanding and rolling
lates window approach. The dataset is sequentially
( 𝑁 )the number of possible training/testing splits as
. Each combination involves k groups for test- divided into a training set followed by a valida-
𝑁−𝑘 ( 𝑁 ) tion set.
ing, and the total number of groups tested is 𝑁−𝑘 ×𝑘,
(b) The unique aspect of WF is its chronological align-
ensuring a uniform distribution across all 𝑁 groups.
ment. The window rolls forward, ensuring the
3. From the combinatorial splits, each group is uniformly validation set always follows the training set in
included in the testing sets. This process results in a time.
comprehensive series of ( backtest
) paths, given by the (c) WF creates a single backtest path that closely
combinatorial number 𝑁𝑘 . mimics real-world trading scenarios. However,
4. Paths are generated by training classifiers on a portion it tests the strategy only once, providing limited
of the data, specifically 1 − 𝑁𝑘 , for each combination. insight into its robustness under different market
The algorithm ensures that the portion of data in the conditions.
training set is balanced against the number of paths 3. Combinatorial Purged Cross-Validation (CPCV):
and size of the testing sets. (a) CPCV enhances backtest pathways by introduc-
5. The CPCV backtesting algorithm involves purging and ing a combinatorial approach. The dataset is di-
embargoing as introduced before. Each path results vided into 𝑁 groups, from which 𝑘 groups are
from combining forecasts from different groups and selected in various combinations for training and
split combinations, ensuring a comprehensive evalua- testing.
tion of the classifier’s performance. (b) This method generates multiple backtest paths,
6. After processing all paths, the performance metrics each representing a different combination of train-
from each path are aggregated to assess the overall ing and validation sets. It addresses the issue of
effectiveness of the model, providing insights into its single-path dependency seen in WF and tradi-
robustness and consistency across various market con- tional CV.
ditions. (c) CPCV also incorporates purging and embargo-
ing to prevent information leakage, making each
CPCV’s unique combinatorial approach allows for a thor- path more realistic and reducing the risk of over-
ough evaluation of the model under diverse scenarios, ad- fitting.
dressing the critical overfitting issue. It provides a more nu- (d) The key advantage of CPCV is its ability to pro-
anced and accurate assessment of a model’s predictive capa- vide a comprehensive view of the strategy’s per-
bilities in the dynamic field of financial markets. formance across a range of scenarios, unlike the
While CPCV offers an extensive validation framework, single scenario tested in WF and traditional CV.
its combinatorial nature can be computationally demanding.
Each CV method’s approach to constructing backtest path-
Therefore, it’s essential to consider computational resources
ways has implications for its utility in financial modeling.
and execution time, particularly for large financial datasets.
Traditional CV’s disregard for temporal order limits its ap-
2.4.5. Scenario Creation: Constructing Backtest plicability for financial time series. WF’s single-path ap-
Pathways proach offers a realistic scenario but lacks robustness testing.
The creation of backtest pathways varies significantly CPCV, with its multiple, purged combinatorial paths, of-
among different cross-validation methods. Traditional Cross- fers a comprehensive evaluation of a strategy’s performance,
Validation (CV), Walk-Forward (WF) Validation, and Com- making it particularly suitable for complex financial mar-
binatorial Purged Cross-Validation (CPCV) each have dis- kets where multiple scenarios are critical for understanding
tinct methodologies for generating these paths. Understand- a strategy’s effectiveness.
ing these differences is crucial for selecting the appropriate
2.5. Assessment of Backtest Overfitting
validation method in financial modeling.
In the quest to develop robust trading strategies within
1. Traditional Cross-Validation (CV): quantitative finance, the assessment of backtest overfitting
(a) In traditional CV, the dataset is divided into 𝑘 emerges as a crucial facet. This section delves into the method-
folds. Each fold is a validation set once, while ologies deployed to evaluate and mitigate the risk of overfit-
the remaining folds constitute the training set. ting, a common pitfall where strategies appear effective in
(b) The backtest path in CV is linear and sequential. retrospective analyses but falter in prospective applications.
Each fold’s validation results contribute to a sin- Two pivotal concepts, the Probability of Backtest Overfit-
gle aggregated performance metric. ting (PBO) and the Deflated Sharpe Ratio (DSR), are har-
nessed to scrutinize the reliability of backtested strategies.

Arian, Norouzi, Seco Page 12 of 26

Electronic copy available at: https://ssrn.com/abstract=4686376


Backtest Overfitting in the Machine Learning Era

PBO is gauged through Combinatorially Symmetric Cross- 5. Finally, the PBO is estimated by calculating the distri-
Validation (CSCV), a technique that rigorously tests strat- bution of ranks out-of-sample (OOS) and integrating
egy performance across diverse market scenarios. Concur- the probability distribution function 𝑓 (𝜆) as:
rently, DSR offers a refined perspective on strategy efficacy
0
by adjusting the Probabilistic Sharpe Ratio (PSR) for multi-
PBO = 𝑓 (𝜆)𝑑𝜆. (2.27)
ple trials, thus enhancing the authenticity of our backtesting ∫−∞
results. Together, these methodologies furnish a comprehen-
sive framework for evaluating the integrity of trading strate- where the PBO represents the probability of in-sample
gies, ensuring that they are not merely artifacts of historical optimal strategies underperforming out-of-sample.
data but are genuinely predictive and robust against future This rigorous statistical approach leading to Eqn. (2.27)
market conditions. allows us to evaluate the extent of overfitting in our strat-
egy development process, ensuring that selected strategies
2.5.1. Probability of Backtest Overfitting: are robust and not merely tailored to historical market id-
Combinatorially Symmetric Cross-Validation iosyncrasies.
Backtest trials are pivotal in the realm of quantitative fi-
nance, particularly in the development of trading strategies. 2.5.2. Probability of False Discovery: The Deflated
Utilizing the methodology outlined in previous sections, we Sharpe Ratio
perform multiple backtest trials, ideally selecting the optimal In selecting the optimal strategy from multiple backtest
strategy based on its performance in these trials. However, trials, a key concern is the probability of false discovery,
this approach inherently risks backtest overfitting, where a which refers to the likelihood that the observed performance
strategy might show exceptional performance in a histori- of a strategy is due to chance rather than true predictive power.
cal context but fails to generalize to new, unseen data. To To address this, we use the Deflated Sharpe Ratio (DSR),
quantitatively assess and mitigate this risk, we calculate the which extends the Probabilistic Sharpe Ratio (PSR) concept
Probability of Backtest Overfitting (PBO) using the Combi- to account for the multiplicity of trials.
natorially Symmetric Cross-Validation (CSCV) method as The PSR, as introduced by Bailey and Lopez de Prado
introduced by Bailey et al. [2016]. CSCV provides a more [2012], adjusts the observed Sharpe Ratio (𝑆𝑅) ̂ by account-
robust measure of a strategy’s effectiveness by examining its ing for the distributional properties of returns, such as skew-
performance across different segments of market data, al- ness and kurtosis. It is calculated as:
lowing us to evaluate the consistency of trial returns both
in-sample and out-of-sample. ⎛ √ ⎞
The CSCV process is outlined in the following steps: ⎜ (𝑆𝑅
̂ − 𝑆𝑅 ∗) 𝑇 − 1 ⎟
𝑆𝑅(𝑆𝑅∗ ) = 𝑍 ⎜ √
𝑃̂ ⎟ , (2.28)
1. Formation of a performance matrix 𝑀 of size 𝑇 × 𝑁, ⎜ 2 ⎟
⎜ 1 − 𝛾̂3 𝑆𝑅 𝛾̂
̂ + 4 𝑆𝑅−1 ̂ ⎟
where each column represents the log returns series ⎝ 4 ⎠
for a specific model configuration over 𝑇 time obser-
vations. where 𝑍[.] is the cumulative distribution function (CDF) of
2. Partitioning of 𝑀 into 𝑆 disjoint submatrices 𝑀𝑠 of the standard Normal distribution, 𝑇 is the number of ob-
equal dimensions, with each submatrix being of order served returns, 𝛾̂3 is the skewness of the returns, and 𝛾̂4 is
𝑇
× 𝑁. the kurtosis of the returns. 𝑆𝑅∗ is a benchmark Sharpe ratio
𝑆 ̂ is compared.
3. Formation of combinations 𝐶𝑆 of these submatrices, against which 𝑆𝑅
The Deflated Sharpe Ratio (DSR), as introduced by Bai-
taken in groups of size 𝑆2 , yielding a total number of
ley and López de Prado [2014b], refines the Probabilistic
combinations calculated as:
Sharpe Ratio (PSR) as given in Eqn. (2.28) by considering
( )
𝑆 ∏
𝑆∕2−1
𝑆 −𝑖
the number of independent trials. This refinement yields a
= . (2.26) more precise measure of the probability of false discovery
𝑆∕2 𝑆∕2 − 𝑖
𝑖=0 when multiple strategies are tested. Specifically, the DSR
employs a benchmark Sharpe ratio (𝑆𝑅∗ ) which is calcu-
4. For each combination 𝑐 ∈ 𝐶𝑆 , the following steps are
lated in Eqn. (2.29), that is influenced by the variance of the
carried out:
estimated Sharpe Ratios (𝑆𝑅 ̂ 𝑛 ) from the trials, the number of
(a) Formation of the training set 𝐽 and the testing
trials (𝑁), and incorporates the Euler-Mascheroni constant
set 𝐽̄.
(𝛾):
(b) Computation of the performance statistic vectors
𝑅 and 𝑅̄ for the training and testing sets, respec- √ ( )
tively. 𝑆𝑅∗ = 𝑉 {𝑆𝑅 ̂ 𝑛}
(c) Identification of the optimal model 𝑛∗ in the train-
( ( ) ( ))
ing set and determination of its relative rank 𝜔̄ 𝑐 1 1
(1 − 𝛾)𝑍 −1 1 − + 𝛾𝑍 −1 1 − 𝑒−1 ,
in the testing set. ( ) 𝑁 𝑁
𝜔̄ (2.29)
(d) Definition of the logit 𝜆𝑐 = log 1−𝜔𝑐̄ .
𝑐

Arian, Norouzi, Seco Page 13 of 26

Electronic copy available at: https://ssrn.com/abstract=4686376


Backtest Overfitting in the Machine Learning Era

where 𝑍 −1 is the inverse of the cumulative distribution func-


tion (CDF) of the standard normal distribution 𝑍. This ad-
justment is based on the expectation of the maximum of a
sample of IID random variables from the standard normal
distribution, which is delineated in Eqn. (2.30):
[ ]
1
E[max{𝑥𝑖 }𝑖=1,…,𝑙 ] ≈ (1 − 𝛾)𝑍 −1 1 −
𝐼
[ ] √
−1 1 −1
+ 𝛾𝑍 1− 𝑒 ≤ 2log [𝐼], (2.30)
𝐼
with 𝛾 ≈ 0.57721566 representing the Euler-Mascheroni
constant, and 𝐼 ≫ 1 indicating a large number of trials.
This formulation, known as the "False Strategy Theorem"
Lopez de Prado [2020], informs the calculation of 𝑆𝑅∗ in
the DSR methodology, providing a benchmark against which
the observed Sharpe Ratios can be evaluated. The DSR,
computed using this adjusted 𝑆𝑅∗ within the PSR frame-
work, offers a comprehensive assessment of a strategy’s true
performance by correcting for the inflationary effect of mul-
tiple testing and helps distinguish genuine skill from statis-
tical flukes.

3. Empirical Results Figure 1: Flow Chart of the Empirical Results Simulation


In this pivotal section of our study, we delve into a com-
prehensive empirical investigation designed to perform a com-
parative analysis of cross-validation techniques within a syn- offer a faithful representation of historical market behavior,
thetic controlled environment. Our empirical endeavor is ensuring the robustness of our simulation. We detail the con-
meticulously constructed to evaluate the robustness of these figuration of these models, their integration within a cohe-
techniques against backtest overfitting—a critical pitfall in sive framework, and the parameter sets governing their be-
the development and validation of financial models. The ex- havior, which are crucial for capturing the complex dynam-
ploration unfolds across a series of simulations replicating ics of financial markets.
market conditions, informed by sophisticated models like
the Heston Stochastic Volatility and Merton Jump Diffusion 3.1.1. Base Model: The Heston Stochastic Volatility
models, offering a rich tapestry of tranquil and turbulent mar- Model
ket scenarios. The heart of our inquiry lies in the robustness The Heston Stochastic Volatility Model in our study is
check against backtest overfitting, ensuring the strategies we parameterized using "Calm" and "Volatile" market regimes,
assess are not merely artifacts of hindsight bias but can stand based on the empirical analysis of the S&P 500 during 2008
the test of uncharted market dynamics. Through our Syn- and 2011 by Papanicolaou and Sircar [2014] as shown in Ta-
thetic Controlled Market Environment lens, 28 strategic tri- ble 1. These parameter sets are chosen for their robustness in
als in 1000 simulations dissect the strengths and weaknesses replicating the real-world market volatility observed in these
of a spectrum of out-of-sample testing methodologies. Each periods.
technique’s ability to identify and negate overfitting is rig-
3.1.2. Price Jumps: The Merton Jump Diffusion
orously examined, thereby serving as a crucible for deter-
Model
mining the most reliable approach to cross-validation. The
Incorporating the Merton Jump Diffusion Model into our
results presented here are a testament to the analytical rigor
simulation, we have calibrated parameters for "Calm" and
of the study but also form a cornerstone for the selection and
"Volatile" market regimes based on insights from the study
implementation of cross-validation techniques in the ever-
of the S&P 500 market by Hanson and Zhu [2004] as shown
evolving domain of quantitative finance.
in Table 1. This parameterization is critical for accurately
3.1. Implementation and Parameterization of replicating the jump behavior in asset prices characteristic
of varied market volatility conditions as observed in the em-
Synthetic Data Models
pirical data.
This subsection delineates the development of a synthetic
market environment, utilizing the Heston Stochastic Volatil- 3.1.3. Modeling Market Anomalies: The Drift Burst
ity Model and the Merton Jump Diffusion Model, with pa- Hypothesis
rameters reflecting market conditions during tranquil and tu- Adopting the Drift Burst Hypothesis model for simu-
multuous times. Parameters derived from empirical studies lating market anomalies and speculative bubbles, our study

Arian, Norouzi, Seco Page 14 of 26

Electronic copy available at: https://ssrn.com/abstract=4686376


Backtest Overfitting in the Machine Learning Era

Table 1
Parameterization of the Heston and Merton Jump Diffusion
Models for Calm and Volatile Market Regimes
Parameter Calm Regime Volatile Regime
Heston Stochastic Volatility
Expected Return (𝜇) 0.1 0.1
Mean Reversion Rate (𝜅) 3.98 3.81
Long-term Variance (𝜃) 0.029 0.25056
Volatility of Variance (𝜉) 0.389645311 0.59176974
Correlation Coefficient (𝜌) -0.7 -0.7
Merton Jump Diffusion
Jump Intensity (𝜆) 121 121
Mean of Logarithmic Jump Size (𝑚) -0.000709 -0.000709
Variance of Logarithmic Jump Size (𝑣) 0.0119 0.0119

aligns with the parameters delineated in the foundational work


by Christensen et al. [2022] as shown in Table 2. This model
uniquely influences the synthetic market environment by im-
posing a fixed-length regime characterized by predefined ar-
rays of drift and volatility values. During this regime, the
Heston Stochastic Volatility Model operates under these con-
ditions, featuring non-stochastic volatility and the absence of
jumps. After the drift burst period, the simulation mandates Figure 2: Speculative Bubble Simulated Using Drift Burst Hy-
a transition to a different market regime, ensuring a realis- pothesis
tic representation of abrupt market transitions. To ensure
computational stability and circumvent potential zero divi-
Table 3
sion errors, the drift and volatility values are constant at a
Markov Chain Transition Matrix for Market Regimes
specific fraction of the entire duration, corresponding to the
explosion filter width. From/To Calm Volatile Speculative Bubble
Calm 1 − Δ𝑡 Δ𝑡 − 0.00001 0.00001
Volatile 20Δ𝑡 1 − 20Δ𝑡 − 0.00001 0.00001
Table 2 Speculative Bubble 1 − Δ𝑡 Δ𝑡 0.0
Parameters for the Drift Burst Hypothesis Model

Parameter Value
Bubble Length (𝑇bubble ) 5 × 252 days
Pre-Burst Drift Parameter (𝑎before ) 0.35
Post-Burst Drift Parameter (𝑎after ) -0.35
Pre-Burst Volatility Parameter (𝑏before ) 0.458
Post-Burst Volatility Parameter (𝑏after ) 0.458
Drift Burst Intensity (𝛼) 0.75
Volatility Burst Intensity (𝛽) 0.225
Explosion Filter Width 0.1

3.1.4. Market Regime Dynamics: Markov Chain Figure 3: Regime Transition Diagram
Transition Modeling
In our simulation, the transitions between market regimes
are governed by a Markov Chain model, drawing insights
3.1.5. Putting Them All Together: Synthetic
from the works of Xie and Deng [2022] and Elliott et al.
Controlled Market Environment
[2016] on regime-switching Heston models as shown in Ta-
In this section, we present the integration of a compre-
ble 3. The transition matrix, pivotal to the Markov chain
hensive synthetic market environment, utilizing a blend of
model, is meticulously calibrated based on these references
the Heston Stochastic Volatility and Merton Jump Diffusion
to represent regime shifts accurately, providing a realistic
models. Our implementation leverages the Python program-
portrayal of market regime dynamics within our synthetic
ming language, numpy for numerical computations, the @ji ⌋
controlled environment.
t decorator for performance optimization, and the QuantE-
con library’s qe.MarkovChain for Markov chain generation.
Stochastic elements’ reproducibility is ensured through np. ⌋

Arian, Norouzi, Seco Page 15 of 26

Electronic copy available at: https://ssrn.com/abstract=4686376


Backtest Overfitting in the Machine Learning Era

( ( ))
1 𝑣2
Δ𝑆𝑡 = 𝜇 − 𝜈𝑡 − 𝜆 𝑚 + 𝑆𝑡 Δ𝑡
2 2
√ √
+ 𝜈𝑡 𝑆𝑡 𝑍 Δ𝑡 + 𝑌 Δ𝑁(𝑡),
√ √ √
Δ𝜈𝑡 = 𝜅(𝜃 − 𝜈𝑡 )Δ𝑡 + 𝜉 𝜈𝑡 (𝜌𝜖 𝜖𝑡𝑃 + 1 − 𝜌2𝜖 𝜖𝑡𝜈 ) Δ𝑡.

Figure 4: Price Series with Market Regimes

Figure 6: Simulated Log Prices

The synthesized log returns, encapsulating 1000 path-


ways over 40 years of market dynamics, are comprehensively
summarized in Table 4, presenting the mean, standard devi-
ation, skewness, and excess kurtosis of returns across dif-
ferent market regimes. Across all regimes, the returns ex-
hibit a slight negative skewness and a notable excess kurto-
sis, suggesting a leptokurtic distribution more prone to ex-
treme events than a normal distribution. The ’Calm’ regime
presents a relatively higher mean and lower volatility, indi-
cating more stable market conditions. Conversely, the ’Volatile’
and ’Bubble’ regimes manifest heightened volatility and neg-
ative means, with the ’Bubble’ regime showing the largest
standard deviation and negative mean, characterizing peri-
Figure 5: Markov Regime Transition Matrix Heatmap From ods of significant market stress and potential downturns.
Simulated Data The Q-Q plots of log returns in Figure 7 illustrate regime-
specific distributions against a theoretical normal distribu-
tion. The ’All’ category shows significant tail deviations,
random.default_rng(). indicating outlier presence. The ’Calm’ regime aligns more
1000 price paths are generated, each simulating 40 years closely with normality except in the tails, hinting at occa-
of market data, equivalent to 40 × 252 business days. The sional extremes. The ’Volatile’ regime’s plot diverges more
1
time step for each simulation is Δ𝑡 = 252 . Initially, 1000 noticeably in the tails, typical of unstable market periods.
unique random seeds are generated, which is the foundation The ’Speculative Bubble’ displays steep slopes and marked
for the price path simulations. The simulations adhere to the tail divergence, characteristic of the rapid price swings dur-
following equations, as detailed in Eqn. (2.9) and Eqn. (2.10): ing speculative phases. These plots underscore the distinct

Arian, Norouzi, Seco Page 16 of 26

Electronic copy available at: https://ssrn.com/abstract=4686376


Backtest Overfitting in the Machine Learning Era

distributional features of each regime, from relative stability


in ’Calm’ conditions to the pronounced tail risks in ’Volatile’
and ’Speculative Bubble’ scenarios.

Table 4
Descriptive Statistics of Log Returns Overall and For Each
Regime

Regimes Mean Std. Skewness Excess Kurtosis


All 0.000 230 0.018 216 −0.124 832 5.164 035
Calm 0.000 306 0.014 699 −0.096 815 3.487 078
Volatile−0.000 131 0.033 051 −0.011 123 0.248 862
Bubble−0.000 753 0.040 550 −0.054 046 1.120 036

Figure 8: Density Distribution of Log Returns Overall and for


Each Regime

bet sizing, is methodically implemented and parameterized,


ensuring a harmonious integration that fortifies our model’s
predictive accuracy and adaptability to the nuanced dynam-
ics of financial markets.

3.2.1. Volatility Assessment and Event-Based


Sampling
Our quantitative analysis adopts an Exponentially Weighted
Moving Average (EWMA) approach to assess daily volatil-
ity, 𝜎𝑡 . Utilizing pandas.Series.ewm with a span of 100 days,
we accurately capture the evolving market volatility. This
Figure 7: Q-Q Plot of Log Returns Overall and For Each calculated 𝜎𝑡 forms the basis for our dynamic threshold in
Regime the symmetric Cumulative Sum (CUSUM) filter Lopez de
Prado [2018] applied to log prices. Specifically, we set the
threshold at 1.8𝜎𝑡 , which is instrumental in resampling the
3.2. Implementation of the Financial Machine data for identifying position opening days. This methodol-
ogy ensures a data-driven, responsive sampling process, ef-
Learning Strategy Components
fectively aligning our trading strategy with prevailing market
Our comprehensive financial machine-learning strategy
volatility and capturing significant price movements.
systematically integrates analytical components for robust
and dynamic market analysis. This encompasses a metic- 3.2.2. Determining Trade Directionality
ulous assessment of market volatility, precise event-based In our trading strategy, trade directionality is determined
sampling, strategic determination of trade directionality, ap- using a momentum strategy based on simple moving aver-
plication of advanced meta-labeling techniques, allocation ages, calculated via pandas.Series.rolling. We define a short-
of sample weights, and selection of financial features. Our ∑𝑁𝑓 𝑎𝑠𝑡 −1
methodology incorporates Fractional Differentiation to achieve term moving average MAshort (𝑦𝑡 ) = 𝑁 1 𝑖=0
𝑦𝑡−𝑖 and
𝑓 𝑎𝑠𝑡
stationarity without compromising memory, alongside the 1 ∑𝑁𝑠𝑙𝑜𝑤 −1
a long-term moving average MAlong (𝑦𝑡 ) = 𝑁 𝑖=0
𝑦𝑡−𝑖 ,
utilization of technical analysis indicators for enhanced mar- 𝑠𝑙𝑜𝑤
where 𝑁𝑓 𝑎𝑠𝑡 and 𝑁𝑠𝑙𝑜𝑤 represent the window sizes for the
ket insight. The strategy intricately balances risk and oppor-
respective averages. Trade positions are initiated based on
tunity through optimal bet sizing, grounded in probabilis-
the crossover of these averages post a CUSUM event: a long
tic assessments from meta-labeling and the uniqueness of
position when MAshort (𝑦𝑡 ) exceeds MAlong (𝑦𝑡 ), suggesting
trade labels. Each component, from volatility assessment to
upward momentum, and a short position when MAshort (𝑦𝑡 )

Arian, Norouzi, Seco Page 17 of 26

Electronic copy available at: https://ssrn.com/abstract=4686376


Backtest Overfitting in the Machine Learning Era

falls below MAlong (𝑦𝑡 ), indicative of downward momentum. This methodical approach to feature selection, blending
This approach aligns trading actions with the prevailing mar- fractional differentiation with technical analysis, enables our
ket trend, as reflected in the price momentum. model to capture intricate market dynamics effectively. The
average Pearson correlation between the 22 features extracted
3.2.3. Meta-Labeling Strategy from each of the 1000 generated price pathways is demon-
Incorporating Lopez de Prado [2018]’s meta-labeling with strated in Figure 9.
the triple-barrier method, our strategy evaluates trades post-
momentum-based direction determination. This process cru-
cially informs position sizing decisions and enhances trade
selection accuracy. The triple-barrier method applies two
horizontal barriers for profit-taking and stop-loss, set with
dynamic volatility-adjusted thresholds of 0.5𝜎𝑡 and 1.5𝜎𝑡 re-
spectively, and a vertical barrier with a 20 working days
expiration time. The outcome of a trade is determined as
follows: hitting the upper (profit-taking) barrier results in
a label of 1 for successful trades while reaching the lower
(stop-loss) barrier first assigns a label of −1 for unsuccessful
trades. If neither horizontal barrier is hit within the vertical
time frame, the trade is evaluated based on the sign of the
return at the end of this period.

3.2.4. Sample Weight Allocation


Our financial model calculates sample weights based on
the uniqueness and magnitude of log returns within the meta-
labeled data. For each label 𝑖, a concurrency count 𝑐𝑡 and a
binary indicator array {1𝑡,𝑖 } are used to determine overlap-
ping intervals. The preliminary weight 𝑤̃ 𝑖 is computed as
‖∑𝑡𝑖,1 𝑟𝑡−1,𝑡 ‖
‖ ‖
‖ 𝑡=𝑡𝑖,0 𝑐𝑡 ‖. These weights are then normalized, giving
‖ ‖
𝑤̃ Figure 9: Average Feature Correlation Matrix
𝑤𝑖 = ∑𝐼 𝑖 , ensuring a balanced impact of each observa-
̃𝑗
𝑗=1 𝑤
tion on the model. This approach emphasizes learning from
distinct market events, thus refining the model’s accuracy.
3.2.6. Optimal Bet Sizing
3.2.5. Feature Selection for Financial Modeling In our model, bet sizing is calibrated using probabilities
We meticulously select features in our financial model- from the meta-labeling strategy and the uniqueness of each
ing process to ensure robust predictive capability. The Frac- label. For each label 𝑥, we test the hypothesis 𝐻0 ∶ 𝑝[𝑥 =
tional Differentiation (FracDiff) feature is pivotal in this en- 1 𝑝[𝑥=1]− 12
1] = using the statistic 𝑧 = √ . The bet size
deavor. We used the fixed-width window fractional differ- 2 𝑝[𝑥=1](1−𝑝[𝑥=1])
entiation approach to set the weight-loss threshold at 0.01. 𝑚 is determined as 𝑚 = 2𝑍[𝑧] − 1, where 𝑍[⋅] is the cumu-
The differentiation order is incrementally determined using lative distribution function of the standard normal distribu-
steps of size 0.1, with a p-value threshold of 0.05 for the tion. To mitigate look-ahead bias, we shift the bet size time
ADF test implemented using the statsmodels.tsa.stattools series by one day, then calculate the daily strategy return by
module with a maximum lag of maxlag=1, balancing mem- multiplying each bet size with the corresponding daily re-
ory retention with the attainment of stationarity in the se- turn and position side. The bet size is then readjusted for
ries. We leverage the ta Python library to construct Techni- the next trading day based on the forthcoming meta-label or
cal Analysis features, utilizing its default configurations to liquidated as needed. Aggregate bet sizing at any timestamp
∑𝐼
derive a spectrum of indicators. We apply specific thresh- 𝑖=1 𝑚𝑖 1𝑡,𝑖
𝑡 is computed as 𝑚𝑡 = ∑𝐼 , averaging all active bets
olds for some indicators to enhance their interpretability: 𝑖=1 1𝑡,𝑖
at that time. This approach ensures dynamic adaptation of
1. ADX Strength: A threshold of 25 distinguishes be- bet sizes to the evolving market conditions, aligning with the
tween strong and weak trends. probabilities and uniqueness of trade signals.
2. RSI Signal: Thresholds of 30 and 70 identify over-
bought and oversold conditions. 3.3. Design and Parameter Dynamics of Trial
3. CCI Signal: Thresholds of -100 and 100 signal po- Simulations
tential market reversals. In our empirical exploration, we methodically design 28
4. Stochastic Signal: Thresholds of 20 and 80 indicate strategy trials by manipulating two core components: the pa-
overbought and oversold conditions. rameters of the momentum cross-over strategy and the con-
figurations of various machine learning models. We create

Arian, Norouzi, Seco Page 18 of 26

Electronic copy available at: https://ssrn.com/abstract=4686376


Backtest Overfitting in the Machine Learning Era

a comprehensive array of trials by alternating between dif- this meticulous approach ensures robustness and accuracy
ferent sets of rolling window sizes in the momentum strat- in comparing our out-of-sample testing procedures.
egy and a diverse range of hyperparameters in the machine
learning models. This approach allows us to assess the per- 3.4.1. Implementation of K-Fold Cross-Validation
formance impact of these variables under varying market Our implementation of K-Fold Cross-Validation (KFold)
conditions and model specifications. Each trial represents in financial modeling utilizes the KFold class within the Cros ⌋
a unique combination of these configurations, providing us sValidatorController framework. Configured with n_split ⌋
with a broad spectrum of insights into the dynamics of our s=4, this approach partitions the dataset into four distinct seg-
financial strategy. ments, adhering to the conventional methodology of KFold.
Each segment sequentially serves as a test set, while the re-
3.3.1. Momentum Cross-Over Strategy Variations maining data forms the training set. This structure is piv-
Our financial model evaluates momentum cross-over strat- otal in our financial time series analysis, where it is crucial
egy variations by altering the moving averages’ rolling win- to avoid look-ahead bias and maintain the chronological in-
dow sizes. We test four distinct configurations: (5, 10), (20, tegrity of data.
50), (50, 100), and (70, 140), representing various pairs of
CrossValidatorController(
fast and slow-moving average window sizes. These trials
'kfold',
systematically examine the strategy’s performance under di-
n_splits=4,
verse temporal dynamics.
).cross_validator
3.3.2. Machine Learning Models Variations Given the nature of financial data, characterized by tem-
Our strategy explores various machine learning models poral dependencies, our KFold implementation is tailored to
to predict meta-labels, each with specific parameter configu- respect these sequences, ensuring more accurate and realistic
rations. Certain hyperparameters are explored with a single model validation. This adherence to the time series structure
candidate value, while others are tested across multiple val- in our KFold setup underscores our commitment to rigorous,
ues, ensuring all cases are comprehensively utilized in our temporally-aware analytical practices in financial modeling.
strategy trials. The configurations are strategically chosen
to heighten the potential for overfitting. We have 3.4.2. Implementation of Walk-Forward
Cross-Validation
1. k-Nearest Neighbors (k-NN): Implemented via skl ⌋ Implemented using the WalkForward class, our Walk-Forward
earn.neighbors.KNeighborsClassifier, with neighbors
Cross-Validation (WFCV) employs the CrossValidatorCont ⌋
parameter varied as n_neighbors : [1, 2, 3]. The roller with n_splits=4, indicating a division of the dataset
data is standardized using sklearn.preprocessing.St ⌋ into four sequential segments. This ensures chronological
andardScaler within a custom pipeline extending sk ⌋
training and testing phases, which is crucial for maintaining
learn.pipeline.Pipeline, which incorporates sample
temporal integrity in financial data analysis. This approach,
weights. emphasizing the sequence and structure of data, mirrors real-
2. Decision Tree: Utilized through sklearn.tree.Decis ⌋ world financial market dynamics and is key to achieving a
ionTreeClassifier, with parameters set to min_sampl ⌋ realistic assessment of model performance. The specific pa-
es_split : [2] and min_samples_leaf : [1]. rameterization of WFCV underscores our commitment to
3. XGBoost: Executed using xgboost.XGBClassifier, with temporal consistency and robust validation in financial mod-
parameters including n_estimators : [1000], max_de ⌋ eling.
pth : [1000000000], learning_rate : [1, 10, 100],
subsample : [1.0], and colsample_bytree : [1.0]. CrossValidatorController(
'walkforward',
3.4. Out-of-Sample Testing via Cross-Validation n_splits=4,
In our quantitative finance framework, we apply a com- ).cross_validator,
prehensive suite of cross-validation techniques to conduct By aligning model evaluation with the chronological pro-
out-of-sample testing, employing the robust CrossValidator ⌋ gression of market data, this configuration enhances the re-
Controller for initializing different validation methods. This liability and relevance of our strategy assessments.
includes K-Fold, Walk-Forward, Purged K-Fold, and Com-
binatorial Purged Cross-Validation, each specifically adapted 3.4.3. Implementation of Purged K-Fold
to the challenges of financial time series data. Utilizing ⌋ Cross-Validation
CrossValidator.backtest_predictions, we generate backtest The implementation of Purged K-Fold Cross-Validation
paths for each cross-validation method, comprising prob- in our framework leverages the PurgedKFold class through the
abilities corresponding to the meta-labels. For labels en- CrossValidatorController, specifically tailored for financial
countered across multiple backtest paths, we average their time series data. Configured with n_splits=4, an embargo
probabilities, creating a consolidated measure that informs rate of embargo=0.02, and time-based partitioning, this ap-
subsequent strategy performance calculations. Integrating proach rigorously maintains the integrity of the temporal or-
traditional and innovative cross-validation methodologies, der. The initialization parameters ensure that the dataset is

Arian, Norouzi, Seco Page 19 of 26

Electronic copy available at: https://ssrn.com/abstract=4686376


Backtest Overfitting in the Machine Learning Era

divided into four contiguous segments, each representing a our collection of 28 strategy trials. The analysis is method-
distinct period in time. ically structured to encompass a holistic evaluation of the
entire performance timeline and an annualized, segmented
CrossValidatorController( examination. Each year is meticulously analyzed, consider-
'purgedkfold', ing 252 trading days per segment. This dual-faceted analysis
n_splits=4, offers insights into the strategies’ overall and specific yearly
times=times, performances and serves as a litmus test for the effective-
embargo=0.02 ness of different out-of-sample testing techniques in curbing
).cross_validator overfitting. The scatter plot in Figure 10 illustrates a negligi-
ble correlation of -0.03 between the Probability of Backtest
This structure is instrumental for mitigating information Overfitting (PBO) and the Best Trial Deflated Sharpe Ratio
leakage and lookahead biases by purging training data that (DSR) Test Statistic in the overall analysis, signaling their
overlaps with the validation period and implementing an em- independence as evaluative tools. Their independence is in-
bargo period. Such modifications are crucial in financial strumental, as it implies a multi-faceted assessment of back-
modeling, where the chronological sequence of data plays test validity, combining robustness checks against overfitting
a pivotal role in the validity and realism of backtesting re- with adjustments for multiple hypothesis testing, thereby en-
sults. Our Purged K-Fold setup, therefore, ensures a more riching the strategy selection process with diverse yet com-
authentic and reliable assessment of the model’s predictive plementary reliability metrics.
capabilities.

3.4.4. Implementation of Combinatorial Purged


Cross-Validation
Our implementation of Combinatorial Purged Cross-Validation
(CPCV) is realized using the CombinatorialPurged class, or-
chestrated through the CrossValidatorController. Tailored
for financial time series analysis, CPCV is initialized with ⌋
n_splits=8 and n_test_groups=2, signifying that the dataset is
divided into eight non-overlapping groups with two groups
designated for testing in each combinatorial split. Addition-
ally, an embargo rate of embargo=0.02 is applied to mitigate
information leakage further. This setup is encapsulated in
the following configuration:

CrossValidatorController(
'combinatorialpurged',
n_splits=8,
n_test_groups=2,
times=times,
embargo=0.02
).cross_validator,

The CPCV approach, with its combinatorial nature, en-


Figure 10: Probability of Backtest Overfitting vs Best Trial
sures a thorough and diversified examination of the model’s
Deflated Sharpe Ratio Test Statistic with the Correlation of
performance across multiple backtest paths, effectively ad- -0.03
dressing overfitting concerns prevalent in financial model-
ing. The integration of purging and embargoing within this
framework further bolsters the temporal integrity of the val-
3.5.1. Implementation of Combinatorially Symmetric
idation process, making CPCV a robust tool for assessing
Cross-Validation (CSCV)
predictive models in the dynamic environment of financial
For assessing the Probability of Backtest Overfitting (PBO)
markets. The selection of parameters in our implementation
in our 28 strategy trials, we implement the Combinatorially
reflects a deliberate balance between comprehensive back-
Symmetric Cross-Validation (CSCV) using Python’s numpy
testing and computational feasibility.
library. The CSCV method is applied to a matrix of strat-
egy returns, which represents the log returns for different
3.5. Comparative Assessment of Out-of-Sample
model configurations across various time observations. We
Testing Techniques utilize n_partitions = 16 to divide the performance matrix
In our study, we conduct a detailed comparative assess-
into an equal number of disjoint submatrices, ensuring a bal-
ment of various out-of-sample testing methodologies, focus-
anced evaluation across multiple data segments. Our evalua-
ing on reducing the likelihood of backtest overfitting within
tion metric, the Sharpe ratio, is computed through a custom

Arian, Norouzi, Seco Page 20 of 26

Electronic copy available at: https://ssrn.com/abstract=4686376


Backtest Overfitting in the Machine Learning Era

function to measure the performance of strategies over an


annual risk-free rate of 0.05. The probability_of_backtes ⌋
t_overfitting function synthesizes this data, estimating the
PBO and producing an array of logit values. This procedure
thoroughly compares each strategy’s in-sample and out-of-
sample performance, which is crucial for identifying over-
fitting and ensuring the robustness of our trading strategies
against future market scenarios.

3.5.2. Utilization of the Deflated Sharpe Ratio in False


Discovery Analysis
In our analysis of 28 trading strategy trials, we employ
the Deflated Sharpe Ratio (DSR) to critically assess the like-
lihood of false discoveries. This evaluation begins with com-
puting the Sharpe Ratios for each strategy, using an annual
risk-free rate of 0.05, to identify the best-performing trial.
Subsequently, we calculate the DSR, considering the skew-
ness, kurtosis of log returns, and the variance of Sharpe Ra-
tios across all trials. Crucially, we focus on the test statistic
derived from the DSR calculation rather than its value post-
application in the normal cumulative distribution function
(CDF). This approach allows us to more accurately discern
between strategies that exhibit predictive skill and those that Figure 11: Distribution of Probability of Backtest Overfitting
may have performed well by chance, ensuring a more robust Values Across Simulations For Each Cross-Validation Method
and reliable selection of the optimal trading strategy.

3.6. Analytical Approaches for Out-of-Sample


Testing Results Evaluation
A meticulous statistical examination of out-of-sample test-
ing results is vital for validating the robustness of cross-validation
methods against overfitting. In this subsection, we imple-
ment a comprehensive suite of non-parametric statistical tests,
augmented by multivariate analysis, to rigorously compare
the distributional characteristics of metric values derived from
different cross-validation techniques. Utilizing Python li-
braries scipy.stats, scikit_posthocs, and sklearn, we per-
form the Kruskal-Wallis H Test for a global understanding
of distributional differences, followed by pairwise compar-
isons via Dunn’s Test for specific methodological distinc-
tions. Furthermore, we conduct Principal Component Anal-
ysis (PCA) to assess the independence of simulation out-
puts, which provides a deeper insight into the interdependen-
cies of the backtest overfitting metrics across our simulation
trials. Together, these analytical strategies ensure a robust
evaluation of the cross-validation methods’ stability, relia-
bility, and independence, painting a comprehensive picture
of their performance in financial strategy validation.
Figure 12: Distribution of Best Trial Deflated Sharpe Ra-
3.6.1. Assessing Distributional Variance Across tio Test Statistic Values Across Simulations For Each Cross-
Methods Validation Method
The Kruskal-Wallis H Test, a non-parametric method for
determining stochastic dominance among multiple groups,
evaluates the null hypothesis that the distributions of metric calculate the effect size using 𝜂 2 = 𝐻−(𝑘−1)
𝑁−𝑘
, where 𝐻 is the
values across different cross-validation methods are identi- Kruskal-Wallis statistic, 𝑘 represents the number of groups,
cal. Unlike parametric counterparts, it does not necessitate and 𝑁 is the total number of observations. This statistic de-
the assumption of normally distributed data. Significant re- lineates the proportion of total variance in the metric values
sults from this test suggest at least one group’s distribution the cross-validation method explains, providing insights be-
differs from others. Should the test yield significance, we

Arian, Norouzi, Seco Page 21 of 26

Electronic copy available at: https://ssrn.com/abstract=4686376


Backtest Overfitting in the Machine Learning Era

yond statistical significance to the magnitude of differences


observed.

3.6.2. Delineating Distinct Distributions via Pairwise


Comparisons
We proceed with Dunn’s Test for pairwise comparisons
upon detecting significant variance in distributions with the
Kruskal-Wallis H Test. This method pinpoints the specific
cross-validation methods with statistically discernible differ-
ences. Dunn’s Test is adept for multiple comparisons, ap-
plying the Bonferroni correction to adjust the significance
threshold, thus controlling the family-wise error rate and re-
inforcing the validity of the inferential distinctions among
pairs.

3.6.3. Principal Component Analysis for Simulation


Independence
To evaluate the correlation and ascertain the indepen-
dence of our 1000 Simulation, we conducted a Principal Com-
ponent Analysis (PCA) on the annualized metric values, as-
sessing backtest overfitting. By examining the Cumulative
Explained Variance by PCA Components for each CV method,
we could discern the degree of correlation among trials. A Figure 13: Comparison of Overall Probability of Backtest
higher explained variance by fewer principal components in- Overfitting and Best Trial Deflated Sharpe Ratio Test Statistic
dicates a stronger correlation and less independence between Values Across Simulations For Each Cross-Validation Method
the trials, which is critical in understanding the diversifica-
tion benefits of our strategy portfolio. Our PCA implemen-
tation utilized Python’s sklearn.pipeline.Pipeline, incorpo- est median PBO value of 0.451437, suggesting a potentially
rating a sklearn.preprocessing.StandardScaler to normalize higher risk of overfitting than other methods. The Kruskal-
the data, a simple average imputer sklearn.impute.SimpleI ⌋ Wallis test indicated significant discrepancies among the groups
mputer for handling missing values, and a sklearn.decompo ⌋ (𝑝 = 7.05 × 10−9 , 𝜂 2 = 0.01022), underscoring the pres-
sition.PCA to perform the decomposition, thereby providing ence of at least one method with a distinct PBO distribu-
a quantitative assessment of the simulations’ interdependen- tion. Dunn’s pairwise comparison, as presented in Table 5,
cies. further corroborated significant distinctions: ’Combinato-
rial Purged’ demonstrated a markedly lower PBO compared
3.7. Disclosure of Empirical Findings to both ’K-Fold’ (𝑝 = 4.20×10−6 ) and ’Purged K-Fold’ (𝑝 =
In this subsection, we unveil the empirical findings de- 3.32 × 10−7 ), as well as against ’Walk-Forward’ (𝑝 = 1.09 ×
rived from our meticulous analysis of backtest overfitting 10−6 ), implying a superior efficacy in mitigating the risk of
within various cross-validation frameworks. Our investiga- overfitting. Meanwhile, ’K-Fold’ and ’Purged K-Fold’ were
tion meticulously scrutinizes the Probability of Backtest Over- statistically indistinguishable from ’Walk-Forward’, suggest-
fitting (PBO) and the Best Trial Deflated Sharpe Ratio (DSR) ing similar PBO profiles between these methods. These in-
Test Statistic, employing robust statistical methods to dis- sights are essential for strategically selecting cross-validation
cern significant disparities and temporal variabilities across methodologies in quantitative finance models, aiming to re-
multiple validation techniques. The ensuing results illumi- duce the probability of overfitting while ensuring robust pre-
nate the comparative resilience of these methods to the per- dictive performance.
ils of overfitting and offer a nuanced understanding of their Our simulations’ non-parametric analysis of the Best Trial
temporal behavior, thereby guiding the strategic selection of Deflated Sharpe Ratio (DSR) Test Statistic values revealed
the most robust and stable cross-validation approaches in fi- distinct statistical characteristics among the various cross-
nancial strategy development. This disclosure is anchored validation methods. As illustrated in Figure 13, the distri-
in a profound commitment to empirical rigor to bolster the bution of DSR values for the ’Walk-Forward’ methodology
integrity of model validation processes within quantitative markedly differed from those of other methods, with a no-
finance. tably lower median value of 0.160818. The Kruskal-Wallis
test confirmed significant disparities across the distributions
3.7.1. Overall Assessment of Backtest Overfitting (𝑝 = 5.0367 × 10−15 , 𝜂 2 = 0.017), suggesting at least one
Our comparative analysis of the Probability of Backtest method deviates from the others in terms of DSR values.
Overfitting (PBO) across various cross-validation techniques Subsequent pairwise comparisons using Dunn’s Test, detailed
revealed significant statistical differences, as visualized in in Table 6, identified ’Walk-Forward’ as significantly differ-
Figure 13. The ’Walk-Forward’ approach exhibited the high-

Arian, Norouzi, Seco Page 22 of 26

Electronic copy available at: https://ssrn.com/abstract=4686376


Backtest Overfitting in the Machine Learning Era

Table 5
Distributions Comparison for Probability of Backtest Over-
fitting Values Across Simulations For Each Cross-Validation
Method
Test P-Value Effect Size (𝜂 2 )
Kruskal Wallis 7.05e-09 0.01022
Dunn’s Test p-Value Significant
Combinatorial Purged vs. K-Fold 4.20e-06 Yes
Combinatorial Purged vs. Purged K-Fold 3.32e-07 Yes
Combinatorial Purged vs. Walk-Forward 1.09e-06 Yes
K-Fold vs. Purged K-Fold 1.0 No
K-Fold vs. Walk-Forward 1.0 No
Purged K-Fold vs. Walk-Forward 1.0 No

ent from both ’K-Fold’ and ’Purged K-Fold’ (𝑝 = 8.32 ×


10−12 and 𝑝 = 1.42 × 10−11 , respectively), while ’Combi-
natorial Purged’ did not exhibit significant differences from
’K-Fold’ and ’Purged K-Fold’. These statistical insights are
crucial for recognizing each cross-validation method’s rela-
tive effectiveness and characteristics in minimizing overfit-
ting while striving for optimal Sharpe ratio performance. Figure 14: Comparison of Temporal Probability of Backtest
Overfitting and Best Trial Deflated Sharpe Ratio Test Statistic
Efficiency Ratio Values Across Simulations For Each Cross-
Table 6 Validation Method
Distributions Comparison for Best Trial Deflated Sharpe Ra-
tio Test Statistic Values Across Simulations For Each Cross-
Validation Method
K-Fold’ and ’Walk-Forward’ (𝑝 = 9.60 × 10−54 ). These
Test P-Value Effect Size (𝜂 2 )
findings underscore the importance of considering the Effi-
Kruskal Wallis 5.0367e-15 0.0174
ciency Ratio when evaluating the consistency of PBO over
Dunn’s Test p-Value Significant time, with ’Walk-Forward’ showing the greatest variability
Combinatorial Purged vs. K-Fold 1.0 No and, thus, potentially, the least stability in PBO values year
Combinatorial Purged vs. Purged K-Fold 1.0 No over year.
Combinatorial Purged vs. Walk-Forward 3.21-09 Yes
K-Fold vs. Purged K-Fold 1.0 No
K-Fold vs. Walk-Forward 8.32e-12 Yes Table 7
Purged K-Fold vs. Walk-Forward 1.42e-11 Yes Distributions Comparison for Probability of Backtest Over-
fitting Efficiency Ratio Values Across Simulations For Each
Cross-Validation Method
Test P-Value Effect Size (𝜂 2 )
Kruskal Wallis 1.426e-76 0.08876
3.7.2. Temporal Variability of Overfitting Assessment
2 Dunn’s Test p-Value Significant
The annual Efficiency Ratio, defined as 𝜇𝜎 2 , was calcu-
Combinatorial Purged vs. K-Fold 8.14e-23 Yes
lated for the Probability of Backtest Overfitting (PBO) across
Combinatorial Purged vs. Purged K-Fold 9.29e-22 Yes
various cross-validation methods to assess the relative vari- Combinatorial Purged vs. Walk-Forward 1.06e-07 Yes
ability of PBO through time. As depicted in Figure 14, ’Walk- K-Fold vs. Purged K-Fold 1.0 No
Forward’ displayed the highest median Efficiency Ratio value K-Fold vs. Walk-Forward 2.15e-54 Yes
of 0.224821, suggesting a higher variance to the mean PBO Purged K-Fold vs. Walk-Forward 9.60e-54 Yes
value than the other methods. The Kruskal-Wallis test re-
vealed highly significant differences in the Efficiency Ratio
distributions across methods (𝑝 = 1.426×10−76 , 𝜂 2 = 0.09), 2
indicating varying levels of PBO stability. Dunn’s Test re- The annual Efficiency Ratio 𝜇𝜎 2 for the Best Trial De-
sults, presented in Table 7, showed significant differences flated Sharpe Ratio (DSR) Test Statistic values was scruti-
between ’Combinatorial Purged’ and both ’K-Fold’ (𝑝 = nized to evaluate the variability of the DSR through time
8.14×10−23 ) and ’Purged K-Fold’ (𝑝 = 9.29×10−22 ), as well for each cross-validation method. Figure 14 illustrates the
as ’Walk-Forward’ (𝑝 = 1.06×10−7 ). Additionally, ’K-Fold’ distributions of these ratios, with ’Combinatorial Purged’
and ’Walk-Forward’ demonstrated a significant variance in showing a notably lower median Efficiency Ratio of 34.30,
their Efficiency Ratios (𝑝 = 2.15 × 10−54 ), as did ’Purged suggesting greater efficiency in DSR performance. In stark

Arian, Norouzi, Seco Page 23 of 26

Electronic copy available at: https://ssrn.com/abstract=4686376


Backtest Overfitting in the Machine Learning Era

contrast, ’K-Fold’ and ’Purged K-Fold’ showed higher me-


dian values of 84.36 and 82.59, respectively, indicating less
efficiency. The Kruskal-Wallis test underscored significant
differences in efficiency across methods (𝑝 = 1.43 × 10−76 ,
𝜂 2 = 0.0888). According to Dunn’s Test results shown in Ta-
ble 8, ’Combinatorial Purged’ demonstrated statistically sig-
nificant higher efficiency when compared to both ’K-Fold’
(𝑝 = 8.14 × 10−23 ) and ’Purged K-Fold’ (𝑝 = 9.29 × 10−22 ),
as well as ’Walk-Forward’ (𝑝 = 1.06 × 10−7 ). Conversely,
’K-Fold’ and ’Walk-Forward’ showed no significant differ-
ence in their efficiency (𝑝 = 2.15×10−54 ), similar to ’Purged
K-Fold’ versus ’Walk-Forward’ (𝑝 = 9.60 × 10−54 ). These
findings are instrumental for discerning the most efficient
cross-validation method regarding DSR variability, which is
crucial for achieving stable performance in financial machine-
learning applications.

Table 8
Distributions Comparison for Best Trial Deflated Sharpe Ratio
Test Statistic Efficiency Ratio Values Across Simulations For
Each Cross-Validation Method
Test P-Value Effect Size (𝜂 2 )
Kruskal Wallis 1.43e-76 0.0888 Figure 15: Comparison of Temporal Probability of Backtest
Overfitting and Best Trial Deflated Sharpe Ratio Test Statistic
Dunn’s Test p-Value Significant ADF Test Statistic Values Across Simulations For Each Cross-
Combinatorial Purged vs. K-Fold 8.14e-23 Yes Validation Method
Combinatorial Purged vs. Purged K-Fold 9.29e-22 Yes
Combinatorial Purged vs. Walk-Forward 1.06e-07 Yes
K-Fold vs. Purged K-Fold 1.00 No Table 9
K-Fold vs. Walk-Forward 2.15e-54 Yes Distributions Comparison for Probability of Backtest Overfit-
Purged K-Fold vs. Walk-Forward 9.60e-54 Yes ting ADF Test Statistic Values Across Simulations For Each
Cross-Validation Method
Test P-Value Effect Size (𝜂 2 )
Kruskal Wallis 0.0 0.55106
3.7.3. Temporal Stationarity of Overfitting Assessment Dunn’s Test p-Value Significant
In our annual time series analysis of the Probability of Combinatorial Purged vs. K-Fold 1.0 No
Backtest Overfitting (PBO), the Augmented Dickey-Fuller Combinatorial Purged vs. Purged K-Fold 1.0 No
(ADF) test statistic values were utilized to examine the sta- Combinatorial Purged vs. Walk-Forward 0.0 Yes
tionarity of the PBO through time. These values are depicted K-Fold vs. Purged K-Fold 1.0 No
in Figure 15 and quantitatively analyzed in Table 9. The K-Fold vs. Walk-Forward 0.0 Yes
’Walk-Forward’ method exhibited a markedly higher median Purged K-Fold vs. Walk-Forward 0.0 Yes
ADF value of -2.41, indicating less stationarity and greater
trend presence than other methods. The Kruskal-Wallis test
yielded a significant result (𝑝 < 0.01, 𝜂 2 = 0.55), imply-
ing substantial differences in the time series characteristics Table 10. The ’Walk-Forward’ approach demonstrated a higher
among the methods. Dunn’s Test revealed that the ’Walk- median ADF value of -3.86, suggesting a weaker presence of
Forward’ method’s ADF values were significantly different stationarity compared to the more negative ADF values of
from those of ’K-Fold’, ’Purged K-Fold’, and ’Combinato- the other methods, which implies a stronger rejection of the
rial Purged’ (all with 𝑝 = 0.0), underscoring its distinct be- unit root and thus a stronger indication of stationarity. The
havior in terms of stationarity. These findings suggest that Kruskal-Wallis test provided extremely significant evidence
while ’Walk-Forward’ might be more prone to exhibit trends of distributional differences among the methods (𝑝 = 2.01 ×
in PBO over time, the other methods did not show signifi- 10−50 , 𝜂 2 = 0.059). Dunn’s Test further identified signifi-
cant differences among themselves, indicating similar levels cant differences between ’Walk-Forward’ and all other meth-
of stationarity in their respective PBO values. ods, with ’Walk-Forward’ being less stationary compared to
The stationarity of the annual Best Trial Deflated Sharpe ’K-Fold’ (𝑝 = 2.38 × 10−37 ), ’Purged K-Fold’ (𝑝 = 1.65 ×
Ratio (DSR) Test Statistic values was assessed using the Aug- 10−31 ), and ’Combinatorial Purged’ (𝑝 = 9.81 × 10−36 ).
mented Dickey-Fuller (ADF) test, with the distributions vi- These results indicate that ’Walk-Forward’ may be less suit-
sualized in Figure 15 and the statistical analysis detailed in able for strategies that require a consistent DSR over time,

Arian, Norouzi, Seco Page 24 of 26

Electronic copy available at: https://ssrn.com/abstract=4686376


Backtest Overfitting in the Machine Learning Era

while the other cross-validation methods do not exhibit sig-


nificant differences regarding stationarity in their DSR val-
ues.

Table 10
Distributions Comparison for Best Trial Deflated Sharpe Ratio
Test Statistic ADF Test Statistic Values Across Simulations
For Each Cross-Validation Method
Test P-Value Effect Size (𝜂 2 )
Kruskal Wallis 2.01e-50 0.05853
Dunn’s Test p-Value Significant
Combinatorial Purged vs. K-Fold 1.0 No
Combinatorial Purged vs. Purged K-Fold 1.0 No
Combinatorial Purged vs. Walk-Forward 9.81e-36 Yes
K-Fold vs. Purged K-Fold 1.0 No
K-Fold vs. Walk-Forward 2.38e-37 Yes
Purged K-Fold vs. Walk-Forward 1.65e-31 Yes

3.7.4. Correlation of Overfitting Assessments Across


Simulations
Figure 16: Temporal Probability of Backtest Overfitting and
Our Principal Component Analysis (PCA) investigation Best Trial Deflated Sharpe Ratio Test Statistic Cumulative
into the correlation between different overfitting metrics across Explained Variance by PCA Components For Each Cross-
simulations revealed notable patterns of dependency. As de- Validation Method
picted in the PCA cumulative explained variance plots (Fig-
ure 16), both the Probability of Backtest Overfitting (PBO)
and the Best Trial Deflated Sharpe Ratio (DSR) Test Statistic of Backtest Overfitting (PBO) revealed ’Walk-Forward’ as
values for the ’Walk-Forward’ method are characterized by a having the highest median value, indicating greater tempo-
higher explained variance with fewer principal components. ral variability and reduced stability. ’Combinatorial Purged’,
This pattern indicates a lower level of simulation result in- however, displayed a notably lower Efficiency Ratio, sug-
dependence, suggesting that the performance metrics of the gesting enhanced temporal stability and consistency in per-
’Walk-Forward’ method are more interrelated than those of formance. When evaluating the Efficiency Ratio for the DSR
other cross-validation methods. Such a correlation structure Test Statistic, ’Combinatorial Purged’ exhibited a notably
within the ’Walk-Forward’ simulations may imply an inher- lower median value, implying greater efficiency and stabil-
ent bias or systemic influence affecting the simulations, an ity in its DSR performance over time. This contrasted with
essential consideration for strategy validation and the selec- ’K-Fold’ and ’Purged K-Fold’, which showed higher median
tion of robust cross-validation methodologies. values, indicating reduced efficiency and potential variabil-
ity in DSR performance.
4. Discussion The temporal stationarity analysis of PBO, using the Aug-
mented Dickey-Fuller (ADF) test, revealed that the ’Walk-
In assessing backtest overfitting, we observed notable Forward’ method exhibited less stationarity, indicating a greater
disparities across various cross-validation techniques. The presence of trends in its PBO over time. Other methods, in-
’Walk-Forward’ approach exhibited the highest Probability cluding ’Combinatorial Purged’, displayed more consistent
of Backtest Overfitting (PBO), signaling a heightened risk of stationarity levels, suggesting more reliable performance. In
overfitting. In contrast, the ’Combinatorial Purged’ method assessing the stationarity of the DSR Test Statistic values,
significantly outperformed others like ’K-Fold’ and ’Purged ’Walk-Forward’ demonstrated weaker stationarity, as indi-
K-Fold’, demonstrating its effectiveness in reducing overfit- cated by its higher median ADF value. This contrasted with
ting risks. The Deflated Sharpe Ratio (DSR) Test Statistic other methods, which showed stronger indications of station-
evaluation highlighted distinct performance variations among arity, implying a more stable and consistent rejection of unit
the methods. ’Walk-Forward’ showed a markedly lower me- root in their DSR values over time.
dian DSR, suggesting heightened false discovery probabil- Our Principal Component Analysis (PCA) on the cor-
ity. In comparison, ’Combinatorial Purged’ aligned closely relation between different overfitting metrics across simula-
with ’K-Fold’ and ’Purged K-Fold’, indicating a more bal- tions highlighted a unique pattern for the ’Walk-Forward’
anced approach in achieving optimal performance while mit- method, characterized by a higher explained variance with
igating overfitting. fewer principal components. This pattern suggests a lower
Our analysis of the Efficiency Ratio for the Probability level of result independence, indicating potential biases or

Arian, Norouzi, Seco Page 25 of 26

Electronic copy available at: https://ssrn.com/abstract=4686376


Backtest Overfitting in the Machine Learning Era

systemic influences in the ’Walk-Forward’ method. market trading. The Journal of Business, 39(1):226–241, 1966. ISSN
00219398, 15375374.
[12] Evelyn Fix and Joseph Lawson Hodges. Nonparametric discrimina-
5. Conclusions tion: consistency properties. Randolph Field, Texas, Project, pages
21–49, 1951.
Our investigation into cross-validation methodologies in [13] James D. Hamilton. Time Series Analysis. Princeton University Press,
financial modeling has revealed critical insights, especially 1994.
the superiority of the ’Combinatorial Purged’ method in min- [14] Floyd B Hanson and Zongwu Zhu. Comparison of market parameters
imizing overfitting risks. This method outperforms tradi- for jump-diffusion distributions using multinomial maximum likeli-
hood estimation. In 2004 43rd IEEE Conference on Decision and
tional approaches like ’K-Fold’, ’Purged K-Fold’, and no- Control (CDC)(IEEE Cat. No. 04CH37601), volume 4, pages 3919–
tably ’Walk-Forward’ in terms of both the Probability of Back- 3924. IEEE, 2004.
test Overfitting (PBO) and the Deflated Sharpe Ratio (DSR) [15] Steven L. Heston. A closed-form solution for options with stochastic
Test Statistic. ’Walk-Forward’, in contrast, shows limita- volatility with applications to bond and currency options. The Review
tions in preventing false discovery and exhibits greater tem- of Financial Studies, 6(2):327–343, 1993.
[16] Ulrich Homm and Jörg Breitung. Testing for speculative bubbles in
poral variability and weaker stationarity from temporal as- stock markets: a comparison of alternative methods. Journal of Fi-
sessment of these methodologies using the Efficiency Ratio nancial Econometrics, 10(1):198–231, 2012.
and the Augmented Dickey-Fuller (ADF) test, raising con- [17] Kin Lam and HC Yam. Cusum techniques for technical trading in
cerns about its reliability. On the other hand, ’Combinato- financial markets. Financial Engineering and the Japanese Markets,
rial Purged’ demonstrates enhanced stability and efficiency, 4:257–274, 1997.
[18] Marcos Lopez De Prado. The future of empirical finance. Journal of
proving to be a more reliable choice for financial strategy Portfolio Management, 41(4), 2015.
development. The choice between ’Purged K-Fold’ and ’K- [19] Marcos Lopez de Prado. Advances in financial machine learning.
Fold’ requires caution, as they show no significant perfor- John Wiley & Sons, 2018.
mance difference, and ’Purged K-Fold’ may reduce the ro- [20] Marcos Lopez de Prado. Machine learning for asset managers. Cam-
bustness of training data for out-of-sample testing. These bridge University Press, 2020.
[21] Robert C. Merton. Option pricing when underlying stock returns are
findings significantly contribute to quantitative finance, pro- discontinuous. Journal of Financial Economics, 3(1):125–144, 1976.
viding a robust framework for cross-validation that aligns [22] Andrew Papanicolaou and Ronnie Sircar. A regime-switching heston
theoretical robustness with practical reliability. They un- model for vix and s&p 500 implied volatilities. Quantitative Finance,
derscore the need for tailored evaluation methods in an era 14(10):1811–1827, 2014.
of complex algorithms and large datasets, guiding decision- [23] Michael Schatz and Didier Sornette. Inefficient bubbles and efficient
drawdowns in financial markets. International Journal of Theoretical
making in a data-driven financial world. Future research and Applied Finance, 23(07):2050047, 2020.
should extend these findings to real-world market conditions [24] Laerd Statistics. Kruskal-wallis h test using spss statistics. Statistical
to enhance their applicability and generalizability. tutorials and software guides, 2015.
[25] Yurong Xie and Guohe Deng. Vulnerable european option pricing in
a markov regime-switching heston model with stochastic interest rate.
References Chaos, Solitons & Fractals, 156:111896, 2022.
[1] David H Bailey and Marcos Lopez de Prado. The sharpe ratio efficient
frontier. Journal of Risk, 15(2):13, 2012.
[2] David H Bailey and Marcos López de Prado. The deflated sharpe
ratio: Correcting for selection bias, backtest overfitting and non-
normality. Journal of Portfolio Management, 40(5):94–107, 2014b.
[3] David H Bailey, Jonathan M Borwein, Marcos López de Prado, and
Qiji Jim Zhu. Pseudomathematics and financial charlatanism: The
effects of backtest over fitting on out-of-sample performance. Notices
of the AMS, 61(5):458–471, 2014a.
[4] David H Bailey, Jonathan Borwein, Marcos Lopez de Prado, and
Qiji Jim Zhu. The probability of backtest overfitting. Journal of Com-
putational Finance, forthcoming, 2016.
[5] Carlo Bonferroni. Teoria statistica delle classi e calcolo delle proba-
bilita. Pubblicazioni del R Istituto Superiore di Scienze Economiche
e Commericiali di Firenze, 8:3–62, 1936.
[6] Leo Breiman. Classification and regression trees. Routledge, 2017.
[7] Tianqi Chen and Carlos Guestrin. Xgboost: A scalable tree boosting
system. In Proceedings of the 22nd acm sigkdd international confer-
ence on knowledge discovery and data mining, pages 785–794, 2016.
[8] Kim Christensen, Roel Oomen, and Roberto Renò. The drift burst
hypothesis. Journal of Econometrics, 227(2):461–497, 2022. ISSN
0304-4076.
[9] Olive Jean Dunn. Multiple comparisons using rank sums. Techno-
metrics, 6(3):241–252, 1964.
[10] Robert J Elliott, Katsumasa Nishide, and Carlton-James U Osakwe.
Heston-type stochastic volatility with a markov switching regime.
Journal of Futures Markets, 36(9):902–919, 2016.
[11] Eugene F. Fama and Marshall E. Blume. Filter rules and stock-

Arian, Norouzi, Seco Page 26 of 26

Electronic copy available at: https://ssrn.com/abstract=4686376

You might also like