Data Pre-Processing - by Quant Arb - The Quant Stack
Data Pre-Processing - by Quant Arb - The Quant Stack
Data Pre-Processing
Avoiding key mistakes and getting the most out of your data
QUANT ARB
JAN 10, 2024 ∙ PAID
10 Share
Introduction
https://www.algos.org/p/data-pre-processing-guide?utm_source=profile&utm_medium=reader2 1/9
5/14/24, 9:01 PM Data Pre-Processing - by Quant Arb - The Quant Stack
Before any research can be conducted, we must prepare our data and, in doing
so, transform it into the most useful version of itself. There are two reasons here
as to why we need to pre-process our data:
In this article, we walk through best practices for each of these areas/reasons
behind data pre-processing with the aim of improving readers' research abilities
and hopefully avoiding common mistakes that beginners and even more
experienced practitioners often make.
Index
1. Introduction
2. Index
3. Ensuring Accuracy
4. Transformation and Reduction of Data
5. Final Remarks
Ensuring Accuracy
Accuracy can take many forms beyond ensuring the data’s values are correct. We
need to ensure that our data is not misleading us subtly. This section goes
beyond the data being wrong to common issues faced because of misuse or
misdirection in/of the data as well.
Bid/Ask Bounce:
This phenomenon occurs when the price bounces between the bid and ask
prices. Most providers will form their candlesticks from trade prices and not
from mid-price. This leads to issues where we observe the close price oscillating
up and down on illiquid assets by large amounts and perceive this as mean-
https://www.algos.org/p/data-pre-processing-guide?utm_source=profile&utm_medium=reader2 2/9
5/14/24, 9:01 PM Data Pre-Processing - by Quant Arb - The Quant Stack
Some exchanges, especially for digital assets, will include OTC trades in the
main trade feed, so errors can occur when these trade prices do not match the
calculated best bid/ask. This usually creates issues related to assuming the
current state of the orderbook. Trades are typically sent out before quotes are, so
when trades occur, people will also use this information to update their
orderbook. This creates problems when the trades don’t actually impact the
orderbook due to their OTC nature - creating a momentary false perception of
the orderbook.
Trade Aggregation:
Whilst this does not necessarily create issues when used properly, there are
mistakes that can easily be made when researchers do not differentiate between
individual trades and aggregate trades. If I send an order for 1000 units and 100
matches at the first level, I will create one trade for 100 at the first level of the
book, and so on, for every price my order impacts through. Some exchanges /
vendors will show each trade, whereas others will simply show the average trade
price and aggregate them all as one.
https://www.algos.org/p/data-pre-processing-guide?utm_source=profile&utm_medium=reader2 3/9
5/14/24, 9:01 PM Data Pre-Processing - by Quant Arb - The Quant Stack
We may have situations where both are relevant, such as estimating the
probability of getting filled at a level of the orderbook. We certainly will need to
look at the individual trades to determine which levels of the book get impacted,
but on the same note, we need to ensure that our simulation understands that all
these orders arrive at the exact same time. If our model calculates that 10 of the
individual orders from an aggregate order would have filled us, and we then
make a hedge trade because of those fills - we can’t assume that liquidity can be
re-used.
Often, cryptocurrency exchanges add fake trades to the trade feed and try to
prop up their volume statistics. When you actually go to quote in those books,
you will find that these trades match at prices worse than yours - even if you are
the best level in the book, all without you getting a single fill. Why is this? Fake
flow. That’s why. It can lead to drastic overestimations of your ability to get
filled in a book or the ability to slowly execute size through the book without
being detected/moving the market. On the liquidity side, dirty exchanges will
have fake quotes that get pulled immediately after one of them gets hit. You can
still hit them, but the second you do, they’ll all disappear. Not all liquidity you
see is really there with the intent of liquidity provision - some is there to meet
the terms of an agreement and will flash away the second you test it. With this
sort of liquidity, it is best to hit it all in one go and skip the VWAP/TWAP
algorithms - take it before you lose it.
Adjusted Data:
Especially with financial data, the data will be edited after it is released. Perhaps
there was an error or, worse, an accounting fraud took place - so after the fact,
the data was edited to reflect the true value of it. The asset’s price for that period
will not reflect this, of course, nor would the information present to traders at
https://www.algos.org/p/data-pre-processing-guide?utm_source=profile&utm_medium=reader2 4/9
5/14/24, 9:01 PM Data Pre-Processing - by Quant Arb - The Quant Stack
the time. Thus, you end up with look-ahead bias as you algorithm prices in the
accounting fraud/error before it is even made public.
This is not just the case with financial data; many datasets are adjusted later on
for various reasons. It is critical to ensure that the data is as it was when it was
released and not in its post-correction form.
Alternative data frequently has issues where the measurements of one provider
is drastically different from another. An example where I have seen this
consistently is with on-chain / fundamental data for cryptocurrencies. Vendors
will provide wildly different estimates of various metrics, such as global volume,
with some filtering out wash volume and others including it.
There are ways to get around this, with the best being to use multiple sources
and triangulate the correct value using three or more vendors.
Flow data commonly has this issue as well, where estimates of OTC flow will
vary due to the inclusion of different markets or even a lack of care from the
vendors themselves.
Spot Borrow:
It is a mistake to believe that every asset can be shorted and at a reasonable cost.
For perpetuals, we must consider funding rates - which, of course, can be solved
by acquiring the correct data, but one that is considered far less is the ability to
get short spot and the costs of borrowing associated with that.
You can usually assume that anything in the S&P500 can be shorted, or generally,
and any liquid, large market capitalization assets have this ability too, but when
dealing with less liquid assets, this is an important consideration and dataset to
acquire.
This isn't necessarily a data error but an error arising from not including critical
information in your dataset before starting research. These datasets are VERY
https://www.algos.org/p/data-pre-processing-guide?utm_source=profile&utm_medium=reader2 5/9
5/14/24, 9:01 PM Data Pre-Processing - by Quant Arb - The Quant Stack
expensive in equities and unavailable in digital assets as far as I’ve seen - best to
get scraping before you end up needing it!
Two main kinds of data for HFT are almost impossible to get without collecting
it yourself. That’s latency data and fill data. You need to place orders at the best
bid/ask and start ping-ponging to collect fill / adversity-related data, and for
latency data, you also need to be actively trading during the period.
The latency that really matters is, of course, the matching engine (for crypto,
different for other markets), and that can only be measured by actively trading
the markets at that time and recording the data.
Resampling:
Whilst some applications warrant the most granular possible data, we don’t
always require such levels of detail and instead have to reduce our data to make
it practical for usage.
Price data is typically resampled into bars, with the most common being
OHLCV (Open, High, Low, Close, Volume) bars. There are four main types of
these bars which work on differing sampling methods:
1. Time
2. Tick
a. Sampling every X amount of base asset traded (for BTC/USDT, the base
asset refers to BTC, especially in crypto where some traders work in
https://www.algos.org/p/data-pre-processing-guide?utm_source=profile&utm_medium=reader2 6/9
5/14/24, 9:01 PM Data Pre-Processing - by Quant Arb - The Quant Stack
coin terms. This can be useful, but in most markets, quote asset is
better)
For the most part, timestamp-based sampling is enough and is the easiest to
work with. If part of your dataset involves financial or any other form of
timestamp sampled data, then it gets tricky if your price data is not also
timestamped.
There are more complicated methods, but these are a waste in my view. All you
need is time and quote asset volume bars. In most cases, time is fine if you
continuously evaluate your alpha (your alpha can include volume anyway / you
can bake in a volume adjustment if it is relevant instead of making the data hard
to work with by creating an irregular frequency with volume bars). Some alphas
only work when using volume bars, and on the other side, some alphas only
work with time bars (think seasonality), so there is use to it, but time bars are
what most brokers provide, and adding tons of time to create your own custom
bars should really only be done if there is good reason for it.
Adding regularity to your data by resampling from trade + quotes into bars
makes working with it a LOT faster and easier overall.
Reduction:
Often, there is a lot of unnecessary data, and it can considerably slow us down.
Resampling can be an effective method of reduction, as described above, but we
can also reduce data size in other ways. One way is to remove new quotes where
the data that is relevant to us has not changed by more than a certain margin.
If we have quotes data and then we create mid-prices from that, we often end up
with data that has not had a real change in mid-price from point to point. This is
from only the quote sizes changing and not the actual price.
https://www.algos.org/p/data-pre-processing-guide?utm_source=profile&utm_medium=reader2 7/9
5/14/24, 9:01 PM Data Pre-Processing - by Quant Arb - The Quant Stack
Removing unnecessary string data such as the name of the asset that appears
(duplicated) in each row, which is a common inclusion in many datasets, will
improve the storage size and read/write speed of the file.
Parquet files tend to strike the best balance between read/write speed and
storage size. It also preserves your Pandas data types, instead of making strings
into objects and other annoying issues when using CSVs.
HDF5 will be the fastest for read/write speed if all the data is numeric (parquet is
faster if there are strings).
CSVs sometimes must be used because Pandas cannot put the data into a
parquet, but generally should be avoided.
Final Remarks
There are many elements to pre-processing I’ve covered here, but also some that
do not make sense to cover as it will be specific to the data you have and the
applications for that data. It typically takes time playing around with data to
solve real world problems before you get a strong understanding here.
Often, our research tasks are clear-cut: which parameter is optimal? and that
doesn’t necessarily require a search for any novel method, simply applying what
we know, but there is still a lot to it related to the optimization of our time. Can
we get roughly the same result with OHLCV bars instead of using quote data?
https://www.algos.org/p/data-pre-processing-guide?utm_source=profile&utm_medium=reader2 8/9
5/14/24, 9:01 PM Data Pre-Processing - by Quant Arb - The Quant Stack
If we can, let’s do it! The quote data will take ages to download, clean-up, and
compute on because of it’s size, best avoided if possible.
It always comes down to what is the minimum work form of data that can get
me the answer - and frankly if you need more data later, you can grab it, but in
most cases you will overshoot as a beginner and that cost you FAR more time
than undershooting does on the data size/granularity.
The information in these two articles should cover the rest of pre-processing,
which would be a shame to repeat myself on.
As always, reading is no substitute for actually doing. Please feel free to go out
there and start wrangling data if you want a better understanding of pre-
processing.
10 Likes
https://www.algos.org/p/data-pre-processing-guide?utm_source=profile&utm_medium=reader2 9/9