Demystifying Apache Arrow
Demystifying Apache Arrow
com/demystifying_arrow/
>robinlinacre
Originally posted: 2020-10-22. View source code for this page here.
In my work as a data scientist, I’ve come across Apache Arrow in a range of seemingly-
unrelated circumstances. However, I’ve always struggled to describe exactly what it is, and
what it does.
which is quite abstract — and for good reason. The project is extremely ambitious, and aims
to provide the backbone for a wide range of data processing tasks. This means it sits at a low
level, providing building blocks for higher level, user-facing, analytics tools like pandas or
dplyr.
As a result, the importance of the project can be hard to understand for users who run into it
occasionally in their day to day work, because much of what it does is behind the scenes.
In this post I describe some of the user-facing features of Apache Arrow which I have run into
in my work, and explain why they are all facets of more fundamental problems which Apache
Arrow aims to solve.
By connecting these dots it becomes clear why Arrow is not just a useful tool to solve some
practical problems today, but one of the most exciting emerging tools, with the potential to
be the engine behind large parts of future data science workflows.
A striking feature of Arrow is that it can read csvs into Pandas more than 10x faster than
pandas.read.csv.
1 of 6 10/27/21, 12:16 PM
Demystifying Apache Arrow https://www.robinlinacre.com/demystifying_arrow/
This is actually a two step process: Arrow reads the data into memory in an Arrow table,
which is really just a collection of record batches, and then converts the Arrow table into a
pandas dataframe.
Arrow has its own in-memory storage format. When we use Arrow to load data into
Pandas, we are really loading data into the Arrow format (an in-memory format for
data frames), and then translating this into the pandas in-memory format. Part of the
speedup in reading csvs therefore comes from the careful design of the Arrow
columnar format itself.
Running Python user defined functions in Apache Spark has historically been very slow — so
slow that the usual recommendation was not to do it on datasets of any significant size.
More recently, Apache Arrow has made it possible to efficiently transfer data between JVM
and Python processes. Combined with vectorised UDFs, this has resulted in huge speedups.
Prior to the introduction of Arrow, the process of translating data from the Java
representation to the Python representation and back was slow — involving serialisation and
deserialisation. It was also one-row-at-a-time, which is slower because many optimised data
processing operations work most efficiently on data held in a columnar format, which data in
each column held in contiguous sections of memory.
The use of Arrow almost completely eliminates the serialisation and deserialisation step, and
also allows data to be processed in columnar batches, meaning more efficient vectorised
algorithms can be used.
2 of 6 10/27/21, 12:16 PM
Demystifying Apache Arrow https://www.robinlinacre.com/demystifying_arrow/
algorithms implemented in one language being able to be used by data scientists working in
other languages.
Arrow is able to read a folder containing many data files into a single dataframe. Arrow also
supports reading folders where data is partitioned into subfolders.
This functionality is illustrative of the higher-level design of Arrow around record batches. If
everything is a record batch, then it matters little whether the source data is stored in a single
file or multiple files.
Arrow can be used to read parquet files into data science tools like Python and R, correctly
representing the data types in the target tool. Since parquet is a self-describing format, with
the data types of the columns specified in the schema, getting data types right may not seem
difficult. But a process of translation is still needed to load data into different data science
tools, and this is surprisingly complex.
For example, historically the pandas integer type has not allowed integers to be null, despite
this being possible in parquet files. Similar quirks apply to other languages. Arrow handles
this process of translation for us. Data is always loaded into the Arrow format first, but Arrow
provides translators that are then able to convert this into language-specific in-memory
formats like the pandas dataframe. This means that for each target format, a single format-
to-arrow translator is needed, a huge improvement over the current situation of ‘n take 2’
translators being needed (one for each pair of tools).
This problem of translation is one that Arrow aims eventually to eliminate altogether: ideally
there would be a single in-memory format for dataframes rather than each tool having its
own representation.
Arrow makes it easy to write data held in-memory in tools like pandas and R to disk in
Parquet format.
Why not just persist the data to disk in Arrow format, and thus have a single, cross-language
data format that is the same on-disk and in-memory? One of the biggest reasons is that
3 of 6 10/27/21, 12:16 PM
Demystifying Apache Arrow https://www.robinlinacre.com/demystifying_arrow/
Parquet generally produces smaller data files, which is more desirable if you are IO-bound.
This will especially be the case if you are loading data from cloud storage like such as AWS S3.
Julien LeDem explains this further in a blog post discussing the two formats:
The trade-offs for columnar data are different for in-memory. For data on disk,
usually IO dominates latency, which can be addressed with aggressive
compression, at the cost of CPU. In memory, access is much faster and we want to
optimise for CPU throughput by paying attention to cache locality, pipelining, and
SIMD instructions.
This makes it desirable to have close cooperation in the development of the in-memory and
on-disk formats, with predictable and speedy translation between the two representations —
and this is what is offered by the Arrow and Parquet formats.
Whilst very useful, the user-facing features discussed so far are not revolutionary. The real
power is in the potential of the underlying building blocks to enable new approaches to data
science tooling. At the moment (autumn 2020), some of this is still work in progress by the
team behind Arrow.
In Apache Arrow, dataframes are composed of record batches. If stored on disk in Parquet
files, these batches can be read into the in-memory Arrow format very quickly.
We’ve seen this can be used to read a csv file into memory. But more broadly this concept
opens the door to entire data science workflows which are parallelised and operate on record
batches, thereby eliminating the need for the full table to be stored in memory.
Currently, the close coupling of data science tools and their in-memory representations of
dataframes means that algorithms written in one language are not easily portable to other
languages. This means the same standard operations, such as filters and aggregations, get
endlessly re-written and optimisations do not easily translate between tools.
4 of 6 10/27/21, 12:16 PM
Demystifying Apache Arrow https://www.robinlinacre.com/demystifying_arrow/
Arrow provides a standard for representing dataframes in memory, and a mechanism for
allowing multiple languages and to refer to the same in-memory data.
an Arrow-native columnar query execution engine in C++, with intended use not only from
C++ but also from user-space languages like Python, R, and Ruby.
This would provide a single, highly optimised codebase, available for use from multiple
languages. The pool of developers who have an interest in maintaining and contributing to
this library would be much larger because it would be available for use across a wide range of
tools.
These ideas come together in the description of Arrow as an ‘API for data’. The idea is that
Arrow provides a cross-language in-memory data format, and an associated query execution
language that provide the building blocks for analytics libraries. Like any good API, Arrow
provides a performant solution to common problems without the user needing to fully
understand the implementation.
This paves the way for a step change in data processing capabilities, with analytics libraries:
no longer needing to work with the constraint that the full dataset must fit into
memory
whilst breaking down barriers to sharing data between tools via faster ways to transmit data
between tools, either on a single machine or over the network.
We can get a glimpse of the potential of this approach in the vignette in the Arrow package in
R, in which a 37gb dataset with 2 billion rows is processed and aggregated on a laptop in
under 2 seconds.
Where does this leave existing high-level data science tools and languages like SQL, pandas
and R?
As the vignette illustrates, the intention is not for Arrow to compete directly with these tools.
Instead, higher level tools can use Arrow under the hood, imposing little constraint on their
5 of 6 10/27/21, 12:16 PM
Demystifying Apache Arrow https://www.robinlinacre.com/demystifying_arrow/
user-facing API. It therefore allows for continued innovation in the expressiveness and
features of different tools whilst improving performance and saving authors the task of
reinventing the common components.
Further reading
https://arrow.apache.org/overview/
https://arrow.apache.org/use_cases/
https://arrow.apache.org/powered_by/
https://arrow.apache.org/blog/
https://ursalabs.org/blog/
https://wesmckinney.com/archives.html
https://www.youtube.com/watch?v=fyj4FyH3XdU
Filesystems API
Datasets API
DataFrame API
Back to homepage.
This site is built using Observable HQ and Gatsby.js. Source code here.
6 of 6 10/27/21, 12:16 PM