0% found this document useful (0 votes)

222 views

Demystifying Apache Arrow

1) Apache Arrow is a cross-language platform that aims to provide efficient in-memory data structures and algorithms for analytics. 2) It allows for faster CSV loading into Pandas by using an optimized columnar format and reading data in parallel batches. 3) Arrow enables efficient transfer of data between languages like Java and Python in Spark, speeding up user-defined functions. 4) Its record batch design allows reading multiple files as a single table and easy translation between formats like Parquet and in-memory representations.

Uploaded by

investtcartier

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

222 views

Demystifying Apache Arrow

Uploaded by

investtcartier

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

Demystifying Apache Arrow https://www.robinlinacre.

com/demystifying_arrow/

>robinlinacre

Originally posted: 2020-10-22. View source code for this page here.

Demystifying Apache Arrow

Learning more about a tool that can filter and aggregate two
billion rows on a laptop in two seconds

In my work as a data scientist, I’ve come across Apache Arrow in a range of seemingly-
unrelated circumstances. However, I’ve always struggled to describe exactly what it is, and
what it does.

The official description of Arrow is:

a cross-language development platform for in-memory analytics

which is quite abstract — and for good reason. The project is extremely ambitious, and aims
to provide the backbone for a wide range of data processing tasks. This means it sits at a low
level, providing building blocks for higher level, user-facing, analytics tools like pandas or
dplyr.

As a result, the importance of the project can be hard to understand for users who run into it
occasionally in their day to day work, because much of what it does is behind the scenes.

In this post I describe some of the user-facing features of Apache Arrow which I have run into
in my work, and explain why they are all facets of more fundamental problems which Apache
Arrow aims to solve.

By connecting these dots it becomes clear why Arrow is not just a useful tool to solve some
practical problems today, but one of the most exciting emerging tools, with the potential to
be the engine behind large parts of future data science workflows.

Faster csv reading

A striking feature of Arrow is that it can read csvs into Pandas more than 10x faster than
pandas.read.csv.

1 of 6 10/27/21, 12:16 PM
Demystifying Apache Arrow https://www.robinlinacre.com/demystifying_arrow/

This is actually a two step process: Arrow reads the data into memory in an Arrow table,
which is really just a collection of record batches, and then converts the Arrow table into a
pandas dataframe.

The speedup is thus a consequence of the underlying design of Arrow:

Arrow has its own in-memory storage format. When we use Arrow to load data into
Pandas, we are really loading data into the Arrow format (an in-memory format for
data frames), and then translating this into the pandas in-memory format. Part of the
speedup in reading csvs therefore comes from the careful design of the Arrow
columnar format itself.

Data in Arrow is stored in-memory in record batches, a 2D data structure containing

contiguous columns of data of equal length. A ‘table’ can be created from these
batches without requiring additional memory copying, because tables can have
‘chunked’ columns (i.e. sections of data, each part representing a contiguous chunk of
memory). This design means that data can be read in parallel rather than the single-
threaded approach of pandas.

Faster User Defined Functions (UDF) in PySpark

Running Python user defined functions in Apache Spark has historically been very slow — so
slow that the usual recommendation was not to do it on datasets of any significant size.

More recently, Apache Arrow has made it possible to efficiently transfer data between JVM
and Python processes. Combined with vectorised UDFs, this has resulted in huge speedups.

What’s going on here?

Prior to the introduction of Arrow, the process of translating data from the Java
representation to the Python representation and back was slow — involving serialisation and
deserialisation. It was also one-row-at-a-time, which is slower because many optimised data
processing operations work most efficiently on data held in a columnar format, which data in
each column held in contiguous sections of memory.

The use of Arrow almost completely eliminates the serialisation and deserialisation step, and
also allows data to be processed in columnar batches, meaning more efficient vectorised
algorithms can be used.

The ability of Arrow to transfer data between languages without a large

serialisation/deserialisation overhead is a key feature, making it much more feasible for

2 of 6 10/27/21, 12:16 PM
Demystifying Apache Arrow https://www.robinlinacre.com/demystifying_arrow/

algorithms implemented in one language being able to be used by data scientists working in
other languages.

Reading folders of files as a single table

Arrow is able to read a folder containing many data files into a single dataframe. Arrow also
supports reading folders where data is partitioned into subfolders.

This functionality is illustrative of the higher-level design of Arrow around record batches. If
everything is a record batch, then it matters little whether the source data is stored in a single
file or multiple files.

Reading parquet files

Arrow can be used to read parquet files into data science tools like Python and R, correctly
representing the data types in the target tool. Since parquet is a self-describing format, with
the data types of the columns specified in the schema, getting data types right may not seem
difficult. But a process of translation is still needed to load data into different data science
tools, and this is surprisingly complex.

For example, historically the pandas integer type has not allowed integers to be null, despite
this being possible in parquet files. Similar quirks apply to other languages. Arrow handles
this process of translation for us. Data is always loaded into the Arrow format first, but Arrow
provides translators that are then able to convert this into language-specific in-memory
formats like the pandas dataframe. This means that for each target format, a single format-
to-arrow translator is needed, a huge improvement over the current situation of ‘n take 2’
translators being needed (one for each pair of tools).

This problem of translation is one that Arrow aims eventually to eliminate altogether: ideally
there would be a single in-memory format for dataframes rather than each tool having its
own representation.

Writing parquet files

Arrow makes it easy to write data held in-memory in tools like pandas and R to disk in
Parquet format.

Why not just persist the data to disk in Arrow format, and thus have a single, cross-language
data format that is the same on-disk and in-memory? One of the biggest reasons is that

3 of 6 10/27/21, 12:16 PM
Demystifying Apache Arrow https://www.robinlinacre.com/demystifying_arrow/

Parquet generally produces smaller data files, which is more desirable if you are IO-bound.
This will especially be the case if you are loading data from cloud storage like such as AWS S3.

Julien LeDem explains this further in a blog post discussing the two formats:

The trade-offs for columnar data are different for in-memory. For data on disk,
usually IO dominates latency, which can be addressed with aggressive
compression, at the cost of CPU. In memory, access is much faster and we want to
optimise for CPU throughput by paying attention to cache locality, pipelining, and
SIMD instructions.

This makes it desirable to have close cooperation in the development of the in-memory and
on-disk formats, with predictable and speedy translation between the two representations —
and this is what is offered by the Arrow and Parquet formats.

How do these ideas tie together?

Whilst very useful, the user-facing features discussed so far are not revolutionary. The real
power is in the potential of the underlying building blocks to enable new approaches to data
science tooling. At the moment (autumn 2020), some of this is still work in progress by the
team behind Arrow.

Record batches and chunking

In Apache Arrow, dataframes are composed of record batches. If stored on disk in Parquet
files, these batches can be read into the in-memory Arrow format very quickly.

We’ve seen this can be used to read a csv file into memory. But more broadly this concept
opens the door to entire data science workflows which are parallelised and operate on record
batches, thereby eliminating the need for the full table to be stored in memory.

A common (cross-language) in memory representation of data

frames

Currently, the close coupling of data science tools and their in-memory representations of
dataframes means that algorithms written in one language are not easily portable to other
languages. This means the same standard operations, such as filters and aggregations, get
endlessly re-written and optimisations do not easily translate between tools.

4 of 6 10/27/21, 12:16 PM
Demystifying Apache Arrow https://www.robinlinacre.com/demystifying_arrow/

Arrow provides a standard for representing dataframes in memory, and a mechanism for
allowing multiple languages and to refer to the same in-memory data.

This creates an opportunity for the development of:

an Arrow-native columnar query execution engine in C++, with intended use not only from
C++ but also from user-space languages like Python, R, and Ruby.

This would provide a single, highly optimised codebase, available for use from multiple
languages. The pool of developers who have an interest in maintaining and contributing to
this library would be much larger because it would be available for use across a wide range of
tools.

The big idea

These ideas come together in the description of Arrow as an ‘API for data’. The idea is that
Arrow provides a cross-language in-memory data format, and an associated query execution
language that provide the building blocks for analytics libraries. Like any good API, Arrow
provides a performant solution to common problems without the user needing to fully
understand the implementation.

This paves the way for a step change in data processing capabilities, with analytics libraries:

being parallelised by default

applying highly optimised calculations on dataframes

no longer needing to work with the constraint that the full dataset must fit into
memory

whilst breaking down barriers to sharing data between tools via faster ways to transmit data
between tools, either on a single machine or over the network.

We can get a glimpse of the potential of this approach in the vignette in the Arrow package in
R, in which a 37gb dataset with 2 billion rows is processed and aggregated on a laptop in
under 2 seconds.

Where does this leave existing high-level data science tools and languages like SQL, pandas
and R?

As the vignette illustrates, the intention is not for Arrow to compete directly with these tools.
Instead, higher level tools can use Arrow under the hood, imposing little constraint on their

5 of 6 10/27/21, 12:16 PM
Demystifying Apache Arrow https://www.robinlinacre.com/demystifying_arrow/

user-facing API. It therefore allows for continued innovation in the expressiveness and
features of different tools whilst improving performance and saving authors the task of
reinventing the common components.