2022-06-23 Apache Arrow and DataFusion_ Changing the Game for implementing Database systems.pdf

Apache Arrow and DataFusion:
Changing the Game for Implementing Database Systems
Andrew Lamb, InfluxData
June 23, 2022
The Data Thread

Today: IOx Team at InfluxData;
Apache Arrow PMC Member
Past life 1: Query Optimizer @ Vertica, also
on Oracle DB server
Past life 2: Chief Architect + VP Engineering
roles at some ML startups

Proliferation of Databases
3
DB

What is going on?
COTS → Totally Custom
5
IT FANG
“Buy and Operate”
● Buy software from
vendors
● Operate on your own
hardware, with
sysadmins
“Build and Operate”
● Write software for, and
operate all components
● Optimized for exact
needs
✓
Current Trend
“Assemble and Operate”
● Assemble from open
source technologies
● Operate on
resources in a public
cloud

Part of a long term trend in DB Specialization
Relational
Key-Value
Timeseries
Graph
Array / Scientific
Document
Stream
Michael Stonebraker and Ugur Cetintemel. 2005. "One Size Fits All": An Idea Whose Time Has Come and Gone. In Proceedings of the 21st
International Conference on Data Engineering (ICDE '05). IEEE Computer Society, USA, 2–11. DOI:https://doi.org/10.1109/ICDE.2005.1
Data Model Deployment
Embedded / Edge
Cloud
Single-Node
Hybrid
Ecosystem
Hadoop
Java
Json / Javascript
AWS
GCP
Azure
Apple Cloud
Use Case
Transactions
Analytics
Streaming
Batch / ETL
...

Implementation timeline for a new
Database system
Client
API
In memory
storage
In-Memory
filter + aggregation
Durability /
persistence
Metadata Catalog +
Management
Query
Language
Parser
Optimized /
Compressed
storage
Execution on
Compressed
Data
Joins!
Additional Client
Languages
Outer
Joins
Subquery
support
More advanced
analytics
Cost
based
optimizer
Out of core
algorithms
Storage
Rearrangement
Heuristic
Query
Planner
Arithmetic
expressions
Date / time
Expressions
Concurrency
Control
Data Model /
Type System
Distributed query
execution
Resource
Management
“Lets Build
a Database”
🤔
“Ok now this
is pretty
good”
😐
“Look mom!
I have a
database!”
😃
Online
recovery
Window functions

“DataFusion is an extensible query
execution framework, written in Rust,
that uses Apache Arrow as its
in-memory format.”
- DataFusion Website
DataFusion: A Query Engine

DataFusion: A Query Engine
SQL Query
SELECT status, COUNT(1)
FROM http_api_requests_total
WHERE path = '/api/v2/write'
GROUP BY status;
RecordBatches
DataFrame
ctx.read_table("http")?
.filter(...)?
.aggregate(..)?;
RecordBatches
Catalog information:
tables, schemas, etc
OR

DataFusion: LLVM-like Infrastructure for Databases
SQL
Query FrontEnds
DataFrame
LogicalPlans ExecutionPlan
Plan Representations
(DataFlow Graphs)
Expression Eval
Optimizations /
Transformations
Optimizations /
Transformations
HashAggregate
Sort
…
Optimized Execution
Operators
(Arrow Based)
Join
Data Sources
Parquet
CSV
…
DataFusion

DataFusion: Totally Customizable
SQL
Query FrontEnds
DataFrame
LogicalPlans ExecutionPlan
Plan Representations
(DataFlow Graphs)
Expression Eval
Optimizations /
Transformations
Optimizations /
Transformations
HashAggregate
Sort
…
Join
Data Sources
Parquet
CSV
DataFusion
Extend ✅
Extend ✅
Extend ✅
Extend ✅ Extend ✅
Extend ✅ Extend ✅
Extend ✅
Optimized Execution
Operators
(Arrow Based)

Cube.js / Cube Store
https://cube.dev/
● Overview:
○ Headless Business Intelligence
○ Cube.js pre-aggregation storage layer.
● Use of DataFusion (fork)
○ SQL API (with custom extensions)
○ Custom Logical and Physical Operators
○ UDFs: custom functions
○ Optimized native plan execution 1
5

InfluxDB IOx
https://github.com/influxdata/influxdb_iox
● Overview:
○ In-memory columnar store using object storage, future
core of InfluxDB; support SQL, InfluxQL, and Flux
○ Query and data reorganization built with DataFusion
● Use of DataFusion:
○ Table Provider: Custom data sources
○ SQL API
○ PlanBuilder API: Plans for custom query language
○ UD Logical and Execution Plans
○ UDFs: to implement the precise semantics of influxRPC
○ Optimized native plan execution
1
6

FLOCK
https://github.com/flock-lab/flock
● Overview:
○ Low-Cost Streaming Query Engine on FaaS Platforms
○ Project from UMD Database Group, runs streaming queries
on AWS Lambda (x86 and arm64/graviton2).
● Use of DataFusion
○ SQL API:
○ DataFrame API: To build plans
1
7

VegaFusion
https://vegafusion.io/
● Overview:
○ Accelerates execution of (interactive) data
visualizations
○ Compiles Vega data transforms into
DataFusion query plans.
● Use of DataFusion:
○ DataFrame API: To build plans
○ UDFs: to implement some Vega expressions
1
8

We ❤ Our Contributors
● Active and Welcoming Community
● Contributions at all levels are encouraged and
welcomed.
● We have Database Internals experts, novices looking
for experience writing Rust, and everything in
between.

Learn More + Join Us
Project site:
● https://arrow.apache.org/datafusion
● https://github.com/apache/arrow-datafusion
Architecture Slides
● DataFusion: An Embeddable Query Engine Written in Rust (google
slides) (slideshare)

Thank You
Andrew Lamb: andrew@nerdnetworks.org

2022-06-23 Apache Arrow and DataFusion_ Changing the Game for implementing Database systems.pdf

More Related Content

What's hot (20)

Similar to 2022-06-23 Apache Arrow and DataFusion_ Changing the Game for implementing Database systems.pdf (20)

Recently uploaded (20)

2022-06-23 Apache Arrow and DataFusion_ Changing the Game for implementing Database systems.pdf