SlideShare a Scribd company logo
Apache Arrow and DataFusion:
Changing the Game for Implementing Database Systems
Andrew Lamb, InfluxData
June 23, 2022
The Data Thread
Today: IOx Team at InfluxData;
Apache Arrow PMC Member
Past life 1: Query Optimizer @ Vertica, also
on Oracle DB server
Past life 2: Chief Architect + VP Engineering
roles at some ML startups
Proliferation of Databases
3
DB
4
What is going on?
COTS → Totally Custom
5
IT FANG
“Buy and Operate”
● Buy software from
vendors
● Operate on your own
hardware, with
sysadmins
“Build and Operate”
● Write software for, and
operate all components
● Optimized for exact
needs
✓
Current Trend
“Assemble and Operate”
● Assemble from open
source technologies
● Operate on
resources in a public
cloud
Part of a long term trend in DB Specialization
Relational
Key-Value
Timeseries
Graph
Array / Scientific
Document
Stream
Michael Stonebraker and Ugur Cetintemel. 2005. "One Size Fits All": An Idea Whose Time Has Come and Gone. In Proceedings of the 21st
International Conference on Data Engineering (ICDE '05). IEEE Computer Society, USA, 2–11. DOI:https://doi.org/10.1109/ICDE.2005.1
Data Model Deployment
Embedded / Edge
Cloud
Single-Node
Hybrid
Ecosystem
Hadoop
Java
Json / Javascript
AWS
GCP
Azure
Apple Cloud
Use Case
Transactions
Analytics
Streaming
Batch / ETL
...
What is DataFusion?
Implementation timeline for a new
Database system
Client
API
In memory
storage
In-Memory
filter + aggregation
Durability /
persistence
Metadata Catalog +
Management
Query
Language
Parser
Optimized /
Compressed
storage
Execution on
Compressed
Data
Joins!
Additional Client
Languages
Outer
Joins
Subquery
support
More advanced
analytics
Cost
based
optimizer
Out of core
algorithms
Storage
Rearrangement
Heuristic
Query
Planner
Arithmetic
expressions
Date / time
Expressions
Concurrency
Control
Data Model /
Type System
Distributed query
execution
Resource
Management
“Lets Build
a Database”
🤔
“Ok now this
is pretty
good”
😐
“Look mom!
I have a
database!”
😃
Online
recovery
Window functions
“DataFusion is an extensible query
execution framework, written in Rust,
that uses Apache Arrow as its
in-memory format.”
- DataFusion Website
DataFusion: A Query Engine
DataFusion: A Query Engine
SQL Query
SELECT status, COUNT(1)
FROM http_api_requests_total
WHERE path = '/api/v2/write'
GROUP BY status;
RecordBatches
DataFrame
ctx.read_table("http")?
.filter(...)?
.aggregate(..)?;
RecordBatches
Catalog information:
tables, schemas, etc
OR
But for Databases
🤔
DataFusion: LLVM-like Infrastructure for Databases
SQL
Query FrontEnds
DataFrame
LogicalPlans ExecutionPlan
Plan Representations
(DataFlow Graphs)
Expression Eval
Optimizations /
Transformations
Optimizations /
Transformations
HashAggregate
Sort
…
Optimized Execution
Operators
(Arrow Based)
Join
Data Sources
Parquet
CSV
…
DataFusion
DataFusion: Totally Customizable
SQL
Query FrontEnds
DataFrame
LogicalPlans ExecutionPlan
Plan Representations
(DataFlow Graphs)
Expression Eval
Optimizations /
Transformations
Optimizations /
Transformations
HashAggregate
Sort
…
Join
Data Sources
Parquet
CSV
DataFusion
Extend ✅
Extend ✅
Extend ✅
Extend ✅ Extend ✅
Extend ✅ Extend ✅
Extend ✅
Optimized Execution
Operators
(Arrow Based)
Example Uses
Cube.js / Cube Store
https://cube.dev/
● Overview:
○ Headless Business Intelligence
○ Cube.js pre-aggregation storage layer.
● Use of DataFusion (fork)
○ SQL API (with custom extensions)
○ Custom Logical and Physical Operators
○ UDFs: custom functions
○ Optimized native plan execution 1
5
InfluxDB IOx
https://github.com/influxdata/influxdb_iox
● Overview:
○ In-memory columnar store using object storage, future
core of InfluxDB; support SQL, InfluxQL, and Flux
○ Query and data reorganization built with DataFusion
● Use of DataFusion:
○ Table Provider: Custom data sources
○ SQL API
○ PlanBuilder API: Plans for custom query language
○ UD Logical and Execution Plans
○ UDFs: to implement the precise semantics of influxRPC
○ Optimized native plan execution
1
6
FLOCK
https://github.com/flock-lab/flock
● Overview:
○ Low-Cost Streaming Query Engine on FaaS Platforms
○ Project from UMD Database Group, runs streaming queries
on AWS Lambda (x86 and arm64/graviton2).
● Use of DataFusion
○ SQL API:
○ DataFrame API: To build plans
○ Optimized native plan execution
1
7
VegaFusion
https://vegafusion.io/
● Overview:
○ Accelerates execution of (interactive) data
visualizations
○ Compiles Vega data transforms into
DataFusion query plans.
● Use of DataFusion:
○ DataFrame API: To build plans
○ UDFs: to implement some Vega expressions
○ Optimized native plan execution
1
8
We ❤ Our Contributors
● Active and Welcoming Community
● Contributions at all levels are encouraged and
welcomed.
● We have Database Internals experts, novices looking
for experience writing Rust, and everything in
between.
Learn More + Join Us
Project site:
● https://arrow.apache.org/datafusion
● https://github.com/apache/arrow-datafusion
Architecture Slides
● DataFusion: An Embeddable Query Engine Written in Rust (google
slides) (slideshare)
Thank You
Andrew Lamb: andrew@nerdnetworks.org

More Related Content

What's hot (20)

PDF
Some Iceberg Basics for Beginners (CDP).pdf
Michael Kogan
 
PPTX
Flexible and Real-Time Stream Processing with Apache Flink
DataWorks Summit
 
PDF
Iceberg: A modern table format for big data (Strata NY 2018)
Ryan Blue
 
PPTX
Apache Flink in the Cloud-Native Era
Flink Forward
 
PPTX
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Flink Forward
 
PPTX
Using Apache Hive with High Performance
Inderaj (Raj) Bains
 
PDF
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Noritaka Sekiyama
 
PDF
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
PDF
The Parquet Format and Performance Optimization Opportunities
Databricks
 
PDF
Common Strategies for Improving Performance on Your Delta Lakehouse
Databricks
 
PDF
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
Databricks
 
PDF
Tame the small files problem and optimize data layout for streaming ingestion...
Flink Forward
 
PDF
Apache Flink internals
Kostas Tzoumas
 
PDF
Building large scale transactional data lake using apache hudi
Bill Liu
 
PDF
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
HostedbyConfluent
 
PDF
Apache Hadoop YARN – Multi-Tenancy, Capacity Scheduler & Preemption - Stamped...
StampedeCon
 
PDF
Change Data Feed in Delta
Databricks
 
PDF
Parquet performance tuning: the missing guide
Ryan Blue
 
PPTX
The columnar roadmap: Apache Parquet and Apache Arrow
DataWorks Summit
 
PPTX
Autoscaling Flink with Reactive Mode
Flink Forward
 
Some Iceberg Basics for Beginners (CDP).pdf
Michael Kogan
 
Flexible and Real-Time Stream Processing with Apache Flink
DataWorks Summit
 
Iceberg: A modern table format for big data (Strata NY 2018)
Ryan Blue
 
Apache Flink in the Cloud-Native Era
Flink Forward
 
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Flink Forward
 
Using Apache Hive with High Performance
Inderaj (Raj) Bains
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Noritaka Sekiyama
 
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
The Parquet Format and Performance Optimization Opportunities
Databricks
 
Common Strategies for Improving Performance on Your Delta Lakehouse
Databricks
 
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
Databricks
 
Tame the small files problem and optimize data layout for streaming ingestion...
Flink Forward
 
Apache Flink internals
Kostas Tzoumas
 
Building large scale transactional data lake using apache hudi
Bill Liu
 
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
HostedbyConfluent
 
Apache Hadoop YARN – Multi-Tenancy, Capacity Scheduler & Preemption - Stamped...
StampedeCon
 
Change Data Feed in Delta
Databricks
 
Parquet performance tuning: the missing guide
Ryan Blue
 
The columnar roadmap: Apache Parquet and Apache Arrow
DataWorks Summit
 
Autoscaling Flink with Reactive Mode
Flink Forward
 

Similar to 2022-06-23 Apache Arrow and DataFusion_ Changing the Game for implementing Database systems.pdf (20)

PDF
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
InfluxData
 
PDF
DataFusion-and-Arrow_Supercharge-Your-Data-Analytical-Tool-with-a-Rusty-Query...
aiuy
 
PPTX
DataFusion and Arrow_ Supercharge Your Data Analytical Tool with a Rusty Quer...
Medcl1
 
PDF
From flat files to deconstructed database
Julien Le Dem
 
PPTX
Strata NY 2018: The deconstructed database
Julien Le Dem
 
PDF
Understanding InfluxDB’s New Storage Engine
InfluxData
 
PDF
InfluxDB IOx Tech Talks: Query Processing in InfluxDB IOx
InfluxData
 
PDF
2021 10-13 i ox query processing
Andrew Lamb
 
PDF
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Perficient, Inc.
 
PDF
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
Databricks
 
PPTX
HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...
HBaseCon
 
PDF
Apache Spark and Python: unified Big Data analytics
Julien Anguenot
 
PDF
Horses for Courses: Database Roundtable
Eric Kavanagh
 
PDF
Best Practices for Leveraging the Apache Arrow Ecosystem
InfluxData
 
PDF
Big Data and Fast Data combined – is it possible?
Swiss Data Forum Swiss Data Forum
 
PDF
The Future of Fast Databases: Lessons from a Decade of QuestDB
javier ramirez
 
PPTX
Data Lakehouse Symposium | Day 4
Databricks
 
PPTX
HBaseCon2015-final
Maryann Xue
 
PDF
A look under the hood at Apache Spark's API and engine evolutions
Databricks
 
PPTX
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Luan Moreno Medeiros Maciel
 
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
InfluxData
 
DataFusion-and-Arrow_Supercharge-Your-Data-Analytical-Tool-with-a-Rusty-Query...
aiuy
 
DataFusion and Arrow_ Supercharge Your Data Analytical Tool with a Rusty Quer...
Medcl1
 
From flat files to deconstructed database
Julien Le Dem
 
Strata NY 2018: The deconstructed database
Julien Le Dem
 
Understanding InfluxDB’s New Storage Engine
InfluxData
 
InfluxDB IOx Tech Talks: Query Processing in InfluxDB IOx
InfluxData
 
2021 10-13 i ox query processing
Andrew Lamb
 
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Perficient, Inc.
 
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
Databricks
 
HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...
HBaseCon
 
Apache Spark and Python: unified Big Data analytics
Julien Anguenot
 
Horses for Courses: Database Roundtable
Eric Kavanagh
 
Best Practices for Leveraging the Apache Arrow Ecosystem
InfluxData
 
Big Data and Fast Data combined – is it possible?
Swiss Data Forum Swiss Data Forum
 
The Future of Fast Databases: Lessons from a Decade of QuestDB
javier ramirez
 
Data Lakehouse Symposium | Day 4
Databricks
 
HBaseCon2015-final
Maryann Xue
 
A look under the hood at Apache Spark's API and engine evolutions
Databricks
 
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Luan Moreno Medeiros Maciel
 
Ad

Recently uploaded (20)

PDF
Powering GIS with FME and VertiGIS - Peak of Data & AI 2025
Safe Software
 
PPTX
WYSIWYG Web Builder Crack 2025 – Free Download Full Version with License Key
HyperPc soft
 
PDF
Dealing with JSON in the relational world
Andres Almiray
 
PPTX
3uTools Full Crack Free Version Download [Latest] 2025
muhammadgurbazkhan
 
PPTX
NeuroStrata: Harnessing Neuro-Symbolic Paradigms for Improved Testability and...
Ivan Ruchkin
 
PPTX
EO4EU Ocean Monitoring: Maritime Weather Routing Optimsation Use Case
EO4EU
 
PPTX
Perfecting XM Cloud for Multisite Setup.pptx
Ahmed Okour
 
PDF
Continouous failure - Why do we make our lives hard?
Papp Krisztián
 
PDF
Difference Between Kubernetes and Docker .pdf
Kindlebit Solutions
 
PDF
Thread In Android-Mastering Concurrency for Responsive Apps.pdf
Nabin Dhakal
 
PDF
LPS25 - Operationalizing MLOps in GEP - Terradue.pdf
terradue
 
PPTX
Quality on Autopilot: Scaling Testing in Uyuni
Oscar Barrios Torrero
 
PPTX
MailsDaddy Outlook OST to PST converter.pptx
abhishekdutt366
 
PDF
interacting-with-ai-2023---module-2---session-3---handout.pdf
cniclsh1
 
PPTX
Migrating Millions of Users with Debezium, Apache Kafka, and an Acyclic Synch...
MD Sayem Ahmed
 
PPTX
Feb 2021 Cohesity first pitch presentation.pptx
enginsayin1
 
PDF
Understanding the Need for Systemic Change in Open Source Through Intersectio...
Imma Valls Bernaus
 
PDF
Salesforce CRM Services.VALiNTRY360
VALiNTRY360
 
PPTX
An Introduction to ZAP by Checkmarx - Official Version
Simon Bennetts
 
PDF
Mobile CMMS Solutions Empowering the Frontline Workforce
CryotosCMMSSoftware
 
Powering GIS with FME and VertiGIS - Peak of Data & AI 2025
Safe Software
 
WYSIWYG Web Builder Crack 2025 – Free Download Full Version with License Key
HyperPc soft
 
Dealing with JSON in the relational world
Andres Almiray
 
3uTools Full Crack Free Version Download [Latest] 2025
muhammadgurbazkhan
 
NeuroStrata: Harnessing Neuro-Symbolic Paradigms for Improved Testability and...
Ivan Ruchkin
 
EO4EU Ocean Monitoring: Maritime Weather Routing Optimsation Use Case
EO4EU
 
Perfecting XM Cloud for Multisite Setup.pptx
Ahmed Okour
 
Continouous failure - Why do we make our lives hard?
Papp Krisztián
 
Difference Between Kubernetes and Docker .pdf
Kindlebit Solutions
 
Thread In Android-Mastering Concurrency for Responsive Apps.pdf
Nabin Dhakal
 
LPS25 - Operationalizing MLOps in GEP - Terradue.pdf
terradue
 
Quality on Autopilot: Scaling Testing in Uyuni
Oscar Barrios Torrero
 
MailsDaddy Outlook OST to PST converter.pptx
abhishekdutt366
 
interacting-with-ai-2023---module-2---session-3---handout.pdf
cniclsh1
 
Migrating Millions of Users with Debezium, Apache Kafka, and an Acyclic Synch...
MD Sayem Ahmed
 
Feb 2021 Cohesity first pitch presentation.pptx
enginsayin1
 
Understanding the Need for Systemic Change in Open Source Through Intersectio...
Imma Valls Bernaus
 
Salesforce CRM Services.VALiNTRY360
VALiNTRY360
 
An Introduction to ZAP by Checkmarx - Official Version
Simon Bennetts
 
Mobile CMMS Solutions Empowering the Frontline Workforce
CryotosCMMSSoftware
 
Ad

2022-06-23 Apache Arrow and DataFusion_ Changing the Game for implementing Database systems.pdf

  • 1. Apache Arrow and DataFusion: Changing the Game for Implementing Database Systems Andrew Lamb, InfluxData June 23, 2022 The Data Thread
  • 2. Today: IOx Team at InfluxData; Apache Arrow PMC Member Past life 1: Query Optimizer @ Vertica, also on Oracle DB server Past life 2: Chief Architect + VP Engineering roles at some ML startups
  • 4. 4
  • 5. What is going on? COTS → Totally Custom 5 IT FANG “Buy and Operate” ● Buy software from vendors ● Operate on your own hardware, with sysadmins “Build and Operate” ● Write software for, and operate all components ● Optimized for exact needs ✓ Current Trend “Assemble and Operate” ● Assemble from open source technologies ● Operate on resources in a public cloud
  • 6. Part of a long term trend in DB Specialization Relational Key-Value Timeseries Graph Array / Scientific Document Stream Michael Stonebraker and Ugur Cetintemel. 2005. "One Size Fits All": An Idea Whose Time Has Come and Gone. In Proceedings of the 21st International Conference on Data Engineering (ICDE '05). IEEE Computer Society, USA, 2–11. DOI:https://doi.org/10.1109/ICDE.2005.1 Data Model Deployment Embedded / Edge Cloud Single-Node Hybrid Ecosystem Hadoop Java Json / Javascript AWS GCP Azure Apple Cloud Use Case Transactions Analytics Streaming Batch / ETL ...
  • 8. Implementation timeline for a new Database system Client API In memory storage In-Memory filter + aggregation Durability / persistence Metadata Catalog + Management Query Language Parser Optimized / Compressed storage Execution on Compressed Data Joins! Additional Client Languages Outer Joins Subquery support More advanced analytics Cost based optimizer Out of core algorithms Storage Rearrangement Heuristic Query Planner Arithmetic expressions Date / time Expressions Concurrency Control Data Model / Type System Distributed query execution Resource Management “Lets Build a Database” 🤔 “Ok now this is pretty good” 😐 “Look mom! I have a database!” 😃 Online recovery Window functions
  • 9. “DataFusion is an extensible query execution framework, written in Rust, that uses Apache Arrow as its in-memory format.” - DataFusion Website DataFusion: A Query Engine
  • 10. DataFusion: A Query Engine SQL Query SELECT status, COUNT(1) FROM http_api_requests_total WHERE path = '/api/v2/write' GROUP BY status; RecordBatches DataFrame ctx.read_table("http")? .filter(...)? .aggregate(..)?; RecordBatches Catalog information: tables, schemas, etc OR
  • 12. DataFusion: LLVM-like Infrastructure for Databases SQL Query FrontEnds DataFrame LogicalPlans ExecutionPlan Plan Representations (DataFlow Graphs) Expression Eval Optimizations / Transformations Optimizations / Transformations HashAggregate Sort … Optimized Execution Operators (Arrow Based) Join Data Sources Parquet CSV … DataFusion
  • 13. DataFusion: Totally Customizable SQL Query FrontEnds DataFrame LogicalPlans ExecutionPlan Plan Representations (DataFlow Graphs) Expression Eval Optimizations / Transformations Optimizations / Transformations HashAggregate Sort … Join Data Sources Parquet CSV DataFusion Extend ✅ Extend ✅ Extend ✅ Extend ✅ Extend ✅ Extend ✅ Extend ✅ Extend ✅ Optimized Execution Operators (Arrow Based)
  • 15. Cube.js / Cube Store https://cube.dev/ ● Overview: ○ Headless Business Intelligence ○ Cube.js pre-aggregation storage layer. ● Use of DataFusion (fork) ○ SQL API (with custom extensions) ○ Custom Logical and Physical Operators ○ UDFs: custom functions ○ Optimized native plan execution 1 5
  • 16. InfluxDB IOx https://github.com/influxdata/influxdb_iox ● Overview: ○ In-memory columnar store using object storage, future core of InfluxDB; support SQL, InfluxQL, and Flux ○ Query and data reorganization built with DataFusion ● Use of DataFusion: ○ Table Provider: Custom data sources ○ SQL API ○ PlanBuilder API: Plans for custom query language ○ UD Logical and Execution Plans ○ UDFs: to implement the precise semantics of influxRPC ○ Optimized native plan execution 1 6
  • 17. FLOCK https://github.com/flock-lab/flock ● Overview: ○ Low-Cost Streaming Query Engine on FaaS Platforms ○ Project from UMD Database Group, runs streaming queries on AWS Lambda (x86 and arm64/graviton2). ● Use of DataFusion ○ SQL API: ○ DataFrame API: To build plans ○ Optimized native plan execution 1 7
  • 18. VegaFusion https://vegafusion.io/ ● Overview: ○ Accelerates execution of (interactive) data visualizations ○ Compiles Vega data transforms into DataFusion query plans. ● Use of DataFusion: ○ DataFrame API: To build plans ○ UDFs: to implement some Vega expressions ○ Optimized native plan execution 1 8
  • 19. We ❤ Our Contributors ● Active and Welcoming Community ● Contributions at all levels are encouraged and welcomed. ● We have Database Internals experts, novices looking for experience writing Rust, and everything in between.
  • 20. Learn More + Join Us Project site: ● https://arrow.apache.org/datafusion ● https://github.com/apache/arrow-datafusion Architecture Slides ● DataFusion: An Embeddable Query Engine Written in Rust (google slides) (slideshare)