0% found this document useful (0 votes)
7 views

Real-Time Data Pipelines Made Easy with Structured Streaming in Apache Spark

The document discusses the implementation of real-time data pipelines using Structured Streaming in Apache Spark, highlighting its advantages over traditional batch processing. It emphasizes the ability to treat streams as unbounded tables, enabling simpler queries and faster data processing, while also addressing challenges such as messy data and historical queries. The presentation outlines the architecture, components, and features of Structured Streaming, including fault tolerance, watermarking, and stateful processing.

Uploaded by

uwayo
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

Real-Time Data Pipelines Made Easy with Structured Streaming in Apache Spark

The document discusses the implementation of real-time data pipelines using Structured Streaming in Apache Spark, highlighting its advantages over traditional batch processing. It emphasizes the ability to treat streams as unbounded tables, enabling simpler queries and faster data processing, while also addressing challenges such as messy data and historical queries. The presentation outlines the architecture, components, and features of Structured Streaming, including fault tolerance, watermarking, and stateful processing.

Uploaded by

uwayo
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 51

Real-time Data Pipelines with

Structured Streaming in
Tathagata “TD” Das
@tathadas

DataEngConf 2018
18th April, San Francisco
About Me

Started Spark Streaming project in AMPLab, UC Berkeley

Currently focused on building Structured Streaming

PMC Member of

Engineer on the StreamTeam @


"we make all your streams come true"
Applications

Streaming SQL ML Graph

unified processing
engine
EC2

YARN
Environments Data Sources
Data Pipelines – 10000ft view

Data Lake

Dump ETL Data Analytics


Warehouse

unstructured unstructured
data dump structured data
data streams
Data Pipeline @ Fortune 100 Company
Trillions of Records Separate warehouses for
Security Infra Messy data not ready each type of analytics
IDS/IPS, DLP, antivirus, load for analytics
balancers, proxy servers Incidence
DW1 Response
DATALAKE1
Cloud Infra & Apps
AWS, Azure, Google Cloud Dump Complex ETL
DW2 Alerting
DATALAKE2
Servers Infra
Linux, Unix, Windows
Reports
DW3
Network Infra Hours of delay in accessing data
Routers, switches, WAPs,
databases, LDAP
Very expensive to scale
Proprietary formats
No advanced analytics (ML)
New Pipeline @ Fortune 100 Company

Incidence
Response

Dump Complex Alerting


STRUCTURED SQL, ML,
ETL

DELTA
STREAMING STREAMING Reports

Data usable in minutes/seconds


Easy to scale
Open formats
Enables advanced analytics
STRUCTURED
STREAMING
you
should not have to
reason about streaming
you
should write simple queries
&
Spark
should continuously update the answer
Treat Streams as Unbounded Tables
data stream unbounded input table

new data in the


data stream
=
new rows appended
to a unbounded table
Anatomy of a Streaming Query
Example

Read JSON data from Kafka


Parse nested JSON ETL

Store in structured Parquet table


Get end-to-end failure guarantees
Anatomy of a Streaming Query
spark.readStream.format("kafka")
.option("kafka.boostrap.servers",...) Source
.option("subscribe", "topic")
Specify where to read data from
.load()
Built-in support for Files / Kafka /
Kinesis*

returns a Can include multiple sources of


DataFrame different types using join() / union()

*Available only on Databricks Runtime


DataFrame ó Table
static data = streaming data =
bounded table unbounded table

Single
API !
DataFrame/Dataset
SQL DataFrame Dataset

spark.sql(" val df: DataFrame = val ds: Dataset[(String, Double)] =


SELECT type, sum(signal) spark.table("device-data") spark.table("device-data")
FROM devices .groupBy("type") .as[DeviceData]
GROUP BY type .sum("signal")) .groupByKey(_.type)
.mapValues(_.signal)
")
.reduceGroups(_ + _)

Most familiar to BI Analysts Great for Data Scientists familiar Great for Data Engineers who
Supports SQL-2003, HiveQL with Pandas, R Dataframes want compile-time type safety

Same semantics, same performance


Choose your hammer for whatever nail you have!
Anatomy of a Streaming Query
spark.readStream.format("kafka")
.option("kafka.boostrap.servers",...)
.option("subscribe", "topic")
.load()

Kafka DataFrame
key value topic partition offset timestamp

[binary] [binary] "topic" 0 345 1486087873

[binary] [binary] "topic" 3 2890 1486086721


Anatomy of a Streaming Query
spark.readStream.format("kafka") Transformations
.option("kafka.boostrap.servers",...)
.option("subscribe", "topic")
.load()
Cast bytes from Kafka records to a
.selectExpr("cast (value as string) as json")
string, parse it as a json, and
.select(from_json("json", schema).as("data"))
generate nested columns
100s of built-in, optimized SQL
functions like from_json
user-defined functions, lambdas,
function literals with map, flatMap…
Anatomy of a Streaming Query
spark.readStream.format("kafka")
.option("kafka.boostrap.servers",...) Sink
.option("subscribe", "topic")
Write transformed output to
.load()
external storage systems
.selectExpr("cast (value as string) as json")
.select(from_json("json", schema).as("data")) Built-in support for Files / Kafka
.writeStream
.format("parquet") Use foreach to execute arbitrary
.option("path", "/parquetTable/")
code with the output data
Some sinks are transactional and
exactly once (e.g. files)
Anatomy of a Streaming Query
spark.readStream.format("kafka")
.option("kafka.boostrap.servers",...) Processing Details
.option("subscribe", "topic")
.load() Trigger: when to process data
.selectExpr("cast (value as string) as json") - Fixed interval micro-batches
.select(from_json("json", schema).as("data")) - As fast as possible micro-batches
.writeStream
.format("parquet") - Continuously (new in Spark 2.3)
.option("path", "/parquetTable/")
.trigger("1 minute")
.option("checkpointLocation", "…")
Checkpoint location: for tracking the
.start() progress of the query
Spark automatically streamifies!
t=1 t=2 t=3
Read from Kafka
Kafka Source
spark.readStream.format("kafka")
.option("kafka.boostrap.servers",...)
Project
.option("subscribe", "topic")
.load() device, signal Optimized

new data
Operator

new data

new data

process
process

process
.selectExpr("cast (value as string) as json")
.select(from_json("json", schema).as("data")) codegen, off-
.writeStream
.format("parquet")
Filter heap, etc.
.option("path", "/parquetTable/") signal > 15
.trigger("1 minute")
.option("checkpointLocation", "…")
.start()
Write to Parquet
Parquet Sink

DataFrames, Logical Optimized Series of Incremental


Datasets, SQL Plan Plan Execution Plans

Spark SQL converts batch-like query to a series of incremental


execution plans operating on new batches of data
Fault-tolerance with Checkpointing
t=1 t=2 t=3

Checkpointing

new data
new data
Saves processed offset info to stable storage

new data

process
process

process
Saved as JSON for forward-compatibility

write
Allows recovery from any failure
ahead
Can resume after limited changes to your log
streaming transformations (e.g. adding new end-to-end
filters to drop corrupted data, etc.)
exactly-once
guarantees
Anatomy of a Streaming Query
spark.readStream.format("kafka")
.option("kafka.boostrap.servers",...)
.option("subscribe", "topic")
ETL
.load()
.selectExpr("cast (value as string) as json")
.select(from_json("json", schema).as("data"))
.writeStream Raw data from Kafka available
.format("parquet")
.option("path", "/parquetTable/") as structured data in seconds,
.trigger("1 minute") ready for querying
.option("checkpointLocation", "…")
.start()
Performance: Benchmark
Structured Streaming reuses 40-core throughput
the Spark SQL Optimizer 70 65M
and Tungsten Engine

Millions of records/s
60
50
40
30

3x
22M
20
10 700K
0
cheaper
faster Kafka Apache Flink Structured
Streams Streaming
More details in our blog post
Business Logic independent of Execution Mode

spark.readStream.format("kafka")
.option("kafka.boostrap.servers",...)
.option("subscribe", "topic")
.load()
.selectExpr("cast (value as string) as json")
Business logic
.select(from_json("json", schema).as("data")) remains unchanged

.writeStream
.format("parquet")
.option("path", "/parquetTable/")
.trigger("1 minute")
.option("checkpointLocation", "…")
.start()
Business Logic independent of Execution Mode

spark.read.format("kafka")
.option("kafka.boostrap.servers",...)
.option("subscribe", "topic")
.load()
.selectExpr("cast (value as string) as json") Business logic
.select(from_json("json", schema).as("data")) remains unchanged
.write
.format("parquet")
.option("path", "/parquetTable/")
.load()
Peripheral code decides whether
it’s a batch or a streaming query
Business Logic independent of Execution Mode
.selectExpr("cast (value as string) as json")
.select(from_json("json", schema).as("data"))

Batch Micro-batch Continuous**


Streaming Streaming
high latency low latency ultra-low latency
(hours/minutes) (seconds) (milliseconds)

execute on-demand efficient resource allocation static resource allocation

high throughput high throughput

**experimental release in Spark 2.3, read our blog


Event time Aggregations
Windowing is just another type of grouping in Struct. Streaming
parsedData
.groupBy(window("timestamp","1 hour"))
number of records every hour .count()

parsedData
avg signal strength of each .groupBy(
device every 10 mins "device",
window("timestamp","10 mins"))
.avg("signal")

Support UDAFs!
Stateful Processing for Aggregations
t=1 t=2 t=3
Aggregates has to be saved as
distributed state between triggers src src src

new data

new data

new data
process

process

process
Each trigger reads previous state and
writes updated state
state state state

State stored in memory, sink sink sink


backed by write ahead log in HDFS
state updates
are written to write
Fault-tolerant, exactly-once guarantee! log for checkpointing ahead
log
Automatically handles Late Data
Keeping state allows 13:00 14:00 15:00 16:00 17:00

late data to update 12:00 - 13:00 1 12:00 - 13:00 3 12:00 - 13:00 3 12:00 - 13:00 5 12:00 - 13:00 3

counts of old windows 13:00 - 14:00 1 13:00 - 14:00 2 13:00 - 14:00 2 13:00 - 14:00 2

14:00 - 15:00 5 14:00 - 15:00 5 14:00 - 15:00 6

15:00 - 16:00 4 15:00 - 16:00 4

16:00 - 17:00 3

But size of the state increases indefinitely red = state updated


with late data
if old windows are not dropped
Watermarking
event time
Watermark - moving threshold of
how late data is expected to be max event time
and when to drop old state 12:30 PM

trailing gap
Trails behind max event time of 10 mins
seen by the engine
watermark data older
12:20 PM than
Watermark delay = trailing gap watermark
not expected
Watermarking
event time
Data newer than watermark may
be late, but allowed to aggregate max event time

Data older than watermark is watermark late data


delay allowed to
"too late" and dropped of 10 mins aggregate

Windows older than watermark watermark data too


automatically deleted to limit the late,
dropped
amount of intermediate state
Watermarking
event time
parsedData
.withWatermark("timestamp", "10 minutes")
.groupBy(window("timestamp","5 minutes")) max event time
.count()

watermark late data


delay
Useful only in stateful operations of 10 mins
allowed to
aggregate

Ignored in non-stateful streaming watermark data too


queries and batch queries late,
dropped
Watermarking
parsedData
.withWatermark("timestamp", "10 minutes")
.groupBy(window("timestamp","5 minutes"))
.count()
12:18
12:15 data is late, but
12:14
12:15 considered in
12:13 counts
12:10
Event Time

10 min
12:08 12:08 data too late,
12:07
12:05
wm = 12:04
ignored in counts,
state dropped
12:04
12:00
12:20
Processing Time
12:10 12:15
system tracks max
watermark updated to
observed event time
12:14 - 10m = 12:04
for next trigger, More details in my blog post
state < 12:04 deleted
Other Interesting Operations
Streaming Deduplication parsedData.dropDuplicates("eventId")

Joins
Stream-batch joins stream1.join(stream2, "device")

Stream-stream joins

ds.groupByKey(_.id)
Arbitrary Stateful Processing .mapGroupsWithState
(timeoutConf)
[map|flatMap]GroupsWithState (mappingWithStateFunc)

See my previous Spark Summit talk and blog posts (here and here)
Data Pipeline with

Incidence
Response

Dump Complex Alerting


STRUCTURED SQL, ML,
ETL

DELTA
STREAMING STREAMING Reports
ETL @
Evolution of a Cutting-Edge Data Pipeline

Events

?
Streaming
Analytics

Data Lake Reporting


Evolution of a Cutting-Edge Data Pipeline

Events
Streaming
Analytics

Data Lake Reporting


Challenge #1: Historical Queries?
λ-arch
1 λ-arch
Events 1

1 λ-arch Streaming
Analytics

Data Lake Reporting


Challenge #2: Messy Data?
λ-arch
1 λ-arch
Events 1 2 Validation
1 λ-arch Streaming
Analytics

2 Validation

Data Lake Reporting


Challenge #3: Mistakes and Failures?
λ-arch
1 λ-arch
Events 1 2 Validation
1 λ-arch Streaming 3 Reprocessing
Analytics

2 Validation
Partitioned
3

Reprocessing
Data Lake Reporting
Challenge #4: Query Performance?
λ-arch
1 λ-arch
Events 1 2 Validation
1 λ-arch Streaming 3 Reprocessing
Analytics
4 Compaction
2 Validation
Partitioned
2 4 Scheduled to
Avoid Compaction

Reprocessing
Data Lake 4 Compact Reporting
Small Files
Let’s try it instead with

DELTA
Let’s try it instead with

DELTA
The
The The
SCALE RELIABILITY &
LOW-LATENCY
of data lake PERFORMANCE
of streaming
of data warehouse
THE GOOD THE GOOD
OF DATA WAREHOUSES OF DATA LAKES
• Pristine Data • Massive scale on cloud storage
• Transactional Reliability • Open Formats (Parquet, ORC)
• Fast Queries • Predictions (ML) & Streaming
Databricks Delta Combines the Best
MASSIVE SCALE Decouple Compute & Storage

RELIABILITY ACID Transactions & Data Validation

PERFORMANCE Data Indexing & Caching (10-100x)

OPEN Data stored as Parquet, ORC, etc.

LOW-LATENCY Integrated with Structured Streaming


The Canonical Data Pipeline
DELTA
Streaming
Analytics

DELTA
DELTA

DELTA
Reporting

1 λ-arch Not needed, Delta handles both short and long term data
2 Validation Easy as data in short term and long term data in one location
3 Reprocessing
Easy and seamless with Detla's transactional guarantees
4 Compaction
Accelerate Innovation with Databricks
Unified Analytics Platform
Increases Data Science Open Extensible API’s
DATABRICKS COLLABORATIVE NOTEBOOKS
Productivity by 5x
Explore Data Train Models Serve Models

DATABRICKS RUNTIME DATABRICKS DELTA


Performance
Improves Performance by I/O Performance Higher Performance &
Reliability
10-20X over Apache Spark Reliability for your Data Lake

Removes Devops & DATABRICKS MANAGED SERVICE Databricks Enterprise Security


Infrastructure Complexity Serverless
SLA’s

IoT / STREAMING DATA CLOUD STORAGE DATA HADOOP STORAGE


Data Pipelines with and DELTA
fast, scalable, fault-tolerant
stream processing with high-
level, user-friendly APIs
Incidence
Response

Dump Complex Alerting


STRUCTURED SQL, ML,
ETL

DELTA
STREAMING STREAMING Reports

data storage solution with the


reliability of data warehouses
and the scalability of data lakes
More Info
Structured Streaming Programming Guide
http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html

Databricks blog posts for more focused discussions on streaming


https://databricks.com/blog/category/engineering/streaming

Databricks Delta
https://databricks.com/product/databricks-delta
Thank you!
@tathadas

You might also like