Real-Time Data Pipelines Made Easy with Structured Streaming in Apache Spark
Real-Time Data Pipelines Made Easy with Structured Streaming in Apache Spark
Structured Streaming in
Tathagata “TD” Das
@tathadas
DataEngConf 2018
18th April, San Francisco
About Me
PMC Member of
unified processing
engine
EC2
YARN
Environments Data Sources
Data Pipelines – 10000ft view
Data Lake
unstructured unstructured
data dump structured data
data streams
Data Pipeline @ Fortune 100 Company
Trillions of Records Separate warehouses for
Security Infra Messy data not ready each type of analytics
IDS/IPS, DLP, antivirus, load for analytics
balancers, proxy servers Incidence
DW1 Response
DATALAKE1
Cloud Infra & Apps
AWS, Azure, Google Cloud Dump Complex ETL
DW2 Alerting
DATALAKE2
Servers Infra
Linux, Unix, Windows
Reports
DW3
Network Infra Hours of delay in accessing data
Routers, switches, WAPs,
databases, LDAP
Very expensive to scale
Proprietary formats
No advanced analytics (ML)
New Pipeline @ Fortune 100 Company
Incidence
Response
DELTA
STREAMING STREAMING Reports
Single
API !
DataFrame/Dataset
SQL DataFrame Dataset
Most familiar to BI Analysts Great for Data Scientists familiar Great for Data Engineers who
Supports SQL-2003, HiveQL with Pandas, R Dataframes want compile-time type safety
Kafka DataFrame
key value topic partition offset timestamp
new data
Operator
new data
new data
process
process
process
.selectExpr("cast (value as string) as json")
.select(from_json("json", schema).as("data")) codegen, off-
.writeStream
.format("parquet")
Filter heap, etc.
.option("path", "/parquetTable/") signal > 15
.trigger("1 minute")
.option("checkpointLocation", "…")
.start()
Write to Parquet
Parquet Sink
Checkpointing
new data
new data
Saves processed offset info to stable storage
new data
process
process
process
Saved as JSON for forward-compatibility
write
Allows recovery from any failure
ahead
Can resume after limited changes to your log
streaming transformations (e.g. adding new end-to-end
filters to drop corrupted data, etc.)
exactly-once
guarantees
Anatomy of a Streaming Query
spark.readStream.format("kafka")
.option("kafka.boostrap.servers",...)
.option("subscribe", "topic")
ETL
.load()
.selectExpr("cast (value as string) as json")
.select(from_json("json", schema).as("data"))
.writeStream Raw data from Kafka available
.format("parquet")
.option("path", "/parquetTable/") as structured data in seconds,
.trigger("1 minute") ready for querying
.option("checkpointLocation", "…")
.start()
Performance: Benchmark
Structured Streaming reuses 40-core throughput
the Spark SQL Optimizer 70 65M
and Tungsten Engine
Millions of records/s
60
50
40
30
3x
22M
20
10 700K
0
cheaper
faster Kafka Apache Flink Structured
Streams Streaming
More details in our blog post
Business Logic independent of Execution Mode
spark.readStream.format("kafka")
.option("kafka.boostrap.servers",...)
.option("subscribe", "topic")
.load()
.selectExpr("cast (value as string) as json")
Business logic
.select(from_json("json", schema).as("data")) remains unchanged
.writeStream
.format("parquet")
.option("path", "/parquetTable/")
.trigger("1 minute")
.option("checkpointLocation", "…")
.start()
Business Logic independent of Execution Mode
spark.read.format("kafka")
.option("kafka.boostrap.servers",...)
.option("subscribe", "topic")
.load()
.selectExpr("cast (value as string) as json") Business logic
.select(from_json("json", schema).as("data")) remains unchanged
.write
.format("parquet")
.option("path", "/parquetTable/")
.load()
Peripheral code decides whether
it’s a batch or a streaming query
Business Logic independent of Execution Mode
.selectExpr("cast (value as string) as json")
.select(from_json("json", schema).as("data"))
parsedData
avg signal strength of each .groupBy(
device every 10 mins "device",
window("timestamp","10 mins"))
.avg("signal")
Support UDAFs!
Stateful Processing for Aggregations
t=1 t=2 t=3
Aggregates has to be saved as
distributed state between triggers src src src
new data
new data
new data
process
process
process
Each trigger reads previous state and
writes updated state
state state state
late data to update 12:00 - 13:00 1 12:00 - 13:00 3 12:00 - 13:00 3 12:00 - 13:00 5 12:00 - 13:00 3
counts of old windows 13:00 - 14:00 1 13:00 - 14:00 2 13:00 - 14:00 2 13:00 - 14:00 2
16:00 - 17:00 3
trailing gap
Trails behind max event time of 10 mins
seen by the engine
watermark data older
12:20 PM than
Watermark delay = trailing gap watermark
not expected
Watermarking
event time
Data newer than watermark may
be late, but allowed to aggregate max event time
10 min
12:08 12:08 data too late,
12:07
12:05
wm = 12:04
ignored in counts,
state dropped
12:04
12:00
12:20
Processing Time
12:10 12:15
system tracks max
watermark updated to
observed event time
12:14 - 10m = 12:04
for next trigger, More details in my blog post
state < 12:04 deleted
Other Interesting Operations
Streaming Deduplication parsedData.dropDuplicates("eventId")
Joins
Stream-batch joins stream1.join(stream2, "device")
Stream-stream joins
ds.groupByKey(_.id)
Arbitrary Stateful Processing .mapGroupsWithState
(timeoutConf)
[map|flatMap]GroupsWithState (mappingWithStateFunc)
See my previous Spark Summit talk and blog posts (here and here)
Data Pipeline with
Incidence
Response
DELTA
STREAMING STREAMING Reports
ETL @
Evolution of a Cutting-Edge Data Pipeline
Events
?
Streaming
Analytics
Events
Streaming
Analytics
1 λ-arch Streaming
Analytics
2 Validation
2 Validation
Partitioned
3
Reprocessing
Data Lake Reporting
Challenge #4: Query Performance?
λ-arch
1 λ-arch
Events 1 2 Validation
1 λ-arch Streaming 3 Reprocessing
Analytics
4 Compaction
2 Validation
Partitioned
2 4 Scheduled to
Avoid Compaction
Reprocessing
Data Lake 4 Compact Reporting
Small Files
Let’s try it instead with
DELTA
Let’s try it instead with
DELTA
The
The The
SCALE RELIABILITY &
LOW-LATENCY
of data lake PERFORMANCE
of streaming
of data warehouse
THE GOOD THE GOOD
OF DATA WAREHOUSES OF DATA LAKES
• Pristine Data • Massive scale on cloud storage
• Transactional Reliability • Open Formats (Parquet, ORC)
• Fast Queries • Predictions (ML) & Streaming
Databricks Delta Combines the Best
MASSIVE SCALE Decouple Compute & Storage
DELTA
DELTA
DELTA
Reporting
1 λ-arch Not needed, Delta handles both short and long term data
2 Validation Easy as data in short term and long term data in one location
3 Reprocessing
Easy and seamless with Detla's transactional guarantees
4 Compaction
Accelerate Innovation with Databricks
Unified Analytics Platform
Increases Data Science Open Extensible API’s
DATABRICKS COLLABORATIVE NOTEBOOKS
Productivity by 5x
Explore Data Train Models Serve Models
DELTA
STREAMING STREAMING Reports
Databricks Delta
https://databricks.com/product/databricks-delta
Thank you!
@tathadas