0% found this document useful (0 votes)

13 views51 pages

Real-Time Data Pipelines Made Easy with Structured Streaming in Apache Spark

The document discusses the implementation of real-time data pipelines using Structured Streaming in Apache Spark, highlighting its advantages over traditional batch processing. It emphasizes the ability to treat streams as unbounded tables, enabling simpler queries and faster data processing, while also addressing challenges such as messy data and historical queries. The presentation outlines the architecture, components, and features of Structured Streaming, including fault tolerance, watermarking, and stateful processing.

Uploaded by

uwayo

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views51 pages

Real-Time Data Pipelines Made Easy with Structured Streaming in Apache Spark

Uploaded by

uwayo

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 51

Real-time Data Pipelines with

Structured Streaming in
Tathagata “TD” Das
@tathadas

DataEngConf 2018
18th April, San Francisco
About Me

Started Spark Streaming project in AMPLab, UC Berkeley

Currently focused on building Structured Streaming

PMC Member of

Engineer on the StreamTeam @

"we make all your streams come true"
Applications

Streaming SQL ML Graph

unified processing
engine
EC2

YARN
Environments Data Sources
Data Pipelines – 10000ft view

Data Lake

Dump ETL Data Analytics

Warehouse

unstructured unstructured
data dump structured data
data streams
Data Pipeline @ Fortune 100 Company
Trillions of Records Separate warehouses for
Security Infra Messy data not ready each type of analytics
IDS/IPS, DLP, antivirus, load for analytics
balancers, proxy servers Incidence
DW1 Response
DATALAKE1
Cloud Infra & Apps
AWS, Azure, Google Cloud Dump Complex ETL
DW2 Alerting
DATALAKE2
Servers Infra
Linux, Unix, Windows
Reports
DW3
Network Infra Hours of delay in accessing data
Routers, switches, WAPs,
databases, LDAP
Very expensive to scale
Proprietary formats
No advanced analytics (ML)
New Pipeline @ Fortune 100 Company

Incidence
Response

Dump Complex Alerting

STRUCTURED SQL, ML,
ETL

DELTA
STREAMING STREAMING Reports

Data usable in minutes/seconds

Easy to scale
Open formats
Enables advanced analytics
STRUCTURED
STREAMING
you
should not have to
reason about streaming
you
should write simple queries
&
Spark
should continuously update the answer
Treat Streams as Unbounded Tables
data stream unbounded input table

new data in the

data stream
=
new rows appended
to a unbounded table
Anatomy of a Streaming Query
Example

Read JSON data from Kafka

Parse nested JSON ETL

Store in structured Parquet table

Get end-to-end failure guarantees
Anatomy of a Streaming Query
spark.readStream.format("kafka")
.option("kafka.boostrap.servers",...) Source
.option("subscribe", "topic")
Specify where to read data from
.load()
Built-in support for Files / Kafka /
Kinesis*

returns a Can include multiple sources of

DataFrame different types using join() / union()

*Available only on Databricks Runtime

DataFrame ó Table
static data = streaming data =
bounded table unbounded table

Single
API !
DataFrame/Dataset
SQL DataFrame Dataset

spark.sql(" val df: DataFrame = val ds: Dataset[(String, Double)] =

SELECT type, sum(signal) spark.table("device-data") spark.table("device-data")
FROM devices .groupBy("type") .as[DeviceData]
GROUP BY type .sum("signal")) .groupByKey(_.type)
.mapValues(_.signal)
")
.reduceGroups(_ + _)

Most familiar to BI Analysts Great for Data Scientists familiar Great for Data Engineers who
Supports SQL-2003, HiveQL with Pandas, R Dataframes want compile-time type safety

Same semantics, same performance

Choose your hammer for whatever nail you have!
Anatomy of a Streaming Query
spark.readStream.format("kafka")
.option("kafka.boostrap.servers",...)
.option("subscribe", "topic")
.load()

Kafka DataFrame
key value topic partition offset timestamp

[binary] [binary] "topic" 0 345 1486087873

[binary] [binary] "topic" 3 2890 1486086721

Anatomy of a Streaming Query
spark.readStream.format("kafka") Transformations
.option("kafka.boostrap.servers",...)
.option("subscribe", "topic")
.load()
Cast bytes from Kafka records to a
.selectExpr("cast (value as string) as json")
string, parse it as a json, and
.select(from_json("json", schema).as("data"))
generate nested columns
100s of built-in, optimized SQL
functions like from_json
user-defined functions, lambdas,
function literals with map, flatMap…
Anatomy of a Streaming Query
spark.readStream.format("kafka")
.option("kafka.boostrap.servers",...) Sink
.option("subscribe", "topic")
Write transformed output to
.load()
external storage systems
.selectExpr("cast (value as string) as json")
.select(from_json("json", schema).as("data")) Built-in support for Files / Kafka
.writeStream
.format("parquet") Use foreach to execute arbitrary
.option("path", "/parquetTable/")
code with the output data
Some sinks are transactional and
exactly once (e.g. files)
Anatomy of a Streaming Query
spark.readStream.format("kafka")
.option("kafka.boostrap.servers",...) Processing Details
.option("subscribe", "topic")
.load() Trigger: when to process data
.selectExpr("cast (value as string) as json") - Fixed interval micro-batches
.select(from_json("json", schema).as("data")) - As fast as possible micro-batches
.writeStream
.format("parquet") - Continuously (new in Spark 2.3)
.option("path", "/parquetTable/")
.trigger("1 minute")
.option("checkpointLocation", "…")
Checkpoint location: for tracking the
.start() progress of the query
Spark automatically streamifies!
t=1 t=2 t=3
Read from Kafka
Kafka Source
spark.readStream.format("kafka")
.option("kafka.boostrap.servers",...)
Project
.option("subscribe", "topic")
.load() device, signal Optimized

new data
Operator

new data

process
process

process
.selectExpr("cast (value as string) as json")
.select(from_json("json", schema).as("data")) codegen, off-
.writeStream
.format("parquet")
Filter heap, etc.
.option("path", "/parquetTable/") signal > 15
.trigger("1 minute")
.option("checkpointLocation", "…")
.start()
Write to Parquet
Parquet Sink

DataFrames, Logical Optimized Series of Incremental

Datasets, SQL Plan Plan Execution Plans

Spark SQL converts batch-like query to a series of incremental

execution plans operating on new batches of data
Fault-tolerance with Checkpointing
t=1 t=2 t=3

Checkpointing

new data
new data
Saves processed offset info to stable storage

new data

process
process

process
Saved as JSON for forward-compatibility

write
Allows recovery from any failure
ahead
Can resume after limited changes to your log
streaming transformations (e.g. adding new end-to-end
filters to drop corrupted data, etc.)
exactly-once
guarantees
Anatomy of a Streaming Query
spark.readStream.format("kafka")
.option("kafka.boostrap.servers",...)
.option("subscribe", "topic")
ETL
.load()
.selectExpr("cast (value as string) as json")
.select(from_json("json", schema).as("data"))
.writeStream Raw data from Kafka available
.format("parquet")
.option("path", "/parquetTable/") as structured data in seconds,
.trigger("1 minute") ready for querying
.option("checkpointLocation", "…")
.start()
Performance: Benchmark
Structured Streaming reuses 40-core throughput
the Spark SQL Optimizer 70 65M
and Tungsten Engine

Millions of records/s
60
50
40
30

3x
22M
20
10 700K
0
cheaper
faster Kafka Apache Flink Structured
Streams Streaming
More details in our blog post
Business Logic independent of Execution Mode

spark.readStream.format("kafka")
.option("kafka.boostrap.servers",...)
.option("subscribe", "topic")
.load()
.selectExpr("cast (value as string) as json")
Business logic
.select(from_json("json", schema).as("data")) remains unchanged

.writeStream
.format("parquet")
.option("path", "/parquetTable/")
.trigger("1 minute")
.option("checkpointLocation", "…")
.start()
Business Logic independent of Execution Mode

spark.read.format("kafka")
.option("kafka.boostrap.servers",...)
.option("subscribe", "topic")
.load()
.selectExpr("cast (value as string) as json") Business logic
.select(from_json("json", schema).as("data")) remains unchanged
.write
.format("parquet")
.option("path", "/parquetTable/")
.load()
Peripheral code decides whether
it’s a batch or a streaming query
Business Logic independent of Execution Mode
.selectExpr("cast (value as string) as json")
.select(from_json("json", schema).as("data"))

Batch Micro-batch Continuous**

Streaming Streaming
high latency low latency ultra-low latency
(hours/minutes) (seconds) (milliseconds)

execute on-demand efficient resource allocation static resource allocation

high throughput high throughput

**experimental release in Spark 2.3, read our blog

Event time Aggregations
Windowing is just another type of grouping in Struct. Streaming
parsedData
.groupBy(window("timestamp","1 hour"))
number of records every hour .count()

parsedData
avg signal strength of each .groupBy(
device every 10 mins "device",
window("timestamp","10 mins"))
.avg("signal")

Support UDAFs!
Stateful Processing for Aggregations
t=1 t=2 t=3
Aggregates has to be saved as
distributed state between triggers src src src

new data

new data
process

process

process
Each trigger reads previous state and
writes updated state
state state state

State stored in memory, sink sink sink

backed by write ahead log in HDFS
state updates
are written to write
Fault-tolerant, exactly-once guarantee! log for checkpointing ahead
log
Automatically handles Late Data
Keeping state allows 13:00 14:00 15:00 16:00 17:00

late data to update 12:00 - 13:00 1 12:00 - 13:00 3 12:00 - 13:00 3 12:00 - 13:00 5 12:00 - 13:00 3

counts of old windows 13:00 - 14:00 1 13:00 - 14:00 2 13:00 - 14:00 2 13:00 - 14:00 2

14:00 - 15:00 5 14:00 - 15:00 5 14:00 - 15:00 6

15:00 - 16:00 4 15:00 - 16:00 4

16:00 - 17:00 3

But size of the state increases indefinitely red = state updated

with late data
if old windows are not dropped
Watermarking
event time
Watermark - moving threshold of
how late data is expected to be max event time
and when to drop old state 12:30 PM

trailing gap
Trails behind max event time of 10 mins
seen by the engine
watermark data older
12:20 PM than
Watermark delay = trailing gap watermark
not expected
Watermarking
event time
Data newer than watermark may
be late, but allowed to aggregate max event time

Data older than watermark is watermark late data

delay allowed to
"too late" and dropped of 10 mins aggregate

Windows older than watermark watermark data too

automatically deleted to limit the late,
dropped
amount of intermediate state
Watermarking
event time
parsedData
.withWatermark("timestamp", "10 minutes")
.groupBy(window("timestamp","5 minutes")) max event time
.count()

watermark late data

delay
Useful only in stateful operations of 10 mins
allowed to
aggregate

Ignored in non-stateful streaming watermark data too

queries and batch queries late,
dropped
Watermarking
parsedData
.withWatermark("timestamp", "10 minutes")
.groupBy(window("timestamp","5 minutes"))
.count()
12:18
12:15 data is late, but
12:14
12:15 considered in
12:13 counts
12:10
Event Time

10 min
12:08 12:08 data too late,
12:07
12:05
wm = 12:04
ignored in counts,
state dropped
12:04
12:00
12:20
Processing Time
12:10 12:15
system tracks max
watermark updated to
observed event time
12:14 - 10m = 12:04
for next trigger, More details in my blog post
state < 12:04 deleted
Other Interesting Operations
Streaming Deduplication parsedData.dropDuplicates("eventId")

Joins
Stream-batch joins stream1.join(stream2, "device")

Stream-stream joins

ds.groupByKey(_.id)
Arbitrary Stateful Processing .mapGroupsWithState
(timeoutConf)
[map|flatMap]GroupsWithState (mappingWithStateFunc)

See my previous Spark Summit talk and blog posts (here and here)
Data Pipeline with

Incidence
Response

Dump Complex Alerting

STRUCTURED SQL, ML,
ETL

DELTA
STREAMING STREAMING Reports
ETL @
Evolution of a Cutting-Edge Data Pipeline

Events

?
Streaming
Analytics

Data Lake Reporting

Evolution of a Cutting-Edge Data Pipeline

Events
Streaming
Analytics

Data Lake Reporting

Challenge #1: Historical Queries?
λ-arch
1 λ-arch
Events 1

1 λ-arch Streaming
Analytics

Data Lake Reporting

Challenge #2: Messy Data?
λ-arch
1 λ-arch
Events 1 2 Validation
1 λ-arch Streaming
Analytics

2 Validation

Data Lake Reporting

Challenge #3: Mistakes and Failures?
λ-arch
1 λ-arch
Events 1 2 Validation
1 λ-arch Streaming 3 Reprocessing
Analytics

2 Validation
Partitioned
3

Reprocessing
Data Lake Reporting
Challenge #4: Query Performance?
λ-arch
1 λ-arch
Events 1 2 Validation
1 λ-arch Streaming 3 Reprocessing
Analytics
4 Compaction
2 Validation
Partitioned
2 4 Scheduled to
Avoid Compaction

Reprocessing
Data Lake 4 Compact Reporting
Small Files
Let’s try it instead with

DELTA
Let’s try it instead with

DELTA
The
The The
SCALE RELIABILITY &
LOW-LATENCY
of data lake PERFORMANCE
of streaming
of data warehouse
THE GOOD THE GOOD
OF DATA WAREHOUSES OF DATA LAKES
• Pristine Data • Massive scale on cloud storage
• Transactional Reliability • Open Formats (Parquet, ORC)
• Fast Queries • Predictions (ML) & Streaming
Databricks Delta Combines the Best
MASSIVE SCALE Decouple Compute & Storage

RELIABILITY ACID Transactions & Data Validation

PERFORMANCE Data Indexing & Caching (10-100x)

OPEN Data stored as Parquet, ORC, etc.

LOW-LATENCY Integrated with Structured Streaming

The Canonical Data Pipeline
DELTA
Streaming
Analytics

DELTA
DELTA

DELTA
Reporting

1 λ-arch Not needed, Delta handles both short and long term data
2 Validation Easy as data in short term and long term data in one location
3 Reprocessing
Easy and seamless with Detla's transactional guarantees
4 Compaction
Accelerate Innovation with Databricks
Unified Analytics Platform
Increases Data Science Open Extensible API’s
DATABRICKS COLLABORATIVE NOTEBOOKS
Productivity by 5x
Explore Data Train Models Serve Models

DATABRICKS RUNTIME DATABRICKS DELTA

Performance
Improves Performance by I/O Performance Higher Performance &
Reliability
10-20X over Apache Spark Reliability for your Data Lake

Removes Devops & DATABRICKS MANAGED SERVICE Databricks Enterprise Security

Infrastructure Complexity Serverless
SLA’s

IoT / STREAMING DATA CLOUD STORAGE DATA HADOOP STORAGE

Data Pipelines with and DELTA
fast, scalable, fault-tolerant
stream processing with high-
level, user-friendly APIs
Incidence
Response

Dump Complex Alerting

STRUCTURED SQL, ML,
ETL

DELTA
STREAMING STREAMING Reports

data storage solution with the

reliability of data warehouses
and the scalability of data lakes
More Info
Structured Streaming Programming Guide
http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html

Databricks blog posts for more focused discussions on streaming

https://databricks.com/blog/category/engineering/streaming

Databricks Delta
https://databricks.com/product/databricks-delta
Thank you!
@tathadas

azure comapny wise question
No ratings yet
azure comapny wise question
68 pages
File Management System Project Report
100% (3)
File Management System Project Report
72 pages
Download Full Statistical and Machine-Learning Data Mining, Third Edition: Techniques for Better Predictive Modeling and Analysis of Big Data, Third Edition Bruce Ratner PDF All Chapters
100% (1)
Download Full Statistical and Machine-Learning Data Mining, Third Edition: Techniques for Better Predictive Modeling and Analysis of Big Data, Third Edition Bruce Ratner PDF All Chapters
55 pages
Proposal - Cloud Migration - v1.0
No ratings yet
Proposal - Cloud Migration - v1.0
17 pages
Online Nursery
40% (5)
Online Nursery
53 pages
Databricks Questions
No ratings yet
Databricks Questions
23 pages
Data Lake Bootcamp: Building Reliable Data Lakes
No ratings yet
Data Lake Bootcamp: Building Reliable Data Lakes
29 pages
ABD22 1st Exam - 6 January - Attempt Review
No ratings yet
ABD22 1st Exam - 6 January - Attempt Review
13 pages
Iti Pdfs
No ratings yet
Iti Pdfs
10 pages
Upgrade Procedure DB & OGG From 11g To 19C
67% (3)
Upgrade Procedure DB & OGG From 11g To 19C
14 pages
2018 02 08 Whats New in Apache Spark 2 180213220045
No ratings yet
2018 02 08 Whats New in Apache Spark 2 180213220045
57 pages
Spark
No ratings yet
Spark
96 pages
Pyspark - DataFrame Window Functions
No ratings yet
Pyspark - DataFrame Window Functions
3 pages
6.3. data_structure_pyspark.ipynb - Exercise
No ratings yet
6.3. data_structure_pyspark.ipynb - Exercise
6 pages
ENGG1003_10_PythonApplicationsOnJupiter
No ratings yet
ENGG1003_10_PythonApplicationsOnJupiter
30 pages
_ Databricks & PySpark learning day-10
No ratings yet
_ Databricks & PySpark learning day-10
4 pages
O Reilly Data Lake Bootcamp Day 11694182865124
No ratings yet
O Reilly Data Lake Bootcamp Day 11694182865124
46 pages
batch arch
No ratings yet
batch arch
1 page
48 Medium Article on PySpark Scenarios
No ratings yet
48 Medium Article on PySpark Scenarios
6 pages
60 Days DSA Learning Plan
No ratings yet
60 Days DSA Learning Plan
9 pages
Data Bricks
No ratings yet
Data Bricks
20 pages
SCD Type 2. Pyspark
No ratings yet
SCD Type 2. Pyspark
7 pages
SCD in Databricks
No ratings yet
SCD in Databricks
16 pages
Matthieu - Lamairesse - Reda - Khouani - Why The Best Serverless Data Warehouse Is A Lakehouse - (DAIWT - PARIS)
No ratings yet
Matthieu - Lamairesse - Reda - Khouani - Why The Best Serverless Data Warehouse Is A Lakehouse - (DAIWT - PARIS)
38 pages
Distributed Computing BE(AI&DS)
No ratings yet
Distributed Computing BE(AI&DS)
53 pages
Introduction to Data Engineering Daniel Beach - Download the full set of chapters carefully compiled
100% (1)
Introduction to Data Engineering Daniel Beach - Download the full set of chapters carefully compiled
64 pages
Change Data Capture
No ratings yet
Change Data Capture
4 pages
Apache Spark
No ratings yet
Apache Spark
62 pages
3 Lecture 3-ETL
100% (1)
3 Lecture 3-ETL
42 pages
Machine Learning With Spark
No ratings yet
Machine Learning With Spark
26 pages
Intro_to_Jupyter_Notebooks
No ratings yet
Intro_to_Jupyter_Notebooks
44 pages
z-devops-guide
No ratings yet
z-devops-guide
136 pages
(Ebook) Spark in Action - Second Edition: Covers Apache Spark 3 with Examples in Java, Python, and Scala by Jean-Georges Perrin ISBN 9781617295522, 1617295523 download pdf
100% (9)
(Ebook) Spark in Action - Second Edition: Covers Apache Spark 3 with Examples in Java, Python, and Scala by Jean-Georges Perrin ISBN 9781617295522, 1617295523 download pdf
55 pages
SCD Type-1,2 Implementation in Pyspark
No ratings yet
SCD Type-1,2 Implementation in Pyspark
6 pages
Data Ingestion Using Nifi: Quick Overview
No ratings yet
Data Ingestion Using Nifi: Quick Overview
24 pages
Stream Processing at Lyft
No ratings yet
Stream Processing at Lyft
20 pages
Slide Deck Data Analysis With Databricks
No ratings yet
Slide Deck Data Analysis With Databricks
115 pages
Mining Your Data Lake For Analytics Insights v3 101420
No ratings yet
Mining Your Data Lake For Analytics Insights v3 101420
16 pages
Pyspark Cashing & Persisting - Complete Guide
No ratings yet
Pyspark Cashing & Persisting - Complete Guide
3 pages
Databricks Question
No ratings yet
Databricks Question
7 pages
PySpark FP Course ID 58339
No ratings yet
PySpark FP Course ID 58339
44 pages
Databricks Associate Data Engineer Notes
No ratings yet
Databricks Associate Data Engineer Notes
39 pages
Download ebooks file Learn PySpark: Build python-based machine learning and deep learning models 1st Edition Pramod Singh all chapters
100% (3)
Download ebooks file Learn PySpark: Build python-based machine learning and deep learning models 1st Edition Pramod Singh all chapters
55 pages
Spark A To Z
No ratings yet
Spark A To Z
63 pages
Performance Tuning Spark UI
No ratings yet
Performance Tuning Spark UI
37 pages
DBT Analytics Engineering
No ratings yet
DBT Analytics Engineering
23 pages
Dec 01 2020
No ratings yet
Dec 01 2020
298 pages
Change Data Capture (CDC) For Iceberg
No ratings yet
Change Data Capture (CDC) For Iceberg
11 pages
Spark Optimizations & Deployment
No ratings yet
Spark Optimizations & Deployment
39 pages
Learning Apache Spark With Python
No ratings yet
Learning Apache Spark With Python
10 pages
Python programms
No ratings yet
Python programms
8 pages
What Are DBT Sources
No ratings yet
What Are DBT Sources
109 pages
Pyspark RDD Cheat Sheet Python For Data Science
No ratings yet
Pyspark RDD Cheat Sheet Python For Data Science
1 page
Databricks Widgets
No ratings yet
Databricks Widgets
13 pages
Spark Streaming Twitter Example
No ratings yet
Spark Streaming Twitter Example
4 pages
Data Bricks
No ratings yet
Data Bricks
43 pages
Oltp Olap Rtap
No ratings yet
Oltp Olap Rtap
53 pages
CCA175 Demo Examenes
No ratings yet
CCA175 Demo Examenes
19 pages
Full Download Learning Informatica PowerCenter 10 x enterprise data warehousing and intelligent data centers Second Edition. Edition Rahul Malewar PDF DOCX
100% (5)
Full Download Learning Informatica PowerCenter 10 x enterprise data warehousing and intelligent data centers Second Edition. Edition Rahul Malewar PDF DOCX
65 pages
Azure databricks mastery
No ratings yet
Azure databricks mastery
53 pages
Apache Sqoop
No ratings yet
Apache Sqoop
21 pages
Apache Druid: Sudhindra Tirupati Nagaraj
No ratings yet
Apache Druid: Sudhindra Tirupati Nagaraj
12 pages
Buy ebook Building an Event-Driven Data Mesh (Early Release) Adam Bellemare cheap price
100% (3)
Buy ebook Building an Event-Driven Data Mesh (Early Release) Adam Bellemare cheap price
40 pages
Instant Access To Data Lake Architecture Designing The Data Lake and Avoiding The Garbage Dump First Edition Bill Inmon Ebook Full Chapters
100% (5)
Instant Access To Data Lake Architecture Designing The Data Lake and Avoiding The Garbage Dump First Edition Bill Inmon Ebook Full Chapters
62 pages
HDInsight Essentials - Second Edition
From Everand
HDInsight Essentials - Second Edition
Rajesh Nadipalli
No ratings yet
MuleSoft_Environment_Setup_Windows
No ratings yet
MuleSoft_Environment_Setup_Windows
2 pages
simple step-by-step guide to creating your first Mule application using Anypoint Studio
No ratings yet
simple step-by-step guide to creating your first Mule application using Anypoint Studio
3 pages
Dia SDP Ussd API Dgdv1.3.0
No ratings yet
Dia SDP Ussd API Dgdv1.3.0
16 pages
Linux Command
No ratings yet
Linux Command
1 page
How To Divert Voice Calls
No ratings yet
How To Divert Voice Calls
1 page
MCS-SQL: Leveraging Multiple Prompts and Multiple-Choice Selection For Text-to-SQL Generation
No ratings yet
MCS-SQL: Leveraging Multiple Prompts and Multiple-Choice Selection For Text-to-SQL Generation
17 pages
SQL Views1
No ratings yet
SQL Views1
5 pages
CV Fernando Casasbuenas
No ratings yet
CV Fernando Casasbuenas
3 pages
Pandas Roadmap
No ratings yet
Pandas Roadmap
6 pages
DB2 UDB For ZOS - Data Sharing - Planning and Administration
No ratings yet
DB2 UDB For ZOS - Data Sharing - Planning and Administration
296 pages
Suggested Db Tables for Pitchmatter
No ratings yet
Suggested Db Tables for Pitchmatter
74 pages
Unit 1 Fundamentals of Data Warehouse
No ratings yet
Unit 1 Fundamentals of Data Warehouse
21 pages
Fasion Store
No ratings yet
Fasion Store
4 pages
DBMS Module 2 - PPT
No ratings yet
DBMS Module 2 - PPT
101 pages
Java Self-Evaluation Form
No ratings yet
Java Self-Evaluation Form
5 pages
Udemy Documentation 4
No ratings yet
Udemy Documentation 4
2 pages
Apache Nifi
No ratings yet
Apache Nifi
8 pages
DBMS
No ratings yet
DBMS
23 pages
Ms Access and Ms Excel
No ratings yet
Ms Access and Ms Excel
10 pages
Resume of Shalini
No ratings yet
Resume of Shalini
6 pages
Amsterdam + Berlin Schedule & Curriculum Edorer Business Analytics & Data Science Bootcamp
No ratings yet
Amsterdam + Berlin Schedule & Curriculum Edorer Business Analytics & Data Science Bootcamp
14 pages
Interview Questions Bank
No ratings yet
Interview Questions Bank
23 pages
CH-6 Data Loading, Storage, and File Formats
No ratings yet
CH-6 Data Loading, Storage, and File Formats
163 pages
Lecture Activity Diagram 1
No ratings yet
Lecture Activity Diagram 1
17 pages
Data Analysis With Python
No ratings yet
Data Analysis With Python
12 pages
Cloud Data Management For Dummies - 2017
No ratings yet
Cloud Data Management For Dummies - 2017
53 pages
ICM Module 2 MCQ
No ratings yet
ICM Module 2 MCQ
63 pages
PPDS On S4 HANA (Advanced Planning) Insights
No ratings yet
PPDS On S4 HANA (Advanced Planning) Insights
8 pages
mcp_security
No ratings yet
mcp_security
28 pages
PSG Design Data Book - ME6503 Design of Machine Elements
80% (10)
PSG Design Data Book - ME6503 Design of Machine Elements
30 pages

Real-Time Data Pipelines Made Easy with Structured Streaming in Apache Spark

Uploaded by

Real-Time Data Pipelines Made Easy with Structured Streaming in Apache Spark

Uploaded by

Real-time Data Pipelines with

Started Spark Streaming project in AMPLab, UC Berkeley

Currently focused on building Structured Streaming

Engineer on the StreamTeam @

Streaming SQL ML Graph

Dump ETL Data Analytics

Dump Complex Alerting

Data usable in minutes/seconds

new data in the

Read JSON data from Kafka

Store in structured Parquet table

returns a Can include multiple sources of

*Available only on Databricks Runtime

spark.sql(" val df: DataFrame = val ds: Dataset[(String, Double)] =

Same semantics, same performance

[binary] [binary] "topic" 0 345 1486087873

[binary] [binary] "topic" 3 2890 1486086721

DataFrames, Logical Optimized Series of Incremental

Spark SQL converts batch-like query to a series of incremental

Batch Micro-batch Continuous**

execute on-demand efficient resource allocation static resource allocation

high throughput high throughput

**experimental release in Spark 2.3, read our blog

State stored in memory, sink sink sink

14:00 - 15:00 5 14:00 - 15:00 5 14:00 - 15:00 6

15:00 - 16:00 4 15:00 - 16:00 4

But size of the state increases indefinitely red = state updated

Data older than watermark is watermark late data

Windows older than watermark watermark data too

watermark late data

Ignored in non-stateful streaming watermark data too

Dump Complex Alerting

Data Lake Reporting

Data Lake Reporting

Data Lake Reporting

Data Lake Reporting

RELIABILITY ACID Transactions & Data Validation

PERFORMANCE Data Indexing & Caching (10-100x)

OPEN Data stored as Parquet, ORC, etc.

LOW-LATENCY Integrated with Structured Streaming

DATABRICKS RUNTIME DATABRICKS DELTA

Removes Devops & DATABRICKS MANAGED SERVICE Databricks Enterprise Security

IoT / STREAMING DATA CLOUD STORAGE DATA HADOOP STORAGE

Dump Complex Alerting

data storage solution with the

Databricks blog posts for more focused discussions on streaming

You might also like