0% found this document useful (0 votes)

62 views28 pages

Spark Streaming: Tathagata "TD" Das

- Spark Streaming allows for processing of live data streams using batch-based computations on discrete time intervals. It provides a simple abstraction called DStreams and a programming model similar to RDDs. - DStreams represent continuous live data streams and are processed as a sequence of RDDs in small time intervals. This allows for fault-tolerant stream processing with low latency. - Spark Streaming provides APIs to receive input streams from sources like Kafka, create output streams using transformations like map and window functions, and output results to systems like HDFS. This provides a unified framework for both batch and stream processing.

Uploaded by

manikandan manikandan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

62 views28 pages

Spark Streaming: Tathagata "TD" Das

Uploaded by

manikandan manikandan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 28

Spark Streaming

Tathagata “TD” Das

Tathagata Das ( TD )
Whoami

•  Core committer on Apache Spark

•  Lead developer on Spark Streaming

•  On leave from PhD program in UC Berkeley

Big Data
Big Streaming Data
Big Streaming Data Processing
Fraud detection in bank transactions

Anomalies in sensor data

Cat videos in tweets

How to Process Big Streaming Data
Distributed
Processing System

Raw Data Streams Processed Data

Scales to hundreds of nodes

Achieves low latency

Efficiently recover from failures

Integrates with batch and interactive processing

What people have been doing?
> Build two stacks – one for batch, one for streaming
-  Often both process same data

> Existing frameworks cannot do both

-  Either, stream processing of 100s of MB/s with low latency
-  Or, batch processing of TBs of data with high latency

> Extremely painful to maintain two different stacks

-  Different programming models
-  Doubles implementation effort
-  Doubles operational effort
Fault-tolerant Stream Processing
> Traditional processing model
mutable state
–  Pipeline of nodes
input
–  Each node maintains mutable state records
–  Each input record updates the state node 1
and new records are sent out
input node 3
records
node 2
> Mutable state is lost if node fails

> Making stateful stream processing

fault-tolerant is challenging!
Existing Streaming Systems

> Storm
-  Replays record if not processed by a node
-  Processes each record at least once
-  May update mutable state twice!
-  Mutable state can be lost due to failure!

> Trident – Use transactions to update state

-  Processes each record exactly once
-  Per-state transaction to external database is slow

9

treaming
What is Spark Streaming?
> Receive data streams from input sources, process
them in a cluster, push out to databases/
dashboards
> Scalable, fault-tolerant, second-scale latencies

Kafka
Flume HDFS

HDFS Databases
Kinesis treaming Dashboards
Twitter
How does Spark Streaming work?
>  Chop up data streams into batches of few secs
>  Spark treats each batch of data as RDDs and
processes them using RDD operations
>  Processed results are pushed out in batches

treamin
g
Receivers

data streams

batches as results as
RDDs RDDs
Spark Streaming Programming Model
> Discretized Stream (DStream)
-  Represents a stream of data
-  Implemented as a sequence of RDDs

> DStreams API very similar to RDD API
-  Functional APIs in Scala, Java
-  Create input DStreams from different sources
-  Apply parallel operations
Example – Get hashtags from Twitter
val ssc = new StreamingContext(sparkContext, Seconds(1))
val tweets = TwitterUtils.createStream(ssc, auth)

Input DStream

Twi9er Streaming API batch @ t batch @ t+1 batch @ t+2

tweets DStream

stored in memory

as RDDs
Example – Get hashtags from Twitter
val tweets = TwitterUtils.createStream(ssc, None)
val hashTags = tweets.flatMap(status => getTags(status))

transformed transforma0on: modify data in one
DStream DStream to create another DStream

batch @ t batch @ t+1 batch @ t+2

tweets DStream

flatMap flatMap flatMap

hashTags Dstream
…
new RDDs created
[#cat, #dog, … ]
for every batch
Example – Get hashtags from Twitter
val tweets = TwitterUtils.createStream(ssc, None)
val hashTags = tweets.flatMap(status => getTags(status))
hashTags.saveAsHadoopFiles("hdfs://...")

output opera0on: to push data to external storage

batch @ t batch @ t+1 batch @ t+2

tweets DStream
flatMap flatMap flatMap

hashTags DStream
save save save

every batch
saved to HDFS
Example – Get hashtags from Twitter
val tweets = TwitterUtils.createStream(ssc, None)
val hashTags = tweets.flatMap(status => getTags(status))
hashTags.foreachRDD(hashTagRDD => { ... })

foreach: do whatever you want with the processed data

batch @ t batch @ t+1 batch @ t+2

tweets DStream
flatMap flatMap flatMap

hashTags DStream
foreach foreach foreach

Write to a database, update analyOcs

UI, do whatever you want
Languages
Scala API

val tweets = TwitterUtils.createStream(ssc, auth)

val hashTags = tweets.flatMap(status => getTags(status))

hashTags.saveAsHadoopFiles("hdfs://...”)

Java API
FuncOon object
JavaDStream<Status> tweets = ssc.twitterStream()
JavaDstream<String> hashTags = tweets.flatMap(new Function<...> { })
hashTags.saveAsHadoopFiles("hdfs://...")

Python API
...soon
Window-based Transformations
val tweets = TwitterUtils.createStream(ssc, auth)
val hashTags = tweets.flatMap(status => getTags(status))
val tagCounts = hashTags.window(Minutes(1), Seconds(5)).countByValue()

sliding window
window length sliding interval
operaOon

window length

DStream of data

sliding interval
Arbitrary Stateful Computations
Specify function to generate new state based on
previous state and new data
-  Example: Maintain per-user mood as state, and update it
with their tweets

def updateMood(newTweets, lastMood) => newMood

val moods = tweetsByUser.updateStateByKey(updateMood _)

Arbitrary Combinations of Batch and
Streaming Computations

Inter-mix RDD and DStream operations!
-  Example: Join incoming tweets with a spam HDFS file to
filter out bad tweets"

tweets.transform(tweetsRDD => {
tweetsRDD.join(spamFile).filter(...)
})

DStreams + RDDs = Power
> Combine live data streams with historical data
-  Generate historical data models with Spark, etc.
-  Use data models to process live data stream

> Combine streaming with MLlib, GraphX algos

-  Offline learning, online prediction
Spark Spark MLlib GraphX
-  Online learning and prediction SQL Streaming machine
learning
graph
processing

Apache Spark

> Interactively query streaming data using SQL

-  select * from table_from_streaming_data
Advantage of an Unified Stack
>  Explore data
$ ./spark-‐shell
scala> val file = sc.hadoopFile(“smallLogs”)

interactively to
...
scala> val filtered = file.filter(_.contains(“ERROR”))

identify problems
...
scala> val mapped = filtered.map(...)
...
object ProcessProductionData {
def main(args: Array[String]) {
val sc = new SparkContext(...)
>  Use same code in val file = sc.hadoopFile(“productionLogs”)
val filtered = file.filter(_.contains(“ERROR”))
Spark for processing val mapped = filtered.map(...)
...
large logs }
} object ProcessLiveStream {
def main(args: Array[String]) {
val sc = new StreamingContext(...)
val stream = KafkaUtil.createStream(...)
>  Use similar code in val filtered = stream.filter(_.contains(“ERROR”))
val mapped = filtered.map(...)
Spark Streaming for ...
}
realtime processing }
Performance
Can process 60M records/sec (6 GB/sec) on
100 nodes at sub-second latency

7 3.5
WordCount
Cluster Thhroughput (GB/s)

Cluster Throughput (GB/s)

Grep
6 3
5 2.5
4 2
3 1.5
2 1
1 sec 1 sec
1 0.5
2 sec 2 sec
0 0
0 50 100 0 50 100
# Nodes in Cluster # Nodes in Cluster
Fault-tolerance
> Batches of input data are tweets
RDD input data
replicated in memory for replicated
fault-tolerance in memory

ﬂatMap
> Data lost due to worker
failure, can be recomputed
from replicated input data hashTags
RDD
lost parOOons
recomputed on
> All transformations are fault- other workers
tolerant, and exactly-once
transformations
Input Sources
•  Out of the box, we provide
-  Kafka, Flume, Kinesis, Raw TCP sockets, HDFS, etc.

•  Very easy to write a custom receiver

-  Define what to when receiver is started and stopped

•  Also, generate your own sequence of RDDs, etc.

and push them in as a “stream”

Output Sinks
•  HDFS, S3, etc (Hadoop API compatible filesystems)

•  Cassandra (using Spark-Cassandra connector)

•  Hbase (integrated support coming to Spark soon)

•  Directly push the data anywhere

Conclusion

Exemple Grile ECDIS
90% (10)
Exemple Grile ECDIS
10 pages
QRep Performance Tuning 2013 v1
No ratings yet
QRep Performance Tuning 2013 v1
64 pages
Big Data Engineering - PySpark
100% (2)
Big Data Engineering - PySpark
120 pages
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
From Everand
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
Wei Liu
No ratings yet
Windows Command Prompt A-N
From Everand
Windows Command Prompt A-N
Prometheus MMS
5/5 (2)
Agile E0 Answers
67% (3)
Agile E0 Answers
43 pages
Apache Spark Streaming Presentation
100% (1)
Apache Spark Streaming Presentation
28 pages
Bài Giảng Spark Streaming
No ratings yet
Bài Giảng Spark Streaming
75 pages
Spark Streaming
No ratings yet
Spark Streaming
14 pages
Lec 05
No ratings yet
Lec 05
10 pages
Stream Data Processing
No ratings yet
Stream Data Processing
32 pages
Data Lake 1
No ratings yet
Data Lake 1
19 pages
Lec 20
No ratings yet
Lec 20
25 pages
Real Time Data Streaming New Techniques
No ratings yet
Real Time Data Streaming New Techniques
5 pages
Week5 Lesson6
No ratings yet
Week5 Lesson6
8 pages
Big Data With Spark Detailed Presentation (1)
No ratings yet
Big Data With Spark Detailed Presentation (1)
13 pages
Bda Unit-Iii-1
No ratings yet
Bda Unit-Iii-1
29 pages
Learning Real-Time Processing With Spark Streaming - Sample Chapter
No ratings yet
Learning Real-Time Processing With Spark Streaming - Sample Chapter
30 pages
Databricks - Spark Streaming
No ratings yet
Databricks - Spark Streaming
55 pages
BDA Unit 3
No ratings yet
BDA Unit 3
42 pages
Unit Iii
No ratings yet
Unit Iii
19 pages
4 Spark SBP
No ratings yet
4 Spark SBP
74 pages
Lec 19
No ratings yet
Lec 19
23 pages
Lec 19
No ratings yet
Lec 19
24 pages
Streaming Graph Processing Unit5
No ratings yet
Streaming Graph Processing Unit5
7 pages
Hands-on Guide to Apache Spark 3: Build Scalable Computing Engines for Batch and Stream Data Processing Alfonso Antolínez García download
No ratings yet
Hands-on Guide to Apache Spark 3: Build Scalable Computing Engines for Batch and Stream Data Processing Alfonso Antolínez García download
77 pages
Real-Time Data Pipelines Made Easy With Structured Streaming in Apache Spark
No ratings yet
Real-Time Data Pipelines Made Easy With Structured Streaming in Apache Spark
51 pages
Unit - 5 FBDA
No ratings yet
Unit - 5 FBDA
7 pages
Hot Data Analytics For Real-Time Streaming in Iot Platform
No ratings yet
Hot Data Analytics For Real-Time Streaming in Iot Platform
227 pages
09 - Apache Spark Streaming
No ratings yet
09 - Apache Spark Streaming
31 pages
Assignment No. 3 For Business Data Analytics
No ratings yet
Assignment No. 3 For Business Data Analytics
16 pages
Spark Summit East 2015 - Adv Dev Ops - Student Slides
No ratings yet
Spark Summit East 2015 - Adv Dev Ops - Student Slides
219 pages
Kafka
No ratings yet
Kafka
78 pages
Devops Slides
No ratings yet
Devops Slides
223 pages
Spark Devops
0% (1)
Spark Devops
301 pages
Hands On Guide To Apache Spark 3 Build Scalable Computing Engines For Batch and Stream Data Processing 1nbsped 1484293797 9781484293799
No ratings yet
Hands On Guide To Apache Spark 3 Build Scalable Computing Engines For Batch and Stream Data Processing 1nbsped 1484293797 9781484293799
407 pages
Unit 4 Spark Cassendra
No ratings yet
Unit 4 Spark Cassendra
41 pages
BDA UNIT-2 (Final)
No ratings yet
BDA UNIT-2 (Final)
27 pages
Interview Question Spark Day1
No ratings yet
Interview Question Spark Day1
3 pages
Decomposing SMACK Stack
No ratings yet
Decomposing SMACK Stack
62 pages
Spark Streaming Through Dynamic Batch Sizing
No ratings yet
Spark Streaming Through Dynamic Batch Sizing
4 pages
Intro To Spark Development
No ratings yet
Intro To Spark Development
172 pages
Lecture 7 - 1-Spark - Streaming
No ratings yet
Lecture 7 - 1-Spark - Streaming
25 pages
Berkeley Data Analytics Stack: Prof. Harold Liu 15 December 2014
No ratings yet
Berkeley Data Analytics Stack: Prof. Harold Liu 15 December 2014
48 pages
Introduction To Spark
No ratings yet
Introduction To Spark
54 pages
Big Data Concepts - Spark & Streaming
No ratings yet
Big Data Concepts - Spark & Streaming
35 pages
How We Built a Data Pipeline With Lambda Architecture Using Spark_Spark Streaming _ by Swetha Kasireddy _ Walmart Global Tech Blog _ Medium
No ratings yet
How We Built a Data Pipeline With Lambda Architecture Using Spark_Spark Streaming _ by Swetha Kasireddy _ Walmart Global Tech Blog _ Medium
11 pages
b0m33bdt 7p Spark Databricks Streaming - 2023 - en
No ratings yet
b0m33bdt 7p Spark Databricks Streaming - 2023 - en
50 pages
BDA Lec8
No ratings yet
BDA Lec8
39 pages
DSPL Casestidy
No ratings yet
DSPL Casestidy
3 pages
SPARK
No ratings yet
SPARK
35 pages
Lecture 25
No ratings yet
Lecture 25
59 pages
Apache Spark Ecosystem - Complete Spark Components Guide: 1. Objective
No ratings yet
Apache Spark Ecosystem - Complete Spark Components Guide: 1. Objective
11 pages
SPARK
No ratings yet
SPARK
66 pages
Public - Crash Course - Apache Spark - Berlin - 2018 PDF
No ratings yet
Public - Crash Course - Apache Spark - Berlin - 2018 PDF
76 pages
Spark
No ratings yet
Spark
96 pages
UNIT V Streaming
No ratings yet
UNIT V Streaming
22 pages
Spark Streaming
No ratings yet
Spark Streaming
99 pages
Lecture #7.2 - Apache Spark - Streaming API
No ratings yet
Lecture #7.2 - Apache Spark - Streaming API
37 pages
BDA Lec10
No ratings yet
BDA Lec10
33 pages
8 - Streaming 3 - Spark Flink
No ratings yet
8 - Streaming 3 - Spark Flink
52 pages
Fast Data Processing Systems with SMACK Stack
From Everand
Fast Data Processing Systems with SMACK Stack
Raúl Estrada
No ratings yet
The Definitive Guide to PowerShell
From Everand
The Definitive Guide to PowerShell
Wesley Dunne
No ratings yet
Design Parameters of Electric Vehicle: February 2020
No ratings yet
Design Parameters of Electric Vehicle: February 2020
9 pages
Wiring Diagrams PDF
33% (3)
Wiring Diagrams PDF
40 pages
Orient Product Profile 2022 Final - Compressed
No ratings yet
Orient Product Profile 2022 Final - Compressed
16 pages
ECHO EZ1 Series SYS7020
No ratings yet
ECHO EZ1 Series SYS7020
2 pages
Wireshark Display Filters
No ratings yet
Wireshark Display Filters
315 pages
Ladder Fileter n6nwp
100% (1)
Ladder Fileter n6nwp
15 pages
Introduction To Automotive Antennas
No ratings yet
Introduction To Automotive Antennas
30 pages
Green Energy Brochures
No ratings yet
Green Energy Brochures
6 pages
Australian Standard: Windows in Buildings - Selection and Installation
0% (3)
Australian Standard: Windows in Buildings - Selection and Installation
8 pages
Intelligence Officers Handbook - 1 - 2010
100% (1)
Intelligence Officers Handbook - 1 - 2010
123 pages
Ak Training
No ratings yet
Ak Training
2 pages
CPP 5 Electrical Instruments
No ratings yet
CPP 5 Electrical Instruments
11 pages
Digital Clock But Without A Microcontroller Hardco
No ratings yet
Digital Clock But Without A Microcontroller Hardco
17 pages
Mechanical Engineer CV Sample
No ratings yet
Mechanical Engineer CV Sample
8 pages
What Are Views in DDIC?
No ratings yet
What Are Views in DDIC?
3 pages
Chapter 2 The Management Environment
No ratings yet
Chapter 2 The Management Environment
24 pages
Sangoma s505 Datasheet PC
No ratings yet
Sangoma s505 Datasheet PC
2 pages
Relationship of Artificial Intelligence To The Reliability Optimization of Electric Power Systems (Review)
No ratings yet
Relationship of Artificial Intelligence To The Reliability Optimization of Electric Power Systems (Review)
5 pages
A Comparative Study Between PI PD and PID Controll
No ratings yet
A Comparative Study Between PI PD and PID Controll
10 pages
WLC Cg70MR1
No ratings yet
WLC Cg70MR1
1,106 pages
CS3451 Course Plan
100% (1)
CS3451 Course Plan
10 pages
Communication Skills For Engineers and Scientists PDF Download
No ratings yet
Communication Skills For Engineers and Scientists PDF Download
2 pages
Larry Page The Presentation
No ratings yet
Larry Page The Presentation
15 pages
Surface Mining
No ratings yet
Surface Mining
3 pages
CCTV Textbook
No ratings yet
CCTV Textbook
90 pages
016 - B.arch - 2nd, 4th & 6th Sem - Final Declare Result - June 2021
No ratings yet
016 - B.arch - 2nd, 4th & 6th Sem - Final Declare Result - June 2021
111 pages
Information Technology Project SYBMS
No ratings yet
Information Technology Project SYBMS
10 pages

Spark Streaming: Tathagata "TD" Das

Uploaded by

Spark Streaming: Tathagata "TD" Das

Uploaded by

Spark Streaming

Tathagata “TD” Das

• Core committer on Apache Spark

• Lead developer on Spark Streaming

• On leave from PhD program in UC Berkeley

Raw Data Streams Processed Data

Scales to hundreds of nodes

Achieves low latency

Efficiently recover from failures

Integrates with batch and interactive processing

> Existing frameworks cannot do both

> Extremely painful to maintain two different stacks

> Making stateful stream processing

> Trident – Use transactions to update state

Twi9er Streaming API batch @ t batch @ t+1 batch @ t+2

stored in memory

batch @ t batch @ t+1 batch @ t+2

flatMap flatMap flatMap

output opera0on: to push data to external storage

batch @ t batch @ t+1 batch @ t+2

foreach: do whatever you want with the processed data

batch @ t batch @ t+1 batch @ t+2

Write to a database, update analyOcs

val tweets = TwitterUtils.createStream(ssc, auth)

DStream of data

def updateMood(newTweets, lastMood) => newMood

> Combine streaming with MLlib, GraphX algos

> Interactively query streaming data using SQL

Cluster Throughput (GB/s)

• Very easy to write a custom receiver

• Also, generate your own sequence of RDDs, etc.

• Hbase (integrated support coming to Spark soon)

• Directly push the data anywhere

You might also like

•  Core committer on Apache Spark

•  Lead developer on Spark Streaming

•  On leave from PhD program in UC Berkeley

> Existing frameworks cannot do both

> Extremely painful to maintain two different stacks

> Making stateful stream processing

> Trident – Use transactions to update state

> Combine streaming with MLlib, GraphX algos

> Interactively query streaming data using SQL

•  Very easy to write a custom receiver

•  Also, generate your own sequence of RDDs, etc.

•  Hbase (integrated support coming to Spark soon)

•  Directly push the data anywhere