Spark Streaming
Tathagata “TD” Das
Tathagata Das ( TD )
Whoami
• Core committer on Apache Spark
• Lead developer on Spark Streaming
• On leave from PhD program in UC Berkeley
Big Data
Big Streaming Data
Big Streaming Data Processing
Fraud detection in bank transactions
Anomalies in sensor data
Cat videos in tweets
How to Process Big Streaming Data
Distributed
Processing System
Raw Data Streams
Processed Data
Scales to hundreds of nodes
Achieves low latency
Efficiently recover from failures
Integrates with batch and interactive processing
What people have been doing?
> Build two stacks – one for batch, one for streaming
- Often both process same data
> Existing frameworks cannot do both
- Either, stream processing of 100s of MB/s with low latency
- Or, batch processing of TBs of data with high latency
> Extremely painful to maintain two different stacks
- Different programming models
- Doubles implementation effort
- Doubles operational effort
Fault-tolerant Stream Processing
> Traditional processing model
mutable
state
– Pipeline of nodes
input
– Each node maintains mutable state
records
– Each input record updates the state node
1
and new records are sent out
input
node
3
records
node
2
> Mutable state is lost if node fails
> Making stateful stream processing
fault-tolerant is challenging!
Existing Streaming Systems
> Storm
- Replays record if not processed by a node
- Processes each record at least once
- May update mutable state twice!
- Mutable state can be lost due to failure!
> Trident – Use transactions to update state
- Processes each record exactly once
- Per-state transaction to external database is slow
9
treaming
What is Spark Streaming?
> Receive data streams from input sources, process
them in a cluster, push out to databases/
dashboards
> Scalable, fault-tolerant, second-scale latencies
Kafka
Flume HDFS
HDFS Databases
Kinesis treaming Dashboards
Twitter
How does Spark Streaming work?
> Chop up data streams into batches of few secs
> Spark treats each batch of data as RDDs and
processes them using RDD operations
> Processed results are pushed out in batches
treamin
g
Receivers
data streams
batches as results as
RDDs RDDs
Spark Streaming Programming Model
> Discretized Stream (DStream)
- Represents a stream of data
- Implemented as a sequence of RDDs
> DStreams API very similar to RDD API
- Functional APIs in Scala, Java
- Create input DStreams from different sources
- Apply parallel operations
Example – Get hashtags from Twitter
val
ssc
=
new
StreamingContext(sparkContext,
Seconds(1))
val
tweets
=
TwitterUtils.createStream(ssc,
auth)
Input
DStream
Twi9er
Streaming
API
batch
@
t
batch
@
t+1
batch
@
t+2
tweets
DStream
stored
in
memory
as
RDDs
Example – Get hashtags from Twitter
val
tweets
=
TwitterUtils.createStream(ssc,
None)
val
hashTags
=
tweets.flatMap(status
=>
getTags(status))
transformed
transforma0on:
modify
data
in
one
DStream
DStream
to
create
another
DStream
batch
@
t
batch
@
t+1
batch
@
t+2
tweets
DStream
flatMap
flatMap
flatMap
hashTags
Dstream
…
new
RDDs
created
[#cat,
#dog,
…
]
for
every
batch
Example – Get hashtags from Twitter
val
tweets
=
TwitterUtils.createStream(ssc,
None)
val
hashTags
=
tweets.flatMap(status
=>
getTags(status))
hashTags.saveAsHadoopFiles("hdfs://...")
output
opera0on:
to
push
data
to
external
storage
batch
@
t
batch
@
t+1
batch
@
t+2
tweets
DStream
flatMap flatMap flatMap
hashTags
DStream
save save save
every
batch
saved
to
HDFS
Example – Get hashtags from Twitter
val
tweets
=
TwitterUtils.createStream(ssc,
None)
val
hashTags
=
tweets.flatMap(status
=>
getTags(status))
hashTags.foreachRDD(hashTagRDD
=>
{
...
})
foreach:
do
whatever
you
want
with
the
processed
data
batch
@
t
batch
@
t+1
batch
@
t+2
tweets
DStream
flatMap flatMap flatMap
hashTags
DStream
foreach foreach foreach
Write
to
a
database,
update
analyOcs
UI,
do
whatever
you
want
Languages
Scala API
val
tweets
=
TwitterUtils.createStream(ssc,
auth)
val
hashTags
=
tweets.flatMap(status
=>
getTags(status))
hashTags.saveAsHadoopFiles("hdfs://...”)
Java API
FuncOon
object
JavaDStream<Status>
tweets
=
ssc.twitterStream()
JavaDstream<String>
hashTags
=
tweets.flatMap(new
Function<...>
{
})
hashTags.saveAsHadoopFiles("hdfs://...")
Python API
...soon
Window-based Transformations
val
tweets
=
TwitterUtils.createStream(ssc,
auth)
val
hashTags
=
tweets.flatMap(status
=>
getTags(status))
val
tagCounts
=
hashTags.window(Minutes(1),
Seconds(5)).countByValue()
sliding
window
window
length
sliding
interval
operaOon
window
length
DStream
of
data
sliding
interval
Arbitrary Stateful Computations
Specify function to generate new state based on
previous state and new data
- Example: Maintain per-user mood as state, and update it
with their tweets
def
updateMood(newTweets,
lastMood)
=>
newMood
val
moods
=
tweetsByUser.updateStateByKey(updateMood
_)
Arbitrary Combinations of Batch and
Streaming Computations
Inter-mix RDD and DStream operations!
- Example: Join incoming tweets with a spam HDFS file to
filter out bad tweets"
tweets.transform(tweetsRDD
=>
{
tweetsRDD.join(spamFile).filter(...)
})
DStreams + RDDs = Power
> Combine live data streams with historical data
- Generate historical data models with Spark, etc.
- Use data models to process live data stream
> Combine streaming with MLlib, GraphX algos
- Offline learning, online prediction
Spark
Spark
MLlib
GraphX
- Online learning and prediction
SQL
Streaming
machine
learning
graph
processing
Apache
Spark
> Interactively query streaming data using SQL
- select * from table_from_streaming_data
Advantage of an Unified Stack
> Explore data
$
./spark-‐shell
scala>
val
file
=
sc.hadoopFile(“smallLogs”)
interactively to
...
scala>
val
filtered
=
file.filter(_.contains(“ERROR”))
identify problems
...
scala>
val
mapped
=
filtered.map(...)
...
object
ProcessProductionData
{
def
main(args:
Array[String])
{
val
sc
=
new
SparkContext(...)
> Use same code in
val
file
=
sc.hadoopFile(“productionLogs”)
val
filtered
=
file.filter(_.contains(“ERROR”))
Spark for processing
val
mapped
=
filtered.map(...)
...
large logs
}
}
object
ProcessLiveStream
{
def
main(args:
Array[String])
{
val
sc
=
new
StreamingContext(...)
val
stream
=
KafkaUtil.createStream(...)
> Use similar code in
val
filtered
=
stream.filter(_.contains(“ERROR”))
val
mapped
=
filtered.map(...)
Spark Streaming for
...
}
realtime processing
}
Performance
Can process 60M records/sec (6 GB/sec) on
100 nodes at sub-second latency
7
3.5
WordCount
Cluster
Thhroughput
(GB/s)
Cluster
Throughput
(GB/s)
Grep
6
3
5
2.5
4
2
3
1.5
2
1
1
sec
1
sec
1
0.5
2
sec
2
sec
0
0
0
50
100
0
50
100
#
Nodes
in
Cluster
#
Nodes
in
Cluster
Fault-tolerance
> Batches of input data are tweets
RDD
input
data
replicated in memory for replicated
fault-tolerance
in
memory
flatMap
> Data lost due to worker
failure, can be recomputed
from replicated input data
hashTags
RDD
lost
parOOons
recomputed
on
> All transformations are fault- other
workers
tolerant, and exactly-once
transformations
Input Sources
• Out of the box, we provide
- Kafka, Flume, Kinesis, Raw TCP sockets, HDFS, etc.
• Very easy to write a custom receiver
- Define what to when receiver is started and stopped
• Also, generate your own sequence of RDDs, etc.
and push them in as a “stream”
Output Sinks
• HDFS, S3, etc (Hadoop API compatible filesystems)
• Cassandra (using Spark-Cassandra connector)
• Hbase (integrated support coming to Spark soon)
• Directly push the data anywhere
Conclusion