Spark: Fast, Interactive, Language-Integrated Cluster Computing
Spark: Fast, Interactive, Language-Integrated Cluster Computing
www.spark-project.org
UC BERKELEY
Project Goals
Extend the MapReduce model to better
support two common classes of
analytics apps:
Iterative algorithms (machine
learning, graphs)
Interactive data mining
Enhance programmability:
Motivation
Most current cluster programming
models are based on acyclic data flow
from stable storage to stable storage
Map
Input
Reduc
e
Map
Map
Reduc
e
Outpu
t
Motivation
Most current cluster programming
models are based on acyclic data flow
from stable storage to stable storage
Map
Reduc
e
Map
Motivation
Acyclic data flow is inefficient for
applications that repeatedly reuse a
working set of data:
Iterative algorithms (machine
learning, graphs)
Interactive data mining tools (R,
Excel, Python)
Solution: Resilient
Distributed Datasets
(RDDs)
Allow apps to keep working sets in
memory for efficient reuse
Retain the attractive properties of
MapReduce
Outline
Spark programming model
Implementation
Demo
User applications
Programming
Model
Resilient distributed datasets (RDDs)
Immutable, partitioned collections of
objects
Created through parallel transformations
(map, filter, groupBy, join, ) on data in
stable storage
Can be cached for efficient reuse
Actions on RDDs
Example: Log
Mining
results
errors = lines.fi
lter(_.startsW ith(ERRO R))
m essages = errors.m ap(_.split(\t)(2))
cachedM sgs = m essages.cache()
cachedM sgs.fi
lter(_.contains(foo)).count
Drive
r
Result:
Result:scaled
full-text
to search
1 TB data
of
Wikipedia
in in
5-7
<1sec
sec (vs 20
(vs 170
sec for
secon-disk
for on-disk
data)
data)
tasks Block 1
Action
Cache
Work 2
cachedM sgs.fi
lter(_.contains(bar)).count
. ..
er
Cache
Work 3
er
Block 3
er
Block 2
RDD Fault
Tolerance
Ex:
HDFS File
Filtered
RDD
Mapped
RDD
filter
map
(func = _.contains(...)) (func = _.split(...))
Example: Logistic
Regression
Goal: find best line separating two sets
of points
random initial line
+
+ ++ +
+
+ +
+ +
target
Logistic Regression
Performance
4500
4000
3500
3000
2500
2000
Running Time (s)
1500
1000
500
0
127 s / iteration
Hado
op
Spark Applications
In-memory data mining on Hive data
(Conviva)
Predictive analytics (Quantifind)
City traffic prediction (Mobile
Millennium)
Twitter spam classification (Monarch)
Collaborative filtering via matrix
factorization
Conviva
GeoReport
Hive
20
Spark 0.5
0
Time
8 10 12 14 16 18 20 (hours)
Frameworks Built on
Spark
Pregel on Spark (Bagel)
Implementation
Runs on Apache
Mesos to share
resources with
Hadoop & other apps
Spark
Hadoo
p
MPI
Mesos
Spark Scheduler
Dryad-like DAGs
B:
A:
Pipelines functions
Stage 1
within a stage
Cache-aware workC:
reuse & locality
G:
groupBy
D:
F:
map
E:
Partitioning-aware
to avoid shuffles Stage 2
join
union
Stage 3
Interactive Spark
Modified Scala interpreter to allow
Spark to be used interactively from the
command line
Required two changes:
Demo
Conclusion
Spark provides a simple, efficient, and
powerful programming model for a
wide range of apps
Download our open source release:
www.spark-project.org
[email protected]
Related Work
DryadLINQ, FlumeJava
Relational databases
Lineage/provenance, logical logging, materialized views
40
58.1
40.7
20
0
Cache disabled 50%
29.7
11.5
Fully cached
Fault Recovery
Results
140
No Failure
119
120
100
81
80
58 58
57
57 59 57 59
56
Iteratrion
time
(s)
60
40
20
0
1
2
3
4
5
6
7
8
9 10
Iteration
Spark Operations
Transformatio
ns
(define a new
RDD)
Actions
(return a result
to driver
program)
map
filter
sample
groupByKey
reduceByKey
sortByKey
flatMap
union
join
cogroup
cross
mapValues
collect
reduce
count
save
lookupKey