Devops Slides
Devops Slides
http://spark-summit.org/east/training/devops
www.linkedin.com/in/blueplastic
making big data simple
• Spark Streaming
Some slides will be skipped
SQL
Tachyon
Distributions:
-‐ CDH
-‐ HDP
Streaming
-‐ MapR
-‐ DSE
MLlib
Apps
(2007 – 2015?)
Giraph
Pregel (2014 – ?)
(2004 – 2013) Tez
Drill
Dremel
Mahout
Storm
S4
Impala
GraphLab
Specialized Systems
(iterative, interactive, ML, streaming, graph, SQL, etc)
General Batch Processing General Unified Engine
Aug 2009
Source: openhub.net
10x
–
100x
0.1 Gb/s
1 Gb/s or Nodes in
125 MB/s another
rack
Network
CPUs:
10 GB/s
1 Gb/s or Nodes in
100 MB/s 600 MB/s same rack
125 MB/s
June 2010
http://www.cs.berkeley.edu/~matei/papers/2010/hotcloud_spark.pdf
“We present Resilient Distributed Datasets (RDDs), a
distributed memory abstraction that lets
programmers perform in-memory computations on
large clusters in a fault-tolerant manner.
April 2012
http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf
TwitterUtils.createStream(...)
.filter(_.getText.contains("Spark"))
.countByWindow(Seconds(5))
Coming soon!
sqlCtx
=
new
HiveContext(sc)
results
=
sqlCtx.sql(
"SELECT
*
FROM
people")
names
=
results.map(lambda
p:
p.name)
https://amplab.cs.berkeley.edu/wp-content/
uploads/2013/05/grades-graphx_with_fonts.pdf
https://www.cs.berkeley.edu/~sameerag/
blinkdb_eurosys13.pdf
http://shop.oreilly.com/product/0636920028512.do
$30 @ Amazon:
http://www.amazon.com/Learning-Spark-Lightning-
Fast-Data-Analysis/dp/1449358624
http://tinyurl.com/dsesparklab
- 102 pages
- DevOps style
- Includes:
- Spark Streaming
- Dangers of
GroupByKey vs.
ReduceByKey
http://tinyurl.com/cdhsparklab
- 109 pages
- DevOps style
- Includes:
- PySpark
- Spark SQL
- Spark-submit
Next slide is only for on-site students…
http://tinyurl.com/sparknycsurvey
(Scala & Python only)
W
Ex
RD
D
RD
D
Driver Program
Worker Machine
Ex
RD
D
RD
D
Worker Machine
more partitions = more parallelism
RDD
W W W
Ex Ex Ex
RD RD RD
D
RD D
RD D
D D
RDD
w/
4
partitions
Error,
ts,
msg1
Info,
ts,
msg8
Error,
ts,
msg3
Error,
ts,
msg4
Warn,
ts,
msg2
Warn,
ts,
msg2
Info,
ts,
msg5
Warn,
ts,
msg9
Error,
ts,
msg1
Info,
ts,
msg8
Info,
ts,
msg5
Error,
ts,
msg1
logLinesRDD
- Parallelize a collection
- Read data from an external source (S3, C*, HDFS, etc)
#
Parallelize
in
Python
wordsRDD
=
sc.parallelize([“fish",
“cats“,
“dogs”])
- Take an existing in-memory
collection and pass it to
SparkContext’s parallelize
method
.filter( )
errorsRDD
Error,
ts,
msg1
Error,
ts,
msg3
Error,
ts,
msg4
Error,
ts,
msg1
Error,
ts,
msg1
errorsRDD
.coalesce( 2 )
.collect( )
Driver
Execute DAG!
.collect( )
Driver
logLinesRDD
.collect( )
Driver
logLinesRDD
.filter(
)
errorsRDD
.coalesce( 2 )
cleanedRDD
.collect( )
Driver
logLinesRDD
.filter(
)
errorsRDD
Pipelined
cleanedRDD
Stage-‐1
.collect(
)
data
Driver
logLinesRDD
errorsRDD
cleanedRDD
Driver
data
Driver
logLinesRDD
errorsRDD
.filter( )
errorsRDD
.filter( )
func = _.contains(…)
errorsRDD
shouldCache=false
1) Create some input RDDs from external data or parallelize a
collection in your driver program.
- Most transformations are element-wise (they work on one element at a time), but this is not
true for all transformations
reduce()
takeOrdered()
collect() saveAsTextFile()
count() saveAsSequenceFile()
first() saveAsObjectFile()
take() countByKey()
takeSample() foreach()
saveToCassandra()
...
• HadoopRDD • DoubleRDD • CassandraRDD (DataStax)
• FilteredRDD • JdbcRDD • GeoRDD (ESRI)
• MappedRDD • JsonRDD • EsSpark (ElasticSearch)
• PairRDD • SchemaRDD
• ShuffledRDD • VertexRDD
• UnionRDD • EdgeRDD
• PythonRDD
* 1) Set of partitions (“splits”)
* Dependencies = none
* Partitioner = none
* Partitions = same as parent RDD
* Partitioner = none
* Partitions = One per reduce task
* Partitioner = HashPartitioner(numTasks)
Keyspace Table
Spark Executor
Spark-C*
Connector
C* Java Driver
- Open Source
- Implemented mostly in Scala
- Scala + Java APIs
- Does automatic type conversions
https://github.com/datastax/spark-cassandra-connector
“Simple things
should be simple,
complex things
should be possible”
- Alan Kay
DEMO:
- Local Static Partitioning
- Standalone Scheduler
- YARN
Dynamic Partitioning
- Mesos
History:
N
JT
N
JobTracker NameNode
D D D D
TT TT TT TT
N N N N
M R M R M R M R
M M R M R M R
M M M R
M M
M
OS OS OS OS
CPUs: 3
options:
-‐
local
-‐
local[N]
JVM: Ex + Driver -‐
local[*]
- SPARK_WORKER_CORES
W W W W
Ex Ex Ex Ex
RDD, P1 RDD, P4 RDD, P5 RDD, P7
Internal
Threads
Driver Spark
Master
OS Disk SSD SSD OS Disk SSD SSD OS Disk SSD SSD OS Disk SSD SSD
- SPARK_WORKER_CORES
W W W W
Ex Ex Ex Ex
RDD, P1 RDD, P4 RDD, P5 RDD, P7
Internal
Threads
More
Masters
I’m HA via
Driver Spark
can be Spark ZooKeeper
Spark Master Master
added live
Master
OS Disk SSD SSD OS Disk SSD SSD OS Disk SSD SSD OS Disk SSD SSD
W W W W
Ex Ex Ex Ex Ex Ex
Ex Ex
Driver Driver
Spark
Master
W W W W W W W W
Ex Ex Ex Ex Ex Ex Ex Ex
Driver Spark
Master
SPARK_WORKER_CORES: [default: ALL] # of cores to allow Spark applications to use on the machine
conf/spark-env.sh SPARK_WORKER_MEMORY: [default: TOTAL RAM – 1 GB] Total memory to allow Spark applications to use on the machine
SPARK_DAEMON_MEMORY: [default: 512 MB] Memory to allocate to the Spark master and worker daemons themselves
Standalone settings
2 3 4 5
7 7
Client #1
Resource
I’m HA via
Manager ZooKeeper
Scheduler
Client #2
Apps Master
Executor Executor
RD RD
D D
Client #1
(cluster mode)
Resource
Manager
Container
App Master Container
Executor
RD Executor
D RD
D
Container Driver
Executor
RD
D
YARN settings
Dynamic Allocation:
spark.dynamicAllocation.enabled
spark.dynamicAllocation.minExecutors
spark.dynamicAllocation.maxExecutors
spark.dynamicAllocation.sustainedSchedulerBacklogTimeout
(N)
spark.dynamicAllocation.schedulerBacklogTimeout
(M)
spark.dynamicAllocation.executorIdleTimeout
(K)
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala
YARN resource manager UI: http://<ip
address>:8088
RDD, P1
Internal
Threads Max Executor heap size depends… maybe 40 GB (watch GC)
JVM
JVM
RDD.persist(MEMORY_ONLY_SER)
+
deserialized
Ex
RDD-
P1
RDD-
JVM
P1
RDD-
P2
.persist(MEMORY_AND_DISK) OS Disk
SSD
+
serialized
JVM
.persist(MEMORY_AND_DISK_SER)
JVM
.persist(DISK_ONLY)
deserialized
deserialized
RDD.persist(MEMORY_ONLY_2)
+ +
deserialized deserialized
JVM JVM
.persist(MEMORY_AND_DISK_2)
JVM-‐1
/
App-‐1
serialized
Tachyon
JVM-‐2 / App-‐1
JVM-‐7 / App-‐2
.persist(OFF_HEAP)
JVM
.unpersist()
?
JVM
?
- If RDD fits in memory, choose MEMORY_ONLY
- Use replicated storage levels sparingly and only if you want fast
fault recovery (maybe to serve requests from a web app)
Remember!
PySpark: stored objects will always be serialized with Pickle library, so it does
not matter whether you choose a serialized level.
Default Memory Allocation in Executor JVM
spark.storage.memoryFraction
User Programs
(remainder)
20%
20% 60%
Shuffle memory
Cached RDDs
spark.shuffle.memoryFraction
Spark uses memory for:
RDD Storage: when you call .persist() or .cache(). Spark will limit the amount of
memory used when caching to a certain fraction of the JVM’s overall heap, set by
spark.storage.memoryFraction
Shuffle and aggregation buffers: When performing shuffle operations, Spark will
create intermediate buffers for storing shuffle output data. These buffers are used to
store intermediate results of aggregations in addition to buffering data that is going
to be directly output as part of the shuffle.
User code: Spark executes arbitrary user code, so user functions can themselves
require substantial memory. For instance, if a user application allocates large arrays
or other objects, these will content for overall memory usage. User code has access
to everything “left” in the JVM heap after the space for RDD storage and shuffle
storage are allocated.
1. Create an RDD
INFO BlockManagerMasterActor: Added rdd_0_1 in memory on mbk.local:50311 (size: 717.5 KB, free: 332.3 MB)
Serialization is used when:
Broadcasting variables
Java serialization vs. Kryo serialization
• Works with any class you create that implements • Use Kyro version 2 for speedy serialization (10x) and
java.io.Serializable more compactness
• You can control the performance of serialization • Does not support all Serializable types
more closely by extending java.io.Externalizable
• Requires you to register the classes you’ll use in
• Flexible, but quite slow advance
• Leads to large serialized formats for many classes • If set, will be used for serializing shuffle data between
nodes and also serializing RDDs to disk
conf.set(“spark.serializer”,
"org.apache.spark.serializer.KryoSerializer")
To register your own custom classes with Kryo, use the
registerKryoClasses method:
Ex Ex
... RD RD
D
RD D
D
RD
D
High churn
To measure GC impact:
- Uses multiple threads to - Uses multiple threads - Concurrent Mark - Garbage First is available
do young gen GC to do both young gen Sweep aka starting Java 7
and old gen GC “Concurrent low
- Will default to Serial on pause collector” - Designed to be long term
single core machines - Also a multithreading replacement for CMS
compacting collector - Tries to minimize
- Aka “throughput pauses due to GC by - Is a parallel, concurrent
collector” - HotSpot does doing most of the work and incrementally
compaction only in concurrently with compacting low-pause
- Good for when a lot of old gen application threads GC
work is needed and
long pauses are - Uses same algorithm
acceptable on young gen as
parallel collector
- Use cases: batch
processing - Use cases:
?
.collect(
)
Task #1
Task #2
Stage 1
Task #3
.
.
Stage 2
Job #1
Stage 3 Stage 4
Stage 5
.
.
RDD Objects DAG Scheduler Task Scheduler Executor
Block manager
Rdd1.join(rdd2) - Launches
.groupBy(…) - Split graph into - Execute tasks
individual tasks
.filter(…) stages of tasks
- Store and serve
- Retry failed or blocks
- Submit each stage as straggling tasks
- Build operator DAG ready
F:
Stage
1
groupBy
=
RDD
C:
D:
E:
=
cached
partition
=
lost
partition
join
scala>
input.toDebugString
res85:
String
=
(2)
data.text
MappedRDD[292]
at
textFile
at
<console>:13
|
data.text
HadoopRDD[291]
at
textFile
at
<console>:13
scala>
counts.toDebugString
res84:
String
=
(2)
ShuffledRDD[296]
at
reduceByKey
at
<console>:17
+-‐(2)
MappedRDD[295]
at
map
at
<console>:17
|
FilteredRDD[294]
at
filter
at
<console>:15
|
MappedRDD[293]
at
map
at
<console>:15
|
data.text
MappedRDD[292]
at
textFile
at
<console>:13
|
data.text
HadoopRDD[291]
at
textFile
at
<console>:13
How do you know if a shuffle will be called on a Transformation?
- All operations that shuffle data over network will benefit from partitioning
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L302
Link
Source: Cloudera
Source: Cloudera
How many Stages will this code require?
sc.textFile("someFile.txt").
map(mapFunc).
flatMap(flatMapFunc).
filter(filterFunc).
count()
Source: Cloudera
How many Stages will this DAG require?
Source: Cloudera
How many Stages will this DAG require?
Source: Cloudera
+ + +
Ex
x=5
x=5
Ex
x=5
x=5
x=5
Ex
• Broadcast variables – Send a large read-only lookup table to all the nodes, or
send a large feature vector in a ML algorithm to all nodes
• Accumulators – count events that occur during job execution for debugging
purposes. Example: How many lines of the input file were blank? Or how many
corrupt records were in the input dataset?
+ + +
Spark supports 2 types of shared variables:
• Accumulators – allows you to aggregate values from worker nodes back to the
driver program. Can be used to count the # of errors seen in an RDD of lines
spread across 100s of nodes. Only the driver can access the value of an
accumulator, tasks cannot. For tasks, accumulators are write-only.
+ + +
Broadcast variables let programmer keep a read-
only variable cached on each machine rather than
shipping a copy of it with tasks
Python:
broadcastVar
=
sc.broadcast(list(range(1,
4)))
broadcastVar.value
Link
History:
Ex
Uses HTTP
Ex
Ex
20 MB file
Ex
Uses bittorrent
Ex
20 MB file
Source: Scott Martin
Ex Ex Ex
Ex Ex
Ex Ex Ex
+ + +
Only the driver program can read an accumulator’s value, not the
tasks
Scala:
+ + +
val accum = sc.accumulator(0)!
!
sc.parallelize(Array(1, 2, 3, 4)).foreach(x => accum += x)!
!
accum.value!
Python:
accum
=
sc.accumulator(0)
rdd
=
sc.parallelize([1,
2,
3,
4])
def
f(x):
global
accum
accum
+=
x
rdd.foreach(f)
accum.value
PySpark at a Glance
PySpark
Java API
Spark Core Engine
(Scala)
Executor JVM
F(x)
Driver JVM
daemon.py
RDD Pipe F(x)
Local Disk
MLlib, SQL, shuffle F(x)
RDD
Executor JVM
MappedRDD
PythonRDD RDD[Array[ , , , ]]
(100 KB – 1MB each picked object)
Choose Your Python Implementation
CPython
(default python)
Spark
Context
pypy
• JIT, so faster
• less memory
• CFFI support
Worker Machine
Driver Machine
rdd = sc.textFile("text")
def wordcount():
rdd.flatMap(lambda x:x.split('/'))\
.map(lambda x:(x,1)).reduceByKey(lambda x,y:x+y).collectAsMap()
def sort():
rdd.sortBy(lambda x:x, 1).count()
def stats():
sc.parallelize(range(1024), 20).flatMap(lambda x: xrange(5024)).stats()
https://github.com/apache/spark/pull/2144
100TB Daytona Sort Competition 2014
More info:
http://sortbenchmark.org
http://databricks.com/blog/2014/11/05/spark-
officially-sets-a-new-record-in-large-scale-sorting.html
Work by Databricks engineers: Reynold Xin, Parviz Deyhim, Xiangrui Meng, Ali Ghodsi, Matei Zaharia
- Stresses “shuffle” which underpins everything from SQL to Mllib
- OpenJDK 1.7
= sortByKey
reduceByKey
spark.shuffle.spill=false
Ex
Linux
NIC
kernel
buffer
buffer
NIC
buffer
- Entirely bounded
- Notice that map by I/O reading from
has to keep 3 file Map() Map() Map() Map()
handles open HDFS and writing out
locally sorted files
TimSort
Reduce() Reduce()
- Mostly network bound
Reduce()
250,000+ reducers!
= 3.6 GB
MergeSort!
TimSort
Reduce() Reduce() Reduce()
RF = 2
- Actual final run
- Fully saturated
the 10 Gbit link
...
SchemaR
DD
- RDD of Row objects, each representing a record
Batches of
Input data stream processed data
(Discretized Stream)
Batch interval = 5 seconds
T=5 T = +5
Input
DStream
Block #1 Block #2 Block #3 Block #1 Block #2 Block #3
5 sec
Materialize!
linesDStream
linesRDD
flatMap()
wordsDStream
# Create a local StreamingContext with two working thread and batch interval of 1 second
sc = SparkContext("local[2]", "NetworkWordCount")
ssc = StreamingContext(sc, 5) linesStream
Ex
RDD, P1 T R
RDD, P2
W
T T
block, P1
T T Ex
RDD, P3 T T
RDD, P4
Internal T T
Threads block, P1
T T
Driver
Internal
Threads
200 ms later
W W
Ex Ex
RDD, P1 T R block, P2 T T
RDD, P2
W
T T T T
block, P1
block, P2
T T Ex T T
RDD, P3 T T
RDD, P4
Internal T T Internal
Threads block, P1 Threads
T T
Driver
Internal
Threads
200 ms later
W W
Ex Ex
RDD, P1 T R T T
RDD, P2
W block, P2
T T T T
block, P1
block, P2
T T Ex T T
RDD, P1 T T
block, P3
RDD, P2
Internal T T Internal
Threads block, P1 Threads
T T
block, P3
Driver
Internal
Threads
W W
Ex Ex
RDD, P1 T R T T
RDD, P2
W RDD, P2
T T T T
RDD, P1
RDD, P2
T T Ex T T
RDD, P1 T T
RDD, P3
RDD, P2
Internal T T Internal
Threads RDD, P1 Threads
T T
RDD, P3
Driver
Internal
Threads
W W
Ex Ex
RDD, P1 T R T T
RDD, P2
W RDD, P2
T T T T
RDD, P1
RDD, P2
T T Ex T T
RDD, P1 T T
RDD, P3
RDD, P2
Internal T T Internal
Threads RDD, P1 Threads
T T
RDD, P3
Driver
Internal
Threads
W W
Ex Ex
T R T R
W block, P1
T T T T
block, P1
T T Ex T T
T T
Internal T T Internal
Threads block, P1 Threads
T T
block, P1
Driver
Internal
Threads
W W
Ex Ex
block, P3 T R block, P1 T R
block, P3
W
T T block, P2 T T
block, P1
block, P2
T T Ex block, P3
T T
block, P3
block, P2 T T
block, P2
Internal T T Internal
Threads block, P1 Threads
T T
block, P1
Driver
Internal
Threads
W Materialize! W
Ex Ex
RDD, P3 T R RDD, P1 T R
RDD, P3
W
T T RDD, P2 T T
RDD, P1
RDD, P2
T T Ex RDD, P3
T T
RDD, P3
RDD, P2 T T
RDD, P2
Internal T T Internal
Threads RDD, P1 Threads
T T
RDD, P1
Driver
Internal
Threads
W Union! W
Ex Ex
RDD, P1 T R RDD, P3 T R
RDD, P2
W
T T RDD, P4 T T
RDD, P3
RDD, P5
T T Ex RDD, P5
T T
RDD, P6
RDD, P1 T T
RDD, P2
Internal T T Internal
Threads RDD, P4 Threads
T T
RDD, P6
Driver
Internal
Threads
join(otherStream,[numTasks]) filter( )
cogroup(otherStream,[numTasks])
RDD
repartition(numPartitions) transform( )
RDD count()
reduceAByKey( ,[numTasks])
countByValue()
(word, 1)
pairs = (cat, 1)
runningCounts = pairs.updateStateByKey(updateFunction)
or
- Functionality to join every batch in a
data stream with another dataset is not
directly exposed in the DStream API. MLlib GraphX
Original
RDD1 RDD 2 Batch 3 Batch 4 Batch 5 Batch 6
DStream
- DStream
- PairDStreamFunctions
reduceByWindow( , windowLength, slideInterval)
- JavaDStream
- JavaPairDStream
foreachRDD( )
saveAsTextFile(prefix, [suffix])
saveAsObjectFiles(prefix, [suffix])
saveAsHadoopFiles(prefix, [suffix])
Next slide is only for on-site students…
http://tinyurl.com/sparknycpostsurvey