0% found this document useful (0 votes)
37 views

Devops Slides

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views

Devops Slides

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 223

DEVOPS ADVANCED CLASS

March 2015: Spark Summit East 2015

http://spark-summit.org/east/training/devops

www.linkedin.com/in/blueplastic
making big data simple

• Founded in late 2013


• by the creators of Apache Spark
• Original team from UC Berkeley AMPLab
• Raised $47 Million in 2 rounds
• ~50 employees
• We’re hiring! (http://databricks.workable.com)

• Level 2/3 support partnerships with


• Cloudera
• Hortonworks
Databricks Cloud:
• MapR “A unified platform for building Big Data pipelines
– from ETL to Exploration and Dashboards, to
• DataStax Advanced Analytics and Data Products.”
The Databricks team contributed more than 75% of the code added to Spark in the past year
AGENDA
Before Lunch After Lunch

• History of Spark • Memory and Persistence

• RDD fundamentals • Jobs -> Stages -> Tasks

• Spark Runtime Architecture • Broadcast Variables and


Integration with Resource Managers Accumulators
(Standalone, YARN)
• PySpark
• GUIs
• DevOps 102
• Lab: DevOps 101
• Shuffle

• Spark Streaming
Some slides will be skipped

Please keep Q&A low during class

(5pm – 5:30pm for Q&A with instructor)

2 anonymous surveys: Pre and Post class

Lunch: noon – 1pm

2 breaks (before lunch and after lunch)


Algorithms
Machines
People @

• AMPLab project was launched in Jan 2011, 6 year planned duration

• Personnel: ~65 students, postdocs, faculty & staff

• Funding from Government/Industry partnership, NSF Award, Darpa, DoE,


20+ companies

• Created BDAS, Mesos, SNAP. Upcoming projects: Succinct & Velox.

“Unknown to most of the world, the University of California, Berkeley’s AMPLab


has already left an indelible mark on the world of information technology, and
even the web. But we haven’t yet experienced the full impact of the
group[…] Not even close”
- Derrick Harris, GigaOm, Aug 2014
Scheduling Monitoring Distributing
RDBMS

SQL  

Tachyon  
Distributions:  
-­‐ CDH  
-­‐ HDP   Streaming  
-­‐ MapR  
-­‐ DSE  

Hadoop  Input    Format  


DataFrames  API   GraphX  

MLlib  

Apps  
(2007 – 2015?)

Giraph
Pregel (2014 – ?)
(2004 – 2013) Tez
Drill
Dremel
Mahout
Storm
S4
Impala
GraphLab

Specialized Systems
(iterative, interactive, ML, streaming, graph, SQL, etc)
General Batch Processing General Unified Engine
Aug 2009

...in June 2013

Source: openhub.net
10x  –  100x  
0.1 Gb/s

1 Gb/s or Nodes in
125 MB/s another
rack
Network
CPUs:
10 GB/s

1 Gb/s or Nodes in
100 MB/s 600 MB/s same rack
125 MB/s

3-­‐12  ms  random  access   0.1  ms  random  access  


   
$0.05  per  GB   $0.45  per  GB  
“The main abstraction in Spark is that of a resilient dis-
tributed dataset (RDD), which represents a read-only
collection of objects partitioned across a set of
machines that can be rebuilt if a partition is lost.

Users can explicitly cache an RDD in memory across


machines and reuse it in multiple MapReduce-like
parallel operations.

RDDs achieve fault tolerance through a notion of


lineage: if a partition of an RDD is lost, the RDD has
enough information about how it was derived from
other RDDs to be able to rebuild just that partition.”

June 2010

http://www.cs.berkeley.edu/~matei/papers/2010/hotcloud_spark.pdf
“We present Resilient Distributed Datasets (RDDs), a
distributed memory abstraction that lets
programmers perform in-memory computations on
large clusters in a fault-tolerant manner.

RDDs are motivated by two types of applications


that current computing frameworks handle
inefficiently: iterative algorithms and interactive data
mining tools.

In both cases, keeping data in memory can improve


performance by an order of magnitude.”

“Best Paper Award and Honorable Mention for Community Award”


- NSDI 2012

- Cited 392 times!

April 2012

http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf
TwitterUtils.createStream(...)  
       .filter(_.getText.contains("Spark"))  
       .countByWindow(Seconds(5))  

- 2 Streaming Paper(s) have been cited 138 times


Seemlessly mix SQL queries with Spark programs.

Coming soon!
sqlCtx  =  new  HiveContext(sc)  
results  =  sqlCtx.sql(  
   "SELECT  *  FROM  people")  
names  =  results.map(lambda  p:  p.name)  

(Will be published in the upcoming


weeks for SIGMOD 2015)
graph  =  Graph(vertices,  edges)  
messages  =  spark.textFile("hdfs://...")  
graph2  =  graph.joinVertices(messages)  {  
   (id,  vertex,  msg)  =>  ...  
}  

https://amplab.cs.berkeley.edu/wp-content/
uploads/2013/05/grades-graphx_with_fonts.pdf
https://www.cs.berkeley.edu/~sameerag/
blinkdb_eurosys13.pdf
http://shop.oreilly.com/product/0636920028512.do

PDF, ePub, Mobi, DAISY


eBook: $33.99
Print: $39.99 Shipping now!

$30 @ Amazon:
http://www.amazon.com/Learning-Spark-Lightning-
Fast-Data-Analysis/dp/1449358624
http://tinyurl.com/dsesparklab

- 102 pages

- DevOps style

- For complete beginners

- Includes:
- Spark Streaming
- Dangers of
GroupByKey vs.
ReduceByKey
http://tinyurl.com/cdhsparklab

- 109 pages

- DevOps style

- For complete beginners

- Includes:
- PySpark
- Spark SQL
- Spark-submit
Next slide is only for on-site students…
http://tinyurl.com/sparknycsurvey
(Scala & Python only)
W

Ex
RD
D
RD
D

Driver Program
Worker Machine

Ex
RD
D
RD
D

Worker Machine
more partitions = more parallelism

RDD  

item-­‐1   item-­‐6   item-­‐11   item-­‐16   item-­‐21  


item-­‐2   item-­‐7   item-­‐12   item-­‐17   item-­‐22  
item-­‐3   item-­‐8   item-­‐13   item-­‐18   item-­‐23  
item-­‐4   item-­‐9   item-­‐14   item-­‐19   item-­‐24  
item-­‐5   item-­‐10   item-­‐15   item-­‐20   item-­‐25  
         

W W W

Ex Ex Ex
RD RD RD
D
RD D
RD D
D D
RDD  w/  4  partitions  

Error,  ts,  msg1   Info,  ts,  msg8   Error,  ts,  msg3   Error,  ts,  msg4  
Warn,  ts,  msg2   Warn,  ts,  msg2   Info,  ts,  msg5   Warn,  ts,  msg9  
Error,  ts,  msg1     Info,  ts,  msg8     Info,  ts,  msg5     Error,  ts,  msg1    

logLinesRDD  

An RDD can be created 2 ways:

- Parallelize a collection
- Read data from an external source (S3, C*, HDFS, etc)
#  Parallelize  in  Python  
wordsRDD  =  sc.parallelize([“fish",  “cats“,  “dogs”])  
- Take an existing in-memory
collection and pass it to
SparkContext’s parallelize
method

- Not generally used outside of


prototyping and testing since it
requires entire dataset in
//  Parallelize  in  Scala  
memory on one machine
val  wordsRDD=  sc.parallelize(List("fish",  "cats",  "dogs"))  

//  Parallelize  in  Java  


JavaRDD<String>  wordsRDD  =  sc.parallelize(Arrays.asList(“fish",  “cats“,  “dogs”));  
#  Read  a  local  txt  file  in  Python  
linesRDD  =  sc.textFile("/path/to/README.md")  
- There are other methods
to read data from HDFS,
C*, S3, HBase, etc.  

//  Read  a  local  txt  file  in  Scala  


val  linesRDD  =  sc.textFile("/path/to/README.md")  

//  Read  a  local  txt  file  in  Java  


JavaRDD<String>  lines  =  sc.textFile("/path/to/README.md");  
Error,  ts,  msg1   Info,  ts,  msg8   Error,  ts,  msg3   Error,  ts,  msg4  
Warn,  ts,  msg2   Warn,  ts,  msg2   Info,  ts,  msg5   Warn,  ts,  msg9  
Error,  ts,  msg1     Info,  ts,  msg8     Info,  ts,  msg5     Error,  ts,  msg1    
logLinesRDD  
(input/base  RDD)  

.filter(            )  

Error,  ts,  msg1   Error,  ts,  msg3   Error,  ts,  msg4  


   
Error,  ts,  msg1     Error,  ts,  msg1    

errorsRDD  
Error,  ts,  msg1   Error,  ts,  msg3   Error,  ts,  msg4  
   
Error,  ts,  msg1     Error,  ts,  msg1    
errorsRDD  

.coalesce(  2  )  

Error,  ts,  msg1   Error,  ts,  msg4  


Error,  ts,  msg3    
Error,  ts,  msg1     Error,  ts,  msg1    
cleanedRDD  

.collect(    )  

Driver
Execute DAG!

.collect(    )  

Driver
logLinesRDD  

.collect(    )  

Driver
logLinesRDD  
.filter(            )  

errorsRDD  

.coalesce(  2  )  

cleanedRDD  

.collect(    )  

Error,  ts,  msg1   Error,  ts,  msg4  


Error,  ts,  msg3    
Error,  ts,  msg1     Error,  ts,  msg1    

Driver
logLinesRDD  
.filter(            )  

errorsRDD  

.coalesce(  2,  shuffle=  False)  

Pipelined  
cleanedRDD  
Stage-­‐1  
.collect(    )  

data  
Driver
logLinesRDD  

errorsRDD  

cleanedRDD  

Driver
data  
Driver
logLinesRDD  

errorsRDD  

.saveToCassandra(    )   Error,  ts,  msg1   Error,  ts,  msg4  


Error,  ts,  msg3    
Error,  ts,  msg1     Error,  ts,  msg1    
cleanedRDD  

.filter(            )  

Error,  ts,  msg1    


.count(    )      
Error,  ts,  msg1     Error,  ts,  msg1    
5
errorMsg1RDD  
.collect(    )  
logLinesRDD  

errorsRDD  

.saveToCassandra(    )   Error,  ts,  msg1   Error,  ts,  msg4  


Error,  ts,  msg3    
Error,  ts,  msg1     Error,  ts,  msg1    
cleanedRDD  

.filter(            )  

Error,  ts,  msg1    


.count(    )      
Error,  ts,  msg1     Error,  ts,  msg1    
5
errorMsg1RDD  
.collect(    )  
Dataset-level view: Partition-level view:

P-­‐1   logLinesRDD   logLinesRDD  


P-­‐2   P-­‐3   P-­‐4  
        (HadoopRDD)  

Path = hdfs://. . . Task-1


Task-2
Task-3
Task-4
P-­‐1   errorsRDD  
P-­‐2   P-­‐3   P-­‐4  
        (filteredRDD)  

func = _.contains(…)
errorsRDD  
shouldCache=false
1) Create some input RDDs from external data or parallelize a
collection in your driver program.

2) Lazily transform them to define new RDDs using


transformations like filter() or map()  

3) Ask Spark to cache() any intermediate RDDs that will need to


be reused.

4) Launch actions such as count() and collect() to kick off a


parallel computation, which is then optimized and executed
by Spark.
(lazy)

map()   intersection()   cartesion()  

flatMap()   distinct()   pipe()  


 
filter()     groupByKey()   coalesce()  

mapPartitions()   reduceByKey()   repartition()  

mapPartitionsWithIndex()   sortByKey()   partitionBy()  

sample()   join()   ...  

union()   cogroup()   ...  

- Most transformations are element-wise (they work on one element at a time), but this is not
true for all transformations
reduce()   takeOrdered()  

collect()   saveAsTextFile()  

count()   saveAsSequenceFile()  

first()   saveAsObjectFile()  

take()   countByKey()  

takeSample()   foreach()  

saveToCassandra()   ...  
• HadoopRDD • DoubleRDD • CassandraRDD (DataStax)
• FilteredRDD • JdbcRDD • GeoRDD (ESRI)
• MappedRDD • JsonRDD • EsSpark (ElasticSearch)
• PairRDD • SchemaRDD
• ShuffledRDD • VertexRDD
• UnionRDD • EdgeRDD
• PythonRDD
* 1) Set of partitions (“splits”)

* 2) List of dependencies on parent RDDs

* 3) Function to compute a partition given parents

* 4) Optional preferred locations

* 5) Optional partitioning info for k/v RDDs (Partitioner)

This captures all current Spark operations!


* Partitions = one per HDFS block

* Dependencies = none

* Compute (partition) = read corresponding block

* preferredLocations (part) = HDFS block location

* Partitioner = none
* Partitions = same as parent RDD

* Dependencies = “one-to-one” on parent

* Compute (partition) = compute parent and filter it

* preferredLocations (part) = none (ask parent)

* Partitioner = none
* Partitions = One per reduce task

* Dependencies = “shuffle” on each parent

* Compute (partition) = read and join shuffled data

* preferredLocations (part) = none

* Partitioner = HashPartitioner(numTasks)
Keyspace Table

val  cassandraRDD  =  sc  


           .cassandraTable(“ks”,  “mytable”)  
    side
  column
       .select(“col-­‐1”,  “col-­‐3”)  
      {    .where(“col-­‐5  =  ?”,  “blue”)  
Server
& row   selection
(for dealing with wide rows)

Start  the  Spark  shell  by  passing  in  a  custom  cassandra.input.split.size:  


 
ubuntu@ip-­‐10-­‐0-­‐53-­‐24:~$  dse  spark  –Dspark.cassandra.input.split.size=2000  
Welcome  to  
         ____                            __  
       /  __/__    ___  _____/  /__  
     _\  \/  _  \/  _  `/  __/    '_/  
   /___/  .__/\_,_/_/  /_/\_\      version  0.9.1  
         /_/  
 
Using  Scala  version  2.10.3  (Java  HotSpot(TM)  64-­‐Bit  Server  VM,  Java  
1.7.0_51)  
Type  in  expressions  to  have  them  evaluated.  
Type  :help  for  more  information.  
Creating  SparkContext...  
Created  spark  context..  
Spark  context  available  as  sc.  
Type  in  expressions  to  have  them  evaluated.  
Type  :help  for  more  information.  
 
scala>  

The cassandra.input.split.size parameter defaults to 100,000. This is the approximate


number of physical rows in a single Spark partition. If you have really wide rows
(thousands of columns), you may need to lower this value. The higher the value, the
fewer Spark tasks are created. Increasing the value too much may limit the parallelism
level.”
https://github.com/datastax/spark-cassandra-connector

Spark Executor

Spark-C*
Connector

C* Java Driver

- Open Source
- Implemented mostly in Scala
- Scala + Java APIs
- Does automatic type conversions
https://github.com/datastax/spark-cassandra-connector
“Simple things
should be simple,
complex things
should be possible”
- Alan Kay
DEMO:
- Local Static Partitioning

- Standalone Scheduler

- YARN
Dynamic Partitioning

- Mesos
History:
N
JT
N

JobTracker NameNode

D D D D
TT TT TT TT
N N N N
M R M R M R M R
M M R M R M R
M M M R
M M
M

OS OS OS OS
CPUs: 3  options:  
-­‐  local  
-­‐  local[N]    
JVM: Ex + Driver -­‐  local[*]    

RDD, P1 Task Task

RDD, P1 Task Task

RDD, P2 Task Task


> ./bin/spark-shell --master local[12]
RDD, P2 Task Task

RDD, P3 Task Task

Task Task > ./bin/spark-submit --name "MyFirstApp"


--master local[12] myApp.jar
Internal
Threads

val conf = new SparkConf()


.setMaster("local[12]")
Worker Machine .setAppName(“MyFirstApp")
.set("spark.executor.memory", “3g")
Disk val sc = new SparkContext(conf)
different spark-env.sh

- SPARK_WORKER_CORES

W W W W

Ex Ex Ex Ex
RDD, P1 RDD, P4 RDD, P5 RDD, P7

RDD, P2 RDD, P6 RDD, P3 RDD, P8

RDD, P1 RDD, P1 RDD, P2 RDD, P2

Internal Internal Internal


Threads Threads Threads

Internal
Threads

Driver Spark
Master

OS Disk SSD SSD OS Disk SSD SSD OS Disk SSD SSD OS Disk SSD SSD

SSD SSD SSD SSD SSD SSD SSD SSD

> ./bin/spark-submit --name “SecondApp" vs.


--master spark://host1:port1
myApp.jar spark-env.sh - SPARK_LOCAL_DIRS
different spark-env.sh

- SPARK_WORKER_CORES

W W W W

Ex Ex Ex Ex
RDD, P1 RDD, P4 RDD, P5 RDD, P7

RDD, P2 RDD, P6 RDD, P3 RDD, P8

RDD, P1 RDD, P1 RDD, P2 RDD, P2

Internal Internal Internal


Threads Threads Threads

Internal
Threads
More
Masters
I’m HA via
Driver Spark
can be Spark ZooKeeper
Spark Master Master
added live
Master

OS Disk SSD SSD OS Disk SSD SSD OS Disk SSD SSD OS Disk SSD SSD

SSD SSD SSD SSD SSD SSD SSD SSD

> ./bin/spark-submit --name “SecondApp" vs.


--master spark://host1:port1,host2:port2
myApp.jar spark-env.sh - SPARK_LOCAL_DIRS
(multiple apps)

W W W W

Ex Ex Ex Ex Ex Ex
Ex Ex

Driver Driver
Spark
Master

OS Disk SSD OS Disk SSD OS Disk SSD OS Disk SSD


(single app)

W W W W W W W W

Ex Ex Ex Ex Ex Ex Ex Ex

Driver Spark
Master

OS Disk SSD OS Disk SSD OS Disk SSD OS Disk SSD

SPARK_WORKER_INSTANCES: [default: 1] # of worker instances to run on each machine

SPARK_WORKER_CORES: [default: ALL] # of cores to allow Spark applications to use on the machine

conf/spark-env.sh SPARK_WORKER_MEMORY: [default: TOTAL RAM – 1 GB] Total memory to allow Spark applications to use on the machine

SPARK_DAEMON_MEMORY: [default: 512 MB] Memory to allocate to the Spark master and worker daemons themselves
Standalone settings

- Apps submitted will run in FIFO mode by default

spark.cores.max:  maximum amount of CPU cores to request for the


application from across the cluster

spark.executor.memory:  Memory for each executor


Client #1 1
Resource
Manager

2 3 4 5

NodeManager NodeManager NodeManager


6 6
App Master
Container Container

7 7
Client #1
Resource
I’m HA via
Manager ZooKeeper

Scheduler
Client #2
Apps Master

NodeManager NodeManager NodeManager

App Master App Master Container

Container Container Container


Client #1
(client mode)
Resource
Driver
Manager

NodeManager NodeManager NodeManager

Container App Master Container

Executor Executor
RD RD
D D
Client #1
(cluster mode)
Resource
Manager

- Does not support Spark Shells

NodeManager NodeManager NodeManager

Container
App Master Container
Executor
RD Executor
D RD
D

Container Driver

Executor
RD
D
YARN settings

-­‐-­‐num-­‐executors:  controls how many executors will be allocated

-­‐-­‐executor-­‐memory:  RAM for each executor

-­‐-­‐executor-­‐cores:  CPU cores for each executor

Dynamic Allocation:
spark.dynamicAllocation.enabled  
spark.dynamicAllocation.minExecutors  
spark.dynamicAllocation.maxExecutors  
spark.dynamicAllocation.sustainedSchedulerBacklogTimeout  (N)  
spark.dynamicAllocation.schedulerBacklogTimeout  (M)  
spark.dynamicAllocation.executorIdleTimeout  (K)  
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala
YARN resource manager UI: http://<ip  address>:8088

(No apps running)


[ec2-user@ip-10-0-72-36 ~]$ spark-submit --class
org.apache.spark.examples.SparkPi --deploy-mode client --master yarn /opt/
cloudera/parcels/CDH-5.2.1-1.cdh5.2.1.p0.12/jars/spark-examples-1.1.0-cdh5.2.1-
hadoop2.5.0-cdh5.2.1.jar 10
App running in client mode
[ec2-user@ip-10-0-72-36 ~]$ spark-submit --class
org.apache.spark.examples.SparkPi --deploy-mode cluster --master yarn /opt/
cloudera/parcels/CDH-5.2.1-1.cdh5.2.1.p0.12/jars/spark-examples-1.1.0-cdh5.2.1-
hadoop2.5.0-cdh5.2.1.jar 10
App running in cluster mode
App running in cluster mode
App running in cluster mode
Spark Central Master Who starts Executors? Tasks run in

Local [none] Human being Executor

Standalone Standalone Master Worker JVM Executor

YARN YARN App Master Node Manager Executor

Mesos Mesos Master Mesos Slave Executor


spark-­‐submit provides a uniform interface for
submitting jobs across all cluster managers

bin/spark-­‐submit  -­‐-­‐master  spark://host:7077  


                                 -­‐-­‐executor-­‐memory  10g  
                                 my_script.py  

Source: Learning Spark


Recommended to use at most only 75% of a machine’s memory
Ex for Spark
RDD, P1

Minimum Executor heap size should be 8 GB


RDD, P2

RDD, P1

Internal
Threads Max Executor heap size depends… maybe 40 GB (watch GC)

Memory usage is greatly affected by storage level and


serialization format
Vs. +
deserialized  

JVM  

RDD.cache()    ==    RDD.persist(MEMORY_ONLY)  

most CPU-efficient option


serialized  

JVM  

RDD.persist(MEMORY_ONLY_SER)  
+

deserialized   Ex
RDD-
P1
RDD-
JVM   P1
RDD-
P2

.persist(MEMORY_AND_DISK)   OS Disk

SSD
+

serialized  

JVM  

.persist(MEMORY_AND_DISK_SER)  
JVM  

.persist(DISK_ONLY)  
deserialized   deserialized  

JVM  on  Node  X   JVM  on  Node  Y  

RDD.persist(MEMORY_ONLY_2)  
+ +

deserialized   deserialized  

JVM   JVM  

.persist(MEMORY_AND_DISK_2)  
JVM-­‐1  /  App-­‐1  

serialized  

Tachyon  

JVM-­‐2  /  App-­‐1  

JVM-­‐7  /  App-­‐2  

.persist(OFF_HEAP)  
JVM  

.unpersist()  
?

JVM  
?  
- If RDD fits in memory, choose MEMORY_ONLY

- If not, use MEMORY_ONLY_SER w/ fast serialization library

- Don’t spill to disk unless functions that computed the datasets


are very expensive or they filter a large amount of data.
(recomputing may be as fast as reading from disk)

- Use replicated storage levels sparingly and only if you want fast
fault recovery (maybe to serve requests from a web app)
Remember!

Intermediate data is automatically persisted during shuffle operations


=

PySpark: stored objects will always be serialized with Pickle library, so it does
not matter whether you choose a serialized level.
Default Memory Allocation in Executor JVM

spark.storage.memoryFraction
User Programs
(remainder)

20%

20% 60%
Shuffle memory
Cached RDDs

spark.shuffle.memoryFraction
Spark uses memory for:

RDD Storage: when you call .persist() or .cache(). Spark will limit the amount of
memory used when caching to a certain fraction of the JVM’s overall heap, set by
spark.storage.memoryFraction  

Shuffle and aggregation buffers: When performing shuffle operations, Spark will
create intermediate buffers for storing shuffle output data. These buffers are used to
store intermediate results of aggregations in addition to buffering data that is going
to be directly output as part of the shuffle.

User code: Spark executes arbitrary user code, so user functions can themselves
require substantial memory. For instance, if a user application allocates large arrays
or other objects, these will content for overall memory usage. User code has access
to everything “left” in the JVM heap after the space for RDD storage and shuffle
storage are allocated.
1. Create an RDD

2. Put it into cache


 logs  will  tell  you  how  much  memory  each  
3. Look at SparkContext logs
partition  is  consuming,  which  you  can  
on the driver program or aggregate  to  get  the  total  size  of  the  RDD  
Spark UI

INFO BlockManagerMasterActor: Added rdd_0_1 in memory on mbk.local:50311 (size: 717.5 KB, free: 332.3 MB)
Serialization is used when:

Transferring data over the network

Spilling data to disk

Caching to memory serialized

Broadcasting variables
Java serialization vs. Kryo serialization

• Uses Java’s ObjectOutputStream framework • Recommended serialization for production apps

• Works with any class you create that implements • Use Kyro version 2 for speedy serialization (10x) and
java.io.Serializable more compactness

• You can control the performance of serialization • Does not support all Serializable types
more closely by extending java.io.Externalizable
• Requires you to register the classes you’ll use in
• Flexible, but quite slow advance

• Leads to large serialized formats for many classes • If set, will be used for serializing shuffle data between
nodes and also serializing RDDs to disk

conf.set(“spark.serializer”,  "org.apache.spark.serializer.KryoSerializer")  
To register your own custom classes with Kryo, use the
registerKryoClasses method:

val conf = new SparkConf().setMaster(...).setAppName(...)


conf.registerKryoClasses(Seq(classOf[MyClass1], classOf[MyClass2]))
val sc = new SparkContext(conf)

- If your objects are large, you may need to increase


spark.kryoserializer.buffer.mb config property

- The default is 2, but this value needs to be large enough to


hold the largest object you will serialize.
W W

Ex Ex
... RD RD
D
RD D
D
RD
D

High churn Low churn


Cost of GC is proportional to the # of
Java objects

(so use an array of Ints instead of a


W
LinkedList)
Ex
... RD
D
RD
D
RD
D

High churn
To measure GC impact:

-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps


Parallel GC Parallel Old GC CMS GC G1 GC
-­‐XX:+UseParallelGC   -­‐XX:+UseParallelOldGC   -­‐XX:+UseConcMarkSweepGC   -­‐XX:+UseG1GC  
-­‐XX:ParallelGCThreads=<#>   -­‐XX:ParallelCMSThreads=<#>  

- Uses multiple threads to - Uses multiple threads - Concurrent Mark - Garbage First is available
do young gen GC to do both young gen Sweep aka starting Java 7
and old gen GC “Concurrent low
- Will default to Serial on pause collector” - Designed to be long term
single core machines - Also a multithreading replacement for CMS
compacting collector - Tries to minimize
- Aka “throughput pauses due to GC by - Is a parallel, concurrent
collector” - HotSpot does doing most of the work and incrementally
compaction only in concurrently with compacting low-pause
- Good for when a lot of old gen application threads GC
work is needed and
long pauses are - Uses same algorithm
acceptable on young gen as
parallel collector
- Use cases: batch
processing - Use cases:
?
.collect(    )  
Task #1
Task #2
Stage 1
Task #3
.
.
Stage 2

Job #1
Stage 3 Stage 4

Stage 5

.
.
RDD Objects DAG Scheduler Task Scheduler Executor

DAG TaskSet Task Task Task threads


Scheduler

Block manager

Rdd1.join(rdd2) - Launches
.groupBy(…) - Split graph into - Execute tasks
individual tasks
.filter(…) stages of tasks
- Store and serve
- Retry failed or blocks
- Submit each stage as straggling tasks
- Build operator DAG ready

Agnostic to Doesn’t know


Stage about stages
operators
failed
“One of the challenges in providing RDDs as an abstraction is
choosing a representation for them that can track lineage across a
wide range of transformations.”

“The most interesting question in designing this interface is how to


represent dependencies between RDDs.”

“We found it both sufficient and useful to classify dependencies


into two types:
• narrow dependencies, where each partition of the parent RDD
is used by at most one partition of the child RDD
• wide dependencies, where multiple child partitions may
depend on it.”
Requires
shuffle

Examples of narrow and wide dependencies.

Each box is an RDD, with partitions shown as shaded rectangles.


A:   B:  

F:  
Stage  1   groupBy  
=  RDD  
C:   D:   E:  
=  cached  partition  
=  lost  partition   join  

Stage  2   map   filter   Stage  3  


Dependencies: Narrow vs Wide

“This distinction is useful for two reasons:

1) Narrow dependencies allow for pipelined execution on one cluster node,


which can compute all the parent partitions. For example, one can apply a
map followed by a filter on an element-by-element basis.

In contrast, wide dependencies require data from all parent partitions to be


available and to be shuffled across the nodes using a MapReduce-like
operation.

2) Recovery after a node failure is more efficient with a narrow dependency, as


only the lost parent partitions need to be recomputed, and they can be
recomputed in parallel on different nodes. In contrast, in a lineage graph with
wide dependencies, a single failed node might cause the loss of some partition
from all the ancestors of an RDD, requiring a complete re-execution.”
To display the lineage of an RDD, Spark provides a toDebugString method:

scala>  input.toDebugString  
 
res85:  String  =  
(2)  data.text  MappedRDD[292]  at  textFile  at  <console>:13  
 |  data.text  HadoopRDD[291]  at  textFile  at  <console>:13  
 
scala>  counts.toDebugString  
res84:  String  =  
(2)  ShuffledRDD[296]  at  reduceByKey  at  <console>:17  
 +-­‐(2)  MappedRDD[295]  at  map  at  <console>:17  
       |  FilteredRDD[294]  at  filter  at  <console>:15  
       |  MappedRDD[293]  at  map  at  <console>:15  
       |  data.text  MappedRDD[292]  at  textFile  at  <console>:13  
       |  data.text  HadoopRDD[291]  at  textFile  at  <console>:13  
How do you know if a shuffle will be called on a Transformation?

- repartition , join, cogroup, and any of the *By or *ByKey transformations


can result in shuffles

- If you declare a numPartitions parameter, it’ll probably shuffle

- If a transformation constructs a shuffledRDD, it’ll probably shuffle

- combineByKey calls a shuffle (so do other transformations like


groupByKey, which actually end up calling combineByKey)

Note that repartition just calls coalese w/ True:


def repartition(numPartitions: Int)(implicit ord:
RDD.scala Ordering[T] = null): RDD[T] = {
coalesce(numPartitions, shuffle = true)
}
How do you know if a shuffle will be called on a Transformation?

Transformations that use “numPartitions” like distinct will probably shuffle:

def distinct(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T] =


map(x => (x, null)).reduceByKey((x, y) => x, numPartitions).map(_._1)
- An extra parameter you can pass a k/v transformation to let Spark know
that you will not be messing with the keys at all

- All operations that shuffle data over network will benefit from partitioning

- Operations that benefit from partitioning:


cogroup, groupWith, join, leftOuterJoin, rightOuterJoin, groupByKey, reduceByKey,
combineByKey, lookup, . . .

https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L302
Link

Source: Cloudera
Source: Cloudera
How many Stages will this code require?

sc.textFile("someFile.txt").
map(mapFunc).
flatMap(flatMapFunc).
filter(filterFunc).
count()

Source: Cloudera
How many Stages will this DAG require?

Source: Cloudera
How many Stages will this DAG require?

Source: Cloudera
+ + +
Ex

x=5
x=5
Ex

x=5
x=5

x=5
Ex
• Broadcast variables – Send a large read-only lookup table to all the nodes, or
send a large feature vector in a ML algorithm to all nodes

• Accumulators – count events that occur during job execution for debugging
purposes. Example: How many lines of the input file were blank? Or how many
corrupt records were in the input dataset?

+ + +
Spark supports 2 types of shared variables:

• Broadcast variables – allows your program to efficiently send a large, read-only


value to all the worker nodes for use in one or more Spark operations. Like
sending a large, read-only lookup table to all the nodes.

• Accumulators – allows you to aggregate values from worker nodes back to the
driver program. Can be used to count the # of errors seen in an RDD of lines
spread across 100s of nodes. Only the driver can access the value of an
accumulator, tasks cannot. For tasks, accumulators are write-only.

+ + +
Broadcast variables let programmer keep a read-
only variable cached on each machine rather than
shipping a copy of it with tasks

For example, to give every node a copy of a large


input dataset efficiently

Spark also attempts to distribute broadcast variables


using efficient broadcast algorithms to reduce
communication cost
Scala:
val  broadcastVar  =  sc.broadcast(Array(1,  2,  3))  
broadcastVar.value  

Python:
broadcastVar  =  sc.broadcast(list(range(1,  4)))  
broadcastVar.value  
Link
History:
Ex

Uses HTTP

Ex

Ex

20 MB file
Ex

Uses bittorrent

Ex

4  MB   4  MB   4  MB   4  MB   ... Ex

20 MB file
Source: Scott Martin
Ex Ex Ex

Ex Ex

Ex Ex Ex
+ + +

Accumulators are variables that can only be “added” to through


an associative operation

Used to implement counters and sums, efficiently in parallel

Spark natively supports accumulators of numeric value types and


standard mutable collections, and programmers can extend
for new types

Only the driver program can read an accumulator’s value, not the
tasks
Scala:
+ + +
val accum = sc.accumulator(0)!
!
sc.parallelize(Array(1, 2, 3, 4)).foreach(x => accum += x)!
!
accum.value!

Python:

accum  =  sc.accumulator(0)  
rdd  =  sc.parallelize([1,  2,  3,  4])  
def  f(x):  
     global  accum  
     accum  +=  x  
 
rdd.foreach(f)  
 
accum.value  
PySpark at a Glance

Write Spark jobs


Run interactive Supports C
in Python
jobs in the shell extensions
41 files
8,100 loc
6,300 comments

PySpark

Java API
Spark Core Engine
(Scala)

Local Standalone Scheduler YARN Mesos


daemon.py
RDD Pipe F(x)
RDD
Spark Py4j Spark
Context F(x)
Controller Socket Context MLlib, SQL, shuffle

Executor JVM
F(x)

Driver JVM
daemon.py
RDD Pipe F(x)

Local Disk
MLlib, SQL, shuffle F(x)
RDD
Executor JVM

Driver Machine Worker Machine


HadoopRDD
Data is stored as Pickled objects in an RDD[Array[Byte]]

MappedRDD

PythonRDD RDD[Array[ , , , ]]
(100 KB – 1MB each picked object)
Choose Your Python Implementation

CPython
(default python)
Spark
Context
pypy
• JIT, so faster
• less memory
• CFFI support
Worker Machine
Driver Machine

$ PYSPARK_DRIVER_PYTHON=pypy PYSPARK_PYTHON=pypy ./bin/


pyspark
OR

$ PYSPARK_DRIVER_PYTHON=pypy PYSPARK_PYTHON=pypy ./bin/spark-submit wordcount.py


The performance speed up will depend on work load (from 20% to 3000%).

Here are some benchmarks:

Job CPython 2.7 PyPy 2.3.1 Speed up


Word Count 41 s 15 s 2.7 x
Sort 46 s 44 s 1.05 x
Stats 174 s 3.6 s 48 x

Here is the code used for benchmark:

rdd = sc.textFile("text")
def wordcount():
rdd.flatMap(lambda x:x.split('/'))\
.map(lambda x:(x,1)).reduceByKey(lambda x,y:x+y).collectAsMap()
def sort():
rdd.sortBy(lambda x:x, 1).count()
def stats():
sc.parallelize(range(1024), 20).flatMap(lambda x: xrange(5024)).stats()

https://github.com/apache/spark/pull/2144
100TB Daytona Sort Competition 2014

Spark sorted the same data 3X faster


using 10X fewer machines
than Hadoop MR in 2013.

All the sorting took place on disk (HDFS) without


using Spark’s in-memory cache!

More info:

http://sortbenchmark.org

http://databricks.com/blog/2014/11/05/spark-
officially-sets-a-new-record-in-large-scale-sorting.html

Work by Databricks engineers: Reynold Xin, Parviz Deyhim, Xiangrui Meng, Ali Ghodsi, Matei Zaharia
- Stresses “shuffle” which underpins everything from SQL to Mllib

- Sorting is challenging b/c there is no reduction in data

- Sort 100 TB = 500 TB disk I/O and 200 TB network

Engineering Investment in Spark:

- Sort-based shuffle (SPARK-2045)


- Netty native network transport (SPARK-2468)
- External shuffle service (SPARK-3796)

Clever Application level Techniques:

- GC and cache friendly memory layout


- Pipelining
- Intel Xeon CPU E5 2670 @ 2.5 GHz w/ 32 cores
- 244 GB of RAM
- 8 x 800 GB SSD and RAID 0 setup formatted with /ext4
W
- ~9.5 Gbps (1.1 GBps) bandwidth between 2 random nodes
Ex
RD
D
- Each record: 100 bytes (10 byte key & 90 byte value)
RD
D

- OpenJDK 1.7

EC2: i2.8xlarge - HDFS 2.4.1 w/ short circuit local reads enabled


(206 workers)
- Apache Spark 1.2.0
- 32 slots per machine
- 6,592 slots total
- Speculative Execution off

- Increased Locality Wait to infinite

- Compression turned off for input, output & network

- Used Unsafe to put all the data off-heap and managed


it manually (i.e. never triggered the GC)
groupByKey

= sortByKey

reduceByKey
spark.shuffle.spill=false

(Affects reducer side and keeps all the data in memory)


- Must turn this on for dynamic allocation in YARN
- Worker JVM serves files
- Node Manager serves files
- Was slow because it had to copy the data 3 times

Ex
Linux
NIC
kernel
buffer
buffer

Map output file


on local dir
- Uses a technique called zero-copy

- Is a map-side optimization to serve data very


quickly to requesting reducers

NIC
buffer

Map output file


on local dir
< 10,000 reducers
=  5  blocks  

- Entirely bounded
- Notice that map by I/O reading from
has to keep 3 file Map() Map() Map() Map()
handles open HDFS and writing out
locally sorted files

TimSort
Reduce() Reduce()
- Mostly network bound
Reduce()
250,000+ reducers!

= 3.6 GB

(28,000 unique blocks)


RF = 2

Map() Map() Map() Map()

- Only one file handle


open at a time
250,000+ reducers!

(28,000 unique blocks) - 5 waves of maps


RF = 2 - 5 waves of reduces

Map() Map() Map() Map()

MergeSort!

TimSort
Reduce() Reduce() Reduce()

RF = 2
- Actual final run

- Fully saturated
the 10 Gbit link

Sustaining 1.1GB/s/node during shuffle


Link
UserID Name Age Location Pet
28492942 John Galt 32 New York Sea Horse
95829324 Winston Smith 41 Oceania Ant
92871761 Tom Sawyer 17 Mississippi Raccoon
37584932 Carlos Hinojosa 33 Orlando Cat
73648274 Luis Rodriguez 34 Orlando Dogs
JDBC/ODBC Your App

...
SchemaR
DD
- RDD of Row objects, each representing a record

- Row objects = type + col. name of each

- Stores data very efficiently by taking advantage of the schema

- SchemaRDDs are also regular RDDs, so you can run


transformations like map() or filter()

- Allows new operations, like running SQL on objects


https://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.html
Warning!

Only looks at first row


Link
TwitterUtils.createStream(...)  
       .filter(_.getText.contains("Spark"))  
       .countByWindow(Seconds(5))  
- Scalable
TCP socket - High-throughput
- Fault-tolerant
Kafka
HDFS
Flume
Cassandra
HDFS
Dashboards
S3
Databases
Kinesis

Twitter Complex algorithms can be expressed using:


- Spark transformations: map(), reduce(), join(), etc
- MLlib + GraphX
- SQL
Batch Realtime

One unified API


Tathagata Das (TD)
- Lead developer of Spark Streaming + Committer on
Apache Spark core

- Helped re-write Spark Core internals in 2012 to


make it 10x faster to support Streaming use cases

- On leave from UC Berkeley PhD program

- Ex: Intern @ Amazon, Intern @ Conviva, Research


Assistant @ Microsoft Research India

- 1 guy; does not scale

- Scales to 100s of nodes

- Batch sizes as small at half a second

- Processing latency as low as 1 second

- Exactly-once semantics no matter what fails


(live statistics)

Page views Kafka for buffering Spark for processing


(Anomaly Detection)
Smart meter readings

Join 2 live data


sources

Live weather data


Batches every X seconds

Batches of
Input data stream processed data
(Discretized Stream)
Batch interval = 5 seconds

T=5 T = +5
Input
DStream
Block #1 Block #2 Block #3 Block #1 Block #2 Block #3

RDD @ T=5 RDD @ T=+5

One RDD is created every 5 seconds


linesDStream

Block #1 Block #2 Block #3

5 sec
Materialize!

linesDStream

Part. #1 Part. #2 Part. #3

linesRDD
flatMap()

wordsDStream

Part. #1 Part. #2 Part. #3


wordsRDD
from pyspark import SparkContext
from pyspark.streaming import StreamingContext

# Create a local StreamingContext with two working thread and batch interval of 1 second
sc = SparkContext("local[2]", "NetworkWordCount")
ssc = StreamingContext(sc, 5) linesStream

# Create a DStream that will connect to hostname:port, like localhost:9999


linesDStream = ssc.socketTextStream("localhost", 9999)
wordsStream

# Split each line into words


wordsDStream = linesDStream.flatMap(lambda line: line.split(" "))
pairsStream
# Count each word in each batch
pairsDStream = wordsDStream.map(lambda word: (word, 1))
wordCountsDStream = pairsDStream.reduceByKey(lambda x, y: x + y)
wordCountsStream
# Print the first ten elements of each RDD generated in this DStream to the console
wordCountsDStream.pprint()

ssc.start() # Start the computation


ssc.awaitTermination() # Wait for the computation to terminate
Terminal #1 Terminal #2

$ nc -lk 9999 $ ./network_wordcount.py localhost 9999

hello world ...


--------------------------
Time: 2015-04-25 15:25:21
--------------------------
(hello, 2)
(world, 1)
Batch interval = 600 ms

Ex
RDD, P1 T R
RDD, P2
W
T T
block, P1
T T Ex
RDD, P3 T T
RDD, P4
Internal T T
Threads block, P1
T T

Driver
Internal
Threads

OS Disk SSD SSD

OS Disk SSD SSD


Batch interval = 600 ms

200 ms later

W W

Ex Ex
RDD, P1 T R block, P2 T T
RDD, P2
W
T T T T
block, P1

block, P2
T T Ex T T
RDD, P3 T T
RDD, P4
Internal T T Internal
Threads block, P1 Threads
T T

Driver
Internal
Threads

OS Disk SSD SSD OS Disk SSD SSD

OS Disk SSD SSD


Batch interval = 600 ms

200 ms later

W W

Ex Ex
RDD, P1 T R T T
RDD, P2
W block, P2
T T T T
block, P1

block, P2
T T Ex T T
RDD, P1 T T
block, P3
RDD, P2
Internal T T Internal
Threads block, P1 Threads
T T
block, P3

Driver
Internal
Threads

OS Disk SSD SSD OS Disk SSD SSD

OS Disk SSD SSD


Batch interval = 600 ms

W W

Ex Ex
RDD, P1 T R T T
RDD, P2
W RDD, P2
T T T T
RDD, P1

RDD, P2
T T Ex T T
RDD, P1 T T
RDD, P3
RDD, P2
Internal T T Internal
Threads RDD, P1 Threads
T T
RDD, P3

Driver
Internal
Threads

OS Disk SSD SSD OS Disk SSD SSD

OS Disk SSD SSD


Batch interval = 600 ms

W W

Ex Ex
RDD, P1 T R T T
RDD, P2
W RDD, P2
T T T T
RDD, P1

RDD, P2
T T Ex T T
RDD, P1 T T
RDD, P3
RDD, P2
Internal T T Internal
Threads RDD, P1 Threads
T T
RDD, P3

Driver
Internal
Threads

OS Disk SSD SSD OS Disk SSD SSD

OS Disk SSD SSD


Batch interval = 600 ms
2 input DStreams

W W

Ex Ex
T R T R
W block, P1
T T T T
block, P1
T T Ex T T
T T

Internal T T Internal
Threads block, P1 Threads
T T
block, P1

Driver
Internal
Threads

OS Disk SSD SSD OS Disk SSD SSD

OS Disk SSD SSD


Batch interval = 600 ms

W W

Ex Ex
block, P3 T R block, P1 T R
block, P3
W
T T block, P2 T T
block, P1

block, P2
T T Ex block, P3
T T
block, P3
block, P2 T T
block, P2
Internal T T Internal
Threads block, P1 Threads
T T
block, P1

Driver
Internal
Threads

OS Disk SSD SSD OS Disk SSD SSD

OS Disk SSD SSD


Batch interval = 600 ms

W Materialize! W

Ex Ex
RDD, P3 T R RDD, P1 T R
RDD, P3
W
T T RDD, P2 T T
RDD, P1

RDD, P2
T T Ex RDD, P3
T T
RDD, P3
RDD, P2 T T
RDD, P2
Internal T T Internal
Threads RDD, P1 Threads
T T
RDD, P1

Driver
Internal
Threads

OS Disk SSD SSD OS Disk SSD SSD

OS Disk SSD SSD


Batch interval = 600 ms

W Union! W

Ex Ex
RDD, P1 T R RDD, P3 T R
RDD, P2
W
T T RDD, P4 T T
RDD, P3

RDD, P5
T T Ex RDD, P5
T T
RDD, P6
RDD, P1 T T
RDD, P2
Internal T T Internal
Threads RDD, P4 Threads
T T
RDD, P6

Driver
Internal
Threads

OS Disk SSD SSD OS Disk SSD SSD

OS Disk SSD SSD


- Kafka - Anywhere
- File systems
- Flume
- Socket Connections
- Twitter
- Akka Actors

Sources directly available Requires linking against Requires implementing


in StreamingContext API extra dependencies user-defined receiver
map( )
reduce( )
union(otherStream)
updateStateByKey( ) *
flatMap( )

join(otherStream,[numTasks]) filter( )
cogroup(otherStream,[numTasks])
RDD

repartition(numPartitions) transform( )
RDD count()

reduceAByKey( ,[numTasks])

countByValue()
(word, 1)
pairs = (cat, 1)

updateStateByKey( ) * : allows you to maintain arbitrary state while


continuously updating it with new information.

To maintain a running count of each word seen


To use: in a text data stream (here running count is an
integer type of state):
1) Define the state
(an  arbitrary  data  type)   def updateFunction(newValues, runningCount):
if runningCount is None:
runningCount = 0
2) Define the state update function return sum(newValues, runningCount) # add the
       (specify  with  a  function  how  to  update  the  state  using  the     # new values with the previous running count #
                 previous  state  and  new  values  from  the  input  stream)   to get the new count

runningCounts = pairs.updateStateByKey(updateFunction)

* Requires a checkpoint directory to be


configured
RDD

transform( ) : can be used to apply any RDD operation that


RDD
is not exposed in the DStream API.

spamInfoRDD = sc.pickleFile(...) # RDD containing spam information

# join data stream with spam information to do data cleaning


cleanedDStream = wordCounts.transform(lambda rdd:
For example: rdd.join(spamInfoRDD).filter(...))

or
- Functionality to join every batch in a
data stream with another dataset is not
directly exposed in the DStream API. MLlib GraphX

- If you want to do real-time data


cleaning by joining the input data
stream with pre-computed spam
information and then filtering based on it.
* 3 time units
Window Length:

Sliding Interval:* 2 time units


time 1 time 2 time 3 time 4 time 5 time 6
* Both of these must be multiples of the
batch interval of the source DSTream

Original
RDD1 RDD 2 Batch 3 Batch 4 Batch 5 Batch 6
DStream

Windowed RDD 1 Part. 2 Part. 3 Part. 4 Part. 5


DStream
RDD @ 3 RDD @ 5

# Reduce last 30 seconds of data, every 10 seconds


windowedWordCounts = pairs.reduceByKeyAndWindow(lambda x, y: x + y, lambda x, y: x - y, 30, 10)
window(windowLength, slideInterval)
countByValueAndWindow(windowLength, slideInterval, [numTasks])

countByWindow(windowLength, slideInterval) API Docs

- DStream
- PairDStreamFunctions
reduceByWindow( , windowLength, slideInterval)
- JavaDStream
- JavaPairDStream

reduceByKeyAndWindow( , windowLength, slideInterval,[numTasks]) - DStream

reduceByKeyAndWindow( , , windowLength, slideInterval,[numTasks])


print()

foreachRDD( )
saveAsTextFile(prefix, [suffix])

saveAsObjectFiles(prefix, [suffix])

saveAsHadoopFiles(prefix, [suffix])
Next slide is only for on-site students…
http://tinyurl.com/sparknycpostsurvey

You might also like