0% found this document useful (0 votes)
159 views11 pages

Spark Details

Spark is an expressive computing system that facilitates in-memory computing to avoid storing intermediate results to disk. It introduces the RDD abstraction of partitioned and distributed datasets that can be cached in memory across a cluster. RDDs support transformations and actions, where transformations build new RDDs and actions trigger execution by returning values or exporting data. Jobs are executed through a DAG scheduler and task scheduler to optimize partitioning and execution across worker nodes.

Uploaded by

sarvesh_mishra
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
159 views11 pages

Spark Details

Spark is an expressive computing system that facilitates in-memory computing to avoid storing intermediate results to disk. It introduces the RDD abstraction of partitioned and distributed datasets that can be cached in memory across a cluster. RDDs support transformations and actions, where transformations build new RDDs and actions trigger execution by returning values or exporting data. Jobs are executed through a DAG scheduler and task scheduler to optimize partitioning and execution across worker nodes.

Uploaded by

sarvesh_mishra
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 11

Spark

Spark ideas
• expressive computing system, not limited to
map-reduce model
• facilitate system memory
– avoid saving intermediate results to disk
– cache data for repetitive queries (e.g. for machine
learning)
• compatible with Hadoop
RDD abstraction
• Resilient Distributed Datasets
• partitioned collection of records
• spread across the cluster
• read-only
• caching dataset in memory
– different storage levels available
– fallback to disk possible
RDD operations
• transformations to build RDDs through
deterministic operations on other RDDs
– transformations include map, filter, join
– lazy operation
• actions to return value or export data
– actions include count, collect, save
– triggers execution
Job example
Driver
val log = sc.textFile(“hdfs://...”)
val errors = file.filter(_.contains(“ERROR”))
errors.cache()
Action!
errors.filter(_.contains(“I/O”)).count()
errors.filter(_.contains(“timeout”)).count()

Worker Worker Worker


Cache1 Cache2 Cache2

Block1 Block2 Block3


RDD partition-level view
Dataset-level view: Partition-level view:

log:
HadoopRDD
path = hdfs://...

errors:
FilteredRDD
func = _.contains(…)
shouldCache = true
Task 1 Task 2 ...

source: https://cwiki.apache.org/confluence/display/SPARK/Spark+Internals
Job scheduling
RDD Objects DAGScheduler TaskScheduler Worker
Cluster Threads
DAG TaskSet manager Task Block
manager

rdd1.join(rdd2) split graph into launch tasks via execute tasks


.groupBy(…)
.filter(…)
stages of tasks cluster manager
submit each retry failed or store and serve
build operator DAG
stage as ready straggling tasks blocks

source: https://cwiki.apache.org/confluence/display/SPARK/Spark+Internals
Available APIs
• You can write in Java, Scala or Python
• interactive interpreter: Scala & Python only
• standalone applications: any
• performance: Java & Scala are faster thanks to
static typing
Hand on - interpreter
• script
http://cern.ch/kacper/spark.txt

• run scala spark interpreter


$ spark-shell

• or python interpreter
$ pyspark
Hand on – build and submission
• download and unpack source code
wget http://cern.ch/kacper/GvaWeather.tar.gz; tar -xzf GvaWeather.tar.gz
• build definition in
GvaWeather/gvaweather.sbt

• source code
GvaWeather/src/main/scala/GvaWeather.scala

• building
cd GvaWeather
sbt package

• job submission
spark-submit --master local --class GvaWeather \
target/scala-2.10/gva-weather_2.10-1.0.jar
Summary
• concept not limited to single pass map-reduce
• avoid soring intermediate results on disk or
HDFS
• speedup computations when reusing datasets

You might also like