Spark and Dask
Spark and Dask
https://open.spotify.com/album/4sfzcZcVe2oIMkePq1ULTQ
5
Highlights
● Spark concepts
○ Built around a data structure: Resilient Distributed Datasets (RDDs)
○ Programming model is MapReduce + extensions
6
In-memory processing
7
Programming model: RDDs
8
9
RDD abstraction
10
Functional programming API
● Transformations:
○ Return RDDs
○ Are parametrized by user-provided functions
11
RDD transformations and actions
(K, V) is a key-value pair
MapReduce = flatMap +
reduceByKey
Return an
RDD
Return
something else
12
Lazy evaluation
13
Lazy evaluation (2)
Computations are costly and memory is limited → Compute only what is necessary
Note:
With eager evaluation,
the entire file must be
read and filtered.
Lazy evaluation limits
the amount of data
that needs to be
processed
15
Memory management
16
Example: Log Mining
Triggers execution
18
RDDs vs Distributed Shared Memory
19
Spark programming interface
● Driver
○ Runs user program
○ Defines RDDs and
invokes actions
○ Tracks the lineage
● Workers
○ Long-lived processes
○ Store RDD partitions in
RAM across operations
20
RDD representation
● HDFS file:
○ partitions(): one partition / HDFS block
○ preferredLocations(p): node where the
block is stored
○ iterator(...): reads the block
21
Narrow and wide dependencies between RDDs
● Narrow dependency
○ Partition has at most 1 child
○ Example: map, filter
○ Can be pipelined on one node
○ Minimal impact of failures
● Wide dependency
○ Partition has more than 1 child
○ Example: join, groupByKey
○ Require data shuffle across nodes
○ High impact of failures 22
Job Scheduling
When user runs an action on an RDD:
● Scheduler builds DAG of stages
from RDD lineage graph.
○ Stages contain as many narrow
dependencies as possible.
○ Stage boundaries are shuffle
operations or partitions in
memory.
● Scheduler then launches tasks.
23
Scheduling (2)
● For shuffles
○ Intermediate records are spilled to disks
○ On the nodes holding parent partitions why?
○ Like in MapReduce!
24
Fault tolerance
● Checkpointing
○ Lineage-based recovery of RDDs may be time-consuming (why?)
○ RDDs can be checkpointed to stable storage, through a flag to .persist()
25
Evaluation: Iterative Machine Learning
● Applications
○ kmeans (more compute)
○ Logistic regression (less compute)
● Execution conditions
○ 10 iterations, 100GB dataset
○ 25-100 machines (Amazon EC2, m1.xlarge, 15GB RAM, 4 cores)
○ HDFS with 256MB blocks
● Systems
○ Hadoop (MapReduce), HadoopBinMem
○ Spark RDDs 26
Results
● First iteration:
○ Spark has less overhead than
Hadoop (due to better internal
implementation)
○ HadoopBinMem had to convert
data to binary
● Subsequent iterations
○ Logistic regression: Spark 25x
faster than Hadoop (why?)
○ kmeans: Spark 3x faster (why?) 27
Fault recovery
28
Behavior with insufficient memory
29
Impact of persistence
30
Partitioning
31
Spark DataFrames
32
Other libraries
33
Use of librairies
34
Recap
General framework (increases usability and performance)
Rich programming model (transformations, actions, libraries)
Data locality (as in MapReduce)
RDDs remain in-memory (reduces disk I/O)
Lazy RDD evaluations (allow for merging transformations)
Fault-tolerance through lineage (better than data replication)
35
36
Introduction
● Dask
○ Leverages block algorithms and task scheduling
○ Parallelizes Numpy, in particular 37
Hardware trends
● Definitions
○ Algorithms that use multiple cores are called parallel
○ Algorithms that efficiently use disks as extension of RAM are called out-of-core
38
Internal data model: Dask Graphs
39
Example
Program Graph
40
Specification
41
Specification (2)
● The next elements in the tuple are arguments. Arguments might be:
○ Any key in the Dask graph
○ Any other literal (1, “a”, etc)
○ Other tasks
○ Lists of arguments
42
Dask Arrays
● Motivation
○ Users are not supposed to write Dask graphs directly
○ Dask arrays re-implement Numpy with Dask graphs
○ Parallel, out-of-core computations
44
Array metadata
● Example
45
Dask Array is a Numpy clone
46
Dynamic Task Scheduling
● Desired properties
○ Lightweight, Low-latency (ms), Mindful of memory use 47
Dynamic Task Scheduling
● Scheduler
○ Records finished tasks
○ Marks new tasks that can be run
○ Marks which data can be released
○ Selects a task to give to the worker, among ready tasks
48
Task selection logic
● Main idea
○ Prioritize tasks that release intermediate results (keep a low memory footprint)
○ Avoids spilling intermediate data to disk
● Policy
○ Last in, first out
○ Prioritize task(s) that were made available last
○ “Finish things before starting new things”
○ Break ties using Depth-First Search
49
Benchmark
50
Results
51
Other Collections: Bags and DataFrames
In short:
52
Bag
53
DataFrame
54
Conclusion
55