0% found this document useful (0 votes)
64 views55 pages

Spark and Dask

Spark provides a unified framework for large-scale data analytics. It uses resilient distributed datasets (RDDs) that can be manipulated in parallel across a cluster. RDDs support transformations and actions, and computations are executed lazily to optimize performance. Spark caches data in memory to reduce disk I/O and supports fault tolerance through lineage tracking of RDDs. It has been shown to outperform Hadoop for iterative machine learning algorithms due to its in-memory processing capabilities.

Uploaded by

jeremiahteee
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
64 views55 pages

Spark and Dask

Spark provides a unified framework for large-scale data analytics. It uses resilient distributed datasets (RDDs) that can be manipulated in parallel across a cluster. RDDs support transformations and actions, and computations are executed lazily to optimize performance. Spark caches data in memory to reduce disk I/O and supports fault tolerance through lineage tracking of RDDs. It has been shown to outperform Hadoop for iterative machine learning algorithms due to its in-memory processing capabilities.

Uploaded by

jeremiahteee
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 55

Music by Jacob Sanz-Robinson:

https://open.spotify.com/album/4sfzcZcVe2oIMkePq1ULTQ

Apache Spark &


Dask
January 27, 2021
Tristan Glatard

Gina Cody School of Engineering


and Computer Science
Department of Computer Science and Software Engineering
1
2
Scope
● Most active open-source project in Big Data Analytics

● Used in thousands organizations from diverse sectors:


○ Technology, banking, retail, biotech, astronomy

● Deployments with up to 8,000 nodes have been announced


3
Introduction

● We need cluster computing to analyse Big Data


● There has been an explosion of specialized computing models
○ MapReduce
○ Google’s Dremel (SQL) and Pregel (graphs)
○ Hadoop’s Storm, Impala, etc

● Big Data applications need to combine many types of processing


○ To handle data variety
○ Users need to stitch systems together
4
Introduction (2)

● The profusion of systems is


detrimental to
○ Usability
○ Performance

● Spark provides a uniform


framework for Big Data
Analytics

5
Highlights

● Spark concepts
○ Built around a data structure: Resilient Distributed Datasets (RDDs)
○ Programming model is MapReduce + extensions

● Added value compared to MapReduce


○ Easier to program because API is unified
○ Interactive explorations of Big Data are possible (pyspark)
○ More efficient to combine multiple tasks (workflows) Why?

6
In-memory processing

● MapReduce lacks abstraction to leverage distributed memory


○ Inefficient for applications that reuse intermediate results
○ For instance iterative algorithms: kmeans, PageRank

● Spark keeps data in memory


○ Reduces disk I/O
○ Spills to disk if not enough space

7
Programming model: RDDs

RDDs are fault-tolerant collections of objects partitioned


across a cluster that can be manipulated in parallel.

8
9
RDD abstraction

● An RDD is a read-only (why?), partitioned collection of records


● Can be created through operations on:
○ Data in stable storage (e.g., sc.textFile)
○ Other RDDs: transformations
○ Existing collections (sc.parallelize)

10
Functional programming API

● Transformations:
○ Return RDDs
○ Are parametrized by user-provided functions

● Actions return results to application or storage system

11
RDD transformations and actions
(K, V) is a key-value pair
MapReduce = flatMap +
reduceByKey

Return an
RDD

Return
something else

12
Lazy evaluation

● Transformations do not compute RDDs immediately


○ The returned RDD object is a representation of the result

● Actions trigger computations


● This allows optimization in the execution plan
○ Multiple map operations are fused in one pass
○ While preserving modularity what does it mean?

13
Lazy evaluation (2)
Computations are costly and memory is limited → Compute only what is necessary

Note:
With eager evaluation,
the entire file must be
read and filtered.
Lazy evaluation limits
the amount of data
that needs to be
processed

(slide from V. Hayot-Sasson) 14


Persistence (or caching)
● By default RDDs are ephemeral
● They might be evicted from memory
● Users can persist RDDs in memory or on disk, by calling .persist()
● persist() accepts priorities for spilling to disk

15
Memory management

● Persisted RDDs might be stored


○ In-memory
○ On-disk

● When a new partition is computed and there is not enough space


○ A partition is evicted using the LRU policy
○ Unless the partition is from the same RDD

● Users can also specify a persistence priority

16
Example: Log Mining

Reads from HDFS


Nothing has executed so far (why?)

Triggers execution

Other uses of errors won’t reload


data from the text file.

Lines is never stored in memory!


17
Lineage graph

An RDD maintains information on


how it was derived from other
datasets.
From the lineage, the RDD can be
completely (re-)computed.
This is used for fault-tolerance.

18
RDDs vs Distributed Shared Memory

RDDs are suitable for


batch applications that
iterate over a dataset

RDDs are not suitable for


asynchronous
fine-grained updates.

19
Spark programming interface

● Driver
○ Runs user program
○ Defines RDDs and
invokes actions
○ Tracks the lineage
● Workers
○ Long-lived processes
○ Store RDD partitions in
RAM across operations
20
RDD representation
● HDFS file:
○ partitions(): one partition / HDFS block
○ preferredLocations(p): node where the
block is stored
○ iterator(...): reads the block

● MappedRDD, created by map:


○ partitions(): same as parent
○ preferredLocations(p): same as parent
○ iterator(...): applies function in map

21
Narrow and wide dependencies between RDDs

● Narrow dependency
○ Partition has at most 1 child
○ Example: map, filter
○ Can be pipelined on one node
○ Minimal impact of failures

● Wide dependency
○ Partition has more than 1 child
○ Example: join, groupByKey
○ Require data shuffle across nodes
○ High impact of failures 22
Job Scheduling
When user runs an action on an RDD:
● Scheduler builds DAG of stages
from RDD lineage graph.
○ Stages contain as many narrow
dependencies as possible.
○ Stage boundaries are shuffle
operations or partitions in
memory.
● Scheduler then launches tasks.
23
Scheduling (2)

● Tasks are assigned to machines based on data locality


○ -> tasks are sent to machines holding their partition in memory
○ -> or to their partition’s preferred locations

● For shuffles
○ Intermediate records are spilled to disks
○ On the nodes holding parent partitions why?
○ Like in MapReduce!

24
Fault tolerance

● When a task fail


○ If parent partitions are available: re-submit to another node
○ Otherwise: resubmit tasks to recompute missing partitions

● Checkpointing
○ Lineage-based recovery of RDDs may be time-consuming (why?)
○ RDDs can be checkpointed to stable storage, through a flag to .persist()

25
Evaluation: Iterative Machine Learning
● Applications
○ kmeans (more compute)
○ Logistic regression (less compute)

● Execution conditions
○ 10 iterations, 100GB dataset
○ 25-100 machines (Amazon EC2, m1.xlarge, 15GB RAM, 4 cores)
○ HDFS with 256MB blocks

● Systems
○ Hadoop (MapReduce), HadoopBinMem
○ Spark RDDs 26
Results

● First iteration:
○ Spark has less overhead than
Hadoop (due to better internal
implementation)
○ HadoopBinMem had to convert
data to binary
● Subsequent iterations
○ Logistic regression: Spark 25x
faster than Hadoop (why?)
○ kmeans: Spark 3x faster (why?) 27
Fault recovery

● Each iteration had 400 tasks


working on 100GB of data.
● Spark re-ran failed tasks in
parallel, they re-read input
data and reconstructed RDD.

28
Behavior with insufficient memory

Performance degrades gracefully

29
Impact of persistence

30
Partitioning

● Partitioning across cluster machines can be controlled by users

31
Spark DataFrames

● A DataFrame is a set of records with a known schema


● A DataFrame is an RDD
● DataFrame operations map to the SQL engine

32
Other libraries

Streaming, GraphX, MLLib


(see next labs)
Spark can optimize across
libraries

33
Use of librairies

34
Recap
General framework (increases usability and performance)
Rich programming model (transformations, actions, libraries)
Data locality (as in MapReduce)
RDDs remain in-memory (reduces disk I/O)
Lazy RDD evaluations (allow for merging transformations)
Fault-tolerance through lineage (better than data replication)

35
36
Introduction

● Scientific Python is great! http://scipy.org


○ Numpy, matplotlib, scipy, pandas
○ Lots of papers and theses use it

● But it doesn’t work for Big Data Analytics


○ It is single-threaded
○ It only works on data that fits in memory

● Dask
○ Leverages block algorithms and task scheduling
○ Parallelizes Numpy, in particular 37
Hardware trends

● Most laptops and workstations are now powerful machines


○ Multi-core CPUs
○ Solid-State Drive (SSD): high bandwidth, low latency
○ Workstations rival small clusters but are more convenient

● Definitions
○ Algorithms that use multiple cores are called parallel
○ Algorithms that efficiently use disks as extension of RAM are called out-of-core

38
Internal data model: Dask Graphs

● Rely on common Python data structures


○ Dictionaries
○ Tuples
○ Callables

● A dictionary mapping keys to tasks or values


○ Keys can be any hashable (as in any Python dict)
○ A task is a tuple with a callable first element
○ A value is any object which is not a task

39
Example
Program Graph

40
Specification

● A Dask graph is a dictionary mapping keys to values or tasks

● A key can be any hashable that is not a task

41
Specification (2)

● A Task is a tuple with a callable first element:

● The next elements in the tuple are arguments. Arguments might be:
○ Any key in the Dask graph
○ Any other literal (1, “a”, etc)
○ Other tasks
○ Lists of arguments

42
Dask Arrays

● Motivation
○ Users are not supposed to write Dask graphs directly
○ Dask arrays re-implement Numpy with Dask graphs
○ Parallel, out-of-core computations

● Main idea: Blocked Array Algorithms


○ Break large computations in smaller chunks:
○ “Take the sum of a trillion numbers”
○ => “break the trillion numbers into one million chunks of size one million”
○ “sum each chunk, then sum all the intermediate sums”
43
Example

44
Array metadata

● Dask Arrays contain


○ A Dask graph (.dask)
○ Information about shape and chunk shape (.chunks)
○ A name referring to a key in the graph (.name)
○ A dtype

● Example

45
Dask Array is a Numpy clone

● It implements most of Numpy’s operations


○ Slicing
○ sum, add, dot, etc

● Numpy programs work out-of-the-box!


○ Except that actual computation is triggered by .compute()

● But it doesn’t support arrays of arbitrary size

46
Dynamic Task Scheduling

● Dask has multiple schedulers


○ Multi-core
○ Distributed execution
○ Users can create their own

● Dask uses dynamic scheduling


○ Execution order and mapping to nodes is done during execution
○ Why is it useful?

● Desired properties
○ Lightweight, Low-latency (ms), Mindful of memory use 47
Dynamic Task Scheduling

● Workers report to scheduler


○ Task completion
○ That they are ready to execute a task

● Scheduler
○ Records finished tasks
○ Marks new tasks that can be run
○ Marks which data can be released
○ Selects a task to give to the worker, among ready tasks

48
Task selection logic

● Main idea
○ Prioritize tasks that release intermediate results (keep a low memory footprint)
○ Avoids spilling intermediate data to disk

● Policy
○ Last in, first out
○ Prioritize task(s) that were made available last
○ “Finish things before starting new things”
○ Break ties using Depth-First Search

49
Benchmark

● Application: matrix multiplication on out-of-core dataset


● Comparison
○ Numpy on a big-memory machine
○ Dask Array in a small amount of memory, 4 threads, pulling data from disk
○ Compare different BLAS implementations, with different numbers of threads

50
Results

Main conclusion: matrix


multiplication benefits from
parallelization.

What do you think of results


presentation in the paper?

51
Other Collections: Bags and DataFrames

In short:

dask.array = numpy + threading


dask.bag = toolz + multiprocessing
dask.dataframe = pandas + threading

52
Bag

● Bags are unordered collections with repeats.


● dask.bag API contains functions like map, filter, etc
● Example

53
DataFrame

dask.dataframe implements a large dataframe out of many Pandas DataFrames

54
Conclusion

Dask extends the scale of convenient Data


Dask has a low barrier to entry (can use existing libraries)
Dask is easily installed (pip install “dask[complete]”)

55

You might also like