0% found this document useful (0 votes)

64 views55 pages

Spark and Dask

Spark provides a unified framework for large-scale data analytics. It uses resilient distributed datasets (RDDs) that can be manipulated in parallel across a cluster. RDDs support transformations and actions, and computations are executed lazily to optimize performance. Spark caches data in memory to reduce disk I/O and supports fault tolerance through lineage tracking of RDDs. It has been shown to outperform Hadoop for iterative machine learning algorithms due to its in-memory processing capabilities.

Uploaded by

jeremiahteee

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

64 views55 pages

Spark and Dask

Uploaded by

jeremiahteee

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 55

Music by Jacob Sanz-Robinson:

https://open.spotify.com/album/4sfzcZcVe2oIMkePq1ULTQ

Apache Spark &

Dask
January 27, 2021
Tristan Glatard

Gina Cody School of Engineering

and Computer Science
Department of Computer Science and Software Engineering
1
2
Scope
● Most active open-source project in Big Data Analytics

● Used in thousands organizations from diverse sectors:

○ Technology, banking, retail, biotech, astronomy

● Deployments with up to 8,000 nodes have been announced

3
Introduction

● We need cluster computing to analyse Big Data

● There has been an explosion of specialized computing models
○ MapReduce
○ Google’s Dremel (SQL) and Pregel (graphs)
○ Hadoop’s Storm, Impala, etc

● Big Data applications need to combine many types of processing

○ To handle data variety
○ Users need to stitch systems together
4
Introduction (2)

● The profusion of systems is

detrimental to
○ Usability
○ Performance

● Spark provides a uniform

framework for Big Data
Analytics

5
Highlights

● Spark concepts
○ Built around a data structure: Resilient Distributed Datasets (RDDs)
○ Programming model is MapReduce + extensions

● Added value compared to MapReduce

○ Easier to program because API is unified
○ Interactive explorations of Big Data are possible (pyspark)
○ More efficient to combine multiple tasks (workflows) Why?

6
In-memory processing

● MapReduce lacks abstraction to leverage distributed memory

○ Inefﬁcient for applications that reuse intermediate results
○ For instance iterative algorithms: kmeans, PageRank

● Spark keeps data in memory

○ Reduces disk I/O
○ Spills to disk if not enough space

7
Programming model: RDDs

RDDs are fault-tolerant collections of objects partitioned

across a cluster that can be manipulated in parallel.

8
9
RDD abstraction

● An RDD is a read-only (why?), partitioned collection of records

● Can be created through operations on:
○ Data in stable storage (e.g., sc.textFile)
○ Other RDDs: transformations
○ Existing collections (sc.parallelize)

10
Functional programming API

● Transformations:
○ Return RDDs
○ Are parametrized by user-provided functions

● Actions return results to application or storage system

11
RDD transformations and actions
(K, V) is a key-value pair
MapReduce = ﬂatMap +
reduceByKey

Return an
RDD

Return
something else

12
Lazy evaluation

● Transformations do not compute RDDs immediately

○ The returned RDD object is a representation of the result

● Actions trigger computations

● This allows optimization in the execution plan
○ Multiple map operations are fused in one pass
○ While preserving modularity what does it mean?

13
Lazy evaluation (2)
Computations are costly and memory is limited → Compute only what is necessary

Note:
With eager evaluation,
the entire file must be
read and filtered.
Lazy evaluation limits
the amount of data
that needs to be
processed

(slide from V. Hayot-Sasson) 14

Persistence (or caching)
● By default RDDs are ephemeral
● They might be evicted from memory
● Users can persist RDDs in memory or on disk, by calling .persist()
● persist() accepts priorities for spilling to disk

15
Memory management

● Persisted RDDs might be stored

○ In-memory
○ On-disk

● When a new partition is computed and there is not enough space

○ A partition is evicted using the LRU policy
○ Unless the partition is from the same RDD

● Users can also specify a persistence priority

16
Example: Log Mining

Reads from HDFS

Nothing has executed so far (why?)

Triggers execution

Other uses of errors won’t reload

data from the text ﬁle.

Lines is never stored in memory!

17
Lineage graph

An RDD maintains information on

how it was derived from other
datasets.
From the lineage, the RDD can be
completely (re-)computed.
This is used for fault-tolerance.

18
RDDs vs Distributed Shared Memory

RDDs are suitable for

batch applications that
iterate over a dataset

RDDs are not suitable for

asynchronous
ﬁne-grained updates.

19
Spark programming interface

● Driver
○ Runs user program
○ Deﬁnes RDDs and
invokes actions
○ Tracks the lineage
● Workers
○ Long-lived processes
○ Store RDD partitions in
RAM across operations
20
RDD representation
● HDFS ﬁle:
○ partitions(): one partition / HDFS block
○ preferredLocations(p): node where the
block is stored
○ iterator(...): reads the block

● MappedRDD, created by map:

○ partitions(): same as parent
○ preferredLocations(p): same as parent
○ iterator(...): applies function in map

21
Narrow and wide dependencies between RDDs

● Narrow dependency
○ Partition has at most 1 child
○ Example: map, ﬁlter
○ Can be pipelined on one node
○ Minimal impact of failures

● Wide dependency
○ Partition has more than 1 child
○ Example: join, groupByKey
○ Require data shufﬂe across nodes
○ High impact of failures 22
Job Scheduling
When user runs an action on an RDD:
● Scheduler builds DAG of stages
from RDD lineage graph.
○ Stages contain as many narrow
dependencies as possible.
○ Stage boundaries are shufﬂe
operations or partitions in
memory.
● Scheduler then launches tasks.
23
Scheduling (2)

● Tasks are assigned to machines based on data locality

○ -> tasks are sent to machines holding their partition in memory
○ -> or to their partition’s preferred locations

● For shufﬂes
○ Intermediate records are spilled to disks
○ On the nodes holding parent partitions why?
○ Like in MapReduce!

24
Fault tolerance

● When a task fail

○ If parent partitions are available: re-submit to another node
○ Otherwise: resubmit tasks to recompute missing partitions

● Checkpointing
○ Lineage-based recovery of RDDs may be time-consuming (why?)
○ RDDs can be checkpointed to stable storage, through a ﬂag to .persist()

25
Evaluation: Iterative Machine Learning
● Applications
○ kmeans (more compute)
○ Logistic regression (less compute)

● Execution conditions
○ 10 iterations, 100GB dataset
○ 25-100 machines (Amazon EC2, m1.xlarge, 15GB RAM, 4 cores)
○ HDFS with 256MB blocks

● Systems
○ Hadoop (MapReduce), HadoopBinMem
○ Spark RDDs 26
Results

● First iteration:
○ Spark has less overhead than
Hadoop (due to better internal
implementation)
○ HadoopBinMem had to convert
data to binary
● Subsequent iterations
○ Logistic regression: Spark 25x
faster than Hadoop (why?)
○ kmeans: Spark 3x faster (why?) 27
Fault recovery

● Each iteration had 400 tasks

working on 100GB of data.
● Spark re-ran failed tasks in
parallel, they re-read input
data and reconstructed RDD.

28
Behavior with insuﬃcient memory

Performance degrades gracefully

29
Impact of persistence

30
Partitioning

● Partitioning across cluster machines can be controlled by users

31
Spark DataFrames

● A DataFrame is a set of records with a known schema

● A DataFrame is an RDD
● DataFrame operations map to the SQL engine

32
Other libraries

Streaming, GraphX, MLLib

(see next labs)
Spark can optimize across
libraries

33
Use of librairies

34
Recap
General framework (increases usability and performance)
Rich programming model (transformations, actions, libraries)
Data locality (as in MapReduce)
RDDs remain in-memory (reduces disk I/O)
Lazy RDD evaluations (allow for merging transformations)
Fault-tolerance through lineage (better than data replication)

35
36
Introduction

● Scientiﬁc Python is great! http://scipy.org

○ Numpy, matplotlib, scipy, pandas
○ Lots of papers and theses use it

● But it doesn’t work for Big Data Analytics

○ It is single-threaded
○ It only works on data that ﬁts in memory

● Dask
○ Leverages block algorithms and task scheduling
○ Parallelizes Numpy, in particular 37
Hardware trends

● Most laptops and workstations are now powerful machines

○ Multi-core CPUs
○ Solid-State Drive (SSD): high bandwidth, low latency
○ Workstations rival small clusters but are more convenient

● Deﬁnitions
○ Algorithms that use multiple cores are called parallel
○ Algorithms that efﬁciently use disks as extension of RAM are called out-of-core

38
Internal data model: Dask Graphs

● Rely on common Python data structures

○ Dictionaries
○ Tuples
○ Callables

● A dictionary mapping keys to tasks or values

○ Keys can be any hashable (as in any Python dict)
○ A task is a tuple with a callable ﬁrst element
○ A value is any object which is not a task

39
Example
Program Graph

40
Speciﬁcation

● A Dask graph is a dictionary mapping keys to values or tasks

● A key can be any hashable that is not a task

41
Speciﬁcation (2)

● A Task is a tuple with a callable ﬁrst element:

● The next elements in the tuple are arguments. Arguments might be:
○ Any key in the Dask graph
○ Any other literal (1, “a”, etc)
○ Other tasks
○ Lists of arguments

42
Dask Arrays

● Motivation
○ Users are not supposed to write Dask graphs directly
○ Dask arrays re-implement Numpy with Dask graphs
○ Parallel, out-of-core computations

● Main idea: Blocked Array Algorithms

○ Break large computations in smaller chunks:
○ “Take the sum of a trillion numbers”
○ => “break the trillion numbers into one million chunks of size one million”
○ “sum each chunk, then sum all the intermediate sums”
43
Example

44
Array metadata

● Dask Arrays contain

○ A Dask graph (.dask)
○ Information about shape and chunk shape (.chunks)
○ A name referring to a key in the graph (.name)
○ A dtype

● Example

45
Dask Array is a Numpy clone

● It implements most of Numpy’s operations

○ Slicing
○ sum, add, dot, etc

● Numpy programs work out-of-the-box!

○ Except that actual computation is triggered by .compute()

● But it doesn’t support arrays of arbitrary size

46
Dynamic Task Scheduling

● Dask has multiple schedulers

○ Multi-core
○ Distributed execution
○ Users can create their own

● Dask uses dynamic scheduling

○ Execution order and mapping to nodes is done during execution
○ Why is it useful?

● Desired properties
○ Lightweight, Low-latency (ms), Mindful of memory use 47
Dynamic Task Scheduling

● Workers report to scheduler

○ Task completion
○ That they are ready to execute a task

● Scheduler
○ Records ﬁnished tasks
○ Marks new tasks that can be run
○ Marks which data can be released
○ Selects a task to give to the worker, among ready tasks

48
Task selection logic

● Main idea
○ Prioritize tasks that release intermediate results (keep a low memory footprint)
○ Avoids spilling intermediate data to disk

● Policy
○ Last in, ﬁrst out
○ Prioritize task(s) that were made available last
○ “Finish things before starting new things”
○ Break ties using Depth-First Search

49
Benchmark

● Application: matrix multiplication on out-of-core dataset

● Comparison
○ Numpy on a big-memory machine
○ Dask Array in a small amount of memory, 4 threads, pulling data from disk
○ Compare different BLAS implementations, with different numbers of threads

50
Results

Main conclusion: matrix

multiplication beneﬁts from
parallelization.

What do you think of results

presentation in the paper?

51
Other Collections: Bags and DataFrames

In short:

dask.array = numpy + threading

dask.bag = toolz + multiprocessing
dask.dataframe = pandas + threading

52
Bag

● Bags are unordered collections with repeats.

● dask.bag API contains functions like map, ﬁlter, etc
● Example

53
DataFrame

dask.dataframe implements a large dataframe out of many Pandas DataFrames

54
Conclusion

Dask extends the scale of convenient Data

Dask has a low barrier to entry (can use existing libraries)
Dask is easily installed (pip install “dask[complete]”)

Big Data - Spark
100% (1)
Big Data - Spark
72 pages
Pyspark DataEngineering Power Guide
No ratings yet
Pyspark DataEngineering Power Guide
73 pages
Lecture 10 - Spark
No ratings yet
Lecture 10 - Spark
87 pages
Spark 1
No ratings yet
Spark 1
57 pages
bd1718-10-spark
No ratings yet
bd1718-10-spark
55 pages
Class_06_IntroToSpark
No ratings yet
Class_06_IntroToSpark
51 pages
Spark
No ratings yet
Spark
51 pages
Writing Spark Application
No ratings yet
Writing Spark Application
37 pages
8 Apache Spark
No ratings yet
8 Apache Spark
25 pages
Unit-V Spark
No ratings yet
Unit-V Spark
69 pages
3- SPARK
No ratings yet
3- SPARK
51 pages
Lecture 3 - Introduction To Apache Spark - 1691899519972
No ratings yet
Lecture 3 - Introduction To Apache Spark - 1691899519972
67 pages
APACHE SPARK and Scala
No ratings yet
APACHE SPARK and Scala
49 pages
BDA-Lec7
No ratings yet
BDA-Lec7
32 pages
Intro To Apache Spark
No ratings yet
Intro To Apache Spark
66 pages
Introduction To Spark
No ratings yet
Introduction To Spark
54 pages
SPARK
No ratings yet
SPARK
66 pages
L03-Spark Framework
No ratings yet
L03-Spark Framework
58 pages
Bootcamp Keynote
No ratings yet
Bootcamp Keynote
47 pages
SPARK
No ratings yet
SPARK
125 pages
Spark And Scala Week 1
No ratings yet
Spark And Scala Week 1
16 pages
C5-SPARK Technology
No ratings yet
C5-SPARK Technology
39 pages
BDA NOTES
No ratings yet
BDA NOTES
241 pages
Lecture 25
No ratings yet
Lecture 25
59 pages
Lecturer 5
No ratings yet
Lecturer 5
21 pages
HDP Developer Apache Pig and Hive
No ratings yet
HDP Developer Apache Pig and Hive
42 pages
CISD 42 Introduction to Spark_Spark Transformation_Spark Actions
No ratings yet
CISD 42 Introduction to Spark_Spark Transformation_Spark Actions
27 pages
BDA-Lec8
No ratings yet
BDA-Lec8
39 pages
Unit-5 Spark
No ratings yet
Unit-5 Spark
24 pages
4 Spark SBP
No ratings yet
4 Spark SBP
74 pages
SPARK
No ratings yet
SPARK
35 pages
Apache Spark
No ratings yet
Apache Spark
31 pages
Spark
No ratings yet
Spark
96 pages
BDA Unit-6
No ratings yet
BDA Unit-6
11 pages
BigData Spark Sparklyr
No ratings yet
BigData Spark Sparklyr
80 pages
spark
No ratings yet
spark
160 pages
Big Data Computing Spark Basics and RDD: Ke Yi
No ratings yet
Big Data Computing Spark Basics and RDD: Ke Yi
43 pages
Spark Programming Basics
No ratings yet
Spark Programming Basics
54 pages
Spark: Fast, Interactive, Language-Integrated Cluster Computing
No ratings yet
Spark: Fast, Interactive, Language-Integrated Cluster Computing
25 pages
Chapter 3 spark
No ratings yet
Chapter 3 spark
6 pages
Pyspark
No ratings yet
Pyspark
31 pages
Introduction to Spark
No ratings yet
Introduction to Spark
30 pages
Module 3
No ratings yet
Module 3
51 pages
Introduction To Big Data With Apache Spark: Uc Berkeley
No ratings yet
Introduction To Big Data With Apache Spark: Uc Berkeley
43 pages
bda unit 5 - mam
No ratings yet
bda unit 5 - mam
44 pages
PySpark Notes
No ratings yet
PySpark Notes
31 pages
PySpark FP Course ID 58339
No ratings yet
PySpark FP Course ID 58339
44 pages
Big Data Engineering - PySpark
100% (2)
Big Data Engineering - PySpark
120 pages
Distributed Database Systems: - Spark I
No ratings yet
Distributed Database Systems: - Spark I
59 pages
Unit 5 Note
No ratings yet
Unit 5 Note
18 pages
Intro To Apache Spark: Credits To CS 347-Stanford Course, 2015, Reynold Xin, Databricks (Spark Provider)
No ratings yet
Intro To Apache Spark: Credits To CS 347-Stanford Course, 2015, Reynold Xin, Databricks (Spark Provider)
96 pages
FortiAnalyzer 6.4.0 CLI Reference
No ratings yet
FortiAnalyzer 6.4.0 CLI Reference
263 pages
Lecture 07 Normalization
No ratings yet
Lecture 07 Normalization
50 pages
Apache Spark With Java
No ratings yet
Apache Spark With Java
209 pages
Spark Overview
No ratings yet
Spark Overview
31 pages
Spark Summit East 2015 - Adv Dev Ops - Student Slides
No ratings yet
Spark Summit East 2015 - Adv Dev Ops - Student Slides
219 pages
CH11 NetSec6e
No ratings yet
CH11 NetSec6e
35 pages
Architecture and Components of Spark
No ratings yet
Architecture and Components of Spark
6 pages
SCORE Oracle v3.1
No ratings yet
SCORE Oracle v3.1
149 pages
Hybris SAP
No ratings yet
Hybris SAP
32 pages
01k.AGM320 APG432 Cloning Instruction
No ratings yet
01k.AGM320 APG432 Cloning Instruction
40 pages
Scala and Spark Overview PDF
No ratings yet
Scala and Spark Overview PDF
37 pages
DR AF T DR AF T: Business Analyst As Scrum Product Owner
No ratings yet
DR AF T DR AF T: Business Analyst As Scrum Product Owner
42 pages
Apach Spark With Scala Slides
No ratings yet
Apach Spark With Scala Slides
187 pages
RiverMuse Technical Overview
No ratings yet
RiverMuse Technical Overview
16 pages
Jowua v5n4 5 PDF
No ratings yet
Jowua v5n4 5 PDF
17 pages
HOWTO: Install LAMP Stack (On Raspbian Stretch June 2018)
No ratings yet
HOWTO: Install LAMP Stack (On Raspbian Stretch June 2018)
14 pages
Lecture No. 15
No ratings yet
Lecture No. 15
11 pages
Linux Questions
No ratings yet
Linux Questions
5 pages
Skilly - Freelance Platform Wireframe
No ratings yet
Skilly - Freelance Platform Wireframe
10 pages
PBI Slides ShortCourse Handout
No ratings yet
PBI Slides ShortCourse Handout
39 pages
Java Software Structures:: Designing and Using Data Structures
No ratings yet
Java Software Structures:: Designing and Using Data Structures
9 pages
Firebird's Nbackup Tool: Paul Vinkenoog
No ratings yet
Firebird's Nbackup Tool: Paul Vinkenoog
20 pages
Moogsoft Aiops Buyers Guide
No ratings yet
Moogsoft Aiops Buyers Guide
16 pages
Swagger UI
No ratings yet
Swagger UI
4 pages
Explore A Bigquery Public Dataset
No ratings yet
Explore A Bigquery Public Dataset
21 pages
Software Engineering Lectures
No ratings yet
Software Engineering Lectures
17 pages
Sabre Gds API
No ratings yet
Sabre Gds API
8 pages
What Is Normalization in DBMS
No ratings yet
What Is Normalization in DBMS
9 pages
What Is Spark?: Up To 100× Faster
No ratings yet
What Is Spark?: Up To 100× Faster
56 pages
Software Requirements Specification: Paraside India Salon App
No ratings yet
Software Requirements Specification: Paraside India Salon App
8 pages
Lvin Jamerlan: Key Skills
No ratings yet
Lvin Jamerlan: Key Skills
4 pages
Risks and Benefits of Business Intelligence in The Cloud
No ratings yet
Risks and Benefits of Business Intelligence in The Cloud
10 pages
SOP Allocation-Version 2
No ratings yet
SOP Allocation-Version 2
3 pages
How To Add Message Queuing Feature - Dell India
No ratings yet
How To Add Message Queuing Feature - Dell India
2 pages
LAB 4 Shell Scripting 2
No ratings yet
LAB 4 Shell Scripting 2
8 pages
Difference Between Sap Ecc and Sap S4hana
100% (2)
Difference Between Sap Ecc and Sap S4hana
2 pages
PBL2 SME Governance Problem Statement-V2
No ratings yet
PBL2 SME Governance Problem Statement-V2
3 pages
Mastering CUDA C++ Programming: A Comprehensive Guidebook
From Everand
Mastering CUDA C++ Programming: A Comprehensive Guidebook
Brett Neutreon
No ratings yet
CUDA Programming with C++: From Basics to Expert Proficiency
From Everand
CUDA Programming with C++: From Basics to Expert Proficiency
William Smith
No ratings yet

Spark and Dask

Uploaded by

Spark and Dask

Uploaded by

Music by Jacob Sanz-Robinson:

Apache Spark &

Gina Cody School of Engineering

● Used in thousands organizations from diverse sectors:

● Deployments with up to 8,000 nodes have been announced

● We need cluster computing to analyse Big Data

● Big Data applications need to combine many types of processing

● The profusion of systems is

● Spark provides a uniform

● Added value compared to MapReduce

● MapReduce lacks abstraction to leverage distributed memory

● Spark keeps data in memory

RDDs are fault-tolerant collections of objects partitioned

● An RDD is a read-only (why?), partitioned collection of records

● Actions return results to application or storage system

● Transformations do not compute RDDs immediately

● Actions trigger computations

(slide from V. Hayot-Sasson) 14

● Persisted RDDs might be stored

● When a new partition is computed and there is not enough space

● Users can also specify a persistence priority

Reads from HDFS

Other uses of errors won’t reload

Lines is never stored in memory!

An RDD maintains information on

RDDs are suitable for

RDDs are not suitable for

● MappedRDD, created by map:

● Tasks are assigned to machines based on data locality

● When a task fail

● Each iteration had 400 tasks

Performance degrades gracefully

● Partitioning across cluster machines can be controlled by users

● A DataFrame is a set of records with a known schema

Streaming, GraphX, MLLib

● Scientiﬁc Python is great! http://scipy.org

● But it doesn’t work for Big Data Analytics

● Most laptops and workstations are now powerful machines

● Rely on common Python data structures

● A dictionary mapping keys to tasks or values

● A Dask graph is a dictionary mapping keys to values or tasks

● A key can be any hashable that is not a task

● A Task is a tuple with a callable ﬁrst element:

● Main idea: Blocked Array Algorithms

● Dask Arrays contain

● It implements most of Numpy’s operations

● Numpy programs work out-of-the-box!

● But it doesn’t support arrays of arbitrary size

● Dask has multiple schedulers

● Dask uses dynamic scheduling

● Workers report to scheduler

● Application: matrix multiplication on out-of-core dataset

Main conclusion: matrix

What do you think of results

dask.array = numpy + threading

● Bags are unordered collections with repeats.

dask.dataframe implements a large dataframe out of many Pandas DataFrames

Dask extends the scale of convenient Data

You might also like