Apache Spark Engine
Apache Spark Engine
Fundamentals
Abderrahmane EZ-ZAHOUT
1
Objectives
After completing this cours, you should e able to:
Learn the fundamentals of Spark, the technology that is
revolutionizing the analytics and big data world!
Learn how it performs at speeds up to 100 times faster than Map
Reduce for iterative algorithms or interactive data mining.
Learn how it provides in-memory cluster computing for lightning fast
speed and supports Java, Python, R, and Scala APIs for ease of
development.
Learn how it can handle a wide range of data processing scenarios by
combining SQL, streaming and complex analytics together seamlessly
in the same application.
Learn how it runs on top of Hadoop, Mesos, standalone, or in the
cloud. It can access diverse data sources such as HDFS, Cassandra,
HBase, or S3.
2
Introduction to Spark
I
What is Spark and what is its purpose?
Components of the Spark unified stack,
Resilient Distributed Dataset (RDD),
Downloading and installing Spark standalone,
Scala and Python overview,
Launching and using Spark’s Scala and
Python shell © 3
RDD and DataFrames
II
Understand how to create parallelized
collections and external datasets
Work with Resilient Distributed Dataset
(RDD) operations
Utilize shared variables and key-value pair
4
Spark application programming
III
Understand the purpose and usage of the
SparkContext
Initialize Spark with the various programming
languages
Describe and run some Spark examples
Pass functions to Spark
Create and run a Spark standalone application
Submit applications to the cluster
5
Spark libraries
IV
Understand and use the various Spark
libraries
6
Understand and use the various
Spark libraries
V
Understand components of the Spark cluster
Configure Spark to modify the Spark properties,
environmental variables, or logging properties
Monitor Spark using the web UIs, metrics, and
external instrumentation
Understand performance tuning considerations
7
Spark Uses Case
VI
Projects
8
Apache Spark
Apache Spark is an open-source cluster-
computing framework.
Originally developed at the University of California,
Berkeley's AMPLab,
the Spark codebase was later donated to the Apache
Software Foundation,
Spark provides an interface for programming entire
clusters with implicit data parallelism and fault
tolerance.
Apache Spark™ is a unified analytics engine
for large-scale data processing.
9
Big Data and Spark
Speed
Run workloads 100x faster:100 times fast than Hadoop’s MapReduce for
in-memory computations and 10 times faster for on disk,
Winner of Daytona GraySort contest, sorting a petabyte 3 times faster and
using 10 times less hardware than Hadoop’s MapReduce
(https://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-
in-largescale-sorting.html),
Apache Spark achieves high performance for both batch and streaming data,
using a state-of-the-art DAG scheduler, a query optimizer, and a physical
execution engine,
Well suited for iterative algorithms in machine learning,
Fast, real-time response to user queries on large in-memory data set,
Low latency data analysis applied to processing live data streams,
10
Big Data and Spark
Ease of Use
Spark offers over 80 high-level operators that make it easy to build parallel
apps. And you can use it interactively from the Scala, Python, R, and SQL
shells.
11
Big Data and Spark
Generality
Combine SQL, streaming, and complex analytics.
Existing libraries and API makes it easy to write programs combining batch,
streaming, iterative machine learning and complex queries in a single
application,
Interactive shell is available for Python and Scala,
Built for performance and reliability, written in Scala and runs on top of JVM,
Operational and debugging tools from the Java stack are available for
programmers,
Spark powers a stack of libraries including SQL and DataFrames, MLlib for
machine learning, GraphX, and Spark Streaming. You can combine these
libraries seamlessly in the same application.
12
Big Data and Spark
Runs Everywhere
Spark runs on Hadoop, Apache Mesos, Kubernetes,
standalone, or in the cloud. It can access diverse data
sources.
Spark provides API for data munging, ETL, machine
learning, graph processing, streaming, interactive and
batch processing. Can replace several SQL, streaming and
complex analytics systems with one unified environment,
Strong integration with variety of tools in the Hadoop
ecosystem,
Can read and write to different data formats and data
sources including HDFS, Cassandra, S3 and Hbase,
You can run Spark using its standalone cluster mode,
on EC2, on Hadoop YARN, on Mesos, or on Kubernetes.
Access data in HDFS, Apache Cassandra, Apache
HBase, Apache Hive, and hundreds of other data sources.
13
Spark is NOT a data storage
system
Spark is not a data store but is versatile in reading
from and writing to a variety of data sources,
Can read and write to different data formats and
data sources including HDFS, Cassandra, S3 and
Hbase,
Can access traditional BI tools using a server mode
that provides standard JDBC and ODBC connectivity,
The DataFrame API provides a pluggable mechanism
to access structured data using Spark SQL,
API provides tight optimization integration, thus
enhancing the speed of the Spark jobs that process
vast amounts of data
14
Spark History
15
Spark History
Apache Spark started as a research project at the UC Berkeley AMPLab in
2009,
Apache Spark was open sourced in early 2010, and transferred to Apache
in 2013,
Top-level Apache project today, Today, the project is developed
collaboratively by a community of hundreds of developers from hundreds
of organizations.
2015 was the year of superstardom for Spark (source:
http://fortune.com/2015/09/25/apache-spark-survey/)
Many of the ideas behind the system were presented in various research
papers over the years,
Motivated by MapReduce and the need to apply machine learning in a
scalable fashion,
16
Use cases of Spark
17
Use cases of Spark
Apache Spark use cases in e-commerce
Industry
18
Use cases of Spark
Apache Spark use cases in Healthcare
Analysis of patient records along with past clinical data.
It helps to identify which patients are likely to face health
issues later on.
This step prevents hospital re-admittance. Since it is possible to deploy home services to
the identified patient now. Also, saves costs for both the hospitals and patients.
Spark is used in genomic sequencing.
19
Use cases of Spark
Apache Spark use cases in Media &
Entertainment Industry
In the gaming, we use Spark to identify patterns from the real-
time in-game events. It helps to respond in order to harvest
lucrative business opportunities. For Example targeted
advertising, auto adjustment of gaming level complexity,
player retention etc.
Some video sharing websites using spark along with MongoDB.
It helps to show relevant advertisements to its users based on
the videos they view, share and browse. Some Real-time
companies which are using Spark are Yahoo, NetFlix, Pinterest,
Conviva etc.
20
Use cases of Spark
Apache Spark use cases in Travel Industry
It helps users to plan a perfect trip by speed up the
personalized recommendations.
They also use it to provide advice to travelers by comparing
many websites to find the best hotel prices.
Also, the Review process of the hotels in a readable format is
done by using Spark.
some apps using Spark to provide us a platform for online
reservation(real time). They are using Spark to manage ample
of restaurants and dinner reservations at the same time.
21
Use cases of Spark
Fraud detection: Spark streaming and Machine Learning
applied to prevent fraud
Network Intrusion detection: Machine Learning applied to
detect cyber hacks
Customer segmentation and personalization: Spark SQL and
Machine Learning applied to maximize customer lifetime value
Social media sentiment analysis: Spark streaming, Spark SQL
and Stanford’s
CoreNLP wrapper helps achieve sentiment analysis
Real-time ad targeting: Spark used to maximize online ad
revenues
Predictive healthcare: Spark used to optimize healthcare costs
22
Use cases of Spark
23
Spark at Uber
”It’s fundamentally a data problem” - Head of Data at Uber,
Aaron Schildkrout
Business Problem: A simple problem of getting people around a city
with an army
of more than 100,000 drivers and to use data to intelligently perfect
the business
in an automated and real time way.
Accurately paying drivers as per the trips data set
• Maximize profits by positioning cars optimally
• Help drivers avoid accidents
• Calculate surge pricing
Solution: Use Spark Streaming and Spark SQL used as the ETL system,
Spark Mllib and GraphX for advanced analytics
Reference: Talk by Uber engineers at Apache Spark meetup
https://www.youtube.com/watch?v=zKbds9ZPjLE
24
Spark at Netflix
Netflix uses Apache Spark for real-time stream processing to provide
online recommendations to its customers.
Streaming devices at Netflix send events which capture all member
activities and play a vital role in personalization.
"There are 33 million different versions of Netflix.” – Joris Evers,
Director of Global Communications
It processes 450 billion events per day which flow to server side
applications and are directed to Apache Kafka.
Business Problem: A video streaming service with emphasis on data
quality, agility and availability. Using analytics to help users discover
movies and shows that they like is key to Netflix’s success.
• Streaming applications are long running tasks that need to be resilient in cloud
deployments,
• Optimize content buying
• Renowned personalization algorithms
Solution: Use Spark streaming in AWS cloud, Spark GraphX for
recommender system Reference: Talk by Netflix engineers at Apache25
Spark meetup https://www.youtube.com/watch?v=gqgPtcDmLGs
Spark at Pinterest
Pinterest is using apache spark to discover trends in high value user
engagement data so that it can react to developing trends in real-
time by getting an in-depth understanding of user behaviour on the
website.
”Data driven decision making is in our company DNA” - Head of
Data Engineering at Pinterest, Krishna Gade
Business Problem: To provide a recommendation and visual
bookmarking tool that lets users discover, save and share ideas and to
get an immediate view of Pinterest engagement activity with high
throughput and minimal latency.
• Real time analytics to process user’s activity
• Process petabytes of data to provide recommendations and personalization
• Apply sophisticated deep learning techniques to a Pin image to suggest related Pins
Solution: Use Spark Streaming, Spark SQL, MemSQL’s Spark connector
used for real
time analytics, Spark MLlib for machine learning use cases
Reference: https://engineering.pinterest.com/blog/real-time- 26
analytics-pinterest
Spark at ADAM, Big Data
Genomics project
“And that’s the promise of precision medicine -- delivering the
right treatments, at the right time, every time to the right
person.“ (President Obama, 2015 State of the Union address)
Business Problem: ADAM is a genomics analysis platform providing
large scale analyses to support population based genomics study that
is essential for precision medicine.
• Parallelize genomics analysis in the cloud
• Replace developing custom distributed computing code
• Support for file formats well suited to genomic data like Apache Parquet and Avro
• Support
Solution: Use Spark on Amazon EMR
Reference: https://github.com/bigdatagenomics/adam,
http://bdgenomics.org/
https://blogs.aws.amazon.com/bigdata/post/Tx1GE3J0NATVJ39/Will-Spark-Power-the-Data-
behind-Precision-Medicine for languages like R which are popular in the genomics
community
27
Spark at Yahoo
”CaffeOnSpark Open Sourced for Distributed Deep Learning on Big
Data Clusters” – Andy Feng, VP Architecture at Yahoo
Business Problem: Deep learning is critical for Yahoo’s product teams
to acquire intelligence from huge amounts of online data. Examples
are image recognition and speech recognition for improved search on
photo sharing service Flickr.
28
Spark Components
29
Spark Components
Apache Spark Core
All the functionalities being provided by Apache Spark are built on the
top of Spark Core.
It delivers speed by providing in-memory computation capability.
Spark Core is the foundation of parallel and distributed processing of
huge dataset.
30
Spark Components
The key features of Apache Spark Core are:
31
Spark Components
Spark Core is embedded with a special collection
called RDD (resilient distributed dataset). RDD is among the
abstractions of Spark. Spark RDD handles partitioning data
across all the nodes in a cluster. It holds them in the memory
pool of the cluster as a single unit. There are two operations
performed on RDDs: Transformation and Action-
Transformation: It is a function that produces new RDD from
the existing RDDs.
Action: In Transformation, RDDs are created from each other.
But when we want to work with the actual dataset, then, at
that point we use Action.
Refer these guides to learn more about Spark RDD
Transformations & Actions API and Different ways to create
RDD in Spark.
32
Spark Components
Apache Spark SQL
The Spark SQL component is a distributed framework for structured data
processing.
Spark SQL works to access structured and semi-structured information.
It also enables powerful, interactive, analytical application across both
streaming and historical data.
Spark SQL is Spark module for structured data processing. Thus, it acts as
a distributed SQL query engine.
33
Spark Components
Apache Spark SQL : Features of Spark SQL include
34
Spark Components
Apache Spark Streaming
It is an add-on to core Spark API which allows scalable, high-
throughput, fault-tolerant stream processing of live data streams.
Spark can access data from sources like Kafka, Flume, Kinesis or TCP
socket.
1. GATHERING
The Spark Streaming provides two categories of built-in streaming sources:
Basic sources: These are the sources which are available in theStreamingContext API.
Examples: file systems, and socket connections.
Advanced sources: These are the sources like Kafka, Flume, Kinesis, etc. are available
through extra utility classes. Hence Spark access data from different sources like Kafka,
Flume, Kinesis, or TCP sockets.
2. PROCESSING
The gathered data is processed using complex algorithms expressed with a high-level
function. For example, map, reduce, join and window.
3. DATA STORAGE
The Processed data is pushed out to file systems, databases, and live dashboards.
Spark Streaming also provides high-level abstraction. It is known as discretized stream or
DStream.
DStream in Spark signifies continuous stream of data. We can form DStream in two ways
either from sources such as Kafka, Flume, and Kinesis or by high-level operations on other
36
DStreams. Thus, DStream is internally a sequence of RDDs.
Spark Components
Apache Spark MLlib (Machine Learning
Library)
MLlib in Spark is a scalable Machine learning library that discusses both
high-quality algorithm and high speed.
Goal is to make practical Machine Learning scalable and easy.
Includes common machine Learning algorithms like classification,
regression, clustering, collaborative filtering.
Provides utilities for feature extraction, transformation,
dimensionality reduction and selection.
Provides tools for constructing Machine Learning pipelines, evaluating
and tuning them.
Supports persistence of models and pipelines.
Includes convenient utilities for linear algebra, statistics, data
handling etc.
37
Spark Components
Apache Spark MLlib (Machine Learning
Library): Spark MLlib in 2.0
38
Spark Components
Apache Spark MLlib (Machine Learning
Library): Example use cases of Spark Mllib
Fraud detection: Spark streaming and Machine Learning applied to
prevent fraud,
Network Intrusion detection: Machine Learning applied to detect
cyber hacks,
Customer segmentation and personalization: Spark SQL and Machine
Learning applied to maximize customer lifetime value,
Real-time ad targeting: Spark used to maximize online ad revenues,
Predictive healthcare: Spark used to optimize healthcare costs,
Genomics analysis to provide precision medicine;
39
Spark Components
Spark ML Pipelines
Provide high level API that help users create and tune practical
machine learning pipelines.
Allows to combine multiple machine learning algorithms and utilities
into a single pipeline
Key concepts in the Pipeline API:
• DataFrame: Is the ML dataset and can hold a variety of data type
• Transformer: Is an algorithm which can transform one DataFrame into
another DataFrame.
• Estimator: Is an algorithm which can be fit on a DataFrame to produce
a Transformer.
• Pipeline: A Pipeline chains multiple Transformers and Estimators
together to specify an ML workflow.
• Parameter: This API allows specifying parameters on all Transformers
and Estimators
40
Spark Components
Spark ML Pipeline: DataFrames
• Machine Learning can be applied to a wide variety of data types such as
images, text, audio clips, numerical data. The ML dataset and can
hold a variety of data type
• The DataFrame API supports a variety of data types and hence is
suitable
• Conceptually equivalent to a table in a relational database or a data
frame in R/Python
• Columns in DataFrame are named
• Can be constructed from a wide array of sources such as: structured
data files, tables in Hive, external databases, or existing RDDs
• It contains richer optimizations under the hood for better performance
41
Spark Components
Spark ML Pipeline: Transformer
42
Spark Components
Spark ML Pipeline: Estimator
43
Spark Components
Spark ML Pipeline: Pipeline
44
Spark Components
Spark ML Pipeline: Parameters
45
Spark Components
Spark ML Pipeline Example
A text document classification pipeline has the following workflow:
Training flow: Input is a set of text documents where each document
is labelled. Stages while training the Machine Learning model are:
• Split each text document into words
• Convert each document’s words into a numerical feature vector
• Learn a prediction model using the feature vectors and labels
Test/Predict flow: Input is a set of text documents and the goal is
predict a label for each
document. Stages while testing or making predictions with the
Machine Learning model are:
• Split each text document into words
• Convert each document’s words into a numerical feature vector
• Use the trained model and make predictions on the feature vector
46
Spark Components
Spark ML Pipeline Training Time
Raw data is processed through the Tokenizer and HashingTF and then
fed into the LogisticRegression algorithm to create a
LogisticRegression Model
47
Spark Components
Spark ML Pipeline Testing Time
The raw data on which the predictions need to be made are passed
through the same Tokenizer and HashingTF for feature extraction and
then passed through the LogisticRegression Model to make
predictions.
48
Spark Components
Spark ML Tuning
Important task in ML is model selection. Using data find a best model
or parameters for making predictions.
Tuning can be done on individual Estimators and on entire pipelines
Tools such as CrossValidator and TrainValidationSplit are used for
tuning
• At a high level, these model selection tools work as follows:
split the input data into separate training and test datasets.
For each (training, test) pair, and for each set of ParamMaps, they
evaluate the model’s performance using the Evaluator
They select the Model produced by the best-performing set of
parameters.
49
Spark Components
Spark Streaming
Extension of the core Spark API that enables scalable, high-
throughput, fault-tolerant stream processing of live data streams
Data can be ingested from many sources like Kafka, Flume, Twitter,
ZeroMQ, Kinesis, or TCP sockets
Data can further be processed using MLlib, Graph processing or high
level functions like map, reduce, join, window etc
Processed data can be pushed out to file systems, databases and live
dashboards
50
Spark Components
Internal working of Spark Streaming:
Spark streaming receives live input data streams and divides data into
batches
Batches of data are processed by the Spark engine to produce batches
of results
High level API called DStream(discretized stream) represents
continuous stream of data
Internally DStreamis represented as a sequence of RDDs
DStreams can be created either from input data streams from sources
such as Kafka, Flume, and Kinesis, or by applying high-level
operations on other DStreams.
51
Spark Components
Spark GraphX
GraphX is Apache Spark's API for graphs and graph-parallel computation.
Graph processing library with APIs to manipulate graphs and
performing graph-parallel computations
Extends the Spark RDD API by introducing a new Graph abstraction, a
directed multigraph with properties attaches to each vertex and edge
Like RDDs, property graphs are immutable, distributed, and fault-
tolerant.
Provides various operators for manipulating graphs (e.g., subgraph
and mapVertices)
Provides library of common graph algorithms (e.g., PageRank and
triangle counting)
Growing collection of algorithms and builders to simplify graph
analytics tasks
52
Spark Components
Spark GraphX
Flexibility
Seamlessly work with both graphs and collections.
GraphX unifies ETL, exploratory analysis, and iterative graph
computation within a single system. You can view the same data as
both graphs and collections, transform and join graphs with RDDs
efficiently, and write custom iterative graph algorithms using
the Pregel API.
53
Spark Components
Spark GraphX
Speed
Comparable performance to the fastest specialized graph processing
systems.
GraphX competes on performance with the fastest graph systems
while retaining Spark's flexibility, fault tolerance, and ease of use.
54
Spark Components
Spark GraphX
Algorithms
Choose from a growing library of graph algorithms.
In addition to a highly flexible API, GraphX comes with a variety of
graph algorithms, many of which were contributed by our users.
55
Spark Components
Spark GraphX
Community
GraphX is developed as part of the Apache Spark project. It thus gets
tested and updated with each Spark release.
56
Spark Components
Spark GraphX Example:
Define vertex array and edge array
Construct RDDs from vertex and edge arrays
Build a Property graph using RDD of vertices and edges
Perform filter operation on graph. Example: list users at least 30
years of age
Use the graph.triplets view to display relationships in the graph
57
Spark Components
Spark R:
SparkR is an R package that provides a light-weight frontend to use
Apache Spark from R.
SparkR provides a distributed data frame implementation that
supports operations like selection, filtering, aggregation etc. on large
datasets.
SparkR also supports distributed machine learning using MLlib.
SparkR uses the SparkDataFrame API, SparkDataFramess can be
constructed from a wide array of sources such as structured data
files, tables in Hive, external databases, or existing local R data
frames.
SparkR shell can be invoked using the ./bin/sparkR command
Also can connect to SparkR from Rstudio or other R IDEs.
58
Spark Components
Spark R:
59
Spark Architecture
The Spark Stack:
60
Spark Architecture
Spark Cluster Managers
Cluster managers allocate resources across applications on a cluster.
• Apache Mesos – a general cluster manager that can also run Hadoop
MapReduce and service applications.
61
Spark Architecture
Spark Runtime Architecture
• In distributed mode, Spark uses a
master/slave architecture
• The master is called the “driver” and
the slaves are the “executors”
• Drivers and executors run in their own
Java processes
• A driver and its executors put together
form a Spark application
• Spark application is launched using the
cluster manager
62
Spark Architecture
SparkContext
63
Spark Architecture
The Driver
The process where the main() method of
the Spark program runs
Responsible for converting a user program
into tasks
Driver schedules the tasks on executors
Results from these tasks are delivered
back to the driver
64
Spark Architecture
Executors
65
Spark Architecture
Spark running on clusters
66
Spark Architecture
Execution steps in a Spark program
User submits an application that launches the driver program
Driver program invokes the main() method specified by the user
Driver program contacts cluster manager to ask for resources to
launch executors
Cluster manager launches executors
Driver program divides the user program into tasks and sends them to
the executors
Executors runs the tasks, computes and saves results, returns results
to the driver
When driver’s main() method exits or SparkContext.stop() is called,
executors are terminated and the cluster manager releases the
resources
67
Spark Architecture
Summary
Spark core is the base engine in the Spark stack. Spark API libraries
like Spark SQL, MLlib, Spark streaming, GraphX, etc. are built on top
of Spark core and inherit all of its features like fault tolerance and
scalability.
Spark core also integrates with pluggable cluster managers. In
addition to the standalone cluster manager built within Spark, Spark
core can integrate with Apache MESOS and Hadoop YARN.
Spark program is made up of a driver program and executor programs.
SparkContext is the entry point to everything Spark and is defined in
the driver program.
Driver programs divide and parallelize tasks and sends them to the
executors.
Executors perform the tasks, saving results to in-memory or disk
storage, and returns results to driver when task is complete.
68
RDD: Resilient Distributed
Datasets
Resilient Distributed Dataset (aka RDD) is the primary data
abstraction in Apache Spark and the core of Spark (that I often refer
to as "Spark Core").
The origins of RDD
The original paper that gave birth to the concept of RDD is Resilient
Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory
Cluster Computing by Matei Zaharia, et al.
A RDD is a resilient and distributed collection of records spread
over one or many partitions.
Note
One could compare RDDs to collections in Scala, i.e. a RDD is
computed on many JVMs while a Scala collection lives on a single
JVM.
69
RDD: Resilient Distributed
Datasets
Using RDD Spark hides data partitioning and so distribution that in
turn allowed them to design parallel computational framework with a
higher-level programming interface (API) for four mainstream
programming languages.
The features of RDDs (decomposing the name):
Resilient, i.e. fault-tolerant with the help of RDD lineage graph and
so able to recompute missing or damaged partitions due to node
failures.
Distributed with data residing on multiple nodes in a cluster.
Dataset is a collection of partitioned data with primitive values or
values of values, e.g. tuples or other objects (that represent records
of the data you work with).
72
RDD: Resilient Distributed
Datasets
Types of RDDs:
There are some of the most interesting types of RDDs:
ParallelCollectionRDD
CoGroupedRDD
HadoopRDD is an RDD that provides core functionality for reading data
stored in HDFS using the older MapReduce API. The most notable use
case is the return RDD of SparkContext.textFile.
MapPartitionsRDD - a result of calling operations
like map, flatMap, filter, mapPartitions, etc.
CoalescedRDD - a result of repartition or coalesce transformations.
ShuffledRDD - a result of shuffling, e.g. after repartition or
coalesce transformations
73
RDD: Resilient Distributed
Datasets
Types of RDDs:
PipedRDD - an RDD created by piping elements to a forked external
process.
PairRDD (implicit conversion by PairRDDFunctions) that is an RDD of
key-value pairs that is a result of groupByKey and join operations.
DoubleRDD (implicit conversion
as org.apache.spark.rdd.DoubleRDDFunctions) that is an RDD
of Double type.
SequenceFileRDD (implicit conversion
as org.apache.spark.rdd.SequenceFileRDDFunctions) that is an RDD
that can be saved as a SequenceFile.
Appropriate operations of a given RDD type are automatically
available on a RDD of the right type, e.g. RDD[(Int, Int)], through
implicit conversion in Scala.
74
RDD: Resilient Distributed
Datasets
Transformations
Transformations are lazy operations on a
RDD that create one or many new RDDs,
e.g. m ap, filter, reduceByKey, join, cogroup, ra
ndom Split.
t
ransform ation: RDD => RDD
t
ransform ation: RDD => Seq[RDD]
75
RDD: Resilient Distributed
Datasets
Transformations
There are two kinds of transformations:
Narrow transformations
Wide transformations
76
RDD: Resilient Distributed
Datasets
Narrow transformations
Narrow transformations are the result of map, filter and such that is
from the data from a single partition only, i.e. it is self-sustained.
An output RDD has partitions with records that originate from a single
partition in the parent RDD. Only a limited subset of partitions used
to calculate the result.
Spark groups narrow transformations as a stage which is
called pipelining.
77
RDD: Resilient Distributed
Datasets
Wide Transformations
Wide transformations are the result of groupByKey and reduceByKey.
The data required to compute the records in a single partition may
reside in many partitions of the parent RDD.
Note
Wide transformations are also called shuffle transformations as they
may or may not depend on a shuffle.
All of the tuples with the same key must end up in the same partition,
processed by the same task. To satisfy these operations, Spark must
execute RDD shuffle, which transfers data across cluster and results in
a new stage with a new set of partitions.
78
RDD: Resilient Distributed
Datasets
Actions
An action is an operation that triggers execution of RDD
transformations and returns a value (to a Spark driver - the user
program).
Actions are RDD operations that produce non-RDD values. They
materialize a value in a Spark program. In other words, a RDD
operation that returns a value of any type but RDD[T] is an action.
action: RDD => a value
79
RDD: Resilient Distributed
Datasets
Actions
Actions in org.apache.spark.rdd.RDD:
aggregate
collect
count
countApprox*
countByValue*
first
fold
foreach
foreachPartition
max
min
reduce
saveAs* actions, e.g. saveAsTextFile, saveAsHadoopFile
take
takeOrdered
takeSample
toLocalIterator
top
treeAggregate 80
treeReduce
RDD: Resilient Distributed
Datasets
Actions
Actions run jobs using SparkContext.runJob or
directly DAGScheduler.runJob.
scala> words.count (1)
res0: Long = 502
81
RDD: Resilient Distributed
Datasets
Actions
AsyncRDDActions
AsyncRDDActions class offers asynchronous actions that you can use on
RDDs (thanks to the implicit conversion rddToAsyncRDDActions in RDD
class). The methods return a FutureAction.
The following asynchronous methods are available:
countAsync
collectAsync
takeAsync
foreachAsync
foreachPartitionAsync
82