Pyspark Tutorial
Pyspark Tutorial
This is an introductory tutorial, which covers the basics of Data-Driven Documents and
explains how to deal with its various components and sub-components.
Audience
This tutorial is prepared for those professionals who are aspiring to make a career in
programming language and real-time processing framework. This tutorial is intended to
make the readers comfortable in getting started with PySpark along with its various
modules and submodules.
Prerequisites
Before proceeding with the various concepts given in this tutorial, it is being assumed that
the readers are already aware about what a programming language and a framework is.
In addition to this, it will be very helpful, if the readers have a sound knowledge of Apache
Spark, Apache Hadoop, Scala Programming Language, Hadoop Distributed File System
(HDFS) and Python.
All the content and graphics published in this e-book are the property of Tutorials Point (I)
Pvt. Ltd. The user of this e-book is prohibited to reuse, retain, copy, distribute or republish
any contents or a part of contents of this e-book in any manner without written consent
of the publisher.
We strive to update the contents of our website and tutorials as timely and as precisely as
possible, however, the contents may contain inaccuracies or errors. Tutorials Point (I) Pvt.
Ltd. provides no guarantee regarding the accuracy, timeliness or completeness of our
website or its contents including this tutorial. If you discover any errors on our website or
in this tutorial, please notify us at [email protected]
i
PySpark
Table of Contents
About the Tutorial .............................................................................................................................................................i
Audience..............................................................................................................................................................................i
Prerequisites .......................................................................................................................................................................i
Copyright and Disclaimer .................................................................................................................................................i
Table of Contents ............................................................................................................................................................. ii
3. PySpar k – SparkContext..................................................................................................................................................4
6. PySpar k – SparkConf..................................................................................................................................................... 17
ii
1.PySpark – Introduction PySpark
In this chapter, we will get ourselves acquainted with what Apache Spark is and how was
PySpark developed.
Spark – Overview
Apache Spark is a lightning fast real-time processing framework. It does in-memory
computations to analyze data in real-time. It came into picture as Apache Hadoop
MapReduce was performing batch processing only and lacked a real-time processing
feature. Hence, Apache Spark was introduced as it can perform stream processing in real-
time and can also take care of batch processing.
Apart from real-time and batch processing, Apache Spark supports interactive queries and
iterative algorithms also. Apache Spark has its own cluster manager, where it can host its
application. It leverages Apache Hadoop for both storage and processing. It uses HDFS
(Hadoop Distributed File system) for storage and it can run Spark applications on YARN
as well.
PySpark – Overview
Apache Spark is written in Scala programming language. To support Python with Spark,
Apache Spark Community released a tool, PySpark. Using PySpark, you can work with
RDDs in Python programming language also. It is because of a library called Py4j that
they are able to achieve this.
PySpark offers PySpark Shell which links the Python API to the spark core and initializes
the Spark context. Majority of data scientists and analytics experts today use Python
because of its rich library set . Integrating Python with Spark is a boon to them.
1
2.PySpark – Environment Setup PySpark
Note: This is considering that you have Java and Scala installed on your computer.
Let us now download and set up PySpark with the following steps.
Step 1: Go to the official Apache Spark download page and download the latest version
of Apache Spark available there. In this tutorial, we are using spark-2.1.0-bin-
hadoop2.7.
Step 2: Now, extract the downloaded Spark tar file. By default, it will get downloaded in
Downloads directory.
export SPARK_HOME=/home/hadoop/spark-2.1.0-bin-hadoop2.7
export PATH=$PATH:/home/hadoop/spark-2.1.0-bin-hadoop2.7/bin
Or, to set the above environments globally, put them in the .bashrc file. Then run the
following command for the environments to work.
# source .bashrc
Now that we have all the environments set, let us go to Spark directory and invoke PySpark
shell by running the following command:
# ./bin/pyspark
2
PySpark
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 2.1.0
/_/
3
3.PySpark – SparkContext PySpark
SparkContext is the entry point to any spark functionality. When we run any Spark
application, a driver program starts, which has the main function and your SparkContext
gets initiated here. The driver program then runs the operations inside the executors on
worker nodes.
4
PySpark
The following code block has the details of a PySpark class and the parameters, which a
SparkContext can take.
Parameters
Following are the parameters of a SparkContext.
pyFiles – The .zip or .py files to send to the cluster and add to the PYTHONPATH.
Gateway – Use an existing gateway and JVM, otherwise initializing a new JVM.
Among the above parameters, master and appname are mostly used. The first two lines
of any PySpark program looks as shown below:
5
PySpark
Note: We are not creating any SparkContext object in the following example because by
default, Spark automatically creates the SparkContext object named sc, when PySpark
shell starts. In case you try to create another SparkContext object, you will get the
following error – "ValueError: Cannot run multiple SparkContexts at once".
-----------------------------------------------firstapp.py--------------------
--------------------
from pyspark import SparkContext
logFile = "file:///home/hadoop/spark-2.1.0-bin-hadoop2.7/README.md"
sc = SparkContext("local", "first app")
logData = sc.textFile(logFile).cache()
numAs = logData.filter(lambda s: 'a' in s).count()
numBs = logData.filter(lambda s: 'b' in s).count()
print "Lines with a: %i, lines with b: %i" % (numAs, numBs)
-----------------------------------------------firstapp.py--------------------
--------------------
6
PySpark
Then we will execute the following command in the terminal to run this Python file. We
will get the same output as above.
$SPARK_HOME/bin/spark-submit firstapp.py
Output: Lines with a: 62, lines with b: 30
7
4.PySpark – RDD PySpark
Now that we have installed and configured PySpark on our system, we can program in
Python on Apache Spark. However before doing so, let us understand a fundament al
concept in Spark - RDD.
RDD stands for Resilient Distributed Dataset, these are the elements that run and
operate on multiple nodes to do parallel processing on a cluster. RDDs are immut able
elements, which means once you create an RDD you cannot change it. RDDs are fault
tolerant as well, hence in case of any failure, they recover automatically. You can apply
multiple operations on these RDDs to achieve a certain task.
Transformation and
Action
Transformation: These are the operations, which are applied on a RDD to create a new
RDD. Filter, groupBy and map are the examples of transformations.
Action: These are the operations that are applied on RDD, which instructs Spark to
perform computation and send the result back to the driver.
To apply any operation in PySpark, we need to create a PySpark RDD first. The following
code block has the detail of a PySpark RDD Class:
Let us see how to run a few basic operations using PySpark. The following code in a Python
file creates RDD words, which stores a set of words mentioned.
count()
Number of elements in the RDD is returned.
-----------------------------------------------count.py-----------------------
---------------------
from pyspark import SparkContext
sc = SparkContext("local", "count app")
words = sc.parallelize(["scala", "java", "hadoop", "spark", "akka","spark vs
hadoop", "pyspark","pyspark and spark"])
counts = words.count()
8
PySpark
$SPARK_HOME/bin/spark-submit count.py
collect()
All the elements in the RDD are returned.
-----------------------------------------------collect.py---------------------
-----------------------
from pyspark import SparkContext
sc = SparkContext("local", "Collect app")
words = sc.parallelize(["scala", "java", "hadoop", "spark", "akka","spark vs
hadoop", "pyspark","pyspark and spark"])
coll = words.collect()
print "Elements in RDD -> %s" % (coll)
-----------------------------------------------collect.py---------------------
-----------------------
$SPARK_HOME/bin/spark-submit collect.py
foreach(f)
Returns only those elements which meet the condition of the function inside foreach. In
the following example, we call a print function in foreach, which prints all the elements in
the RDD.
-----------------------------------------------foreach.py---------------------
-----------------------
from pyspark import SparkContext
sc = SparkContext("local", "ForEach app")
words = sc.parallelize(["scala", "java", "hadoop", "spark", "akka","spark vs
hadoop", "pyspark","pyspark and spark"])
9
PySpark
$SPARK_HOME/bin/spark-submit foreach.py
scala
java
hadoop
spark
akka
spark vs hadoop
pyspark
pyspark and spark
filter(f)
A new RDD is returned containing the elements, which satisfies the function inside the
filter. In the following example, we filter out the strings containing ''spark".
-----------------------------------------------filter.py----------------------
----------------------
from pyspark import SparkContext
sc = SparkContext("local", "Filter app")
words = sc.parallelize(["scala", "java", "hadoop", "spark", "akka","spark vs
hadoop", "pyspark","pyspark and spark"])
words_filter = words.filter(lambda x: 'spark' in x)
filtered = words_filter.collect()
print "Fitered RDD -> %s" % (filtered)
-----------------------------------------------filter.py----------------------
-----------------------
$SPARK_HOME/bin/spark-submit filter.py
Fitered RDD -> ['spark', 'spark vs hadoop', 'pyspark', 'pyspark and spark']
10
PySpark
map(f, preservesPartitioning=False)
A new RDD is returned by applying a function to each element in the RDD. In the following
example, we form a key value pair and map every string with a value of 1.
-----------------------------------------------map.py-------------------------
-------------------
from pyspark import SparkContext
sc = SparkContext("local", "Map app")
words = sc.parallelize(["scala", "java", "hadoop", "spark", "akka","spark vs
hadoop", "pyspark","pyspark and spark"])
words_map = words.map(lambda x: (x, 1))
mapping = words_map.collect()
print "Key value pair -> %s" % (mapping)
-----------------------------------------------map.py-------------------------
-------------------
$SPARK_HOME/bin/spark-submit map.py
Key value pair -> [('scala', 1), ('java', 1), ('hadoop', 1), ('spark', 1),
('akka', 1), ('spark vs hadoop', 1), ('pyspark', 1), ('pyspark and spark', 1)]
reduce(f)
After performing the specified commutative and associative binary operation, the element
in the RDD is returned. In the following example, we are importing add package from the
operator and applying it on ‘num’ to carry out a simple addition operation.
-----------------------------------------------reduce.py----------------------
----------------------
from pyspark import SparkContext
from operator import add
sc = SparkContext("local", "Reduce app")
nums = sc.parallelize([1, 2, 3, 4, 5])
adding = nums.reduce(add)
print "Adding all the elements -> %i" % (adding)
-----------------------------------------------reduce.py----------------------
----------------------
$SPARK_HOME/bin/spark-submit reduce.py
11
PySpark
join(other, numPartitions=None)
It returns RDD with a pair of elements with the matching keys and all the values for that
particular key. In the following example, there are two pair of elements in two different
RDDs. After joining these two RDDs, we get an RDD with elements having matching keys
and their values.
-----------------------------------------------join.py------------------------
--------------------
from pyspark import SparkContext
sc = SparkContext("local", "Join app")
x = sc.parallelize([("spark", 1), ("hadoop", 4)])
y = sc.parallelize([("spark", 2), ("hadoop", 5)])
joined = x.join(y)
final = joined.collect()
print "Join RDD -> %s" % (final)
-----------------------------------------------join.py------------------------
--------------------
$SPARK_HOME/bin/spark-submit join.py
cache()
Persist this RDD with the default storage level (MEMORY_ONLY). You can also check if the
RDD is cached or not.
-----------------------------------------------cache.py-----------------------
---------------------
from pyspark import SparkContext
words.cache()
12
PySpark
caching = words.persist().is_cached
$SPARK_HOME/bin/spark-submit cache.py
These were some of the most important operations that are done on PySpark RDD.
13
5.PySpark – Broadcast & Accumulator PySpark
For parallel processing, Apache Spark uses shared variables. A copy of shared variable
goes on each node of the cluster when the driver sends a task to the executor on the
cluster, so that it can be used for performing tasks.
Broadcast
Accumulator
Broadcast
Broadcast variables are used to save the copy of data across all node s. This variable is
cached on all the machines and not sent on machines with tasks. The following code block
has the details of a Broadcast class for PySpark.
The following example shows how to use a Broadcast variable. A Broadcast variable has
an attribute called value, which stores the data and is used to return a broadcasted value.
-----------------------------------------------broadcast.py-------------------
-------------------------
from pyspark import SparkContext
data = words_new.value
elem = words_new.value[2]
$SPARK_HOME/bin/spark-submit broadcast.py
14
PySpark
Accumulator
Accumulator variables are used for aggregating the information through associative and
commutative operations. For example, you can use an accumulator for a sum operation
or counters (in MapReduce). The following code block has the details of an Accumulator
class for PySpark.
The following example shows how to use an Accumulator variable. An Accumulator variable
has an attribute called value that is similar to what a broadcast variable has. It stores the
data and is used to return the accumulator's value, but usable only in a driver program.
-----------------------------------------------accumulator.py-----------------
---------------------------
from pyspark import SparkContext
num=sc.accumulator(10)
def f(x):
global num
num+=x
rdd = sc.parallelize([20,30,40,50])
rdd.foreach(f)
final = num.value
15
PySpark
$SPARK_HOME/bin/spark-submit accumulator.py
16
6.PySpark – SparkConf PySpark
To run a Spark application on the local/cluster, you need to set a few configurations and
parameters, this is what SparkConf helps with. It provides configurations to run a Spark
application. The following code block has the details of a SparkConf class for PySpark.
Initially, we will create a SparkConf object with SparkConf(), which will load the values
from spark.* Java system properties as well. Now you can set different parameters using
the SparkConf object and their parameters will take priority over the system properties.
In a SparkConf class, there are setter methods, which support chaining. For example, you
can write conf.setAppName(“PySpark App”).setMaster(“local”). Once we pass a
SparkConf object to Apache Spark, it cannot be modified by any user.
Let us consider the following example of using SparkConf in a PySpark program. In this
example, we are setting the spark application name as PySpark App and setting the
master URL for a spark application to spark://master:7077.
The following code block has the lines, when they get added in the Python file, it sets the
basic configurations for running a PySpark application.
------------------------------------------------------------------------------
---------------------------------
from pyspark import SparkConf, SparkContext
conf = SparkConf().setAppName("PySpark App").setMaster("spark://master:7077")
sc = SparkContext(conf=conf)
------------------------------------------------------------------------------
---------------------------------
17
7.PySpark – SparkFiles PySpark
In Apache Spark, you can upload your files using sc.addFile (sc is your default
SparkContext) and get the path on a worker using SparkFiles.get. Thus, SparkFiles
resolve the paths to files added through SparkContext.addFile().
get(filename)
getrootdirectory()
get(filename)
getrootdirectory()
It specifies the path to the root directory, which contains the file that is added through the
SparkContext.addFile().
-----------------------------------------------sparkfile.py-------------------
---------------------------
from pyspark import SparkContext
from pyspark import SparkFiles
finddistance = "/home/hadoop/examples_pyspark/finddistance.R"
finddistancename = "finddistance.R"
sc = SparkContext("local", "SparkFile App")
sc.addFile(finddistance)
print "Absolute Path -> %s" % SparkFiles.get(finddistancename)
-----------------------------------------------sparkfile.py-------------------
---------------------------
$SPARK_HOME/bin/spark-submit sparkfiles.py
18
8.PySpark – StorageLevel PySpark
StorageLevel decides how RDD should be stored. In Apache Spark, StorageLevel decides
whether RDD should be stored in the memory or should it be stored over the disk, or both.
It also decides whether to serialize RDD and whether to replicate RDD partitions.
Now, to decide the storage of RDD, there are different storage levels, which are given
below:
Let us consider the following example of StorageLevel, where we use the storage level
MEMORY_AND_DISK_2, which means RDD partitions will have replication of 2.
-----------------------------------------------storagelevel.py----------------
-----------------------------
from pyspark import SparkContext
import pyspark
sc = SparkContext("local", "storagelevel app")
rdd1 = sc.parallelize([1,2])
rdd1.persist( pyspark.StorageLevel.MEMORY_AND_DISK_2 )
rdd1.getStorageLevel()
print(rdd1.getStorageLevel())
-----------------------------------------------storagelevel.py----------------
-----------------------------
19
PySpark
$SPARK_HOME/bin/spark-submit storagelevel.py
20
9.PySpark – MLlib PySpark
Apache Spark offers a Machine Learning API called MLlib. PySpark has this machine
learning API in Python as well. It supports different kind of algorithms, which are
mentioned below:
There are other algorithms, classes and functions also as a part of the mllib package. As
of now, let us understand a demonstration on pyspark.mllib.
The following example is of collaborative filtering using ALS algorithm to build the
recommendation model and evaluate it on training data.
1,1,5.0
1,2,1.0
1,3,5.0
1,4,1.0
2,1,5.0
21
PySpark
2,2,1.0
2,3,5.0
2,4,1.0
3,1,1.0
3,2,5.0
3,3,1.0
3,4,5.0
4,1,1.0
4,2,5.0
4,3,1.0
4,4,5.0
-----------------------------------------------recommend.py-------------------
--------------------------
from __future__ import print_function
from pyspark import SparkContext
from pyspark.mllib.recommendation import ALS, MatrixFactorizationModel, Rating
if __name__ == "__main__":
sc = SparkContext(appName="Pspark mllib Example")
data = sc.textFile("test.data")
ratings = data.map(lambda l: l.split(',')) \
.map(lambda l: Rating(int(l[0]), int(l[1]), float(l[2])))
22
PySpark
model.save(sc, "target/tmp/myCollaborativeFilter")
sameModel = MatrixFactorizationModel.load(sc,
"target/tmp/myCollaborativeFilter")
-----------------------------------------------recommend.py-------------------
--------------------------
$SPARK_HOME/bin/spark-submit recommend.py
23
10. PySpark – Serializers PySpark
Serialization is used for performance tuning on Apache Spark. All data that is sent over
the network or written to the disk or persisted in the memory should be serialized.
Serialization plays an important role in costly operations.
PySpark supports custom serializers for performance tuning. The following two serializers
are supported by PySpark:
MarshalSerializer
Serializes objects using Python’s Marshal Serializer. This serializer is faster than
PickleSerializer, but supports fewer datatypes.
class pyspark.MarshalSerializer
PickleSerializer
Serializes objects using Python’s Pickle Serializer. This serializer supports nearly any
Python object, but may not be as fast as more specialized serializers.
class pyspark.PickleSerializer
Let us see an example on PySpark serialization. Here, we serialize the data using
MarshalSerializer.
-----------------------------------------------serializing.py-----------------
----------------------------
from pyspark.context import SparkContext
from pyspark.serializers import MarshalSerializer
sc = SparkContext("local", "serialization app",
serializer=MarshalSerializer())
print(sc.parallelize(list(range(1000))).map(lambda x: 2 * x).take(10))
sc.stop()
-----------------------------------------------serializing.py-----------------
----------------------------
$SPARK_HOME/bin/spark-submit serializing.py
24