Learning Spark Programming Basics: Introduction To Rdds
Learning Spark Programming Basics: Introduction To Rdds
In This Chapter:
Resilient Distributed Datasets (RDDs)
How to load data into Spark RDDs
Transformation and actions on RDDs
How to perform operations on multiple RDDs
Now that we’ve covered Spark’s runtime architecture and how to deploy Spark,
it’s time to learn Spark programming in Python, starting with the basics. This
chapter discusses a foundational concept in Spark programming and execution:
Resilient Distributed Datasets (RDDs). You will also learn how to work with the
Spark API, including the fundamental Spark transformations and actions and
their usage. This is an intense chapter, but by the end of it, you will have all the
basic building blocks you need to create any Spark application.
Introduction to RDDs
A Resilient Distributed Dataset (RDD) is the most fundamental data object used
in Spark programming. RDDs are datasets within a Spark application, including
the initial dataset(s) loaded, any intermediate dataset(s), and the final resultant
dataset(s). Most Spark applications load an RDD with external data and then
create new RDDs by performing operations on the existing RDDs; these
operations are transformations. This process is repeated until an output operation
is ultimately required—for instance, to write the results of an application to a
filesystem; these types of operations are actions.
RDDs are essentially distributed collections of objects that represent the data
used in Spark programs. In the case of PySpark, RDDs consist of distributed
Python objects, such as lists, tuples, and dictionaries. Objects within RDDs, such
as elements in a list, can be of any object type, including primitive data types
such as integers, floating point numbers, and strings, as well as complex types
such as tuples, dictionaries, or other lists. If you are using the Scala or Java
APIs, RDDs consist of collections of Scala and Java objects, respectively.
Although there are options for persisting RDDs to disk, RDDs are predominantly
stored in memory, or at least they are intended to be stored in memory. Because
one of the initial uses for Spark was to support machine learning, Spark’s RDDs
provided a restricted form of shared memory that could make efficient reuse of
data for successive and iterative operations.
Moreover, one of the main downsides of Hadoop’s implementation of
MapReduce was its persistence of intermediate data to disk and the copying of
this data between nodes at runtime. Although the MapReduce distributed
processing method of sharing data did provide resiliency and fault tolerance, it
was at the cost of latency. This design limitation was one of the major catalysts
for the Spark project. As data volumes increased along with the necessity for
real-time data processing and insights, Spark’s mainly in-memory processing
framework based on RDDs grew in popularity.
The term Resilient Distributed Dataset is an accurate and succinct descriptor for
the concept. Here’s how it breaks down:
Resilient: RDDs are resilient, meaning that if a node performing an
operation in Spark is lost, the dataset can be reconstructed. This is because
Spark knows the lineage of each RDD, which is the sequence of steps to
create the RDD.
Distributed: RDDs are distributed, meaning the data in RDDs is divided
into one or many partitions and distributed as in-memory collections of
objects across Worker nodes in the cluster. As mentioned earlier in this
chapter, RDDs provide an effective form of shared memory to exchange
data between processes (Executors) on different nodes (Workers).
Dataset: RDDs are datasets that consist of records. Records are uniquely
identifiable data collections within a dataset. A record can be a collection of
fields similar to a row in a table in a relational database, a line of text in a
file, or multiple other formats. RDDs are created such that each partition
contains a unique set of records and can be operated on independently. This
is an example of the shared nothing approach discussed in Chapter 1.
Another key property of RDDs is their immutability, which means that after they
are instantiated and populated with data, they cannot be updated. Instead, new
RDDs are created by performing transformations such as map or filter
functions, discussed later in this chapter, on existing RDDs.
Actions are the other operations performed on RDDs. Actions produce output
that can be in the form of data from an RDD returned to a Driver program, or
they can save the contents of an RDD to a filesystem (local, HDFS, S3, or
other). There are many other actions as well, including returning a count of the
number of records in an RDD.
Listing 4.1 shows a sample Spark program loading data into an RDD, creating a
new RDD using a filter transformation, and then using an action to save the
resultant RDD to disk. We will look at each of these operations in this chapter.
You can find more detail about RDD concepts in the University of California,
Berkeley, paper “Resilient Distributed Datasets: A Fault-Tolerant Abstraction for
In-Memory Cluster Computing,”
https://amplab.cs.berkeley.edu/publication/resilient-distributed-datasets-a-fault-
tolerant-abstraction-for-in-memory-cluster-computing/.
Loading Data into RDDs
RDDs are effectively created after they are populated with data. This can be the
result of transformations on an existing RDD being written into a new RDD as
part of a Spark program.
To start any Spark routine, you need to initialize at least one RDD with data
from an external source. This initial RDD is then used to create further
intermediate RDDs or the final RDD through a series of transformations and
actions. The initial RDD can be created in several ways, including the following:
Loading data from a file or files
Loading data from a data source, such as a SQL or NoSQL datastore
Loading data programmatically
Loading data from a stream, as discussed in Chapter 7, “Stream Processing
and Messaging Using Spark”
HDFS* hdfs://hdfs_path
You can use text files and the methods described in the following sections to
create RDDs.
textFile()
Syntax:
Click here to view code image
sc.textFile(name, minPartitions=None, use_unicode=True)
Listing 4.2 provides examples of the textFile() method loading the data
from the HDFS directory pictured in Figure 4.2.
In each of the examples in Listing 4.2, an RDD named logs is created, with
each line of the file represented as a record.
wholeTextFiles()
Syntax:
Click here to view code image
sc.wholeTextFiles(path, minPartitions=None, use_unicode=True)
# now let's perform a similar exercise using the same directory using
the
# wholeTextFiles() method instead
licensefile_pairs = sc.wholeTextFiles("file:///opt/spark/licenses/") #
inspect the object created
licensefile_pairs
# returns:
# org.apache.spark.api.java.JavaPairRDD@3f714d2d
licensefile_pairs.take(1)
# returns the first key/value pair as a list of tuples, with the key
being each file # and the value being the entire contents of that file:
# [(u'file:/opt/spark/licenses/LICENSE-scopt.txt', u'The MIT License
(MIT)\n...)..]
licensefile_pairs.getNumPartitions()
# this method will create a single partition (1) containing key/value
pairs for each # file in the directory
licensefile_pairs.count()
# this action will count the number of files or key/value pairs
# in this case the return value is 36
# download the latest jdbc connector for your target database, include
as follows:
$SPARK_HOME/bin/pyspark \
--driver-class-path mysql-connector-java-5.*-bin.jar \
--master local
read.jdbc()
Syntax:
Click here to view code image
spark.read.jdbc(url, table,
column=None,
lowerBound=None,
upperBound=None,
numPartitions=None,
predicates=None,
properties=None)
The url and table arguments specify the target database and table to read.
The column argument helps Spark choose the appropriate column, preferably a
long or int datatype, to create the number of partitions specified by
numPartitions. The upperBound and lowerBound arguments are used
in conjunction with the column argument to assist Spark in creating the
partitions. These represent the minimum and maximum values for the specified
column in the source table. If any one of these arguments is supplied with the
read.jdbc() function, all must be supplied.
The optional argument predicates allows for including WHERE conditions to
filter unneeded records while loading partitions. You can use the properties
argument to pass parameters to the JDBC API, such as the database user
credentials; if supplied, this argument must be a Python dictionary, a set of
name/value pairs representing the various configuration options.
Listing 4.6 shows the creation of an RDD using the read.jdbc() method.
employeesdf =
spark.read.jdbc(url="jdbc:mysql://localhost:3306/employees",
table="employees",column="emp_no",numPartitions="2",lowerBound="10001",
upperBound="499999",properties={"user":"<user>","password":"<pwd>"})
employeesdf.rdd.getNumPartitions()
# should return 2 as we specified numPartitions=2
sqlContext.registerDataFrameAsTable(employeesdf, "employees")
df2 = spark.sql("SELECT emp_no, first_name, last_name FROM employees
LIMIT 2")
df2.show()
# will return a 'pretty printed' result set similar to:
#+------+----------+---------+
#|emp_no|first_name|last_name|
#+------+----------+---------+
#| 10001| Georgi| Facello|
#| 10002| Bezalel| Simmel|
#+------+----------+---------+
read.json()
Syntax:
Click here to view code image
spark.read.json(path, schema=None)
The path argument specifies the full path to the JSON file you are using as a
data source. You can use the optional schema argument to specify the target
schema for creating the DataFrame.
Consider a JSON file named people.json that contains the names and,
optionally, the ages of people. This file happens to be in the examples
directory of the Spark installation, as shown in Figure 4.4.
Listing 4.8 demonstrates the creation of a DataFrame named people using the
people.json file.
people =
spark.read.json("/opt/spark/examples/src/main/resources/people.json")
# inspect the object created
people
# notice that a DataFrame is created which includes the following
schema:
# DataFrame[age: bigint, name: string]
# this schema was inferred from the object
people.dtypes
# the dtypes method returns the column names and datatypes in this case
it returns:
#[('age', 'bigint'), ('name', 'string')]
people.show()
# you should see the following output
#+----+-------+
#| age| name|
#+----+-------+
#|null|Michael|
#| 30| Andy|
#| 19| Justin|
#+----+-------+
# as with all DataFrames you can create use them to run SQL queries as
follows
sqlContext.registerDataFrameAsTable(people, "people")
df2 = spark.sql("SELECT name, age FROM people WHERE age > 20")
df2.show()
# you should see the resultant output below
#+----+---+
#|name|age|
#+----+---+
#|Andy| 30|
#+----+---+
parallelize()
Syntax:
Click here to view code image
sc.parallelize(c, numSlices=None)
The parallelize() method assumes that you have a list created already and
that you supply it as the c (for collection) argument. The numSlices argument
specifies the desired number of partitions to create. An example of the
parallelize() method is shown in Listing 4.9.
range()
Syntax:
Click here to view code image
sc.range(start, end=None, step=1, numSlices=None)
The range() method generates a list for you and creates and distributes the
RDD. The start, end, and step arguments define the sequence of values,
and numSlices specifies the desired number of partitions. An example of the
range() method is shown in Listing 4.10.
Operations on RDDs
Now that you have learned how to create RDDs from files in various
filesystems, from relational data sources, and programmatically, let’s look at the
types of operations you can perform against RDDs and some of the key RDD
concepts.
Spark uses lazy evaluation, also called lazy execution, in processing Spark
programs. Lazy evaluation defers processing until an action is called (that is,
when output is required). This is easily demonstrated using an interactive shell,
where you can enter one or more transformation methods to RDDs one after the
other without any processing starting. Instead, each statement is parsed for
syntax and object references only. After requesting an action such as count()
or saveAsTextFile(), a DAG is created along with logical and physical
execution plans. The Driver then orchestrates and manages these plans across
Executors.
This lazy evaluation allows Spark to combine operations where possible, thereby
reducing processing stages and minimizing the amount of data transferred
between Spark Executors in a process called shuffling.
After a request to persist the RDD using the persist() method (note that
there is a similar cache() method as well), the RDD remains in memory on all
the nodes in the cluster where it is computed after the first action called on it.
You can see the persisted RDD in your Spark application UI in the Storage tab,
as shown in Figure 4.5.
RDD Lineage
Spark keeps track of each RDD’s lineage—that is, the sequence of
transformations that resulted in the RDD. As discussed previously, every RDD
operation recomputes the entire lineage by default unless RDD persistence is
requested.
In an RDD’s lineage, each RDD has a parent RDD and/or a child RDD. Spark
creates a directed acyclic graph (DAG) consisting of dependencies between
RDDs. RDDs are processed in stages, which are sets of transformations. RDDs
and stages have dependencies that can be narrow or wide.
Narrow dependencies, or narrow operations, are categorized by the following
traits:
Operations can collapse into a single stage; for instance, a map() and
filter() operation against elements in the same dataset can be
processed in a single pass of each element in the dataset.
Only one child RDD depends on the parent RDD; for instance, an RDD is
created from a text file (the parent RDD), with one child RDD to perform
the set of transformations in one stage.
No shuffling of data between nodes is required.
Narrow operations are preferred because they maximize parallel execution and
minimize shuffling, which is quite expensive and can be a bottleneck.
Wide dependencies of wide operations, in contrast, have the following traits:
Operations define new stages and often require shuffling.
RDDs have multiple dependencies; for instance, a join() operation
(covered shortly) requires an RDD to be dependent upon two or more
parent RDDs.
You can avert long recovery periods for complex processing operations by
checkpointing, or saving the data to a persistent file-based object. (Chapter 5,
“Advanced Programming Using the Spark Core API,” discusses checkpointing.)
Types of RDDs
Aside from the base RDD class that contains members (properties or attributes
and functions) common to all RDDs, there are some specific RDD
implementations that enable additional operators and functions. These additional
RDD types include the following:
PairRDD: An RDD of key/value pairs. You have already seen this type of
RDD as it is automatically created by using the wholeTextFiles()
method.
DoubleRDD: An RDD consisting of a collection of double values only.
Because the values are of the same numeric type, several additional
statistical functions are available, including mean(), sum(), stdev(),
variance(), and histogram(), among others.
DataFrame (formerly known as SchemaRDD): A distributed collection
of data organized into named and typed columns. A DataFrame is
equivalent to a relational table in Spark SQL. DataFrames originated with
the read.jdbc() and read.json() functions discussed earlier.
SequenceFileRDD: An RDD created from a SequenceFile, either
compressed or uncompressed.
HadoopRDD: An RDD that provides core functionality for reading data
stored in HDFS using the v1 MapReduce API.
NewHadoopRDD: An RDD that provides core functionality for reading
data stored in Hadoop—for example, files in HDFS, sources in HBase, or
S3—using the new MapReduce API (org.apache.hadoop.mapreduce).
CoGroupedRDD: An RDD that cogroups its parents. For each key in
parent RDDs, the resulting RDD contains a tuple with the list of values for
that key. (We will discuss the cogroup() function later in this chapter.)
JdbcRDD: An RDD resulting from a SQL query to a JDBC connection. It
is available in the Scala API only.
PartitionPruningRDD: An RDD used to prune RDD partitions or other
partitions to avoid launching tasks on all partitions. For example, if you
know the RDD is partitioned by range, and the execution DAG has a filter
on the key, you can avoid launching tasks on partitions that don’t have the
range covering the key.
ShuffledRDD: The resulting RDD from a shuffle, such as repartitioning of
data.
UnionRDD: An RDD resulting from a union() operation against two or
more RDDs.
map()
Syntax:
Click here to view code image
RDD.map(<function>, preservesPartitioning=False)
flatMap()
Syntax:
Click here to view code image
RDD.flatMap(<function>, preservesPartitioning=False)
filter()
Syntax:
Click here to view code image
RDD.filter(<function>)
licenses = sc.textFile('file:///opt/spark/licenses')
words = licenses.flatMap(lambda x: x.split(' '))
words.take(5)
# returns [u'The', u'MIT', u'License', u'(MIT)', u'']
lowercase = words.map(lambda x: x.lower())
lowercase.take(5)
# returns [u'the', u'mit', u'license', u'(mit)', u'']
longwords = lowercase.filter(lambda x: len(x) > 12)
longwords.take(2)
# returns [u'documentation', u'merchantability,']
There is a standard axiom in the world of Big Data programming: “Filter early,
filter often.” This refers to the fact that there is no value in carrying records or
fields through a process where they are not needed. Both the filter() and
map() functions can be used to achieve this objective. That said, in many cases
Spark—through its key runtime characteristic of lazy execution—attempts to
optimize routines for you even if you do not explicitly do this yourself.
distinct()
Syntax:
Click here to view code image
RDD.distinct(numPartitions=None)
licenses = sc.textFile('file:///opt/spark/licenses')
words = licenses.flatMap(lambda x: x.split(' '))
lowercase = words.map(lambda x: x.lower())
allwords = lowercase.count()
distinctwords = lowercase.distinct().count()
print "Total words: %s, Distinct Words: %s" % (allwords, distinctwords)
# returns "Total words: 11484, Distinct Words: 892"
groupBy()
Syntax:
Click here to view code image
RDD.groupBy(<function>, numPartitions=None)
licenses = sc.textFile('file:///opt/spark/licenses')
words = licenses.flatMap(lambda x: x.split(' ')) \
.filter(lambda x: len(x) > 0)
groupedbyfirstletter = words.groupBy(lambda x: x[0].lower())
groupedbyfirstletter.take(1)
# returns:
# [('l', <pyspark.resultiterable.ResultIterable object at
0x7f678e9cca20>)]
sortBy()
Syntax:
Click here to view code image
RDD.sortBy(<keyfunc>, ascending=True, numPartitions=None)
readme = sc.textFile('file:///opt/spark/README.md')
words = readme.flatMap(lambda x: x.split(' ')) \
.filter(lambda x: len(x) > 0)
sortbyfirstletter = words.sortBy(lambda x: x[0].lower(),
ascending=False)
sortbyfirstletter.take(5)
# returns ['You', 'you', 'You', 'you', 'you']
count()
Syntax:
RDD.count()
The count() action takes no arguments and returns a long value, which
represents the count of the elements in the RDD. Listing 4.17 shows a simple
count() example. Note that with actions that take no arguments, you need to
include empty parentheses, (), after the action name.
collect()
Syntax:
RDD.collect()
The collect() action returns a list that contains all the elements in an RDD
to the Spark Driver. Because collect() does not restrict the output, which
can be quite large and can potentially cause out-of-memory errors on the Driver,
it is typically useful for only small RDDs or development. Listing 4.18
demonstrates the collect() action.
licenses = sc.textFile('file:///opt/spark/licenses')
words = licenses.flatMap(lambda x: x.split(' '))
words.collect()
# returns [u'The', u'MIT', u'License', u'(MIT)', u'', u'Copyright', ...]
take()
Syntax:
RDD.take(n)
The take() action returns the first n elements of an RDD. The elements taken
are not in any particular order; in fact, the elements returned from a take()
action are non-deterministic, meaning they can differ if the same action is run
again, particularly in a fully distributed environment. There is a similar Spark
function, takeOrdered(), which takes the first n elements ordered based on
a key supplied by a key function.
For RDDs that span more than one partition, take() scans one partition and
uses the results from that partition to estimate the number of additional partitions
needed to satisfy the number requested.
Listing 4.19 shows an example of the take() action.
licenses = sc.textFile('file:///opt/spark/licenses')
words = licenses.flatMap(lambda x: x.split(' '))
words.take(3)
# returns [u'The', u'MIT', u'License']
top()
Syntax:
RDD.top(n, key=None)
The top() action returns the top n elements from an RDD, but unlike with
take(), with top() the elements are ordered and returned in descending
order. Order is determined by the object type, such as numeric order for integers
or dictionary order for strings.
The key argument specifies the key by which to order the results to return the
top n elements. This is an optional argument; if it is not supplied, the key will be
inferred from the elements in the RDD.
Listing 4.20 shows the top three distinct words sorted from a text file in
descending lexicographical order.
readme = sc.textFile('file:///opt/spark/README.md')
words = readme.flatMap(lambda x: x.split(' '))
words.distinct().top(3)
# returns [u'your', u'you', u'with']
first()
Syntax:
RDD.first()
The first() action returns the first element in this RDD. Similar to the
take() and collect() actions and unlike the top() action, first()
does not consider the order of elements and is a non-deterministic operation,
especially in fully distributed environments.
As you can see from Listing 4.21, the primary difference between first() and
take(1) is that first() returns an atomic data element, and take() (even
if n = 1) returns a list of data elements. The first() action is useful for
inspecting the output of an RDD as part of development or data exploration.
readme = sc.textFile('file:///opt/spark/README.md')
words = readme.flatMap(lambda x: x.split(' ')) \
.filter(lambda x: len(x) > 0)
words.distinct().first()
# returns a string: u'project.'
words.distinct().take(1)
# returns a list with one string element: [u'project.']
The reduce() and fold() actions are aggregate actions, each of which
executes a commutative and/or an associative operation, such as summing a list
of values, against an RDD. Commutative and associative are the operative terms
here. This makes the operations independent of the order in which they run, and
this is integral to distributed processing because the order isn’t guaranteed. Here
is the general form of the commutative characteristics:
x + y = y + x
The following sections look at the main Spark actions that perform aggregations.
reduce()
Syntax:
RDD.reduce(<function>)
numbers = sc.parallelize([1,2,3,4,5,6,7,8,9])
numbers.reduce(lambda x, y: x + y)
# returns 45
fold()
Syntax:
RDD.fold(zeroValue, <function>)
The fold() action aggregates the elements of each partition of an RDD and
then performs the aggregate operation against the results for all, using a given
function and a zeroValue. Although reduce() and fold() are similar
in function, they differ in that fold() is not commutative, and thus an initial
and final value (zeroValue) is required. A simple example is a fold()
action with zeroValue=0, as shown in Listing 4.23.
numbers = sc.parallelize([1,2,3,4,5,6,7,8,9])
numbers.fold(0, lambda x, y: x + y)
# returns 45
The fold() action in Listing 4.23 looks exactly the same as the reduce()
action in Listing 4.22. However, Listing 4.24 demonstrates a clear functional
difference in the two actions. The fold() action provides a zeroValue that
is added to the beginning and end of the function supplied as input to the
fold() action, generalized here:
Click here to view code image
result = zeroValue + ( 1 + 2 ) + 3 . . . + zeroValue
empty = sc.parallelize([])
empty.reduce(lambda x, y: x + y)
# returns:
# ValueError: Cannot reduce() empty RDD
empty.fold(0, lambda x, y: x + y)
# returns 0
foreach()
Syntax:
RDD.foreach(<function>)
Transformations on PairRDDs
Key/value pair RDDs, or simply PairRDDs, contain records consisting of keys
and values. The keys can be simple objects such as integer or string objects or
complex objects such as tuples. The values can range from scalar values to data
structures such as lists, tuples, dictionaries, or sets. This is a common data
representation in multi-structured data analysis on schema-on-read and NoSQL
systems. PairRDDs and their constituent functions are integral to functional
Spark programming. These functions are broadly classified into four categories:
Dictionary functions
Functional transformations
Grouping, aggregation, and sorting operations
Join functions, which we discuss specifically in the next section
Dictionary functions return a set of keys or values from a key/value pair RDD.
Examples include keys() and values().
Earlier in this chapter we looked at other aggregate operations, including
reduce() and fold(). These are conceptually similar in that they aggregate
values in an RDD based on a key, but there is a fundamental difference:
reduce() and fold() are actions, which means they force computation and
produce a result, whereas reduceByKey() and foldByKey(), which we
discuss shortly, are transformations, meaning they are lazily evaluated and return
a new RDD.
keys()
Syntax:
RDD.keys()
The keys() function returns an RDD with the keys from a key/value pair RDD
or the first element from each tuple in a key/value pair RDD. Listing 4.26
demonstrates using the keys() function.
kvpairs = sc.parallelize([('city','Hayward')
,('state','CA')
,('zip',94541)
,('country','USA')])
kvpairs.keys().collect()
# returns ['city', 'state', 'zip', 'country']
values()
Syntax:
RDD.values()
The values() function returns an RDD with values from a key/value pair
RDD or the second element from each tuple in a key/value pair RDD. Listing
4.27 demonstrates using the values() function.
kvpairs = sc.parallelize([('city','Hayward')
,('state','CA')
,('zip',94541)
,('country','USA')])
kvpairs.values().collect()
# returns ['Hayward', 'CA', 94541, 'USA']
keyBy()
Syntax:
RDD.keyBy(<function>)
Recall that x[2] in Listing 4.28 refers to the third element in list x, as Python
list elements are ordinal numbers, starting with 0.
Functional transformations available for key/value pair RDDs work similarly to
the more general functional transformations you learned about earlier. The
difference is that these functions operate specifically on either the key or value
element within a tuple—the key/value pair, in this case. Functional
transformations include mapValues() and flatMapValues().
mapValues()
Syntax:
RDD.mapValues(<function>)
flatMapValues()
Syntax:
RDD.flatMapValues(<function>)
Listing 4.29 simulates the loading of this data into an RDD and uses
mapValues() to create a list of key/value pair tuples containing the city and a
list of temperatures for the city. It shows the use of flatMapValues() with
the same function against the same RDD to create tuples containing the city and
a number for each temperature recorded for the city.
A simple way to describe this is that mapValues() creates one element per
city containing the city name and a list of five temperatures for the city, whereas
flatMapValues() flattens the lists to create five elements per city with the
city name and a temperature value.
locwtemps = sc.parallelize(['Hayward,71|69|71|71|72',
'Baumholder,46|42|40|37|39',
'Alexandria,50|48|51|53|44',
'Melbourne,88|101|85|77|74'])
kvpairs = locwtemps.map(lambda x: x.split(','))
kvpairs.take(4)
# returns :
# [['Hayward', '71|69|71|71|72'],
# ['Baumholder', '46|42|40|37|39'],
# ['Alexandria', '50|48|51|53|44'],
# ['Melbourne', '88|101|85|77|74']]
locwtemplist = kvpairs.mapValues(lambda x: x.split('|')) \
.mapValues(lambda x: [int(s) for s in x])
locwtemplist.take(4)
# returns :
# [('Hayward', [71, 69, 71, 71, 72]),
# ('Baumholder', [46, 42, 40, 37, 39]),
# ('Alexandria', [50, 48, 51, 53, 44]),
# ('Melbourne', [88, 101, 85, 77, 74])]
locwtemps = kvpairs.flatMapValues(lambda x: x.split('|')) \
.map(lambda x: (x[0], int(x[1])))
locwtemps.take(3)
# returns :
# [('Hayward', 71), ('Hayward', 69), ('Hayward', 71)]
groupByKey()
Syntax:
Click here to view code image
RDD.groupByKey(numPartitions=None, partitionFunc=<hash_fn>)
reduceByKey()
Syntax:
Click here to view code image
RDD.reduceByKey(<function>, numPartitions=None, partitionFunc=<hash_fn>)
The reduceByKey() transformation merges the values for the keys by using
an associative function. The reduceByKey() method is called on a dataset of
key/value pairs and returns a dataset of key/value pairs, aggregating values for
each key. This function is expressed as follows:
vn, vn+1 => vresult
Averaging is not an associative operation; you can get around this by creating
tuples containing the sum total of values for each key and the count for each key
—operations that are associative and commutative—and then computing the
average as a final step, as shown in Listing 4.31.
Note that reduceByKey() is efficient because it combines values locally on
each Executor before each of the combined lists sends to a remote Executor or
Executors running the final reduce stage. This is a shuffle operation.
Because the same associative and commutative function are run on the local
Executor or Worker and again on a remote Executor or Executors, taking a sum
function, for example, you can think of this as adding a list of sums as opposed
to summing a bigger list of individual values. Because there is less data sent in
the shuffle phase, reduceByKey() using a sum function generally performs
better than groupByKey() followed by a sum() function.
foldByKey()
Syntax:
Click here to view code image
RDD.foldByKey(zeroValue, <function>, numPartitions=None,
partitionFunc=<hash_fn>)
Listing 4.33 shows the use of the sortByKey() transformation. The first
example shows a simple sort based on the key: a string representing the city
name, sorted alphabetically. In the second example, the keys and values are
inverted to make the temperature the key and then use sortByKey() to list the
temperatures in descending numeric order, with the highest temperatures first.
Note that if your Python binary is not python (for instance, it may be py or
python3 depending upon your release), you need to direct Spark to the
correct file. This can be done using the following environment variable
settings:
Click here to view code image
$ export PYSPARK_PYTHON=python3
$ export PYSPARK_DRIVER_PYTHON=python3
6. Filter empty lines from the RDD, split lines by whitespace, and flatten the
lists of words into one list:
Click here to view code image
flattened = doc.filter(lambda line: len(line) > 0) \
.flatMap(lambda line: re.split('\W+', line))
8. Map text to lowercase, remove empty strings, and then convert to key/value
pairs in the form (word, 1):
Click here to view code image
kvpairs = flattened.filter(lambda word: len(word) > 0) \
.map(lambda word:(word.lower(),1))
9. Inspect the kvpairs RDD. Notice that the RDD created is a PairRDD
representing a collection of key/value pairs:
kvpairs.take(5)
10. Count each word and sort results in reverse alphabetic order:
Click here to view code image
countsbyword = kvpairs.reduceByKey(lambda v1, v2: v1 + v2) \
.sortByKey(ascending=False)
Note how the map() function is used in step 12 to invert the key and value.
This is a common approach to performing an operation known as a
secondary sort, which is a means to sort values that are not sorted by default.
Now exit your pyspark session by pressing Ctrl+D.
14. Now put it all together and run it as a complete Python program by using
spark-submit. First, minimize the amount of logging by creating and
configuring a log4j.properties file in the conf directory of your
Spark installation. Do this by executing the following command from a
Linux terminal (or an analogous operation if you are using another operating
system):
Click here to view code image
sed \
"s/log4j.rootCategory=INFO, console/log4j.rootCategory=ERROR,
console/" \
$SPARK_HOME/conf/log4j.properties.template \
> $SPARK_HOME/conf/log4j.properties
15. Create a new file named wordcounts.py and add the following code to
the file:
Click here to view code image
import sys, re
from pyspark import SparkConf, SparkContext
conf = SparkConf().setAppName('Word Counts')
sc = SparkContext(conf=conf)
You should see the top five words displayed in the console. Check the output
directory $SPARK_HOME/data/wordcounts; you should see one file in
this directory (part-00000) because you used only one partition for this
exercise. If you used more than one partition, you would see additional files
(part-00001, part-00002, and so on). Open the file and inspect the
contents.
17. Run the command from step 16 again. It should fail because the
wordcounts directory already exists and cannot be overwritten. Simply
remove or rename this directory or change the output directory for the next
operation to a directory that does not exist, such as wordcounts2.
The complete source code for this exercise can be found in the wordcount
folder at https://github.com/sparktraining/spark_using_python.
Join Transformations
Join operations are analogous to the JOIN operations you routinely see in SQL
programming. Join functions combine records from two RDDs based on a
common field, a key. Because join functions in Spark require a key to be
defined, they operate on key/value pair RDDs.
The following is a quick refresher on joins—which you may want to skip if you
have a relational database background:
A join operates on two different datasets, where one field in each dataset is
nominated as a key (a join key). The datasets are referred to in the order in
which they are specified. For instance, the first dataset specified is
considered the left entity or dataset, and the second dataset specified is
considered the right entity or dataset.
An inner join, often simply called a join (where the “inner” is inferred),
returns all elements or records from both datasets, where the nominated key
is present in both datasets.
An outer join does not require keys to match in both datasets. Outer joins
are implemented as either a left outer join, a right outer join, or a full outer
join.
A left outer join returns all records from the left (or first) dataset along with
matched records only (by the specified key) from the right (or second)
dataset.
A right outer join returns all records from the right (or second) dataset along
with matched records only (by the specified key) from the left (or first)
dataset.
A full outer join returns all records from both datasets whether there is a key
match or not.
Joins are some of the most commonly required transformations in the Spark API,
so it is imperative that you understand these functions and become comfortable
using them.
To illustrate the use of the different join types in the Spark RDD API, let’s
consider a dataset from a fictitious retailer that includes an entity containing
stores and an entity containing salespeople, loaded into RDDs, as shown in
Listing 4.34.
The following sections look at the available join transformations in Spark, their
usage, and some examples.
join()
Syntax:
Click here to view code image
RDD.join(<otherRDD>, numPartitions=None)
salespeople.keyBy(lambda x: x[2]) \
.join(stores).collect()
# returns: [(100, ((1, 'Henry', 100), 'Boca Raton')),
# (100, ((2, 'Karen', 100), 'Boca Raton')),
# (102, ((4, 'Jimmy', 102), 'Cambridge')),
# (101, ((3, 'Paul', 101), 'Columbia'))]
This join() operation returns all salespeople assigned to stores keyed by the
store ID (the join key) along with the entire store record and salesperson record.
Notice that the resultant RDD contains duplicate data. You could (and should in
many cases) follow the join() with a map() transformation to prune fields or
project only the fields required for further processing.
Optimizing Joins in Spark
Joins involving RDDs that span more than one partition—and many do—
require a shuffle. Spark generally plans and implements this activity to
achieve the most optimal performance possible; however, a simple axiom
to remember is “join large by small.” This means to reference the large
RDD (the one with the most elements, if this is known) first, followed by
the smaller of the two RDDs. This will seem strange for users coming from
relational database programming backgrounds, but unlike with relational
database systems, joins in Spark are relatively inefficient. And unlike with
most databases, there are no indexes or statistics to optimize the join, so the
optimizations you provide are essential to maximizing performance.
leftOuterJoin()
Syntax:
Click here to view code image
RDD.leftOuterJoin(<otherRDD>, numPartitions=None)
salespeople.keyBy(lambda x: x[2]) \
.leftOuterJoin(stores) \
.filter(lambda x: x[1][1] is None) \
.map(lambda x: "salesperson " + x[1][0][1] + " has no store")
\
.collect()
# returns ['salesperson Janice has no store']
rightOuterJoin()
Syntax:
Click here to view code image
RDD.rightOuterJoin(<otherRDD>, numPartitions=None)
salespeople.keyBy(lambda x: x[2]) \
.rightOuterJoin(stores) \
.filter(lambda x: x[1][0] is None) \
.map(lambda x: x[1][1] + " store has no salespeople") \
.collect()
# returns ['Naperville store has no salespeople']
fullOuterJoin()
Syntax:
Click here to view code image
RDD.fullOuterJoin(<otherRDD>, numPartitions=None)
salespeople.keyBy(lambda x: x[2]) \
.fullOuterJoin(stores) \
.filter(lambda x: x[1][0] is None or x[1][1] is None) \
.collect()
# returns [(,([5,'Janice',], None)),(103,(None,[103,'Naperville']))]
cogroup()
Syntax:
Click here to view code image
RDD.cogroup(<otherRDD>, numPartitions=None)
The resultant RDD output from a cogroup() operation of two RDDs (A, B)
with a key K could be summarized as:
Click here to view code image
[K, Iterable(K,VA, …), Iterable(K,VB, …)]
If an RDD does not have elements for a given key that is present in the other
RDD, the corresponding iterable is empty. Listing 4.39 shows a cogroup()
transformation using the salespeople and stores RDDs from the preceding
examples.
cartesian()
Syntax:
RDD.cartesian(<otherRDD>)
salespeople.keyBy(lambda x: x[2]) \
.cartesian(stores).take(1)
# returns:
# [((100, (1, 'Henry', 100)), (100, 'Boca Raton'))]
salespeople.keyBy(lambda x: x[2]) \
.cartesian(stores).count()
# returns 20 as there are 5 x 4 = 20 records
Table 4.2 shows the schema or structure of the files in the stations
directory.
lat Latitude
long Longitude
landmark City
Table 4.3 shows the schema or structure of the files in the status
directory.
4. Split the status data into discrete fields, projecting only the fields
necessary, and decompose the date string so that you can filter records by
date more easily in the next step:
Click here to view code image
status2 = status.map(lambda x: x.split(',')) \
.map(lambda x: (x[0], x[1], x[2], x[3].replace('"',''))) \
.map(lambda x: (x[0], x[1], x[2], x[3].split(' '))) \
.map(lambda x: (x[0], x[1], x[2], x[3][0].split('-'), x[3]
[1].split(':'))) \
.map(lambda x: (int(x[0]), int(x[1]), int(x[3][0]), int(x[3][1]),
int(x[3][2]), int(x[4][0])))
The schema for status3 is the same as the schema for status2 because
you have just removed unnecessary records.
6. Filter the stations dataset to include only stations where
landmark='San Jose':
Click here to view code image
stations2 = stations.map(lambda x: x.split(',')) \
.filter(lambda x: x[5] == 'San Jose') \
.map(lambda x: (int(x[0]), x[1]))
10. Create a key/value pair with the key as a tuple consisting of the station
name and the hour and then compute the averages by each hour for each
station:
Click here to view code image
avgbyhour = cleaned.keyBy(lambda x: (x[3],x[2])) \
.mapValues(lambda x: (x[1], 1)) \
.reduceByKey(lambda x, y: (x[0] + y[0], x[1] + y[1])) \
.mapValues(lambda x: (x[0]/x[1]))
11. Find the top 10 averages by station and hour by using the sortBy()
function:
Click here to view code image
topavail = avgbyhour.keyBy(lambda x: x[1]) \
.sortByKey(ascending=False) \
.map(lambda x: (x[1][0][0], x[1][0][1], x[0]))
topavail.take(10)
The complete source code for this exercise can be found in the joining-
datasets folder at https://github.com/sparktraining/spark_using_python.
Transformations on Sets
Set operations are conceptually similar to mathematical set operations. A set
function operates against two RDDs and results in one RDD. Consider the Venn
diagram shown in Figure 4.9, which shows a set of odd integers and a subset of
Fibonacci numbers. The following sections use these two sets to demonstrate the
various set transformations available in the Spark API.
Figure 4.9 Set Venn diagram.
union()
Syntax:
RDD.union(<otherRDD>)
The union() transformation takes one RDD and appends another RDD to it,
resulting in a combined output RDD. The RDDs are not required to have the
same schema or structure. For instance, the first RDD can have five fields,
whereas the second can have more or fewer than five fields.
The union() transformation does not filter duplicates from the output RDD in
the case that two unioned RDDs have records that are identical to each other. To
filter duplicates, you could follow the union() transformation with the
distinct() function discussed previously.
The RDD that results from a union() operation is not sorted either, but you
could sort it by following union() with a sortBy() function.
Listing 4.41 shows an example using union().
odds = sc.parallelize([1,3,5,7,9])
fibonacci = sc.parallelize([0,1,2,3,5,8])
odds.union(fibonacci).collect()
# returns [1, 3, 5, 7, 9, 0, 1, 2, 3, 5, 8]
intersection()
Syntax:
RDD.intersection(<otherRDD>)
odds = sc.parallelize([1,3,5,7,9])
fibonacci = sc.parallelize([0,1,2,3,5,8])
odds.intersection(fibonacci).collect()
# returns [1, 3, 5]
subtract()
Syntax:
Click here to view code image
RDD.subtract(<otherRDD>, numPartitions=None)
odds = sc.parallelize([1,3,5,7,9])
fibonacci = sc.parallelize([0,1,2,3,5,8])
odds.subtract(fibonacci).collect()
# returns [7, 9]
subtractByKey()
Syntax:
Click here to view code image
RDD.subtractByKey(<otherRDD>, numPartitions=None)
cities1 = sc.parallelize([('Hayward',(37.668819,-122.080795)),
('Baumholder',(49.6489,7.3975)),
('Alexandria',(38.820450,-77.050552)),
('Melbourne', (37.663712,144.844788))])
cities2 = sc.parallelize([('Boulder Creek',(64.0708333,-148.2236111)),
('Hayward',(37.668819,-122.080795)),
('Alexandria',(38.820450,-77.050552)),
('Arlington', (38.878337,-77.100703))])
cities1.subtractByKey(cities2).collect()
# returns:
# [('Baumholder', (49.6489, 7.3975)), ('Melbourne', (37.663712,
144.844788))]
cities2.subtractByKey(cities1).collect()
# returns:
# [('Boulder Creek', (64.0708333, -148.2236111)),
# ('Arlington', (38.878337, -77.100703))]
min()
Syntax:
RDD.min(key=None)
The min() function is an action that returns the minimum value for a numeric
RDD. The key argument is a function used to generate a key for comparing.
Listing 4.45 shows the use of the min() function.
numbers = sc.parallelize([0,1,1,2,3,5,8,13,21,34])
numbers.min()
# returns 0
max()
Syntax:
RDD.max(key=None)
The max() function is an action that returns the maximum value for a numeric
RDD. The key argument is a function used to generate a key for comparing.
Listing 4.46 shows the use of the max() function.
Listing 4.46 The max() Function
Click here to view code image
numbers = sc.parallelize([0,1,1,2,3,5,8,13,21,34])
numbers.max()
# returns 34
mean()
Syntax:
RDD.mean()
The mean() function computes the arithmetic mean from a numeric RDD.
Listing 4.47 demonstrates the use of the mean() function.
numbers = sc.parallelize([0,1,1,2,3,5,8,13,21,34])
numbers.mean()
# returns 8.8
sum()
Syntax:
RDD.sum()
The sum() function returns the sum of a list of numbers from a numeric RDD.
Listing 4.48 shows the use of the sum() function.
numbers = sc.parallelize([0,1,1,2,3,5,8,13,21,34])
numbers.sum()
# returns 88
stdev()
Syntax:
RDD.stdev()
The stdev() function is an action that computes the standard deviation for a
series of numbers from a numeric RDD. Listing 4.49 shows an example of
stdev().
numbers = sc.parallelize([0,1,1,2,3,5,8,13,21,34])
numbers.stdev()
# returns 10.467091286503619
variance()
Syntax:
RDD.variance()
numbers = sc.parallelize([0,1,1,2,3,5,8,13,21,34])
numbers.variance()
# returns 109.55999999999999
stats()
Syntax:
RDD.stats()
numbers = sc.parallelize([0,1,1,2,3,5,8,13,21,34])
numbers.stats()
# returns (count: 10, mean: 8.8, stdev: 10.4670912865, max: 34.0, min:
0.0)
Summary
This chapter covers the fundamentals of Spark programming, starting with a
closer look at Spark RDDs (the most fundamental atomic data object in the
Spark programming model), including looking at how to load data into RDDs,
how RDDs are evaluated and processed, and how RDDs achieve fault tolerance
and resiliency. This chapter also discusses the concepts of transformations and
actions in Spark and provides specific descriptions and examples of the most
important functions in the Spark core (or RDD) API. This chapter is arguably the
most important chapter in this book as it has laid the foundations for all
programming in Spark, including stream processing, machine learning, and
SQL. The remainder of the book regularly refers to the functions and concepts
covered in this chapter.