Learn Apache Spark
Learn Apache Spark
Selected Reading
UPSC IAS Exams Notes
Developer's Best Practices
Questions and Answers
Effective Resume Writing
HR Interview Questions
Computer Glossary
Who is Who
Advertisements
Previous Page
Next Page
Spark was introduced by Apache Software Foundation for speeding up the Hadoop
computational computing software process.
As against a common belief, Spark is not a modified version of Hadoop and is not,
really, dependent on Hadoop because it has its own cluster management. Hadoop is just
one of the ways to implement Spark.
Spark uses Hadoop in two ways – one is storage and second is processing. Since
Spark has its own cluster management computation, it uses Hadoop for storage purpose
only.
Apache Spark
Apache Spark is a lightning-fast cluster computing technology, designed for fast
computation. It is based on Hadoop MapReduce and it extends the MapReduce model to
efficiently use it for more types of computations, which includes interactive queries and
stream processing. The main feature of Spark is its in-memory cluster computing
that increases the processing speed of an application.
Advanced Analytics − Spark not only supports ‘Map’ and ‘reduce’. It also
supports SQL queries, Streaming data, Machine learning (ML), and Graph
algorithms.
Hadoop Yarn − Hadoop Yarn deployment means, simply, spark runs on Yarn
without any pre-installation or root access required. It helps to integrate Spark
into Hadoop ecosystem or Hadoop stack. It allows other components to run on
top of stack.
Components of Spark
The following illustration depicts the different components of Spark.
Spark SQL
Spark SQL is a component on top of Spark Core that introduces a new data abstraction
called SchemaRDD, which provides support for structured and semi-structured data.
Spark Streaming
Spark Streaming leverages Spark Core's fast scheduling capability to perform streaming
analytics. It ingests data in mini-batches and performs RDD (Resilient Distributed
Datasets) transformations on those mini-batches of data.
GraphX
GraphX is a distributed graph-processing framework on top of Spark. It provides an API
for expressing graph computation that can model the user-defined graphs by using
Pregel abstraction API. It also provides an optimized runtime for this abstraction.
There are two ways to create RDDs − parallelizing an existing collection in your driver
program, or referencing a dataset in an external storage system, such as a shared
file system, HDFS, HBase, or any data source offering a Hadoop Input Format.
Spark makes use of the concept of RDD to achieve faster and efficient MapReduce
operations. Let us first discuss how MapReduce operations take place and why they are
not so efficient.
Unfortunately, in most current frameworks, the only way to reuse data between
computations (Ex − between two MapReduce jobs) is to write it to an external stable
storage system (Ex − HDFS). Although this framework provides numerous abstractions
for accessing a cluster’s computational resources, users still want more.
Both Iterative and Interactive applications require faster data sharing across parallel
jobs. Data sharing is slow in MapReduce due to replication, serialization, and disk
IO. Regarding storage system, most of the Hadoop applications, they spend more than
90% of the time doing HDFS read-write operations.
The following illustration explains how the current framework works while doing the
interactive queries on MapReduce.
Data Sharing using Spark RDD
Data sharing is slow in MapReduce due to replication, serialization, and disk IO. Most
of the Hadoop applications, they spend more than 90% of the time doing HDFS read-
write operations.
Let us now try to find out how iterative and interactive operations take place in Spark
RDD.
Note − If the Distributed memory (RAM) is sufficient to store intermediate results (State
of the JOB), then it will store those results on the disk.
Interactive Operations on Spark RDD
This illustration shows interactive operations on Spark RDD. If different queries are run
on the same set of data repeatedly, this particular data can be kept in memory for better
execution times.
By default, each transformed RDD may be recomputed each time you run an action on
it. However, you may also persist an RDD in memory, in which case Spark will keep the
elements around on the cluster for much faster access, the next time you query it. There
is also support for persisting RDDs on disk, or replicated across multiple nodes.
$java -version
If Java is already, installed on your system, you get to see the following response −
In case you do not have Java installed on your system, then Install Java before
proceeding to next step.
$scala -version
If Scala is already installed on your system, you get to see the following response −
In case you don’t have Scala installed on your system, then proceed to next step for
Scala installation.
$ su –
Password:
# cd /home/Hadoop/Downloads/
# mv scala-2.11.6 /usr/local/scala
# exit
Set PATH for Scala
Use the following command for setting PATH for Scala.
$scala -version
If Scala is already installed on your system, you get to see the following response −
$ su –
Password:
# cd /home/Hadoop/Downloads/
# mv spark-1.3.1-bin-hadoop2.6 /usr/local/spark
# exit
export PATH=$PATH:/usr/local/spark/bin
$ source ~/.bashrc
$spark-shell
If spark is installed successfully then you will find the following output.
Spark assembly has been built with Hive, including Datanucleus jars on classpath
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
15/06/04 15:25:22 INFO SecurityManager: Changing view acls to: hadoop
15/06/04 15:25:22 INFO SecurityManager: Changing modify acls to: hadoop
15/06/04 15:25:22 INFO SecurityManager: SecurityManager: authentication disabled;
ui acls disabled; users with view permissions: Set(hadoop); users with modify permissions:
Set(hadoop)
15/06/04 15:25:22 INFO HttpServer: Starting HTTP Server
15/06/04 15:25:23 INFO Utils: Successfully started service 'HTTP class server' on port 43292.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 1.4.0
/_/
Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_71)
Type in expressions to have them evaluated.
Spark context available as sc
scala>
Spark Shell
Spark provides an interactive shell − a powerful tool to analyze data interactively. It is
available in either Scala or Python language. Spark’s primary abstraction is a distributed
collection of items called a Resilient Distributed Dataset (RDD). RDDs can be created
from Hadoop Input Formats (such as HDFS files) or by transforming other RDDs.
$ spark-shell
The Spark RDD API introduces few Transformations and few Actions to manipulate
RDD.
RDD Transformations
RDD transformations returns pointer to new RDD and allows you to create dependencies
between RDDs. Each RDD in dependency chain (String of Dependencies) has a function
for calculating its data and has a pointer (dependency) to its parent RDD.
Spark is lazy, so nothing will be executed unless you call some transformation or action
that will trigger job creation and execution. Look at the following snippet of the word-
count example.
Therefore, RDD transformation is not a set of data but is a step in a program (might be
the only step) telling Spark how to get data and what to do with it.
map(func)
1
Returns a new distributed dataset, formed by passing each element of the source through a
function func.
filter(func)
2 Returns a new dataset formed by selecting those elements of the source on which func returns
true.
flatMap(func)
3 Similar to map, but each input item can be mapped to 0 or more output items (so func should
return a Seq rather than a single item).
mapPartitions(func)
4 Similar to map, but runs separately on each partition (block) of the RDD, so func must be of
type Iterator<T> ⇒ Iterator<U> when running on an RDD of type T.
mapPartitionsWithIndex(func)
Similar to map Partitions, but also provides func with an integer value representing the index
5
of the partition, so func must be of type (Int, Iterator<T>) ⇒ Iterator<U> when running on an
RDD of type T.
6 Sample a fraction of the data, with or without replacement, using a given random number
generator seed.
union(otherDataset)
7 Returns a new dataset that contains the union of the elements in the source dataset and the
argument.
8 intersection(otherDataset)
Returns a new RDD that contains the intersection of elements in the source dataset and the
argument.
distinct([numTasks])
9
Returns a new dataset that contains the distinct elements of the source dataset.
groupByKey([numTasks])
When called on a dataset of (K, V) pairs, returns a dataset of (K, Iterable<V>) pairs.
10
Note − If you are grouping in order to perform an aggregation (such as a sum or average) over
each key, using reduceByKey or aggregateByKey will yield much better performance.
reduceByKey(func, [numTasks])
When called on a dataset of (K, V) pairs, returns a dataset of (K, V) pairs where the values for
11 each key are aggregated using the given reduce function func, which must be of type (V, V) ⇒
V. Like in groupByKey, the number of reduce tasks is configurable through an optional second
argument.
When called on a dataset of (K, V) pairs, returns a dataset of (K, U) pairs where the values for
each key are aggregated using the given combine functions and a neutral "zero" value. Allows
12
an aggregated value type that is different from the input value type, while avoiding unnecessary
allocations. Like in groupByKey, the number of reduce tasks is configurable through an optional
second argument.
sortByKey([ascending], [numTasks])
When called on a dataset of (K, V) pairs where K implements Ordered, returns a dataset of (K,
13
V) pairs sorted by keys in ascending or descending order, as specified in the Boolean ascending
argument.
join(otherDataset, [numTasks])
When called on datasets of type (K, V) and (K, W), returns a dataset of (K, (V, W)) pairs with all
14
pairs of elements for each key. Outer joins are supported through leftOuterJoin, rightOuterJoin,
and fullOuterJoin.
15 cogroup(otherDataset, [numTasks])
When called on datasets of type (K, V) and (K, W), returns a dataset of (K, (Iterable<V>,
Iterable<W>)) tuples. This operation is also called group With.
cartesian(otherDataset)
16
When called on datasets of types T and U, returns a dataset of (T, U) pairs (all pairs of elements).
pipe(command, [envVars])
Pipe each partition of the RDD through a shell command, e.g. a Perl or bash script. RDD
17
elements are written to the process's stdin and lines output to its stdout are returned as an RDD
of strings.
coalesce(numPartitions)
18 Decrease the number of partitions in the RDD to numPartitions. Useful for running operations
more efficiently after filtering down a large dataset.
repartition(numPartitions)
19 Reshuffle the data in the RDD randomly to create either more or fewer partitions and balance
it across them. This always shuffles all data over the network.
repartitionAndSortWithinPartitions(partitioner)
Repartition the RDD according to the given partitioner and, within each resulting partition, sort
20
records by their keys. This is more efficient than calling repartition and then sorting within each
partition because it can push the sorting down into the shuffle machinery.
Actions
The following table gives a list of Actions, which return values.
S.No Action & Meaning
reduce(func)
1 Aggregate the elements of the dataset using a function func (which takes two arguments and
returns one). The function should be commutative and associative so that it can be computed
correctly in parallel.
collect()
2 Returns all the elements of the dataset as an array at the driver program. This is usually useful
after a filter or other operation that returns a sufficiently small subset of the data.
count()
3
Returns the number of elements in the dataset.
first()
4
Returns the first element of the dataset (similar to take (1)).
take(n)
5
Returns an array with the first n elements of the dataset.
6 Returns an array with a random sample of num elements of the dataset, with or without
replacement, optionally pre-specifying a random number generator seed.
takeOrdered(n, [ordering])
7
Returns the first n elements of the RDD using either their natural order or a custom comparator.
saveAsTextFile(path)
Writes the elements of the dataset as a text file (or set of text files) in a given directory in the
8
local filesystem, HDFS or any other Hadoop-supported file system. Spark calls toString on each
element to convert it to a line of text in the file.
Writes the elements of the dataset as a Hadoop SequenceFile in a given path in the local
filesystem, HDFS or any other Hadoop-supported file system. This is available on RDDs of key-
9
value pairs that implement Hadoop's Writable interface. In Scala, it is also available on types
that are implicitly convertible to Writable (Spark includes conversions for basic types like Int,
Double, String, etc).
10 Writes the elements of the dataset in a simple format using Java serialization, which can then
be loaded using SparkContext.objectFile().
countByKey()
11 Only available on RDDs of type (K, V). Returns a hashmap of (K, Int) pairs with the count of each
key.
foreach(func)
Runs a function func on each element of the dataset. This is usually, done for side effects such
12 as updating an Accumulator or interacting with external storage systems.
Note − modifying variables other than Accumulators outside of the foreach() may result in
undefined behavior. See Understanding closures for more details.
Example
Consider a word count example − It counts each word appearing in a document.
Consider the following text as an input and is saved as an input.txt file in a home
directory.
Open Spark-Shell
The following command is used to open spark shell. Generally, spark is built using Scala.
Therefore, a Spark program runs on Scala environment.
$ spark-shell
If Spark shell opens successfully then you will find the following output. Look at the last
line of the output “Spark context available as sc” means the Spark container is
automatically created spark context object with the name sc. Before starting the first
step of a program, the SparkContext object should be created.
Spark assembly has been built with Hive, including Datanucleus jars on classpath
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
15/06/04 15:25:22 INFO SecurityManager: Changing view acls to: hadoop
15/06/04 15:25:22 INFO SecurityManager: Changing modify acls to: hadoop
15/06/04 15:25:22 INFO SecurityManager: SecurityManager: authentication disabled;
ui acls disabled; users with view permissions: Set(hadoop); users with modify permissions:
Set(hadoop)
15/06/04 15:25:22 INFO HttpServer: Starting HTTP Server
15/06/04 15:25:23 INFO Utils: Successfully started service 'HTTP class server' on port 43292.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 1.4.0
/_/
Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_71)
Type in expressions to have them evaluated.
Spark context available as sc
scala>
Create an RDD
First, we have to read the input file using Spark-Scala API and create an RDD.
The following command is used for reading a file from given location. Here, new RDD is
created with the name of inputfile. The String which is given as an argument in the
textFile(“”) method is absolute path for the input file name. However, if only the file
name is given, then it means that the input file is in the current location.
Next, read each word as a key with a value ‘1’ (<key, value> = <word,1>)using map
function (map(word ⇒ (word, 1)).
The following command is used for executing word count logic. After executing this, you
will not find any output because this is not an action, this is a transformation; pointing
a new RDD or tell spark to what to do with the given data)
scala> val counts = inputfile.flatMap(line => line.split(" ")).map(word => (word,
1)).reduceByKey(_+_);
Current RDD
While working with the RDD, if you want to know about current RDD, then use the
following command. It will show you the description about current RDD and its
dependencies for debugging.
scala> counts.toDebugString
scala> counts.cache()
scala> counts.saveAsTextFile("output")
part-00000
part-00001
_SUCCESS
Output
(people,1)
(are,2)
(not,1)
(as,8)
(beautiful,2)
(they, 7)
(look,1)
Output
(walk, 1)
(or, 1)
(talk, 1)
(only, 1)
(love, 1)
(care, 1)
(share, 1)
http://localhost:4040
You will see the following screen, which shows the storage space used for the application,
which are running on the Spark shell.
If you want to UN-persist the storage space of particular RDD, then use the following
command.
Scala> counts.unpersist()
For verifying the storage space in the browser, use the following URL.
http://localhost:4040/
You will see the following screen. It shows the storage space used for the application,
which are running on the Spark shell.
Example
Let us take the same example of word count, we used before, using shell commands.
Here, we consider the same example as a spark application.
Sample Input
The following text is the input data and the file named is in.txt.
SparkWordCount.scala
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark._
object SparkWordCount {
.reduceByKey(_ + _)
count.saveAsTextFile("outfile")
System.out.println("OK");
}
}
Save the above program into a file named SparkWordCount.scala and place it in a
user-defined directory named spark-application.
Note − While transforming the inputRDD into countRDD, we are using flatMap() for
tokenizing the lines (from text file) into words, map() method for counting the word
frequency and reduceByKey() method for counting each word repetition.
Use the following steps to submit this application. Execute all steps in the spark-
application directory through the terminal.
If it is executed successfully, then you will find the output given below. The OK letting
in the following output is for user identification and that is the last line of the program.
If you carefully read the following output, you will find different things, such as −
successfully started service 'sparkDriver' on port 42954
MemoryStore cleared
The following commands are used for opening and checking the list of files in the outfile
directory.
$ cd outfile
$ ls
Part-00000 part-00001 _SUCCESS
$ cat part-00000
(people,1)
(are,2)
(not,1)
(as,8)
(beautiful,2)
(they, 7)
(look,1)
$ cat part-00001
(walk, 1)
(or, 1)
(talk, 1)
(only, 1)
(love, 1)
(care, 1)
(share, 1)
Go through the following section to know more about the ‘spark-submit’ command.
Spark-submit Syntax
spark-submit [options] <app jar | python file> [app arguments]
Options
The table given below describes a list of options −
S.No Option Description
Spark actions are executed through a set of stages, separated by distributed “shuffle”
operations. Spark automatically broadcasts the common data needed by tasks within
each stage.
The data broadcasted this way is cached in serialized form and is deserialized before
running each task. This means that explicitly creating broadcast variables, is only useful
when tasks across multiple stages need the same data or when caching the data in
deserialized form is important.
Output −
After the broadcast variable is created, it should be used instead of the value v in any
functions run on the cluster, so that v is not shipped to the nodes more than once. In
addition, the object v should not be modified after its broadcast, in order to ensure that
all nodes get the same value of the broadcast variable.
Accumulators
Accumulators are variables that are only “added” to through an associative operation
and can therefore, be efficiently supported in parallel. They can be used to implement
counters (as in MapReduce) or sums. Spark natively supports accumulators of numeric
types, and programmers can add support for new types. If accumulators are created
with a name, they will be displayed in Spark’s UI. This can be useful for understanding
the progress of running stages (NOTE − this is not yet supported in Python).
An accumulator is created from an initial value v by calling
SparkContext.accumulator(v). Tasks running on the cluster can then add to it using
the add method or the += operator (in Scala and Python). However, they cannot read
its value. Only the driver program can read the accumulator’s value, using its value
method.
The code given below shows an accumulator being used to add up the elements of an
array −
If you want to see the output of above code then use the following command −
scala> accum.value
Output
res2: Int = 10
count()
1
Number of elements in the RDD.
Mean()
2
Average of the elements in the RDD.
3 Sum()
Total value of the elements in the RDD.
Max()
4
Maximum value among all elements in the RDD.
Min()
5
Minimum value among all elements in the RDD.
Variance()
6
Variance of the elements.
Stdev()
7
Standard deviation.
If you want to use only one of these methods, you can call the corresponding method
directly on RDD.
Previous Page
Print
Next Page