0% found this document useful (0 votes)
29 views

Function Spark

Resilient Distributed Datasets (RDDs) are Spark's core abstraction, representing immutable distributed collections of objects that can be operated on in parallel. RDDs can be created from external data or by parallelizing collections in the driver program. RDDs support transformations that create new RDDs and actions that return values.

Uploaded by

bhargavi
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views

Function Spark

Resilient Distributed Datasets (RDDs) are Spark's core abstraction, representing immutable distributed collections of objects that can be operated on in parallel. RDDs can be created from external data or by parallelizing collections in the driver program. RDDs support transformations that create new RDDs and actions that return values.

Uploaded by

bhargavi
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Resilient Distributed Datasets (RDDs)

Spark revolves around the concept of a resilient distributed dataset (RDD), which is
a fault-tolerant collection of elements that can be operated on in parallel. There are
two ways to create RDDs: parallelizing an existing collection in your driver
program, or referencing a dataset in an external storage system, such as a shared file
system, HDFS, HBase, or any data source offering a Hadoop Input Format.

Parallelized Collections-SCALA
Parallelized collections are created by calling SparkContext’s parallelize method on
an existing collection in your driver program (a Scala Seq). The elements of the
collection are copied to form a distributed dataset that can be operated on in parallel.
For example, here is how to create a parallelized collection holding the numbers 1
to 5:
val data = Array(1, 2, 3, 4, 5)
val distData = sc.parallelize(data)
Once created, the distributed dataset (distData) can be operated on in parallel. For
example, we might call distData.reduce((a, b) => a + b) to add up the elements of
the array. We describe operations on distributed datasets later on.
One important parameter for parallel collections is the number of partitions to cut
the dataset into. Spark will run one task for each partition of the cluster. Typically
you want 2-4 partitions for each CPU in your cluster. Normally, Spark tries to set
the number of partitions automatically based on your cluster. However, you can also
set it manually by passing it as a second parameter
to parallelize (e.g. sc.parallelize(data, 10)). Note: some places in the code use the
term slices (a synonym for partitions) to maintain backward compatibility.

Parallelized Collections-JAVA
Parallelized collections are created by
calling JavaSparkContext’s parallelize method on an existing Collection in your
driver program. The elements of the collection are copied to form a distributed
dataset that can be operated on in parallel. For example, here is how to create a
parallelized collection holding the numbers 1 to 5:
List<Integer> data = Arrays.asList(1, 2, 3, 4, 5);
JavaRDD<Integer> distData = sc.parallelize(data);
Once created, the distributed dataset (distData) can be operated on in parallel. For
example, we might call distData.reduce((a, b) -> a + b) to add up the elements of the
list. We describe operations on distributed datasets later on.
One important parameter for parallel collections is the number of partitions to cut
the dataset into. Spark will run one task for each partition of the cluster. Typically,
you want 2-4 partitions for each CPU in your cluster. Normally, Spark tries to set
the number of partitions automatically based on your cluster. However, you can also
set it manually by passing it as a second parameter
to parallelize (e.g. sc.parallelize(data, 10)). Note: some places in the code use the
term slices (a synonym for partitions) to maintain backward compatibility.
Parallelized Collections-PYTHON
Parallelized collections are created by calling SparkContext’s parallelize method on
an existing iterable or collection in your driver program. The elements of the
collection are copied to form a distributed dataset that can be operated on in parallel.
For example, here is how to create a parallelized collection holding the numbers 1
to 5:
data = [1, 2, 3, 4, 5]
distData = sc.parallelize(data)
Once created, the distributed dataset (distData) can be operated on in parallel. For
example, we can call distData.reduce(lambda a, b: a + b) to add up the elements of
the list. We describe operations on distributed datasets later on.
One important parameter for parallel collections is the number of partitions to cut
the dataset into. Spark will run one task for each partition of the cluster. Typically
you want 2-4 partitions for each CPU in your cluster. Normally, Spark tries to set
the number of partitions automatically based on your cluster. However, you can also
set it manually by passing it as a second parameter
to parallelize (e.g. sc.parallelize(data, 10)). Note: some places in the code use the
term slices (a synonym for partitions) to maintain backward compatibility.
External Datasets-scala
Spark can create distributed datasets from any storage source supported by Hadoop,
including your local file system, HDFS, Cassandra, HBase, Amazon S3, etc. Spark
supports text files, SequenceFiles, and any other Hadoop InputFormat.
Text file RDDs can be created using SparkContext’s textFile method. This method
takes an URI for the file (either a local path on the machine, or a hdfs://, s3n://, etc
URI) and reads it as a collection of lines. Here is an example invocation:
scala> val distFile = sc.textFile("data.txt")
distFile: org.apache.spark.rdd.RDD[String] = data.txt MapPartitionsRDD[10] at t
extFile at <console>:26

Once created, distFile can be acted on by dataset operations. For example, we can
add up the sizes of all the lines using the map and reduce operations as
follows: distFile.map(s => s.length).reduce((a, b) => a + b).
Some notes on reading files with Spark:
 If using a path on the local file system, the file must also be accessible at the same
path on worker nodes. Either copy the file to all workers or use a network-
mounted shared file system.
 All of Spark’s file-based input methods, including text File, support running on
directories, compressed files, and wildcards as well. For example, you can
use textFile("/my/directory"), textFile("/my/directory/*.txt"),
and textFile("/my/directory/*.gz").
 The textFile method also takes an optional second argument for controlling the
number of partitions of the file. By default, Spark creates one partition for each
block of the file (blocks being 128MB by default in HDFS), but you can also ask
for a higher number of partitions by passing a larger value. Note that you cannot
have fewer partitions than blocks.
Apart from text files, Spark’s Scala API also supports several other data formats
External Datasets-java
Spark can create distributed datasets from any storage source supported by Hadoop,
including your local file system, HDFS, Cassandra, HBase, Amazon S3, etc. Spark
supports text files, SequenceFiles, and any other Hadoop InputFormat.
Text file RDDs can be created using SparkContext’s textFile method. This method
takes an URI for the file (either a local path on the machine, or a hdfs://, s3n://, etc
URI) and reads it as a collection of lines. Here is an example invocation:
JavaRDD<String> distFile = sc.textFile("data.txt");
Once created, distFile can be acted on by dataset operations. For example, we can
add up the sizes of all the lines using the map and reduce operations as
follows: distFile.map(s -> s.length()).reduce((a, b) -> a + b).
Some notes on reading files with Spark:
 If using a path on the local filesystem, the file must also be accessible at the same
path on worker nodes. Either copy the file to all workers or use a network-
mounted shared file system.
 All of Spark’s file-based input methods, including textFile, support running on
directories, compressed files, and wildcards as well. For example, you can
use textFile("/my/directory"), textFile("/my/directory/*.txt"),
and textFile("/my/directory/*.gz").
 The textFile method also takes an optional second argument for controlling the
number of partitions of the file. By default, Spark creates one partition for each
block of the file (blocks being 128MB by default in HDFS), but you can also ask
for a higher number of partitions by passing a larger value. Note that you cannot
have fewer partitions than blocks.
Apart from text files, Spark’s Java API also supports several other data formats:
External Datasets-python
PySpark can create distributed datasets from any storage source supported by
Hadoop, including your local file system, HDFS, Cassandra, HBase, Amazon S3,
etc. Spark supports text files, SequenceFiles, and any other Hadoop InputFormat.
Text file RDDs can be created using SparkContext’s textFile method. This method
takes an URI for the file (either a local path on the machine, or a hdfs://, s3n://, etc
URI) and reads it as a collection of lines. Here is an example invocation:
>>> distFile = sc.textFile("data.txt")
Once created, distFile can be acted on by dataset operations. For example, we can
add up the sizes of all the lines using the map and reduce operations as
follows: distFile.map(lambda s: len(s)).reduce(lambda a, b: a + b).
Some notes on reading files with Spark:
 If using a path on the local filesystem, the file must also be accessible at the same
path on worker nodes. Either copy the file to all workers or use a network-
mounted shared file system.
 All of Spark’s file-based input methods, including textFile, support running on
directories, compressed files, and wildcards as well. For example, you can
use textFile("/my/directory"), textFile("/my/directory/*.txt"),
and textFile("/my/directory/*.gz").
 The textFile method also takes an optional second argument for controlling the
number of partitions of the file. By default, Spark creates one partition for each
block of the file (blocks being 128MB by default in HDFS), but you can also ask
for a higher number of partitions by passing a larger value. Note that you cannot
have fewer partitions than blocks.

RDD Operations
RDDs support two types of operations: transformations, which create a new dataset
from an existing one, and actions, which return a value to the driver program after
running a computation on the dataset. For example, map is a transformation that
passes each dataset element through a function and returns a new RDD representing
the results. On the other hand, reduce is an action that aggregates all the elements of
the RDD using some function and returns the final result to the driver program
(although there is also a parallel reduce by Key that returns a distributed dataset).
All transformations in Spark are lazy, in that they do not compute their results right
away. Instead, they just remember the transformations applied to some base dataset
(e.g. a file). The transformations are only computed when an action requires a result
to be returned to the driver program. This design enables Spark to run more
efficiently. For example, we can realize that a dataset created through map will be
used in a reduce and return only the result of the reduce to the driver, rather than the
larger mapped dataset.
By default, each transformed RDD may be recomputed each time you run an action
on it. However, you may also persist an RDD in memory using the persist (or cache)
method, in which case Spark will keep the elements around on the cluster for much
faster access the next time you query it. There is also support for persisting RDDs
on disk, or replicated across multiple nodes.
1.To illustrate RDD basics, consider the simple program below: -Scala
val lines = sc.textFile("data.txt")
val lineLengths = lines.map(s => s.length)
val totalLength = lineLengths.reduce((a, b) => a + b)
The first line defines a base RDD from an external file. This dataset is not loaded in
memory or otherwise acted on: lines is merely a pointer to the file. The second line
defines lineLengths as the result of a map transformation.
Again, lineLengths is not immediately computed, due to laziness. Finally, we
run reduce, which is an action. At this point Spark breaks the computation into tasks
to run on separate machines, and each machine runs both its part of the map and a
local reduction, returning only its answer to the driver program.
If we also wanted to use lineLengths again later, we could add:
lineLengths.persist()
before the reduce, which would cause lineLengths to be saved in memory after the
first time it is computed.
2.To illustrate RDD basics, consider the simple program below: -java
JavaRDD<String> lines = sc.textFile("data.txt");
JavaRDD<Integer> lineLengths = lines.map(s -> s.length());
int totalLength = lineLengths.reduce((a, b) -> a + b);
The first line defines a base RDD from an external file. This dataset is not loaded in
memory or otherwise acted on: lines is merely a pointer to the file. The second line
defines lineLengths as the result of a map transformation.
Again, lineLengths is not immediately computed, due to laziness. Finally, we
run reduce, which is an action. At this point Spark breaks the computation into tasks
to run on separate machines, and each machine runs both its part of the map and a
local reduction, returning only its answer to the driver program.
If we also wanted to use lineLengths again later, we could add:
lineLengths.persist(StorageLevel.MEMORY_ONLY());
before the reduce, which would cause lineLengths to be saved in memory after the
first time it is computed.
3. To illustrate RDD basics, consider the simple program below:
lines = sc.textFile("data.txt")
lineLengths = lines.map(lambda s: len(s))
totalLength = lineLengths.reduce(lambda a, b: a + b)
The first line defines a base RDD from an external file. This dataset is not loaded in
memory or otherwise acted on: lines is merely a pointer to the file. The second line
defines lineLengths as the result of a map transformation.
Again, lineLengths is not immediately computed, due to laziness. Finally, we
run reduce, which is an action. At this point Spark breaks the computation into tasks
to run on separate machines, and each machine runs both its part of the map and a
local reduction, returning only its answer to the driver program.
If we also wanted to use lineLengths again later, we could add:
lineLengths.persist()
before the reduce, which would cause lineLengths to be saved in memory after the
first time it is computed.

 Passing Functions to Spark-SCALA


Spark’s API relies heavily on passing functions in the driver program to run on the
cluster. There are two recommended ways to do this:

 Anonymous function syntax, which can be used for short pieces of code.
 Static methods in a global singleton object. For example, you can define object
MyFunctions and then pass MyFunctions.func1, as follows:

object MyFunctions {
def func1(s: String): String = { ... }
}

myRdd.map(MyFunctions.func1)
Note that while it is also possible to pass a reference to a method in a class instance
(as opposed to a singleton object), this requires sending the object that contains that
class along with the method. For example, consider:
class MyClass {
def func1(s: String): String = { ... }
def doStuff(rdd: RDD[String]): RDD[String] = { rdd.map(func1) }
}
Here, if we create a new MyClass instance and call doStuff on it, the map inside
there references the func1 method of that MyClass instance, so the whole object
needs to be sent to the cluster. It is similar to writing rdd.map(x => this.func1(x)).
In a similar way, accessing fields of the outer object will reference the whole object:
class MyClass {
val field = "Hello"
def doStuff(rdd: RDD[String]): RDD[String] = { rdd.map(x => field + x) }
}
is equivalent to writing rdd.map(x => this.field + x), which references all of this. To
avoid this issue, the simplest way is to copy field into a local variable instead of
accessing it externally:
def doStuff(rdd: RDD[String]): RDD[String] = {
val field_ = this.field
rdd.map(x => field_ + x)
}
Passing Functions to Spark-python
Spark’s API relies heavily on passing functions in the driver program to run on the
cluster. There are three recommended ways to do this:

 Lambda expressions, for simple functions that can be written as an expression.


(Lambdas do not support multi-statement functions or statements that do not
return a value.)
 Local defs inside the function calling into Spark, for longer code.
 Top-level functions in a module.

For example, to pass a longer function than can be supported using a lambda,
consider the code below:
"""MyScript.py"""
if __name__ == "__main__":
def myFunc(s):
words = s.split(" ")
return len(words)

sc = SparkContext(...)
sc.textFile("file.txt").map(myFunc)
Note that while it is also possible to pass a reference to a method in a class instance
(as opposed to a singleton object), this requires sending the object that contains that
class along with the method. For example, consider:
class MyClass(object):
def func(self, s):
return s
def doStuff(self, rdd):
return rdd.map(self.func)
Here, if we create a new MyClass and call doStuff on it, the map inside there
references the func method of that MyClass instance, so the whole object needs to
be sent to the cluster.
In a similar way, accessing fields of the outer object will reference the whole object:
class MyClass(object):
def __init__(self):
self.field = "Hello"
def doStuff(self, rdd):
return rdd.map(lambda s: self.field + s)
To avoid this issue, the simplest way is to copy field into a local variable instead of
accessing it externally:
def doStuff(self, rdd):
field = self.field
return rdd.map(lambda s: field + s)

You might also like