Function Spark

Resilient Distributed Datasets (RDDs) are Spark's core abstraction, representing immutable distributed collections of objects that can be operated on in parallel. RDDs can be created from external data or by parallelizing collections in the driver program. RDDs support transformations that create new RDDs and actions that return values.

Uploaded by

bhargavi

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

29 views

Function Spark

Uploaded by

bhargavi

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

Resilient Distributed Datasets (RDDs)

Spark revolves around the concept of a resilient distributed dataset (RDD), which is
a fault-tolerant collection of elements that can be operated on in parallel. There are
two ways to create RDDs: parallelizing an existing collection in your driver
program, or referencing a dataset in an external storage system, such as a shared file
system, HDFS, HBase, or any data source offering a Hadoop Input Format.

Parallelized Collections-SCALA
Parallelized collections are created by calling SparkContext’s parallelize method on
an existing collection in your driver program (a Scala Seq). The elements of the
collection are copied to form a distributed dataset that can be operated on in parallel.
For example, here is how to create a parallelized collection holding the numbers 1
to 5:
val data = Array(1, 2, 3, 4, 5)
val distData = sc.parallelize(data)
Once created, the distributed dataset (distData) can be operated on in parallel. For
example, we might call distData.reduce((a, b) => a + b) to add up the elements of
the array. We describe operations on distributed datasets later on.
One important parameter for parallel collections is the number of partitions to cut
the dataset into. Spark will run one task for each partition of the cluster. Typically
you want 2-4 partitions for each CPU in your cluster. Normally, Spark tries to set
the number of partitions automatically based on your cluster. However, you can also
set it manually by passing it as a second parameter
to parallelize (e.g. sc.parallelize(data, 10)). Note: some places in the code use the
term slices (a synonym for partitions) to maintain backward compatibility.

Parallelized Collections-JAVA
Parallelized collections are created by
calling JavaSparkContext’s parallelize method on an existing Collection in your
driver program. The elements of the collection are copied to form a distributed
dataset that can be operated on in parallel. For example, here is how to create a
parallelized collection holding the numbers 1 to 5:
List<Integer> data = Arrays.asList(1, 2, 3, 4, 5);
JavaRDD<Integer> distData = sc.parallelize(data);
Once created, the distributed dataset (distData) can be operated on in parallel. For
example, we might call distData.reduce((a, b) -> a + b) to add up the elements of the
list. We describe operations on distributed datasets later on.
One important parameter for parallel collections is the number of partitions to cut
the dataset into. Spark will run one task for each partition of the cluster. Typically,
you want 2-4 partitions for each CPU in your cluster. Normally, Spark tries to set
the number of partitions automatically based on your cluster. However, you can also
set it manually by passing it as a second parameter
to parallelize (e.g. sc.parallelize(data, 10)). Note: some places in the code use the
term slices (a synonym for partitions) to maintain backward compatibility.
Parallelized Collections-PYTHON
Parallelized collections are created by calling SparkContext’s parallelize method on
an existing iterable or collection in your driver program. The elements of the
collection are copied to form a distributed dataset that can be operated on in parallel.
For example, here is how to create a parallelized collection holding the numbers 1
to 5:
data = [1, 2, 3, 4, 5]
distData = sc.parallelize(data)
Once created, the distributed dataset (distData) can be operated on in parallel. For
example, we can call distData.reduce(lambda a, b: a + b) to add up the elements of
the list. We describe operations on distributed datasets later on.
One important parameter for parallel collections is the number of partitions to cut
the dataset into. Spark will run one task for each partition of the cluster. Typically
you want 2-4 partitions for each CPU in your cluster. Normally, Spark tries to set
the number of partitions automatically based on your cluster. However, you can also
set it manually by passing it as a second parameter
to parallelize (e.g. sc.parallelize(data, 10)). Note: some places in the code use the
term slices (a synonym for partitions) to maintain backward compatibility.
External Datasets-scala
Spark can create distributed datasets from any storage source supported by Hadoop,
including your local file system, HDFS, Cassandra, HBase, Amazon S3, etc. Spark
supports text files, SequenceFiles, and any other Hadoop InputFormat.
Text file RDDs can be created using SparkContext’s textFile method. This method
takes an URI for the file (either a local path on the machine, or a hdfs://, s3n://, etc
URI) and reads it as a collection of lines. Here is an example invocation:
scala> val distFile = sc.textFile("data.txt")
distFile: org.apache.spark.rdd.RDD[String] = data.txt MapPartitionsRDD[10] at t
extFile at <console>:26

Once created, distFile can be acted on by dataset operations. For example, we can
add up the sizes of all the lines using the map and reduce operations as
follows: distFile.map(s => s.length).reduce((a, b) => a + b).
Some notes on reading files with Spark:
 If using a path on the local file system, the file must also be accessible at the same
path on worker nodes. Either copy the file to all workers or use a network-
mounted shared file system.
 All of Spark’s file-based input methods, including text File, support running on
directories, compressed files, and wildcards as well. For example, you can
use textFile("/my/directory"), textFile("/my/directory/*.txt"),
and textFile("/my/directory/*.gz").
 The textFile method also takes an optional second argument for controlling the
number of partitions of the file. By default, Spark creates one partition for each
block of the file (blocks being 128MB by default in HDFS), but you can also ask
for a higher number of partitions by passing a larger value. Note that you cannot
have fewer partitions than blocks.
Apart from text files, Spark’s Scala API also supports several other data formats
External Datasets-java
Spark can create distributed datasets from any storage source supported by Hadoop,
including your local file system, HDFS, Cassandra, HBase, Amazon S3, etc. Spark
supports text files, SequenceFiles, and any other Hadoop InputFormat.
Text file RDDs can be created using SparkContext’s textFile method. This method
takes an URI for the file (either a local path on the machine, or a hdfs://, s3n://, etc
URI) and reads it as a collection of lines. Here is an example invocation:
JavaRDD<String> distFile = sc.textFile("data.txt");
Once created, distFile can be acted on by dataset operations. For example, we can
add up the sizes of all the lines using the map and reduce operations as
follows: distFile.map(s -> s.length()).reduce((a, b) -> a + b).
Some notes on reading files with Spark:
 If using a path on the local filesystem, the file must also be accessible at the same
path on worker nodes. Either copy the file to all workers or use a network-
mounted shared file system.
 All of Spark’s file-based input methods, including textFile, support running on
directories, compressed files, and wildcards as well. For example, you can
use textFile("/my/directory"), textFile("/my/directory/*.txt"),
and textFile("/my/directory/*.gz").
 The textFile method also takes an optional second argument for controlling the
number of partitions of the file. By default, Spark creates one partition for each
block of the file (blocks being 128MB by default in HDFS), but you can also ask
for a higher number of partitions by passing a larger value. Note that you cannot
have fewer partitions than blocks.
Apart from text files, Spark’s Java API also supports several other data formats:
External Datasets-python
PySpark can create distributed datasets from any storage source supported by
Hadoop, including your local file system, HDFS, Cassandra, HBase, Amazon S3,
etc. Spark supports text files, SequenceFiles, and any other Hadoop InputFormat.
Text file RDDs can be created using SparkContext’s textFile method. This method
takes an URI for the file (either a local path on the machine, or a hdfs://, s3n://, etc
URI) and reads it as a collection of lines. Here is an example invocation:
>>> distFile = sc.textFile("data.txt")
Once created, distFile can be acted on by dataset operations. For example, we can
add up the sizes of all the lines using the map and reduce operations as
follows: distFile.map(lambda s: len(s)).reduce(lambda a, b: a + b).
Some notes on reading files with Spark:
 If using a path on the local filesystem, the file must also be accessible at the same
path on worker nodes. Either copy the file to all workers or use a network-
mounted shared file system.
 All of Spark’s file-based input methods, including textFile, support running on
directories, compressed files, and wildcards as well. For example, you can
use textFile("/my/directory"), textFile("/my/directory/*.txt"),
and textFile("/my/directory/*.gz").
 The textFile method also takes an optional second argument for controlling the
number of partitions of the file. By default, Spark creates one partition for each
block of the file (blocks being 128MB by default in HDFS), but you can also ask
for a higher number of partitions by passing a larger value. Note that you cannot
have fewer partitions than blocks.

RDD Operations
RDDs support two types of operations: transformations, which create a new dataset
from an existing one, and actions, which return a value to the driver program after
running a computation on the dataset. For example, map is a transformation that
passes each dataset element through a function and returns a new RDD representing
the results. On the other hand, reduce is an action that aggregates all the elements of
the RDD using some function and returns the final result to the driver program
(although there is also a parallel reduce by Key that returns a distributed dataset).
All transformations in Spark are lazy, in that they do not compute their results right
away. Instead, they just remember the transformations applied to some base dataset
(e.g. a file). The transformations are only computed when an action requires a result
to be returned to the driver program. This design enables Spark to run more
efficiently. For example, we can realize that a dataset created through map will be
used in a reduce and return only the result of the reduce to the driver, rather than the
larger mapped dataset.
By default, each transformed RDD may be recomputed each time you run an action
on it. However, you may also persist an RDD in memory using the persist (or cache)
method, in which case Spark will keep the elements around on the cluster for much
faster access the next time you query it. There is also support for persisting RDDs
on disk, or replicated across multiple nodes.
1.To illustrate RDD basics, consider the simple program below: -Scala
val lines = sc.textFile("data.txt")
val lineLengths = lines.map(s => s.length)
val totalLength = lineLengths.reduce((a, b) => a + b)
The first line defines a base RDD from an external file. This dataset is not loaded in
memory or otherwise acted on: lines is merely a pointer to the file. The second line
defines lineLengths as the result of a map transformation.
Again, lineLengths is not immediately computed, due to laziness. Finally, we
run reduce, which is an action. At this point Spark breaks the computation into tasks
to run on separate machines, and each machine runs both its part of the map and a
local reduction, returning only its answer to the driver program.
If we also wanted to use lineLengths again later, we could add:
lineLengths.persist()
before the reduce, which would cause lineLengths to be saved in memory after the
first time it is computed.
2.To illustrate RDD basics, consider the simple program below: -java
JavaRDD<String> lines = sc.textFile("data.txt");
JavaRDD<Integer> lineLengths = lines.map(s -> s.length());
int totalLength = lineLengths.reduce((a, b) -> a + b);
The first line defines a base RDD from an external file. This dataset is not loaded in
memory or otherwise acted on: lines is merely a pointer to the file. The second line
defines lineLengths as the result of a map transformation.
Again, lineLengths is not immediately computed, due to laziness. Finally, we
run reduce, which is an action. At this point Spark breaks the computation into tasks
to run on separate machines, and each machine runs both its part of the map and a
local reduction, returning only its answer to the driver program.
If we also wanted to use lineLengths again later, we could add:
lineLengths.persist(StorageLevel.MEMORY_ONLY());
before the reduce, which would cause lineLengths to be saved in memory after the
first time it is computed.
3. To illustrate RDD basics, consider the simple program below:
lines = sc.textFile("data.txt")
lineLengths = lines.map(lambda s: len(s))
totalLength = lineLengths.reduce(lambda a, b: a + b)
The first line defines a base RDD from an external file. This dataset is not loaded in
memory or otherwise acted on: lines is merely a pointer to the file. The second line
defines lineLengths as the result of a map transformation.
Again, lineLengths is not immediately computed, due to laziness. Finally, we
run reduce, which is an action. At this point Spark breaks the computation into tasks
to run on separate machines, and each machine runs both its part of the map and a
local reduction, returning only its answer to the driver program.
If we also wanted to use lineLengths again later, we could add:
lineLengths.persist()
before the reduce, which would cause lineLengths to be saved in memory after the
first time it is computed.

 Passing Functions to Spark-SCALA

Spark’s API relies heavily on passing functions in the driver program to run on the
cluster. There are two recommended ways to do this:

 Anonymous function syntax, which can be used for short pieces of code.
 Static methods in a global singleton object. For example, you can define object
MyFunctions and then pass MyFunctions.func1, as follows:

object MyFunctions {
def func1(s: String): String = { ... }
}

myRdd.map(MyFunctions.func1)
Note that while it is also possible to pass a reference to a method in a class instance
(as opposed to a singleton object), this requires sending the object that contains that
class along with the method. For example, consider:
class MyClass {
def func1(s: String): String = { ... }
def doStuff(rdd: RDD[String]): RDD[String] = { rdd.map(func1) }
}
Here, if we create a new MyClass instance and call doStuff on it, the map inside
there references the func1 method of that MyClass instance, so the whole object
needs to be sent to the cluster. It is similar to writing rdd.map(x => this.func1(x)).
In a similar way, accessing fields of the outer object will reference the whole object:
class MyClass {
val field = "Hello"
def doStuff(rdd: RDD[String]): RDD[String] = { rdd.map(x => field + x) }
}
is equivalent to writing rdd.map(x => this.field + x), which references all of this. To
avoid this issue, the simplest way is to copy field into a local variable instead of
accessing it externally:
def doStuff(rdd: RDD[String]): RDD[String] = {
val field_ = this.field
rdd.map(x => field_ + x)
}
Passing Functions to Spark-python
Spark’s API relies heavily on passing functions in the driver program to run on the
cluster. There are three recommended ways to do this:

 Lambda expressions, for simple functions that can be written as an expression.

(Lambdas do not support multi-statement functions or statements that do not
return a value.)
 Local defs inside the function calling into Spark, for longer code.
 Top-level functions in a module.

For example, to pass a longer function than can be supported using a lambda,
consider the code below:
"""MyScript.py"""
if __name__ == "__main__":
def myFunc(s):
words = s.split(" ")
return len(words)

sc = SparkContext(...)
sc.textFile("file.txt").map(myFunc)
Note that while it is also possible to pass a reference to a method in a class instance
(as opposed to a singleton object), this requires sending the object that contains that
class along with the method. For example, consider:
class MyClass(object):
def func(self, s):
return s
def doStuff(self, rdd):
return rdd.map(self.func)
Here, if we create a new MyClass and call doStuff on it, the map inside there
references the func method of that MyClass instance, so the whole object needs to
be sent to the cluster.
In a similar way, accessing fields of the outer object will reference the whole object:
class MyClass(object):
def __init__(self):
self.field = "Hello"
def doStuff(self, rdd):
return rdd.map(lambda s: self.field + s)
To avoid this issue, the simplest way is to copy field into a local variable instead of
accessing it externally:
def doStuff(self, rdd):
field = self.field
return rdd.map(lambda s: field + s)

The How of Happiness
92% (36)
The How of Happiness
193 pages
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
From Everand
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
Wei Liu
No ratings yet
Function Spark
No ratings yet
Function Spark
10 pages
Pyspark Interview Code
100% (3)
Pyspark Interview Code
197 pages
Unit v Programming Model
No ratings yet
Unit v Programming Model
53 pages
BDA-Unit-III
No ratings yet
BDA-Unit-III
19 pages
Hadoop & Big Data
No ratings yet
Hadoop & Big Data
36 pages
Note
No ratings yet
Note
14 pages
Suppose You Have A Large Dataset Stored in A Distributed File System Like HDFS
No ratings yet
Suppose You Have A Large Dataset Stored in A Distributed File System Like HDFS
11 pages
BDALab Assn5
No ratings yet
BDALab Assn5
16 pages
unit 6 spark (2)
No ratings yet
unit 6 spark (2)
43 pages
Practical 11cdscds
No ratings yet
Practical 11cdscds
4 pages
1) Discuss The Design of Hadoop Distributed File System (HDFS) and Concept in Detail
No ratings yet
1) Discuss The Design of Hadoop Distributed File System (HDFS) and Concept in Detail
11 pages
BDA-Unit 4
No ratings yet
BDA-Unit 4
20 pages
U-3 Big Data
No ratings yet
U-3 Big Data
23 pages
Unit III
No ratings yet
Unit III
86 pages
CLOUD UNIT 5
No ratings yet
CLOUD UNIT 5
52 pages
CC Unit 5 Notes
No ratings yet
CC Unit 5 Notes
30 pages
BDA UNIT -3 Updated (1).docx
No ratings yet
BDA UNIT -3 Updated (1).docx
25 pages
BDA Lab Assignment 4 PDF
No ratings yet
BDA Lab Assignment 4 PDF
21 pages
PySpark FP Course ID 58339
No ratings yet
PySpark FP Course ID 58339
44 pages
Lab06 Spark Dataframes
No ratings yet
Lab06 Spark Dataframes
12 pages
Cloudlab Exercise 11 Lesson 11
No ratings yet
Cloudlab Exercise 11 Lesson 11
2 pages
Spark PPT
No ratings yet
Spark PPT
55 pages
Compare Hadoop & Spark Criteria Hadoop Spark
No ratings yet
Compare Hadoop & Spark Criteria Hadoop Spark
18 pages
Unit 5 Print
No ratings yet
Unit 5 Print
32 pages
BDA Notes Unit-4
No ratings yet
BDA Notes Unit-4
86 pages
UNIT 5-PLH
No ratings yet
UNIT 5-PLH
34 pages
SPARK
No ratings yet
SPARK
35 pages
Experiment No 1
No ratings yet
Experiment No 1
13 pages
Hadoop: OREIN IT Technologies
No ratings yet
Hadoop: OREIN IT Technologies
65 pages
unit6
No ratings yet
unit6
22 pages
notes - Copy
No ratings yet
notes - Copy
5 pages
BDA 3rd Unit QB
No ratings yet
BDA 3rd Unit QB
4 pages
Big Data Module 2
No ratings yet
Big Data Module 2
23 pages
Learning Spark Programming Basics: Introduction To Rdds
No ratings yet
Learning Spark Programming Basics: Introduction To Rdds
70 pages
Cloud Computing - Unit 5 Notes
No ratings yet
Cloud Computing - Unit 5 Notes
33 pages
BDS Notes Merged
No ratings yet
BDS Notes Merged
23 pages
Spark Notes
No ratings yet
Spark Notes
6 pages
Unit - II
No ratings yet
Unit - II
64 pages
Hadoop Presentaton
No ratings yet
Hadoop Presentaton
47 pages
spark
No ratings yet
spark
9 pages
PySpark Questions
No ratings yet
PySpark Questions
5 pages
Article Apache Hadoop 3.0.0 - HDFS Erasure Coding
No ratings yet
Article Apache Hadoop 3.0.0 - HDFS Erasure Coding
5 pages
HDFS Unit 4
No ratings yet
HDFS Unit 4
8 pages
New 9
No ratings yet
New 9
3 pages
Hadoop Architecture
No ratings yet
Hadoop Architecture
8 pages
Prepared By: Manoj Kumar Joshi & Vikas Sawhney
No ratings yet
Prepared By: Manoj Kumar Joshi & Vikas Sawhney
47 pages
Spark SQL
100% (1)
Spark SQL
25 pages
Module-2
No ratings yet
Module-2
23 pages
UNIT V-Cloud Computing
No ratings yet
UNIT V-Cloud Computing
33 pages
External Video-En (15)
No ratings yet
External Video-En (15)
2 pages
Module 1 PDF
No ratings yet
Module 1 PDF
49 pages
Big data analytics lab-JD
No ratings yet
Big data analytics lab-JD
49 pages
Apache Hive Optimization Techniques - 1 - Towards Data Science
No ratings yet
Apache Hive Optimization Techniques - 1 - Towards Data Science
8 pages
UNIT III BASICS_OF_HADOOP
No ratings yet
UNIT III BASICS_OF_HADOOP
22 pages
1 Hdfs Notes
No ratings yet
1 Hdfs Notes
38 pages
Apache_Spark
No ratings yet
Apache_Spark
9 pages
Hadoop File Formats - YoussefEtman
No ratings yet
Hadoop File Formats - YoussefEtman
8 pages
Cloud PDF
No ratings yet
Cloud PDF
47 pages
Learn Cassandra in 24 Hours
From Everand
Learn Cassandra in 24 Hours
Alex Nordeen
No ratings yet
nps
No ratings yet
nps
3 pages
Pig_2
No ratings yet
Pig_2
63 pages
Prog Python
No ratings yet
Prog Python
67 pages
lec18
No ratings yet
lec18
21 pages
Lec 26
No ratings yet
Lec 26
10 pages
Per Partition
No ratings yet
Per Partition
3 pages
Spark Notes
No ratings yet
Spark Notes
37 pages
Lec 8
No ratings yet
Lec 8
24 pages
Lec 2
No ratings yet
Lec 2
20 pages
Lec 6
No ratings yet
Lec 6
16 pages
Big Data Analytics Using Hadoop
No ratings yet
Big Data Analytics Using Hadoop
26 pages
Lec 5
No ratings yet
Lec 5
6 pages
Lec 3
No ratings yet
Lec 3
28 pages
Lec 7
No ratings yet
Lec 7
10 pages
Lec 1
No ratings yet
Lec 1
30 pages
Lec 4
No ratings yet
Lec 4
28 pages
Advanced Computer Architecture
No ratings yet
Advanced Computer Architecture
2 pages
Artificial Intelligence
No ratings yet
Artificial Intelligence
2 pages
Advanced English Communication Skills Lab
No ratings yet
Advanced English Communication Skills Lab
2 pages
Advanced Data Structures
No ratings yet
Advanced Data Structures
2 pages
Native Instruments - Guitar Rig 5 5.2.2 Standalone, VST, Aax x86 x64 (No Install) (03.06.2016)
No ratings yet
Native Instruments - Guitar Rig 5 5.2.2 Standalone, VST, Aax x86 x64 (No Install) (03.06.2016)
2 pages
Electrical Powerline Networking For A Samrt Home
92% (13)
Electrical Powerline Networking For A Samrt Home
34 pages
Ip Qos Principles: Theory and Practice
No ratings yet
Ip Qos Principles: Theory and Practice
108 pages
Project Planning and Scheduling
83% (6)
Project Planning and Scheduling
11 pages
cs502 Midterm Solved MCQs
No ratings yet
cs502 Midterm Solved MCQs
28 pages
Ooad - Question Bank
No ratings yet
Ooad - Question Bank
2 pages
A Perspective Analysis of Traffic Accident Using Data Mining Techniques
No ratings yet
A Perspective Analysis of Traffic Accident Using Data Mining Techniques
9 pages
Performance of Arithmetic Crossover and Heuristic Crossover in Genetic Algorithm Based On Alpha Parameter
No ratings yet
Performance of Arithmetic Crossover and Heuristic Crossover in Genetic Algorithm Based On Alpha Parameter
6 pages
COMPUTER-SCIENCE-2024-25
100% (1)
COMPUTER-SCIENCE-2024-25
12 pages
DS Solbank U1
No ratings yet
DS Solbank U1
7 pages
Deepak 49 A2
No ratings yet
Deepak 49 A2
6 pages
Shahmir Enterprises: Deals in With General Supplies, Contractors, Rent A Car, Sea Food& Other Service Provide
No ratings yet
Shahmir Enterprises: Deals in With General Supplies, Contractors, Rent A Car, Sea Food& Other Service Provide
1 page
Shell Script (Nigam&deepesh221110)
No ratings yet
Shell Script (Nigam&deepesh221110)
151 pages
The Design and Implementation of A Job-Matching Portal Using Clustering Algorithm
No ratings yet
The Design and Implementation of A Job-Matching Portal Using Clustering Algorithm
42 pages
2 - SCTP
No ratings yet
2 - SCTP
44 pages
Linux Question Bank
No ratings yet
Linux Question Bank
2 pages
Digidown Manual V1-7 - 4
No ratings yet
Digidown Manual V1-7 - 4
6 pages
Code Timing and Object Orientation and Zombies
No ratings yet
Code Timing and Object Orientation and Zombies
42 pages
W1129
No ratings yet
W1129
16 pages
Northern India Engineering College: Department of Information Technology
No ratings yet
Northern India Engineering College: Department of Information Technology
1 page
Linux Programming
No ratings yet
Linux Programming
150 pages
CSE4011 Virtualization ETH 1 AC41
No ratings yet
CSE4011 Virtualization ETH 1 AC41
6 pages
Some Important Transaction Codes For SAP BW
No ratings yet
Some Important Transaction Codes For SAP BW
3 pages
Bergmann, Gustav - Notes On Ontology
100% (1)
Bergmann, Gustav - Notes On Ontology
25 pages
Robotics Syllabus
100% (1)
Robotics Syllabus
2 pages
IC Agile Resource Planning Template 9260
No ratings yet
IC Agile Resource Planning Template 9260
5 pages
Copa Summarization Level
75% (4)
Copa Summarization Level
40 pages
3 1 1 A Vex Inputsoutputs
No ratings yet
3 1 1 A Vex Inputsoutputs
3 pages
Minmax
No ratings yet
Minmax
59 pages

Function Spark

Uploaded by

Function Spark

Uploaded by

Resilient Distributed Datasets (RDDs)

 Passing Functions to Spark-SCALA

 Lambda expressions, for simple functions that can be written as an expression.

You might also like