0% found this document useful (0 votes)

20 views69 pages

Unit-V Spark

BDa Sem

Uploaded by

gaddamlokesh20

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views69 pages

Unit-V Spark

BDa Sem

Uploaded by

gaddamlokesh20

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 69

UNIT-V

SPARK
Languages supported by Spark

Java
Python
R
Scala
Two important concepts of Hadoop
• 1. Hadoop Distributed File System(HDFS) (Storage)
• 2. MapReduce(Processing)
Processing in Hadoop
Why Spark?
To overcome the limitations of Hadoop
What is Spark?
• Apache Spark is a lightning-fast cluster computing designed for fast
computation.
• It was built on top of Hadoop MapReduce.
• It extends the MapReduce model to efficiently use more types of
computations which includes Interactive Queries and Stream
Processing.
• Spark was introduced by Apache Software Foundation for speeding
up the Hadoop computational computing software process.
• Spark is not a modified version of Hadoop and is not, really,
dependent on Hadoop because it has its own cluster management.
Hadoop is just one of the ways to implement Spark.
• Spark uses Hadoop in two ways – one is storage and second
is processing.
• Spark has its own cluster management computation, it uses Hadoop
for storage purpose only.
Apache Spark
• The main feature of Spark is its in-memory cluster computing that
increases the processing speed of an application.
• Spark is designed to cover a wide range of workloads such as
1. batch applications,
2. iterative algorithms,
3. interactive queries
4. streaming.
• It also reduces the management burden of maintaining separate
tools.
Evolution of Apache Spark

• Spark is one of Hadoop’s sub project developed in 2009 in Berkeley’s

AMPLab by Matei Zaharia.
• It was Open Sourced in 2010 under a BSD(Berkeley Source
Distribution) license.
• It was donated to Apache software foundation in 2013.
• Apache Spark has become a top level Apache project from Feb-2014.
Features of Apache Spark
• Speed − Spark helps to run an application in Hadoop cluster, up to
100 times faster in memory, and 10 times faster when running on
disk.
• This is possible by reducing number of read/write operations to disk.
It stores the intermediate processing data in memory.
• Supports multiple languages − Spark provides built-in APIs in Java,
Scala, or Python.
• We can write applications in different languages. Spark comes up with
80 high-level operators for interactive querying.
• Advanced Analytics − Spark not only supports ‘Map’ and ‘reduce’. It
also supports SQL queries, Streaming data, Machine learning (ML),
and Graph algorithms.
Spark Built on Hadoop
The following diagram shows three ways of how Spark can be
built with Hadoop components.
• Standalone − Spark Standalone deployment means Spark occupies
the place on top of HDFS(Hadoop Distributed File System) and space
is allocated for HDFS, explicitly.
• Spark and MapReduce will run side by side to cover all spark jobs on
cluster.
• Hadoop Yarn − Hadoop Yarn deployment means, simply, spark runs
on Yarn without any pre-installation or root access required.
• It helps to integrate Spark into Hadoop ecosystem or Hadoop stack. It
allows other components to run on top of stack.
• Spark in MapReduce (SIMR) − Spark in MapReduce is used to launch
spark job in addition to standalone deployment.
• With SIMR, user can start Spark and uses its shell without any
administrative access.
Components of Spark
Components of Spark
• Apache Spark Core
Spark Core is the underlying general execution engine for spark
platform that all other functionality is built upon.
It provides In-Memory computing and referencing datasets in external
storage systems.
• Spark SQL
Spark SQL is a component on top of Spark Core that introduces a new
data abstraction called SchemaRDD, which provides support for
structured and semi-structured data.
• Spark Streaming
Spark Streaming leverages Spark Core's fast scheduling capability to
perform streaming analytics.
It ingests data in mini-batches and performs RDD (Resilient Distributed
Datasets) transformations on those mini-batches of data.
• MLlib (Machine Learning Library)
MLlib is a distributed machine learning framework above Spark
because of the distributed memory-based Spark architecture.
Spark MLlib is nine times as fast as the Hadoop disk-based version
of Apache Mahout .
• GraphX
GraphX is a distributed graph-processing framework on top of Spark.
It provides an API for expressing graph computation that can model the
user-defined graphs by using Pregel abstraction API.
It also provides an optimized runtime for this abstraction.
Hadoop
Hadoop
Spark
Architecture of Spark
Master-Slave
Resilient Distributed Datasets(RDD)
• Resilient Distributed Datasets (RDD) is a fundamental data structure
of Spark.
• It is an immutable distributed collection of objects.
• Each dataset in RDD is divided into logical partitions, which may be
computed on different nodes of the cluster.
• RDDs can contain any type of Python, Java, or Scala objects, including
user-defined classes.
• RDD is a fault-tolerant collection of elements that can be operated in
parallel.
Process of dividing the dataset into partitions
Operations on RDD
1. Transformations
2. Actions
• There are two ways to create RDDs
1. parallelizing an existing collection in your driver program
2. referencing a dataset in an external storage system, such as a
shared file system, HDFS, HBase, or any data source offering a
Hadoop Input Format.
• Spark makes use of the concept of RDD to achieve faster and efficient
MapReduce operations.
Creating RDD using parallelize()
// Create an RDD from a list of integers
val data = List(1, 2, 3, 4, 5)
val rdd = sc.parallelize(data)

// Create an RDD from a list of strings

val data = List("Hello", "World", "Spark", "is", "awesome")
val rdd = sc.parallelize(data)
Creating RDD using parallelize()
// Create an RDD from a list of tuples
val data = List(("Alice", 25), ("Bob", 30), ("Charlie", 35))
val rdd = sc.parallelize(data)
Creating RDD from a text file(dataset)
// Start the Scala Spark shell or use the spark-shell command in Cloudera

// Load the sample file into an RDD

val rdd = sc.textFile("hdfs://path/to/your/sample/file.txt")

// Perform operations on the RDD, for example, count the number of lines
val count = rdd.count()

// Print the number of lines in the RDD

println(s"Number of lines in the RDD: $count")
RDD Transformations
• Map: Apply a function to each element of the RDD to create a new
RDD.
• // Create an RDD of integers
• val rdd = sc.parallelize(List(1, 2, 3, 4, 5))
1
• // Map each element to its square 4
• val squaredRDD = rdd.map(x => x * x)
9
16
25
• // Print the result
• squaredRDD.collect().foreach(println)
• Filter: Keep only the elements of the RDD that satisfy a condition.
• // Create an RDD of integers
• val rdd = sc.parallelize(List(1, 2, 3, 4, 5))

• // Filter the even numbers

• val evenRDD = rdd.filter(x => x % 2 == 0)

• // Print the result

• evenRDD.collect().foreach(println)

2
4
FlatMap: Apply a function to each element of the
RDD and flatten the results.
• // Create an RDD of strings
• val rdd = sc.parallelize(List("hello world", "how are you"))

• // Split each string into words and flatten the result

• val wordsRDD = rdd.flatMap(_.split(" "))
hello
• // Print the result world
• wordsRDD.collect().foreach(println) how
are
you
Union: Combine two RDDs into one RDD.
• // Create two RDDs of integers
• val rdd1 = sc.parallelize(List(1, 2, 3))
• val rdd2 = sc.parallelize(List(4, 5, 6))

• // Union the two RDDs 1

2
• val unionRDD = rdd1.union(rdd2) 3
4
5
• // Print the result 6
• unionRDD.collect().foreach(println)
Distinct: Remove duplicate elements from the RDD.

• // Create an RDD of integers with duplicates

• val rdd = sc.parallelize(List(1, 2, 3, 4, 2, 3, 5))

• // Remove duplicates 1
• val distinctRDD = rdd.distinct() 2
3
• // Print the result 4
• distinctRDD.collect().foreach(println) 5
RDD Actions
• collect(): Return all elements of the RDD as an array to the driver
program.
• // Create an RDD of integers
• val rdd = sc.parallelize(List(1, 2, 3, 4, 5))

• // Collect all elements of the RDD

• val result = rdd.collect() 1
2
3
• // Print the result
4
• result.foreach(println) 5
take(n): Return the first n elements of the RDD as
an array.
• // Create an RDD of integers
• val rdd = sc.parallelize(List(1, 2, 3, 4, 5))

• // Take the first 3 elements of the RDD

• val result = rdd.take(3) 1
2
• // Print the result 3
• result.foreach(println)
reduce(func): Aggregate the elements of the RDD
using a specified binary function.
• // Create an RDD of integers
• val rdd = sc.parallelize(List(1, 2, 3, 4, 5))

• // Sum all elements of the RDD using reduce

• val sum = rdd.reduce((x, y) => x + y)
Sum: 15
• // Print the result
• println("Sum:", sum)
count(): Return the number of elements in the
RDD.
• // Create an RDD of integers
• val rdd = sc.parallelize(List(1, 2, 3, 4, 5))

• // Count the number of elements in the RDD

• val count = rdd.count()
Count: 5
• // Print the result
• println("Count:", count)
foreach(func): Apply a function to each element of
the RDD.
• // Create an RDD of integers
• val rdd = sc.parallelize(List(1, 2, 3, 4, 5))

• // Print each element of the RDD 1

2
• rdd.foreach(println) 3
4
5
What is Lazy Evaluation?
• The transformations on data are not immediately executed, but
rather their execution is delayed until an action is triggered.
• Spark builds up a logical execution plan called the DAG (Directed
Acyclic Graph) that represents the sequence of transformations to be
applied. The DAG is an internal representation of the computation.
• Actions in Spark, trigger the execution of the entire DAG, and the
computed results are returned to the driver program or written to an
external storage system.
Advantages of Using Lazy Evaluation
• Optimized Performance:
• Lazy evaluation allows for optimized performance by delaying computations which means that unnecessary
computations and operations can be avoided, resulting in improved runtime efficiency and reduced
processing time.
• Resource Efficiency:
• Lazy evaluation helps in efficient resource utilization. By delaying the execution of computations until they
are needed, unnecessary resource allocation, memory consumption, and CPU usage can be avoided.
• Flexibility and Composition:
• By deferring computations, lazy evaluation allows functions and expressions to be composed, combined, and
reused in various contexts without immediately executing them. This enables modularity, code reuse, and
the creation of more flexible and composable software components.
• Dynamic Control Flow:
• Lazy evaluation facilitates dynamic control flow and conditional execution. It allows programs to evaluate
and execute different parts of the code based on runtime conditions, enabling more dynamic and adaptable
behavior.
• Error Handling and Fault Tolerance:
• Lazy evaluation can improve error handling and fault tolerance in applications. By postponing computations
until necessary, it provides an opportunity to catch and handle errors more effectively.
Advantages of Using Lazy Evaluation
Data Sharing is Slow in MapReduce
• MapReduce is widely adopted for processing and generating large
datasets with a parallel, distributed algorithm on a cluster.
• It allows users to write parallel computations, using a set of high-level
operators, without having to worry about work distribution and fault
tolerance.
• Both Iterative and Interactive applications require faster data sharing
across parallel jobs.
• Data sharing is slow in MapReduce due to replication, serialization,
and disk IO. Regarding storage system, most of the Hadoop
applications, they spend more than 90% of the time doing HDFS read-
write operations.
Data Sharing using Spark RDD

• Data sharing is slow in MapReduce due to replication, serialization,

and disk IO. Most of the Hadoop applications, they spend more than 90%
of the time doing HDFS read-write operations.
• Recognizing this problem, researchers developed a specialized framework
called Apache Spark. The key idea of spark is Resilient Distributed Datasets
(RDD); it supports in-memory processing computation. This means, it
stores the state of memory as an object across the jobs and the object is
sharable between those jobs. Data sharing in memory is 10 to 100 times
faster than network and Disk.
• Let us now try to find out how iterative and interactive operations take
place in Spark RDD.
Iterative Operations on Spark RDD

If the Distributed memory (RAM) is not sufficient to store intermediate results

(State of the JOB), then it will store those results on the disk.
Interactive Operations on Spark RDD
• If different queries are run on the same set of data repeatedly, this
particular data can be kept in memory for better execution times.
To Start Spark
• Open Virtual box and goto cloudera
• Open the terminal

• Type the command spark-shell and press Enter

scala>
scala>sc.parallelize(1 to 100, 5).collect()
• parallelizing a collection of integers from 1 to 100 across a specified
number of partitions and then collecting the results back to the
driver program.
• sc: It refers to the SparkContext object indicates that entry point to
any Spark functionality.
• parallelize(): This method is used to create a new RDD (Resilient
Distributed Dataset) from a collection.
• 1 to 100: This represents the range of integers from 1 to 100,
inclusive.
5: This parameter specifies the number of partitions into which the
data should be divided.
collect() is an action in Spark that retrieves all elements of an RDD (or
DataFrame) from the executors to the driver program.
Reading csv file locally
val path = "examples/src/main/resources/people.csv"

val df = spark.read.csv(path)
df.show()
// +------------------+
// | _c0|
// +------------------+
// | name;age;job|
// |Jorge;30;Developer|
// | Bob;32;Developer|
// +------------------+
val df2 = spark.read.option("delimiter", ";").csv(path)

• df2.show()
• // +-----+---+---------+
• // | _c0|_c1| _c2|
• // +-----+---+---------+
• // | name|age| job|
• // |Jorge| 30|Developer|
• // | Bob| 32|Developer|
• // +-----+---+---------+
val df3 = spark.read.option("delimiter", ";").option("header",
"true").csv(path)

df3.show()
// +-----+---+---------+
// | name|age| job|
// +-----+---+---------+
// |Jorge| 30|Developer|
// | Bob| 32|Developer|
// +-----+---+---------+
aggregation functions in Spark
• Sum
• Average
• Maximum
• Minimum
Sum()
• val data = Seq(("Alice", 100), ("Bob", 150), ("Charlie", 200), ("Alice",
50))
• val df = data.toDF("Name", "Amount") // Sum aggregation
• val sumDF =
df.groupBy("Name").agg(sum("Amount").alias("TotalAmount"))
• sumDF.show()
Avg()
• val data = Seq(("Alice", 100), ("Bob", 150), ("Charlie", 200), ("Alice", 50))
• val df = data.toDF("Name", "Amount") // Average aggregation val
avgDF =
df.groupBy("Name").agg(avg("Amount").alias("AverageAmount"))
avgDF.show()
Max()
• val data = Seq(("Alice", 100), ("Bob", 150), ("Charlie", 200), ("Alice",
50))
• val df = data.toDF("Name", "Amount") // Maximum aggregation val
maxDF =
df.groupBy("Name").agg(max("Amount").alias("MaxAmount"))
• maxDF.show()
Min()
• val data = Seq(("Alice", 100), ("Bob", 150), ("Charlie", 200), ("Alice",
50))
• val df = data.toDF("Name", "Amount") // Minimum aggregation val
minDF =
df.groupBy("Name").agg(min("Amount").alias("MinAmount"))
• minDF.show()
Suppose we have a dataset containing sales transactions with columns like transaction_id, product_id,
quantity, and amount.
We want to group this data by product and calculate the total quantity and total amount sold for each product.

• val data = Seq( (1, "A", 10, 100.0), (2, "B", 20, 200.0), (3, "A", 15, 150.0),
(4, "C", 5, 50.0), (5, "B", 25, 250.0) )
• val df = data.toDF("transaction_id", "product_id", "quantity", "amount")
// Grouping and aggregation
• val groupedDF = df.groupBy("product_id")
.agg(sum("quantity").alias("total_quantity"),
sum("amount").alias("total_amount")) groupedDF.show()
Dataset1
• // Sample sales data
• val salesData = Seq( (1, "A", 10, 100.0), (2, "B", 20, 200.0), (3, "A", 15,
150.0), (4, "C", 5, 50.0), (5, "B", 25, 250.0) ).toDF("transaction_id",
"product_id", "quantity", "amount")

• // Sample product data

• val productData = Seq( ("A", "Product A", "Category 1"), ("B", "Product
B", "Category 2"), ("C", "Product C", "Category 1") ).toDF("product_id",
"product_name", "category")
// Joining the datasets
• val joinedDF = salesData.join(productData, Seq("product_id"),
"inner") joinedDF.show()

+----------+--------------+--------+------+------------+----------+
|product_id|transaction_id|quantity|amount|product_name| category|
+----------+--------------+--------+------+------------+----------+
| A| 1| 10| 100.0| Product A|Category 1|
| A| 3| 15| 150.0| Product A|Category 1|
| B| 2| 20| 200.0| Product B|Category 2|
| B| 5| 25| 250.0| Product B|Category 2|
| C| 4| 5| 50.0| Product C|Category 1|
+----------+--------------+--------+------+------------+----------+

Big Data - Spark
100% (1)
Big Data - Spark
72 pages
spark
No ratings yet
spark
160 pages
8 Apache Spark
No ratings yet
8 Apache Spark
25 pages
Spark
No ratings yet
Spark
96 pages
Spark PPT
No ratings yet
Spark PPT
55 pages
Spark 1
No ratings yet
Spark 1
57 pages
BDA-Unit-III
No ratings yet
BDA-Unit-III
19 pages
4.1. Spark Basics
No ratings yet
4.1. Spark Basics
28 pages
BDA NOTES
No ratings yet
BDA NOTES
241 pages
Lab 04 Spark APIs (1)
No ratings yet
Lab 04 Spark APIs (1)
20 pages
4 Spark SBP
No ratings yet
4 Spark SBP
74 pages
BDA Lect5 Apache Spark 2023
No ratings yet
BDA Lect5 Apache Spark 2023
115 pages
Writing Spark Application
No ratings yet
Writing Spark Application
37 pages
bda unit 5 - mam
No ratings yet
bda unit 5 - mam
44 pages
Spark
No ratings yet
Spark
51 pages
L03-Spark Framework
No ratings yet
L03-Spark Framework
58 pages
unit 6 spark (2)
No ratings yet
unit 6 spark (2)
43 pages
BDA-Lec7
No ratings yet
BDA-Lec7
32 pages
Intro To Apache Spark
No ratings yet
Intro To Apache Spark
66 pages
Unit V Big data
No ratings yet
Unit V Big data
18 pages
Spark And Scala Week 1
No ratings yet
Spark And Scala Week 1
16 pages
BDA-Lec8
No ratings yet
BDA-Lec8
39 pages
Introduction To Spark
No ratings yet
Introduction To Spark
54 pages
Spark Summit East 2015 - Adv Dev Ops - Student Slides
No ratings yet
Spark Summit East 2015 - Adv Dev Ops - Student Slides
219 pages
SPARK
No ratings yet
SPARK
66 pages
Devops Slides
No ratings yet
Devops Slides
223 pages
Apache Spark
No ratings yet
Apache Spark
31 pages
Spark_Class_1_PPT
No ratings yet
Spark_Class_1_PPT
33 pages
Big Data Engineering - PySpark
100% (2)
Big Data Engineering - PySpark
120 pages
Lecture 25
No ratings yet
Lecture 25
59 pages
Unit 5
100% (1)
Unit 5
109 pages
C5-SPARK Technology
No ratings yet
C5-SPARK Technology
39 pages
SPARK
No ratings yet
SPARK
35 pages
BigData Spark Sparklyr
No ratings yet
BigData Spark Sparklyr
80 pages
Cse3002 Big Data m3 Detailed
No ratings yet
Cse3002 Big Data m3 Detailed
39 pages
Bda 5
No ratings yet
Bda 5
21 pages
Introduction To Spark
No ratings yet
Introduction To Spark
84 pages
Spark Programming Basics
No ratings yet
Spark Programming Basics
54 pages
Module 3
No ratings yet
Module 3
51 pages
Unit 4 Spark Cassendra
No ratings yet
Unit 4 Spark Cassendra
41 pages
Apache Spark With Java
No ratings yet
Apache Spark With Java
209 pages
Distributed Database Systems: - Spark I
No ratings yet
Distributed Database Systems: - Spark I
59 pages
Introduction to Spark
No ratings yet
Introduction to Spark
30 pages
Spark
No ratings yet
Spark
9 pages
Intro To Apache Spark: Credits To CS 347-Stanford Course, 2015, Reynold Xin, Databricks (Spark Provider)
No ratings yet
Intro To Apache Spark: Credits To CS 347-Stanford Course, 2015, Reynold Xin, Databricks (Spark Provider)
96 pages
Unit-5 Spark
No ratings yet
Unit-5 Spark
24 pages
Introduction To Big Data With Apache Spark: Uc Berkeley
No ratings yet
Introduction To Big Data With Apache Spark: Uc Berkeley
43 pages
Big Data Computing Spark Basics and RDD: Ke Yi
No ratings yet
Big Data Computing Spark Basics and RDD: Ke Yi
43 pages
Apach Spark With Scala Slides
No ratings yet
Apach Spark With Scala Slides
187 pages
What Is Spark?: History of Apache Spark
No ratings yet
What Is Spark?: History of Apache Spark
65 pages
Features of Apache Spark
No ratings yet
Features of Apache Spark
7 pages
Mod4 Bda
No ratings yet
Mod4 Bda
14 pages
Chapter 3 spark
No ratings yet
Chapter 3 spark
6 pages
Spark SQL
100% (1)
Spark SQL
25 pages
Savemom
No ratings yet
Savemom
14 pages
Online PDF To Text Converter & Language Translator Python Project
No ratings yet
Online PDF To Text Converter & Language Translator Python Project
10 pages
Bluguard l900 User Manual v1.0
No ratings yet
Bluguard l900 User Manual v1.0
23 pages
Learn Apache Spark
100% (1)
Learn Apache Spark
31 pages
Get Automation for Food Engineering Food Quality Quantization and Process Control 1st Edition Lev Nelik free all chapters
No ratings yet
Get Automation for Food Engineering Food Quality Quantization and Process Control 1st Edition Lev Nelik free all chapters
41 pages
What Is Spark?: Up To 100× Faster
No ratings yet
What Is Spark?: Up To 100× Faster
56 pages
The Homework Machine Comprehension Quiz
100% (1)
The Homework Machine Comprehension Quiz
7 pages
Architecture and Components of Spark
No ratings yet
Architecture and Components of Spark
6 pages
Social Media Image Sizes Template
No ratings yet
Social Media Image Sizes Template
17 pages
Supply API
No ratings yet
Supply API
15 pages
EMIS Clone Report
No ratings yet
EMIS Clone Report
8 pages
العماره التكنولوجيه
No ratings yet
العماره التكنولوجيه
15 pages
Resume Builder Web Based Applications
No ratings yet
Resume Builder Web Based Applications
11 pages
SAP ICS Cash Advance Reconciliation Best Practice
No ratings yet
SAP ICS Cash Advance Reconciliation Best Practice
6 pages
Electronics 0508
No ratings yet
Electronics 0508
115 pages
Core Aruba 1 New
No ratings yet
Core Aruba 1 New
12 pages
Web Access To Email With Office 365
No ratings yet
Web Access To Email With Office 365
12 pages
AI With Python MCQ's
No ratings yet
AI With Python MCQ's
4 pages
Lesson 5
No ratings yet
Lesson 5
20 pages
Spark Notes
No ratings yet
Spark Notes
6 pages
BTP Presentation
No ratings yet
BTP Presentation
21 pages
Linkedin Srs
100% (1)
Linkedin Srs
24 pages
Blockchain Programming
No ratings yet
Blockchain Programming
30 pages
FIR FILTER DESIGN ANALYSIS Lab11
No ratings yet
FIR FILTER DESIGN ANALYSIS Lab11
6 pages
Tinkercad - From Mind To Design in Minutes
No ratings yet
Tinkercad - From Mind To Design in Minutes
3 pages
Inventaris Tools Gudang
No ratings yet
Inventaris Tools Gudang
14 pages
Hibernate Syllabus
No ratings yet
Hibernate Syllabus
4 pages
Geostatistics and Analysis of Spatial Data: Allan A. Nielsen
No ratings yet
Geostatistics and Analysis of Spatial Data: Allan A. Nielsen
6 pages
6.7.12 Packet Tracer - Configure Cisco Devices For Syslog, NTP, and SSH Operations - ILM
No ratings yet
6.7.12 Packet Tracer - Configure Cisco Devices For Syslog, NTP, and SSH Operations - ILM
4 pages
24.2 Exercise 4 - Expressions
No ratings yet
24.2 Exercise 4 - Expressions
7 pages
2080-LC50-48QBB: M850 28DI 20DO SOURCE ENET RS232/485 24V DC Catalogue No
No ratings yet
2080-LC50-48QBB: M850 28DI 20DO SOURCE ENET RS232/485 24V DC Catalogue No
3 pages
Ethical Hacking Career A Career Guideline For Ethical Hacker
No ratings yet
Ethical Hacking Career A Career Guideline For Ethical Hacker
6 pages
Variables Are Used To Store Values. A String Is A Series of
No ratings yet
Variables Are Used To Store Values. A String Is A Series of
2 pages
Full Stack Development Bootcamp: 11 Months - Online
No ratings yet
Full Stack Development Bootcamp: 11 Months - Online
12 pages
Fast Data Processing Systems with SMACK Stack
From Everand
Fast Data Processing Systems with SMACK Stack
Raúl Estrada
No ratings yet
Learning Apache Spark 2
From Everand
Learning Apache Spark 2
Muhammad Asif Abbasi
No ratings yet

Unit-V Spark

Uploaded by

Unit-V Spark

Uploaded by

UNIT-V

• Spark is one of Hadoop’s sub project developed in 2009 in Berkeley’s

// Create an RDD from a list of strings

// Load the sample file into an RDD

// Print the number of lines in the RDD

• // Filter the even numbers

• // Print the result

• // Split each string into words and flatten the result

• // Union the two RDDs 1

• // Create an RDD of integers with duplicates

• // Collect all elements of the RDD

• // Take the first 3 elements of the RDD

• // Sum all elements of the RDD using reduce

• // Count the number of elements in the RDD

• // Print each element of the RDD 1

• Data sharing is slow in MapReduce due to replication, serialization,

If the Distributed memory (RAM) is not sufficient to store intermediate results

• Type the command spark-shell and press Enter

• // Sample product data

You might also like