0% found this document useful (0 votes)
20 views69 pages

Unit-V Spark

BDa Sem

Uploaded by

gaddamlokesh20
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views69 pages

Unit-V Spark

BDa Sem

Uploaded by

gaddamlokesh20
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 69

UNIT-V

SPARK
Languages supported by Spark

Java
Python
R
Scala
Two important concepts of Hadoop
• 1. Hadoop Distributed File System(HDFS) (Storage)
• 2. MapReduce(Processing)
Processing in Hadoop
Why Spark?
To overcome the limitations of Hadoop
What is Spark?
• Apache Spark is a lightning-fast cluster computing designed for fast
computation.
• It was built on top of Hadoop MapReduce.
• It extends the MapReduce model to efficiently use more types of
computations which includes Interactive Queries and Stream
Processing.
• Spark was introduced by Apache Software Foundation for speeding
up the Hadoop computational computing software process.
• Spark is not a modified version of Hadoop and is not, really,
dependent on Hadoop because it has its own cluster management.
Hadoop is just one of the ways to implement Spark.
• Spark uses Hadoop in two ways – one is storage and second
is processing.
• Spark has its own cluster management computation, it uses Hadoop
for storage purpose only.
Apache Spark
• The main feature of Spark is its in-memory cluster computing that
increases the processing speed of an application.
• Spark is designed to cover a wide range of workloads such as
1. batch applications,
2. iterative algorithms,
3. interactive queries
4. streaming.
• It also reduces the management burden of maintaining separate
tools.
Evolution of Apache Spark

• Spark is one of Hadoop’s sub project developed in 2009 in Berkeley’s


AMPLab by Matei Zaharia.
• It was Open Sourced in 2010 under a BSD(Berkeley Source
Distribution) license.
• It was donated to Apache software foundation in 2013.
• Apache Spark has become a top level Apache project from Feb-2014.
Features of Apache Spark
• Speed − Spark helps to run an application in Hadoop cluster, up to
100 times faster in memory, and 10 times faster when running on
disk.
• This is possible by reducing number of read/write operations to disk.
It stores the intermediate processing data in memory.
• Supports multiple languages − Spark provides built-in APIs in Java,
Scala, or Python.
• We can write applications in different languages. Spark comes up with
80 high-level operators for interactive querying.
• Advanced Analytics − Spark not only supports ‘Map’ and ‘reduce’. It
also supports SQL queries, Streaming data, Machine learning (ML),
and Graph algorithms.
Spark Built on Hadoop
The following diagram shows three ways of how Spark can be
built with Hadoop components.
• Standalone − Spark Standalone deployment means Spark occupies
the place on top of HDFS(Hadoop Distributed File System) and space
is allocated for HDFS, explicitly.
• Spark and MapReduce will run side by side to cover all spark jobs on
cluster.
• Hadoop Yarn − Hadoop Yarn deployment means, simply, spark runs
on Yarn without any pre-installation or root access required.
• It helps to integrate Spark into Hadoop ecosystem or Hadoop stack. It
allows other components to run on top of stack.
• Spark in MapReduce (SIMR) − Spark in MapReduce is used to launch
spark job in addition to standalone deployment.
• With SIMR, user can start Spark and uses its shell without any
administrative access.
Components of Spark
Components of Spark
• Apache Spark Core
Spark Core is the underlying general execution engine for spark
platform that all other functionality is built upon.
It provides In-Memory computing and referencing datasets in external
storage systems.
• Spark SQL
Spark SQL is a component on top of Spark Core that introduces a new
data abstraction called SchemaRDD, which provides support for
structured and semi-structured data.
• Spark Streaming
Spark Streaming leverages Spark Core's fast scheduling capability to
perform streaming analytics.
It ingests data in mini-batches and performs RDD (Resilient Distributed
Datasets) transformations on those mini-batches of data.
• MLlib (Machine Learning Library)
MLlib is a distributed machine learning framework above Spark
because of the distributed memory-based Spark architecture.
Spark MLlib is nine times as fast as the Hadoop disk-based version
of Apache Mahout .
• GraphX
GraphX is a distributed graph-processing framework on top of Spark.
It provides an API for expressing graph computation that can model the
user-defined graphs by using Pregel abstraction API.
It also provides an optimized runtime for this abstraction.
Hadoop
Hadoop
Spark
Architecture of Spark
Master-Slave
Resilient Distributed Datasets(RDD)
• Resilient Distributed Datasets (RDD) is a fundamental data structure
of Spark.
• It is an immutable distributed collection of objects.
• Each dataset in RDD is divided into logical partitions, which may be
computed on different nodes of the cluster.
• RDDs can contain any type of Python, Java, or Scala objects, including
user-defined classes.
• RDD is a fault-tolerant collection of elements that can be operated in
parallel.
Process of dividing the dataset into partitions
Operations on RDD
1. Transformations
2. Actions
• There are two ways to create RDDs
1. parallelizing an existing collection in your driver program
2. referencing a dataset in an external storage system, such as a
shared file system, HDFS, HBase, or any data source offering a
Hadoop Input Format.
• Spark makes use of the concept of RDD to achieve faster and efficient
MapReduce operations.
Creating RDD using parallelize()
// Create an RDD from a list of integers
val data = List(1, 2, 3, 4, 5)
val rdd = sc.parallelize(data)

// Create an RDD from a list of strings


val data = List("Hello", "World", "Spark", "is", "awesome")
val rdd = sc.parallelize(data)
Creating RDD using parallelize()
// Create an RDD from a list of tuples
val data = List(("Alice", 25), ("Bob", 30), ("Charlie", 35))
val rdd = sc.parallelize(data)
Creating RDD from a text file(dataset)
// Start the Scala Spark shell or use the spark-shell command in Cloudera

// Load the sample file into an RDD


val rdd = sc.textFile("hdfs://path/to/your/sample/file.txt")

// Perform operations on the RDD, for example, count the number of lines
val count = rdd.count()

// Print the number of lines in the RDD


println(s"Number of lines in the RDD: $count")
RDD Transformations
• Map: Apply a function to each element of the RDD to create a new
RDD.
• // Create an RDD of integers
• val rdd = sc.parallelize(List(1, 2, 3, 4, 5))
1
• // Map each element to its square 4
• val squaredRDD = rdd.map(x => x * x)
9
16
25
• // Print the result
• squaredRDD.collect().foreach(println)
• Filter: Keep only the elements of the RDD that satisfy a condition.
• // Create an RDD of integers
• val rdd = sc.parallelize(List(1, 2, 3, 4, 5))

• // Filter the even numbers


• val evenRDD = rdd.filter(x => x % 2 == 0)

• // Print the result


• evenRDD.collect().foreach(println)

2
4
FlatMap: Apply a function to each element of the
RDD and flatten the results.
• // Create an RDD of strings
• val rdd = sc.parallelize(List("hello world", "how are you"))

• // Split each string into words and flatten the result


• val wordsRDD = rdd.flatMap(_.split(" "))
hello
• // Print the result world
• wordsRDD.collect().foreach(println) how
are
you
Union: Combine two RDDs into one RDD.
• // Create two RDDs of integers
• val rdd1 = sc.parallelize(List(1, 2, 3))
• val rdd2 = sc.parallelize(List(4, 5, 6))

• // Union the two RDDs 1


2
• val unionRDD = rdd1.union(rdd2) 3
4
5
• // Print the result 6
• unionRDD.collect().foreach(println)
Distinct: Remove duplicate elements from the RDD.

• // Create an RDD of integers with duplicates


• val rdd = sc.parallelize(List(1, 2, 3, 4, 2, 3, 5))

• // Remove duplicates 1
• val distinctRDD = rdd.distinct() 2
3
• // Print the result 4
• distinctRDD.collect().foreach(println) 5
RDD Actions
• collect(): Return all elements of the RDD as an array to the driver
program.
• // Create an RDD of integers
• val rdd = sc.parallelize(List(1, 2, 3, 4, 5))

• // Collect all elements of the RDD


• val result = rdd.collect() 1
2
3
• // Print the result
4
• result.foreach(println) 5
take(n): Return the first n elements of the RDD as
an array.
• // Create an RDD of integers
• val rdd = sc.parallelize(List(1, 2, 3, 4, 5))

• // Take the first 3 elements of the RDD


• val result = rdd.take(3) 1
2
• // Print the result 3
• result.foreach(println)
reduce(func): Aggregate the elements of the RDD
using a specified binary function.
• // Create an RDD of integers
• val rdd = sc.parallelize(List(1, 2, 3, 4, 5))

• // Sum all elements of the RDD using reduce


• val sum = rdd.reduce((x, y) => x + y)
Sum: 15
• // Print the result
• println("Sum:", sum)
count(): Return the number of elements in the
RDD.
• // Create an RDD of integers
• val rdd = sc.parallelize(List(1, 2, 3, 4, 5))

• // Count the number of elements in the RDD


• val count = rdd.count()
Count: 5
• // Print the result
• println("Count:", count)
foreach(func): Apply a function to each element of
the RDD.
• // Create an RDD of integers
• val rdd = sc.parallelize(List(1, 2, 3, 4, 5))

• // Print each element of the RDD 1


2
• rdd.foreach(println) 3
4
5
What is Lazy Evaluation?
• The transformations on data are not immediately executed, but
rather their execution is delayed until an action is triggered.
• Spark builds up a logical execution plan called the DAG (Directed
Acyclic Graph) that represents the sequence of transformations to be
applied. The DAG is an internal representation of the computation.
• Actions in Spark, trigger the execution of the entire DAG, and the
computed results are returned to the driver program or written to an
external storage system.
Advantages of Using Lazy Evaluation
• Optimized Performance:
• Lazy evaluation allows for optimized performance by delaying computations which means that unnecessary
computations and operations can be avoided, resulting in improved runtime efficiency and reduced
processing time.
• Resource Efficiency:
• Lazy evaluation helps in efficient resource utilization. By delaying the execution of computations until they
are needed, unnecessary resource allocation, memory consumption, and CPU usage can be avoided.
• Flexibility and Composition:
• By deferring computations, lazy evaluation allows functions and expressions to be composed, combined, and
reused in various contexts without immediately executing them. This enables modularity, code reuse, and
the creation of more flexible and composable software components.
• Dynamic Control Flow:
• Lazy evaluation facilitates dynamic control flow and conditional execution. It allows programs to evaluate
and execute different parts of the code based on runtime conditions, enabling more dynamic and adaptable
behavior.
• Error Handling and Fault Tolerance:
• Lazy evaluation can improve error handling and fault tolerance in applications. By postponing computations
until necessary, it provides an opportunity to catch and handle errors more effectively.
Advantages of Using Lazy Evaluation
Data Sharing is Slow in MapReduce
• MapReduce is widely adopted for processing and generating large
datasets with a parallel, distributed algorithm on a cluster.
• It allows users to write parallel computations, using a set of high-level
operators, without having to worry about work distribution and fault
tolerance.
• Both Iterative and Interactive applications require faster data sharing
across parallel jobs.
• Data sharing is slow in MapReduce due to replication, serialization,
and disk IO. Regarding storage system, most of the Hadoop
applications, they spend more than 90% of the time doing HDFS read-
write operations.
Data Sharing using Spark RDD

• Data sharing is slow in MapReduce due to replication, serialization,


and disk IO. Most of the Hadoop applications, they spend more than 90%
of the time doing HDFS read-write operations.
• Recognizing this problem, researchers developed a specialized framework
called Apache Spark. The key idea of spark is Resilient Distributed Datasets
(RDD); it supports in-memory processing computation. This means, it
stores the state of memory as an object across the jobs and the object is
sharable between those jobs. Data sharing in memory is 10 to 100 times
faster than network and Disk.
• Let us now try to find out how iterative and interactive operations take
place in Spark RDD.
Iterative Operations on Spark RDD

If the Distributed memory (RAM) is not sufficient to store intermediate results


(State of the JOB), then it will store those results on the disk.
Interactive Operations on Spark RDD
• If different queries are run on the same set of data repeatedly, this
particular data can be kept in memory for better execution times.
To Start Spark
• Open Virtual box and goto cloudera
• Open the terminal

• Type the command spark-shell and press Enter


scala>
scala>sc.parallelize(1 to 100, 5).collect()
• parallelizing a collection of integers from 1 to 100 across a specified
number of partitions and then collecting the results back to the
driver program.
• sc: It refers to the SparkContext object indicates that entry point to
any Spark functionality.
• parallelize(): This method is used to create a new RDD (Resilient
Distributed Dataset) from a collection.
• 1 to 100: This represents the range of integers from 1 to 100,
inclusive.
5: This parameter specifies the number of partitions into which the
data should be divided.
collect() is an action in Spark that retrieves all elements of an RDD (or
DataFrame) from the executors to the driver program.
Reading csv file locally
val path = "examples/src/main/resources/people.csv"

val df = spark.read.csv(path)
df.show()
// +------------------+
// | _c0|
// +------------------+
// | name;age;job|
// |Jorge;30;Developer|
// | Bob;32;Developer|
// +------------------+
val df2 = spark.read.option("delimiter", ";").csv(path)

• df2.show()
• // +-----+---+---------+
• // | _c0|_c1| _c2|
• // +-----+---+---------+
• // | name|age| job|
• // |Jorge| 30|Developer|
• // | Bob| 32|Developer|
• // +-----+---+---------+
val df3 = spark.read.option("delimiter", ";").option("header",
"true").csv(path)

df3.show()
// +-----+---+---------+
// | name|age| job|
// +-----+---+---------+
// |Jorge| 30|Developer|
// | Bob| 32|Developer|
// +-----+---+---------+
aggregation functions in Spark
• Sum
• Average
• Maximum
• Minimum
Sum()
• val data = Seq(("Alice", 100), ("Bob", 150), ("Charlie", 200), ("Alice",
50))
• val df = data.toDF("Name", "Amount") // Sum aggregation
• val sumDF =
df.groupBy("Name").agg(sum("Amount").alias("TotalAmount"))
• sumDF.show()
Avg()
• val data = Seq(("Alice", 100), ("Bob", 150), ("Charlie", 200), ("Alice", 50))
• val df = data.toDF("Name", "Amount") // Average aggregation val
avgDF =
df.groupBy("Name").agg(avg("Amount").alias("AverageAmount"))
avgDF.show()
Max()
• val data = Seq(("Alice", 100), ("Bob", 150), ("Charlie", 200), ("Alice",
50))
• val df = data.toDF("Name", "Amount") // Maximum aggregation val
maxDF =
df.groupBy("Name").agg(max("Amount").alias("MaxAmount"))
• maxDF.show()
Min()
• val data = Seq(("Alice", 100), ("Bob", 150), ("Charlie", 200), ("Alice",
50))
• val df = data.toDF("Name", "Amount") // Minimum aggregation val
minDF =
df.groupBy("Name").agg(min("Amount").alias("MinAmount"))
• minDF.show()
Suppose we have a dataset containing sales transactions with columns like transaction_id, product_id,
quantity, and amount.
We want to group this data by product and calculate the total quantity and total amount sold for each product.

• val data = Seq( (1, "A", 10, 100.0), (2, "B", 20, 200.0), (3, "A", 15, 150.0),
(4, "C", 5, 50.0), (5, "B", 25, 250.0) )
• val df = data.toDF("transaction_id", "product_id", "quantity", "amount")
// Grouping and aggregation
• val groupedDF = df.groupBy("product_id")
.agg(sum("quantity").alias("total_quantity"),
sum("amount").alias("total_amount")) groupedDF.show()
Dataset1
• // Sample sales data
• val salesData = Seq( (1, "A", 10, 100.0), (2, "B", 20, 200.0), (3, "A", 15,
150.0), (4, "C", 5, 50.0), (5, "B", 25, 250.0) ).toDF("transaction_id",
"product_id", "quantity", "amount")

• // Sample product data


• val productData = Seq( ("A", "Product A", "Category 1"), ("B", "Product
B", "Category 2"), ("C", "Product C", "Category 1") ).toDF("product_id",
"product_name", "category")
// Joining the datasets
• val joinedDF = salesData.join(productData, Seq("product_id"),
"inner") joinedDF.show()

+----------+--------------+--------+------+------------+----------+
|product_id|transaction_id|quantity|amount|product_name| category|
+----------+--------------+--------+------+------------+----------+
| A| 1| 10| 100.0| Product A|Category 1|
| A| 3| 15| 150.0| Product A|Category 1|
| B| 2| 20| 200.0| Product B|Category 2|
| B| 5| 25| 250.0| Product B|Category 2|
| C| 4| 5| 50.0| Product C|Category 1|
+----------+--------------+--------+------+------------+----------+

You might also like