Unit-V Spark
Unit-V Spark
SPARK
Languages supported by Spark
Java
Python
R
Scala
Two important concepts of Hadoop
• 1. Hadoop Distributed File System(HDFS) (Storage)
• 2. MapReduce(Processing)
Processing in Hadoop
Why Spark?
To overcome the limitations of Hadoop
What is Spark?
• Apache Spark is a lightning-fast cluster computing designed for fast
computation.
• It was built on top of Hadoop MapReduce.
• It extends the MapReduce model to efficiently use more types of
computations which includes Interactive Queries and Stream
Processing.
• Spark was introduced by Apache Software Foundation for speeding
up the Hadoop computational computing software process.
• Spark is not a modified version of Hadoop and is not, really,
dependent on Hadoop because it has its own cluster management.
Hadoop is just one of the ways to implement Spark.
• Spark uses Hadoop in two ways – one is storage and second
is processing.
• Spark has its own cluster management computation, it uses Hadoop
for storage purpose only.
Apache Spark
• The main feature of Spark is its in-memory cluster computing that
increases the processing speed of an application.
• Spark is designed to cover a wide range of workloads such as
1. batch applications,
2. iterative algorithms,
3. interactive queries
4. streaming.
• It also reduces the management burden of maintaining separate
tools.
Evolution of Apache Spark
// Perform operations on the RDD, for example, count the number of lines
val count = rdd.count()
2
4
FlatMap: Apply a function to each element of the
RDD and flatten the results.
• // Create an RDD of strings
• val rdd = sc.parallelize(List("hello world", "how are you"))
• // Remove duplicates 1
• val distinctRDD = rdd.distinct() 2
3
• // Print the result 4
• distinctRDD.collect().foreach(println) 5
RDD Actions
• collect(): Return all elements of the RDD as an array to the driver
program.
• // Create an RDD of integers
• val rdd = sc.parallelize(List(1, 2, 3, 4, 5))
val df = spark.read.csv(path)
df.show()
// +------------------+
// | _c0|
// +------------------+
// | name;age;job|
// |Jorge;30;Developer|
// | Bob;32;Developer|
// +------------------+
val df2 = spark.read.option("delimiter", ";").csv(path)
• df2.show()
• // +-----+---+---------+
• // | _c0|_c1| _c2|
• // +-----+---+---------+
• // | name|age| job|
• // |Jorge| 30|Developer|
• // | Bob| 32|Developer|
• // +-----+---+---------+
val df3 = spark.read.option("delimiter", ";").option("header",
"true").csv(path)
df3.show()
// +-----+---+---------+
// | name|age| job|
// +-----+---+---------+
// |Jorge| 30|Developer|
// | Bob| 32|Developer|
// +-----+---+---------+
aggregation functions in Spark
• Sum
• Average
• Maximum
• Minimum
Sum()
• val data = Seq(("Alice", 100), ("Bob", 150), ("Charlie", 200), ("Alice",
50))
• val df = data.toDF("Name", "Amount") // Sum aggregation
• val sumDF =
df.groupBy("Name").agg(sum("Amount").alias("TotalAmount"))
• sumDF.show()
Avg()
• val data = Seq(("Alice", 100), ("Bob", 150), ("Charlie", 200), ("Alice", 50))
• val df = data.toDF("Name", "Amount") // Average aggregation val
avgDF =
df.groupBy("Name").agg(avg("Amount").alias("AverageAmount"))
avgDF.show()
Max()
• val data = Seq(("Alice", 100), ("Bob", 150), ("Charlie", 200), ("Alice",
50))
• val df = data.toDF("Name", "Amount") // Maximum aggregation val
maxDF =
df.groupBy("Name").agg(max("Amount").alias("MaxAmount"))
• maxDF.show()
Min()
• val data = Seq(("Alice", 100), ("Bob", 150), ("Charlie", 200), ("Alice",
50))
• val df = data.toDF("Name", "Amount") // Minimum aggregation val
minDF =
df.groupBy("Name").agg(min("Amount").alias("MinAmount"))
• minDF.show()
Suppose we have a dataset containing sales transactions with columns like transaction_id, product_id,
quantity, and amount.
We want to group this data by product and calculate the total quantity and total amount sold for each product.
• val data = Seq( (1, "A", 10, 100.0), (2, "B", 20, 200.0), (3, "A", 15, 150.0),
(4, "C", 5, 50.0), (5, "B", 25, 250.0) )
• val df = data.toDF("transaction_id", "product_id", "quantity", "amount")
// Grouping and aggregation
• val groupedDF = df.groupBy("product_id")
.agg(sum("quantity").alias("total_quantity"),
sum("amount").alias("total_amount")) groupedDF.show()
Dataset1
• // Sample sales data
• val salesData = Seq( (1, "A", 10, 100.0), (2, "B", 20, 200.0), (3, "A", 15,
150.0), (4, "C", 5, 50.0), (5, "B", 25, 250.0) ).toDF("transaction_id",
"product_id", "quantity", "amount")
+----------+--------------+--------+------+------------+----------+
|product_id|transaction_id|quantity|amount|product_name| category|
+----------+--------------+--------+------+------------+----------+
| A| 1| 10| 100.0| Product A|Category 1|
| A| 3| 15| 150.0| Product A|Category 1|
| B| 2| 20| 200.0| Product B|Category 2|
| B| 5| 25| 250.0| Product B|Category 2|
| C| 4| 5| 50.0| Product C|Category 1|
+----------+--------------+--------+------+------------+----------+