UNIT V
UNIT V
Speed: Spark performs up to 100 times faster than MapReduce for processing large
amounts of data. It is also able to divide the data into chunks in a controlled way.
Powerful Caching: Powerful caching and disk persistence capabilities are offered by a
simple programming layer.
Deployment: Hadoop via YARN, or Spark’s own cluster manager can all be used to
deploy it.
Real-Time: Because of its in-memory processing, it offers real-time computation and low
latency.
Apache Spark is needed because it solves several problems that arise when dealing with big
data processing and real-time data analytics. Here’s why Spark is valuable:
One Tool, Many Jobs: Spark supports batch processing, streaming data, SQL queries,
machine learning, and graph analytics-all within the same engine.
Reduces Tool Overhead: Instead of using separate tools for each task (like Hive for SQL,
Storm for streaming), Spark provides an all-in-one solution.
Spark Streaming: Allows real-time data processing for live data feeds (e.g., from IoT
devices, social media, or logs).
Low Latency: Processes data in near real-time using micro-batches, useful for
monitoring systems, fraud detection, etc.
4. Easy to Use
Developer-Friendly APIs: Available in Python, Java, Scala, and R with concise syntax and
rich libraries.
Simple Coding Model: You can perform complex operations with simple code using
functions like map(), reduceByKey(), and groupBy().
5. Fault Tolerant
Automatic Recovery: Spark can recover lost data and tasks using RDD lineage, without
needing data replication like Hadoop.
Reliable for Big Data: Great for distributed environments where node failures are
common.
6. Scalability
Compatible with Big Data Tools: Integrates with HDFS, Hive, HBase, Cassandra, Kafka,
and more.
Ecosystem Modules:
o Spark SQL for querying structured data
o MLlib for machine learning
o GraphX for graph processing
o Spark Streaming for real-time analytics
8. Cost-Effective
Better Resource Utilization: Faster processing means less compute time, saving on cloud
or infrastructure costs.
Open Source: Free to use and actively supported by a large community.
The worker node is a slave node. Its role is to run the application code in the cluster.
Many worker nodes can be used to process an RDD created in the SparkContext, and
the results can also be cached.
The slave nodes function as executors, processing tasks, and returning the results back
to the spark context. The master node issues tasks to the Spark context and the worker
nodes execute them.
They make the process simpler by boosting the worker nodes (1 to n) to handle as many
jobs as possible in parallel by dividing the job up into sub-jobs on multiple machines.
A Spark worker monitors worker nodes to ensure that the computation is performed
simply. Each worker node handles one Spark task.
The Spark executors
An executor is responsible for executing a job and storing data in a cache at the outset.
Executors first register with the driver programme at the beginning.
These executors have a number of time slots to run the application concurrently. The
executor runs the task when it has loaded data and they are removed in idle mode.
The executor runs in the Java process when data is loaded and removed during the
execution of the tasks. The executors are allocated dynamically and constantly added
and removed during the execution of the tasks.
A driver program monitors the executors during their performance. Users’ tasks are
executed in the Java process.
Task:
The Spark Context receives task information from the Cluster Manager and enqueues it on
worker nodes. It’s a unit of work that will be sent to one executor.
Spark Components:
The Spark project consists of different types of tightly integrated components. At its core, Spark
is a computational engine that can schedule, distribute and monitor multiple applications.
Spark Core
o The Spark Core is the heart of Spark and performs the core functionality.
o It holds the components for task scheduling, fault recovery, interacting with storage
systems and memory management.
Spark SQL
o The Spark SQL is built on the top of Spark Core. It provides support for structured data.
o It allows to query the data via SQL (Structured Query Language) as well as the Apache
Hive variant of SQL, called the HQL (Hive Query Language).
o It supports JDBC and ODBC connections that establish a relation between Java objects
and existing databases, data warehouses and business intelligence tools.
o It also supports various sources of data like Hive tables, Parquet, and JSON.
Spark Streaming
MLlib
o The MLlib is a Machine Learning library that contains various machine learning algorithms.
o These include correlations and hypothesis testing, classification and regression,
clustering, and principal component analysis.
o It is nine times faster than the disk-based implementation used by Apache Mahout.
GraphX
o The GraphX is a library that is used to manipulate graphs and perform graph-parallel
computations.
o It facilitates to create a directed graph with arbitrary properties attached to each vertex
and edge.
o To manipulate graph, it supports various fundamental operators like subgraph, join
Vertices, and aggregate Messages.
Spark Shell:
Spark’s shell provides a simple way to learn the API, as well as a powerful tool to analyze data
interactively. It is available in either Scala (which runs on the Java VM and is thus a good way to
use existing Java libraries) or Python. Start it by running the following in the Spark directory:
For Python :
./bin/pyspark
For Scala:
./bin/spark-shell
- It is built on Scala (default) but also supports PySpark (for Python users).
- Internally connects to a SparkSession (spark) and SparkContext (sc) for executing operations.
- You can check the version using the following command: sc.version
- To stop the Spark session in Spark Shell, you can use either of these commands:
Spark.stop() or sc.stop()
- The :quit command is specific to the Scala REPL(Read-Eval-Print-Loop), which includes the
Spark Shell (since it's built on Scala).
Example Commands in Spark Shell:
val data = Seq(1, 2, 3, 4, 5)
val rdd = sc.parallelize(data)
val doubled = rdd.map(_ * 2)
doubled.collect() // Output: Array(2, 4, 6, 8, 10)
2. Pre-initialized Context
Automatically provides:
o sc → SparkContext
o spark → SparkSession
No need to manually initialize them.
Leverages Spark’s in-memory data processing, making operations super fast compared
to traditional MapReduce.
Ideal for writing small pieces of code to test logic before integrating into full
applications.
7. Multi-language Support
Resilient, i.e. fault-tolerant with the help of RDD lineage graph and so able to recompute
missing or damaged partitions due to node failures.
Distributed with data residing on multiple nodes in a cluster.
Dataset is a collection of partitioned data with primitive values or values of values, e.g.
tuples or other objects (python, scala or java)
Beside the above traits (that are directly embedded in the name of the data abstraction — RDD)
it has the following additional traits:
In-Memory, i.e. data inside RDD is stored in memory as much (size) and long (time) as
possible.
Immutable or Read-Only, i.e. it does not change once created and can only be
transformed using transformations to new RDDs.
Lazy evaluated i.e. transformation on RDD, don’t get performed immediately. Spark
internally records the metadata to track the operations. Lazy evaluation reduce the number
of passes on the data by grouping the operations.
Cacheable, i.e. you can hold all the data in a persistent “storage” like memory (default
and the most preferred) or disk (the least preferred due to access speed).
Parallel, i.e. process data in parallel.
Typed — RDD records have types, e.g. Long in RDD[Long] or (Int, String) in RDD[(Int,
String)].
Partitioned — records are partitioned (split into logical partitions) and distributed across
nodes in a cluster.
Location-Stickiness — RDD can define placement preferences to compute partitions (as
close to the records as possible).
RDD Operations:
There are two types of operations, we can perform on RDDs. They are transformations and
actions.
A Transformation is a function that produces new RDD from the existing RDDs but when we
want to work with the actual dataset, at that point Action is performed. When the action is
triggered after the result, new RDD is not formed like transformation.
RDD Transformation
Spark Transformation is a function that produces new RDD from the existing RDDs. It
takes RDD as input and produces one or more RDD as output. Each time it creates new
RDD when we apply any transformation. Thus, the input RDDs, cannot be changed since
RDD are immutable in nature.
Applying transformation built an RDD lineage, with the entire parent RDDs of the final
RDD(s). RDD lineage, also known as RDD operator graph or RDD dependency graph. It is
a logical execution plan i.e., it is Directed Acyclic Graph (DAG) of the entire parent RDDs
of RDD.
Transformations are lazy in nature i.e., they get execute when we call an action. They
are not executed immediately.
Narrow transformation – In Narrow transformation, all the elements that are required to
compute the records in single partition live in the single partition of parent RDD. A limited
subset of partition is used to calculate the result. Narrow transformations are the result
of map(), filter().
One Parent ==> One Child
Wide transformation – In wide transformation, all the elements that are required to
compute the records in the single partition may live in many partitions of parent RDD. The
partition may live in many partitions of parent RDD. Wide transformations are the result
of groupbyKey() and reducebyKey().Wide transformations are also known as shuffle
transformations because data shuffle is required to process the data.
The elements of the collection are copied to form a distributed dataset that can be
operated on in parallel.
To operate this method, we need entire dataset on one machine. Due to this property, this
process is rarely used outside of testing and prototyping.
Spark RDD’s can be created from any storage source supported by Hadoop, including your
local file system, HDFS, Cassandra, HBase, Amazon S3, etc. Spark supports text files,
SequenceFiles, and any other Hadoop InputFormat.
Text file RDDs can be created using sc.textFile(name, partitions)method. This method takes
an URI for the file (either a local path on the machine, or a hdfs://, s3n://, etc URI) and reads
it as a collection of lines.
RDD of pairs of a file and its content from a directory can can be created
using sc.wholeTextFiles(name, partitions). This method lets you read a directory containing
multiple small text files, and returns each of them as (filename, content) pairs. This is in
contrast with textFile, which would return one record per line in each file.
If using a path on the local filesystem, the file must also be accessible at the same path on
worker nodes. Either copy the file to all workers or use a network-mounted shared file
system.
All of Spark’s file-based input methods, including textFile, support running on directories,
compressed files, and wildcards as well. For example, you can use textFile("/my/directory")
The textFile method also takes an optional second argument for controlling the number of
partitions of the file. By default, Spark creates one partition for each block of the file (blocks
being 128MB by default in HDFS), but you can also ask for a higher number of partitions by
passing a larger value. Note that you cannot have fewer partitions than blocks.
We can create different RDD from the existing RDDs. This process of creating another
dataset from the existing ones means transformation. As a result, transformation always
produces new RDD. As they are immutable, no changes take place in it if once created.
This property maintains the consistency over the cluster.
Some of the operations performed on RDD are map, filter, count, distinct, flatmap etc.
SQL
RDD <-- Java byte Code
Catalyst
Optimizer
DataFrame API
(Spark SQL
Engine)
DataSet
Physical Code
Analysis Logical
Planning Generation &
Phase Optimization
Phase Execution
Phase
Phase
1. Input Interfaces :
These are the various APIs and interfaces through which you can interact with Spark SQL:
SQL: You can write queries using standard SQL syntax, which Spark can interpret.
DataFrame API: A higher-level abstraction over RDDs, DataFrames allow you to
manipulate distributed collections of data using domain-specific language (DSL)
functions.
DataSet: Combines the best of RDDs (strong typing, functional programming) and
DataFrames (optimized execution). Mainly used with strongly typed languages like Scala.
All three interfaces eventually send the user query to the Catalyst Optimizer.
i. Analysis Phase
Translates the optimized logical plan into one or more physical plans.
Uses cost-based optimizations to select the best physical plan.
4. Java Bytecode
The 4 main phases of the Spark SQL Engine represent how an SQL query or DataFrame
operation is processed internally in Spark. These phases transform the query from a high-level
statement into low-level RDD operations executed on a cluster.
The 4-phase model is a simplified way to explain Spark SQL's pipeline.
1. Analysis Phase
This is where the Spark SQL engine starts understanding your query.
Sub-Steps:
Sub-Steps:
withCachedData Logical Plan (Optional):
o If parts of the data are cached (df.cache()), Spark uses cached versions here.
Optimized Logical Plan:
o Spark applies Catalyst optimizer rules:
Predicate pushdown (move filters closer to data)
Constant folding (compute constant expressions early)
Projection pruning (select only needed columns)
Sub-Steps:
Sub-Step:
Spark Session:
Spark context is the entry point for spark, As RDD was the main API, it was created and
manipulated using context API’s. For every other API, we needed to use a different
context.
For streaming we needed streamingContext. But as dataSet and DataFrame API’s are
becoming new standalone API’s we need an entry-point build for them. So in spark 2.0,
we have a new entry point build for DataSet and DataFrame API’s called as Spark-
Session.
It’s a combination of SQLContext, HiveContext and future streamingContext. All the API’s
available on those contexts are available on SparkSession also SparkSession has a spark
context for actual computation.
SparkSession.builder()
master(“local”)
appName( )
Set a name for the application which will be shown in the spark Web UI. If no application
name is set, a randomly generated name will be used.
Config
Sets a config option set using this method are automatically propagated to both
‘SparkConf’ and ‘SparkSession’ own configuration, its arguments consist of key-value
pair.
GetOrCreate
Gets an existing SparkSession or, if there is a valid thread-local SparkSession and if yes,
return that one. It then checks whether there is a valid global default SparkSession and if
yes returns that one. If no valid global SparkSession exists, the method creates a new
SparkSession and assign newly created SparkSession as the global default.
In case an existing SparkSession is returned, the config option specified in this builder will
be applied to existing SparkSession.
Creating DataFrames:
In Spark, DataFrames are the distributed collections of data, organized into rows and
columns. Each column in a DataFrame has a name and an associated type. DataFrames
are similar to traditional database tables, which are structured and concise. We can say
that DataFrames are relational databases with better optimization techniques.
Spark DataFrames can be created from various sources, such as Hive tables, log tables,
external databases, or the existing RDDs. DataFrames allow the processing of huge
amounts of data.
A DataFrame is a distributed collection of data organized into named columns, similar to
a table in a relational database.
You can create a DataFrame using spark.createDataFrame() by providing a list of tuples and
defining the schema.
spark = SparkSession.builder.appName("DataFrameExample").getOrCreate()
# Define schema
schema = StructType([
])
# Create DataFrame
data = [(1, "John", 25), (2, "Doe", 28), (3, "Alice", 23)]
df = spark.createDataFrame(data, schema)
df.show()
If you have a CSV file, you can directly load it into a DataFrame:
df = spark.read.option("header", True).csv("data.csv")
df.show()
df = spark.read.json("data.json")
df.show()
To create a DataFrame from a pandas DataFrame, you can use the createDataFrame method
provided by the SparkSession class and pass in the pandas DataFrame as the first argument.
import pandas as pd
from pyspark.sql import SparkSession
df = spark.createDataFrame(pandas_df)
df.show()
5. Creating a DataFrame from a TXT file
To create a DataFrame from a text file, you can use the text() method provided by
the SparkSession class.
df = spark.read.text("path/to/file.txt")
df.show()
DataFrame Operations:
View
Filter
Transform
Analyze
Save/load it
These operations help you work with large-scale datasets efficiently in a distributed
environment.
df.printSchema():Shows the structure of the DataFrame (column names, data types, nullability).
df.withColumn("Country", lit("India")).show()
Renaming a column:
df.withColumnRenamed("Name", "FullName").show()
3. Filtering Rows
5. Sorting Data
6. Joining DataFrames
Types of joins are inner, left, right, outer= df1.join(df2, on="ID", how="inner").show()
Read CSV:
df = spark.read.option("header", True).csv("data.csv")
Write CSV:
df.write.option("header", True).csv("output_folder")
Dataset Operations:
What is a Dataset?
A Dataset is a strongly-typed collection of objects that provides better optimization and type
safety compared to DataFrames. It is mainly used in Scala and Java, as Python’s pyspark library
does not support typed Datasets.
import spark.implicits._
val ds = data.toDS()
ds.show()
Generic functions abstract away the complexity of dealing with different data sources by
providing a unified interface for reading and writing data.
A load_data function reads data from different sources based on parameters like file type or
data source type.
Example in Python
import pandas as pd
import json
import sqlite3
# Example usage:
# df = load_data("data.csv", source_type="csv")
# data_dict = load_data("data.json", source_type="json")
Example in Python
def save_data(data, destination, dest_type="csv"):
if dest_type == "csv":
data.to_csv(destination, index=False)
elif dest_type == "json":
with open(destination, 'w') as f:
json.dump(data, f, indent=4)
elif dest_type == "excel":
data.to_excel(destination, index=False)
elif dest_type == "sqlite":
conn = sqlite3.connect(destination)
data.to_sql("my_table", conn, if_exists="replace", index=False)
else:
raise ValueError("Unsupported destination type!")
# Example usage:
# save_data(df, "output.csv", dest_type="csv")
# save_data(data_dict, "output.json", dest_type="json")
name := "SparkSQLApp"
version := "1.0"
scalaVersion := "2.12.18"
enablePlugins(AssemblyPlugin)
assemblyMergeStrategy in assembly := {
case PathList("META-INF", _ @ _*) => MergeStrategy.discard
case x => MergeStrategy.first
}
sbt.version=1.8.0
3. project/plugins.sbt (Enable sbt-assembly Plugin): Create a file inside the project/ directory
object SparkSQLApp {
def main(args: Array[String]): Unit = {
// Initialize SparkSession
val spark = SparkSession.builder()
.appName("Spark SQL Application")
.master("local[*]") // Runs locally
.getOrCreate()
import spark.implicits._
// Sample DataFrame
val data = Seq(
("Alice", 25),
("Bob", 28),
("Charlie", 30)
).toDF("name", "age")
// Show Results
result.show()
// Stop SparkSession
spark.stop()
}
}
Objective: Analyze user actions (like clicks, searches, purchases) in real time to:
Recommend products.
Detect trending items.
Monitor user engagement.
Optimize campaigns instantly.
Data like page views, searches, add-to-cart events, and purchases are continuously
generated by users.
These events are sent to Apache Kafka as messages.