0% found this document useful (0 votes)
4 views35 pages

UNIT V

Apache Spark is an open-source framework for processing large datasets, offering an alternative to Hadoop with its in-memory computing capabilities that enhance speed and performance. It supports various data processing tasks, including batch processing, streaming, SQL queries, and machine learning, all within a unified architecture that includes components such as RDDs, DAGs, and Spark SQL. Spark's features include fault tolerance, scalability, and integration with other big data tools, making it a powerful choice for real-time data analytics.

Uploaded by

anujkothawale93
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views35 pages

UNIT V

Apache Spark is an open-source framework for processing large datasets, offering an alternative to Hadoop with its in-memory computing capabilities that enhance speed and performance. It supports various data processing tasks, including batch processing, streaming, SQL queries, and machine learning, all within a unified architecture that includes components such as RDDs, DAGs, and Spark SQL. Spark's features include fault tolerance, scalability, and integration with other big data tools, making it a powerful choice for real-time data analytics.

Uploaded by

anujkothawale93
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 35

UNIT V: Introduction to SPARK

What is Apache Spark?


 Spark Architecture, an open-source, framework-based component that processes a
large amount of unstructured, semi-structured, and structured data for analytics, is
utilized in Apache Spark.
 Apart from Hadoop and map-reduce architectures for big data processing, Apache
Spark’s architecture is regarded as an alternative.
 The RDD and DAG, Spark’s data storage and processing framework, are utilized to store
and process data, respectively.
 Spark architecture consists of four components, including the spark driver, executors,
cluster administrators, and worker nodes.
 It uses the Dataset and data frames as the fundamental data storage mechanism to
optimize the Spark process and big data computation.
Apache Spark Features:
Apache Spark, a popular cluster computing framework, was created in order to accelerate data
processing applications. Spark, which enables applications to run faster by utilizing in-memory
cluster computing, is a popular open source framework. A cluster is a collection of nodes that
communicate with each other and share data. Because of implicit data parallelism and fault
tolerance, Spark may be applied to a wide range of sequential and interactive processing
demands.

 Speed: Spark performs up to 100 times faster than MapReduce for processing large
amounts of data. It is also able to divide the data into chunks in a controlled way.
 Powerful Caching: Powerful caching and disk persistence capabilities are offered by a
simple programming layer.
 Deployment: Hadoop via YARN, or Spark’s own cluster manager can all be used to
deploy it.
 Real-Time: Because of its in-memory processing, it offers real-time computation and low
latency.

Why Apache Spark?

Apache Spark is needed because it solves several problems that arise when dealing with big
data processing and real-time data analytics. Here’s why Spark is valuable:

1. Speed and Performance

 In-Memory Computation: Spark stores intermediate data in memory (RAM), reducing


time spent on disk I/O compared to Hadoop MapReduce.
 Faster Execution: Operations like map, reduce, and filter are optimized using Directed
Acyclic Graph (DAG) execution, making Spark up to 100x faster in memory.
2. Unified Data Processing Engine

 One Tool, Many Jobs: Spark supports batch processing, streaming data, SQL queries,
machine learning, and graph analytics-all within the same engine.
 Reduces Tool Overhead: Instead of using separate tools for each task (like Hive for SQL,
Storm for streaming), Spark provides an all-in-one solution.

3. Real-time Stream Processing

 Spark Streaming: Allows real-time data processing for live data feeds (e.g., from IoT
devices, social media, or logs).
 Low Latency: Processes data in near real-time using micro-batches, useful for
monitoring systems, fraud detection, etc.

4. Easy to Use

 Developer-Friendly APIs: Available in Python, Java, Scala, and R with concise syntax and
rich libraries.
 Simple Coding Model: You can perform complex operations with simple code using
functions like map(), reduceByKey(), and groupBy().

5. Fault Tolerant

 Automatic Recovery: Spark can recover lost data and tasks using RDD lineage, without
needing data replication like Hadoop.
 Reliable for Big Data: Great for distributed environments where node failures are
common.

6. Scalability

 Horizontal Scaling: Easily scales from a single machine to thousands of nodes.


 Cluster Support: Works with Hadoop YARN, or its own standalone cluster manager.

7. Rich Ecosystem and Integration

 Compatible with Big Data Tools: Integrates with HDFS, Hive, HBase, Cassandra, Kafka,
and more.
 Ecosystem Modules:
o Spark SQL for querying structured data
o MLlib for machine learning
o GraphX for graph processing
o Spark Streaming for real-time analytics

8. Cost-Effective
 Better Resource Utilization: Faster processing means less compute time, saving on cloud
or infrastructure costs.
 Open Source: Free to use and actively supported by a large community.

Spark vs Hadoop MapReduce


Feature Apache Spark Hadoop MapReduce

1. Speed - Much faster due to in-memory - Slower as it relies on disk-based


processing. processing for each stage.
- Up to 100x faster than MapReduce.
2. Data Processes data in memory (RAM) for Writes intermediate results to
Processing most operations. disk, causing slower
performance.
3. Ease of Use - Has simple, high-level APIs in Scala, - Complex code using low-level
Java, Python, and R. Java APIs.
- Supports SQL, ML, streaming, and - Lacks native support for
graphs. streaming and ML.
4. Data Flow - Uses DAG (Directed Acyclic Graph) - Uses step-by-step job chaining,
for better optimization and fault which increases overhead.
recovery.
5. Machine - Has built-in MLlib for scalable - No native ML support; must use
Learning machine learning. separate tools like Mahout.
6. Real-Time - Supports real-time data processing - Only supports batch processing;
Processing using Spark Streaming or Structured no real-time capability.
Streaming.
7. Fault Tolerance - Achieved through RDD lineage (can - Uses data replication in HDFS
recompute lost data). for fault tolerance.
8. Ecosystem - Unified platform for batch, - Requires separate components
streaming, SQL, ML, and graph (Pig, Hive, Mahout, etc.) for
processing. various tasks.
9. Resource - Uses resources more efficiently due - Higher resource consumption
Utilization to in-memory computing. due to frequent disk I/O.
10. Cluster - Works with YARN, Mesos, and - Typically runs on YARN only.
Managers standalone.

Apache Spark Architecture:


The Spark follows the master-slave architecture. Its cluster consists of a single master and
multiple slaves.

The Spark architecture depends upon two abstractions:


o Resilient Distributed Dataset (RDD)
o Directed Acyclic Graph (DAG)

Resilient Distributed Datasets (RDD)


The Resilient Distributed Datasets are the group of data items that can be stored in-memory on
worker nodes. Here,

o Resilient: Restore the data on failure.


o Distributed: Data is distributed among different nodes.
o Dataset: Group of data.
It is a key tool for data computation. It enables you to recheck data in the event of a failure, and
it acts as an interface for immutable data. It helps in recomputing data in case of failures, and it
is a data structure. There are two methods for modifying RDDs: transformations and actions.
Directed Acyclic Graph (DAG)
Directed Acyclic Graph is a finite direct graph that performs a sequence of computations on
data. Each node is an RDD partition, and the edge is a transformation on top of data. Here, the
graph refers the navigation whereas directed and acyclic refers to how it is done.

The Spark driver:


 When the Driver Program in the Apache Spark architecture executes, it calls the real program of
an application and creates a SparkContext. SparkContext contains all of the basic functions. The
Spark Driver includes several other components, including a DAG Scheduler, Task Scheduler,
Backend Scheduler, and Block Manager, all of which are responsible for translating user-written
code into jobs that are actually executed on the cluster.
 In the diagram, the driver programmes call the main application and create a spark context
(acts as a gateway) that jointly monitors the job working in the cluster and connects to a Spark
cluster. Everything is executed using the spark context.
 Each Spark session has an entry in the Spark context. Spark drivers include more components to
execute jobs in clusters, as well as cluster managers.
 Spark context acquires worker nodes to execute and store data as Spark clusters are connected
to different types of cluster managers. When a process is executed in the cluster, the job is
divided into stages with gain stages into scheduled tasks.
Cluster Manager
 The Cluster Manager manages the execution of various jobs in the cluster. Spark Driver
works in conjunction with the Cluster Manager to control the execution of various other
jobs.
 The cluster Manager does the task of allocating resources for the job. Once the job has
been broken down into smaller jobs, which are then distributed to worker nodes,
SparkDriver will control the execution.
Worker Nodes:

 The worker node is a slave node. Its role is to run the application code in the cluster.
 Many worker nodes can be used to process an RDD created in the SparkContext, and
the results can also be cached.
 The slave nodes function as executors, processing tasks, and returning the results back
to the spark context. The master node issues tasks to the Spark context and the worker
nodes execute them.
 They make the process simpler by boosting the worker nodes (1 to n) to handle as many
jobs as possible in parallel by dividing the job up into sub-jobs on multiple machines.
 A Spark worker monitors worker nodes to ensure that the computation is performed
simply. Each worker node handles one Spark task.
The Spark executors
 An executor is responsible for executing a job and storing data in a cache at the outset.
Executors first register with the driver programme at the beginning.
 These executors have a number of time slots to run the application concurrently. The
executor runs the task when it has loaded data and they are removed in idle mode.
 The executor runs in the Java process when data is loaded and removed during the
execution of the tasks. The executors are allocated dynamically and constantly added
and removed during the execution of the tasks.
 A driver program monitors the executors during their performance. Users’ tasks are
executed in the Java process.
Task:
The Spark Context receives task information from the Cluster Manager and enqueues it on
worker nodes. It’s a unit of work that will be sent to one executor.

Spark Components:
The Spark project consists of different types of tightly integrated components. At its core, Spark
is a computational engine that can schedule, distribute and monitor multiple applications.

The below diagram shows each Spark component in detail.

Spark Core

o The Spark Core is the heart of Spark and performs the core functionality.
o It holds the components for task scheduling, fault recovery, interacting with storage
systems and memory management.

Spark SQL

o The Spark SQL is built on the top of Spark Core. It provides support for structured data.
o It allows to query the data via SQL (Structured Query Language) as well as the Apache
Hive variant of SQL, called the HQL (Hive Query Language).
o It supports JDBC and ODBC connections that establish a relation between Java objects
and existing databases, data warehouses and business intelligence tools.
o It also supports various sources of data like Hive tables, Parquet, and JSON.

Spark Streaming

o Spark Streaming is a Spark component that supports scalable and fault-tolerant


processing of streaming data.
o It uses Spark Core's fast scheduling capability to perform streaming analytics.
o It accepts data in mini-batches and performs RDD transformations on that data.
o Its design ensures that the applications written for streaming data can be reused to
analyze batches of historical data with little modification.
o The log files generated by web servers can be considered as a real-time example of a data
stream.

MLlib

o The MLlib is a Machine Learning library that contains various machine learning algorithms.
o These include correlations and hypothesis testing, classification and regression,
clustering, and principal component analysis.
o It is nine times faster than the disk-based implementation used by Apache Mahout.

GraphX

o The GraphX is a library that is used to manipulate graphs and perform graph-parallel
computations.
o It facilitates to create a directed graph with arbitrary properties attached to each vertex
and edge.
o To manipulate graph, it supports various fundamental operators like subgraph, join
Vertices, and aggregate Messages.

Spark Shell:
Spark’s shell provides a simple way to learn the API, as well as a powerful tool to analyze data
interactively. It is available in either Scala (which runs on the Java VM and is thus a good way to
use existing Java libraries) or Python. Start it by running the following in the Spark directory:
For Python :

./bin/pyspark

For Scala:

./bin/spark-shell

- When you run spark-shell:

1. Spark initializes a local Spark application.


2. It creates a SparkContext (object sc) and a SparkSession (object spark) automatically.
3. You can start typing Spark/Scala commands directly.

- It is built on Scala (default) but also supports PySpark (for Python users).
- Internally connects to a SparkSession (spark) and SparkContext (sc) for executing operations.

- You can check the version using the following command: sc.version
- To stop the Spark session in Spark Shell, you can use either of these commands:
Spark.stop() or sc.stop()
- The :quit command is specific to the Scala REPL(Read-Eval-Print-Loop), which includes the
Spark Shell (since it's built on Scala).
Example Commands in Spark Shell:
val data = Seq(1, 2, 3, 4, 5)
val rdd = sc.parallelize(data)
val doubled = rdd.map(_ * 2)
doubled.collect() // Output: Array(2, 4, 6, 8, 10)

Key Features of Spark Shell:


1. Interactive Data Processing

 You can write and execute code line-by-line.


 Great for testing Spark code and experimenting quickly.

2. Pre-initialized Context

 Automatically provides:
o sc → SparkContext
o spark → SparkSession
 No need to manually initialize them.

3. Supports In-Memory Computation

 Leverages Spark’s in-memory data processing, making operations super fast compared
to traditional MapReduce.

4. Built-in Libraries Support

 You can use Spark libraries like:


o Spark SQL
o Spark Streaming
o MLlib (Machine Learning)
o GraphX (Graph processing)
5. Great for Prototyping & Debugging

 Ideal for writing small pieces of code to test logic before integrating into full
applications.

6. Supports Data Exploration

 Read, transform, and visualize sample data quickly.


 Helpful in data cleaning and ETL pipeline development.

7. Multi-language Support

 Spark Shell: Scala


 PySpark Shell: Python
 SparkR Shell: R

Benefits of Using Spark Shell:

 Instant feedback for code.


 Great for prototyping and debugging.
 No need to write a full program or compile code.
 Built-in Spark context ready to use.

Spark Core: RDD

 Resilient Distributed Datasets (RDD) is a fundamental data structure of Spark. It is an


immutable distributed collection of objects. Each dataset in RDD is divided into logical
partitions, which may be computed on different nodes of the cluster. RDDs can contain
any type of Python, Java, or Scala objects, including user-defined classes.
 Formally, an RDD is a read-only, partitioned collection of records. RDDs can be created
through deterministic operations on either data on stable storage or other RDDs. RDD is
a fault-tolerant collection of elements that can be operated on in parallel.

The features of RDDs (decomposing the name):

 Resilient, i.e. fault-tolerant with the help of RDD lineage graph and so able to recompute
missing or damaged partitions due to node failures.
 Distributed with data residing on multiple nodes in a cluster.
 Dataset is a collection of partitioned data with primitive values or values of values, e.g.
tuples or other objects (python, scala or java)
Beside the above traits (that are directly embedded in the name of the data abstraction — RDD)
it has the following additional traits:

 In-Memory, i.e. data inside RDD is stored in memory as much (size) and long (time) as
possible.
 Immutable or Read-Only, i.e. it does not change once created and can only be
transformed using transformations to new RDDs.

 Lazy evaluated i.e. transformation on RDD, don’t get performed immediately. Spark
internally records the metadata to track the operations. Lazy evaluation reduce the number
of passes on the data by grouping the operations.
 Cacheable, i.e. you can hold all the data in a persistent “storage” like memory (default
and the most preferred) or disk (the least preferred due to access speed).
 Parallel, i.e. process data in parallel.
 Typed — RDD records have types, e.g. Long in RDD[Long] or (Int, String) in RDD[(Int,
String)].
 Partitioned — records are partitioned (split into logical partitions) and distributed across
nodes in a cluster.
 Location-Stickiness — RDD can define placement preferences to compute partitions (as
close to the records as possible).

RDD Operations:

There are two types of operations, we can perform on RDDs. They are transformations and
actions.
A Transformation is a function that produces new RDD from the existing RDDs but when we
want to work with the actual dataset, at that point Action is performed. When the action is
triggered after the result, new RDD is not formed like transformation.
RDD Transformation

 Spark Transformation is a function that produces new RDD from the existing RDDs. It
takes RDD as input and produces one or more RDD as output. Each time it creates new
RDD when we apply any transformation. Thus, the input RDDs, cannot be changed since
RDD are immutable in nature.
 Applying transformation built an RDD lineage, with the entire parent RDDs of the final
RDD(s). RDD lineage, also known as RDD operator graph or RDD dependency graph. It is
a logical execution plan i.e., it is Directed Acyclic Graph (DAG) of the entire parent RDDs
of RDD.
 Transformations are lazy in nature i.e., they get execute when we call an action. They
are not executed immediately.

There are two kinds of transformations: narrow transformation, wide transformation.

Narrow transformation – In Narrow transformation, all the elements that are required to
compute the records in single partition live in the single partition of parent RDD. A limited
subset of partition is used to calculate the result. Narrow transformations are the result
of map(), filter().
One Parent ==> One Child

 Wide transformation – In wide transformation, all the elements that are required to
compute the records in the single partition may live in many partitions of parent RDD. The
partition may live in many partitions of parent RDD. Wide transformations are the result
of groupbyKey() and reducebyKey().Wide transformations are also known as shuffle
transformations because data shuffle is required to process the data.

One Parent ==> Multiple Child


RDD Action

 An action is an operation that triggers execution of RDD transformations and returns a


value (to a Spark driver — the user program).
 Transformations create RDDs from each other, but when we want to work with the
actual dataset, at that point action is performed. When the action is triggered after the
result, new RDD is not formed like transformation.
 Thus, Actions are Spark RDD operations that give non-RDD values. The values of action
are stored to drivers or to the external storage system. It brings laziness of RDD into
motion.
 An action is one of the ways of sending data from Executer to the driver. Executors are
agents that are responsible for executing a task. While the driver is a JVM process that
coordinates workers and execution of the task. Some of the actions of Spark are:
 Action count() returns the number of elements in RDD. For example, RDD has values {1,
2, 2, 3, 4, 5, 5, 6} in this RDD “rdd.count()” will give the result 8.
 The countByValue() returns, many times each element occur in RDD. For example, RDD
has values {1, 2, 2, 3, 4, 5, 5, 6} in this RDD “rdd.countByValue()” will give the result
{(1,1), (2,2), (3,1), (4,1), (5,2), (6,1)}

Ways to Create an RDD:


1. Using Parallelized Collection: Parallelized collections are created by calling SparkContext’s
sc.paralleleize(col, slices)
sc.makeRDD(coll, slices)
sc.range(start, end, step, slices)

 The elements of the collection are copied to form a distributed dataset that can be
operated on in parallel.

 To operate this method, we need entire dataset on one machine. Due to this property, this
process is rarely used outside of testing and prototyping.

2. From external datasets:

 Spark RDD’s can be created from any storage source supported by Hadoop, including your
local file system, HDFS, Cassandra, HBase, Amazon S3, etc. Spark supports text files,
SequenceFiles, and any other Hadoop InputFormat.
 Text file RDDs can be created using sc.textFile(name, partitions)method. This method takes
an URI for the file (either a local path on the machine, or a hdfs://, s3n://, etc URI) and reads
it as a collection of lines.

 RDD of pairs of a file and its content from a directory can can be created
using sc.wholeTextFiles(name, partitions). This method lets you read a directory containing
multiple small text files, and returns each of them as (filename, content) pairs. This is in
contrast with textFile, which would return one record per line in each file.

Some notes on reading files with Spark:

 If using a path on the local filesystem, the file must also be accessible at the same path on
worker nodes. Either copy the file to all workers or use a network-mounted shared file
system.

 All of Spark’s file-based input methods, including textFile, support running on directories,
compressed files, and wildcards as well. For example, you can use textFile("/my/directory")

 The textFile method also takes an optional second argument for controlling the number of
partitions of the file. By default, Spark creates one partition for each block of the file (blocks
being 128MB by default in HDFS), but you can also ask for a higher number of partitions by
passing a larger value. Note that you cannot have fewer partitions than blocks.

3. From existing apache spark RDDs.

 We can create different RDD from the existing RDDs. This process of creating another
dataset from the existing ones means transformation. As a result, transformation always
produces new RDD. As they are immutable, no changes take place in it if once created.
This property maintains the consistency over the cluster.

 Some of the operations performed on RDD are map, filter, count, distinct, flatmap etc.

What is Spark SQL?


 Spark SQL brings native support for SQL to Spark and streamlines the process of
querying data stored both in RDDs (Spark’s distributed datasets) and in external sources.
 Spark SQL is a module in Apache Spark that allows you to process structured data using
SQL queries, the DataFrame API, or Dataset API.
 It's like using a database engine (like MySQL), but it runs on top of Apache Spark, which
is designed for big data processing on clusters.

SQL
RDD <-- Java byte Code
Catalyst
Optimizer
DataFrame API
(Spark SQL
Engine)

DataSet

4 Phases of Spark SQL


Engine

Physical Code
Analysis Logical
Planning Generation &
Phase Optimization
Phase Execution
Phase
Phase

1. Input Interfaces :

These are the various APIs and interfaces through which you can interact with Spark SQL:

 SQL: You can write queries using standard SQL syntax, which Spark can interpret.
 DataFrame API: A higher-level abstraction over RDDs, DataFrames allow you to
manipulate distributed collections of data using domain-specific language (DSL)
functions.
 DataSet: Combines the best of RDDs (strong typing, functional programming) and
DataFrames (optimized execution). Mainly used with strongly typed languages like Scala.

All three interfaces eventually send the user query to the Catalyst Optimizer.

2. Catalyst Optimizer (Logical SQL Engine)

This is the brain of Spark SQL. It goes through 4 major phases:


4 Phases of Spark SQL Engine (inside Catalyst)

i. Analysis Phase

 Parses the query into an unresolved logical plan.


 Resolves references to tables, columns, and functions using the catalog.

ii. Logical Optimization Phase

 Applies rule-based optimizations:


o Predicate pushdown
o Constant folding
o Projection pruning
 Produces an optimized logical plan.

iii. Physical Planning Phase

 Translates the optimized logical plan into one or more physical plans.
 Uses cost-based optimizations to select the best physical plan.

iv. Code Generation & Execution

 Uses the Tungsten Engine to generate Java bytecode.


 Converts physical plan → RDDs + compiled bytecode.
 Executes across Spark cluster.

3. RDD (Resilient Distributed Dataset)

 RDDs are the lowest level abstraction in Spark.


 Spark uses them to execute transformations and actions behind the scenes.
 They are generated from the physical plan.

4. Java Bytecode

 Spark generates Java bytecode from the physical plan.


 Bytecode is highly optimized for execution using whole-stage code generation.
 Used to run tasks efficiently across Spark executors.

The 4 main phases of the Spark SQL Engine represent how an SQL query or DataFrame
operation is processed internally in Spark. These phases transform the query from a high-level
statement into low-level RDD operations executed on a cluster.
The 4-phase model is a simplified way to explain Spark SQL's pipeline.
1. Analysis Phase

This is where the Spark SQL engine starts understanding your query.

Sub-Steps:

 Unresolved Logical Plan:


o Spark builds an initial plan from the SQL or DataFrame code.
o At this stage, column/table references are not yet validated.
o For example: SELECT name FROM employees → Spark doesn’t yet know if
employees exists.
 Analyzed Logical Plan:
o The Analyzer resolves table names, column references using catalog metadata.
o Ensures everything exists and matches schema.
o Now Spark knows what employees.name means.

2. Logical Optimization Phase

Now Spark tries to improve performance by rewriting the plan.

Sub-Steps:
 withCachedData Logical Plan (Optional):
o If parts of the data are cached (df.cache()), Spark uses cached versions here.
 Optimized Logical Plan:
o Spark applies Catalyst optimizer rules:
 Predicate pushdown (move filters closer to data)
 Constant folding (compute constant expressions early)
 Projection pruning (select only needed columns)

3. Physical Planning Phase

Spark now thinks about how to run the query.

Sub-Steps:

 Physical Plan (Multiple Candidates):


o Spark generates several possible physical plans (like choosing join strategies).
o Examples: Shuffle Hash Join vs Broadcast Join.
 Executable Physical Plan:
o Spark chooses the best one using cost-based evaluation (CBO).
o The final plan is ready for execution.

4. Code Generation & Execution Phase

This is where Spark actually runs your query.

Sub-Step:

 RDD (Resilient Distributed Dataset):


o The executable plan is transformed into low-level RDD operations.
o Spark submits tasks to worker nodes across the cluster.
o Results are returned or written to storage.

Spark Session:
 Spark context is the entry point for spark, As RDD was the main API, it was created and
manipulated using context API’s. For every other API, we needed to use a different
context.
 For streaming we needed streamingContext. But as dataSet and DataFrame API’s are
becoming new standalone API’s we need an entry-point build for them. So in spark 2.0,
we have a new entry point build for DataSet and DataFrame API’s called as Spark-
Session.
 It’s a combination of SQLContext, HiveContext and future streamingContext. All the API’s
available on those contexts are available on SparkSession also SparkSession has a spark
context for actual computation.

Creating a Spark Session

val spark = SparkSession.builder() .master("local")


.appName("example of SparkSession")
.config("spark.some.config.option", "some-value")
.getOrCreate()

SparkSession.builder()

 This method is created for constructing a SparkSession.

master(“local”)

Sets the spark master URL to connect to such as :

 “local” to run locally

 “local[4]” to run locally with 4 cores

 “spark://master:7077” to run on a spark standalone cluster

appName( )

 Set a name for the application which will be shown in the spark Web UI. If no application
name is set, a randomly generated name will be used.

Config

 Sets a config option set using this method are automatically propagated to both
‘SparkConf’ and ‘SparkSession’ own configuration, its arguments consist of key-value
pair.

GetOrCreate

 Gets an existing SparkSession or, if there is a valid thread-local SparkSession and if yes,
return that one. It then checks whether there is a valid global default SparkSession and if
yes returns that one. If no valid global SparkSession exists, the method creates a new
SparkSession and assign newly created SparkSession as the global default.

 In case an existing SparkSession is returned, the config option specified in this builder will
be applied to existing SparkSession.
Creating DataFrames:

 In Spark, DataFrames are the distributed collections of data, organized into rows and
columns. Each column in a DataFrame has a name and an associated type. DataFrames
are similar to traditional database tables, which are structured and concise. We can say
that DataFrames are relational databases with better optimization techniques.
 Spark DataFrames can be created from various sources, such as Hive tables, log tables,
external databases, or the existing RDDs. DataFrames allow the processing of huge
amounts of data.
 A DataFrame is a distributed collection of data organized into named columns, similar to
a table in a relational database.

1. Creating a DataFrame from a List (Manually)

You can create a DataFrame using spark.createDataFrame() by providing a list of tuples and
defining the schema.

from pyspark.sql import SparkSession

from pyspark.sql.types import StructType, StructField, StringType, IntegerType

# Initialize Spark Session

spark = SparkSession.builder.appName("DataFrameExample").getOrCreate()

# Define schema

schema = StructType([

StructField("ID", IntegerType(), True),

StructField("Name", StringType(), True),

StructField("Age", IntegerType(), True)

])
# Create DataFrame

data = [(1, "John", 25), (2, "Doe", 28), (3, "Alice", 23)]

df = spark.createDataFrame(data, schema)

df.show()

2. Creating a DataFrame from a CSV File

If you have a CSV file, you can directly load it into a DataFrame:

df = spark.read.option("header", True).csv("data.csv")

df.show()

3. Creating a DataFrame from a JSON File

Similarly, a JSON file can be read into a DataFrame:

df = spark.read.json("data.json")

df.show()

4. Creating a DataFrame from a Pandas DataFrame:

To create a DataFrame from a pandas DataFrame, you can use the createDataFrame method
provided by the SparkSession class and pass in the pandas DataFrame as the first argument.

import pandas as pd
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("Create DataFrame").getOrCreate()

pandas_df = pd.DataFrame({"name": ["John", "Mary", "Mike"], "age": [25, 30, 35]})

df = spark.createDataFrame(pandas_df)

df.show()
5. Creating a DataFrame from a TXT file

To create a DataFrame from a text file, you can use the text() method provided by
the SparkSession class.

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("Create DataFrame").getOrCreate()

df = spark.read.text("path/to/file.txt")

df.show()

DataFrame Operations:

PySpark DataFrame provides a variety of operations for data manipulation, filtering,


aggregation, and transformation. These operations help in large-scale data processing
efficiently. In PySpark, DataFrame operations are actions or transformations you perform on
data to:

 View
 Filter
 Transform
 Analyze
 Save/load it

These operations help you work with large-scale datasets efficiently in a distributed
environment.

1. Viewing and Inspecting Data

df.show() : Displays the top 20 rows of the DataFrame.

df.printSchema():Shows the structure of the DataFrame (column names, data types, nullability).

df.columns : Returns a list of column names.

2. Selecting and Modifying Columns

Selecting specific columns:


df.select("Name", "Age").show()

Adding a new column:

from pyspark.sql.functions import lit

df.withColumn("Country", lit("India")).show()

Changing values of an existing column:

from pyspark.sql.functions import col

df.withColumn("Age", col("Age") + 1).show()

Renaming a column:

df.withColumnRenamed("Name", "FullName").show()

3. Filtering Rows

Using expressions: df.filter(df.Age > 25).show()

Using SQL-like syntax: df.filter("Age > 25").show()

With multiple conditions: df.filter((df.Age > 25) & (df.ID == 1)).show()

4. Grouping and Aggregation

 Group by and count= df.groupBy("Age").count().show()


 Aggregations (max, min, avg)= df.agg({"Age": "max"}).show()
 Group by with aggregation= df.groupBy("Name").agg({"Age": "avg"}).show()

5. Sorting Data

 Sort ascending= df.sort("Age").show()


 Sort descending= df.sort(df.Age.desc()).show()

6. Joining DataFrames

 Types of joins are inner, left, right, outer= df1.join(df2, on="ID", how="inner").show()

7. Handling Missing Data


 Dropping rows with nulls= df.dropna().show()
 Filling missing values= df.fillna({"Age": 25}).show()

8. Saving & Loading

Read CSV:
df = spark.read.option("header", True).csv("data.csv")

Write CSV:
df.write.option("header", True).csv("output_folder")

9. SQL with DataFrames

Create a temporary SQL table


df.createOrReplaceTempView("people")

Run SQL queries


spark.sql("SELECT * FROM people WHERE Age > 25").show()

Dataset Operations:

What is a Dataset?

A Dataset is a strongly-typed collection of objects that provides better optimization and type
safety compared to DataFrames. It is mainly used in Scala and Java, as Python’s pyspark library
does not support typed Datasets.

Creating a Dataset (in Scala):

case class Person(id: Int, name: String, age: Int)


val data = Seq(Person(1, "John", 25), Person(2, "Doe", 28), Person(3, "Alice", 23))

import spark.implicits._
val ds = data.toDS()

ds.show()

Transformations and Actions in DataFrame & Dataset:


Different Data Sources: Generic Load and Save Functions
When working with data in software applications, it’s common to interact with various data
sources. To ensure flexibility and maintainability, developers often implement generic load and
save functions, which allow reading and writing data from multiple sources without modifying
core logic.
1. Types of Data Sources

a) File-Based Data Sources

 Text Files (.txt, .csv, .json, .xml, .yaml)


o CSV (Comma-Separated Values) is commonly used for tabular data.
o JSON (JavaScript Object Notation) is widely used in web applications.
o XML (Extensible Markup Language) is useful for hierarchical data representation.
 Binary Files (.bin, .dat, .pickle)
o Used for efficient storage of structured data, like serialized objects.
 Excel Files (.xls, .xlsx)
o Used for structured data with additional formatting and calculations.

b) Database-Based Data Sources

 Relational Databases (SQL)


o Examples: MySQL, PostgreSQL, SQLite, Oracle, SQL Server.
o Stores structured data in tables with relationships.
 NoSQL Databases
o Examples: MongoDB, Firebase, Cassandra, Redis.
o Used for unstructured, semi-structured, or hierarchical data storage.

c) API-Based Data Sources


 REST APIs
o Data is typically exchanged in JSON or XML formats.
o Used for fetching data from external web services.
 GraphQL APIs
o Allows fetching specific parts of the data efficiently.
 SOAP APIs
o Uses XML-based structured messages.

d) Cloud Storage & Big Data Sources

 Cloud Storage Services


o Examples: AWS S3, Google Cloud Storage, Azure Blob Storage.
o Used for storing large files and datasets.
 Big Data Frameworks
o Examples: Hadoop, Apache Spark, Google BigQuery.
o Handles massive datasets in distributed environments.

2. Generic Load and Save Functions

Generic functions abstract away the complexity of dealing with different data sources by
providing a unified interface for reading and writing data.

a) Generic Load Function

A load_data function reads data from different sources based on parameters like file type or
data source type.

Example in Python
import pandas as pd
import json
import sqlite3

def load_data(source, source_type="csv"):


if source_type == "csv":
return pd.read_csv(source)
elif source_type == "json":
with open(source, 'r') as f:
return json.load(f)
elif source_type == "excel":
return pd.read_excel(source)
elif source_type == "sqlite":
conn = sqlite3.connect(source)
return pd.read_sql_query("SELECT * FROM my_table", conn)
else:
raise ValueError("Unsupported data source type!")

# Example usage:
# df = load_data("data.csv", source_type="csv")
# data_dict = load_data("data.json", source_type="json")

b) Generic Save Function

A save_data function allows storing data in different formats.

Example in Python
def save_data(data, destination, dest_type="csv"):
if dest_type == "csv":
data.to_csv(destination, index=False)
elif dest_type == "json":
with open(destination, 'w') as f:
json.dump(data, f, indent=4)
elif dest_type == "excel":
data.to_excel(destination, index=False)
elif dest_type == "sqlite":
conn = sqlite3.connect(destination)
data.to_sql("my_table", conn, if_exists="replace", index=False)
else:
raise ValueError("Unsupported destination type!")

# Example usage:
# save_data(df, "output.csv", dest_type="csv")
# save_data(data_dict, "output.json", dest_type="json")

3. Benefits of Using Generic Load and Save Functions

 Code Reusability: One function handles multiple data formats.


 Flexibility: Easily switch between different data sources.
 Scalability: Supports integration with databases, cloud storage, and APIs.
 Error Handling & Logging: Centralized error handling improves debugging.

Building Spark SQL Application with SBT (Simple Build Tool):


Building a Spark SQL application with SBT (Simple Build Tool) involves setting up a project with
the necessary dependencies, writing the Spark SQL code, and compiling and running it using
SBT. Below is a step-by-step guide to achieve this.

1. build.sbt (SBT Configuration): Create a file build.sbt in the project root

name := "SparkSQLApp"
version := "1.0"
scalaVersion := "2.12.18"

val sparkVersion = "3.5.0"

libraryDependencies ++= Seq(


"org.apache.spark" %% "spark-core" % sparkVersion,
"org.apache.spark" %% "spark-sql" % sparkVersion
)

enablePlugins(AssemblyPlugin)

assemblyMergeStrategy in assembly := {
case PathList("META-INF", _ @ _*) => MergeStrategy.discard
case x => MergeStrategy.first
}

2. project/build.properties: Create a file inside the project/ directory

sbt.version=1.8.0

3. project/plugins.sbt (Enable sbt-assembly Plugin): Create a file inside the project/ directory

addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "1.2.0")

4. SparkSQLApp.scala (Application Code): Create a Scala file inside src/main/scala/

import org.apache.spark.sql.{SparkSession, DataFrame}

object SparkSQLApp {
def main(args: Array[String]): Unit = {
// Initialize SparkSession
val spark = SparkSession.builder()
.appName("Spark SQL Application")
.master("local[*]") // Runs locally
.getOrCreate()
import spark.implicits._

// Sample DataFrame
val data = Seq(
("Alice", 25),
("Bob", 28),
("Charlie", 30)
).toDF("name", "age")

// Register DataFrame as SQL Table


data.createOrReplaceTempView("people")

// Execute SQL Query


val result: DataFrame = spark.sql("SELECT * FROM people WHERE age > 26")

// Show Results
result.show()

// Stop SparkSession
spark.stop()
}
}

5. Building and Running the Application:


Step 1: Compile the Project - sbt compile
Step 2: Create a Fat JAR - sbt assembly
Step 3: Run the Application
spark-submit --class SparkSQLApp target/scala-2.12/spark-sql-app-assembly-1.0.jar

Spark Real-Time Use Case:

Data Analytics Project Architecture:

Use Case: Real-Time User Behavior Analytics on an E-Commerce Website

Objective: Analyze user actions (like clicks, searches, purchases) in real time to:

 Recommend products.
 Detect trending items.
 Monitor user engagement.
 Optimize campaigns instantly.

Project Architecture: Real-Time Analytics with Apache Spark

Key Components Explained:

1. Data Source (Clickstream from App/Web)

 Data like page views, searches, add-to-cart events, and purchases are continuously
generated by users.
 These events are sent to Apache Kafka as messages.

2. Apache Kafka (Streaming Source)

 A message broker that receives and holds user events in topics.


 Spark consumes this data in near real-time.

3. Apache Spark Structured Streaming (Core Processing)

 Reads data from Kafka topics.


 Cleans and transforms the data (e.g., parse JSON, filter nulls).
 Applies window-based aggregations (e.g., “top 5 viewed products every 10 minutes”).
 Integrates with ML models to recommend items or score engagement levels.
 Sends output to databases, dashboards, or alert systems.

4. Machine Learning Integration

 Pre-trained models (e.g., collaborative filtering, clustering, anomaly detection) loaded


into Spark.
 Spark evaluates user behavior in real-time to:
o Recommend products.
o Predict user churn.
o Detect anomalies.
5. Data Storage (NoSQL or HDFS)

 Processed and enriched data is stored in:


o HDFS for batch analytics.
o Cassandra / MongoDB for fast querying.
o Delta Lake for versioned big data analytics.

6. Alerting & Dashboard

 Fraud alerts or engagement notifications pushed via email, SMS, or Slack.


 Grafana / Kibana dashboards show:
o Live user count.
o Heatmaps of popular items.
o Real-time funnel drop-off.

Real Business Value:

 Real-time recommendations improve conversions.


 Live trend analysis helps marketing teams act quickly.
 Early fraud or bot detection protects business.
 Operational dashboards help decision makers instantly.

Real-Time Use Cases of Apache Spark:

1. Fraud Detection in Banking & Payments

 Detect abnormal spending patterns instantly.


 Flag suspicious transactions in milliseconds.
 Prevent credit card abuse or account takeover in real-time.

2. Live Analytics for E-Commerce / Websites

 Track page views, clicks, and user funnels in real-time.


 Update top-selling products or trending searches.
 Deliver product recommendations on the fly.

3. Security Log Monitoring / Threat Detection

 Ingest logs from firewalls, VPNs, and apps.


 Detect brute force attacks or malware traffic instantly.
 Trigger real-time alerts to security teams.

4. Social Media Sentiment Analysis

 Stream tweets or Facebook comments.


 Analyze real-time sentiment trends for brands or topics.
 Use NLP to tag and categorize in-flight messages.

5. Real-Time Ad Targeting / Campaign Optimization

 Monitor user interactions and optimize ads dynamically.


 Deliver personalized content instantly based on behavior.
 Reduce ad spend by stopping non-performing campaigns early.

6. IoT Sensor Stream Analytics

 Monitor real-time temperature, pressure, or location data.


 Predict machinery failures in factories (predictive maintenance).
 Track moving vehicles for logistics or smart city traffic.

7. Stock Market & Crypto Trading Analytics


 Monitor live stock/crypto prices.
 Detect arbitrage opportunities.
 Automate buy/sell triggers based on real-time insights.

8. Gaming & Streaming Analytics

 Track player behavior live (kills, achievements, session length).


 Recommend in-game purchases or power-ups in real-time.
 Prevent cheating or unusual play patterns.

You might also like