Spark Programming Basics
Spark Programming Basics
28/05/2024 1
Spark: Flexible, in-memory data processing framework written in Scala
Goals:
• Simplicity (Easier to use):
Rich APIs for Scala, Java, and Python
• Generality: APIs for different types of workloads
Batch, Streaming, Machine Learning, Graph
• Low Latency (Performance) : In-memory processing and
caching
• Fault-tolerance: Faults shouldn’t be special case
28/05/2024 2
Spark Core Concepts
• Job: A piece of code which reads some input from HDFS or local, performs some
computation on the data and writes some output data.
• Stages: Jobs are divided into stages. Stages are classified as a Map or reduce
stages (Its easier to understand if you have worked on Hadoop and want to
correlate). Stages are divided based on computational boundaries, all computations
(operators) cannot be Updated in a single Stage. It happens over many stages.
• Tasks: Each stage has some tasks, one task per partition. One task is executed on
one partition of data on one executor (machine).
• DAG: DAG stands for Directed Acyclic Graph, in the present context its a DAG of
operators.
• Executor: The process responsible for executing a task.
• Master: The machine on which the Driver program runs
• Slave: The machine on which the Executor program run
28/05/2024
3
Spark Components
1. Spark Driver
• separate process to execute user applications
• creates SparkContext to schedule jobs execution and negotiate with cluster manager
2. Executors
• run tasks scheduled by driver
• store computation results in memory, on disk or off-heap
• interact with storage systems
3. Cluster Manager
• Mesos
• YARN
• Spark
Standalone Spark Driver contains more components responsible for translation of user code
into actual jobs executed on cluster
28/05/2024
4
28/05/2024
5
Spark Driver Components
• SparkContext
-represents the connection to a Spark cluster, and can be used to create RDDs, accumulators and
broadcast variables on that cluster
• DAGScheduler
–computes a DAG of stages for each job and submits them to Task Scheduler determines preferred
locations for tasks (based on cache status or shuffle files locations)and finds minimum schedule to run
the jobs
• TaskScheduler
–responsible for sending tasks to the cluster, running them, retrying if there are failures, and mitigating
stragglers
• SchedulerBackend
–backend interface for scheduling systems that allows plugging in different implementations(Mesos,
YARN, Standalone, local)
• BlockManager
–provides interfaces for putting and retrieving blocks both locally and remotely into various stores
(memory, disk, and off heap).
28/05/2024
6
28/05/2024
7
How Spark Works?
28/05/2024
8
Spark Basics
There are two ways to manipulate data in Spark
• Spark Shell:
Interactive – for learning or data exploration
Python or Scala
• Spark Applications
For large scale data processing
Python, Scala, or Java
28/05/2024 9
Spark Shell
The Spark Shell provides interactive data exploration
(REPL)
28/05/2024 10
Spark Fundamentals
Example of an application:
• Spark Context
• Resilient Distributed Data
• Transformations
• Actions
28/05/2024 11
Spark: Fundamentals
• Spark Context
• Resilient Distributed Datasets
(RDDs)
• Transformations
• Actions
28/05/2024 12
Spark Context (1)
• Every Spark application requires a spark context: the main entry
point to the Spark API
• Spark Shell provides a preconfigured Spark Context called “sc”
28/05/2024 13
Spark Context (2)
• Standalone applications Driver code Spark Context
• Spark Context represents connection to a Spark cluster
28/05/2024 14
Spark Context (3)
• Spark context works as a client and represents connection to a Spark
cluster
28/05/2024 15
Spark Fundamentals
Example of an application:
• Spark Context
• Resilient Distributed Data
• Transformations
• Actions
28/05/2024 16
Resilient Distributed Dataset
RDD (Resilient Distributed Dataset) is the fundamental unit of data in Spark: An
Immutable collection of objects (or records, or elements) that can be operated on
“in parallel” (spread across a cluster)
Resilient -- if data in memory is lost, it can be recreated
• Recover from node failures
• An RDD keeps its lineage information it can be recreated from parent
RDDs
Distributed -- processed across the cluster
• Each RDD is composed of one or more partitions (more partitions –
more parallelism)
Dataset -- initial data can come from a file or be created
28/05/2024 17
RDDs
Key Idea: Write applications in terms of transformations
on distributed datasets
• Collections of objects spread across a Memory caching
layer(cluster) that stores data in a distributed, fault-tolerant
cache
• Can fall back to disk when dataset does not fit in memory
• Built through parallel transformations (map, filter, group-by,
join, etc)
• Automatically rebuilt on failure
• Controllable persistence (e.g. caching in RAM)
28/05/2024 18
RDDs -- Immutability
28/05/2024 19
Resilient Distributed Datasets (RDDs)
28/05/2024
20
21
Parallelized collection (parallelizing)
In the initial stage when we learn Spark, RDDs are generally created by
parallelized collection i.e. by taking an existing collection in the program
and passing it to SparkContext’s parallelize() method.
This method is used in the initial stage of learning Spark since it quickly
creates our own RDDs in Spark shell and performs operations on them.
28/05/2024
22
By Creating a DataFrame
28/05/2024
23
External Datasets (Referencing a dataset)
In Spark, the distributed dataset can be formed from any data source supported by
Hadoop, including the local file system, HDFS, Cassandra, HBase et
To create text file RDD, we can use SparkContext’s textFile method. It takes URL
of the file and read it as a collection of line
28/05/2024
24
External Datasets (from a file)
## set up SparkSession
from pyspark.sql importSparkSession
spark = SparkSession \
.builder \
.appName("Python Spark create RDD example") \
.config("spark.some.config.option", "some-value") \
.getOrCreate()
df = spark.read.format('spark.csv').\
options(header='true', \inferschema='true').\
load("Advertising.csv",header=True)
df.show(5)
df.printSchema()
28/05/2024
25
External Datasets (from a database)
28/05/2024
26
28/05/2024
27
External Datasets (from HDFS)
sc= SparkContext('local','example')
hc = HiveContext(sc)
tf1 = sc.textFile("hdfs://meghna/data/demo.CSV")
print(tf1.first())
hc.sql("use test")
spf = hc.sql("SELECT*FROM spf LIMIT 100")
print(spf.show(5))
28/05/2024
28
Creating RDD from existing RDD
Transformation mutates one RDD into another RDD, thus transformation is the
way to create an RDD from already existing RDD.
28/05/2024
29
Spark1 Vs Spark2
• SparkSession is now the new entry point of Spark that replaces the old SQLContext
andHiveContext. Note that the old SQLContext and HiveContext are kept for backward
compatibility. A new catalog interface is accessible from SparkSession - existing API on
databases and tables access such as listTables, createExternalTable, dropTempView,
cacheTable are moved here.
• Dataset API and DataFrame API are unified. In Scala, DataFrame becomes a type alias
forDataset[Row], while Java API users must replace DataFrame with Dataset. Both the typed
transformations (e.g., map, filter, and groupByKey) and untyped transformations (e.g.,select
and groupBy) are available on the Dataset class. Since compile-time type-safety in Python and
R is not a language feature, the concept of Dataset does not apply to these languages’ APIs.
Instead, DataFrame remains the primary programing abstraction, which is analogous to the
single-node data frame notion in these languages.
• Dataset and DataFrame API unionAll has been deprecated and replaced by union
• Dataset and DataFrame API explode has been deprecated, alternatively, use
functions.explode() with select or flatMap
30
• Dataset and DataFrame API registerTempTable has been deprecated and replaced by
createOrReplaceTempView
• Changes to CREATE TABLE ... LOCATION behavior for Hive tables.
• From Spark 2.0, CREATE TABLE ... LOCATION is equivalent to CREATE EXTERNAL
TABLE ... LOCATION in order to prevent accidental dropping the existing data in the user-
provided locations. That means, a Hive table created in Spark SQL with the user-specified
location is always a Hive external table. Dropping external tables will not remove the data.
Users are not allowed to specify the location for Hive managed tables.
• Note that this is different from the Hive behavior.
As a result, DROP TABLE statements on those tables will not remove the data.
31
Resilient Distributed Datasets (RDDs)
28/05/2024
32
There are two types of transformations:
Narrow transformation — In Narrow transformation, all the elements that are required to compute the
records in single partition live in the single partition of parent RDD. A limited subset of partition is used to
calculate the result. Functions such as map(), mapPartition(), flatMap(), filter(), union() are some examples
of narrow transformation
Wide transformation — In wide transformation, all the elements that are required to compute the records
in the single partition may live in many partitions of parent RDD. The partition may live in many partitions of
parent RDD. Functions such as groupByKey(), aggregateByKey(), aggregate(), join(), repartition() are some
examples of a wider transformations.
33
Resilient Distributed Datasets (RDDs)
ii. Actions
An Action in Spark returns final result of RDD computations. It triggers
execution using lineage graph to load the data into original RDD, carry out
all intermediate transformations and return final results to Driver program
or write it out to file system. Lineage graph is dependency graph of all
parallel RDDs of RDD.
Actions are RDD operations that produce non-RDD values. They materialize
a value in a Spark program. An Action is one of the ways to send result
from executors to the driver. First(), take(), reduce(), collect(), the count() is
some of the Actions in spark.
28/05/2024
34
35
Resilient Distributed Datasets (RDDs)
Transformations
Spark Transformation is a function that produces new RDD from the existing
RDDs.
It takes RDD as input and produces one or more RDD as output. Each time it
creates new RDD when we apply any transformation.
Applying transformation built an RDD lineage, with the entire parent RDDs of
the final RDD(s). RDD lineage, also known as RDD operator graph or RDD
dependency graph.
28/05/2024
36
28/05/2024
37
28/05/2024
38
28/05/2024
39
RDD Actions
• Apply transformation chains on RDDs, eventually performing some
additional operations (e.g., counting)
• Some actions only store data to an external data source (e.g. HDFS),
others fetch data from the RDD (and its transformation chain) upon
which the action is applied, and convey it to the driver
• Some common actions
count() – return the number of elements
28/05/2024 43
Spark Programming (1)
Creating RDDs
# Turn a Python collection into an RDD
sc.parallelize([1, 2, 3])
Basic Transformations
28/05/2024 45
Spark Programming (3)
Basic Actions
nums = sc.parallelize([1, 2, 3])
28/05/2024 48
Example: Word Count
lines = sc.textFile(“hamlet.txt”)
counts = lines.flatMap(lambda line: line.split(“ “))
.map(lambda word: (word, 1))
.reduceByKey(lambda x, y: x + y)
28/05/2024 49
Example: Spark Streaming
28/05/2024 50
Spark: Combining Libraries (Unified
Pipeline)
# Load data using Spark SQL
points = spark.sql(“select latitude, longitude from tweets”)
# Apply it to a stream
sc.twitterStream(...)
.map(lambda t: (model.predict(t.location), 1))
.reduceByWindow(“5s”, lambda a, b: a + b)
28/05/2024 51
Spark: Setting the Level of Parallelism
All the pair RDD operations take an optional second parameter
for number of tasks
words.reduceByKey(lambda x, y: x + y, 5)
words.groupByKey(5)
visits.join(pageViews, 5)
28/05/2024 52
MapReduce vs Spark (Summary)
• Performance:
While Spark performs better when all the data fits in the main memory (especially
on dedicated clusters), MapReduce is designed for data that doesn’t fit in the
memory
• Ease of Use:
Spark is easier to use compared to Hadoop MapReduce as it comes with user-
friendly APIs for Scala (its native language), Java, Python, and Spark SQL.
• Fault-tolerance:
Batch processing: Spark HDFS replication
Stream processing: Spark RDDs replicated
28/05/2024 53
Summary
• Spark is a powerful “manager” for big data computing.
• It centers on a job scheduler for Hadoop (MapReduce) that is smart about where to
run each task: co-locate task with data.
• The data objects are “RDDs”: a kind of recipe for generating a file from an
underlying data collection. RDD caching allows Spark to run mostly from memory-
mapped data, for speed.