0% found this document useful (0 votes)
15 views

Spark Programming Basics

Uploaded by

vishiiokk12
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views

Spark Programming Basics

Uploaded by

vishiiokk12
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 54

Spark Programming Basics

28/05/2024 1
Spark: Flexible, in-memory data processing framework written in Scala
Goals:
• Simplicity (Easier to use):
 Rich APIs for Scala, Java, and Python
• Generality: APIs for different types of workloads
 Batch, Streaming, Machine Learning, Graph
• Low Latency (Performance) : In-memory processing and
caching
• Fault-tolerance: Faults shouldn’t be special case

28/05/2024 2
Spark Core Concepts
• Job: A piece of code which reads some input from HDFS or local, performs some
computation on the data and writes some output data.
• Stages: Jobs are divided into stages. Stages are classified as a Map or reduce
stages (Its easier to understand if you have worked on Hadoop and want to
correlate). Stages are divided based on computational boundaries, all computations
(operators) cannot be Updated in a single Stage. It happens over many stages.
• Tasks: Each stage has some tasks, one task per partition. One task is executed on
one partition of data on one executor (machine).
• DAG: DAG stands for Directed Acyclic Graph, in the present context its a DAG of
operators.
• Executor: The process responsible for executing a task.
• Master: The machine on which the Driver program runs
• Slave: The machine on which the Executor program run
28/05/2024
3
Spark Components
1. Spark Driver
• separate process to execute user applications
• creates SparkContext to schedule jobs execution and negotiate with cluster manager
2. Executors
• run tasks scheduled by driver
• store computation results in memory, on disk or off-heap
• interact with storage systems
3. Cluster Manager
• Mesos
• YARN
• Spark
Standalone Spark Driver contains more components responsible for translation of user code
into actual jobs executed on cluster
28/05/2024
4
28/05/2024
5
Spark Driver Components
• SparkContext
-represents the connection to a Spark cluster, and can be used to create RDDs, accumulators and
broadcast variables on that cluster
• DAGScheduler
–computes a DAG of stages for each job and submits them to Task Scheduler determines preferred
locations for tasks (based on cache status or shuffle files locations)and finds minimum schedule to run
the jobs
• TaskScheduler
–responsible for sending tasks to the cluster, running them, retrying if there are failures, and mitigating
stragglers
• SchedulerBackend
–backend interface for scheduling systems that allows plugging in different implementations(Mesos,
YARN, Standalone, local)
• BlockManager
–provides interfaces for putting and retrieving blocks both locally and remotely into various stores
(memory, disk, and off heap).
28/05/2024
6
28/05/2024
7
How Spark Works?

28/05/2024
8
Spark Basics
There are two ways to manipulate data in Spark
• Spark Shell:
 Interactive – for learning or data exploration
 Python or Scala
• Spark Applications
 For large scale data processing
 Python, Scala, or Java

28/05/2024 9
Spark Shell
The Spark Shell provides interactive data exploration
(REPL)

REPL: Repeat/Evaluate/Print Loop

28/05/2024 10
Spark Fundamentals
Example of an application:
• Spark Context
• Resilient Distributed Data
• Transformations
• Actions

28/05/2024 11
Spark: Fundamentals

• Spark Context
• Resilient Distributed Datasets
(RDDs)
• Transformations
• Actions

28/05/2024 12
Spark Context (1)
• Every Spark application requires a spark context: the main entry
point to the Spark API
• Spark Shell provides a preconfigured Spark Context called “sc”

28/05/2024 13
Spark Context (2)
• Standalone applications  Driver code  Spark Context
• Spark Context represents connection to a Spark cluster

Standalone Application (Driver


Program)

28/05/2024 14
Spark Context (3)
• Spark context works as a client and represents connection to a Spark
cluster

28/05/2024 15
Spark Fundamentals
Example of an application:
• Spark Context
• Resilient Distributed Data
• Transformations
• Actions

28/05/2024 16
Resilient Distributed Dataset
RDD (Resilient Distributed Dataset) is the fundamental unit of data in Spark: An
Immutable collection of objects (or records, or elements) that can be operated on
“in parallel” (spread across a cluster)
Resilient -- if data in memory is lost, it can be recreated
• Recover from node failures
• An RDD keeps its lineage information  it can be recreated from parent
RDDs
Distributed -- processed across the cluster
• Each RDD is composed of one or more partitions  (more partitions –
more parallelism)
Dataset -- initial data can come from a file or be created

28/05/2024 17
RDDs
Key Idea: Write applications in terms of transformations
on distributed datasets
• Collections of objects spread across a Memory caching
layer(cluster) that stores data in a distributed, fault-tolerant
cache
• Can fall back to disk when dataset does not fit in memory
• Built through parallel transformations (map, filter, group-by,
join, etc)
• Automatically rebuilt on failure
• Controllable persistence (e.g. caching in RAM)
28/05/2024 18
RDDs -- Immutability

• Immutability  lineage information  can be recreated


at any time  Fault-tolerance
• Avoids data inconsistency problems  no simultaneous
updates  Correctness
• Easily live in memory as on disk  Caching  Safe to
share across processes/tasks  Improves performance
• Tradeoff: (Fault-tolerance & Correctness) vs (Disk Memory
& CPU)

28/05/2024 19
Resilient Distributed Datasets (RDDs)

CREATING RDDS …………………………

28/05/2024
20
21
Parallelized collection (parallelizing)
 In the initial stage when we learn Spark, RDDs are generally created by
parallelized collection i.e. by taking an existing collection in the program
and passing it to SparkContext’s parallelize() method.
 This method is used in the initial stage of learning Spark since it quickly
creates our own RDDs in Spark shell and performs operations on them.

28/05/2024
22
By Creating a DataFrame

28/05/2024
23
External Datasets (Referencing a dataset)
 In Spark, the distributed dataset can be formed from any data source supported by
Hadoop, including the local file system, HDFS, Cassandra, HBase et

 To create text file RDD, we can use SparkContext’s textFile method. It takes URL
of the file and read it as a collection of line

 DataFrameReader Interface is used to load a Dataset from external storage systems

 Use SparkSession.read to access an instance of DataFrameReader

28/05/2024
24
External Datasets (from a file)
## set up SparkSession
from pyspark.sql importSparkSession

spark = SparkSession \
.builder \
.appName("Python Spark create RDD example") \
.config("spark.some.config.option", "some-value") \
.getOrCreate()

df = spark.read.format('spark.csv').\
options(header='true', \inferschema='true').\
load("Advertising.csv",header=True)

df.show(5)
df.printSchema()

28/05/2024
25
External Datasets (from a database)

28/05/2024
26
28/05/2024
27
External Datasets (from HDFS)

from pyspark.conf importSparkConf from pyspark.context


import SparkContext from pyspark.sql
import HiveContext

sc= SparkContext('local','example')
hc = HiveContext(sc)
tf1 = sc.textFile("hdfs://meghna/data/demo.CSV")
print(tf1.first())
hc.sql("use test")
spf = hc.sql("SELECT*FROM spf LIMIT 100")
print(spf.show(5))
28/05/2024
28
Creating RDD from existing RDD
Transformation mutates one RDD into another RDD, thus transformation is the
way to create an RDD from already existing RDD.

28/05/2024
29
Spark1 Vs Spark2
• SparkSession is now the new entry point of Spark that replaces the old SQLContext
andHiveContext. Note that the old SQLContext and HiveContext are kept for backward
compatibility. A new catalog interface is accessible from SparkSession - existing API on
databases and tables access such as listTables, createExternalTable, dropTempView,
cacheTable are moved here.
• Dataset API and DataFrame API are unified. In Scala, DataFrame becomes a type alias
forDataset[Row], while Java API users must replace DataFrame with Dataset. Both the typed
transformations (e.g., map, filter, and groupByKey) and untyped transformations (e.g.,select
and groupBy) are available on the Dataset class. Since compile-time type-safety in Python and
R is not a language feature, the concept of Dataset does not apply to these languages’ APIs.
Instead, DataFrame remains the primary programing abstraction, which is analogous to the
single-node data frame notion in these languages.
• Dataset and DataFrame API unionAll has been deprecated and replaced by union
• Dataset and DataFrame API explode has been deprecated, alternatively, use
functions.explode() with select or flatMap
30
• Dataset and DataFrame API registerTempTable has been deprecated and replaced by
createOrReplaceTempView
• Changes to CREATE TABLE ... LOCATION behavior for Hive tables.
• From Spark 2.0, CREATE TABLE ... LOCATION is equivalent to CREATE EXTERNAL
TABLE ... LOCATION in order to prevent accidental dropping the existing data in the user-
provided locations. That means, a Hive table created in Spark SQL with the user-specified
location is always a Hive external table. Dropping external tables will not remove the data.
Users are not allowed to specify the location for Hive managed tables.
• Note that this is different from the Hive behavior.
As a result, DROP TABLE statements on those tables will not remove the data.

31
Resilient Distributed Datasets (RDDs)

Spark RDD Operations


RDD in Apache Spark supports two types of operations:
 Transformation
 Actions
Transformations
Spark RDD Transformations are functions that take an RDD as the input and
produce one or many RDDs as the output. They do not change the input RDD,
but always produce one or more new RDDs by applying the computations they
represent e.g. map(), filter(), reduceByKey() etc.

28/05/2024
32
There are two types of transformations:
Narrow transformation — In Narrow transformation, all the elements that are required to compute the
records in single partition live in the single partition of parent RDD. A limited subset of partition is used to
calculate the result. Functions such as map(), mapPartition(), flatMap(), filter(), union() are some examples
of narrow transformation
Wide transformation — In wide transformation, all the elements that are required to compute the records
in the single partition may live in many partitions of parent RDD. The partition may live in many partitions of
parent RDD. Functions such as groupByKey(), aggregateByKey(), aggregate(), join(), repartition() are some
examples of a wider transformations.

**** When compared to


Narrow transformations,
wider transformations
are expensive operations
due to shuffling.

33
Resilient Distributed Datasets (RDDs)

ii. Actions
An Action in Spark returns final result of RDD computations. It triggers
execution using lineage graph to load the data into original RDD, carry out
all intermediate transformations and return final results to Driver program
or write it out to file system. Lineage graph is dependency graph of all
parallel RDDs of RDD.

Actions are RDD operations that produce non-RDD values. They materialize
a value in a Spark program. An Action is one of the ways to send result
from executors to the driver. First(), take(), reduce(), collect(), the count() is
some of the Actions in spark.

28/05/2024
34
35
Resilient Distributed Datasets (RDDs)

Transformations
 Spark Transformation is a function that produces new RDD from the existing
RDDs.
 It takes RDD as input and produces one or more RDD as output. Each time it
creates new RDD when we apply any transformation.

 Applying transformation built an RDD lineage, with the entire parent RDDs of
the final RDD(s). RDD lineage, also known as RDD operator graph or RDD
dependency graph.

 It is a logical execution plan i.e., it is Directed Acyclic Graph (DAG) of the


entire parent RDDs of RDD

28/05/2024
36
28/05/2024
37
28/05/2024
38
28/05/2024
39
RDD Actions
• Apply transformation chains on RDDs, eventually performing some
additional operations (e.g., counting)
• Some actions only store data to an external data source (e.g. HDFS),
others fetch data from the RDD (and its transformation chain) upon
which the action is applied, and convey it to the driver
• Some common actions
count() – return the number of elements

take(n) – return an array of the first n elements

collect()– return an array of all elements

saveAsTextFile(file) – save to text file(s)

Data in RDDs is not processed until an action is performed


28/05/2024 40
28/05/2024
41
42
RDDs: Data Locality
•Data Locality Principle
 Same as for Hadoop MapReduce
 Avoids network I/O, workers should manage local data
•Data Locality and Caching
 First run: data not in cache, so use HadoopRDD’s locality
preferences (from HDFS)
 Second run: FilteredRDD is in cache, so use its locations
 If something falls out of cache, go back to HDFS

28/05/2024 43
Spark Programming (1)

Creating RDDs
# Turn a Python collection into an RDD
sc.parallelize([1, 2, 3])

# Load text file from local FS, HDFS, or S3


sc.textFile(“file.txt”)
sc.textFile(“directory/*.txt”)
sc.textFile(“hdfs:///namenode:9000/path/file”)

# Use existing Hadoop InputFormat (Java/Scala only)


sc.hadoopFile(keyClass, valClass, inputFmt, conf)
28/05/2024 44
Spark Programming (2)

Basic Transformations

nums = sc.parallelize([1, 2, 3])

# Pass each element through a function


squares = nums.map(lambda x: x*x) // {1, 4, 9}

# Keep elements passing a predicate


even = squares.filter(lambda x: x % 2 == 0) // {4}

28/05/2024 45
Spark Programming (3)
Basic Actions
nums = sc.parallelize([1, 2, 3])

# Retrieve RDD contents as a local collection


nums.collect() # => [1, 2, 3]

# Return first K elements


nums.take(2) # => [1, 2]

# Count number of elements


nums.count() # => 3

# Merge elements with an associative function


nums.reduce(lambda x, y: x + y) # => 6
28/05/2024 46
Spark Programming (4)
Working with Key-Value Pairs
Spark’s “distributed reduce” transformations operate on RDDs of
key-value pairs

Python: pair = (a, b)


pair[0] # => a
pair[1] # => b

Scala: val pair = (a, b)


pair._1 // => a
pair._2 // => b

Java: Tuple2 pair = new Tuple2(a, b);


pair._1 // => a
pair._2 // => b
28/05/2024 47
Spark Programming (5)
Some Key-Value Operations

pets = sc.parallelize([(“cat”, 1), (“dog”, 1), (“cat”, 2)])

pets.reduceByKey(lambda x, y: x + y) # => {(cat, 3), (dog, 1)}

pets.groupByKey() # => {(cat, [1, 2]), (dog, [1])}

pets.sortByKey() # => {(cat, 1), (cat, 2), (dog, 1)}

28/05/2024 48
Example: Word Count
lines = sc.textFile(“hamlet.txt”)
counts = lines.flatMap(lambda line: line.split(“ “))
.map(lambda word: (word, 1))
.reduceByKey(lambda x, y: x + y)

28/05/2024 49
Example: Spark Streaming

Represents streams as a series of RDDs over time (typically sub second


intervals, but it is configurable)

val spammers = sc.sequenceFile(“hdfs://spammers.seq”)


sc.twitterStream(...)
.filter(t => t.text.contains(“Santa Clara University”))
.transform(tweets => tweets.map(t => (t.user, t)).join(spammers))
.print()

28/05/2024 50
Spark: Combining Libraries (Unified
Pipeline)
# Load data using Spark SQL
points = spark.sql(“select latitude, longitude from tweets”)

# Train a machine learning model


model = KMeans.train(points, 10)

# Apply it to a stream
sc.twitterStream(...)
.map(lambda t: (model.predict(t.location), 1))
.reduceByWindow(“5s”, lambda a, b: a + b)

28/05/2024 51
Spark: Setting the Level of Parallelism
All the pair RDD operations take an optional second parameter
for number of tasks

words.reduceByKey(lambda x, y: x + y, 5)
words.groupByKey(5)
visits.join(pageViews, 5)

28/05/2024 52
MapReduce vs Spark (Summary)
• Performance:
 While Spark performs better when all the data fits in the main memory (especially
on dedicated clusters), MapReduce is designed for data that doesn’t fit in the
memory
• Ease of Use:
 Spark is easier to use compared to Hadoop MapReduce as it comes with user-
friendly APIs for Scala (its native language), Java, Python, and Spark SQL.
• Fault-tolerance:
 Batch processing: Spark  HDFS replication
 Stream processing: Spark RDDs replicated

28/05/2024 53
Summary
• Spark is a powerful “manager” for big data computing.
• It centers on a job scheduler for Hadoop (MapReduce) that is smart about where to
run each task: co-locate task with data.
• The data objects are “RDDs”: a kind of recipe for generating a file from an
underlying data collection. RDD caching allows Spark to run mostly from memory-
mapped data, for speed.

• Online tutorials: spark.apache.org/docs/latest


28/05/2024
54

You might also like