0% found this document useful (0 votes)

36 views

Spark Intro

Spark is a general purpose cluster computing system that provides high-level APIs in Scala, Java, Python, and R. It uses an in-memory data abstraction called resilient distributed datasets (RDDs) that allow data to be cached across clusters. RDDs provide fault tolerance by recomputing lost data from lineage graphs rather than relying on disk storage. Spark is faster than Hadoop MapReduce for iterative algorithms due to its ability to cache data in memory between jobs.

Uploaded by

Eiknath Thakur

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

36 views

Spark Intro

Uploaded by

Eiknath Thakur

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 24

INTRODUCTION TO SPARK:

SPARK:
-->It is a general purpose
In memory
Compute/process engine

======================== compute engine

Hadoop provides us with the below:
1. Hdfs ==> storage
2. Map reduce ==> process engine
3. YARN ==> resource manager

-->spark is alternative/replacement of map-reduce

Spark can work with the below:

Storage==> Local, hdfs, amazon s3 (cloud)
Process engine ==> Spark
Resource manager: YARN, MESOS, kubernetes

======================== In memory
Why spark is very fast?
MAP-REDUCE: For every MR job, mapper reads the data and writes the
data back to hdfs. It requires 2 disc operations for every mapper. In
general we have hundreds of mappers for each task.
-->Disc I/O is a tedious task due to which the computation more time.

SPARK: For every spark job, the processing happens IN-MEMORY ie

after the data is read from disc it processes the entire data in-memory
and writes the final result to disc.
-->Only 2 disc I/O due for a spark job due to which it is very fast
processing.
=============================General Purpose
Spark can be used for a wide range of activities like cleaning, querying,
streaming data which is all under one roof.

How is data stored in spark?

-->The data is stored in spark in the form of RDD .
-->RDD is a distributed in-memory collection
-->RDD (resilient distributed dataset)
===============================Resilient
What happens when an RDD fails/loses data ?
-->It means ability to get back to normal after a failure
-->We don't have replication as fault tolerance as we deal with in-
memory and it is a costly process.
-->By making use of RDD Lineage graph we try to retrieve for failed
RDD
RDD1
|==> Map
RDD2
| ==> Filter
RDD3
-->If RDD3 fails, then it gets back to its parent RDD2 and asks it to
peform "flatMap" and "filter" again to generate RDD3 which got failed
previously.

Why RDD are immutable?

-->If RDD are mutable ie if we store all the transformation & action on
single RDD.
-->what if if this RDD fails, we just lose all the data?
-->In order to make the RDD resilient we make it immutable
What are the various operations in spark?
1. Transformations:
-->They change the data from one form to another.
-->They are lazy ie they don't execute until an "action" is performed

2. Action:
-->They are in-charge of actually performing the computation
-->They are not lazy and execute when they are called

Why spark is lazy?

1. Spark computation are costly as it is in-memory
2. We need to make sure that we optimize the computation in best
possible way before giving it to spark
3. They are lazy so they don't compute each & every statement
every time they were created
4. They prepare optimized plan sheet based (DAG)on which they
will compute when an "action" is called

Ex1. To load 2GB file into spark and print 1st line

Ex2. We have to perform filtering of data

-->It perform Predicate pushup to optimize the query plan ie it alters
the steps to be performed
-->spark-shell ==>to run spark commands in cloudera (we get an
interactive scala shell)
-->localhost:4040 ==>UI for spark (only active when spark-shell is
opened)
-->SC (spark context) is the entry point for a spark cluster. We need to
mention it if we have to run our queries on spark cluster and achieve
parallelism

-->we have to load the file, as it is lazy only DAG is created and loading
not yet done
-->when action is performed "collect" then our file is actually loaded
-->flatMap() ==> It takes each line as a input and transforms it
according to our operations. It is an array of arrays(each line stored as
array)
-->Array(Array(I,am,passionate,about,big,data),
Array(I,love,big,data)…..)
-->ie it takes each line as input and splits it based on " " and creates an
array for each line
-->finally all the arrays are merged into single array as seen below.

-->map: works on each input given. For n inputs we get n outputs.

-->reduceByKey: It sorts the data and computes for 2 records at a time.
(am,1)
(am,1) ==>(am,2)
(am,1) ==>(am,3)
(am,1) ==>(am,4)
SPARK PART 2:
What is "checkpointing" in spark?
-->To save the RDD state from lineage graph into a storage so that
resiiency of the RDD is achieved.

What are stages in spark?

-->Each job triggers multiple no. of tasks which can be considered
as a stage.

SPARK STAGES:
1. shuffleMap stage: It is an intermediatary stage in DAG which
provides data for another stage ie shuffling of data takes place.
2. ResultStage: It produces the output for the "action" called which
is the final result given by the spark
Broadcast Variables:

-->It contains a list of values (small data) which is broadcasted ie

distributed to all the systems for a specific purpose to be met.

Problem: From above ex. We can see redundant words (in, the, a
etc) which are not actually not very helpful.
We collect a list of such words and broadcast them and filter them
out from using "filter" function.

-->similar to map side join in hive

What are Accumulators?

-->It is a shared copy of value kept in driver machine.
-->It is updated by executors but none can read them
-->Similar to counter in map-reduce

How MR jobs were managed before YARN ?

-->Map-Reduce jobs were handled by Job Tracker and Task tracker.

-->Job Tracker:
1. Clients submit the requests to it which it then scheduled to be
distributed to slaves.
2. It schedules the jobs, priortizes jobs and allocates resources to
them
3. It monitors the performance of jobs
4. It single handedly takes care of all these tasks

-->Task Tracker:
1. Inside each block where we have tasks to be performed, it is
tracked by task tracker

Limitations:
1. Scalability: If cluster size increases then load on Job Tracker will
be high and its performance dips.
2. Resource Under-Utilization : Instances where MR slots are fixed
and they are not used to their full potential
3. Only MR jobs can be handled (no graphic processing etc)

Why YARN was introduced?

YARN (Yet Another Resource Negotiator)

-->As the job tracker has to manage huge no. of tasks and resources
which it failed to perform well. We introduced a separate
component just to schedule and manage resouces which is YARN

-->Components of YARN:
1. Resource Manager : Does only scheduling (load is now reduced)
2. Application Master: Does the monitoring and negotiates
resources
3. Node Manager: Manages all the tasks which are assigned to it.

Architecture:

1. Client submits the requests to run its programs to "Resource

Manager"
2. RM creates a container (memory + cpu) and AM on a node to
take care of the job
-->Each job will have an AM
3. AM will request RM for resources (containers) to run its tasks
-->RM will have a priority chart based on which it allocates
-->RM sends container location to AM
4. AM will then start its tasks in containers it got from RM
5. Tasks monitoring is done by AM and handles issues if any
6. Results from tasks are given to node manager
How Limitations were overcome ?
1. Scalability : Each job can have one AM and no. of jobs can be
easily increased
2. Resource Utilization: No fixed MR slots and containers can provide
dynamic allocation
3. Apart from MR job we can run spark, giraph jobs as well

What happens when job is very small?

Uberization:

-->AM will not request for additional resources from RM

-->It will run the job in its own container

YARN on Spark Cluster:

How to run sparks jobs on spark cluster?

1. Interactive shell: spark-shell
2. Submit jobs: we write code and run jobs using Eclipse

How spark jobs run on spark cluster?

It follows master-slave architecture
1. Master (Driver) : Schedules, monitors, distributes tasks
2. Slave (Executor) : Executes code locally

Client(sends request) --> Driver -->Executors (

)

Which component executes Where?

-->Executors always run on cluster machines (data nodes, in-
memory)
-->Driver runs on either client (client mode) or cluster (cluster
mode)
-->Cluster mode is preferrable as it doesn't depend on client
machines (mainly real-time production)

Who controls the spark cluster?

It can be done by few of below "cluster managers"
1. YARN , Mesos, Kubernetes

YARN Architecture in Spark:

1. Spark session : Entry point into spark cluster where driver

provides location of executors where processing will happen

Yarn in client mode:

1. When spark job launched then Spark session created

2. Then Yarn RM creates AM on node manager
3. AM requests resources if req and launch executors in them
4. Driver and executors can communicate directly
Yarn in cluster mode:

1.The only difference is that Driver runs on AM

How to convert collections into RDD?

-->when we have Array, List, Set of values.
-->we can use "parallelize" method to convert them into RDD
-->val listOrder=List("mobile", "garment","purse")
sc.parallelize(listOrder)

What are types of transformations?

Narrow Transformations: Data shuffling is not required ie moving

data from one machine to another
Ex. Map(), flatMap() , filter()
-->when we perform map(), all the data resides in one machine and
we need not move it across various machines.
Val a="Vaishu is a big data developer" map(a=>a.split(" ") ==>
Array[vaishu is a big data developer]
Wide Transformations: Data shuffling in involved ie data is moved
across machines.
-->It works mainly on pair RDD
Ex. reduceByKey, groupByKey()

P1=({vaish,1},{big,1},{love,1})
P2=({love,1},{vaish,1})
|
|
P1={{vaish,2},{big,1}}
P2=({love,2})
Now, data has to be moved across the partitions to compute the
final result

What are stages in spark?

-->Based on DAG we can see the stages of execution in spark.
-->Each wide transformation invokes a separate stage as data
shuffling is required
-->If we have 2 stages then we have 1 wide transformation for sure
-->Stage 1: Processes the input and writes it to the disk
-->Stage 2: From disk wide transformation takes it and gives the
final output

Difference btw reduceByKey() and reduce() :

-->reduceByKey() : It is a transformation ie one rdd is converted

into another.
-->It works on only pair rdd
Pair RDD: It has tuple of 2 elements Ex. (vaish,1) ==>like (key,value)
pair

-->reduce() : it is an action ie rdd is converted into local variable.

-->It gives us a single output value.
Ex. Val a=1 to 100 a.reduce((x,y)=>x+y)
Output:5050 (it just gives sum of all numbers)

reduceByKey() Reduce()
1. It is a 1. It is an action
transformation 2. It gives a single
2. It works on output value
only pair RDD
3. It gives
multiple
output values

Why reduceByKey() is transformation and not action?

-->reduceByKey() gives us a list of output and not a single output

-->It can be used again for other transformation

reduceByKey() vs groupByKey() :

reduceByKey() groupByKey()
1. Each partition 1. Each partition will
will process just group the similar
the data and inputs and give it to
give the output reducer machine
to reducer 2. More processing at
machine reducer
2. Less 3. Less parallelism
processing at 4. Does not perform
reducer local aggregation
3. More 5. If data is huge then
parallelism reduce can't single
4. Performs local handedly work on it
aggregation due to which we
should never use
groupByKey()

reduceByKey():

groupByKey:
Practical: groupByKey()

File size ~ 350mb Tasks=11 (350/12)

-->File compressed and sent to reducers. As we can see only 2 reducers

are doing all the work and all others are idle. It is not an idle situation
as we need to make the best use of all tasks. Due to this groupByKey()
is not recommended
What is a Pair RDD:

1. RDD with a tuple of 2 elements

("vaish",10) (20,30)
2. Pair RDD is not a map as map can have only unique keys
3. It works on transformations like reduceByKey() and groupByKey()

What happens when multiple actions are called?

Val mapIn=X.map(x=>x,1)
Val redIn= mapIn.reduceByKey(x,y=>x+y)

Val resultFinal=redIn.collect() ==> action 1 (processes entire data)

Val finalResult=redIn.count ==>action2 (spark stores the results
upto last transformation and uses it directly)

-->Every action triggers a job in spark

-->Every job processes the data right from beginning
-->** But spark optimizes it and holds the results of processing
done previously which can be used for the next action directly
rather than processing from the beginning.

Properties in spark:

1. For data collections

1.1 sc.defaultparallelism => It gives us the default no. of partitions
created for the rdd's

1. For Hdfs data

2.1 sc.defaultMinPartitions => It determines min no. of partitions
returned by rdd
Ex. File~ 100mb (actually 1 partition)
But due to above property it returns 2
2.2 sc.getNumPartitions => It gives no. of partitions for a particular
rdd
This property takes precedence over defaultMinPartitions

1. Repartition vs Coalsence

Repartition Coalesce
1. Used to either 1.Used to only decrease no.
increase/decrease no. of of partitions
partitions for rdd's 2.If we increase partitions
Val no changes will be made
newRdd=inputFile.repartition(1 and same old value remains
0)
Val Val
newRdd=inputFile.repartition(1 newRdd=inputFile.coalesce(
) 2)
2. It is a wide
transformation as data
shuffling is involved
1. Use repartition for 1. Use coalesce for
INCREASE of partitions DECREASE of
partitions
It combines partitions
on same machines due
to which shuffling is
less and optimized

Before:
Node1: p1 p2
Node 2: p3 p4
Node3: p5 p6

After:
Node1: p1(p1+p2)
Node2: p3(p3+p4)
Node3:p5(p5+p6)

2. Don't use
REPARTIION for
DECREASE
Repartition tries to
equally distribute the
data among all
partitions due to
which more shuffling
is required

**
1. No. of actions = No. of jobs= No. of AM
2. No. of tasks= No. of partitions
3. 1 block size in local = 1rdd partition=32mb
4. 1 rdd= n partitions
5. 1 task = 1 partition

Big Data Engineering - PySpark
100% (1)
Big Data Engineering - PySpark
120 pages
Envivio G6 Series DS-2015-05 r01
No ratings yet
Envivio G6 Series DS-2015-05 r01
3 pages
Introduction to Spark
No ratings yet
Introduction to Spark
30 pages
Bigdata Interview Q&A
No ratings yet
Bigdata Interview Q&A
71 pages
Spark Architecture and Deploy Modes
No ratings yet
Spark Architecture and Deploy Modes
22 pages
Spark Architecture
No ratings yet
Spark Architecture
7 pages
Architecture and Components of Spark
No ratings yet
Architecture and Components of Spark
6 pages
BDA-Lec8
No ratings yet
BDA-Lec8
39 pages
SPARK Interview Questions
No ratings yet
SPARK Interview Questions
12 pages
Spark by Sumit
No ratings yet
Spark by Sumit
33 pages
Apache Spark
No ratings yet
Apache Spark
31 pages
slips bigdata
No ratings yet
slips bigdata
6 pages
BDA-Lec7
No ratings yet
BDA-Lec7
32 pages
Spark_Class_1_PPT
No ratings yet
Spark_Class_1_PPT
33 pages
Spark Keywords 1675605055
No ratings yet
Spark Keywords 1675605055
6 pages
4 Spark SBP
No ratings yet
4 Spark SBP
74 pages
Top Spark Interview Q&A
No ratings yet
Top Spark Interview Q&A
21 pages
spark
No ratings yet
spark
160 pages
What Is Spark?: History of Apache Spark
No ratings yet
What Is Spark?: History of Apache Spark
65 pages
Spark End To End QUESTIONS
No ratings yet
Spark End To End QUESTIONS
10 pages
BD Notes
No ratings yet
BD Notes
11 pages
Apache Spark Architecture
No ratings yet
Apache Spark Architecture
7 pages
ECS765P_W5_Spark Programming
No ratings yet
ECS765P_W5_Spark Programming
43 pages
Spark Questions Imp
No ratings yet
Spark Questions Imp
33 pages
Spark Interview QUestions
No ratings yet
Spark Interview QUestions
200 pages
Data Bricks Interview
No ratings yet
Data Bricks Interview
18 pages
spark_notes
No ratings yet
spark_notes
19 pages
unit 6 spark (2)
No ratings yet
unit 6 spark (2)
43 pages
Spark Slides
No ratings yet
Spark Slides
23 pages
Interview Question Spark Day1
No ratings yet
Interview Question Spark Day1
3 pages
UNIT 4 Part 2
No ratings yet
UNIT 4 Part 2
11 pages
Spark Interview Questions and Answers
100% (2)
Spark Interview Questions and Answers
31 pages
Lecturer 5
No ratings yet
Lecturer 5
21 pages
Spark Interview Questions
100% (1)
Spark Interview Questions
8 pages
Introduction To Spark
No ratings yet
Introduction To Spark
54 pages
Spark Interview
No ratings yet
Spark Interview
17 pages
C5-SPARK Technology
No ratings yet
C5-SPARK Technology
39 pages
Distributed Database Systems: - Spark I
No ratings yet
Distributed Database Systems: - Spark I
59 pages
Module 4
No ratings yet
Module 4
29 pages
10 - Big Data Architecture and Tools (1)
No ratings yet
10 - Big Data Architecture and Tools (1)
31 pages
Spark
No ratings yet
Spark
96 pages
Interview - Questions
No ratings yet
Interview - Questions
8 pages
BDA NOTES
No ratings yet
BDA NOTES
241 pages
Big Data Computing Spark Basics and RDD: Ke Yi
No ratings yet
Big Data Computing Spark Basics and RDD: Ke Yi
43 pages
notes (3) - Copy
No ratings yet
notes (3) - Copy
3 pages
Spark Interview Questions
No ratings yet
Spark Interview Questions
61 pages
PySpark Core Print
No ratings yet
PySpark Core Print
8 pages
HDP Training Tesco - II Notes
No ratings yet
HDP Training Tesco - II Notes
250 pages
Pyspark_notes_new
No ratings yet
Pyspark_notes_new
18 pages
Unit-V Spark
No ratings yet
Unit-V Spark
69 pages
L3
No ratings yet
L3
30 pages
Mod4 Bda
No ratings yet
Mod4 Bda
14 pages
Spark Introduction
No ratings yet
Spark Introduction
4 pages
Intro To Apache Spark: Credits To CS 347-Stanford Course, 2015, Reynold Xin, Databricks (Spark Provider)
No ratings yet
Intro To Apache Spark: Credits To CS 347-Stanford Course, 2015, Reynold Xin, Databricks (Spark Provider)
96 pages
Spark Final Theory
No ratings yet
Spark Final Theory
19 pages
SPARK Architecture
No ratings yet
SPARK Architecture
22 pages
Big Data Assignment
No ratings yet
Big Data Assignment
6 pages
Lecture 25
No ratings yet
Lecture 25
59 pages
Spark Concept
No ratings yet
Spark Concept
18 pages
Apache Spark With Java
No ratings yet
Apache Spark With Java
209 pages
SAP interface programming with RFC and VBA: Edit SAP data with MS Access
From Everand
SAP interface programming with RFC and VBA: Edit SAP data with MS Access
Karl Josef Hensel
No ratings yet
Msi Ms 7327 Rev 1.1 II
No ratings yet
Msi Ms 7327 Rev 1.1 II
33 pages
Git Cheat Sheet: by Thesyntacticsugar
No ratings yet
Git Cheat Sheet: by Thesyntacticsugar
4 pages
ARM Exception Handling
No ratings yet
ARM Exception Handling
30 pages
NI LabVIEW For CompactRIO Developer's Guide-21-29
No ratings yet
NI LabVIEW For CompactRIO Developer's Guide-21-29
9 pages
TheWindowsProcessJourney v9.0 March2024
No ratings yet
TheWindowsProcessJourney v9.0 March2024
70 pages
Standard Module 1 (Garbin) - 2nd REVISION
No ratings yet
Standard Module 1 (Garbin) - 2nd REVISION
19 pages
IT Fundamentals Model Questions MCQ Question Bank
No ratings yet
IT Fundamentals Model Questions MCQ Question Bank
12 pages
CSE 3712-Course Syllabus - Sec C
No ratings yet
CSE 3712-Course Syllabus - Sec C
2 pages
Ubuntu 1804 English
No ratings yet
Ubuntu 1804 English
64 pages
MY P&WC POWER Offline Technical Publications (Single and Multi-User Modes)
No ratings yet
MY P&WC POWER Offline Technical Publications (Single and Multi-User Modes)
5 pages
Zebra Enterprise Connector-Guide
No ratings yet
Zebra Enterprise Connector-Guide
126 pages
Ds Eternus dx200 s5 WW en
No ratings yet
Ds Eternus dx200 s5 WW en
7 pages
Wireshark Lab2-HTTP PDF
No ratings yet
Wireshark Lab2-HTTP PDF
6 pages
Cable-Ping Wiring Diagrams
No ratings yet
Cable-Ping Wiring Diagrams
9 pages
TBR Server Manual
No ratings yet
TBR Server Manual
109 pages
GL200 Track Air Interface Firmware Update Protocol V100 Decrypted.100130744 PDF
No ratings yet
GL200 Track Air Interface Firmware Update Protocol V100 Decrypted.100130744 PDF
12 pages
WithSecure Research Kapeka
No ratings yet
WithSecure Research Kapeka
39 pages
D000792212-Antivirus Guidelines For Vue PACS 12.2 - RevB
No ratings yet
D000792212-Antivirus Guidelines For Vue PACS 12.2 - RevB
9 pages
IVMS-5200 Enterprise V3.3 - HIKVISION Products Compatibility List - 20171206
No ratings yet
IVMS-5200 Enterprise V3.3 - HIKVISION Products Compatibility List - 20171206
16 pages
A Guide To Building A Kubernetes Cluster With Raspberry Pi's - by Alexander Sniffin - Medium
No ratings yet
A Guide To Building A Kubernetes Cluster With Raspberry Pi's - by Alexander Sniffin - Medium
25 pages
Tui Uk Request For Change: Pre-Order Meal and Priority Check Feature Will Include in Flight Extras
No ratings yet
Tui Uk Request For Change: Pre-Order Meal and Priority Check Feature Will Include in Flight Extras
6 pages
Questions and Answers Subject Code & Subject: 08.705-Real Time Operating Systems
No ratings yet
Questions and Answers Subject Code & Subject: 08.705-Real Time Operating Systems
11 pages
Os Linux
No ratings yet
Os Linux
2 pages
Computer Fundamentals Module 3
No ratings yet
Computer Fundamentals Module 3
129 pages
PR 3
No ratings yet
PR 3
18 pages
AIA Quick Start Guide
No ratings yet
AIA Quick Start Guide
17 pages
Handbook Amd64 PDF
No ratings yet
Handbook Amd64 PDF
166 pages
As Info
No ratings yet
As Info
308 pages
Live Memory Forensics On Android With Volatility
No ratings yet
Live Memory Forensics On Android With Volatility
109 pages

Spark Intro

Uploaded by

Spark Intro

Uploaded by

INTRODUCTION TO SPARK:

======================== compute engine

-->spark is alternative/replacement of map-reduce

Spark can work with the below:

SPARK: For every spark job, the processing happens IN-MEMORY ie

How is data stored in spark?

Why RDD are immutable?

Why spark is lazy?

Ex2. We have to perform filtering of data

-->map: works on each input given. For n inputs we get n outputs.

What are stages in spark?

-->It contains a list of values (small data) which is broadcasted ie

-->similar to map side join in hive

What are Accumulators?

How MR jobs were managed before YARN ?

-->Map-Reduce jobs were handled by Job Tracker and Task tracker.

Why YARN was introduced?

1. Client submits the requests to run its programs to "Resource

What happens when job is very small?

-->AM will not request for additional resources from RM

YARN on Spark Cluster:

How to run sparks jobs on spark cluster?

How spark jobs run on spark cluster?

Client(sends request) --> Driver -->Executors (

Which component executes Where?

Who controls the spark cluster?

YARN Architecture in Spark:

1. Spark session : Entry point into spark cluster where driver

Yarn in client mode:

1. When spark job launched then Spark session created

1.The only difference is that Driver runs on AM

How to convert collections into RDD?

What are types of transformations?

Narrow Transformations: Data shuffling is not required ie moving

What are stages in spark?

Difference btw reduceByKey() and reduce() :

-->reduceByKey() : It is a transformation ie one rdd is converted

-->reduce() : it is an action ie rdd is converted into local variable.

Why reduceByKey() is transformation and not action?

-->reduceByKey() gives us a list of output and not a single output

File size ~ 350mb Tasks=11 (350/12)

-->File compressed and sent to reducers. As we can see only 2 reducers

1. RDD with a tuple of 2 elements

What happens when multiple actions are called?

Val resultFinal=redIn.collect() ==> action 1 (processes entire data)

-->Every action triggers a job in spark

1. For data collections

1. For Hdfs data

You might also like