Spark Intro
Spark Intro
SPARK:
-->It is a general purpose
In memory
Compute/process engine
======================== In memory
Why spark is very fast?
MAP-REDUCE: For every MR job, mapper reads the data and writes the
data back to hdfs. It requires 2 disc operations for every mapper. In
general we have hundreds of mappers for each task.
-->Disc I/O is a tedious task due to which the computation more time.
2. Action:
-->They are in-charge of actually performing the computation
-->They are not lazy and execute when they are called
Ex1. To load 2GB file into spark and print 1st line
-->we have to load the file, as it is lazy only DAG is created and loading
not yet done
-->when action is performed "collect" then our file is actually loaded
-->flatMap() ==> It takes each line as a input and transforms it
according to our operations. It is an array of arrays(each line stored as
array)
-->Array(Array(I,am,passionate,about,big,data),
Array(I,love,big,data)…..)
-->ie it takes each line as input and splits it based on " " and creates an
array for each line
-->finally all the arrays are merged into single array as seen below.
SPARK STAGES:
1. shuffleMap stage: It is an intermediatary stage in DAG which
provides data for another stage ie shuffling of data takes place.
2. ResultStage: It produces the output for the "action" called which
is the final result given by the spark
Broadcast Variables:
Problem: From above ex. We can see redundant words (in, the, a
etc) which are not actually not very helpful.
We collect a list of such words and broadcast them and filter them
out from using "filter" function.
-->Job Tracker:
1. Clients submit the requests to it which it then scheduled to be
distributed to slaves.
2. It schedules the jobs, priortizes jobs and allocates resources to
them
3. It monitors the performance of jobs
4. It single handedly takes care of all these tasks
-->Task Tracker:
1. Inside each block where we have tasks to be performed, it is
tracked by task tracker
Limitations:
1. Scalability: If cluster size increases then load on Job Tracker will
be high and its performance dips.
2. Resource Under-Utilization : Instances where MR slots are fixed
and they are not used to their full potential
3. Only MR jobs can be handled (no graphic processing etc)
-->As the job tracker has to manage huge no. of tasks and resources
which it failed to perform well. We introduced a separate
component just to schedule and manage resouces which is YARN
-->Components of YARN:
1. Resource Manager : Does only scheduling (load is now reduced)
2. Application Master: Does the monitoring and negotiates
resources
3. Node Manager: Manages all the tasks which are assigned to it.
Architecture:
P1=({vaish,1},{big,1},{love,1})
P2=({love,1},{vaish,1})
|
|
P1={{vaish,2},{big,1}}
P2=({love,2})
Now, data has to be moved across the partitions to compute the
final result
reduceByKey() Reduce()
1. It is a 1. It is an action
transformation 2. It gives a single
2. It works on output value
only pair RDD
3. It gives
multiple
output values
reduceByKey() vs groupByKey() :
reduceByKey() groupByKey()
1. Each partition 1. Each partition will
will process just group the similar
the data and inputs and give it to
give the output reducer machine
to reducer 2. More processing at
machine reducer
2. Less 3. Less parallelism
processing at 4. Does not perform
reducer local aggregation
3. More 5. If data is huge then
parallelism reduce can't single
4. Performs local handedly work on it
aggregation due to which we
should never use
groupByKey()
reduceByKey():
groupByKey:
Practical: groupByKey()
Val mapIn=X.map(x=>x,1)
Val redIn= mapIn.reduceByKey(x,y=>x+y)
Properties in spark:
1. Repartition vs Coalsence
Repartition Coalesce
1. Used to either 1.Used to only decrease no.
increase/decrease no. of of partitions
partitions for rdd's 2.If we increase partitions
Val no changes will be made
newRdd=inputFile.repartition(1 and same old value remains
0)
Val Val
newRdd=inputFile.repartition(1 newRdd=inputFile.coalesce(
) 2)
2. It is a wide
transformation as data
shuffling is involved
1. Use repartition for 1. Use coalesce for
INCREASE of partitions DECREASE of
partitions
It combines partitions
on same machines due
to which shuffling is
less and optimized
Before:
Node1: p1 p2
Node 2: p3 p4
Node3: p5 p6
After:
Node1: p1(p1+p2)
Node2: p3(p3+p4)
Node3:p5(p5+p6)
2. Don't use
REPARTIION for
DECREASE
Repartition tries to
equally distribute the
data among all
partitions due to
which more shuffling
is required
**
1. No. of actions = No. of jobs= No. of AM
2. No. of tasks= No. of partitions
3. 1 block size in local = 1rdd partition=32mb
4. 1 rdd= n partitions
5. 1 task = 1 partition