notes (3) - Copy
notes (3) - Copy
Rack Awareness in Hadoop is the concept to choose a nearby data node (closest to
the client which has raised the Read/Write request), thereby reducing the network
traffic.
An edge node is a computer that acts as an end user portal for communication with
other nodes in cluster computing. Edge nodes are also sometimes called gateway
nodes or edge communication nodes. In a Hadoop cluster, three types of nodes exist:
master, worker and edge nodes. Master nodes are responsible for storing data in
HDFS and overseeing key operations, such as running parallel computations on the
data using MapReduce. The worker nodes comprise most of the virtual machines in a
Hadoop cluster, and perform the job of storing the data and running computations.
Each worker node runs the DataNode and TaskTracker services, which are used to
receive the instructions from the master nodes.
a speculative execution means that Hadoop in overall doesn't try to fix slow tasks
as it is hard to detect the reason (misconfiguration, hardware issues, etc),
instead, it just launches another parallel/backup task for each task that is
performing slower than the expected, on faster nodes.
groupByKey receives key value pairs and groups the records over each key.
reduceByKey has also same functionality.
While both reducebykey and groupbykey will produce the same answer, the reduceByKey
example works much better on a large dataset. That's because Spark knows it can
combine output with a common key on each partition before shuffling the data.
On the other hand, when calling groupByKey - all the key-value pairs are shuffled
around. This is a lot of unnessary data to being transferred over the network.
ReduceByKey uses a combiner to combine all the data while groupByKey doesnt.
eg - val testrdd = seq(("A", 1), ("B", 1), ("A", 1), ("C", 1))
reducebyKey o/p -> ("A", 2), ("B", 1), ("C", 1)
groupByKey o/p -> ("A", 1, 1), ("B", 1), ("C", 1)
Resilient because RDDs are immutable(can't be modified once created) and fault
tolerant, Distributed because it is distributed across cluster and Dataset because
it holds data.
DAG (direct acyclic graph) is the representation of the way Spark will execute your
program - each vertex on that graph is a separate operation and edges represent
dependencies of each operation. Your program (thus DAG that represents it) may
operate on multiple entities (RDDs, Dataframes, etc). RDD Lineage is just a portion
of a DAG (one or more operations) that lead to the creation of that particular RDD.
Spark architecture -
In your master node, you have the driver program, which drives your application.
The code you are writing behaves as a driver program or if you are using the
interactive shell, the shell acts as the driver program. Inside the driver program,
the first thing you do is, you create a Spark Context. Assume that the Spark
context is a gateway to all the Spark functionalities. It is similar to your
database connection. Any command you execute in your database goes through the
database connection. Likewise, anything you do on Spark goes through Spark context.
Now, this Spark context works with the cluster manager to manage various jobs. The
driver program & Spark context takes care of the job execution within the cluster.
A job is split into multiple tasks which are distributed over the worker node.
Anytime an RDD is created in Spark context, it can be distributed across various
nodes and can be cached there.
Worker nodes are the slave nodes whose job is to basically execute the tasks. These
tasks are then executed on the partitioned RDDs in the worker node and hence
returns back the result to the Spark Context.
Spark Context takes the job, break we can identify which dataframes are taking
longer to execute and use repartition accordingly.
-> shuffle partition and parallelism
-> In case job is failing at any point due to stage failure, we can experiment with
adding checkpoint directory as well.