Unit-2 CH 1 Updated
Unit-2 CH 1 Updated
• That is why we need such a feature in HDFS which can make copies of that file blocks for backup purposes,
this is known as fault tolerance.
• Now one thing we also need to notice that after making so many replica’s of our file blocks we are wasting
so much of our storage but for the big brand organization the data is very much important than the storage
so nobody cares for this extra storage.
• What is a rack?
• The Rack is the collection of around 40-50 DataNodes connected
using the same network switch.
• If the network goes down, the whole rack will be unavailable.
• A large Hadoop cluster is deployed in multiple racks.
• What is Rack Awareness in Hadoop HDFS?
• In a large Hadoop cluster, there are multiple racks.
• Each rack consists of DataNodes.
• Communication between the DataNodes on the same rack is more efficient
as compared to the communication between DataNodes residing on different
racks.
• To reduce the network traffic during file read/write, NameNode
chooses the closest DataNode for serving the client read/write
request.
• NameNode maintains rack ids of each DataNode to achieve this rack
information.
• This concept of choosing the closest DataNode based on the rack
information is known as Rack Awareness.
Goals of HDFS
• Fault detection and recovery :
• Since HDFS includes a large number of commodity hardware, failure of components is
frequent.
• Therefore HDFS should have mechanisms for quick and automatic fault detection and
recovery.
• Huge datasets :
• HDFS should have hundreds of nodes per cluster to manage the applications having
huge datasets.
• Hardware at data :
• A requested task can be done efficiently, when the computation takes place near the
data.
• Especially where huge datasets are involved, it reduces the network traffic and
increases the throughput.
HDFS Architecture
Terminology
• Payload - Applications implement the Map and the Reduce functions, and form the core of
the job.
• Mapper - Mapper maps the input key/value pairs to a set of intermediate key/value pair.
• Named Node - Node that manages the Hadoop Distributed File System (HDFS).
• DataNode - Node where data is presented in advance before any processing takes place.
• MasterNode - Node where Job Tracker runs and which accepts job requests from clients.
• Slave Node - Node where Map and Reduce program runs.
• Job Tracker - Schedules jobs and tracks the assign jobs to Task tracker.
• Task Tracker - Tracks the task and reports status to Job Tracker.
• Job - A program in an execution of a Mapper and Reducer across a dataset.
• Task - An execution of a Mapper or a Reducer on a slice of data.
• Task Attempt - A particular instance of an attempt to execute a task on a Slave Node.
• HDFS follows the master-slave architecture and it
. has the following elements.
• Namenode
• The namenode is the commodity hardware that contains the GNU/Linux operating
system and the namenode software. It is a software that can be run on commodity
hardware. The system having the namenode acts as the master server and it does the
following tasks:
• Manages the file system namespace.
• Regulates client’s access to files.
• It also executes file system operations such as renaming, closing, and opening files and
directories.
Datanode
• Replication:
Assuming the replication factor is 3
• When a file is written from a data node (say R1N1), Hadoop attempts to save the first replica in same data
node (R1N1).
• Second replica is written into another node (R2N2) in a different rack (R2).
• Third replica is written into another node (R2N1) in the same rack (R2) where the second replica was
saved.