0% found this document useful (0 votes)
13 views22 pages

Unit-2 CH 1 Updated

The document provides an overview of the Hadoop Distributed File System (HDFS), detailing its architecture, components, and functionalities such as fault tolerance, data replication, and resource management using YARN. It explains the roles of NameNode and DataNode, the concept of file blocks, and the importance of rack awareness for efficient data processing. Additionally, it covers the terminology and processes involved in MapReduce programming within the Hadoop ecosystem.

Uploaded by

Utkarsha Mahajan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views22 pages

Unit-2 CH 1 Updated

The document provides an overview of the Hadoop Distributed File System (HDFS), detailing its architecture, components, and functionalities such as fault tolerance, data replication, and resource management using YARN. It explains the roles of NameNode and DataNode, the concept of file blocks, and the importance of rack awareness for efficient data processing. Additionally, it covers the terminology and processes involved in MapReduce programming within the Hadoop ecosystem.

Uploaded by

Utkarsha Mahajan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

UNIT – 2

• Hadoop Distributed File System:


• Processing Data with Hadoop,
• Managing Resources and Applications with Hadoop YARN (Yet another Resource
Negotiator),
• Introduction to MAPREDUCE Programming:
• Introduction,
• Mapper,
• reducer,
• Combiner,
• Partitioner,
• Searching,
• Sorting,
• Compression.
Hadoop Distributed File System
The Hadoop Distributed File System (HDFS) is based on the Google File System
(GFS) and provides a distributed file system that is designed to run on
commodity hardware. It has many similarities with existing distributed file
systems.
However, the differences from other distributed file systems are significant.
It is highly fault-tolerant and is designed to be deployed on low-cost hardware.
It provides high throughput access to application data and is suitable for
applications having large datasets. Apart from the above-mentioned two core
components,
Hadoop framework also includes the following two modules −
Hadoop Common − These are Java libraries and utilities required by other
Hadoop modules.
Hadoop YARN − This is a framework for job scheduling and cluster resource
management.
HDFS
• HDFS(Hadoop Distributed File System) is utilized for storage permission is a
Hadoop cluster.
• HDFS in Hadoop provides Fault-tolerance and High availability to the
storage layer and the other devices present in that Hadoop cluster.
• Data storage Nodes in HDFS.
• NameNode(Master)
• DataNode(Slave)
NameNode
• NameNode works as a Master in a Hadoop cluster that guides the Datanode(Slaves).
• Namenode is mainly used for storing the Metadata i.e. the data about the data.
• Meta Data can be the transaction logs that keep track of the user’s activity in a Hadoop cluster.
• Meta Data can also be the name of the file, size, and the information about the location(Block
number, Block ids) of Datanode that Namenode stores to find the closest DataNode for Faster
Communication.
• Namenode instructs the DataNodes with the operation like delete, create, Replicate, etc.
DataNode
• Data Nodes works as a Slave Data Nodes are mainly utilized for storing the data in a Hadoop
cluster, the number of Data Nodes can be from 1 to 500 or even more than that.
• The more number of Data Node, the Hadoop cluster will be able to store more data.
• It is advised that the Data Node should have High storing capacity to store a large number of
file blocks.
High Level Architecture Of Hadoop
Processing Data with Hadoop - Managing
Resources and Applications with Hadoop YARN
File Block In HDFS:
Data in HDFS is always stored in terms of blocks.
So the single block of data is divided into multiple blocks of size 128MB which is default and you can also
change it manually.
• Let’s understand this concept of breaking down of file in blocks with
an example.
• Suppose you have uploaded a file of 400MB to your HDFS then what
happens is this file got divided into blocks of 128MB+128MB+128MB+16MB
= 400MB size.
• Means 4 blocks are created each of 128MB except the last one.
• Hadoop doesn’t know or it doesn’t care about what data is stored in these
blocks so it considers the final file blocks as a partial record as it does not
have any idea regarding it.
• In the Linux file system, the size of a file block is about 4KB which is very
much less than the default size of file blocks in the Hadoop file system.
• As we all know Hadoop is mainly configured for storing the large size data
which is in petabyte, this is what makes Hadoop file system different from
other file systems as it can be scaled, nowadays file blocks of 128MB to
256MB are considered in Hadoop.
Replication In HDFS
• Replication ensures the availability of the data.
• Replication is making a copy of something and the number of times you make a copy of that particular
thing can be expressed as it’s Replication Factor.
• The HDFS stores the data in the form of various blocks at the same time Hadoop is also configured to make
a copy of those file blocks.
• By default, the Replication Factor for Hadoop is set to 3 which can be configured means you can change it
manually as per your requirement like in above example we have made 4 file blocks which means that 3
Replica or copy of each file block is made means total of 4×3 = 12 blocks are made for the backup purpose.
• This is because for running Hadoop we are using commodity hardware (inexpensive system hardware)
which can be crashed at any time.

• That is why we need such a feature in HDFS which can make copies of that file blocks for backup purposes,
this is known as fault tolerance.
• Now one thing we also need to notice that after making so many replica’s of our file blocks we are wasting
so much of our storage but for the big brand organization the data is very much important than the storage
so nobody cares for this extra storage.
• What is a rack?
• The Rack is the collection of around 40-50 DataNodes connected
using the same network switch.
• If the network goes down, the whole rack will be unavailable.
• A large Hadoop cluster is deployed in multiple racks.
• What is Rack Awareness in Hadoop HDFS?
• In a large Hadoop cluster, there are multiple racks.
• Each rack consists of DataNodes.
• Communication between the DataNodes on the same rack is more efficient
as compared to the communication between DataNodes residing on different
racks.
• To reduce the network traffic during file read/write, NameNode
chooses the closest DataNode for serving the client read/write
request.
• NameNode maintains rack ids of each DataNode to achieve this rack
information.
• This concept of choosing the closest DataNode based on the rack
information is known as Rack Awareness.
Goals of HDFS
• Fault detection and recovery :
• Since HDFS includes a large number of commodity hardware, failure of components is
frequent.
• Therefore HDFS should have mechanisms for quick and automatic fault detection and
recovery.
• Huge datasets :
• HDFS should have hundreds of nodes per cluster to manage the applications having
huge datasets.
• Hardware at data :
• A requested task can be done efficiently, when the computation takes place near the
data.
• Especially where huge datasets are involved, it reduces the network traffic and
increases the throughput.
HDFS Architecture
Terminology
• Payload - Applications implement the Map and the Reduce functions, and form the core of
the job.
• Mapper - Mapper maps the input key/value pairs to a set of intermediate key/value pair.
• Named Node - Node that manages the Hadoop Distributed File System (HDFS).
• DataNode - Node where data is presented in advance before any processing takes place.
• MasterNode - Node where Job Tracker runs and which accepts job requests from clients.
• Slave Node - Node where Map and Reduce program runs.
• Job Tracker - Schedules jobs and tracks the assign jobs to Task tracker.
• Task Tracker - Tracks the task and reports status to Job Tracker.
• Job - A program in an execution of a Mapper and Reducer across a dataset.
• Task - An execution of a Mapper or a Reducer on a slice of data.
• Task Attempt - A particular instance of an attempt to execute a task on a Slave Node.
• HDFS follows the master-slave architecture and it
. has the following elements.
• Namenode
• The namenode is the commodity hardware that contains the GNU/Linux operating
system and the namenode software. It is a software that can be run on commodity
hardware. The system having the namenode acts as the master server and it does the
following tasks:
• Manages the file system namespace.
• Regulates client’s access to files.
• It also executes file system operations such as renaming, closing, and opening files and
directories.
Datanode

• The datanode is a commodity hardware having the GNU/Linux operating


system and datanode software.

• For every node (Commodity hardware/System) in a cluster, there will be a


data node.

• These nodes manage the data storage of their system.


• Datanodes perform read-write operations on the file systems, as per client request.
• They also perform operations such as block creation, deletion, and replication
according to the instructions of the namenode.
Blocks
• Generally the user data is stored in the files of HDFS. The file in a file
system will be divided into one or more segments and/or stored in
individual data nodes.
• These file segments are called as blocks.
• In other words, the minimum amount of data that HDFS can read or
write is called a Block.
Namenode (master)
Task Metadata
Managing filesystem tree & Namespace image, edit log
namespace (stored persistently)
Block locations
Keeping track of all the blocks
(stored just in memory)

• Single point of failure


• Backup the persistent metadata files
• Run a secondary namenode (standby)
• Datanodes (workers)
• Store & retrieve blocks
• Report the lists of storing blocks to the namenode
20
Data Flow
• Anatomy of file read in Hadoop:
• Consider a Hadoop cluster with one name node and two racks named R1 and R2 in a data center D1.
• Each rack has 4 nodes and they are uniquely identified as R1N1, R1N2 and so on. The replication factor
is set to 3 and the HDFS block size is set to 64 MB(128MB in Hadoop V2) by default.
• Background:
• Name node stores the HDFS block information like file location, permission, etc. in files called FSImage and
edit logs.
• Files are stored in HDFS as blocks.
• These block information are not saved in any file.
• Instead it is gathered every time the cluster is started and this information is stored in namenode’s
memory.

• Replication:
Assuming the replication factor is 3
• When a file is written from a data node (say R1N1), Hadoop attempts to save the first replica in same data
node (R1N1).
• Second replica is written into another node (R2N2) in a different rack (R2).
• Third replica is written into another node (R2N1) in the same rack (R2) where the second replica was
saved.

You might also like