Lecture 2
Lecture 2
INFORMATION TECHNOLOGY
Lecture Notes 2
© 2 0 2 4 b y D r. M o h a m e d A t t i a 1
Content
❑ Introduction to Hadoop
❑ Hadoop Architecture
❑ Characteristics
❑ Hadoop Features
❑ HDFS
2
What is Hadoop ?
➢ The Apache Hadoop is open-source framework that allows for the
distributed processing of large data sets across clusters of computers using
simple programming models.
3
What is Hadoop?
Distributed Processing
An open-source framework that ❖ Data is processed
allows Distributed Processing of distributedly on multiple
large data-sets across the cluster nodes / servers
of commodity hardware ❖ Multiple machines processes
the data independently
Cluster
Resource Node
Manager Manager
NameNode DataNode
What is YARN ?
YARN (Yet Another Resource Negotiator) is a core component of the Apache Hadoop
ecosystem, designed to manage computing resources in clusters and use them for scheduling
users' applications.
8
YARN Component
YARN consists of the following main components:
9
Basic Hadoop Architecture
Sub Work Sub Work Sub Work Sub Work
Open Source
➢Source code is freely available Free
➢Can be redistributed
➢Can be modified Open
No vendor Affordable
lock
Source
Community
Distributed Processing
Distributed Processing
Fault Tolerance
Servers
HDFS Architecture
Metadata(Name, replicas..)
Metadata ops Namenode (/home/foo/data,6. ..
Client
Block ops
Read Datanodes Datanodes
replication
B
Blocks
Client
2/29/2024
17
18
19
20
Namenode and Datanodes
Master/slave architecture
HDFS cluster consists of a single Namenode, a master server that
manages the file system and regulates access to files by clients.
There are a number of Data Nodes at least one node in a cluster.
Each data node at least has one block.
The Data Nodes manage storage attached to the nodes that they run
on.
A file is split into one or more blocks and set of blocks are stored
in DataNodes.
DataNodes: serves read, write requests, performs block creation,
deletion, and replication upon instruction from Namenode.
2/29/2024
21
Namenode operations
▪ Namenode maintains and manages the Data Nodes and assigns the task to
them.
▪ Namenode does not contain actual file data.
▪ Namenode stores metadata of actual data like Filename, path, number of data
blocks, block IDs, block location, number of replicas, and other slave-
related information.
▪ Namenode manages all the request(read, write) of client for actual data file.
▪ Namenode executes file system name space operations like opening/closing
files, renaming files and directories.
22
Datanode operations
23
What happens if one of the Datanodes gets failed in HDFS?
1. Namenode periodically receives a heartbeat and a Block report from each Datanode in the
cluster.
2. Every Datanode sends heartbeat message after every 3 seconds to Namenode. The health
report is just information about a particular Datanode that is working properly or not. In the
other words we can say that particular Datanode is alive or not.
3. A block report of a particular Datanode contains information about all the blocks on that
resides on the corresponding Datanode.
4. When Namenode doesn’t receive any heartbeat message for 10 minutes(ByDefault) from a
particular Datanode then corresponding Datanode is considered Dead or failed by
Namenode.
5. Since blocks will be under replicated, the system starts the replication process from one
Datanode to another by taking all block information from the Block report of corresponding
Datanode.
24
HDFS
Example
25
HDFS Example
Let’s assume 10GB of file to store in HDFS. Block size of the cluster is 256MB, replication factor as
3, this 10GB of data requires how much space Datanodes, NameNode required?
Solution:
▪ 10 GB-> 10000 MB Total size file to store (Given)
# of Blocks = File size / Size of given Block
▪ 10000 MB /Size of block given = 10000/256= ~ 40 Block
▪ replication factor as 3 (Given)
Total number of blocks = replication factor * # of blocks
▪ So, the number of blocks will be 40 * 3= 120 block
Size of Data Node = # of Blocks * Block size
▪ 120 block * 256 MB= 30725 ~ 30 GB required for data nodes
26
HDFS Example
Given Metadata per block=150 bytes what is the size of Name node ?
Total metadata for Name Node = Number of blocks × Metadata per block
Thus, the Data Nodes need 30GB to store the file, and the Name Node requires a minuscule
amount of space for metadata, approximately 5.59KB.
27
28