0% found this document useful (0 votes)
21 views

Lecture 2

This document provides an overview of Hadoop, including its key components and architecture. It discusses: - Hadoop is an open-source framework for distributed processing of large datasets across clusters of machines. - The main Hadoop components are the Hadoop Distributed File System (HDFS) for storage and YARN for resource management. - HDFS uses a master/slave architecture with a NameNode master and DataNodes slaves. The NameNode manages metadata and the DataNodes store actual data blocks. - YARN allows multiple data processing engines (like MapReduce) to leverage resources managed by YARN, such as compute resources on worker machines.

Uploaded by

ahmed hossam
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views

Lecture 2

This document provides an overview of Hadoop, including its key components and architecture. It discusses: - Hadoop is an open-source framework for distributed processing of large datasets across clusters of machines. - The main Hadoop components are the Hadoop Distributed File System (HDFS) for storage and YARN for resource management. - HDFS uses a master/slave architecture with a NameNode master and DataNodes slaves. The NameNode manages metadata and the DataNodes store actual data blocks. - YARN allows multiple data processing engines (like MapReduce) to leverage resources managed by YARN, such as compute resources on worker machines.

Uploaded by

ahmed hossam
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

FACULTY OF COMPUTERS AND

INFORMATION TECHNOLOGY

Information Systems Department


IS467
Selected Topics in Information Systems-1(Big Data)
4th Level, Spring 2024

Lecture Notes 2

© 2 0 2 4 b y D r. M o h a m e d A t t i a 1
Content
❑ Introduction to Hadoop

❑ Hadoop nodes & daemons

❑ Hadoop Architecture

❑ Characteristics

❑ Hadoop Features

❑ HDFS

2
What is Hadoop ?
➢ The Apache Hadoop is open-source framework that allows for the
distributed processing of large data sets across clusters of computers using
simple programming models.

➢ It is designed to scale up from single servers to thousands of machines, each


offering local computation and storage.

3
What is Hadoop?
Distributed Processing
An open-source framework that ❖ Data is processed
allows Distributed Processing of distributedly on multiple
large data-sets across the cluster nodes / servers
of commodity hardware ❖ Multiple machines processes
the data independently
Cluster

❖ Multiple machines connected together


❖ Nodes are connected via LAN
Hadoop Components
Hadoop consists of three key parts
Hadoop Nodes
Nodes

Master Node Slave Node


Hadoop Nodes
Nodes

Master Node Slave Node

Resource Node
Manager Manager

NameNode DataNode
What is YARN ?
YARN (Yet Another Resource Negotiator) is a core component of the Apache Hadoop
ecosystem, designed to manage computing resources in clusters and use them for scheduling
users' applications.

8
YARN Component
YARN consists of the following main components:

▪ Resource Manager (RM):


The Resource Manager is the master who controls all the available cluster resources and is
responsible for allocating resources to various running applications based on constraints
such as capacity, queues, etc.

▪ Node Manager (NM):


The Node Manager is a per-node slave service responsible for containers, monitoring their
resource usage (CPU, memory, disk, network), and reporting it to the Resource Manager.

9
Basic Hadoop Architecture
Sub Work Sub Work Sub Work Sub Work

Sub Work Sub Work Sub Work Sub Work

Sub Work Sub Work Sub Work Sub Work

Sub Work Sub Work Sub Work Sub Work

Sub Work Sub Work Sub Work Sub Work

Work Sub Work Sub Work Sub Work Sub Work

Sub Work Sub Work Sub Work Sub Work

Sub Work Sub Work Sub Work Sub Work


Hadoop Characteristics

Open Source
➢Source code is freely available Free

➢Can be redistributed
➢Can be modified Open
No vendor Affordable
lock
Source

Community
Distributed Processing

➢Data is processed distributedly on


cluster
➢Multiple nodes in the cluster
process data independently
Centralized Processing

Distributed Processing
Fault Tolerance

➢Failure of nodes are recovered automatically


➢Framework takes care of failure of hardware
as well tasks.

➢Detection of faults and quick, automatic


recovery from them is a core architectural
goal of HDFS.
Reliability

➢ Data is reliably stored on the


cluster of machines despite machine
failures
➢ Failure of nodes doesn’t cause
data loss
Scalability

▪Vertical Scalability – New hardware


can be added to the nodes

▪Horizontal Scalability – New nodes


can be added on the fly
Data Locality
Data Data

▪ Move computation to data instead of data to


computation. Data Data

Storage Servers App Servers

▪ Data is processed on the nodes where it is


stored. Algo Algo
Dat Dat
a a
Algorithm
Algo Algo
Dat Dat
a a

Servers
HDFS Architecture
Metadata(Name, replicas..)
Metadata ops Namenode (/home/foo/data,6. ..

Client
Block ops
Read Datanodes Datanodes

replication
B
Blocks

Rack1 Write Rack2

Client

2/29/2024
17
18
19
20
Namenode and Datanodes
 Master/slave architecture
 HDFS cluster consists of a single Namenode, a master server that
manages the file system and regulates access to files by clients.
 There are a number of Data Nodes at least one node in a cluster.
 Each data node at least has one block.
 The Data Nodes manage storage attached to the nodes that they run
on.
 A file is split into one or more blocks and set of blocks are stored
in DataNodes.
 DataNodes: serves read, write requests, performs block creation,
deletion, and replication upon instruction from Namenode.

2/29/2024
21
Namenode operations

▪ Namenode maintains and manages the Data Nodes and assigns the task to
them.
▪ Namenode does not contain actual file data.
▪ Namenode stores metadata of actual data like Filename, path, number of data
blocks, block IDs, block location, number of replicas, and other slave-
related information.
▪ Namenode manages all the request(read, write) of client for actual data file.
▪ Namenode executes file system name space operations like opening/closing
files, renaming files and directories.

22
Datanode operations

▪ Datanodes is responsible of storing actual data.


▪ Upon instruction from Namenode, it performs operations like
creation/replication/deletion of data blocks.
▪ When one of Datanode gets down then it will not make any effect on Hadoop
cluster due to replication.
▪ All Datanodes are synchronized in the Hadoop cluster in a way that they can
communicate with each other for various operations.

23
What happens if one of the Datanodes gets failed in HDFS?
1. Namenode periodically receives a heartbeat and a Block report from each Datanode in the
cluster.
2. Every Datanode sends heartbeat message after every 3 seconds to Namenode. The health
report is just information about a particular Datanode that is working properly or not. In the
other words we can say that particular Datanode is alive or not.
3. A block report of a particular Datanode contains information about all the blocks on that
resides on the corresponding Datanode.
4. When Namenode doesn’t receive any heartbeat message for 10 minutes(ByDefault) from a
particular Datanode then corresponding Datanode is considered Dead or failed by
Namenode.
5. Since blocks will be under replicated, the system starts the replication process from one
Datanode to another by taking all block information from the Block report of corresponding
Datanode.

24
HDFS
Example

25
HDFS Example
Let’s assume 10GB of file to store in HDFS. Block size of the cluster is 256MB, replication factor as
3, this 10GB of data requires how much space Datanodes, NameNode required?
Solution:
▪ 10 GB-> 10000 MB Total size file to store (Given)
# of Blocks = File size / Size of given Block
▪ 10000 MB /Size of block given = 10000/256= ~ 40 Block
▪ replication factor as 3 (Given)
Total number of blocks = replication factor * # of blocks
▪ So, the number of blocks will be 40 * 3= 120 block
Size of Data Node = # of Blocks * Block size
▪ 120 block * 256 MB= 30725 ~ 30 GB required for data nodes

26
HDFS Example
Given Metadata per block=150 bytes what is the size of Name node ?

Total metadata for Name Node = Number of blocks × Metadata per block

Total metadata for Name Node in bytes: 40×150=600040×150=6000 bytes


Total metadata in GB: 6000 bytes/(1024×1024×1024) ≈0.00000559 GB

Thus, the Data Nodes need 30GB to store the file, and the Name Node requires a minuscule
amount of space for metadata, approximately 5.59KB.

27
28

You might also like