0% found this document useful (0 votes)

30 views28 pages

Lecture 2

This document provides an overview of Hadoop, including its key components and architecture. It discusses: - Hadoop is an open-source framework for distributed processing of large datasets across clusters of machines. - The main Hadoop components are the Hadoop Distributed File System (HDFS) for storage and YARN for resource management. - HDFS uses a master/slave architecture with a NameNode master and DataNodes slaves. The NameNode manages metadata and the DataNodes store actual data blocks. - YARN allows multiple data processing engines (like MapReduce) to leverage resources managed by YARN, such as compute resources on worker machines.

Uploaded by

ahmed hossam

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

30 views28 pages

Lecture 2

Uploaded by

ahmed hossam

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 28

FACULTY OF COMPUTERS AND

INFORMATION TECHNOLOGY

Information Systems Department

IS467
Selected Topics in Information Systems-1(Big Data)
4th Level, Spring 2024

Lecture Notes 2

❑ Hadoop nodes & daemons

❑ Hadoop Architecture

❑ Characteristics

❑ Hadoop Features

❑ HDFS

2
What is Hadoop ?
➢ The Apache Hadoop is open-source framework that allows for the
distributed processing of large data sets across clusters of computers using
simple programming models.

➢ It is designed to scale up from single servers to thousands of machines, each

offering local computation and storage.

3
What is Hadoop?
Distributed Processing
An open-source framework that ❖ Data is processed
allows Distributed Processing of distributedly on multiple
large data-sets across the cluster nodes / servers
of commodity hardware ❖ Multiple machines processes
the data independently
Cluster

❖ Multiple machines connected together

❖ Nodes are connected via LAN
Hadoop Components
Hadoop consists of three key parts
Hadoop Nodes
Nodes

Master Node Slave Node

Hadoop Nodes
Nodes

Master Node Slave Node

Resource Node
Manager Manager

NameNode DataNode
What is YARN ?
YARN (Yet Another Resource Negotiator) is a core component of the Apache Hadoop
ecosystem, designed to manage computing resources in clusters and use them for scheduling
users' applications.

8
YARN Component
YARN consists of the following main components:

▪ Resource Manager (RM):

The Resource Manager is the master who controls all the available cluster resources and is
responsible for allocating resources to various running applications based on constraints
such as capacity, queues, etc.

▪ Node Manager (NM):

The Node Manager is a per-node slave service responsible for containers, monitoring their
resource usage (CPU, memory, disk, network), and reporting it to the Resource Manager.

9
Basic Hadoop Architecture
Sub Work Sub Work Sub Work Sub Work

Sub Work Sub Work Sub Work Sub Work

Work Sub Work Sub Work Sub Work Sub Work

Sub Work Sub Work Sub Work Sub Work

Hadoop Characteristics

Open Source
➢Source code is freely available Free

➢Can be redistributed
➢Can be modified Open
No vendor Affordable
lock
Source

Community
Distributed Processing

➢Data is processed distributedly on

cluster
➢Multiple nodes in the cluster
process data independently
Centralized Processing

Distributed Processing
Fault Tolerance

➢Failure of nodes are recovered automatically

➢Framework takes care of failure of hardware
as well tasks.

➢Detection of faults and quick, automatic

recovery from them is a core architectural
goal of HDFS.
Reliability

➢ Data is reliably stored on the

cluster of machines despite machine
failures
➢ Failure of nodes doesn’t cause
data loss
Scalability

▪Vertical Scalability – New hardware

can be added to the nodes

▪Horizontal Scalability – New nodes

can be added on the fly
Data Locality
Data Data

▪ Move computation to data instead of data to

computation. Data Data

Storage Servers App Servers

▪ Data is processed on the nodes where it is

stored. Algo Algo
Dat Dat
a a
Algorithm
Algo Algo
Dat Dat
a a

Servers
HDFS Architecture
Metadata(Name, replicas..)
Metadata ops Namenode (/home/foo/data,6. ..

Client
Block ops
Read Datanodes Datanodes

replication
B
Blocks

Rack1 Write Rack2

Client

2/29/2024
17
18
19
20
Namenode and Datanodes
 Master/slave architecture
 HDFS cluster consists of a single Namenode, a master server that
manages the file system and regulates access to files by clients.
 There are a number of Data Nodes at least one node in a cluster.
 Each data node at least has one block.
 The Data Nodes manage storage attached to the nodes that they run
on.
 A file is split into one or more blocks and set of blocks are stored
in DataNodes.
 DataNodes: serves read, write requests, performs block creation,
deletion, and replication upon instruction from Namenode.

2/29/2024
21
Namenode operations

▪ Namenode maintains and manages the Data Nodes and assigns the task to
them.
▪ Namenode does not contain actual file data.
▪ Namenode stores metadata of actual data like Filename, path, number of data
blocks, block IDs, block location, number of replicas, and other slave-
related information.
▪ Namenode manages all the request(read, write) of client for actual data file.
▪ Namenode executes file system name space operations like opening/closing
files, renaming files and directories.

22
Datanode operations

▪ Datanodes is responsible of storing actual data.

▪ Upon instruction from Namenode, it performs operations like
creation/replication/deletion of data blocks.
▪ When one of Datanode gets down then it will not make any effect on Hadoop
cluster due to replication.
▪ All Datanodes are synchronized in the Hadoop cluster in a way that they can
communicate with each other for various operations.

23
What happens if one of the Datanodes gets failed in HDFS?
1. Namenode periodically receives a heartbeat and a Block report from each Datanode in the
cluster.
2. Every Datanode sends heartbeat message after every 3 seconds to Namenode. The health
report is just information about a particular Datanode that is working properly or not. In the
other words we can say that particular Datanode is alive or not.
3. A block report of a particular Datanode contains information about all the blocks on that
resides on the corresponding Datanode.
4. When Namenode doesn’t receive any heartbeat message for 10 minutes(ByDefault) from a
particular Datanode then corresponding Datanode is considered Dead or failed by
Namenode.
5. Since blocks will be under replicated, the system starts the replication process from one
Datanode to another by taking all block information from the Block report of corresponding
Datanode.

24
HDFS
Example

25
HDFS Example
Let’s assume 10GB of file to store in HDFS. Block size of the cluster is 256MB, replication factor as
3, this 10GB of data requires how much space Datanodes, NameNode required?
Solution:
▪ 10 GB-> 10000 MB Total size file to store (Given)
# of Blocks = File size / Size of given Block
▪ 10000 MB /Size of block given = 10000/256= ~ 40 Block
▪ replication factor as 3 (Given)
Total number of blocks = replication factor * # of blocks
▪ So, the number of blocks will be 40 * 3= 120 block
Size of Data Node = # of Blocks * Block size
▪ 120 block * 256 MB= 30725 ~ 30 GB required for data nodes

26
HDFS Example
Given Metadata per block=150 bytes what is the size of Name node ?

Total metadata for Name Node = Number of blocks × Metadata per block

Total metadata for Name Node in bytes: 40×150=600040×150=6000 bytes

Total metadata in GB: 6000 bytes/(1024×1024×1024) ≈0.00000559 GB

Thus, the Data Nodes need 30GB to store the file, and the Name Node requires a minuscule
amount of space for metadata, approximately 5.59KB.

27
28

Exploring Hadoop Ecosystem (Volume 2): Stream Processing
From Everand
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
Wei Liu
No ratings yet
Big Data Unit-2 PPT part1
No ratings yet
Big Data Unit-2 PPT part1
76 pages
Hadoop 1
No ratings yet
Hadoop 1
75 pages
Unit 2 Hadoop
No ratings yet
Unit 2 Hadoop
67 pages
Prepared By: Manoj Kumar Joshi & Vikas Sawhney
No ratings yet
Prepared By: Manoj Kumar Joshi & Vikas Sawhney
47 pages
BDA_UNIT-IV
No ratings yet
BDA_UNIT-IV
37 pages
UNIT - 2
No ratings yet
UNIT - 2
42 pages
Business Intelligence & Big Data Analytics-CSE3124Y
No ratings yet
Business Intelligence & Big Data Analytics-CSE3124Y
26 pages
NYOUG Hadoop Presentaton
No ratings yet
NYOUG Hadoop Presentaton
47 pages
Hadoop Presentaton
No ratings yet
Hadoop Presentaton
47 pages
HADOOP FRAME WORK
No ratings yet
HADOOP FRAME WORK
38 pages
21CS72-BIGDATA-MODULE-2-HDFS (1)
No ratings yet
21CS72-BIGDATA-MODULE-2-HDFS (1)
55 pages
UNIT 5-PLH
No ratings yet
UNIT 5-PLH
34 pages
UNIT-1-part-2-BIG DATA ANALYTICS AND TOOLS
No ratings yet
UNIT-1-part-2-BIG DATA ANALYTICS AND TOOLS
19 pages
Apex Institute of Technology: Big Data Security
No ratings yet
Apex Institute of Technology: Big Data Security
30 pages
Chapter_6 - Hadoop
No ratings yet
Chapter_6 - Hadoop
51 pages
Unit 2 Hadoop
No ratings yet
Unit 2 Hadoop
60 pages
Unit 2
No ratings yet
Unit 2
56 pages
UNIT V-Cloud Computing
No ratings yet
UNIT V-Cloud Computing
33 pages
Bda - Unit 2
No ratings yet
Bda - Unit 2
56 pages
DSECL ZG 522: Big Data Systems: Session 6: Hadoop Architecture and Filesystem
No ratings yet
DSECL ZG 522: Big Data Systems: Session 6: Hadoop Architecture and Filesystem
56 pages
Hadoop Platform & Services
No ratings yet
Hadoop Platform & Services
41 pages
Hadoop Intro and Hdfs
No ratings yet
Hadoop Intro and Hdfs
37 pages
Hadoop Architecture
No ratings yet
Hadoop Architecture
48 pages
Unit-Iv CC&BD CS71
No ratings yet
Unit-Iv CC&BD CS71
148 pages
CC Unit 5 Notes
No ratings yet
CC Unit 5 Notes
30 pages
BDS Session 5
No ratings yet
BDS Session 5
57 pages
Unit-2_ch_1_updated
No ratings yet
Unit-2_ch_1_updated
22 pages
Module-2 PPT-1
No ratings yet
Module-2 PPT-1
126 pages
Hdfs Architecture
No ratings yet
Hdfs Architecture
16 pages
Introduction To Hadoop: Dr. G Sudha Sadhasivam Professor, CSE PSG College of Technology Coimbatore
No ratings yet
Introduction To Hadoop: Dr. G Sudha Sadhasivam Professor, CSE PSG College of Technology Coimbatore
34 pages
BDA Module 2 - Notes PDF
No ratings yet
BDA Module 2 - Notes PDF
101 pages
Lec 5 - Big Data Storage Technologies I - Hadoop
No ratings yet
Lec 5 - Big Data Storage Technologies I - Hadoop
44 pages
(17CS82) 8 Semester CSE: Big Data Analytics
No ratings yet
(17CS82) 8 Semester CSE: Big Data Analytics
169 pages
Hadoop Architecture
No ratings yet
Hadoop Architecture
84 pages
Introduction To Hadoop Ecosystem
No ratings yet
Introduction To Hadoop Ecosystem
46 pages
Unit 2 Da Material
No ratings yet
Unit 2 Da Material
71 pages
Hadoop
No ratings yet
Hadoop
31 pages
Hadoop
No ratings yet
Hadoop
4 pages
Hadoop Architecture
No ratings yet
Hadoop Architecture
8 pages
Lecture-1 - 3 Hadoop - HDFS - Mapreduce (Self Study)
No ratings yet
Lecture-1 - 3 Hadoop - HDFS - Mapreduce (Self Study)
25 pages
Introduction to Hadoop
No ratings yet
Introduction to Hadoop
56 pages
HDFS
No ratings yet
HDFS
37 pages
Big Data-UNIT-2
No ratings yet
Big Data-UNIT-2
46 pages
Big data aktu unit 3
No ratings yet
Big data aktu unit 3
90 pages
Hadoop Distributed File System: Bhavneet Kaur B.Tech Computer Science 2 Year
No ratings yet
Hadoop Distributed File System: Bhavneet Kaur B.Tech Computer Science 2 Year
34 pages
Introduction To Hadoop
No ratings yet
Introduction To Hadoop
5 pages
Unit-4 BDA as on 25-11-2024
No ratings yet
Unit-4 BDA as on 25-11-2024
248 pages
Introduction to Hadoop
No ratings yet
Introduction to Hadoop
18 pages
Module 1 PDF
No ratings yet
Module 1 PDF
49 pages
2-Hadoop History Terminologies DFS-03-01-2025
No ratings yet
2-Hadoop History Terminologies DFS-03-01-2025
52 pages
CH 2
No ratings yet
CH 2
6 pages
Unit 5 Print
No ratings yet
Unit 5 Print
32 pages
Haoop Architecture
No ratings yet
Haoop Architecture
34 pages
HDFS 79
No ratings yet
HDFS 79
74 pages
Big Data Unit 3 by Multi Atoms
No ratings yet
Big Data Unit 3 by Multi Atoms
6 pages
UNIT -2
No ratings yet
UNIT -2
27 pages
Hadoop File System: B. Ramamurthy
No ratings yet
Hadoop File System: B. Ramamurthy
36 pages
Introduction to Hadoop- chapter-2
No ratings yet
Introduction to Hadoop- chapter-2
59 pages
Big Data Analytics
From Everand
Big Data Analytics
Nitin Kumar Yadav
No ratings yet
Sentiment Analysis On Twitter Data-Set Using Naive Bayes Algorithm
No ratings yet
Sentiment Analysis On Twitter Data-Set Using Naive Bayes Algorithm
5 pages
Unit4 - DataAnalytics and IoT PDF
No ratings yet
Unit4 - DataAnalytics and IoT PDF
40 pages
Unit-4 Pig Hive
No ratings yet
Unit-4 Pig Hive
40 pages
surya.sairam25@gmail
No ratings yet
surya.sairam25@gmail
11 pages
S - Hadoop Ecosystem
No ratings yet
S - Hadoop Ecosystem
14 pages
Expert Veri Ed, Online, Free.: Custom View Settings
No ratings yet
Expert Veri Ed, Online, Free.: Custom View Settings
12 pages
Notes - Unit 3 - Map Reduce Applications
No ratings yet
Notes - Unit 3 - Map Reduce Applications
11 pages
Big Data Analytics
No ratings yet
Big Data Analytics
6 pages
AWS Plus Common Big Data Notes
No ratings yet
AWS Plus Common Big Data Notes
3 pages
Ai Analytics in Production PDF
No ratings yet
Ai Analytics in Production PDF
137 pages
4.7.1 BDA-MBA
No ratings yet
4.7.1 BDA-MBA
2 pages
Annexure - I - Syllabus PG-DBDA Aug 16
No ratings yet
Annexure - I - Syllabus PG-DBDA Aug 16
4 pages
BDA Lab Manual
No ratings yet
BDA Lab Manual
20 pages
Mid - 2 Questions & Bits
No ratings yet
Mid - 2 Questions & Bits
5 pages
Big Data Technologies 2018
No ratings yet
Big Data Technologies 2018
1 page
Data Analytics IT 404 - Mod 6: Ojus Thomas Lee CE Kidangoor
No ratings yet
Data Analytics IT 404 - Mod 6: Ojus Thomas Lee CE Kidangoor
53 pages
Apache oozie
No ratings yet
Apache oozie
12 pages
JobResume2
No ratings yet
JobResume2
331 pages
PPT 2.1.1.
No ratings yet
PPT 2.1.1.
24 pages
(Ebook) Hadoop MapReduce v2 Cookbook, 2nd Edition: Explore the Hadoop MapReduce v2 ecosystem to gain insights from very large datasets by Thilina Gunarathne ISBN 9781783285471, 1783285478 instant download
No ratings yet
(Ebook) Hadoop MapReduce v2 Cookbook, 2nd Edition: Explore the Hadoop MapReduce v2 ecosystem to gain insights from very large datasets by Thilina Gunarathne ISBN 9781783285471, 1783285478 instant download
53 pages
Bigdata 15cs82 Vtu Module 1 2 Notes PDF
No ratings yet
Bigdata 15cs82 Vtu Module 1 2 Notes PDF
49 pages
Lecture Notes - Hive and Querying
No ratings yet
Lecture Notes - Hive and Querying
20 pages
Apache Hadoop MapReduce Commands
No ratings yet
Apache Hadoop MapReduce Commands
5 pages
Cse 3-2 Cs & Syllabus - Ug - r20
No ratings yet
Cse 3-2 Cs & Syllabus - Ug - r20
93 pages
Hadoop Installation and Configuration
No ratings yet
Hadoop Installation and Configuration
16 pages
Manufacturing Big Data Ecosystem A Systematic Literature Review - 2ndresubmit - v2
No ratings yet
Manufacturing Big Data Ecosystem A Systematic Literature Review - 2ndresubmit - v2
40 pages
Big Data Analysis and Data Visualization To Facilitate Decision-Making - Mega Start Case Study
No ratings yet
Big Data Analysis and Data Visualization To Facilitate Decision-Making - Mega Start Case Study
11 pages
IV Year E Hand Book
No ratings yet
IV Year E Hand Book
191 pages
Basics of CTLM and Its Architecture
No ratings yet
Basics of CTLM and Its Architecture
22 pages
Introduction To Big data-21CS753-syllabus
No ratings yet
Introduction To Big data-21CS753-syllabus
3 pages

Lecture 2

Uploaded by

Lecture 2

Uploaded by

FACULTY OF COMPUTERS AND

Information Systems Department

❑ Hadoop nodes & daemons

➢ It is designed to scale up from single servers to thousands of machines, each

❖ Multiple machines connected together

Master Node Slave Node

Master Node Slave Node

▪ Resource Manager (RM):

▪ Node Manager (NM):

Sub Work Sub Work Sub Work Sub Work

Sub Work Sub Work Sub Work Sub Work

Sub Work Sub Work Sub Work Sub Work

Sub Work Sub Work Sub Work Sub Work

Work Sub Work Sub Work Sub Work Sub Work

Sub Work Sub Work Sub Work Sub Work

Sub Work Sub Work Sub Work Sub Work

➢Data is processed distributedly on

➢Failure of nodes are recovered automatically

➢Detection of faults and quick, automatic

➢ Data is reliably stored on the

▪Vertical Scalability – New hardware

▪Horizontal Scalability – New nodes

▪ Move computation to data instead of data to

Storage Servers App Servers

▪ Data is processed on the nodes where it is

Rack1 Write Rack2

▪ Datanodes is responsible of storing actual data.

Total metadata for Name Node in bytes: 40×150=600040×150=6000 bytes

You might also like