Unit-2 CH 1 Updated

The document provides an overview of the Hadoop Distributed File System (HDFS), detailing its architecture, components, and functionalities such as fault tolerance, data replication, and resource management using YARN. It explains the roles of NameNode and DataNode, the concept of file blocks, and the importance of rack awareness for efficient data processing. Additionally, it covers the terminology and processes involved in MapReduce programming within the Hadoop ecosystem.

Uploaded by

Utkarsha Mahajan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views22 pages

Unit-2 CH 1 Updated

Uploaded by

Utkarsha Mahajan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 22

UNIT – 2

• Hadoop Distributed File System:

• Processing Data with Hadoop,
• Managing Resources and Applications with Hadoop YARN (Yet another Resource
Negotiator),
• Introduction to MAPREDUCE Programming:
• Introduction,
• Mapper,
• reducer,
• Combiner,
• Partitioner,
• Searching,
• Sorting,
• Compression.
Hadoop Distributed File System
The Hadoop Distributed File System (HDFS) is based on the Google File System
(GFS) and provides a distributed file system that is designed to run on
commodity hardware. It has many similarities with existing distributed file
systems.
However, the differences from other distributed file systems are significant.
It is highly fault-tolerant and is designed to be deployed on low-cost hardware.
It provides high throughput access to application data and is suitable for
applications having large datasets. Apart from the above-mentioned two core
components,
Hadoop framework also includes the following two modules −
Hadoop Common − These are Java libraries and utilities required by other
Hadoop modules.
Hadoop YARN − This is a framework for job scheduling and cluster resource
management.
HDFS
• HDFS(Hadoop Distributed File System) is utilized for storage permission is a
Hadoop cluster.
• HDFS in Hadoop provides Fault-tolerance and High availability to the
storage layer and the other devices present in that Hadoop cluster.
• Data storage Nodes in HDFS.
• NameNode(Master)
• DataNode(Slave)
NameNode
• NameNode works as a Master in a Hadoop cluster that guides the Datanode(Slaves).
• Namenode is mainly used for storing the Metadata i.e. the data about the data.
• Meta Data can be the transaction logs that keep track of the user’s activity in a Hadoop cluster.
• Meta Data can also be the name of the file, size, and the information about the location(Block
number, Block ids) of Datanode that Namenode stores to find the closest DataNode for Faster
Communication.
• Namenode instructs the DataNodes with the operation like delete, create, Replicate, etc.
DataNode
• Data Nodes works as a Slave Data Nodes are mainly utilized for storing the data in a Hadoop
cluster, the number of Data Nodes can be from 1 to 500 or even more than that.
• The more number of Data Node, the Hadoop cluster will be able to store more data.
• It is advised that the Data Node should have High storing capacity to store a large number of
file blocks.
High Level Architecture Of Hadoop
Processing Data with Hadoop - Managing
Resources and Applications with Hadoop YARN
File Block In HDFS:
Data in HDFS is always stored in terms of blocks.
So the single block of data is divided into multiple blocks of size 128MB which is default and you can also
change it manually.
• Let’s understand this concept of breaking down of file in blocks with
an example.
• Suppose you have uploaded a file of 400MB to your HDFS then what
happens is this file got divided into blocks of 128MB+128MB+128MB+16MB
= 400MB size.
• Means 4 blocks are created each of 128MB except the last one.
• Hadoop doesn’t know or it doesn’t care about what data is stored in these
blocks so it considers the final file blocks as a partial record as it does not
have any idea regarding it.
• In the Linux file system, the size of a file block is about 4KB which is very
much less than the default size of file blocks in the Hadoop file system.
• As we all know Hadoop is mainly configured for storing the large size data
which is in petabyte, this is what makes Hadoop file system different from
other file systems as it can be scaled, nowadays file blocks of 128MB to
256MB are considered in Hadoop.
Replication In HDFS
• Replication ensures the availability of the data.
• Replication is making a copy of something and the number of times you make a copy of that particular
thing can be expressed as it’s Replication Factor.
• The HDFS stores the data in the form of various blocks at the same time Hadoop is also configured to make
a copy of those file blocks.
• By default, the Replication Factor for Hadoop is set to 3 which can be configured means you can change it
manually as per your requirement like in above example we have made 4 file blocks which means that 3
Replica or copy of each file block is made means total of 4×3 = 12 blocks are made for the backup purpose.
• This is because for running Hadoop we are using commodity hardware (inexpensive system hardware)
which can be crashed at any time.

• That is why we need such a feature in HDFS which can make copies of that file blocks for backup purposes,
this is known as fault tolerance.
• Now one thing we also need to notice that after making so many replica’s of our file blocks we are wasting
so much of our storage but for the big brand organization the data is very much important than the storage
so nobody cares for this extra storage.
• What is a rack?
• The Rack is the collection of around 40-50 DataNodes connected
using the same network switch.
• If the network goes down, the whole rack will be unavailable.
• A large Hadoop cluster is deployed in multiple racks.
• What is Rack Awareness in Hadoop HDFS?
• In a large Hadoop cluster, there are multiple racks.
• Each rack consists of DataNodes.
• Communication between the DataNodes on the same rack is more efficient
as compared to the communication between DataNodes residing on different
racks.
• To reduce the network traffic during file read/write, NameNode
chooses the closest DataNode for serving the client read/write
request.
• NameNode maintains rack ids of each DataNode to achieve this rack
information.
• This concept of choosing the closest DataNode based on the rack
information is known as Rack Awareness.
Goals of HDFS
• Fault detection and recovery :
• Since HDFS includes a large number of commodity hardware, failure of components is
frequent.
• Therefore HDFS should have mechanisms for quick and automatic fault detection and
recovery.
• Huge datasets :
• HDFS should have hundreds of nodes per cluster to manage the applications having
huge datasets.
• Hardware at data :
• A requested task can be done efficiently, when the computation takes place near the
data.
• Especially where huge datasets are involved, it reduces the network traffic and
increases the throughput.
HDFS Architecture
Terminology
• Payload - Applications implement the Map and the Reduce functions, and form the core of
the job.
• Mapper - Mapper maps the input key/value pairs to a set of intermediate key/value pair.
• Named Node - Node that manages the Hadoop Distributed File System (HDFS).
• DataNode - Node where data is presented in advance before any processing takes place.
• MasterNode - Node where Job Tracker runs and which accepts job requests from clients.
• Slave Node - Node where Map and Reduce program runs.
• Job Tracker - Schedules jobs and tracks the assign jobs to Task tracker.
• Task Tracker - Tracks the task and reports status to Job Tracker.
• Job - A program in an execution of a Mapper and Reducer across a dataset.
• Task - An execution of a Mapper or a Reducer on a slice of data.
• Task Attempt - A particular instance of an attempt to execute a task on a Slave Node.
• HDFS follows the master-slave architecture and it
. has the following elements.
• Namenode
• The namenode is the commodity hardware that contains the GNU/Linux operating
system and the namenode software. It is a software that can be run on commodity
hardware. The system having the namenode acts as the master server and it does the
following tasks:
• Manages the file system namespace.
• Regulates client’s access to files.
• It also executes file system operations such as renaming, closing, and opening files and
directories.
Datanode

• The datanode is a commodity hardware having the GNU/Linux operating

system and datanode software.

• For every node (Commodity hardware/System) in a cluster, there will be a

data node.

• These nodes manage the data storage of their system.

• Datanodes perform read-write operations on the file systems, as per client request.
• They also perform operations such as block creation, deletion, and replication
according to the instructions of the namenode.
Blocks
• Generally the user data is stored in the files of HDFS. The file in a file
system will be divided into one or more segments and/or stored in
individual data nodes.
• These file segments are called as blocks.
• In other words, the minimum amount of data that HDFS can read or
write is called a Block.
Namenode (master)
Task Metadata
Managing filesystem tree & Namespace image, edit log
namespace (stored persistently)
Block locations
Keeping track of all the blocks
(stored just in memory)

• Single point of failure

• Backup the persistent metadata files
• Run a secondary namenode (standby)
• Datanodes (workers)
• Store & retrieve blocks
• Report the lists of storing blocks to the namenode
20
Data Flow
• Anatomy of file read in Hadoop:
• Consider a Hadoop cluster with one name node and two racks named R1 and R2 in a data center D1.
• Each rack has 4 nodes and they are uniquely identified as R1N1, R1N2 and so on. The replication factor
is set to 3 and the HDFS block size is set to 64 MB(128MB in Hadoop V2) by default.
• Background:
• Name node stores the HDFS block information like file location, permission, etc. in files called FSImage and
edit logs.
• Files are stored in HDFS as blocks.
• These block information are not saved in any file.
• Instead it is gathered every time the cluster is started and this information is stored in namenode’s
memory.

• Replication:
Assuming the replication factor is 3
• When a file is written from a data node (say R1N1), Hadoop attempts to save the first replica in same data
node (R1N1).
• Second replica is written into another node (R2N2) in a different rack (R2).
• Third replica is written into another node (R2N1) in the same rack (R2) where the second replica was
saved.

Big Data Unit-3
No ratings yet
Big Data Unit-3
46 pages
BDA Lab Assignment 2
No ratings yet
BDA Lab Assignment 2
18 pages
5.apache Hadoop
No ratings yet
5.apache Hadoop
33 pages
Bigdata Unit 3
No ratings yet
Bigdata Unit 3
96 pages
Hadoop Distributed File System
No ratings yet
Hadoop Distributed File System
4 pages
Unit 3 HDFS
No ratings yet
Unit 3 HDFS
26 pages
Big Data Unit-2 PPT Part1
No ratings yet
Big Data Unit-2 PPT Part1
76 pages
Unit - 3 (HDFS)
No ratings yet
Unit - 3 (HDFS)
23 pages
5 - BDP 2024 06
No ratings yet
5 - BDP 2024 06
14 pages
2-Hadoop History Terminologies DFS-03-01-2025
No ratings yet
2-Hadoop History Terminologies DFS-03-01-2025
52 pages
Unit 3
No ratings yet
Unit 3
44 pages
Bigdata 15cs82 Vtu Module 1 2 Notes
57% (14)
Bigdata 15cs82 Vtu Module 1 2 Notes
49 pages
HDFS
No ratings yet
HDFS
16 pages
Hadoop Architecture
No ratings yet
Hadoop Architecture
8 pages
BCS061 Notes Unit3
No ratings yet
BCS061 Notes Unit3
23 pages
Bda - M 2
No ratings yet
Bda - M 2
113 pages
Bda Unit-Iv
No ratings yet
Bda Unit-Iv
37 pages
Big Data Lecture # 05
No ratings yet
Big Data Lecture # 05
22 pages
Introduction To Hadoop - Chapter-2
No ratings yet
Introduction To Hadoop - Chapter-2
59 pages
Module II
No ratings yet
Module II
46 pages
Bigdata Unit IV
No ratings yet
Bigdata Unit IV
29 pages
BDA Mod 3 QB Solns
No ratings yet
BDA Mod 3 QB Solns
19 pages
Unit II-bid Data Programming
No ratings yet
Unit II-bid Data Programming
23 pages
Unit-3 (HDFS)
No ratings yet
Unit-3 (HDFS)
59 pages
Lecture 2
No ratings yet
Lecture 2
28 pages
HDFS and YARN
No ratings yet
HDFS and YARN
91 pages
Smart Car Wash Document
No ratings yet
Smart Car Wash Document
49 pages
Unit 5 Print
No ratings yet
Unit 5 Print
32 pages
NYOUG Hadoop Presentaton
No ratings yet
NYOUG Hadoop Presentaton
47 pages
Ops School
No ratings yet
Ops School
187 pages
Unit - 3 (HDFS) - 1
No ratings yet
Unit - 3 (HDFS) - 1
24 pages
BDA - Unit-2
No ratings yet
BDA - Unit-2
24 pages
File System Basics: Hadoop Distributed
No ratings yet
File System Basics: Hadoop Distributed
22 pages
Unit 3 Part 1
No ratings yet
Unit 3 Part 1
17 pages
UNIT-1-part-2-BIG DATA ANALYTICS AND TOOLS
No ratings yet
UNIT-1-part-2-BIG DATA ANALYTICS AND TOOLS
19 pages
Bda Unit 5
No ratings yet
Bda Unit 5
17 pages
Unit 2 Da Material
No ratings yet
Unit 2 Da Material
71 pages
Hadoop
No ratings yet
Hadoop
4 pages
HDFS
No ratings yet
HDFS
14 pages
Memopal User Guide en
No ratings yet
Memopal User Guide en
27 pages
QStar - Technical Specification
No ratings yet
QStar - Technical Specification
25 pages
HDFS
No ratings yet
HDFS
15 pages
BDA-Unit 4
No ratings yet
BDA-Unit 4
20 pages
Database Management System
No ratings yet
Database Management System
86 pages
Sitara Boot Camp 03 Giving Linux The Boot
No ratings yet
Sitara Boot Camp 03 Giving Linux The Boot
31 pages
Hadoop Architecture
No ratings yet
Hadoop Architecture
48 pages
Unix Unit 1 Part 2
No ratings yet
Unix Unit 1 Part 2
25 pages
IT Essentials ITE v5.0 v5.02 A Cert Practice Exam 2 Answers
No ratings yet
IT Essentials ITE v5.0 v5.02 A Cert Practice Exam 2 Answers
26 pages
Unit 2
No ratings yet
Unit 2
56 pages
Blob
No ratings yet
Blob
15 pages
Mondo Rescue
No ratings yet
Mondo Rescue
14 pages
Prepared By: Manoj Kumar Joshi & Vikas Sawhney
No ratings yet
Prepared By: Manoj Kumar Joshi & Vikas Sawhney
47 pages
Troubleshooting Guide: Ibm Security Qradar
No ratings yet
Troubleshooting Guide: Ibm Security Qradar
20 pages
Unit 3 Big Data - 240516 - 090400
No ratings yet
Unit 3 Big Data - 240516 - 090400
20 pages
Business Intelligence & Big Data Analytics-CSE3124Y
No ratings yet
Business Intelligence & Big Data Analytics-CSE3124Y
26 pages
Unit 1 IntroductionToDeviceDrivers
No ratings yet
Unit 1 IntroductionToDeviceDrivers
24 pages
CC Unit 5 Notes
No ratings yet
CC Unit 5 Notes
30 pages
Bigdata 15cs82 Vtu Module 1 2 Notes PDF
No ratings yet
Bigdata 15cs82 Vtu Module 1 2 Notes PDF
49 pages
Config
No ratings yet
Config
10 pages
Chapter 4 - Hadoop Ecosystem
No ratings yet
Chapter 4 - Hadoop Ecosystem
24 pages
Bda - Unit 2
No ratings yet
Bda - Unit 2
56 pages
Instalação CLP Bosch
No ratings yet
Instalação CLP Bosch
91 pages
Unit II Big Data Analytics
No ratings yet
Unit II Big Data Analytics
11 pages
Bigdata
No ratings yet
Bigdata
5 pages
Netkit Introduction
No ratings yet
Netkit Introduction
42 pages
CC - Unit-5
No ratings yet
CC - Unit-5
26 pages
Unit 2
No ratings yet
Unit 2
14 pages
HDFS Concepts
No ratings yet
HDFS Concepts
10 pages
Hadoop File System
No ratings yet
Hadoop File System
36 pages
ISSUE:1:: Unable To Login To OBIEE (Bi Publisher) With Any User Except Weblogic User
No ratings yet
ISSUE:1:: Unable To Login To OBIEE (Bi Publisher) With Any User Except Weblogic User
39 pages
Building Scalable Web Sites
No ratings yet
Building Scalable Web Sites
21 pages
Red Hat Gluster Storage-3.4-Quick Start Guide-En-US
No ratings yet
Red Hat Gluster Storage-3.4-Quick Start Guide-En-US
37 pages
I.mx 6 SABRE-AI Linux Users Guide
No ratings yet
I.mx 6 SABRE-AI Linux Users Guide
17 pages
Document 4 HDFS
No ratings yet
Document 4 HDFS
8 pages
DCA 2019 Batch
No ratings yet
DCA 2019 Batch
25 pages
HDFS
No ratings yet
HDFS
8 pages
Module 1 PDF
No ratings yet
Module 1 PDF
49 pages
Unit2 HDFS
No ratings yet
Unit2 HDFS
17 pages
Red Hat Linux System Administration Pre-Assessment Questionnaire
100% (1)
Red Hat Linux System Administration Pre-Assessment Questionnaire
6 pages
UNIT 3 HDFS, Hadoop Environment Part 1
No ratings yet
UNIT 3 HDFS, Hadoop Environment Part 1
9 pages
Oracle Settings For R3load Based System
No ratings yet
Oracle Settings For R3load Based System
3 pages
WWW Askmlabs Com 2011 01 Copying Files From Asm To File Syst
No ratings yet
WWW Askmlabs Com 2011 01 Copying Files From Asm To File Syst
4 pages
Highly Useful Linux Commands
No ratings yet
Highly Useful Linux Commands
40 pages
Chapter-1 Notes - Introduction
No ratings yet
Chapter-1 Notes - Introduction
17 pages
Hadoop Distributed File System
No ratings yet
Hadoop Distributed File System
5 pages
Operating System Collector SAPOSCOL
No ratings yet
Operating System Collector SAPOSCOL
47 pages
Steps of Unix Os Installation
No ratings yet
Steps of Unix Os Installation
3 pages
Sap Basis Troubleshooting 1
100% (1)
Sap Basis Troubleshooting 1
4 pages
Comp3511 Spring16 hw4
No ratings yet
Comp3511 Spring16 hw4
6 pages

Unit-2 CH 1 Updated

Uploaded by

Unit-2 CH 1 Updated

Uploaded by

UNIT – 2

• Hadoop Distributed File System:

• The datanode is a commodity hardware having the GNU/Linux operating

• For every node (Commodity hardware/System) in a cluster, there will be a

• These nodes manage the data storage of their system.

• Single point of failure

You might also like