0% found this document useful (0 votes)
12 views19 pages

Hdfs r20it III

Uploaded by

kharshitha93
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views19 pages

Hdfs r20it III

Uploaded by

kharshitha93
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 19

UNIT III Hadoop

Introduction to Hadoop:
History of Hadoop,
Hadoop Distributed File System,
Components of Hadoop
Analysing the Data with Hadoop,
Scaling Out,
Design of HDFS,
Java interfaces to HDFS Basics,
2

12/06/2024
HADOOP IS SCALE UP OR SCALE OUT??
Faster servers, more memory and powerful
processors
Adding Nodes for parallel computing
3

hadoop

HDFS MAP-REDUCE

12/06/2024
Hadoop Distributed File
System
Apache projects
HDFS BUILDING BLOCKS

1.NAME NODE 4
2. SECONDARY NAME NODE
3. DATA NODE
4. BLOCK SIZE
5. RESOURCE MANAGER (JOB
TRACKER)
6. NODE MANAGER (TASK TRACKER)
Name node

Secondary name node


MASTER DEAMONS

Job tracker

DATA NODE
SLAVE DEAMONS
TASK
TRACKER

5 12/06/2024
BUILDING BLOCKS OF HADOOP
6

12/06/2024
HDFS Architecture
7
Metadata(Name, replicas..)
Metadata ops Namenode (/home/foo/data,6. ..

Client
Block ops
Read Datanodes Datanodes

replication
B
Blocks

Rack1 Write Rack2

Client

12/06/2024
Fault tolerance
8

Failure is the norm rather than exception


A HDFS instance may consist of thousands of
server machines, each storing part of the file
system’s data.
Since we have huge number of components
and that each component has probability of
failure means that there is always some
component that is non-functional.
Detection of faults and quick automatic
recovery from them is a core architectural
goal of HDFS.
12/06/2024
9
Data Characteristics
10
Streaming data access to Applications.
Batch processing rather than interactive user
access.
Large data sets and files: gigabytes to terabytes
size
High aggregate data bandwidth
Scale to hundreds of nodes in a cluster
Write-once-read-many: a file once created, written
and closed need not be changed – this assumption
simplifies coherency
Java Slogan??
A map-reduce application or web-crawler
application fits perfectly with this model. 12/06/2024
11

12/06/2024
12

12/06/2024
FILESYSTEM IMAGE and EditLogs
13

 FsImage is a file stored on the OS filesystem that


contains the complete directory structure.
(NAMESPACE)of the HDFS and the details of
location of data on Data Blocks and which blocks
are stored on which node.
 EditLogs is a transaction log that records the
changes in the HDFS file system or any action
performed on the HDFS cluster such as addtion of
a new block, replication, deletion etc., It records
the changes since the last FsImage was created
 When we are starting namenode, latest FsImage
file is loaded into "in-memory" .
12/06/2024
Namenode and Datanodes
14

 Master/slave architecture
 HDFS cluster consists of a single Namenode, a master server
that manages the file system namespace and regulates access to
files by clients.
 There are a number of DataNodes usually one per node in a
cluster.
 The DataNodes manage storage attached to the nodes that they
run on.
 HDFS exposes a file system namespace and allows user data to be
stored in files.
 A file is split into one or more blocks and set of blocks are stored
in DataNodes.
 DataNodes: serves read, write requests, performs block creation,
deletion, and replication upon instruction from Namenode.

12/06/2024
NAME NODE
15

The NameNode stores modifications to the


file system as a log appended to a native file
system file, edits.
When a NameNode starts up, it reads HDFS
state from an image file, fsimage, and then
applies edits from the edits log file.
It then writes new HDFS state to
the fsimage and starts normal operation with
an empty edits file.

12/06/2024
File system Namespace
16

edit Logs holds all the information about


Metadata
 EditLog is stored in the Namenode’s local filesystem
Hierarchical file system with directories and
files
Create, remove, move, rename etc.
Namenode maintains the file system
Any meta data changes to the file system is
recorded by the Namenode.
An application can specify the number of
replicas of the file needed: replication factor
of the file. This information is stored in the
12/06/2024
Data Replication
17

HDFS is designed to store very large files across


machines in a large cluster.
Each file is a sequence of blocks.
All blocks in the file except the last are of the
same size.
Blocks are replicated for fault tolerance.
Block size and replicas are configurable per file.
The Namenode receives a Heartbeat and a
BlockReport from each DataNode in the cluster.
BlockReport contains all the blocks on a
Datanode.
12/06/2024
Namenode
18

Keeps image of entire file system namespace and file


Blockmap in memory.
4GB of local RAM is sufficient to support the above
data structures that represent the huge number of
files and directories.
When the Namenode starts up it gets the FsImage
and Editlog from its local file system, update FsImage
with EditLog information and then stores a copy of
the FsImage on the filesytstem as a checkpoint.
Periodic checkpointing is done. So that the system
can recover back to the last checkpointed state in
case of a crash.

12/06/2024
Datanode
19

A Datanode stores data in files in its local file system.


Datanode has no knowledge about HDFS filesystem
It stores each block of HDFS data in a separate file.
Datanode does not create all files in the same
directory.
It uses heuristics to determine optimal number of files
per directory and creates directories appropriately:
 Research issue?
When the filesystem starts up it generates a list of all
HDFS blocks and send this report to Namenode:
Blockreport.

12/06/2024

You might also like