0% found this document useful (0 votes)
18 views

Hadoop

This document provides an overview of the Hadoop Distributed File System (HDFS). It describes the key components of HDFS including the NameNode, DataNodes, and clients. It explains how these components interact for file input/output operations and how HDFS manages data replication across DataNodes for fault tolerance. The document also discusses HDFS features like upgrades, point-in-time recovery, and dynamic replica adjustment.

Uploaded by

ishugupta0298
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views

Hadoop

This document provides an overview of the Hadoop Distributed File System (HDFS). It describes the key components of HDFS including the NameNode, DataNodes, and clients. It explains how these components interact for file input/output operations and how HDFS manages data replication across DataNodes for fault tolerance. The document also discusses HDFS features like upgrades, point-in-time recovery, and dynamic replica adjustment.

Uploaded by

ishugupta0298
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 23

Tutorial 7

Hadoop
Distributed File
System
Ishu Gupta - 20114041
Introduction
• Apache Hadoop subproject
• Runs on commodity hardware
• Fault Tolerant
• High throughput access
Architecture
NameNode
• Master server that manages the metadata and namespace of the HDFS cluster.
• Stores information about the file system hierarchy, file metadata (such as permissions, modification
times), and the mapping of data blocks to DataNodes.
• NameNode is a critical component, and its failure can result in the unavailability of the entire HDFS
cluster.
• HDFS employs mechanisms like secondary NameNode and, in modern setups, Hadoop High
Availability (HA) to ensure continuous availability and fault tolerance.
DataNode
• Worker nodes in the HDFS cluster responsible for storing actual data blocks.
• Store and manage the file data as blocks on their local disks. Each block is typically 128 megabytes
in size, although this can be configured per file.
• Replicate data blocks as per the replication factor specified for the file (usually three replicas by
default).
• Replication ensures fault tolerance; if a DataNode or a block becomes unavailable due to a failure,
HDFS can still retrieve the data from other available replicas on different DataNodes.
• Send periodic heartbeats to the NameNode to confirm that they are operational.
Client
• HDFS clients interact with the Hadoop Distributed File System for file storage and retrieval.
• Perform operations like creating, reading, updating, and deleting files and directories in the HDFS
namespace.
• Communicate with the NameNode to obtain metadata information about files and directories.
• Query the NameNode to discover the locations of data blocks, allowing for efficient data access.
• Read data by contacting the NameNode for block locations and then fetching the data from the
nearest DataNode.
• During write operations, clients request the NameNode to choose suitable DataNodes to host block
replicas, and data is written in a pipelined manner for optimal throughput.
Interaction of NameNode
DataNode and Client
Image and Journal
• Image - In-memory representation of the file system namespace and metadata stored in the
NameNode's memory.
• Includes information about files, directories, permissions, modification times, and block locations.
• Journal - modification log that records changes made to the image.
• Captures metadata modifications such as file creations, deletions, and updates.
• Enhances durability and provides a way to replay changes to the namespace in case of failures.
CheckpointNode
• CheckpointNode periodically combines the existing checkpoint and journal to create a new
checkpoint and an empty journal
• Runs on a different host from the NameNode
• Downloads the current checkpoint and journal files from the NameNode, merges them locally, and
returns the new checkpoint back to the NameNode.
BackupNode
• In-memory, up-to-date image of the file system namespace that is always synchronized with the state
of the NameNode
• Accepts the journal stream of namespace transactions from the active NameNode, saves them to its
own storage directories, and applies these transactions to its own namespace image in memory.
• BackupNode can create a checkpoint without downloading checkpoint and journal files from the
active NameNode
• Read-only NameNode.
Upgrades
• HDFS supports rolling upgrades.
• During a rolling upgrade, individual components such as NameNodes and DataNodes can be
upgraded one at a time while the cluster continues to operate with the remaining nodes, ensuring
minimal downtime.
• Upgrading HDFS involves ensuring compatibility between the new software version and the existing
components in the cluster.
Upgrades
Point in Time Recovery
• Snapshots are read-only copies of the file system at a specific moment, preserving the file and
directory structure as it existed when the snapshot was taken.
• Snapshots serve as valuable tools for point-in-time recovery and backup strategies. Administrators
can revert the file system to a specific snapshot in case of accidental data deletion or corruption.
• By periodically creating snapshots, organizations can establish a backup strategy that allows them to
recover data to a known and reliable state, enhancing data protection and disaster recovery
capabilities.
FILE I/O OPERATIONS AND
REPLICA MANGEMENT
File I/O Operations
File I/O Operations
READ -
• HDFS allows clients to read data from files stored in the system. Clients first contact the NameNode
to obtain the locations of data blocks.
• Data blocks are read from the nearest DataNodes, facilitating efficient and parallelized reading of
large files. HDFS supports streaming reads, enabling sequential access to data.
WRITE -
• When writing data to HDFS, clients request the NameNode to nominate a set of DataNodes to host
block replicas. Clients then write data in a pipeline fashion to these selected DataNodes.
• HDFS provides fault tolerance by replicating data blocks across multiple DataNodes. The client
ensures that multiple copies of the data are stored for reliability.
File I/O Operations
Append Operations:
• HDFS supports append operations, allowing clients to append data to existing files. Clients can
append data to the end of a file without altering its existing content.
• Appending data is achieved by adding new blocks to the file. Clients interact with the NameNode to
locate suitable DataNodes for appending new data blocks.
File Deletion and Renaming:
• Clients can delete files and directories from HDFS by sending requests to the NameNode. Deletion
involves removing the metadata associated with the file and releasing the corresponding data blocks.
• HDFS also supports file renaming, which involves changing the path of a file or directory. Renaming
operations are atomic and do not involve physically moving data blocks.
Replica Management
Racks in HDFS
Replica Management
• If a DataNode or block becomes unavailable due to hardware failure or other issues, the system can
continue functioning by retrieving the data from the remaining replicas.

Block Placement Policy:


• Determine where to store replicas. Optimize data availability, reliability, and network bandwidth
usage.
• The NameNode selects DataNodes for block placement, considering network proximity, rack
awareness, and existing block locations, ensuring balanced distribution across the cluster.
Replica Management
Rebalancing -
• If a DataNode fails or if the replication factor falls below the specified threshold, the system initiates
replica re-replication to maintain the desired level of fault tolerance.
• Admins can manually rebalance replicas to ensure uniform distribution of data and storage utilisation
across the cluster.
Dynamic Replica Adjustment -
• HDFS allows administrators to dynamically adjust the replication factor for specific files or
directories based on changing storage requirements and data access patterns.
• Administrators can balance data durability with storage costs, optimizing resource utilization as
needed.
Tutorial 7

Thank You!
Ishu Gupta - 20114041

You might also like