Hadoop Distributed
File System (HDFS)
Goals of HDFS
• Very Large Distributed File System
• 10K nodes, 100 million files, 10PB
• Assumes Commodity Hardware
• Files are replicated to handle hardware failure
• Detect failures and recover from them
• Optimized for Batch Processing
• Data locations exposed so that computations can move to where data
resides
• Provides very high aggregate bandwidth
Distributed File System
• Single Namespace for the entire cluster
• Data Coherency
• Write-once-read-many (WORM) access model
• Client can only append to existing files
• Files are broken up into blocks
• Typically 64MB block size
• Each block replicated on multiple DataNodes
• Intelligent Client
• Client can find the location of blocks
• Client accesses data directly from DataNode
HDFS Architecture
• Master / Slave Architecture
• A single NameNode
• Many DataNodes
• Internally a file is split into
one or more blocks and
these blocks are stored in a
set of DataNodes.
Functions of a NameNode
•Manages File System Namespace
• Maps a file name to a set of blocks
• Maps a block to the DataNodes where it resides
•Cluster Configuration Management
•Replication Engine for Blocks
NameNode Metadata
Metadata in Memory
The entire metadata is in the main memory
No demand paging of metadata
Types of metadata
List of files
List of Blocks for each file
List of DataNodes for each block
File attributes, e.g. creation time, replication factor
A Transaction Log
Records file creations, file deletions etc.
DataNode
• A Block Server
• Stores data in the local file system (e.g. ext3)
• Stores metadata of a block (e.g. CRC)
• Serves data and metadata to Clients
• Block Report
• Periodically sends a report of all existing blocks to the
NameNode
• Facilitates Pipelining of Data
• Forwards data to other specified DataNodes
Block Placement
• Current Strategy
• One replica on the local node
• Second replica on a remote rack
• Third replica on the same remote rack
• Additional replicas are randomly placed
• Clients read from the nearest replicas
• Would like to make this policy pluggable
Heartbeats
• DataNodes send a heartbeat to the NameNode
• Once every 3 seconds
• NameNode uses heartbeats to detect DataNode failure.
• A network partition can cause a subset of DataNodes to
lose connectivity with NameNode.
• The NameNode marks DataNodes without recent
Heartbeats as dead and does not forward any new IO
requests to them.
Replication Engine
• NameNode detects DataNode failures
• Chooses new DataNodes for new replicas
• Balances disk usage
• Balances communication traffic to DataNodes
Data Correctness
• Use Checksums to validate data
• Use CRC32
• File Creation
• The client computes checksum per 512 bytes
• DataNode stores the checksum
• File Access
• The client retrieves the data and checksum from
DataNode
• If Validation fails, the Client tries other replicas
NameNode Failure
• A single point of failure
• Transaction Log stored in multiple directories
• A directory on the local file system
• A directory on a remote file system (NFS/CIFS)
• Need to develop a real High Availability (HA) solution
Data Pipelining
• Client retrieves a list of DataNodes on which to place
replicas of a block.
• Client writes a block to the first DataNode.
• The first DataNode forwards the data to the next node
in the Pipeline.
• When all replicas are written, the Client moves on to
write the next block in the file.
Rebalancer
• Goal: % disk full on DataNodes should be similar
• Usually run when new DataNodes are added
• Cluster is online when Rebalancer is active
• Rebalancer is throttled to avoid network congestion
• Command line tool
Secondary NameNode
• Copies FsImage and Transaction Log from NameNode to a
temporary directory
• Merges FsImage and Transaction Log into a new FsImage in a
temporary directory
• Uploads new FsImage to the NameNode
• Transaction Log on NameNode is purged
• FsImage is a file stored on the OS filesystem that contains the
complete directory structure (namespace) of the HDFS with
details about the location of the data on the Data Blocks and
which blocks are stored on which node.
Rebalancer
• Commads for HDFS User:
• hadoop dfs -mkdir /foodir
• hadoop dfs -cat /foodir/myfile.txt
• hadoop dfs -rm /foodir/myfile.txt
• Commands for HDFS Administrator
• hadoop dfsadmin -report
• hadoop dfsadmin -decommision datanodename
• Web Interface
• http://host:port/dfshealth.jsp
Google File System
(GFS)
Motivation
• More than 15,000 commodity-class PC’s.
• Multiple clusters distributed worldwide.
• Thousands of queries served per second.
• One query reads 100’s of MB of data.
• One query consumes 10’s of billions of CPU Cycle.
• Google stores dozens of copies of the entire Web.
Conclusion: Need large, distributed, highly fault tolerant
file system
Introduction
• Google File System (GFS) a scalable distributed file system for large distributed data-intensive
applications.
• It is designed and implemented to meet the rapidly growing demands of Google’s data processing
needs.
• GFS shares many of the same goals as previous distributed file systems such as performance,
scalability, reliability, and availability.
• It provides fault tolerance while running on inexpensive commodity hardware, and it delivers high
aggregate performance to a large number of clients.
Operations Support
• GFS support the usual operations to create, delete, open, close, read, and write files.
• Moreover, GFS has snapshot and record append operations.
• Snapshot creates a copy of a file or a directory tree at low cost.
• Record append allows multiple clients to append data to the same file concurrently while guaranteeing
the atomicity of each individual client’s append.
• Record append is useful for implementing multi-way merge results and producer consumer queues that
many clients can simultaneously append to without additional locking.
Components of GFS
• Master (NameNode)
• Manages metadata (namespace)
• Not involved in data transfer
• Controls allocation, placement, replication
• Chunkserver (DataNode)
• Stores chunks of data
• No knowledge of GFS file system structure
• Built on local linux file system
GFS Architecture
• A GFS cluster consists of a single master and multiple chunkservers and is accessed by multiple clients, as shown in Figure 1.
Figure 1: GFS Architecture
GFS Architecture (Contd.)
GFS uses large chunk: 64MB (1GB=1024 MB=16 chunks)
Stored as plain Linux file, which will be lazily extended up to 64MB.
The master maintains less than 64 bytes of metadata for each 64 MB chunk.
Opt to many read and write on a given chunk
Reduces network overhead by keeping a connection to the chunk server.
See also MapReduce, Big Table
Chunkserver:
Files are divided into fixed-size chunks (64MB).
Each chunk is identified by an immutable and globally unique 64 bit chunk handle assigned by
the master at the time of chunk creation.
Chunkservers store chunks on local disks as Linux files and read or write chunk data specified by
a chunk handle and byte range.
For reliability, each chunk is replicated on multiple chunkservers. (Default 3 replicas).
GFS Client:
GFS client code linked into each application implements the file system API and communicates
with the master and chunkservers to read or write data on behalf of the application.
GFS Architecture (Contd.)
System Metadata
The master stores three major types of metadata: the file and chunk namespaces, the mapping
from files to chunks, and the locations of each chunk’s replicas.
All metadata is kept in the master’s memory.
The first two types (namespaces and file-to-chunk mapping) are also kept persistent by logging
mutations to an operation log stored on the master’s local disk and replicated on remote machines.
Using a log allows us to update the master state simply, reliably, and without risking inconsistencies in
the event of a master crash. The master does not store chunk location information persistently.
Instead, it asks each chunkserver about its chunks at master startup and whenever a chunkserver
joins the cluster.
Write Control and Data Flow
Step 1) Who has the lease?
2) Lease info
3) Data push
4) Commit
5) Serialized Commit
6)Commit ACK
7) Success
Note: Lease is temporary
access to AWS Account
Step 1. The client asks the master which chunkserver holds the current lease for the chunk and the locations of the other
replicas. If no one has a lease, the master grants one to a replica it chooses (not shown).
Step 2. The master replies with the identity of the primary and the locations of the other (secondary) replicas. The client
caches this data for future mutations. It needs to contact the master again only when the primary becomes unreachable
or replies that it no longer holds a lease.
Step 3. The client pushes the data to all the replicas. A client can do so in any order. Each chunkserver will store the data
in an internal LRU buffer cache until the data is used or aged out. By decoupling the data flow from the control flow, we
can improve performance by scheduling the expensive data flow based on the network topology regardless of which
chunkserver is the primary.
Step 4. Once all the replicas have acknowledged receiving the data, the client sends a write request to the primary. The
request identifies the data pushed earlier to all of the replicas. The primary assigns consecutive serial numbers to all the
mutations it receives, possibly from multiple clients, which provides the necessary serialization. It applies the mutation to
its own local state in serial number order.
Step 5. The primary forwards the write request to all secondary replicas. Each secondary replica applies
mutations in the same serial number order assigned by the primary.
Step 6. The secondaries all reply to the primary indicating that they have completed the operation.
Step 7. The primary replies to the client. Any errors encountered at any of the replicas are reported to
the client. In case of errors, the write may have succeeded at the primary and an arbitrary subset of the
secondary replicas. (If it had failed at the primary, it would not have been assigned a serial number and
forwarded.) The client request is considered to have failed, and the modified region is left in an
inconsistent state. Our client code handles such errors by retrying the failed mutation. It will make a few
attempts at steps (3) through (7) before falling back to a retry from the beginning of the write.
MASTER OPERATION
• The master executes all namespace operations.
• In addition, it manages chunk replicas throughout the system: It makes placement decisions, creates
new chunks and hence replicas, and coordinates various system-wide activities to keep chunks fully
replicated, to balance load across all the chunkservers, and to reclaim unused storage.
• Namespace Management and Locking
• Replica Placement
• Creation, Re-replication, Rebalancing
• Garbage Collection
• Stale Replica Detection
Conclusion
• The Google File System demonstrates the qualities essential for supporting large-scale data processing workloads
on commodity hardware.
• GFS provides fault tolerance by constant monitoring, replicating crucial data, and fast and automatic recovery.
• Chunk replication allows us to tolerate chunkserver failures.
• GFS has successfully met our storage needs and is widely used within Google as the storage platform for research
and development as well as production data processing.
GFS vs HDFS