Wa Introhdfs PDF
Wa Introhdfs PDF
System
Explore HDFS framework and subsystems
01 Feb 2011
HDFS is fault tolerant and provides high-throughput access to large data sets. This
article explores the primary features of HDFS and provides a high-level view of the
HDFS architecture.
Overview of HDFS
HDFS has many similarities with other distributed file systems, but is different in
Another unique attribute of HDFS is the viewpoint that it is usually better to locate
processing logic near the data rather than moving the data to the application space.
HDFS rigorously restricts data writing to one writer at a time. Bytes are always
appended to the end of a stream, and byte streams are guaranteed to be stored in
the order written.
HDFS has many goals. Here are some of the most notable:
The applications described in Table 1 are also available to interface with HDFS.
Application Description
FileSystem (FS) shell A command-line interface similar to common
Linux® and UNIX® shells (bash, csh, etc.) that
allows interaction with HDFS data.
DFSAdmin A command set that you can use to administer
an HDFS cluster.
fsck A subcommand of the Hadoop
command/application. You can use the fsck
command to check for inconsistencies with files,
such as missing blocks, but you cannot use the
fsck command to correct these inconsistencies.
Name nodes and data nodes These have built-in web servers that let
administrators check the current status of a
cluster.
HDFS has an extraordinary feature set with high expectations thanks to its simple,
yet powerful, architecture.
HDFS architecture
HDFS is comprised of interconnected clusters of nodes where files and directories
reside. An HDFS cluster consists of a single node, known as a NameNode, that
manages the file system namespace and regulates client access to files. In addition,
data nodes (DataNodes) store data as blocks within files.
Within HDFS, a given name node manages file system namespace operations like
opening, closing, and renaming files and directories. A name node also maps data
blocks to data nodes, which handle read and write requests from HDFS clients. Data
nodes also create, delete, and replicate data blocks according to instructions from
the governing name node.
As Figure 1 illustrates, each cluster contains one name node. This design facilitates
a simplified model for managing each namespace and arbitrating data distribution.
Name nodes and data nodes are software components designed to run in a
decoupled manner on commodity machines across heterogeneous operating
systems. HDFS is built using the Java programming language; therefore, any
machine that supports the Java programming language can run HDFS. A typical
installation cluster has a dedicated machine that runs a name node and possibly one
data node. Each of the other machines in the cluster runs one data node.
Communications protocols
All HDFS communication protocols build on the TCP/IP protocol.
HDFS clients connect to a Transmission Control Protocol (TCP) port
opened on the name node, and then communicate with the name
node using a proprietary Remote Procedure Call (RPC)-based
protocol. Data nodes talk to the name node using a proprietary
block-based protocol.
Data nodes continuously loop, asking the name node for instructions. A name node
can't connect directly to a data node; it simply returns values from functions invoked
by a data node. Each data node maintains an open server socket so that client code
or other data nodes can read or write data. The host or port for this server socket is
known by the name node, which provides the information to interested clients or
other data nodes. See the Communications protocols sidebar for more about
communication between data nodes, name nodes, and clients.
The name node maintains and administers changes to the file system namespace.
HDFS also supports third-party file systems such as CloudStore and Amazon Simple
Storage Service (S3) (see Resources).
Data replication
HDFS replicates file blocks for fault tolerance. An application can specify the number
of replicas of a file at the time it is created, and this number can be changed any
time after that. The name node makes all decisions concerning block replication.
Rack awareness
Typically, large HDFS clusters are arranged across multiple
installations (racks). Network traffic between different nodes within
the same installation is more efficient than network traffic across
installations. A name node tries to place replicas of a block on
multiple installations for improved fault tolerance. However, HDFS
allows administrators to decide on which installation a node
belongs. Therefore, each node knows its rack ID, making it rack
aware.
HDFS uses an intelligent replica placement model for reliability and performance.
Optimizing replica placement makes HDFS unique from most other distributed file
systems, and is facilitated by a rack-aware replica placement policy that uses
network bandwidth efficiently.
Data organization
One of the main goals of HDFS is to support large files. The size of a typical HDFS
block is 64MB. Therefore, each HDFS file consists of one or more 64MB blocks.
HDFS tries to place each block on separate data nodes.
Manipulating files on HDFS is similar to the processes used with other file systems.
However, because HDFS is a multi-machine system that appears as a single disk,
all code that manipulates files on HDFS uses a subclass of the
org.apache.hadoop.fs.FileSystem object (see Resources).
The code shown in Listing 1 illustrates a typical file creation process on HDFS.
Staging to commit
When a client creates a file in HDFS, it first caches the data into a temporary local
file. It then redirects subsequent writes to the temporary file. When the temporary file
accumulates enough data to fill an HDFS block, the client reports this to the name
node, which converts the file to a permanent data node. The client then closes the
temporary file and flushes any remaining data to the newly created data node. The
name node then commits the data node to disk.
Replication pipelining
When a client accumulates a full block of user data, it retrieves a list of data nodes
that contains a replica of that block from the name node. The client then flushes the
full data block to the first data node specified in the replica list. As the node receives
chunks of data, it writes them to disk and transfers copies to the next data node in
the list. The next data node does the same. This pipelining process is repeated until
the replication factor is satisfied.
Detection is the first step HDFS takes to overcome failures. HDFS uses heartbeat
messages to detect connectivity between name and data nodes.
HDFS heartbeats
Several things can cause loss of connectivity between name and data nodes.
Therefore, each data node sends periodic heartbeat messages to its name node, so
the latter can detect loss of connectivity if it stops receiving them. The name node
marks as dead data nodes not responding to heartbeats and refrains from sending
further requests to them. Data stored on a dead node is no longer available to an
HDFS client from that node, which is effectively removed from the system. If the
death of a node causes the replication factor of data blocks to drop below their
minimum value, the name node initiates additional replication to bring the replication
factor back to a normalized state.
HDFS data blocks might not always be placed uniformly across data nodes,
meaning that the used space for one or more data nodes can be underutilized.
Therefore, HDFS supports rebalancing data blocks using various models. One
model might move data blocks from one data node to another automatically if the
free space on a data node falls too low. Another model might dynamically create
additional replicas and rebalance other data blocks in a cluster if a sudden increase
in demand for a given file occurs. HDFS also provides the hadoop balance
command for manual rebalancing tasks.
One common reason to rebalance is the addition of new data nodes to a cluster.
When placing new blocks, name nodes consider various parameters before
choosing the data nodes to receive them. Some of the considerations are:
Data integrity
HDFS goes to great lengths to ensure the integrity of data across clusters. It uses
checksum validation on the contents of HDFS files by storing computed checksums
in separate, hidden files in the same namespace as the actual data. When a client
retrieves file data, it can verify that the data received matches the checksum stored
in the associated file.
The HDFS namespace is stored using a transaction log kept by each name node.
The file system namespace, along with file block mappings and file system
properties, is stored in a file called FsImage. When a name node is initialized, it
reads the FsImage file along with other files, and applies the transactions and state
information found in these files.
A name node uses a log file known as the EditLog to persistently record every
transaction that occurs to HDFS file system metadata. If the EditLog or FsImage files
become corrupted, the HDFS instance to which they belong ceases to function.
Therefore, a name node supports multiple copies of the FsImage and EditLog files.
With multiple copies of these files in place, any change to either file propagates
synchronously to all of the copies. When a name node restarts, it uses the latest
consistent version of FsImage and EditLog to initialize itself.
HDFS implements a permissions model for files and directories that has a lot in
common with the Portable Operating System Interface (POSIX) model; for example,
every file and directory is associated with an owner and a group. The HDFS
permissions model supports read (r), write (w), and execute (x). Because there is no
concept of file execution within HDFS, the x permission takes on a different
meaning. Simply put, the x attribute indicates permission for accessing a child
directory of a given parent directory. The owner of a file or directory is the identity of
the client process that created it. The group is the group of the parent directory.
Snapshots
HDFS was originally planned to support snapshots that can be used to roll back a
corrupted HDFS instance to a previous state. However, HDFS support for snapshots
has been tabled for the time being.
Summary
Hadoop is an Apache Software Foundation distributed file system and data
management project with goals for storing and managing large amounts of data.
Hadoop uses a storage system called HDFS to connect commodity personal
computers, known as nodes, contained within clusters over which data blocks are
distributed. You can access and store the data blocks as one seamless file system
using the MapReduce processing model.
HDFS shares many common features with other distributed file systems while
supporting some important differences. One significant difference is HDFS's
write-once-read-many model that relaxes concurrency control requirements,
simplifies data coherency, and enables high-throughput access.
Resources
Learn
• The Hadoop wiki provides community input related to Hadoop and HDFS.
• The Hadoop API site documents the Java classes and interfaces that are used
to program to Hadoop and HDFS.
• Wikipedia's MapReduce page is a great place to begin your research into the
MapReduce framework.
• Visit Amazon S3 to learn about Amazon's S3 infrastructure.
• The developerWorks Web development zone specializes in articles covering
various web-based solutions.
Get products and technologies
• The Hadoop project site contains valuable resources pertaining to the Hadoop
architecture and the MapReduce framework.
• The Hadoop Distributed File System project site offers downloads and
documentation about HDFS.
• Venture to the CloudStore site for downloads and documentation about the
integration between CloudStore, Hadoop, and HDFS.
Discuss
• Create your My developerWorks profile today and set up a watch list on
Hadoop. Get connected and stay connected with developerWorks community.
• Find other developerWorks members interested in web development.
• Share what you know: Join one of our developerWorks groups focused on web
topics.
• Roland Barcia talks about Web 2.0 and middleware in his blog.
• Follow developerWorks' members' shared bookmarks on web topics.
• Get answers quickly: Visit the Web 2.0 Apps forum.
• Get answers quickly: Visit the Ajax forum.