HDFS
HDFS
HDFS is a distributed file system that provides access to data across Hadoop clusters. A
cluster is a group of computers that work together. Like other Hadoop-related technologies,
HDFS is a key tool that manages and supports analysis of very large volumes; petabytes and
zettabytes of data.
Why HDFS?
Before 2011, storing and retrieving petabytes or zettabytes of data had the following three
major challenges: Cost, Speed, Reliability. Traditional file system approximately costs
$10,000 to $14,000, per terabyte. Searching and analyzing data was time-consuming and
expensive. Also, if search components were saved on different servers, fetching data was
difficult. Here’s how HDFS resolves all the three major issues of traditional file systems:
Cost
HDFS is open-source software so that it can be used with zero licensing and support costs. It
is designed to run on a regular computer.
Speed
Large Hadoop clusters can read or write more than a terabyte of data per second. A cluster
comprises multiple systems logically interconnected in the same network.
HDFS can easily deliver more than two gigabytes of data per second, per computer to
MapReduce, which is a data processing framework of Hadoop.
Reliability
HDFS copies the data multiple times and distributes the copies to individual nodes. A node is
a commodity server which is interconnected through a network device.
HDFS then places at least one copy of data on a different server. In case, any of the data is
deleted from any of the nodes; it can be found within the cluster.
A regular file system, like a Linux file system, is different from HDFS with respect to the
size of the data. In a regular file system, each block of data is small, usually about 51 bytes.
However, in HDFS, each block is 128 Megabytes by default.
A regular file system provides access to large data but may suffer from disk input/output
problems mainly due to multiple seek operations.
On the other hand, HDFS can read large quantities of data sequentially after a single seek
operation. This makes HDFS unique since all of these operations are performed in a
distributed mode.
Characteristics of HDFS
HDFS may consist of thousands of server machines. Each machine stores a part of
the file system data. HDFS detects faults that can occur on any of the machines
and recovers it quickly and automatically.
HDFS is designed to store and scan millions of rows of data and to count or add
some subsets of the data. The time required in this process is dependent on the
complexities involved.
It has been designed to support large datasets in batch-style jobs. However, the
emphasis is on high throughput of data access rather than low latency.
HDFS is economical
HDFS is designed in such a way that it can be built on commodity hardware and
heterogeneous platforms, which is low-priced and easily available.
Similar to the example explained in the previous section, HDFS stores files in a number of
blocks. Each block is replicated to a few separate computers. The count of replication can be
modified by the administrator. Data is divided into 128 Megabytes per block and replicated
across local disks of cluster nodes. Metadata controls the physical location of a block and its
replication within the cluster. It is stored in NameNode. HDFS is the storage system for both
input/output of MapReduce jobs. Let’s understand how HDFS stores files with an example
How Does HDFS Work?
Here’s how HDFS stores files. Example - A patron gifted a collection of popular books to a
college library. The librarian decided to arrange the books on a small rack and then distribute
multiple copies of each book on other racks. This way the students could easily pick up a
book from any of the racks.
Similarly, HDFS creates multiple copies of a data block and keeps them in separate systems
for easy access.
Let’s discuss the HDFS Architecture and Components in the next section.
Broadly, HDFS architecture is known as the master and slave architecture which is shown
below.
A master node, that is the NameNode, is responsible for accepting jobs from the clients. Its
task is to ensure that the data required for the operation is loaded and segregated into chunks
of data blocks.
HDFS exposes a file system namespace and allows user data to be stored in files. A file is
split into one or more blocks, stored, and replicated in the slave nodes known as the
DataNodes as shown in the section below.
The data blocks are then distributed to the DataNode systems within the cluster. This ensures
that the replicas of the data are maintained. DataNode serves to read or write requests. It also
creates, deletes, and replicates blocks on the instructions from the NameNode. We discussed
in the previous topic that it is the metadata that stores the block location and its replication. It
is explained in the below diagram.
There is a Secondary NameNode which performs tasks for NameNode and is also considered
as a master node. Prior to Hadoop 2.0.0, the NameNode was a Single Point of Failure, or
SPOF, in an HDFS cluster.
Each cluster had a single NameNode. In case of an unplanned event, such as a system failure,
the cluster would be unavailable until an operator restarted the NameNode.
Also, planned maintenance events, such as software or hardware upgrades on the NameNode
system, would result in cluster downtime.
The HDFS High Availability, or HA, feature addresses these problems by providing the
option of running two redundant NameNodes in the same cluster in an Active/Passive
configuration with a hot standby.
This allows a fast failover to a new NameNode in case a system crashes or an administrator
initiates a failover for the purpose of a planned maintenance.
In an HA cluster, two separate systems are configured as NameNodes. At any instance, one
of the NameNodes is in an Active state, and the other is in a Standby state.
The Active NameNode is responsible for all client operations in the cluster, while the
Standby simply acts as a slave, maintaining enough state to provide a fast failover if
necessary.
Shared storage using Network File System: In shared storage using NFS
implementation, the Standby node keeps its state synchronized with the Active
node through access to a directory on a shared storage device.
HDFS Components
Namenode
Secondary Namenode
File system
Metadata
Datanode
Namenode
The NameNode server is the core component of an HDFS cluster. There can be only one
NameNode server in an entire cluster. Namenode maintains and executes the file system
namespace operation such as opening, closing, and renaming of files and directories, which
are present in HDFS.
The namespace image and the edit log stores information of the data and the metadata.
NameNode also determines the linking of blocks to DataNodes. Furthermore, the NameNode
is a single point of failure. The DataNode is a multiple instance server. There can be several
numbers of DataNode servers. The number depends on the type of network and the storage
system.
The DataNode servers, stores, and maintains the data blocks. The NameNode Server
provisions the data blocks on the basis of the type of job submitted by the client.
DataNode also stores and retrieves the blocks when asked by clients or the NameNode.
Furthermore, it reads/writes requests and performs block creation, deletion, and replication of
instruction from the NameNode. There can be only one Secondary NameNode server in a
cluster. Note that you cannot treat the Secondary NameNode server as a disaster recovery
server. However, it partially restores the NameNode server in case of a failure.
Secondary Namenode
The Secondary NameNode server maintains the edit log and namespace image information in
sync with the NameNode server. At times, the namespace images from the NameNode server
are not updated; therefore, you cannot totally rely on the Secondary NameNode server for the
recovery process.
File System
HDFS exposes a file system namespace and allows user data to be stored in files. HDFS has a
hierarchical file system with directories and files. The NameNode manages the file system
namespace, allowing clients to work with files and directories.
A file system supports operations like create, remove, move, and rename. The NameNode,
apart from maintaining the file system namespace, records any change to metadata
information.
Now that we have learned about HDFS components, let us see how NameNode works along
with other components.
Namenode: Operation
NameNode maintains two persistent files; one a transaction log called an Edit Log and the
other, a namespace image called a FsImage. The Edit Log records every change that occurs in
the file system metadata such as creating a new file.
The NameNode is a local filesystem that stores the Edit Log. The entire file system
namespace including mapping of blocks, files, and file system properties is stored in
FsImage. This is also stored in the NameNode local file system.
Metadata
When new DataNodes join a cluster, metadata loads the blocks that reside on a specific
DataNode into its memory at startup. Metadata then periodically loads the data at user-
defined or default intervals.
When the NameNode starts up, it retrieves the Edit Log and FsImage from its local file
system. It then updates the FsImage with Edit Log information and stores a copy of the
FsImage on the file system as a checkpoint.
The metadata size is limited to the RAM available on the NameNode. A large number of
small files would require more metadata than a small number of large files. Hence, the in-
memory metadata management issue explains why HDFS favors a small number of large
files.
If a NameNode runs out of RAM, it will crash, and the applications will not be able to use
HDFS until the NameNode is operational again.
Data block split is an important process of HDFS architecture. As discussed earlier, each file
is split into one or more blocks stored and replicated in DataNodes.
DataNode
DataNodes manage names and locations of file blocks. By default, each file block is 128
Megabytes. However, this potentially reduces the amount of parallelism that can be achieved
as the number of blocks per file decreases.
Each map task operates on one block, so if tasks are fewer than nodes in the cluster, the jobs
will run slowly. However, this issue is lesser when the average MapReduce job involves
more files or larger individual files.
Simplified replication
Fault-tolerance
Reliability.
Block replication refers to creating copies of a block in multiple data nodes. Usually, the data
is split into the forms of parts such as part and part one.
HDFS performs block replication on multiple data nodes so that if an error exists on one of
the data nodes servers. The job tracker service resubmits the job to another data node server.
The job tracker service is present in the name node server.
Replication Method
In the replication method, each file is split into a sequence of blocks. All blocks except the
last one in the file are of the same size. Blocks are replicated for fault tolerance.
The block replication factor is usually configured at the cluster level but it can also be
configured at the file level.
The name node receives a heartbeat and a block report from each data node in the cluster.
The heartbeat denotes that the data node is functioning properly. A block report lists the
blocks on a data node.
The topology of the replicas is critical to ensure the reliability of HDFS. Usually, each data is
replicated thrice where the suggested replication topology is as follows.
Place the first replica on the same node as that of the client. Place the second replica on a
different rack from that of the first replica. Place the third replica on the same rack as that of
the second one but on a different node. Let's understand data replication through a simple
example.
The diagram illustrates a Hadoop cluster with three racks. A diagram for Replication and
Rack Awareness in Hadoop is given below.
Each rack consists of multiple nodes. R1N1 represents node 1 on rack 1. Suppose each rack
has eight nodes. The name node decides which data node belongs to which rack. Block 1
which is B1 is first written to node 4 on rack 1.
A copy is then written to a different node on a different rack which is node 5 on rack 2. The
third and final copy of the block is written to the same rack of the second copy but to a
different node which is rack 2 node 1.
And now that you have learned about data replication topology or placement, let's discuss
how a file is stored in HDFS.
Let's say we have a large data file that is divided into four blocks. Each block is replicated
three times as shown in the diagram.
You might recall that the default size of each block is 128 megabytes.
The name node then carries metadata information of all blocks and its distribution. Let's work
on an example of HDFS which gives a deep understanding of all the points discussed so far.
Suppose you have 2 log files that you want to save from a local file system to the HDFS
cluster.
The cluster has 5 data nodes: node A, node B, node C, node D, and node E.
Now the first log is divided into three blocks: b1 b2 and b3 and the other log is divided into
two blocks: b4 and b5.
Now the blocks b1 b2 b3 b4 and b5 are distributed to the node A, node B, node C, and no D
respectively as shown in the diagram.
Each block will also be replicated three times on the five data nodes. All of the information
related to the list of blocks and the replication known as the metadata information of the five
blocks will be stored in namenode.
Now suppose the client asks for one log that you have stored. The inquiry goes to the
namenode and the client gets the information about this log file as shown in the diagram.
Based on the information from the namenode, the client receives the file information from the
respective data nodes. HDFS provides various access mechanisms. A Java API can be used
for applications. There is also a Python and AC language wrapper for non-java applications.
A web GUI can also be utilized through an HTTP browser. An FS shell is available for
executing commands on HDFS.