Why a Block in HDFS is so Large?

Last Updated : 10 Sep, 2020

A Disk has a block size, which decides how much information or data it can read or write. The disk blocks are generally different than the file system block. Filesystem blocks are normally a couple of kilobytes in size, while disk blocks are regularly 512 bytes in size. This is commonly straightforward to the filesystem client who is reading or writing a document of whatever length. df and fsck are the tools that operate on the filesystem block level and are used for the maintenance of a file system.

Like other filesystems, HDFS(Hadoop Distributed File System) also has the idea of storing or managing the data in blocks. But in HDFS the default size of the block is too much larger than the simple file systems. The records or files in HDFS are broken into various blocks of the measured size which are put away as autonomous units. The size of the data block in HDFS is 64 MB by default, which can be configured manually. In general, the data blocks of size 128MB is used in the industry. The major advantage of using HDFS for storing the information in a cluster is that if the size of the file is less then the block size then in that case it will not occupy the full block worth of underlying storage.

Why is a Block in HDFS So Large?

HDFS blocks are huge than the disk blocks, and the explanation is to limit the expense of searching. The time or cost to transfer the data from the disk can be made larger than the time to seek for the beginning of the block by simply improving the size of blocks significantly. Hence to move a huge record or file made of different blocks set off at the normal disk transfer rate.

A speedy estimation shows that if the searching time for a data block is around 10 ms, and the exchange rate i.e. the rate at which the data is transferred is 100 MB/s, at that point to make the look for time 1% of the exchange time, we have to make the disk size around 100 MB. However, the default size of HDFS blocks is 64MB but many albeit numerous HDFS installations utilize data blocks of 128 MB. As the transfer speed grows of disk drives in the future the size of these data blocks will also follow the upward motion and this figure will keep on changing.

There are lots of other benefits that can be achieved just because of data block abstraction lets discuss each of them.

It is possible that the file we are storing on HDFS can be larger than the single disk in our cluster network. with data blocks, in HDFS it is not necessary to store all the blocks of a record on the same disk. HDFS can utilize any of the disks available in the cluster.
In HDFS the abstraction is made over the blocks of a file rather than a single file which simplifies the storage subsystem. Since the size of the blocks is fixed it is easy to manage and calculate how many blocks can be stored on a single disk. The storage subsystem simplifies the storage management system by eliminates the problem of storing metadata(the information regarding file permission and log) problem. The HDFS blocks are capable of storing the file data on multiple DataNodes in a cluster. We can use another system that manages metadata separately typically known as NameNode.
Another benefit of using block abstraction is that providing replication to data blocks is simple, so fault tolerance and high-availability can be achieved in our Hadoop cluster. By default, each data block is replicated three times on physical machines available in our cluster. In case if the data block is somehow unavailable the copy of the same data block can also be provided from some other node available in our cluster where the replica is made. A block that is corrupted is replicated to some other live machine just to maintain the replication factor to the normal state. Some important applications can require high replication to provide extra high availability and to spread the read load on our cluster.

The HDFS fsck command understands blocks. running the below fsck command Will list down all the details regarding blocks that make each file in the file system.

hdfs fsck / -files -blocks

Let's do a quick hands-on to understand this:

Step1: Start all your Hadoop Daemons with the below commands.

start-dfs.sh   // this will start your namenode, datanode and secondary namenode

start-yarn.sh  // this will start your resource manager and node manager