Why a Block in HDFS is so Large?
Last Updated :
10 Sep, 2020
A Disk has a block size, which decides how much information or data it can read or write. The disk blocks are generally different than the file system block. Filesystem blocks are normally a couple of kilobytes in size, while disk blocks are regularly 512 bytes in size. This is commonly straightforward to the filesystem client who is reading or writing a document of whatever length. df and fsck are the tools that operate on the filesystem block level and are used for the maintenance of a file system.
Like other filesystems, HDFS(Hadoop Distributed File System) also has the idea of storing or managing the data in blocks. But in HDFS the default size of the block is too much larger than the simple file systems. The records or files in HDFS are broken into various blocks of the measured size which are put away as autonomous units. The size of the data block in HDFS is 64 MB by default, which can be configured manually. In general, the data blocks of size 128MB is used in the industry. The major advantage of using HDFS for storing the information in a cluster is that if the size of the file is less then the block size then in that case it will not occupy the full block worth of underlying storage.
Why is a Block in HDFS So Large?
HDFS blocks are huge than the disk blocks, and the explanation is to limit the expense of searching. The time or cost to transfer the data from the disk can be made larger than the time to seek for the beginning of the block by simply improving the size of blocks significantly. Hence to move a huge record or file made of different blocks set off at the normal disk transfer rate.
A speedy estimation shows that if the searching time for a data block is around 10 ms, and the exchange rate i.e. the rate at which the data is transferred is 100 MB/s, at that point to make the look for time 1% of the exchange time, we have to make the disk size around 100 MB. However, the default size of HDFS blocks is 64MB but many albeit numerous HDFS installations utilize data blocks of 128 MB. As the transfer speed grows of disk drives in the future the size of these data blocks will also follow the upward motion and this figure will keep on changing.
There are lots of other benefits that can be achieved just because of data block abstraction lets discuss each of them.
- It is possible that the file we are storing on HDFS can be larger than the single disk in our cluster network. with data blocks, in HDFS it is not necessary to store all the blocks of a record on the same disk. HDFS can utilize any of the disks available in the cluster.
- In HDFS the abstraction is made over the blocks of a file rather than a single file which simplifies the storage subsystem. Since the size of the blocks is fixed it is easy to manage and calculate how many blocks can be stored on a single disk. The storage subsystem simplifies the storage management system by eliminates the problem of storing metadata(the information regarding file permission and log) problem. The HDFS blocks are capable of storing the file data on multiple DataNodes in a cluster. We can use another system that manages metadata separately typically known as NameNode.
- Another benefit of using block abstraction is that providing replication to data blocks is simple, so fault tolerance and high-availability can be achieved in our Hadoop cluster. By default, each data block is replicated three times on physical machines available in our cluster. In case if the data block is somehow unavailable the copy of the same data block can also be provided from some other node available in our cluster where the replica is made. A block that is corrupted is replicated to some other live machine just to maintain the replication factor to the normal state. Some important applications can require high replication to provide extra high availability and to spread the read load on our cluster.
The HDFS fsck command understands blocks. running the below fsck command Will list down all the details regarding blocks that make each file in the file system.
hdfs fsck / -files -blocks
Let's do a quick hands-on to understand this:
Step1: Start all your Hadoop Daemons with the below commands.
start-dfs.sh // this will start your namenode, datanode and secondary namenode
start-yarn.sh // this will start your resource manager and node manager

Step2: Check the status of running Daemon with the below commands.
jps

Step3: Run HDFS fsck command.
hdfs fsck / -files -blocks


With the above explanation, we can easily observe all the details regarding blocks that make each file in our file system.
Similar Reads
Software Development Life Cycle (SDLC) Software development life cycle (SDLC) is a structured process that is used to design, develop, and test good-quality software. SDLC, or software development life cycle, is a methodology that defines the entire procedure of software development step-by-step. The goal of the SDLC life cycle model is
11 min read
Waterfall Model - Software Engineering The Waterfall Model is a Traditional Software Development Methodology. It was first introduced by Winston W. Royce in 1970. It is a linear and sequential approach to software development that consists of several phases. This classical waterfall model is simple and idealistic. It is important because
13 min read
What is DFD(Data Flow Diagram)? Data Flow Diagram is a visual representation of the flow of data within the system. It help to understand the flow of data throughout the system, from input to output, and how it gets transformed along the way. The models enable software engineers, customers, and users to work together effectively d
9 min read
COCOMO Model - Software Engineering The Constructive Cost Model (COCOMO) It was proposed by Barry Boehm in 1981 and is based on the study of 63 projects, which makes it one of the best-documented models. It is a Software Cost Estimation Model that helps predict the effort, cost, and schedule required for a software development project
15+ min read
What is Spiral Model in Software Engineering? The Spiral Model is one of the most important SDLC model. The Spiral Model is a combination of the waterfall model and the iterative model. It provides support for Risk Handling. The Spiral Model was first proposed by Barry Boehm. This article focuses on discussing the Spiral Model in detail.Table o
9 min read
Software Requirement Specification (SRS) Format In order to form a good SRS, here you will see some points that can be used and should be considered to form a structure of good Software Requirements Specification (SRS). These are below mentioned in the table of contents and are well explained below. Table of ContentIntroductionGeneral description
5 min read
Differences between Black Box Testing and White Box Testing In the Software Testing field, various methods are used to find defects, which used to increasing the software's quality. Black-Box Testing and White-Box Testing play important roles in these process.Let's Learn about them in detail.Table of ContentWhat is Black Box Testing?What is White Box Testing
6 min read
Software Engineering Tutorial Software Engineering is a subdomain of Engineering in which you learn to develop, design, test, and maintain software using a systematic and structured approach. Software is a collection of programs. And that programs are developed by software engineers In this Software Engineering Tutorial, you wil
7 min read
Coupling and Cohesion - Software Engineering The purpose of the Design phase in the Software Development Life Cycle is to produce a solution to a problem given in the SRS(Software Requirement Specification) document. The output of the design phase is a Software Design Document (SDD). Coupling and Cohesion are two key concepts in software engin
10 min read
Functional vs. Non Functional Requirements Requirements analysis is an essential process that enables the success of a system or software project to be assessed. Requirements are generally split into two types: Functional and Non-functional requirements. functional requirements define the specific behavior or functions of a system. In contra
6 min read