As we all know Hadoop is a framework written in Java that utilizes a large cluster of commodity hardware to maintain and store big size data. Hadoop works on MapReduce Programming Algorithm that was introduced by Google. Today lots of Big Brand Companies are using Hadoop in their Organization to deal with big data, eg. Facebook, Yahoo, Netflix, eBay, etc. The Hadoop Architecture Mainly consists of 4 components.
- MapReduce
- HDFS(Hadoop Distributed File System)
- YARN(Yet Another Resource Negotiator)
- Common Utilities or Hadoop Common
Hadoop architectureLet's understand the role of each one of this component in detail.
1. MapReduce
MapReduce is a data processing model in Hadoop built on top of the YARN framework. Its main feature is distributed and parallel processing across a Hadoop cluster, which significantly boosts performance when handling Big Data. Since serial processing is inefficient at large scale, MapReduce enables fast and efficient data processing by splitting the workload into two phases: Map and Reduce.

MapReduce Workflow: The workflow begins when input data is passed to the Map() function. This data, often stored in blocks, is broken down into tuples (key-value pairs). These pairs are then passed to the Reduce() function, which aggregates them based on the keys and performs required operations such as sorting, counting or summing to generate the final output. The result is then written to HDFS.
Note: The core processing typically happens in the Reducer, tailored to the specific business or analytical requirement.
Map Task Components
- RecordReader: Reads input data and converts it into key-value pairs. The key usually represents location metadata, and the value contains the actual data.
- Mapper: A user-defined function that processes each key-value pair. It may output zero or more new key-value pairs.
- Combiner (Optional): Acts as a local mini-reducer to summarize or group Mapper outputs before sending them to the Reducer. It's used to reduce data transfer overhead.
- Partitioner: Determines which Reducer will receive which key-value pair. It uses the formula: key.hashCode() % numberOfReducers to assign partitions.
Reduce Task Components
- Shuffle and Sort: Intermediate key-value pairs from the Mapper are transferred (shuffled) to the Reducer and sorted by key. Shuffling starts as soon as some Mappers finish, speeding up the process.
- Reducer: Receives grouped key-value pairs and performs operations such as aggregation or filtering, based on business logic.
- OutputFormat: The final output is written to HDFS using a RecordWriter, where each line represents a record with the key and value separated by a space.
2. HDFS
HDFS(Hadoop Distributed File System) is utilized for storage permission. It is mainly designed for working on commodity Hardware devices(inexpensive devices), working on a distributed file system design. HDFS is designed in such a way that it believes more in storing the data in a large chunk of blocks rather than storing small data blocks.
HDFS in Hadoop provides Fault-tolerance and High availability to the storage layer and the other devices present in that Hadoop cluster.
HDFS Architecture Components
1. NameNode (Master Node):
- Manages the HDFS cluster and stores metadata (file names, sizes, permissions, block locations).
- Does not store actual data, only metadata (via edit logs and fsimage).
- Controls file operations like create, delete, and replicate, directing DataNodes accordingly.
- Helps clients locate the nearest DataNode for fast access.
2. DataNode (Slave Node):
- Stores the actual data blocks on local disks.
- Handles read/write requests from clients.
- Sends heartbeats and block reports to the NameNode regularly.
- Data is replicated (default 3 copies) for reliability.
- More DataNodes = More storage & better performance; each should have high capacity and stable connectivity.
High Level Architecture Of Hadoop
File Block In HDFS: In HDFS data is always stored in the form of blocks. By default, each block is 128MB in size, although this value can be manually configured depending on the use case (commonly increased to 256MB or more in modern systems).

Suppose you upload a file of 400MB to HDFS. Hadoop will divide this file into blocks as follows:
128MB + 128MB + 128MB + 16MB = 400MB
This results in four blocks, three of 128MB each and one final block of 16MB. Hadoop does not interpret the data content, it simply splits files based on size. As a result, a single record may be partially stored across two blocks.
Comparison with Traditional File Systems
- In traditional file systems like Linux, the default block size is 4KB, which is significantly smaller than Hadoop’s default block size of 128MB or more.
- HDFS uses larger block sizes to Reduce metadata overhead and I/O operations Improve storage scalability and data processing efficiency for large datasets
Replication In HDFS: Replication in HDFS ensures data availability and fault tolerance by creating multiple copies of each data block.
- Default Replication Factor: 3 (configurable in hdfs-site.xml)
- If a file is split into 4 blocks, with a replication factor of 3: 4 blocks × 3 replicas = 12 total blocks
Since Hadoop runs on commodity hardware, failures are common. Replication prevents data loss if a node crashes. While it increases storage usage, organizations prioritize data reliability over saving space.
Rack Awareness: A rack is a group of machines (typically 30–40) in a Hadoop cluster. Large clusters have many racks. Rack Awareness helps the NameNode to:
- Choose the nearest DataNode for faster read/write operations.
- Reduce network traffic by minimizing inter-rack data transfer.
This improves overall performance and efficiency in data access.
HDFS Architecture

3. YARN (Yet Another Resource Negotiator)
YARN is the resource management layer in the Hadoop ecosystem. It allows multiple data processing engines like MapReduce, Spark, and others to run and share cluster resources efficiently. It handles two core responsibilities:
- Job Scheduling: Breaks a large task into smaller jobs and assigns them to various nodes in the Hadoop cluster. It also tracks job priority, dependencies, and execution time.
- Resource Management: Allocates and manages the cluster resources (CPU, memory, etc.) required for running these jobs.
Key Features of YARN
- Multi-Tenancy: Supports multiple users and applications.
- Scalability: Efficiently scales to handle thousands of nodes and jobs.
- Better Cluster Utilization: Maximizes resource usage across the cluster.
- Compatibility: Works with existing MapReduce applications and supports other processing models.
4. Hadoop Common (Common Utilities)
Hadoop Common, also known as Common Utilities, includes the core Java libraries and scripts required by all the components in a Hadoop ecosystem such as HDFS, YARN and MapReduce. These libraries provide essential functionalities like:
- File system and I/O operations
- Configuration and logging
- Security and authentication
- Network communication
Hadoop Common ensures that the entire cluster works cohesively and that hardware failures, which are common in Hadoop's commodity hardware setup, are handled automatically by the software.
Similar Reads
Hadoop - Architecture As we all know Hadoop is a framework written in Java that utilizes a large cluster of commodity hardware to maintain and store big size data. Hadoop works on MapReduce Programming Algorithm that was introduced by Google. Today lots of Big Brand Companies are using Hadoop in their Organization to dea
6 min read
Hadoop Ecosystem Overview: Apache Hadoop is an open source framework intended to make interaction with big data easier, However, for those who are not acquainted with this technology, one question arises that what is big data ? Big data is a term given to the data sets which can't be processed in an efficient manner
6 min read
Introduction to Hadoop Hadoop is an open-source software framework that is used for storing and processing large amounts of data in a distributed computing environment. It is designed to handle big data and is based on the MapReduce programming model, which allows for the parallel processing of large datasets. Its framewo
3 min read
Top 60+ Data Engineer Interview Questions and Answers Data engineering is a rapidly growing field that plays a crucial role in managing and processing large volumes of data for organizations. As companies increasingly rely on data-driven decision-making, the demand for skilled data engineers continues to rise. If you're preparing for a data engineer in
15+ min read
What is Big Data? Data science is the study of data analysis by advanced technology (Machine Learning, Artificial Intelligence, Big data). It processes a huge amount of structured, semi-structured, and unstructured data to extract insight meaning, from which one pattern can be designed that will be useful to take a d
5 min read
Explain the Hadoop Distributed File System (HDFS) Architecture and Advantages. The Hadoop Distributed File System (HDFS) is a key component of the Apache Hadoop ecosystem, designed to store and manage large volumes of data across multiple machines in a distributed manner. It provides high-throughput access to data, making it suitable for applications that deal with large datas
5 min read
What is Big Data Analytics ? - Definition, Working, Benefits Big Data Analytics uses advanced analytical methods that can extract important business insights from bulk datasets. Within these datasets lies both structured (organized) and unstructured (unorganized) data. Its applications cover different industries such as healthcare, education, insurance, AI, r
9 min read
Kafka Architecture Apache Kafka is a distributed streaming platform designed for building real-time data pipelines and streaming applications. It is known for its high throughput, low latency, fault tolerance, and scalability. This article delves into the architecture of Kafka, exploring its core components, functiona
12 min read
Hadoop - HDFS (Hadoop Distributed File System) Before head over to learn about the HDFS(Hadoop Distributed File System), we should know what actually the file system is. The file system is a kind of Data structure or method which we use in an operating system to manage file on disk space. This means it allows the user to keep maintain and retrie
7 min read
Map Reduce and its Phases with numerical example. Map Reduce is a framework in which we can write applications to run huge amount of data in parallel and in large cluster of commodity hardware in a reliable manner.Phases of MapReduceMapReduce model has three major and one optional phase.âMappingShuffling and SortingReducingCombining1) MappingIt is
3 min read