Open In App

Hadoop - Architecture

Last Updated : 01 Jul, 2025
Summarize
Comments
Improve
Suggest changes
Share
Like Article
Like
Report

As we all know Hadoop is a framework written in Java that utilizes a large cluster of commodity hardware to maintain and store big size data. Hadoop works on MapReduce Programming Algorithm that was introduced by Google. Today lots of Big Brand Companies are using Hadoop in their Organization to deal with big data, eg. Facebook, Yahoo, Netflix, eBay, etc. The Hadoop Architecture Mainly consists of 4 components. 

  • MapReduce
  • HDFS(Hadoop Distributed File System)
  • YARN(Yet Another Resource Negotiator)
  • Common Utilities or Hadoop Common
Hadoop_architecture
Hadoop architecture

Let's understand the role of each one of this component in detail. 

1. MapReduce

MapReduce is a data processing model in Hadoop built on top of the YARN framework. Its main feature is distributed and parallel processing across a Hadoop cluster, which significantly boosts performance when handling Big Data. Since serial processing is inefficient at large scale, MapReduce enables fast and efficient data processing by splitting the workload into two phases: Map and Reduce.

MapReduce workflow

MapReduce Workflow: The workflow begins when input data is passed to the Map() function. This data, often stored in blocks, is broken down into tuples (key-value pairs). These pairs are then passed to the Reduce() function, which aggregates them based on the keys and performs required operations such as sorting, counting or summing to generate the final output. The result is then written to HDFS.

Note: The core processing typically happens in the Reducer, tailored to the specific business or analytical requirement.

Map Task Components

  • RecordReader: Reads input data and converts it into key-value pairs. The key usually represents location metadata, and the value contains the actual data.
  • Mapper: A user-defined function that processes each key-value pair. It may output zero or more new key-value pairs.
  • Combiner (Optional): Acts as a local mini-reducer to summarize or group Mapper outputs before sending them to the Reducer. It's used to reduce data transfer overhead.
  • Partitioner: Determines which Reducer will receive which key-value pair. It uses the formula: key.hashCode() % numberOfReducers to assign partitions.

Reduce Task Components

  • Shuffle and Sort: Intermediate key-value pairs from the Mapper are transferred (shuffled) to the Reducer and sorted by key. Shuffling starts as soon as some Mappers finish, speeding up the process.
  • Reducer: Receives grouped key-value pairs and performs operations such as aggregation or filtering, based on business logic.
  • OutputFormat: The final output is written to HDFS using a RecordWriter, where each line represents a record with the key and value separated by a space.

2. HDFS

HDFS(Hadoop Distributed File System) is utilized for storage permission. It is mainly designed for working on commodity Hardware devices(inexpensive devices), working on a distributed file system design. HDFS is designed in such a way that it believes more in storing the data in a large chunk of blocks rather than storing small data blocks. 

HDFS in Hadoop provides Fault-tolerance and High availability to the storage layer and the other devices present in that Hadoop cluster.

HDFS Architecture Components

1. NameNode (Master Node):

  • Manages the HDFS cluster and stores metadata (file names, sizes, permissions, block locations).
  • Does not store actual data, only metadata (via edit logs and fsimage).
  • Controls file operations like create, delete, and replicate, directing DataNodes accordingly.
  • Helps clients locate the nearest DataNode for fast access.

2. DataNode (Slave Node):

  • Stores the actual data blocks on local disks.
  • Handles read/write requests from clients.
  • Sends heartbeats and block reports to the NameNode regularly.
  • Data is replicated (default 3 copies) for reliability.
  • More DataNodes = More storage & better performance; each should have high capacity and stable connectivity.

High Level Architecture Of Hadoop 

High Level Architecture Of HadoopFile Block In HDFS: In HDFS data is always stored in the form of blocks. By default, each block is 128MB in size, although this value can be manually configured depending on the use case (commonly increased to 256MB or more in modern systems).

file blocks in HDFS

Suppose you upload a file of 400MB to HDFS. Hadoop will divide this file into blocks as follows:

128MB + 128MB + 128MB + 16MB = 400MB

This results in four blocks, three of 128MB each and one final block of 16MB. Hadoop does not interpret the data content, it simply splits files based on size. As a result, a single record may be partially stored across two blocks.

Comparison with Traditional File Systems

  • In traditional file systems like Linux, the default block size is 4KB, which is significantly smaller than Hadoop’s default block size of 128MB or more.
  • HDFS uses larger block sizes to Reduce metadata overhead and I/O operations Improve storage scalability and data processing efficiency for large datasets

Replication In HDFS: Replication in HDFS ensures data availability and fault tolerance by creating multiple copies of each data block.

  • Default Replication Factor: 3 (configurable in hdfs-site.xml)
  • If a file is split into 4 blocks, with a replication factor of 3: 4 blocks × 3 replicas = 12 total blocks

Since Hadoop runs on commodity hardware, failures are common. Replication prevents data loss if a node crashes. While it increases storage usage, organizations prioritize data reliability over saving space.

Rack Awareness: A rack is a group of machines (typically 30–40) in a Hadoop cluster. Large clusters have many racks. Rack Awareness helps the NameNode to:

  • Choose the nearest DataNode for faster read/write operations.
  • Reduce network traffic by minimizing inter-rack data transfer.

This improves overall performance and efficiency in data access.

HDFS Architecture 

HDFS Architecture

3. YARN (Yet Another Resource Negotiator)

YARN is the resource management layer in the Hadoop ecosystem. It allows multiple data processing engines like MapReduce, Spark, and others to run and share cluster resources efficiently. It handles two core responsibilities:

  • Job Scheduling: Breaks a large task into smaller jobs and assigns them to various nodes in the Hadoop cluster. It also tracks job priority, dependencies, and execution time.
  • Resource Management: Allocates and manages the cluster resources (CPU, memory, etc.) required for running these jobs.

Key Features of YARN

  • Multi-Tenancy: Supports multiple users and applications.
  • Scalability: Efficiently scales to handle thousands of nodes and jobs.
  • Better Cluster Utilization: Maximizes resource usage across the cluster.
  • Compatibility: Works with existing MapReduce applications and supports other processing models.

4. Hadoop Common (Common Utilities)

Hadoop Common, also known as Common Utilities, includes the core Java libraries and scripts required by all the components in a Hadoop ecosystem such as HDFS, YARN and MapReduce. These libraries provide essential functionalities like:

  • File system and I/O operations
  • Configuration and logging
  • Security and authentication
  • Network communication

Hadoop Common ensures that the entire cluster works cohesively and that hardware failures, which are common in Hadoop's commodity hardware setup, are handled automatically by the software.
 


Article Tags :

Similar Reads