Hadoop - Reducer in Map-Reduce
Last Updated :
21 Feb, 2022
Map-Reduce is a programming model that is mainly divided into two phases i.e. Map Phase and Reduce Phase. It is designed for processing the data in parallel which is divided on various machines(nodes). The Hadoop Java programs are consist of Mapper class and Reducer class along with the driver class. Reducer is the second part of the Map-Reduce programming model. The Mapper produces the output in the form of key-value pairs which works as input for the Reducer.
But before sending this intermediate key-value pairs directly to the Reducer some process will be done which shuffle and sort the key-value pairs according to its key values, which means the value of the key is the main decisive factor for sorting. The output generated by the Reducer will be the final output which is then stored on HDFS(Hadoop Distributed File System). Reducer mainly performs some computation operation like addition, filtration, and aggregation. By default, the number of reducers utilized for process the output of the Mapper is 1 which is configurable and can be changed by the user according to the requirement.
Let's understand the Reducer in Map-Reduce:

Here, in the above image, we can observe that there are multiple Mapper which are generating the key-value pairs as output. The output of each mapper is sent to the sorter which will sort the key-value pairs according to its key value. Shuffling also takes place during the sorting process and the output will be sent to the Reducer part and final output is produced.
Let's take an example to understand the working of Reducer. Suppose we have the data of a college faculty of all departments stored in a CSV file. In case we want to find the sum of salaries of faculty according to their department then we can make their dept. title as key and salaries as value. The Reducer will perform the summation operation on this dataset and produce the desired output.
The number of Reducers in Map-Reduce task also affects below features:
- Framework overhead increases.
- Cost of failure Reduces
- Increase load balancing.
One thing we also need to remember is that there will always be a one to one mapping between Reducers and the keys. Once the whole Reducer process is done the output is stored at the part file(default name) on HDFS(Hadoop Distributed File System). In the output directory on HDFS, The Map-Reduce always makes a _SUCCESS file and part-r-00000 file. The number of part files depends on the number of reducers in case we have 5 Reducers then the number of the part file will be from part-r-00000 to part-r-00004. By default, these files have the name of part-a-bbbbb type. It can be changed manually all we need to do is to change the below property in our driver code of Map-Reduce.
// Here we are changing output file name from part-r-00000 to GeeksForGeeks
job.getConfiguration().set("mapreduce.output.basename", "GeeksForGeeks")
The Reducer Of Map-Reduce is consist of mainly 3 processes/phases:
- Shuffle: Shuffling helps to carry data from the Mapper to the required Reducer. With the help of HTTP, the framework calls for applicable partition of the output in all Mappers.
- Sort: In this phase, the output of the mapper that is actually the key-value pairs will be sorted on the basis of its key value.
- Reduce: Once shuffling and sorting will be done the Reducer combines the obtained result and perform the computation operation as per the requirement. OutputCollector.collect() property is used for writing the output to the HDFS. Keep remembering that the output of the Reducer will not be sorted.
Note: Shuffling and Sorting both execute in parallel.
Setting Number Of Reducers In Map-Reduce:
- With Command Line: While executing our Map-Reduce program we can manually change the number of Reducer with controller mapred.reduce.tasks.
- With JobConf instance: In our driver class, we can specify the number of reducers using the instance of job.setNumReduceTasks(int).
For example job.setNumReduceTasks(2), Here we have 2 Reducers. we can also make Reducers to 0 in case we need only a Map job.
// Ideally The number of Reducers in a Map-Reduce must be set to:
0.95 or 1.75 multiplied by (<no. of nodes> * <no. of maximum containers per node>)
Similar Reads
Hadoop - Architecture As we all know Hadoop is a framework written in Java that utilizes a large cluster of commodity hardware to maintain and store big size data. Hadoop works on MapReduce Programming Algorithm that was introduced by Google. Today lots of Big Brand Companies are using Hadoop in their Organization to dea
6 min read
MapReduce Architecture MapReduce and HDFS are the two major components of Hadoop which makes it so powerful and efficient to use. MapReduce is a programming model used for efficient processing in parallel over large data-sets in a distributed manner. The data is first split and then combined to produce the final result. T
4 min read
Introduction to Hadoop Hadoop is an open-source software framework that is used for storing and processing large amounts of data in a distributed computing environment. It is designed to handle big data and is based on the MapReduce programming model, which allows for the parallel processing of large datasets. Its framewo
3 min read
Top 60+ Data Engineer Interview Questions and Answers Data engineering is a rapidly growing field that plays a crucial role in managing and processing large volumes of data for organizations. As companies increasingly rely on data-driven decision-making, the demand for skilled data engineers continues to rise. If you're preparing for a data engineer in
15+ min read
What is Big Data? Data science is the study of data analysis by advanced technology (Machine Learning, Artificial Intelligence, Big data). It processes a huge amount of structured, semi-structured, and unstructured data to extract insight meaning, from which one pattern can be designed that will be useful to take a d
5 min read
Explain the Hadoop Distributed File System (HDFS) Architecture and Advantages. The Hadoop Distributed File System (HDFS) is a key component of the Apache Hadoop ecosystem, designed to store and manage large volumes of data across multiple machines in a distributed manner. It provides high-throughput access to data, making it suitable for applications that deal with large datas
5 min read
What is Big Data Analytics ? - Definition, Working, Benefits Big Data Analytics uses advanced analytical methods that can extract important business insights from bulk datasets. Within these datasets lies both structured (organized) and unstructured (unorganized) data. Its applications cover different industries such as healthcare, education, insurance, AI, r
9 min read
Kafka Architecture Apache Kafka is a distributed streaming platform designed for building real-time data pipelines and streaming applications. It is known for its high throughput, low latency, fault tolerance, and scalability. This article delves into the architecture of Kafka, exploring its core components, functiona
12 min read
Hadoop - HDFS (Hadoop Distributed File System) Before head over to learn about the HDFS(Hadoop Distributed File System), we should know what actually the file system is. The file system is a kind of Data structure or method which we use in an operating system to manage file on disk space. This means it allows the user to keep maintain and retrie
7 min read
Map Reduce and its Phases with numerical example. Map Reduce is a framework in which we can write applications to run huge amount of data in parallel and in large cluster of commodity hardware in a reliable manner.Phases of MapReduceMapReduce model has three major and one optional phase.âMappingShuffling and SortingReducingCombining1) MappingIt is
3 min read