Hadoop MapReduce - Data Flow
Last Updated :
30 Jul, 2020
Map-Reduce is a processing framework used to process data over a large number of machines. Hadoop uses Map-Reduce to process the data distributed in a Hadoop cluster. Map-Reduce is not similar to the other regular processing framework like Hibernate, JDK, .NET, etc. All these previous frameworks are designed to use with a traditional system where the data is stored at a single location like Network File System, Oracle database, etc. But when we are processing big data the data is located on multiple commodity machines with the help of HDFS.
So when the data is stored on multiple nodes we need a processing framework where it can copy the program to the location where the data is present, Means it copies the program to all the machines where the data is present. Here the Map-Reduce came into the picture for processing the data on Hadoop over a distributed system. Hadoop has a major drawback of cross-switch network traffic which is due to the massive volume of data. Map-Reduce comes with a feature called Data-Locality. Data Locality is the potential to move the computations closer to the actual data location on the machines.
Since Hadoop is designed to work on commodity hardware it uses Map-Reduce as it is widely acceptable which provides an easy way to process data over multiple nodes. Map-Reduce is not the only framework for parallel processing. Nowadays Spark is also a popular framework used for distributed computing like Map-Reduce. We also have HAMA, MPI theses are also the different-different distributed processing framework.

Let's Understand Data-Flow in Map-Reduce
Map Reduce is a terminology that comes with Map Phase and Reducer Phase. The map is used for Transformation while the Reducer is used for aggregation kind of operation. The terminology for Map and Reduce is derived from some functional programming languages like Lisp, Scala, etc. The Map-Reduce processing framework program comes with 3 main components i.e. our Driver code, Mapper(For Transformation), and Reducer(For Aggregation).
Let's take an example where you have a file of 10TB in size to process on Hadoop. The 10TB of data is first distributed across multiple nodes on Hadoop with HDFS. Now we have to process it for that we have a Map-Reduce framework. So to process this data with Map-Reduce we have a Driver code which is called Job. If we are using Java programming language for processing the data on HDFS then we need to initiate this Driver class with the Job object. Suppose you have a car which is your framework than the start button used to start the car is similar to this Driver code in the Map-Reduce framework. We need to initiate the Driver code to utilize the advantages of this Map-Reduce Framework.
There are also Mapper and Reducer classes provided by this framework which are predefined and modified by the developers as per the organizations requirement.
Brief Working of Mapper
Mapper is the initial line of code that initially interacts with the input dataset. suppose, If we have 100 Data-Blocks of the dataset we are analyzing then, in that case, there will be 100 Mapper program or process that runs in parallel on machines(nodes) and produce there own output known as intermediate output which is then stored on Local Disk, not on HDFS. The output of the mapper act as input for Reducer which performs some sorting and aggregation operation on data and produces the final output.
Brief Working Of Reducer
Reducer is the second part of the Map-Reduce programming model. The Mapper produces the output in the form of key-value pairs which works as input for the Reducer. But before sending this intermediate key-value pairs directly to the Reducer some process will be done which shuffle and sort the key-value pairs according to its key values. The output generated by the Reducer will be the final output which is then stored on HDFS(Hadoop Distributed File System). Reducer mainly performs some computation operation like addition, filtration, and aggregation.

Steps of Data-Flow:
- At a time single input split is processed. Mapper is overridden by the developer according to the business logic and this Mapper run in a parallel manner in all the machines in our cluster.
- The intermediate output generated by Mapper is stored on the local disk and shuffled to the reducer to reduce the task.
- Once Mapper finishes their task the output is then sorted and merged and provided to the Reducer.
- Reducer performs some reducing tasks like aggregation and other compositional operation and the final output is then stored on HDFS in part-r-00000(created by default) file.
Similar Reads
Hadoop - Architecture As we all know Hadoop is a framework written in Java that utilizes a large cluster of commodity hardware to maintain and store big size data. Hadoop works on MapReduce Programming Algorithm that was introduced by Google. Today lots of Big Brand Companies are using Hadoop in their Organization to dea
6 min read
MapReduce Architecture MapReduce and HDFS are the two major components of Hadoop which makes it so powerful and efficient to use. MapReduce is a programming model used for efficient processing in parallel over large data-sets in a distributed manner. The data is first split and then combined to produce the final result. T
4 min read
Hadoop Ecosystem Overview: Apache Hadoop is an open source framework intended to make interaction with big data easier, However, for those who are not acquainted with this technology, one question arises that what is big data ? Big data is a term given to the data sets which can't be processed in an efficient manner
6 min read
Introduction to Hadoop Hadoop is an open-source software framework that is used for storing and processing large amounts of data in a distributed computing environment. It is designed to handle big data and is based on the MapReduce programming model, which allows for the parallel processing of large datasets. Its framewo
3 min read
Top 60+ Data Engineer Interview Questions and Answers Data engineering is a rapidly growing field that plays a crucial role in managing and processing large volumes of data for organizations. As companies increasingly rely on data-driven decision-making, the demand for skilled data engineers continues to rise. If you're preparing for a data engineer in
15+ min read
What is Big Data? Data science is the study of data analysis by advanced technology (Machine Learning, Artificial Intelligence, Big data). It processes a huge amount of structured, semi-structured, and unstructured data to extract insight meaning, from which one pattern can be designed that will be useful to take a d
5 min read
Explain the Hadoop Distributed File System (HDFS) Architecture and Advantages. The Hadoop Distributed File System (HDFS) is a key component of the Apache Hadoop ecosystem, designed to store and manage large volumes of data across multiple machines in a distributed manner. It provides high-throughput access to data, making it suitable for applications that deal with large datas
5 min read
What is Big Data Analytics ? - Definition, Working, Benefits Big Data Analytics uses advanced analytical methods that can extract important business insights from bulk datasets. Within these datasets lies both structured (organized) and unstructured (unorganized) data. Its applications cover different industries such as healthcare, education, insurance, AI, r
9 min read
Hadoop - HDFS (Hadoop Distributed File System) Before head over to learn about the HDFS(Hadoop Distributed File System), we should know what actually the file system is. The file system is a kind of Data structure or method which we use in an operating system to manage file on disk space. This means it allows the user to keep maintain and retrie
7 min read
Map Reduce and its Phases with numerical example. Map Reduce is a framework in which we can write applications to run huge amount of data in parallel and in large cluster of commodity hardware in a reliable manner.Phases of MapReduceMapReduce model has three major and one optional phase.âMappingShuffling and SortingReducingCombining1) MappingIt is
3 min read