MapReduce Programming Model and its role in Hadoop.
Last Updated :
28 May, 2024
In the Hadoop framework, MapReduce is the programming model. MapReduce utilizes the map and reduce strategy for the analysis of data. In today’s fast-paced world, there is a huge number of data available, and processing this extensive data is one of the critical tasks to do so. However, the MapReduce programming model can be the solution for processing extensive data while maintaining both speed and efficiency. Understanding this programming model, its components, and execution workflow in the Hadoop framework will be helpful to gain valuable insights.
What is MapReduce?
MapReduce is a parallel, distributed programming model in the Hadoop framework that can be used to access the extensive data stored in the Hadoop Distributed File System (HDFS). The Hadoop is capable of running the MapReduce program written in various languages such as Java, Ruby, and Python. One of the beneficial factors that MapReduce aids is that MapReduce programs are inherently parallel, making the very large scale easier for data analysis.
When the MapReduce programs run in parallel, it speeds up the process. The process of running MapReduce programs is explained below.
- Dividing the input into fixed-size chunks: Initially, it divides the work into equal-sized pieces. When the file size varies, dividing the work into equal-sized pieces isn’t the straightforward method to follow, because some processes will finish much earlier than others while some may take a very long run to complete their work. So one of the better approaches is that one that requires more work is said to split the input into fixed-size chunks and assign each chunk to a process.
- Combining the results: Combining results from independent processes is a crucial task in MapReduce programming because it may often need additional processing such as aggregating and finalizing the results.
Key components of MapReduce
There are two key components in the MapReduce. The MapReduce consists of two primary phases such as the map phase and the reduces phase. Each phase contains the key-value pairs as its input and output and it also has the map function and reducer function within it.
- Mapper: Mapper is the first phase of the MapReduce. The Mapper is responsible for processing each input record and the key-value pairs are generated by the InputSplit and RecordReader. Where these key-value pairs can be completely different from the input pair. The MapReduce output holds the collection of all these key-value pairs.
- Reducer: The reducer phase is the second phase of the MapReduce. It is responsible for processing the output of the mapper. Once it completes processing the output of the mapper, the reducer now generates a new set of output that can be stored in HDFS as the final output data.
Execution workflow of MapReduce
Now let’s understand how the MapReduce Job execution works and what all the components it contains. Generally, MapReduce processes the data in different phases with the help of its different components. Take a look at the below figure which illustrates the steps of the job execution workflow of MapReduce in Hadoop.
Mapreduce job execution workflow
- Input Files: The data for MapReduce tasks are present in the input files. These input files reside in HDFS. The format for input files is arbitrary, while line-based log files and binary format can also be used.
- InputFormat: The InputFormat is used to define how the input files are split and read. It selects the files or objects that are used for input. In general, the InputFormat is used to create the Input Split.
- Record Reader: The RecordReader can communicate with the Input Split in the Hadoop MapReduce. It can also convert the data into key-value pairs so that the mapper can read. By default, the RecordReader utilizes the TextInputFormat for converting data into key-value pairs. The Record Reader communicates with the Input Split until the file reading is completed. It then assigns a byte offset (unique number) to each line present in the file. Then these key-value pairs are sent to the mapper for further processing.
- Mapper: From the RecordReader, the mapper receives the input records. The Mapper is responsible for processing those input records from the RecordReader and it generates the new key-value pair. The Key-value pair generated by the mapper can be completely different from the input pair. The output of the mapper which is intermediate output is said to be stored in the local disk since it is the temporary data.
- Combiner: The Combiner in MapReduce is also known as Mini-reducer. The Hadoop MapReduce combiner performs the local aggregation on the mapper’s output which minimizes the data transfer between the mapper and reducer. Once the Combiner completes its process, the output of the combiner is passed to the partitioner for further work.
- Partitioner: In Hadoop MapReduce, the partitioner is used when we are working with more than one reducer. The Partitioner extracts the output from the combiner and then it performs partitioning. The partitioning of output takes place based on the key and then it is sorted. With the help of a hash function, the key (or subset of the key) is used to derive the partition. Since MapReduce execution works with the help of key-value, each combiner output is partitioned and a record having the same key value moves into the same partition and then each partition is sent to the reducer. Partitioning of the output of the combiner allows the even distribution of the map output over the reducer.
- Shuffling and Sorting: The Shuffling performs the shuffling operation on the mapper’s output before it is sent to the reducer phase. Once all the mapper has completed their work and their output is said to be shuffled on the reducer nodes, then this intermediate output is merged and sorted. This sorted output is passed as input to the reducer phase.
- Reducer: It takes the set of intermediate key-value pairs from the mapper as the input and then it runs the reducer function on each of the key-value pairs to generate the output. This output of the reducer phase is the final output and it is stored in the HDFS.
- Record Writer: The Record Writer holds the power of writing these output key-value pairs from the reducer phase to the output files.
- Output format: The Output format determines how these output values are written in the output files by the record reader. The Output format instances provided by Hadoop are generally used to write files on either HDFS or on the local disk. Thus, the final output of the reducer is written on the HDFS by output format instances.
Conclusion
We have seen the concept of MapReduce and it's components in the Hadoop framework. Understanding these key components and the MapReduce workflow provides a comprehensive insight into how Hadoop leverages MapReduce to solve complex data processing challenges while maintaining both speed and scalability in the ever-growing world of big data.
Similar Reads
Machine Learning Tutorial Machine learning is a branch of Artificial Intelligence that focuses on developing models and algorithms that let computers learn from data without being explicitly programmed for every task. In simple words, ML teaches the systems to think and understand like humans by learning from the data.Machin
5 min read
SQL Interview Questions Are you preparing for a SQL interview? SQL is a standard database language used for accessing and manipulating data in databases. It stands for Structured Query Language and was developed by IBM in the 1970's, SQL allows us to create, read, update, and delete data with simple yet effective commands.
15+ min read
Python Interview Questions and Answers Python is the most used language in top companies such as Intel, IBM, NASA, Pixar, Netflix, Facebook, JP Morgan Chase, Spotify and many more because of its simplicity and powerful libraries. To crack their Online Assessment and Interview Rounds as a Python developer, we need to master important Pyth
15+ min read
Java Interview Questions and Answers Java is one of the most popular programming languages in the world, known for its versatility, portability, and wide range of applications. Java is the most used language in top companies such as Uber, Airbnb, Google, Netflix, Instagram, Spotify, Amazon, and many more because of its features and per
15+ min read
Non-linear Components In electrical circuits, Non-linear Components are electronic devices that need an external power source to operate actively. Non-Linear Components are those that are changed with respect to the voltage and current. Elements that do not follow ohm's law are called Non-linear Components. Non-linear Co
11 min read
Linear Regression in Machine learning Linear regression is a type of supervised machine-learning algorithm that learns from the labelled datasets and maps the data points with most optimized linear functions which can be used for prediction on new datasets. It assumes that there is a linear relationship between the input and output, mea
15+ min read
Spring Boot Tutorial Spring Boot is a Java framework that makes it easier to create and run Java applications. It simplifies the configuration and setup process, allowing developers to focus more on writing code for their applications. This Spring Boot Tutorial is a comprehensive guide that covers both basic and advance
10 min read
Support Vector Machine (SVM) Algorithm Support Vector Machine (SVM) is a supervised machine learning algorithm used for classification and regression tasks. It tries to find the best boundary known as hyperplane that separates different classes in the data. It is useful when you want to do binary classification like spam vs. not spam or
9 min read
React Interview Questions and Answers React is an efficient, flexible, and open-source JavaScript library that allows developers to create simple, fast, and scalable web applications. Jordan Walke, a software engineer who was working for Facebook, created React. Developers with a JavaScript background can easily develop web applications
15+ min read
Top 100 Data Structure and Algorithms DSA Interview Questions Topic-wise DSA has been one of the most popular go-to topics for any interview, be it college placements, software developer roles, or any other technical roles for freshers and experienced to land a decent job. If you are among them, you already know that it is not easy to find the best DSA interview question
3 min read