0% found this document useful (0 votes)
6 views

Hadoop - Mapreduce (1)

MapReduce is a programming model for processing large datasets in parallel across distributed clusters, consisting of three main phases: Map, Shuffle and Sort, and Reduce. It offers key features such as scalability, fault tolerance, and parallel processing, making it efficient for various applications like data mining and log analysis. However, it has limitations including high latency and complexity in programming.

Uploaded by

aravioffl
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

Hadoop - Mapreduce (1)

MapReduce is a programming model for processing large datasets in parallel across distributed clusters, consisting of three main phases: Map, Shuffle and Sort, and Reduce. It offers key features such as scalability, fault tolerance, and parallel processing, making it efficient for various applications like data mining and log analysis. However, it has limitations including high latency and complexity in programming.

Uploaded by

aravioffl
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Hadoop - Mapreduce

Key Concepts of MapReduce

1.​ Map Phase:​

○​ The input data is divided into smaller chunks (splits).


○​ Each chunk is processed by a mapper function in parallel.
○​ The mapper function transforms the input key-value pairs
into intermediate key-value pairs.
2.​ Shuffle and Sort:​

○​ Intermediate key-value pairs produced by the mapper are


shuffled and grouped based on their keys.
○​ Sorting ensures that the data with the same keys is brought
together.
3.​ Reduce Phase:​

○​ The grouped data is processed by the reducer function.


○​ The reducer combines and summarizes the data to produce
the final output.

Key Features

●​ Scalability: Handles large-scale datasets distributed across


multiple nodes.
●​ Fault Tolerance: Automatic handling of node failures by
reassigning tasks.
●​ Parallel Processing: Executes tasks in parallel to improve
efficiency.
●​ Data Locality: Moves computation to where the data resides,
minimizing data transfer overhead.

Workflow of MapReduce

1.​ Input:​

○​ A dataset, such as a log file or a large document, is broken


into splits.
○​ Each split is fed into a mapper.
2.​ Mapping:​

○​ The mapper processes the split and emits intermediate


key-value pairs.
○​ For example, in a word count task, the mapper reads lines
and emits pairs like (word, 1).
3.​ Shuffling and Sorting:​

○​ Intermediate pairs are grouped by key.


○​ Sorting ensures all occurrences of a key are processed
together.
4.​ Reducing:​

○​ Reducers aggregate or process the grouped data.


○​ For word count, the reducer sums up the occurrences of each
word.
5.​ Output:​
○​ The final results are written to a distributed file system, like
HDFS (Hadoop Distributed File System).

Example: Word Count Program

Input Data

Unset

file1: "hadoop mapreduce tutorial"


file2: "hadoop tutorial example"

Map Phase

Unset

Mapper 1 Output (file1): (hadoop, 1),


(mapreduce, 1), (tutorial, 1)
Mapper 2 Output (file2): (hadoop, 1),
(tutorial, 1), (example, 1)

Shuffle and Sort

Unset

(hadoop, [1, 1])


(mapreduce, [1])
(tutorial, [1, 1])
(example, [1])

Reduce Phase
Unset

(hadoop, 2)
(mapreduce, 1)
(tutorial, 2)
(example, 1)

Output

Unset

hadoop 2
mapreduce 1
tutorial 2
example 1

Advantages

●​ Efficient: Processes large-scale data efficiently by distributing


tasks.
●​ Flexible: Supports a wide range of applications, including data
mining, log analysis, and web indexing.
●​ Cost-Effective: Open-source and runs on commodity hardware.

Limitations

●​ High Latency: Not suitable for real-time processing.


●​ Complexity: Writing MapReduce programs can be complex for
some applications.
●​ Iterative Algorithms: Inefficient for tasks requiring multiple
iterations.

You might also like