0% found this document useful (0 votes)
5 views25 pages

Map Reduce

MapReduce is a programming model designed for processing large datasets in parallel across multiple nodes, addressing the challenges of Big Data which includes aspects like volume, velocity, and variety. It operates through two main tasks, Map and Reduce, where the Map task processes data into key-value pairs and the Reduce task aggregates these pairs into a smaller dataset. The document also outlines the MapReduce algorithm, its components, and examples of its application in real-world scenarios like Twitter's tweet management.

Uploaded by

Ehsan Aslam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views25 pages

Map Reduce

MapReduce is a programming model designed for processing large datasets in parallel across multiple nodes, addressing the challenges of Big Data which includes aspects like volume, velocity, and variety. It operates through two main tasks, Map and Reduce, where the Map task processes data into key-value pairs and the Reduce task aggregates these pairs into a smaller dataset. The document also outlines the MapReduce algorithm, its components, and examples of its application in real-world scenarios like Twitter's tweet management.

Uploaded by

Ehsan Aslam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

MapReduce - Introduction

MapReduce is a programming model for writing applications that can process Big Data in parallel
on multiple nodes. MapReduce provides analytical capabilities for analyzing huge volumes of
complex data.

What is Big Data?

Big Data is a collection of large datasets that cannot be processed using traditional computing
techniques. For example, the volume of data Facebook or Youtube need require it to collect and
manage on a daily basis, can fall under the category of Big Data. However, Big Data is not only
about scale and volume, it also involves one or more of the following aspects − Velocity, Variety,
Volume, and Complexity.

Why MapReduce?

Traditional Enterprise Systems normally have a centralized server to store and process data. The
following illustration depicts a schematic view of a traditional enterprise system. Traditional model is
certainly not suitable to process huge volumes of scalable data and cannot be accommodated by
standard database servers. Moreover, the centralized system creates too much of a bottleneck
while processing multiple files simultaneously.

Google solved this bottleneck issue using an algorithm called MapReduce. MapReduce divides a
task into small parts and assigns them to many computers. Later, the results are collected at one
place and integrated to form the result dataset.
How MapReduce Works?
The MapReduce algorithm contains two important tasks, namely Map and Reduce.
The Map task takes a set of data and converts it into another set of data, where individual
elements are broken down into tuples (key-value pairs).
The Reduce task takes the output from the Map as an input and combines those data
tuples (key-value pairs) into a smaller set of tuples.
The reduce task is always performed after the map job.
Let us now take a close look at each of the phases and try to understand their significance.

Input Phase − Here we have a Record Reader that translates each record in an input file
and sends the parsed data to the mapper in the form of key-value pairs.
Map − Map is a user-defined function, which takes a series of key-value pairs and
processes each one of them to generate zero or more key-value pairs.

Intermediate Keys − They key-value pairs generated by the mapper are known as
intermediate keys.
Combiner − A combiner is a type of local Reducer that groups similar data from the map
phase into identifiable sets. It takes the intermediate keys from the mapper as input and
applies a user-defined code to aggregate the values in a small scope of one mapper. It is
not a part of the main MapReduce algorithm; it is optional.

Shuffle and Sort − The Reducer task starts with the Shuffle and Sort step. It downloads
the grouped key-value pairs onto the local machine, where the Reducer is running. The
individual key-value pairs are sorted by key into a larger data list. The data list groups the
equivalent keys together so that their values can be iterated easily in the Reducer task.
Reducer − The Reducer takes the grouped key-value paired data as input and runs a
Reducer function on each one of them. Here, the data can be aggregated, filtered, and
combined in a number of ways, and it requires a wide range of processing. Once the
execution is over, it gives zero or more key-value pairs to the final step.

Output Phase − In the output phase, we have an output formatter that translates the final
key-value pairs from the Reducer function and writes them onto a file using a record writer.

Let us try to understand the two tasks Map &f Reduce with the help of a small diagram −

MapReduce-Example
Let us take a real-world example to comprehend the power of MapReduce. Twitter receives around
500 million tweets per day, which is nearly 3000 tweets per second. The following illustration shows
how Tweeter manages its tweets with the help of MapReduce.

As shown in the illustration, the MapReduce algorithm performs the following actions −
Tokenize − Tokenizes the tweets into maps of tokens and writes them as key-value pairs.
Filter − Filters unwanted words from the maps of tokens and writes the filtered maps as
key-value pairs.
Count − Generates a token counter per word.
Aggregate Counters − Prepares an aggregate of similar counter values into small
manageable units.
MapReduce - Algorithm

The MapReduce algorithm contains two important tasks, namely Map and Reduce.
The map task is done by means of Mapper Class
The reduce task is done by means of Reducer Class.
Mapper class takes the input, tokenizes it, maps and sorts it. The output of Mapper class is used as
input by Reducer class, which in turn searches matching pairs and reduces them.

MapReduce implements various mathematical algorithms to divide a task into small parts and
assign them to multiple systems. In technical terms, MapReduce algorithm helps in sending the
Map & Reduce tasks to appropriate servers in a cluster.
These mathematical algorithms may include the following −
Sorting
Searching
Indexing
TF-IDF

Sorting
Sorting is one of the basic MapReduce algorithms to process and analyze data. MapReduce
implements sorting algorithm to automatically sort the output key-value pairs from the mapper by
their keys.

Sorting methods are implemented in the mapper class itself.


In the Shuffle and Sort phase, after tokenizing the values in the mapper class, the Context
class (user-defined class) collects the matching valued keys as a collection.
To collect similar key-value pairs (intermediate keys), the Mapper class takes the help of
RawComparator class to sort the key-value pairs.
The set of intermediate key-value pairs for a given Reducer is automatically sorted by
Hadoop to form key-values (K2, {V2, V2, …}) before they are presented to the Reducer.

Searching
Searching plays an important role in MapReduce algorithm. It helps in the combiner phase
(optional) and in the Reducer phase. Let us try to understand how Searching works with the help of
an example.

Example

The following example shows how MapReduce employs Searching algorithm to find out the details
of the employee who draws the highest salary in a given employee dataset.

Let us assume we have employee data in four different files − A, B, C, and D. Let us also
assume there are duplicate employee records in all four files because of importing the
employee data from all database tables repeatedly. See the following illustration.

The Map phase processes each input file and provides the employee data in key-value
pairs (<k, v> : <emp name, salary>). See the following illustration.

The combiner phase (searching technique) will accept the input from the Map phase as a
key-value pair with employee name and salary. Using searching technique, the combiner
will check all the employee salary to find the highest salaried employee in each file. See
the following snippet.

<k: employee name, v: salary>


Max= the salary of an first employee. Treated as max salary

if(v(second employee).salary > Max){


Max = v(salary);
}

else{
Continue checking;
}

The expected result is as follows −

<satish, <manisha,
26000> <gopal, 50000> <kiran, 45000> 45000>

Reducer phase − Form each file, you will find the highest salaried employee. To avoid
redundancy, check all the <k, v> pairs and eliminate duplicate entries, if any. The same
algorithm is used in between the four <k, v> pairs, which are coming from four input files.
The final output should be as follows −

<gopal, 50000>

Indexing

Normally indexing is used to point to a particular data and its address. It performs batch indexing
on the input files for a particular Mapper.

The indexing technique that is normally used in MapReduce is known as inverted index. Search
engines like Google and Bing use inverted indexing technique. Let us try to understand how
Indexing works with the help of a simple example.

Example

The following text is the input for inverted indexing. Here T[0], T[1], and t[2] are the file names and
their content are in double quotes.

T[0] = "it is what it is"


T[1] = "what is it"
T[2] = "it is a banana"

After applying the Indexing algorithm, we get the following output −

"a": {2}
"banana": {2}
"is": {0, 1, 2}
"it": {0, 1, 2}
"what": {0, 1}

Here "a": {2} implies the term "a" appears in the T[2] file. Similarly, "is": {0, 1, 2} implies the term "is"
appears in the files T[0], T[1], and T[2].

TF-IDF
TF-IDF is a text processing algorithm which is short for Term Frequency − Inverse Document
Frequency. It is one of the common web analysis algorithms. Here, the term 'frequency' refers to
the number of times a term appears in a document.

Term Frequency (TF)

It measures how frequently a particular term occurs in a document. It is calculated by the number of
times a word appears in a document divided by the total number of words in that document.

TF(the) = (Number of times term the ‘the’ appears in a document) / (Total number of

Inverse Document Frequency (IDF)

It measures the importance of a term. It is calculated by the number of documents in the text
database divided by the number of documents where a specific term appears.

While computing TF, all the terms are considered equally important. That means, TF counts the
term frequency for normal words like “is”, “a”, “what”, etc. Thus we need to know the frequent terms
while scaling up the rare ones, by computing the following −

IDF(the) = log_e(Total number of documents / Number of documents with term ‘the’ in

The algorithm is explained below with the help of a small example.

Example

Consider a document containing 1000 words, wherein the word hive appears 50 times. The TF for
hive is then (50 / 1000) = 0.05.

Now, assume we have 10 million documents and the word hive appears in 1000 of these. Then,
the IDF is calculated as log(10,000,000 / 1,000) = 4.
The TF-IDF weight is the product of these quantities − 0.05 × 4 = 0.20.
MapReduce - API

In this chapter, we will take a close look at the classes and their methods that are involved in the
operations of MapReduce programming. We will primarily keep our focus on the following −
JobContext Interface
Job Class
Mapper Class
Reducer Class

JobContext Interface

The JobContext interface is the super interface for all the classes, which defines different jobs in
MapReduce. It gives you a read-only view of the job that is provided to the tasks while they are
running.
The following are the sub-interfaces of JobContext interface.

S.No. Subinterface Description

1. MapContext<KEYIN, VALUEIN, KEYOUT, VALUEOUT>


Defines the context that is given to the Mapper.

2. ReduceContext<KEYIN, VALUEIN, KEYOUT, VALUEOUT>


Defines the context that is passed to the Reducer.

Job class is the main class that implements the JobContext interface.

Job Class
The Job class is the most important class in the MapReduce API. It allows the user to configure the
job, submit it, control its execution, and query the state. The set methods only work until the job is
submitted, afterwards they will throw an IllegalStateException.
Normally, the user creates the application, describes the various facets of the job, and then submits
the job and monitors its progress.
Here is an example of how to submit a job −
// Create a new Job
Job job = new Job(new Configuration());
job.setJarByClass(MyJob.class);

// Specify various job-specific parameters


job.setJobName("myjob");
job.setInputPath(new Path("in"));
job.setOutputPath(new Path("out"));

job.setMapperClass(MyJob.MyMapper.class);
job.setReducerClass(MyJob.MyReducer.class);

// Submit the job, then poll for progress until the job is complete
job.waitForCompletion(true);

Constructors

Following are the constructor summary of Job class.

S.No Constructor Summary

1 Job()

2 Job(Configuration conf)

3 Job(Configuration conf, String jobName)

Methods

Some of the important methods of Job class are as follows −


S.No Method Description

1 getJobName()
User-specified job name.

2 getJobState()
Returns the current state of the Job.

3 isComplete()
Checks if the job is finished or not.

4 setInputFormatClass()
Sets the InputFormat for the job.

5 setJobName(String name)
Sets the user-specified job name.

6 setOutputFormatClass()
Sets the Output Format for the job.

7 setMapperClass(Class)
Sets the Mapper for the job.

8 setReducerClass(Class)
Sets the Reducer for the job.

9 setPartitionerClass(Class)
Sets the Partitioner for the job.

10 setCombinerClass(Class)
Sets the Combiner for the job.

Mapper Class

The Mapper class defines the Map job. Maps input key-value pairs to a set of intermediate key-
value pairs. Maps are the individual tasks that transform the input records into intermediate
records. The transformed intermediate records need not be of the same type as the input records.
A given input pair may map to zero or many output pairs.
Method

map is the most prominent method of the Mapper class. The syntax is defined below −

map(KEYIN key, VALUEIN value, org.apache.hadoop.mapreduce.Mapper.Context context)

This method is called once for each key-value pair in the input split.

Reducer Class

The Reducer class defines the Reduce job in MapReduce. It reduces a set of intermediate values
that share a key to a smaller set of values. Reducer implementations can access the Configuration
for a job via the JobContext.getConfiguration() method. A Reducer has three primary phases −
Shuffle, Sort, and Reduce.
Shuffle − The Reducer copies the sorted output from each Mapper using HTTP across the
network.
Sort − The framework merge-sorts the Reducer inputs by keys (since different Mappers
may have output the same key). The shuffle and sort phases occur simultaneously, i.e.,
while outputs are being fetched, they are merged.
Reduce − In this phase the reduce (Object, Iterable, Context) method is called for each
<key, (collection of values)> in the sorted inputs.

Method

reduce is the most prominent method of the Reducer class. The syntax is defined below −

reduce(KEYIN key, Iterable<VALUEIN> values, org.apache.hadoop.mapreduce.Reducer.Contex

This method is called once for each key on the collection of key-value pairs.
MapReduce by examples
MapReduce inspiration
The name MapReduce comes from functional programming:
- map is the name of a higher-order function that applies a given function
to each element of a list. Sample in Scala:
val numbers = List(1,2,3,4,5)
numbers.map(x => x * x) == List(1,4,9,16,25)

- reduce is the name of a higher-order function that analyze a recursive


data structure and recombine through use of a given combining
operation the results of recursively processing its constituent parts,
building up a return value. Sample in Scala:
val numbers = List(1,2,3,4,5)
numbers.reduce(_ + _) == 15

MapReduce takes an input, splits it into smaller parts, execute the code of
the mapper on every part, then gives all the results to one or more reducers
that merge all the results into one.

src: http://en.wikipedia.org/wiki/Map_(higher-order_function)
http://en.wikipedia.org/wiki/Fold_(higher-order_function)
MapReduce by examples
Overall view
MapReduce by examples
How does Hadoop work?
Init
- Hadoop divides the input file stored on HDFS into splits (tipically of the size
of an HDFS block) and assigns every split to a different mapper, trying to
assign every split to the mapper where the split physically resides
Mapper
- locally, Hadoop reads the split of the mapper line by line
- locally, Hadoop calls the method map() of the mapper for every line passing
it as the key/value parameters
- the mapper computes its application logic and emits other key/value pairs
Shuffle and sort
- locally, Hadoop's partitioner divides the emitted output of the mapper into
partitions, each of those is sent to a different reducer
- locally, Hadoop collects all the different partitions received from the
mappers and sort them by key
Reducer
- locally, Hadoop reads the aggregated partitions line by line
- locally, Hadoop calls the reduce() method on the reducer for every line of
the input
- the reducer computes its application logic and emits other key/value pairs
- locally, Hadoop writes the emitted pairs output (the emitted pairs) to HDFS
MapReduce by examples

Simplied flow (for developers)


MapReduce by examples

Serializable vs Writable

- Serializable stores the class name and the object representation to the
stream; other instances of the class are referred to by an handle to the
class name: this approach is not usable with random access

- For the same reason, the sorting needed for the shuffle and sort phase
can not be used with Serializable

- The deserialization process creates a new instance of the object, while


Hadoop needs to reuse objects to minimize computation

- Hadoop introduces the two interfaces Writable and WritableComparable


that solve these problem
MapReduce by examples

Writable wrappers
Java Writable Java class Writable
primitive implementation implementation
boolean BooleanWritable String Text

byte ByteWritable byte[] BytesWritable


Object ObjectWritable
short ShortWritable
null NullWritable
int IntWritable
VIntWritable
Java Writable
float FloatWritable collection implementation
long LongWritable array ArrayWritable
VLongWritable ArrayPrimitiveWritable
TwoDArrayWritable
double DoubleWritable
Map MapWritable
SortedMap SortedMapWritable
enum EnumSetWritable
MapReduce by examples
Implementing Writable: the SumCount class
public class SumCount implements WritableComparable<SumCount> {

DoubleWritable sum;
IntWritable count;

public SumCount() {
set(new DoubleWritable(0), new IntWritable(0));
}

public SumCount(Double sum, Integer count) {


set(new DoubleWritable(sum), new IntWritable(count));
}

@Override
public void write(DataOutput dataOutput) throws IOException {

sum.write(dataOutput);
count.write(dataOutput);
}

@Override
public void readFields(DataInput dataInput) throws IOException {

sum.readFields(dataInput);
count.readFields(dataInput);
}
// getters, setters and Comparable overridden methods are omitted
}
MapReduce by examples

Glossary
Term Meaning
Job The whole process to execute: the input data,
the mapper and reducers execution and the
output data
Task Every job is divided among the several
mappers and reducers; a task is the job
portion that goes to every single mapper and
reducer
Split The input file is split into several splits (the
suggested size is the HDFS block size, 64Mb)

Record The split is read from mapper by default a line


at the time: each line is a record. Using a class
extending FileInputFormat, the record can
be composed by more than one line
Partition The set of all the key-value pairs that will be
sent to a single reducer. The default
partitioner uses an hash function on the key to
determine to which reducer send the data
MapReduce by examples

Let's start coding!


MapReduce by examples

WordCount
(the Hello World! for MapReduce, available in Hadoop sources)

We want to count the occurrences of every word


of a text file

Input Data:

The text of the book ”Flatland”


By Edwin Abbott.
Source:
http://www.gutenberg.org/cache/epub/201/pg201.txt
MapReduce by examples

WordCount mapper

public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable> {

private final static IntWritable one = new IntWritable(1);


private Text word = new Text();

@Override
public void map(Object key, Text value, Context context)
throws IOException, InterruptedException {

StringTokenizer itr = new StringTokenizer(value.toString());


while (itr.hasMoreTokens()) {
word.set(itr.nextToken().trim());
context.write(word, one);
}
}
}
MapReduce by examples

WordCount reducer

public static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable>{

private IntWritable result = new IntWritable();

@Override
public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}
MapReduce by examples

WordCount
a 936
ab 6
abbot 3
abbott 2
abbreviated 1
abide 1
ability 1
able 9
ablest 2
Results: abolished 1
abolition 1
about 40
above 22
abroad 1
abrogation 1
abrupt 1
abruptly 1
absence 4
absent 1
absolute 2
...

You might also like