Map Reduce
Map Reduce
MapReduce is a programming model for writing applications that can process Big Data in parallel
on multiple nodes. MapReduce provides analytical capabilities for analyzing huge volumes of
complex data.
Big Data is a collection of large datasets that cannot be processed using traditional computing
techniques. For example, the volume of data Facebook or Youtube need require it to collect and
manage on a daily basis, can fall under the category of Big Data. However, Big Data is not only
about scale and volume, it also involves one or more of the following aspects − Velocity, Variety,
Volume, and Complexity.
Why MapReduce?
Traditional Enterprise Systems normally have a centralized server to store and process data. The
following illustration depicts a schematic view of a traditional enterprise system. Traditional model is
certainly not suitable to process huge volumes of scalable data and cannot be accommodated by
standard database servers. Moreover, the centralized system creates too much of a bottleneck
while processing multiple files simultaneously.
Google solved this bottleneck issue using an algorithm called MapReduce. MapReduce divides a
task into small parts and assigns them to many computers. Later, the results are collected at one
place and integrated to form the result dataset.
How MapReduce Works?
The MapReduce algorithm contains two important tasks, namely Map and Reduce.
The Map task takes a set of data and converts it into another set of data, where individual
elements are broken down into tuples (key-value pairs).
The Reduce task takes the output from the Map as an input and combines those data
tuples (key-value pairs) into a smaller set of tuples.
The reduce task is always performed after the map job.
Let us now take a close look at each of the phases and try to understand their significance.
Input Phase − Here we have a Record Reader that translates each record in an input file
and sends the parsed data to the mapper in the form of key-value pairs.
Map − Map is a user-defined function, which takes a series of key-value pairs and
processes each one of them to generate zero or more key-value pairs.
Intermediate Keys − They key-value pairs generated by the mapper are known as
intermediate keys.
Combiner − A combiner is a type of local Reducer that groups similar data from the map
phase into identifiable sets. It takes the intermediate keys from the mapper as input and
applies a user-defined code to aggregate the values in a small scope of one mapper. It is
not a part of the main MapReduce algorithm; it is optional.
Shuffle and Sort − The Reducer task starts with the Shuffle and Sort step. It downloads
the grouped key-value pairs onto the local machine, where the Reducer is running. The
individual key-value pairs are sorted by key into a larger data list. The data list groups the
equivalent keys together so that their values can be iterated easily in the Reducer task.
Reducer − The Reducer takes the grouped key-value paired data as input and runs a
Reducer function on each one of them. Here, the data can be aggregated, filtered, and
combined in a number of ways, and it requires a wide range of processing. Once the
execution is over, it gives zero or more key-value pairs to the final step.
Output Phase − In the output phase, we have an output formatter that translates the final
key-value pairs from the Reducer function and writes them onto a file using a record writer.
Let us try to understand the two tasks Map &f Reduce with the help of a small diagram −
MapReduce-Example
Let us take a real-world example to comprehend the power of MapReduce. Twitter receives around
500 million tweets per day, which is nearly 3000 tweets per second. The following illustration shows
how Tweeter manages its tweets with the help of MapReduce.
As shown in the illustration, the MapReduce algorithm performs the following actions −
Tokenize − Tokenizes the tweets into maps of tokens and writes them as key-value pairs.
Filter − Filters unwanted words from the maps of tokens and writes the filtered maps as
key-value pairs.
Count − Generates a token counter per word.
Aggregate Counters − Prepares an aggregate of similar counter values into small
manageable units.
MapReduce - Algorithm
The MapReduce algorithm contains two important tasks, namely Map and Reduce.
The map task is done by means of Mapper Class
The reduce task is done by means of Reducer Class.
Mapper class takes the input, tokenizes it, maps and sorts it. The output of Mapper class is used as
input by Reducer class, which in turn searches matching pairs and reduces them.
MapReduce implements various mathematical algorithms to divide a task into small parts and
assign them to multiple systems. In technical terms, MapReduce algorithm helps in sending the
Map & Reduce tasks to appropriate servers in a cluster.
These mathematical algorithms may include the following −
Sorting
Searching
Indexing
TF-IDF
Sorting
Sorting is one of the basic MapReduce algorithms to process and analyze data. MapReduce
implements sorting algorithm to automatically sort the output key-value pairs from the mapper by
their keys.
Searching
Searching plays an important role in MapReduce algorithm. It helps in the combiner phase
(optional) and in the Reducer phase. Let us try to understand how Searching works with the help of
an example.
Example
The following example shows how MapReduce employs Searching algorithm to find out the details
of the employee who draws the highest salary in a given employee dataset.
Let us assume we have employee data in four different files − A, B, C, and D. Let us also
assume there are duplicate employee records in all four files because of importing the
employee data from all database tables repeatedly. See the following illustration.
The Map phase processes each input file and provides the employee data in key-value
pairs (<k, v> : <emp name, salary>). See the following illustration.
The combiner phase (searching technique) will accept the input from the Map phase as a
key-value pair with employee name and salary. Using searching technique, the combiner
will check all the employee salary to find the highest salaried employee in each file. See
the following snippet.
else{
Continue checking;
}
<satish, <manisha,
26000> <gopal, 50000> <kiran, 45000> 45000>
Reducer phase − Form each file, you will find the highest salaried employee. To avoid
redundancy, check all the <k, v> pairs and eliminate duplicate entries, if any. The same
algorithm is used in between the four <k, v> pairs, which are coming from four input files.
The final output should be as follows −
<gopal, 50000>
Indexing
Normally indexing is used to point to a particular data and its address. It performs batch indexing
on the input files for a particular Mapper.
The indexing technique that is normally used in MapReduce is known as inverted index. Search
engines like Google and Bing use inverted indexing technique. Let us try to understand how
Indexing works with the help of a simple example.
Example
The following text is the input for inverted indexing. Here T[0], T[1], and t[2] are the file names and
their content are in double quotes.
"a": {2}
"banana": {2}
"is": {0, 1, 2}
"it": {0, 1, 2}
"what": {0, 1}
Here "a": {2} implies the term "a" appears in the T[2] file. Similarly, "is": {0, 1, 2} implies the term "is"
appears in the files T[0], T[1], and T[2].
TF-IDF
TF-IDF is a text processing algorithm which is short for Term Frequency − Inverse Document
Frequency. It is one of the common web analysis algorithms. Here, the term 'frequency' refers to
the number of times a term appears in a document.
It measures how frequently a particular term occurs in a document. It is calculated by the number of
times a word appears in a document divided by the total number of words in that document.
TF(the) = (Number of times term the ‘the’ appears in a document) / (Total number of
It measures the importance of a term. It is calculated by the number of documents in the text
database divided by the number of documents where a specific term appears.
While computing TF, all the terms are considered equally important. That means, TF counts the
term frequency for normal words like “is”, “a”, “what”, etc. Thus we need to know the frequent terms
while scaling up the rare ones, by computing the following −
Example
Consider a document containing 1000 words, wherein the word hive appears 50 times. The TF for
hive is then (50 / 1000) = 0.05.
Now, assume we have 10 million documents and the word hive appears in 1000 of these. Then,
the IDF is calculated as log(10,000,000 / 1,000) = 4.
The TF-IDF weight is the product of these quantities − 0.05 × 4 = 0.20.
MapReduce - API
In this chapter, we will take a close look at the classes and their methods that are involved in the
operations of MapReduce programming. We will primarily keep our focus on the following −
JobContext Interface
Job Class
Mapper Class
Reducer Class
JobContext Interface
The JobContext interface is the super interface for all the classes, which defines different jobs in
MapReduce. It gives you a read-only view of the job that is provided to the tasks while they are
running.
The following are the sub-interfaces of JobContext interface.
Job class is the main class that implements the JobContext interface.
Job Class
The Job class is the most important class in the MapReduce API. It allows the user to configure the
job, submit it, control its execution, and query the state. The set methods only work until the job is
submitted, afterwards they will throw an IllegalStateException.
Normally, the user creates the application, describes the various facets of the job, and then submits
the job and monitors its progress.
Here is an example of how to submit a job −
// Create a new Job
Job job = new Job(new Configuration());
job.setJarByClass(MyJob.class);
job.setMapperClass(MyJob.MyMapper.class);
job.setReducerClass(MyJob.MyReducer.class);
// Submit the job, then poll for progress until the job is complete
job.waitForCompletion(true);
Constructors
1 Job()
2 Job(Configuration conf)
Methods
1 getJobName()
User-specified job name.
2 getJobState()
Returns the current state of the Job.
3 isComplete()
Checks if the job is finished or not.
4 setInputFormatClass()
Sets the InputFormat for the job.
5 setJobName(String name)
Sets the user-specified job name.
6 setOutputFormatClass()
Sets the Output Format for the job.
7 setMapperClass(Class)
Sets the Mapper for the job.
8 setReducerClass(Class)
Sets the Reducer for the job.
9 setPartitionerClass(Class)
Sets the Partitioner for the job.
10 setCombinerClass(Class)
Sets the Combiner for the job.
Mapper Class
The Mapper class defines the Map job. Maps input key-value pairs to a set of intermediate key-
value pairs. Maps are the individual tasks that transform the input records into intermediate
records. The transformed intermediate records need not be of the same type as the input records.
A given input pair may map to zero or many output pairs.
Method
map is the most prominent method of the Mapper class. The syntax is defined below −
This method is called once for each key-value pair in the input split.
Reducer Class
The Reducer class defines the Reduce job in MapReduce. It reduces a set of intermediate values
that share a key to a smaller set of values. Reducer implementations can access the Configuration
for a job via the JobContext.getConfiguration() method. A Reducer has three primary phases −
Shuffle, Sort, and Reduce.
Shuffle − The Reducer copies the sorted output from each Mapper using HTTP across the
network.
Sort − The framework merge-sorts the Reducer inputs by keys (since different Mappers
may have output the same key). The shuffle and sort phases occur simultaneously, i.e.,
while outputs are being fetched, they are merged.
Reduce − In this phase the reduce (Object, Iterable, Context) method is called for each
<key, (collection of values)> in the sorted inputs.
Method
reduce is the most prominent method of the Reducer class. The syntax is defined below −
This method is called once for each key on the collection of key-value pairs.
MapReduce by examples
MapReduce inspiration
The name MapReduce comes from functional programming:
- map is the name of a higher-order function that applies a given function
to each element of a list. Sample in Scala:
val numbers = List(1,2,3,4,5)
numbers.map(x => x * x) == List(1,4,9,16,25)
MapReduce takes an input, splits it into smaller parts, execute the code of
the mapper on every part, then gives all the results to one or more reducers
that merge all the results into one.
src: http://en.wikipedia.org/wiki/Map_(higher-order_function)
http://en.wikipedia.org/wiki/Fold_(higher-order_function)
MapReduce by examples
Overall view
MapReduce by examples
How does Hadoop work?
Init
- Hadoop divides the input file stored on HDFS into splits (tipically of the size
of an HDFS block) and assigns every split to a different mapper, trying to
assign every split to the mapper where the split physically resides
Mapper
- locally, Hadoop reads the split of the mapper line by line
- locally, Hadoop calls the method map() of the mapper for every line passing
it as the key/value parameters
- the mapper computes its application logic and emits other key/value pairs
Shuffle and sort
- locally, Hadoop's partitioner divides the emitted output of the mapper into
partitions, each of those is sent to a different reducer
- locally, Hadoop collects all the different partitions received from the
mappers and sort them by key
Reducer
- locally, Hadoop reads the aggregated partitions line by line
- locally, Hadoop calls the reduce() method on the reducer for every line of
the input
- the reducer computes its application logic and emits other key/value pairs
- locally, Hadoop writes the emitted pairs output (the emitted pairs) to HDFS
MapReduce by examples
Serializable vs Writable
- Serializable stores the class name and the object representation to the
stream; other instances of the class are referred to by an handle to the
class name: this approach is not usable with random access
- For the same reason, the sorting needed for the shuffle and sort phase
can not be used with Serializable
Writable wrappers
Java Writable Java class Writable
primitive implementation implementation
boolean BooleanWritable String Text
DoubleWritable sum;
IntWritable count;
public SumCount() {
set(new DoubleWritable(0), new IntWritable(0));
}
@Override
public void write(DataOutput dataOutput) throws IOException {
sum.write(dataOutput);
count.write(dataOutput);
}
@Override
public void readFields(DataInput dataInput) throws IOException {
sum.readFields(dataInput);
count.readFields(dataInput);
}
// getters, setters and Comparable overridden methods are omitted
}
MapReduce by examples
Glossary
Term Meaning
Job The whole process to execute: the input data,
the mapper and reducers execution and the
output data
Task Every job is divided among the several
mappers and reducers; a task is the job
portion that goes to every single mapper and
reducer
Split The input file is split into several splits (the
suggested size is the HDFS block size, 64Mb)
WordCount
(the Hello World! for MapReduce, available in Hadoop sources)
Input Data:
WordCount mapper
@Override
public void map(Object key, Text value, Context context)
throws IOException, InterruptedException {
WordCount reducer
@Override
public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}
MapReduce by examples
WordCount
a 936
ab 6
abbot 3
abbott 2
abbreviated 1
abide 1
ability 1
able 9
ablest 2
Results: abolished 1
abolition 1
about 40
above 22
abroad 1
abrogation 1
abrupt 1
abruptly 1
absence 4
absent 1
absolute 2
...