Mapreduce: Simplified Data Processing On Large Clusters by Jeffrey Dean and Sanjay Ghemawa Presented by Jon Logan
Mapreduce: Simplified Data Processing On Large Clusters by Jeffrey Dean and Sanjay Ghemawa Presented by Jon Logan
I will present the concepts of MapReduce using the typical example of MR,
Word Count
The input of this program is a volume of raw text, of unspecified size (could
be KB, MB, TB, it doesnt matter!)
The output is a list of words, and their occurrence count. Assume that words
are split correctly, ignoring capitalization and punctuation.
Example
The doctor went to the store. =>
The, 2
Doctor, 1
Went, 1
To, 1
Store, 1
Map? Reduce?
Mappers read in data from the filesystem, and output (typically) modified data
Reducers collect all of the mappers output on the keys, and output (typically)
reduced data
The outputted data is written to disk
User Components:
Mapper
Reducer
Combiner (Optional)
Partitioner (Optional) (Shuffle)
Writable(s) (Optional)
System Components:
Master
Input Splitter*
Output Committer*
* You can use your own if you really want!
Split
Mapper 0
0
Out
Reducer 0
0
Split
Input
Mapper 1
1
Out
Reducer 1
1
Split
Mapper 2
2
Input Splitter
Ex. For our Word Count example, with the following input: The teacher went
to the store. The store was closed; the store opens in the morning. The store
opens at 9am.
Binning of results
Master
If a task fails to report progress (such as reading input, writing output, etc),
crashes, the machine goes down, etc, it is assumed to be stuck, and is killed,
and the step is re-launched (with the same input)
If a node is the last step, and is completing slowly, the master can launch a second
copy of that node
This can be due to hardware isuses, network issues, etc.
First one to complete wins, then any other runs are killed
Writables
Our output is a bunch of <Text, LongWritable>. The key is the token, and the value
is the count. This is always 1.
For the purpose of this demonstration, just assume Text is a fancy String, and
LongWritable is a fancy Long. In reality, theyre just the Writable equivalents.
Reducer Code
Do we need a combiner?
No, but it reduces bandwidth.
All that is needed to run the above code is an extremely simple runner class.
Simply specifies which components to use, and your input/output directories
Workflows
Process
Data A
Input Fetch
Merge Output
Data
Process
Data B
Conclusion