Introduction To MapReduce
Introduction To MapReduce
What is MapReduce?
A programming model (& its associated
implementation) For processing large data set Exploits large set of commodity computers Executes process in distributed manner Offers high degree of transparencies In other words:
simple and maybe suitable for your tasks !!!
Distributed Grep
Split data
matches matches
matches
cat
All matches
Split data
matches
count count
count
merge
merged count
Split data
count
Map Reduce
Very big data M A P Partitioning Function R E D U C E Result
Map:
Accepts input key/value pair Emits intermediate key/value pair
Reduce :
Accepts intermediate key/value* pair Emits output key/value pair
Partitioning Function
Distributed Sort
Map:
emit(key,value)
MapReduce
Distributed Grep
Map:
if match(value,pattern) emit(value,1)
Reduce:
emit(key,sum(value*))
Reduce:
emit(key,sum(value*))
MapReduce Transparencies
Plus Google Distributed File System : Parallelization Fault-tolerance Locality optimization Load balancing
Summary
A simple programming model for
processing large dataset on large set of computer cluster Fun to use, focus on problem, and let the library deal with the messy detail
References
Original paper
(http://labs.google.com/papers/mapreduce .html) On wikipedia (http://en.wikipedia.org/wiki/MapReduce) Hadoop MapReduce in Java (http://lucene.apache.org/hadoop/) Starfish - MapReduce in Ruby (http://rufy.com/starfish/)