0% found this document useful (0 votes)
35 views

By Christian Mechem and Geoff Crowley

MapReduce is a programming model developed by Google to process large datasets in parallel across clusters of computers. It works by splitting data into pieces, mapping those pieces to nodes where processing occurs to generate key-value pairs, shuffling the data between nodes to group by key, and reducing the values for each key. This allows for fault tolerance and scalability. Implementations like Apache Hadoop allow other developers to use MapReduce on their own data.

Uploaded by

Christian Mechem
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views

By Christian Mechem and Geoff Crowley

MapReduce is a programming model developed by Google to process large datasets in parallel across clusters of computers. It works by splitting data into pieces, mapping those pieces to nodes where processing occurs to generate key-value pairs, shuffling the data between nodes to group by key, and reducing the values for each key. This allows for fault tolerance and scalability. Implementations like Apache Hadoop allow other developers to use MapReduce on their own data.

Uploaded by

Christian Mechem
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 11

By Christian Mechem and Geoff Crowley

MapReduce: A Programming Model


• A technique used to process large data parallelly instead of serially

• First the data is split into pieces

• Typically 16-64MB per piece


Mapping

• Then each piece is sent to multiple


mappers who will process the data.

• The data is processed into key-


value pairs and put into a list.

• For our example, the key is a word


in a line and the value is its
occurrence, which is always 1
Shuffle
• Now the data needs to be organized

• In the Shuffle phase the data is organized by the key


created in the mapping phase

• This will allow the reducing phase to work properly


Reduce

• Now we need to condense the data we


received into what we want.
• The reducers will use a defined function
to reduce the data to one result
• Our function is creating a sum of the
values associated with a key
• Then it is reconsolidated to a list of new
key-value pairs with your key and result.
MapReduce was developed by Google.

MapReduce was invented by engineers at Google to respond to the massive amount of data
they were collecting from the web.

Distributing this data to numerous computers and parallelizing computations on that data
presented significant work for programmers.

To solve the problem, they created a new programming model based off the functional
programming paradigm.

In functional programming computations never modify data and order of operations does not
matter. These concepts were applied in MapReduce to employ fault tolerance and parallelism.
Immutable and Redundant Data

• MapReduce was originally implemented on a large cluster with thousands of nodes


(computers) on a network. With thousands of nodes, Murphy’s Law dictates that some
portion of them will fail.
• To employ fault tolerance during MapReduce computations, redundancy is employed by
making at least three copies of the input data called “replicas”.
• The replicas are never changed. They are only used to produce the reduced outputs
(functional paradigm).
• If a node performing a MapReduce process fails, then the MasterServer schedules
computation on another working node to operate on the replica that was not fully processed.
MapReduce and Parallel Processes

• The MapReduce model works with data and computations that are
independent of each other.
• This is the simplest implementation of parallel processes because MapReduce
computations can be performed locally on each node with no communication
required between nodes (MasterServer excepted).
• Massive amounts of data are processed parallelly on nodes in a cluster each
using the same MapReduce function.
• Final key, value pairs are written to a GFS or Global File System where the
aggregate final results are stored.
MapReduce uses the Principle of Locality

• Since MapReduce works on massive amounts of data, it is essential to reduce


execution time.
• MapReduce implementations on a cluster place replicas on a node’s local disks before
task execution.
• Data blocks are processed on the same machine they reside on.
• The 16-64MB data block size is the size at which this locality is optimal.
• If a node fails, the MasterServer attempts to reschedule the task on a node that is on
the same network switch before trying a node on another switch in order to maintain
the benefits of locality.
Other Implementations

• MapReduce is no longer a privilege of Google.


• Several MapReduce frameworks exist for developers who want to process large
amounts of data.
• The most well-known framework is Apache Hadoop which employs the HDFS or
Hadoop File System in place of the GFS used by Google.
• The Hadoop framework mimics MapReduce processes of the cluster on your
computer.
• Most Object-Oriented Languages, including Java, can be used to write MapReduce
functions and process data using the Hadoop file system and distributed cluster.
Citations

• Dean, Jeffrey, and Sanjay Ghemawat. “MapReduce: Simplified Data Processing on Large Clusters.” GoogleAPI, vol. 51, no. 1,
2008, pp. 107–113., doi:10.1145/1327452.1327492.
• Fedak, Gilles, et al. “Future of MapReduce for Scientific Computing.” Proceedings of the Second International Workshop on
MapReduce and Its Applications - MapReduce 11, 2011, doi:10.1145/1996092.1996108.
• Guo, Zhenhua, et al. “Investigation of Data Locality and Fairness in MapReduce.” Proceedings of Third International Workshop
on MapReduce and Its Applications Date - MapReduce 12, 2012, doi:10.1145/2287016.2287022.
• Pearlman, Shana. “MapReduce 101: What It Is & How to Get Started - Talend.” Talend Real-Time Open Source Data Integration
Software, www.talend.com/resources/what-is-mapreduce/.
• Roebuck, Kevin. MapReduce: High-Impact Strategies. Tebbo, 2011.
• Tan, Yu Shyang. “MapReduce and Its Applications in Heterogeneous Environment.” doi:10.32657/10356/46718.
• Dharanipragada, Janakiram, et al. “Generate-Map-Reduce: An Extension to Map-Reduce to Support Shared Data and Recursive
Computations.” Concurrency and Computation: Practice and Experience, vol. 26, no. 2, Apr. 2013, pp. 561–585.,
doi:10.1002/cpe.3018.

You might also like