0% found this document useful (0 votes)
14 views

Hadoop - MapReduce

Uploaded by

mytempemail2023
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views

Hadoop - MapReduce

Uploaded by

mytempemail2023
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Hadoop - MapReduce

MapReduce is a framework using which we can write applications to process huge amounts
of data, in parallel, on large clusters of commodity hardware in a reliable manner.
MapReduce is a software framework for writing applications that can process huge amounts
of data across the clusters of nodes. Hadoop MapReduce is the processing part of Hadoop. It
is also known as the heart of Hadoop. It is the most preferred data processing application.
Several players in the e-commerce sector such as Amazon, Yahoo, etc. are using the
MapReduce framework for high volume data processing.

What is MapReduce?
MapReduce is a processing technique and a program model for distributed computing based
on java. The MapReduce algorithm contains two important tasks, namely Map and Reduce.
Map takes a set of data and converts it into another set of data, where individual elements are
broken down into tuples (key/value pairs). Secondly, reduce task, which takes the output
from a map as an input and combines those data tuples into a smaller set of tuples. As the
sequence of the name MapReduce implies, the reduce task is always performed after the map
job.
The major advantage of MapReduce is that it is easy to scale data processing over multiple
computing nodes. Under the MapReduce model, the data processing primitives are called
mappers and reducers. Decomposing a data processing application
into mappers and reducers is sometimes nontrivial. But, once we write an application in the
MapReduce form, scaling the application to run over hundreds, thousands, or even tens of
thousands of machines in a cluster is merely a configuration change. This simple scalability is
what has attracted many programmers to use the MapReduce model.

The Algorithm
• Generally MapReduce paradigm is based on sending the computer to where the data
resides!
• MapReduce program executes in three stages, namely map stage, shuffle stage, and
reduce stage.
o Map stage − The map or mapper’s job is to process the input data. Generally
the input data is in the form of file or directory and is stored in the Hadoop file
system (HDFS). The input file is passed to the mapper function line by line.
The mapper processes the data and creates several small chunks of data.
o Reduce stage − This stage is the combination of the Shuffle stage and
the Reduce stage. The Reducer’s job is to process the data that comes from the
mapper. After processing, it produces a new set of output, which will be stored
in the HDFS.
• During a MapReduce job, Hadoop sends the Map and Reduce tasks to the appropriate
servers in the cluster.
• The framework manages all the details of data-passing such as issuing tasks, verifying
task completion, and copying data around the cluster between the nodes.
• Most of the computing takes place on nodes with data on local disks that reduces the
network traffic.
• After completion of the given tasks, the cluster collects and reduces the data to form
an appropriate result, and sends it back to the Hadoop server.
Inputs and Outputs (Java Perspective)
The MapReduce framework operates on <key, value> pairs, that is, the framework views the
input to the job as a set of <key, value> pairs and produces a set of <key, value> pairs as the
output of the job, conceivably of different types.
The key and the value classes should be in serialized manner by the framework. Input and
Output types of a MapReduce job − (Input) <k1, v1> → map → <k2, v2> → reduce → <k3,
v3>(Output).

Terminology
• PayLoad − Applications implement the Map and the Reduce functions, and form the
core of the job.
• Mapper − Mapper maps the input key/value pairs to a set of intermediate key/value
pair.
• NamedNode − Node that manages the Hadoop Distributed File System (HDFS).
• DataNode − Node where data is presented in advance before any processing takes
place.
• MasterNode − Node where JobTracker runs and which accepts job requests from
clients.
• SlaveNode − Node where Map and Reduce program runs.
• JobTracker − Schedules jobs and tracks the assign jobs to Task tracker.
• Task Tracker − Tracks the task and reports status to JobTracker.
Features of MapReduce

1. Scalability

Hadoop is a highly scalable framework. This is because of its ability to store and distribute
huge data across plenty of servers. All these servers are inexpensive and can operate in
parallel. We can easily scale the storage and computation power by adding servers to the
cluster.

Hadoop MapReduce programming enables organizations to run applications from large sets
of nodes which could involve the use of thousands of terabytes of data.

2. Flexibility

MapReduce programming enables companies to access new sources of data. It enables


companies to operate on different types of data. It allows enterprises to access structured as
well as unstructured data, and derive significant value by gaining insights from the multiple
sources of data.

Additionally, the MapReduce framework also provides support for the multiple languages
and data from sources ranging from email, social media, to clickstream.

The MapReduce processes data in simple key-value pairs thus supports data type including
meta-data, images, and large files. Hence, MapReduce is flexible to deal with data rather than
traditional DBMS.

3. Security and Authentication

The MapReduce programming model uses HBase and HDFS security platform that allows
access only to the authenticated users to operate on the data. Thus, it protects unauthorized
access to system data and enhances system security.

4. Cost-effective solution

Hadoop’s scalable architecture with the MapReduce programming framework allows the
storage and processing of large data sets in a very affordable manner.

5. Fast

Hadoop uses a distributed storage method called as a Hadoop Distributed File System that
basically implements a mapping system for locating data in a cluster.

The tools that are used for data processing, such as MapReduce programming, are generally
located on the very same servers that allow for the faster processing of data.

So, even if we are dealing with large volumes of unstructured data, Hadoop MapReduce just
takes minutes to process terabytes of data. It can process petabytes of data in just an hour.
6. Simple model of programming

Amongst the various features of Hadoop MapReduce, one of the most important features is
that it is based on a simple programming model. Basically, this allows programmers to
develop the MapReduce programs which can handle tasks easily and efficiently.

The MapReduce programs can be written in Java, which is not very hard to pick up and is
also used widely. So, anyone can easily learn and write MapReduce programs and meet their
data processing needs.

7. Parallel Programming

One of the major aspects of the working of MapReduce programming is its parallel
processing. It divides the tasks in a manner that allows their execution in parallel.
The parallel processing allows multiple processors to execute these divided tasks. So the
entire program is run in less time.

8. Availability and resilient nature

Whenever the data is sent to an individual node, the same set of data is forwarded to some
other nodes in a cluster. So, if any particular node suffers from a failure, then there are always
other copies present on other nodes that can still be accessed whenever needed. This assures
high availability of data.

One of the major features offered by Hadoop is its fault tolerance. The Hadoop MapReduce
framework has the ability of quickly recognizing faults that occur.

It then applies a quick and automatic recovery solution. This feature makes it a game-changer
in the world of big data processing.

Let’s take some example of Map-Reduce function usage in the industry :

• At Google:

– Index building for Google Search

– Article clustering for Google News

– Statistical machine translation

• At Yahoo!:

– Index building for Yahoo! Search

– Spam detection for Yahoo! Mail


At Facebook:

– Data mining

– Ad optimization

– Spam detection Example

• At Amazon:

– Product clustering

– Statistical machine translation

You might also like