Unit 4 Map Reduce
Unit 4 Map Reduce
Map Reduce
Parallel computing
Parallel computing refers to the process of breaking down larger problems into smaller,
independent, often similar parts that can be executed simultaneously by multiple processors
communicating via shared memory, the results of which are combined upon completion as part
of an overall algorithm. The primary goal of parallel computing is to increase available
computation power for faster application processing and problem solving.
Parallel computing infrastructure is typically housed within a single datacenter where several
processors are installed in a server rack; computation requests are distributed in small chunks
by the application server that are then executed simultaneously on each server.
There are generally four types of parallel computing, available from both proprietary and open
source parallel computing vendors -- bit-level parallelism, instruction-level parallelism, task
parallelism, or superword-level parallelism:
Bit-level parallelism: increases processor word size, which reduces the quantity of
instructions the processor must execute in order to perform an operation on variables
greater than the length of the word.
Instruction-level parallelism: the hardware approach works upon dynamic parallelism,
in which the processor decides at run-time which instructions to execute in parallel; the
software approach works upon static parallelism, in which the compiler decides which
instructions to execute in parallel
Task parallelism: a form of parallelization of computer code across multiple processors
that runs several different tasks at the same time on the same data
Superword-level parallelism: a vectorization technique that can exploit parallelism of
inline code
1
to a single, shared main memory with full access to all common resources and devices.
Each processor has a private cache memory, may be connected using on-chip mesh
networks, and can work on any task no matter where the data for that task is located in
memory.
3. Distributed computing: Distributed system components are located on different
networked computers that coordinate their actions by communicating via pure HTTP,
RPC-like connectors, and message queues. Significant characteristics of distributed
systems include independent failure of components and concurrency of components.
Distributed programming is typically categorized as client–server, three-tier, n-tier, or
peer-to-peer architectures. There is much overlap in distributed and parallel computing
and the terms are sometimes used interchangeably.
4. Massively parallel computing: refers to the use of numerous computers or computer
processors to simultaneously execute a set of computations in parallel. One approach
involves the grouping of several processors in a tightly structured, centralized computer
cluster. Another approach is grid computing, in which many widely distributed
computers work together and communicate via the Internet to solve a particular
problem.
2
MapReduce
MapReduce is a processing technique and a program model for distributed computing based
on java. The MapReduce algorithm contains two important tasks, namely Map and Reduce.
Map takes a set of data and converts it into another set of data, where individual elements are
broken down into tuples (key/value pairs). Secondly, reduce task, which takes the output from
a map as an input and combines those data tuples into a smaller set of tuples. As the sequence
of the name MapReduce implies, the reduce task is always performed after the map job.
The major advantage of MapReduce is that it is easy to scale data processing over multiple
computing nodes. Under the MapReduce model, the data processing primitives are called
mappers and reducers. Decomposing a data processing application
into mappers and reducers is sometimes nontrivial. But, once we write an application in the
MapReduce form, scaling the application to run over hundreds, thousands, or even tens of
thousands of machines in a cluster is merely a configuration change. This simple scalability
is what has attracted many programmers to use the MapReduce model.
The Algorithm
Generally MapReduce paradigm is based on sending the computer to where the data
resides!
MapReduce program executes in three stages, namely map stage, shuffle stage, and
reduce stage.
o Map stage − The map or mapper’s job is to process the input data. Generally
the input data is in the form of file or directory and is stored in the Hadoop file
system (HDFS). The input file is passed to the mapper function line by line.
The mapper processes the data and creates several small chunks of data.
o Reduce stage − This stage is the combination of the Shuffle stage and
the Reduce stage. The Reducer’s job is to process the data that comes from the
mapper. After processing, it produces a new set of output, which will be stored
in the HDFS.
During a MapReduce job, Hadoop sends the Map and Reduce tasks to the appropriate
servers in the cluster.
The framework manages all the details of data-passing such as issuing tasks, verifying
task completion, and copying data around the cluster between the nodes.
Most of the computing takes place on nodes with data on local disks that reduces the
network traffic.
After completion of the given tasks, the cluster collects and reduces the data to form
an appropriate result, and sends it back to the Hadoop server.
3
Parallel Efficiency of Map-Reduce
It is worth trying to figure out what is the parallel efficiency of MapReduce. Now let us assume
that the data produced after the map phase is σ times the original data size D(σD), further, we
assume that there are P processors which in turn perform map and reduce operations depending
on which phase they are used in , so we do not have any wastage of processors. Also, the
algorithm itself is assumed to do wD useful work, w is not necessarily a constant it could be
D2 or something like that, but the point is that there is some amount of useful work being done
even if you have a single processor and that’s wD. Now let us look at the overheads of doing
the computation wD using MapReduce. After the map operation, instead of D data items, we
𝛔𝐃
have σD data items in P pieces, so each mapper writes data to their local disk. So, there is
𝑃
some overhead associated with writing this data. Next, this data has to be ready by each reducer
𝛔𝐃
before it can begin ‘reduce’ operations. So each reducer has to read that is one Pth of the
𝒑𝟐
data from a particular mapper. Since there are P different reducers, one Pth of the data in a map
goes to each reducer and it has to read P of these from P different mappers once again getting
us the communication time that a reducer spends getting the data it requires from the different
𝛔𝐃
mappers as 𝑷
𝛔𝐃
Overheads : 𝑃
So the total overhead that is work that would not have to be done if we did not have parallelism,
𝟐𝐃
that is writing data to disk and reading data from remote mappers is , Now if you look at the
𝑃
efficiency using this overhead, we get the wD is the time it takes on one processor which is the
𝐰𝐃
useful work that needs to be done; if we had P processors but we have this extra overhead
𝑃
𝟐𝛔𝐃
of . We have got a constant c which essentially measures how much time it takes to write
𝑃
one data item to disk or to read a data item remotely from a different mapper.
4
Google Mapreduce Infrastructure
The user submits MapReduce tasks for execution by utilizing client libraries, which are
responsible for sending input data files, registering the map and reduce functions and returning
control to the user once the task is done. MapReduce applications may be executed on a general
distributed infrastructure with job-scheduling and distributed storage capabilities. On the
distributed infrastructure, two types of processes are run: master process and worker processes.
The master process is responsible for directing the execution of map and reduce tasks, as well
as partitioning and rearranging the map task’s intermediate output to feed the reduce tasks. The
worker processes are used to host the execution of map and reduce operations, as well as to
offer fundamental I/O facilities for interacting with input and output files. In a MapReduce
calculation, input files ae divided into splits of usually 16 to 64 MB and stored in a distributed
file system (i.e HDFS). By balancing the load, the master process produces the map tasks and
allocates input splits to each of them.
Input and Output buffers are utilized by worker processes to optimize the efficiency of the map
and reduce operations. Output buffers for map operations are dumped to the disk regularly to
produce intermediate files. To equally separate the output of map operations, intermediate files
are partitioned using a user-defined function. The positions of these pairings are then sent to
the master process, which passes this information to the reduced tasks, which may gather the
needed input through a remote procedure call to read from the map tasks local storage. The key
range is then sorted, and any keys that have the same value are grouped. Finally, the reduction
job is run to generate the final result, which is saved in the global file system. This procedure
is fully automated; users may control it by providing (in addition to the map and reduce
functions) the number of map jobs, the number of partitions into which the final output is
divided, and the partition function for the intermediate key range.
5
Fig: Google MapReduce Infrastructure
Several modifications and variants to the original MapReduce architecture have been proposed,
to expand the MapReduce application area and give developers a more user-friendly interface
for building distributed algorithms, Hadoop, Pig, Hive, Map-Reduce-Merge etc are some of
those modified variants.
1. Apache Hadoop is a series of software projects that enable scalable and reliable distributed
computing. Hadoop as a whole is an open-source implementation of the MapReduce
architecture. It mainly consists of two projects: Hadoop Distributed File System (HDFS)
and Hadoop MapReduce. HDFS is an implementation of the Google file system and
Hadoop MapReduce offers the same functionality and abstraction as Google MapReduce.
Originally developed and supported by Yahoo, Hadoop is currently the most mature and
comprehensive data cloud application with a very active user community has the support
of developers and users. Yahoo operates the world’s largest Hadoop cluster, consisting of
6
40,000 machines and more than 300,000 cores, which can be used by academic institutions
around the world.
2. Apache Pig is a platform for analysing large data sets that consists of a high-level language
for expressing data analysis programs and infrastructure for evaluating these programs, the
outstanding property of Pig programs is that their structure is susceptible to significant
parallelization, which in turn allows them to process very large data sets. Pig’s
infrastructure layer currently consists of a compiler that generates sequences of map-reduce
programs for which large scale parallel implementations are already in place. Pig’s
language layer currently consists of a text language called Pig Latin, which has key features
such as simple programming has optimization possibilities and expandability.
3. Apache Hive data warehouse program simplifies the reading, writing and management of
large data sets on distributed storage using SQL. It includes tools for quick data summary,
adhoc searches and analysing big datasets stored in Hadoop MapReduce files. The Hive
architecture has identical features as a traditional data warehouse, but is doesnot perform
well in terms of query latency, so it is not a viable option for online transaction processing.
Built on top of Apache Hadoop, Hive provides the tools to enable easy access to data via
SQL, a mechanism to impose structure on a variety of data formats, access to files stored
either directly in Apache HDFS or other data storage systems such as Apache HBase, Query
execution via Apache Tez, Apache Spark, or MapReduce etc. Hive’s main advantages are
its capacity to scale out as it is built on the Hadoop architecture, and its ability to provide a
data warehouse infrastructure in situations where a Hadoop system is already running.
4. Map-Reduce-Merge is a change of the MapReduce paradigm that introduces a third phase
to the traditional MapReduce pipeline, which is ‘Merge’ that lets in effectively combining
data that has formerly been partitioned and taken care of through map and reduce modules.
The Map-Reduce-Merge framework facilitates the handling of heterogeneous linked
datasets by providing an abstraction that can express standard relational algebra operations
as well as numerous join methods.
Dear, Bear, River, Car, Car, River, Deer, Car and Bear
Now, suppose, we have to perform a word count on the sample.txt using MapReduce. So, we
will be finding unique words and the number of occurrences of those unique words.
7
First, we divide the input into three splits as shown in the figure. This will distribute the
work among all the map nodes.
Then, we tokenize the words in each of the mappers and give a hardcoded value (1) to each
of the tokens or words. The rationale behind giving a hardcoded value equal to 1 is that
every word, in itself, will occur once.
Now, a list of key-value pair will be created where the key is nothing but the individual
words and value is one. So, for the first line (Dear Bear River) we have 3 key-value pairs
— Dear, 1; Bear, 1; River, 1. The mapping process remains the same on all the nodes.
After the mapper phase, a partition process takes place where sorting and shuffling happen
so that all the tuples with the same key are sent to the corresponding reducer.
So, after the sorting and shuffling phase, each reducer will have a unique key and a list of
values corresponding to that very key. For example, Bear, [1,1]; Car, [1,1,1].., etc.
Now, each Reducer counts the values which are present in that list of values. As shown in
the figure, reducer gets a list of values which is [1,1] for the key Bear. Then, it counts the
number of ones in the very list and gives the final output as — Bear, 2.
Finally, all the output key/value pairs are then collected and written in the output file.
8
Advantages of MapReduce
1. Parallel Processing:
In MapReduce, we are dividing the job among multiple nodes and each node works with a part
of the job simultaneously. So, MapReduce is based on Divide and Conquer paradigm which
helps us to process the data using different machines. As the data is processed by multiple
machines instead of a single machine in parallel, the time taken to process the data gets reduced
2. Data Locality:
Instead of moving data to the processing unit, we are moving the processing unit to the data
in the MapReduce Framework. In the traditional system, we used to bring data to the
processing unit and process it. But, as the data grew and became very huge, bringing this
huge amount of data to the processing unit posed the following issues:
Moving huge data to processing is costly and deteriorates the network performance.
9
Processing takes time as the data is processed by a single unit which becomes the
bottleneck.
Master node can get over-burdened and may fail.
Now, MapReduce allows us to overcome the above issues by bringing the processing unit to
the data. So, as you can see in the above image that the data is distributed among multiple
nodes where each node processes the part of the data residing on it. This allows us to have the
following advantages:
It is very cost effective to move the processing unit to the data.
The processing time is reduced as all the nodes are working with their part of the data in
parallel.
Every node gets a part of the data to process and therefore, there is no chance of a node
getting overburdened.
10