Unit 2 Topic 5 Developing A Map Reduce Application
Unit 2 Topic 5 Developing A Map Reduce Application
Application
Developing a Map Reduce Application
• Writing a program in MapReduce follows a certain pattern.
• You start by writing your map and reduce functions, ideally with
unit tests to make sure they do what you expect.
• Then you write a driver program to run a job, which can run from
your IDE using a small subset of the data to check that it is
working.
Conti…
• If it fails, you can use your IDE’s debugger to find the source of the
problem.
• Note that they combine and reduce functions use the same type,
except in the variable names where K3 is K2 and V3 is V2.
Conti…
• If the combine function is used, it has the same form as the
reduce function and the output is fed to the reduce function.
• This may be illustrated as follows:
map: (K1, V1) → list (K2, V2)
combine: (K2, list(V2)) → list (K2, V2)
reduce: (K2, list(V2)) → list (K3, V3)
• Note that they combine and reduce functions use the same type,
except in the variable names where K3 is K2 and V3 is V2.
Conti…
• The partition function operates on the intermediate key-value
types.
• It controls the partitioning of the keys of the intermediate map
outputs.
• The key derives the partition using a typical hash function.
• The total number of partitions is the same as the number of
reduced tasks for the job.
• The partition is determined only by the key ignoring the value.
Input Formats
• Hadoop has to accept and process a variety of formats, from text
files to databases.
• A chunk of input, called input split, is processed by a single map.
• Each split is further divided into logical records given to the map
to process in key-value pair.
• In the context of a database, the split means reading a range of
tuples from an SQL table, as done by the DBInputFormat and
producing LongWritables containing record numbers as keys and
DBWritables as values.
Conti…
• The Java API for input splits is as follows:
Flexibility
• MapReduce programming enables companies to access new
sources of data. It enables companies to operate on different
types of data.
Conti…
Security and Authentication
• The MapReduce programming model uses HBase and HDFS
security platform that allows access only to the authenticated
users to operate on the data.
Cost-effective solution
• Hadoop’s scalable architecture with the MapReduce programming
framework allows the storage and processing of large data sets in
a very affordable manner.
Conti…
Fast
• Even if we are dealing with large volumes of unstructured data,
Hadoop MapReduce just takes minutes to process terabytes of
data. It can process petabytes of data in just an hour.
Availability
• If any particular node suffers from a failure, then there are always other
copies present on other nodes that can still be accessed whenever needed.
Resilient nature
• One of the major features offered by Apache Hadoop is its fault tolerance.
The Hadoop MapReduce framework has the ability to quickly recognizing
faults that occur.
Real-world Map Reduce
• MapReduce real world example on e-commerce transactions data
is described here using Python streaming.
• A real world e-commerce transactions dataset from a UK based
retailer is used.
• https://idevji.com/blog/2018/08/08/mapreduce-real-world-
example/
Conti…
Outline
• The dataset consists of real world e-commerce data from UK
based retailer
• The dataset is provided by Kaggle
• Our goal is to find out country wise total sales
• Mapper multiplies quantity and unit price
• Mapper emits key-value pair as country, sales
• Reducer sums-up all pairs for same country
• Final output is country, sales for all countries
Conti…
Data
• Download: Link to Kaggle Dataset
• Source: The dataset has real-life transaction data from a UK retailer.
• Format: CSV
• Size: in….. MB
• Columns:
• InvoiceNo
• StockCode
• Description
• Quantity
• InvoiceDate
• UnitPrice
• CustomerID
• Country
Conti…
Problem
• In this MapReduce real world example, we calculate total sales for
each country from given dataset.
Approach
• Firstly, our data doesn’t have a Total column so it is to be
computed using Quantity and UnitPrice columns as
Total = Quantity * UnitPrice.
Conti…
What Mapper Does
• Read the data
• Convert data into proper format
• Calculate total
• Print output as key-value pair CountryName:Total
Conti…
What Reducer Does
• Read input from mapper
• Check for existing country key in the disctionary
• Add total to existing total value
• Print all key-value pairs
Conti…
• Python Code for Mapper (MapReduce Real World Example)
Conti…
• Python Code for Reducer (MapReduce Real World Example)
Country Score
Canada 3599.68
Brazil 1143.6
Conti…
Italy 16506.03
Czech Republic 707.72
USA 1730.92
Lithuania 1661.06
Unspecified 4746.65
France 197194.15
• Output Norway
Bahrain
34908.13
548.4
Israel 7867.42
Australia 135330.19
Singapore 9054.69
Iceland 4299.8
Channel Islands 19950.54
Germany 220791.78
Belgium 40752.83
European Community 1291.75
Hong Kong 10037.84
Spain 54632.86
EIRE 262112.48
Netherlands 283440.66
Denmark 18665.18
Poland 7193.34
Finland 22226.69
Saudi Arabia 131.17
Sweden 36374.15
Malta 2503.19
Switzerland 56199.23
Portugal 29272.34
United Arab Emirates 1877.08
Lebanon 1693.88
RSA 1002.31
United Kingdom 8148025.164
Austria 10149.28
Greece 4644.82
Japan 34616.06
Cyprus 12791.31
Conti…
Conclusions
• Mapper picks-up a record and emits country and total for that
record
• Mapper repeats this process for all 5.42k records
• Now, we have 5.42k key value pairs
• Reducer’s role is to combine these pairs until all keys are unique!
THANK
YOU