0% found this document useful (0 votes)

5 views25 pages

Map Reduce

MapReduce is a programming model designed for processing large datasets in parallel across multiple nodes, addressing the challenges of Big Data which includes aspects like volume, velocity, and variety. It operates through two main tasks, Map and Reduce, where the Map task processes data into key-value pairs and the Reduce task aggregates these pairs into a smaller dataset. The document also outlines the MapReduce algorithm, its components, and examples of its application in real-world scenarios like Twitter's tweet management.

Uploaded by

Ehsan Aslam

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views25 pages

Map Reduce

Uploaded by

Ehsan Aslam

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 25

MapReduce - Introduction

MapReduce is a programming model for writing applications that can process Big Data in parallel
on multiple nodes. MapReduce provides analytical capabilities for analyzing huge volumes of
complex data.

What is Big Data?

Big Data is a collection of large datasets that cannot be processed using traditional computing
techniques. For example, the volume of data Facebook or Youtube need require it to collect and
manage on a daily basis, can fall under the category of Big Data. However, Big Data is not only
about scale and volume, it also involves one or more of the following aspects − Velocity, Variety,
Volume, and Complexity.

Why MapReduce?

Traditional Enterprise Systems normally have a centralized server to store and process data. The
following illustration depicts a schematic view of a traditional enterprise system. Traditional model is
certainly not suitable to process huge volumes of scalable data and cannot be accommodated by
standard database servers. Moreover, the centralized system creates too much of a bottleneck
while processing multiple files simultaneously.

Google solved this bottleneck issue using an algorithm called MapReduce. MapReduce divides a
task into small parts and assigns them to many computers. Later, the results are collected at one
place and integrated to form the result dataset.
How MapReduce Works?
The MapReduce algorithm contains two important tasks, namely Map and Reduce.
The Map task takes a set of data and converts it into another set of data, where individual
elements are broken down into tuples (key-value pairs).
The Reduce task takes the output from the Map as an input and combines those data
tuples (key-value pairs) into a smaller set of tuples.
The reduce task is always performed after the map job.
Let us now take a close look at each of the phases and try to understand their significance.

Input Phase − Here we have a Record Reader that translates each record in an input file
and sends the parsed data to the mapper in the form of key-value pairs.
Map − Map is a user-defined function, which takes a series of key-value pairs and
processes each one of them to generate zero or more key-value pairs.

Intermediate Keys − They key-value pairs generated by the mapper are known as
intermediate keys.
Combiner − A combiner is a type of local Reducer that groups similar data from the map
phase into identifiable sets. It takes the intermediate keys from the mapper as input and
applies a user-defined code to aggregate the values in a small scope of one mapper. It is
not a part of the main MapReduce algorithm; it is optional.

Shuffle and Sort − The Reducer task starts with the Shuffle and Sort step. It downloads
the grouped key-value pairs onto the local machine, where the Reducer is running. The
individual key-value pairs are sorted by key into a larger data list. The data list groups the
equivalent keys together so that their values can be iterated easily in the Reducer task.
Reducer − The Reducer takes the grouped key-value paired data as input and runs a
Reducer function on each one of them. Here, the data can be aggregated, filtered, and
combined in a number of ways, and it requires a wide range of processing. Once the
execution is over, it gives zero or more key-value pairs to the final step.

Output Phase − In the output phase, we have an output formatter that translates the final
key-value pairs from the Reducer function and writes them onto a file using a record writer.

Let us try to understand the two tasks Map &f Reduce with the help of a small diagram −

MapReduce-Example
Let us take a real-world example to comprehend the power of MapReduce. Twitter receives around
500 million tweets per day, which is nearly 3000 tweets per second. The following illustration shows
how Tweeter manages its tweets with the help of MapReduce.

As shown in the illustration, the MapReduce algorithm performs the following actions −
Tokenize − Tokenizes the tweets into maps of tokens and writes them as key-value pairs.
Filter − Filters unwanted words from the maps of tokens and writes the filtered maps as
key-value pairs.
Count − Generates a token counter per word.
Aggregate Counters − Prepares an aggregate of similar counter values into small
manageable units.
MapReduce - Algorithm

The MapReduce algorithm contains two important tasks, namely Map and Reduce.
The map task is done by means of Mapper Class
The reduce task is done by means of Reducer Class.
Mapper class takes the input, tokenizes it, maps and sorts it. The output of Mapper class is used as
input by Reducer class, which in turn searches matching pairs and reduces them.

MapReduce implements various mathematical algorithms to divide a task into small parts and
assign them to multiple systems. In technical terms, MapReduce algorithm helps in sending the
Map & Reduce tasks to appropriate servers in a cluster.
These mathematical algorithms may include the following −
Sorting
Searching
Indexing
TF-IDF

Sorting
Sorting is one of the basic MapReduce algorithms to process and analyze data. MapReduce
implements sorting algorithm to automatically sort the output key-value pairs from the mapper by
their keys.

Sorting methods are implemented in the mapper class itself.

In the Shuffle and Sort phase, after tokenizing the values in the mapper class, the Context
class (user-defined class) collects the matching valued keys as a collection.
To collect similar key-value pairs (intermediate keys), the Mapper class takes the help of
RawComparator class to sort the key-value pairs.
The set of intermediate key-value pairs for a given Reducer is automatically sorted by
Hadoop to form key-values (K2, {V2, V2, …}) before they are presented to the Reducer.

Searching
Searching plays an important role in MapReduce algorithm. It helps in the combiner phase
(optional) and in the Reducer phase. Let us try to understand how Searching works with the help of
an example.

Example

The following example shows how MapReduce employs Searching algorithm to find out the details
of the employee who draws the highest salary in a given employee dataset.

Let us assume we have employee data in four different files − A, B, C, and D. Let us also
assume there are duplicate employee records in all four files because of importing the
employee data from all database tables repeatedly. See the following illustration.

The Map phase processes each input file and provides the employee data in key-value
pairs (<k, v> : <emp name, salary>). See the following illustration.

The combiner phase (searching technique) will accept the input from the Map phase as a
key-value pair with employee name and salary. Using searching technique, the combiner
will check all the employee salary to find the highest salaried employee in each file. See
the following snippet.

<k: employee name, v: salary>

Max= the salary of an first employee. Treated as max salary

if(v(second employee).salary > Max){

Max = v(salary);
}

else{
Continue checking;
}

The expected result is as follows −

<satish, <manisha,
26000> <gopal, 50000> <kiran, 45000> 45000>

Reducer phase − Form each file, you will find the highest salaried employee. To avoid
redundancy, check all the <k, v> pairs and eliminate duplicate entries, if any. The same
algorithm is used in between the four <k, v> pairs, which are coming from four input files.
The final output should be as follows −

<gopal, 50000>

Indexing

Normally indexing is used to point to a particular data and its address. It performs batch indexing
on the input files for a particular Mapper.

The indexing technique that is normally used in MapReduce is known as inverted index. Search
engines like Google and Bing use inverted indexing technique. Let us try to understand how
Indexing works with the help of a simple example.

Example

The following text is the input for inverted indexing. Here T[0], T[1], and t[2] are the file names and
their content are in double quotes.

T[0] = "it is what it is"

T[1] = "what is it"
T[2] = "it is a banana"

After applying the Indexing algorithm, we get the following output −

"a": {2}
"banana": {2}
"is": {0, 1, 2}
"it": {0, 1, 2}
"what": {0, 1}

Here "a": {2} implies the term "a" appears in the T[2] file. Similarly, "is": {0, 1, 2} implies the term "is"
appears in the files T[0], T[1], and T[2].

TF-IDF
TF-IDF is a text processing algorithm which is short for Term Frequency − Inverse Document
Frequency. It is one of the common web analysis algorithms. Here, the term 'frequency' refers to
the number of times a term appears in a document.

Term Frequency (TF)

It measures how frequently a particular term occurs in a document. It is calculated by the number of
times a word appears in a document divided by the total number of words in that document.

TF(the) = (Number of times term the ‘the’ appears in a document) / (Total number of

Inverse Document Frequency (IDF)

It measures the importance of a term. It is calculated by the number of documents in the text
database divided by the number of documents where a specific term appears.

While computing TF, all the terms are considered equally important. That means, TF counts the
term frequency for normal words like “is”, “a”, “what”, etc. Thus we need to know the frequent terms
while scaling up the rare ones, by computing the following −

IDF(the) = log_e(Total number of documents / Number of documents with term ‘the’ in

The algorithm is explained below with the help of a small example.

Example

Consider a document containing 1000 words, wherein the word hive appears 50 times. The TF for
hive is then (50 / 1000) = 0.05.

Now, assume we have 10 million documents and the word hive appears in 1000 of these. Then,
the IDF is calculated as log(10,000,000 / 1,000) = 4.
The TF-IDF weight is the product of these quantities − 0.05 × 4 = 0.20.
MapReduce - API

In this chapter, we will take a close look at the classes and their methods that are involved in the
operations of MapReduce programming. We will primarily keep our focus on the following −
JobContext Interface
Job Class
Mapper Class
Reducer Class

JobContext Interface

The JobContext interface is the super interface for all the classes, which defines different jobs in
MapReduce. It gives you a read-only view of the job that is provided to the tasks while they are
running.
The following are the sub-interfaces of JobContext interface.

S.No. Subinterface Description

1. MapContext<KEYIN, VALUEIN, KEYOUT, VALUEOUT>

Defines the context that is given to the Mapper.

2. ReduceContext<KEYIN, VALUEIN, KEYOUT, VALUEOUT>

Defines the context that is passed to the Reducer.

Job class is the main class that implements the JobContext interface.

Job Class
The Job class is the most important class in the MapReduce API. It allows the user to configure the
job, submit it, control its execution, and query the state. The set methods only work until the job is
submitted, afterwards they will throw an IllegalStateException.
Normally, the user creates the application, describes the various facets of the job, and then submits
the job and monitors its progress.
Here is an example of how to submit a job −
// Create a new Job
Job job = new Job(new Configuration());
job.setJarByClass(MyJob.class);

// Specify various job-specific parameters

job.setJobName("myjob");
job.setInputPath(new Path("in"));
job.setOutputPath(new Path("out"));

job.setMapperClass(MyJob.MyMapper.class);
job.setReducerClass(MyJob.MyReducer.class);

// Submit the job, then poll for progress until the job is complete
job.waitForCompletion(true);

Constructors

Following are the constructor summary of Job class.

S.No Constructor Summary

1 Job()

2 Job(Configuration conf)

3 Job(Configuration conf, String jobName)

Methods

Some of the important methods of Job class are as follows −

S.No Method Description

1 getJobName()
User-specified job name.

2 getJobState()
Returns the current state of the Job.

3 isComplete()
Checks if the job is finished or not.

4 setInputFormatClass()
Sets the InputFormat for the job.

5 setJobName(String name)
Sets the user-specified job name.

6 setOutputFormatClass()
Sets the Output Format for the job.

7 setMapperClass(Class)
Sets the Mapper for the job.

8 setReducerClass(Class)
Sets the Reducer for the job.

9 setPartitionerClass(Class)
Sets the Partitioner for the job.

10 setCombinerClass(Class)
Sets the Combiner for the job.

Mapper Class

The Mapper class defines the Map job. Maps input key-value pairs to a set of intermediate key-
value pairs. Maps are the individual tasks that transform the input records into intermediate
records. The transformed intermediate records need not be of the same type as the input records.
A given input pair may map to zero or many output pairs.
Method

map is the most prominent method of the Mapper class. The syntax is defined below −

map(KEYIN key, VALUEIN value, org.apache.hadoop.mapreduce.Mapper.Context context)

This method is called once for each key-value pair in the input split.

Reducer Class

The Reducer class defines the Reduce job in MapReduce. It reduces a set of intermediate values
that share a key to a smaller set of values. Reducer implementations can access the Configuration
for a job via the JobContext.getConfiguration() method. A Reducer has three primary phases −
Shuffle, Sort, and Reduce.
Shuffle − The Reducer copies the sorted output from each Mapper using HTTP across the
network.
Sort − The framework merge-sorts the Reducer inputs by keys (since different Mappers
may have output the same key). The shuffle and sort phases occur simultaneously, i.e.,
while outputs are being fetched, they are merged.
Reduce − In this phase the reduce (Object, Iterable, Context) method is called for each
<key, (collection of values)> in the sorted inputs.

Method

reduce is the most prominent method of the Reducer class. The syntax is defined below −

reduce(KEYIN key, Iterable<VALUEIN> values, org.apache.hadoop.mapreduce.Reducer.Contex

This method is called once for each key on the collection of key-value pairs.
MapReduce by examples
MapReduce inspiration
The name MapReduce comes from functional programming:
- map is the name of a higher-order function that applies a given function
to each element of a list. Sample in Scala:
val numbers = List(1,2,3,4,5)
numbers.map(x => x * x) == List(1,4,9,16,25)

- reduce is the name of a higher-order function that analyze a recursive

data structure and recombine through use of a given combining
operation the results of recursively processing its constituent parts,
building up a return value. Sample in Scala:
val numbers = List(1,2,3,4,5)
numbers.reduce(_ + _) == 15

MapReduce takes an input, splits it into smaller parts, execute the code of
the mapper on every part, then gives all the results to one or more reducers
that merge all the results into one.

src: http://en.wikipedia.org/wiki/Map_(higher-order_function)
http://en.wikipedia.org/wiki/Fold_(higher-order_function)
MapReduce by examples
Overall view
MapReduce by examples
How does Hadoop work?
Init
- Hadoop divides the input file stored on HDFS into splits (tipically of the size
of an HDFS block) and assigns every split to a different mapper, trying to
assign every split to the mapper where the split physically resides
Mapper
- locally, Hadoop reads the split of the mapper line by line
- locally, Hadoop calls the method map() of the mapper for every line passing
it as the key/value parameters
- the mapper computes its application logic and emits other key/value pairs
Shuffle and sort
- locally, Hadoop's partitioner divides the emitted output of the mapper into
partitions, each of those is sent to a different reducer
- locally, Hadoop collects all the different partitions received from the
mappers and sort them by key
Reducer
- locally, Hadoop reads the aggregated partitions line by line
- locally, Hadoop calls the reduce() method on the reducer for every line of
the input
- the reducer computes its application logic and emits other key/value pairs
- locally, Hadoop writes the emitted pairs output (the emitted pairs) to HDFS
MapReduce by examples

Simplied flow (for developers)

MapReduce by examples

Serializable vs Writable

- Serializable stores the class name and the object representation to the
stream; other instances of the class are referred to by an handle to the
class name: this approach is not usable with random access

- For the same reason, the sorting needed for the shuffle and sort phase
can not be used with Serializable

- The deserialization process creates a new instance of the object, while

Hadoop needs to reuse objects to minimize computation

- Hadoop introduces the two interfaces Writable and WritableComparable

that solve these problem
MapReduce by examples

Writable wrappers
Java Writable Java class Writable
primitive implementation implementation
boolean BooleanWritable String Text

byte ByteWritable byte[] BytesWritable

Object ObjectWritable
short ShortWritable
null NullWritable
int IntWritable
VIntWritable
Java Writable
float FloatWritable collection implementation
long LongWritable array ArrayWritable
VLongWritable ArrayPrimitiveWritable
TwoDArrayWritable
double DoubleWritable
Map MapWritable
SortedMap SortedMapWritable
enum EnumSetWritable
MapReduce by examples
Implementing Writable: the SumCount class
public class SumCount implements WritableComparable<SumCount> {

DoubleWritable sum;
IntWritable count;

public SumCount() {
set(new DoubleWritable(0), new IntWritable(0));
}

public SumCount(Double sum, Integer count) {

set(new DoubleWritable(sum), new IntWritable(count));
}

@Override
public void write(DataOutput dataOutput) throws IOException {

sum.write(dataOutput);
count.write(dataOutput);
}

@Override
public void readFields(DataInput dataInput) throws IOException {

sum.readFields(dataInput);
count.readFields(dataInput);
}
// getters, setters and Comparable overridden methods are omitted
}
MapReduce by examples

Glossary
Term Meaning
Job The whole process to execute: the input data,
the mapper and reducers execution and the
output data
Task Every job is divided among the several
mappers and reducers; a task is the job
portion that goes to every single mapper and
reducer
Split The input file is split into several splits (the
suggested size is the HDFS block size, 64Mb)

Record The split is read from mapper by default a line

at the time: each line is a record. Using a class
extending FileInputFormat, the record can
be composed by more than one line
Partition The set of all the key-value pairs that will be
sent to a single reducer. The default
partitioner uses an hash function on the key to
determine to which reducer send the data
MapReduce by examples

Let's start coding!

MapReduce by examples

WordCount
(the Hello World! for MapReduce, available in Hadoop sources)

We want to count the occurrences of every word

of a text file

Input Data:

The text of the book ”Flatland”

By Edwin Abbott.
Source:
http://www.gutenberg.org/cache/epub/201/pg201.txt
MapReduce by examples

WordCount mapper

public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable> {

private final static IntWritable one = new IntWritable(1);

private Text word = new Text();

@Override
public void map(Object key, Text value, Context context)
throws IOException, InterruptedException {

StringTokenizer itr = new StringTokenizer(value.toString());

while (itr.hasMoreTokens()) {
word.set(itr.nextToken().trim());
context.write(word, one);
}
}
}
MapReduce by examples

WordCount reducer

public static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable>{

private IntWritable result = new IntWritable();

@Override
public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}
MapReduce by examples

WordCount
a 936
ab 6
abbot 3
abbott 2
abbreviated 1
abide 1
ability 1
able 9
ablest 2
Results: abolished 1
abolition 1
about 40
above 22
abroad 1
abrogation 1
abrupt 1
abruptly 1
absence 4
absent 1
absolute 2
...

PSM1Question ALLNew
0% (2)
PSM1Question ALLNew
7 pages
Unit4 Fos
No ratings yet
Unit4 Fos
7 pages
Unit Ii Iintroduction To Map Reduce
No ratings yet
Unit Ii Iintroduction To Map Reduce
4 pages
Map Reduce Tutorial-1
No ratings yet
Map Reduce Tutorial-1
7 pages
Map Reduce
No ratings yet
Map Reduce
7 pages
(BIG DATA) (MapReduce - Quick Guide, Tutorialspoint - Com)
No ratings yet
(BIG DATA) (MapReduce - Quick Guide, Tutorialspoint - Com)
36 pages
6.UNIT 3 BDA
No ratings yet
6.UNIT 3 BDA
18 pages
Map reduce
No ratings yet
Map reduce
35 pages
Unit 3
No ratings yet
Unit 3
22 pages
Map-Reduce For Parallel Computing: Amit Jain
No ratings yet
Map-Reduce For Parallel Computing: Amit Jain
72 pages
UNIT - 5
No ratings yet
UNIT - 5
57 pages
Chapter4 - MapReduce
No ratings yet
Chapter4 - MapReduce
29 pages
Module2 C MapReduceParadigm
No ratings yet
Module2 C MapReduceParadigm
74 pages
Why MapReduce
No ratings yet
Why MapReduce
8 pages
Map Reduce 2
No ratings yet
Map Reduce 2
14 pages
Data Science
No ratings yet
Data Science
7 pages
Map Reduce
No ratings yet
Map Reduce
11 pages
Ir MR 1
No ratings yet
Ir MR 1
34 pages
Unit-2 Map Reduce Notes
No ratings yet
Unit-2 Map Reduce Notes
28 pages
Unit III
No ratings yet
Unit III
8 pages
BIG DATA
No ratings yet
BIG DATA
120 pages
DrKP Module 3
No ratings yet
DrKP Module 3
44 pages
Chapter 4 - Understanding Map Reduce Fundamentals
No ratings yet
Chapter 4 - Understanding Map Reduce Fundamentals
45 pages
Ecs765p W2
No ratings yet
Ecs765p W2
55 pages
Map Reduce
No ratings yet
Map Reduce
18 pages
BDA Unit 3 1
No ratings yet
BDA Unit 3 1
37 pages
Paper Map Reduce
No ratings yet
Paper Map Reduce
16 pages
Map Reduce Workflow Colloquim
No ratings yet
Map Reduce Workflow Colloquim
30 pages
MapReduce: Simplified Data Processing On Large Clusters
100% (1)
MapReduce: Simplified Data Processing On Large Clusters
13 pages
Research Paper - Map Reduce - CSC3323
No ratings yet
Research Paper - Map Reduce - CSC3323
16 pages
L04-MapReduce
No ratings yet
L04-MapReduce
37 pages
Mapreduce
No ratings yet
Mapreduce
13 pages
Mapreduce
No ratings yet
Mapreduce
13 pages
Lecture 3 - MapReduce
No ratings yet
Lecture 3 - MapReduce
9 pages
Analysis of Mapreduce Algorithms: Harini Padmanaban
No ratings yet
Analysis of Mapreduce Algorithms: Harini Padmanaban
6 pages
Big Data Management Continued
No ratings yet
Big Data Management Continued
48 pages
Mapreduce: Simpli - Ed Data Processing On Large Clusters
No ratings yet
Mapreduce: Simpli - Ed Data Processing On Large Clusters
4 pages
BIG DATA UNIT -3
No ratings yet
BIG DATA UNIT -3
7 pages
Unit-4-1
No ratings yet
Unit-4-1
12 pages
Mapreduce Model Principles
No ratings yet
Mapreduce Model Principles
65 pages
Chapter 9 - Processing Big Data With Mapreduce
No ratings yet
Chapter 9 - Processing Big Data With Mapreduce
157 pages
Dean 08 Map Reduce
No ratings yet
Dean 08 Map Reduce
7 pages
Introduction To MapReduce
No ratings yet
Introduction To MapReduce
9 pages
Module -4 _ UNDERSTANDING MAP REDUCE FUNDAMENTALS
No ratings yet
Module -4 _ UNDERSTANDING MAP REDUCE FUNDAMENTALS
6 pages
Distributed and Cloud Computing
No ratings yet
Distributed and Cloud Computing
58 pages
Map Reduce
No ratings yet
Map Reduce
74 pages
Traditional Way Vs Map Reduce Way and Steps in Mapreduce (Word Count) - 1
No ratings yet
Traditional Way Vs Map Reduce Way and Steps in Mapreduce (Word Count) - 1
4 pages
The Mapreduce Paradigm: Michael Kleber
No ratings yet
The Mapreduce Paradigm: Michael Kleber
13 pages
ECS765P_W2_The MapReduce Programming Model
No ratings yet
ECS765P_W2_The MapReduce Programming Model
53 pages
Unit 2 Topic 4 Map Reduce
No ratings yet
Unit 2 Topic 4 Map Reduce
43 pages
Map Reduce
No ratings yet
Map Reduce
42 pages
Map Reduce
No ratings yet
Map Reduce
39 pages
Map Reduce
No ratings yet
Map Reduce
7 pages
Map Reduce Report
No ratings yet
Map Reduce Report
16 pages
Unit-2 MapReduce2024
No ratings yet
Unit-2 MapReduce2024
41 pages
Practical 1: Data Mining and Business Intelligence Practical-1
No ratings yet
Practical 1: Data Mining and Business Intelligence Practical-1
10 pages
Introduction to Algorithms
From Everand
Introduction to Algorithms
S VASIST
No ratings yet
Mastering Data Structures and Algorithms in C and C++
From Everand
Mastering Data Structures and Algorithms in C and C++
Sachin Naha
No ratings yet
Profound Python Data Science
From Everand
Profound Python Data Science
Onder Teker
No ratings yet
R Fast Track Guide - 86 Key Points Every Programmer from Other Languages Should Master
From Everand
R Fast Track Guide - 86 Key Points Every Programmer from Other Languages Should Master
Ginno
No ratings yet
Python Advanced Programming: The Guide to Learn Python Programming. Reference with Exercises and Samples About Dynamical Programming, Multithreading, Multiprocessing, Debugging, Testing and More
From Everand
Python Advanced Programming: The Guide to Learn Python Programming. Reference with Exercises and Samples About Dynamical Programming, Multithreading, Multiprocessing, Debugging, Testing and More
Marcus Richards
No ratings yet
Scalability and Performance
No ratings yet
Scalability and Performance
19 pages
burhan (1) (1)
No ratings yet
burhan (1) (1)
20 pages
Lecture 6
No ratings yet
Lecture 6
16 pages
Lecture 5
No ratings yet
Lecture 5
12 pages
5-sec6ech05-181101141758
No ratings yet
5-sec6ech05-181101141758
50 pages
Expense management admin login
No ratings yet
Expense management admin login
11 pages
Designing a Cloud Application
No ratings yet
Designing a Cloud Application
49 pages
Programming Models
No ratings yet
Programming Models
21 pages
Big Data Distributed Platforms
No ratings yet
Big Data Distributed Platforms
18 pages
A Journey Through Cloud Computing
No ratings yet
A Journey Through Cloud Computing
3 pages
MTBF-MTTF
No ratings yet
MTBF-MTTF
25 pages
Hadoop
No ratings yet
Hadoop
15 pages
mapreduce-example
No ratings yet
mapreduce-example
9 pages
A Web-Based Geographical Information System Prototype On Portuguese Traditional Food Products
No ratings yet
A Web-Based Geographical Information System Prototype On Portuguese Traditional Food Products
54 pages
Information Assurance and Security
No ratings yet
Information Assurance and Security
2 pages
Python Project Online Health Management System Report
No ratings yet
Python Project Online Health Management System Report
11 pages
React JS MCQs 9
No ratings yet
React JS MCQs 9
3 pages
Frida Unleashed
No ratings yet
Frida Unleashed
24 pages
Parallel 123
No ratings yet
Parallel 123
28 pages
HTML Color Names
No ratings yet
HTML Color Names
26 pages
Version Control and Project Organization Best Practice Guide
No ratings yet
Version Control and Project Organization Best Practice Guide
52 pages
Programming in C FILE PROCESSING
No ratings yet
Programming in C FILE PROCESSING
7 pages
Online Final Year Project Title Booking System 24 Pages
100% (1)
Online Final Year Project Title Booking System 24 Pages
24 pages
Collage of Informatics Department Information Systems: Prepared By: Name Id No
No ratings yet
Collage of Informatics Department Information Systems: Prepared By: Name Id No
14 pages
Flow Chart
No ratings yet
Flow Chart
1 page
Online Student Mentoring System
100% (1)
Online Student Mentoring System
19 pages
IBM Power Expert Care L2 Quiz Attempt 2 Review PDF
No ratings yet
IBM Power Expert Care L2 Quiz Attempt 2 Review PDF
13 pages
Excel Shortcut Notes - Nupur
No ratings yet
Excel Shortcut Notes - Nupur
5 pages
Atlantic Computer Case
100% (1)
Atlantic Computer Case
7 pages
Blue Print: Kendriya Vidyalaya Sangathan, Chandigarh Region
No ratings yet
Blue Print: Kendriya Vidyalaya Sangathan, Chandigarh Region
2 pages
CERTIFICATE (1)
No ratings yet
CERTIFICATE (1)
20 pages
Better Linux Disk Caching & Performance With VM - Dirty - Ratio
No ratings yet
Better Linux Disk Caching & Performance With VM - Dirty - Ratio
5 pages
Multi Threaded Programming: Heavyweight Process. There Is One Program Counter, and One Sequence of
No ratings yet
Multi Threaded Programming: Heavyweight Process. There Is One Program Counter, and One Sequence of
39 pages
Laboratorio de Mecanica de Rocas
No ratings yet
Laboratorio de Mecanica de Rocas
15 pages
Electronic Commerce Seventh Annual Edition Gary P. Schneider - The latest ebook edition with all chapters is now available
100% (2)
Electronic Commerce Seventh Annual Edition Gary P. Schneider - The latest ebook edition with all chapters is now available
62 pages
User Manual: Rev X9
No ratings yet
User Manual: Rev X9
41 pages
Com - Jdexpert.virtual64bit Logcat
No ratings yet
Com - Jdexpert.virtual64bit Logcat
5 pages
Aoc - Monitor - Lm520i
No ratings yet
Aoc - Monitor - Lm520i
40 pages
Institute management system
No ratings yet
Institute management system
12 pages
Chapter-5: Assembly Language Programming Using 8086 (16 Marks)
100% (2)
Chapter-5: Assembly Language Programming Using 8086 (16 Marks)
55 pages
Chapter Five Microsoft Word
No ratings yet
Chapter Five Microsoft Word
18 pages
DIA6ED2131203EN
No ratings yet
DIA6ED2131203EN
129 pages

Map Reduce

Uploaded by

Map Reduce

Uploaded by

MapReduce - Introduction

What is Big Data?

Sorting methods are implemented in the mapper class itself.

<k: employee name, v: salary>

if(v(second employee).salary > Max){

The expected result is as follows −

T[0] = "it is what it is"

After applying the Indexing algorithm, we get the following output −

Term Frequency (TF)

Inverse Document Frequency (IDF)

IDF(the) = log_e(Total number of documents / Number of documents with term ‘the’ in

The algorithm is explained below with the help of a small example.

S.No. Subinterface Description

1. MapContext<KEYIN, VALUEIN, KEYOUT, VALUEOUT>

2. ReduceContext<KEYIN, VALUEIN, KEYOUT, VALUEOUT>

// Specify various job-specific parameters

Following are the constructor summary of Job class.

S.No Constructor Summary

3 Job(Configuration conf, String jobName)

Some of the important methods of Job class are as follows −

map(KEYIN key, VALUEIN value, org.apache.hadoop.mapreduce.Mapper.Context context)

reduce(KEYIN key, Iterable<VALUEIN> values, org.apache.hadoop.mapreduce.Reducer.Contex

- reduce is the name of a higher-order function that analyze a recursive

Simplied flow (for developers)

- The deserialization process creates a new instance of the object, while

- Hadoop introduces the two interfaces Writable and WritableComparable

byte ByteWritable byte[] BytesWritable

public SumCount(Double sum, Integer count) {

Record The split is read from mapper by default a line

Let's start coding!

We want to count the occurrences of every word

The text of the book ”Flatland”

public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable> {

private final static IntWritable one = new IntWritable(1);

StringTokenizer itr = new StringTokenizer(value.toString());

public static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable>{

private IntWritable result = new IntWritable();

You might also like