0% found this document useful (0 votes)

72 views36 pages

Understanding Map-Reduce Framework

MapReduce is a distributed data processing model that enables parallel processing of large datasets across clusters of machines, primarily utilizing the Map and Reduce functions to handle data efficiently. The document illustrates its application through a weather dataset example, detailing the phases of input formatting, mapping, reducing, and outputting results. Additionally, it discusses key-value databases, their suitability for various applications, and compares different key-value store systems like Riak and Apache Cassandra.

Uploaded by

madhuri.bitcse

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

72 views36 pages

Understanding Map-Reduce Framework

Uploaded by

madhuri.bitcse

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

1

Map-Reduce
• Design pattern to take advantage of clustered machines
to do processing in parallel
• While keeping as much work and data as possible local to a single
machine
• Widely used open source implementation and is the part
of Hadoop project.
• Map function- Reads the records from the database
and emits key-value pairs.
• Reduce function- Takes several key-value pairs with the
same key and aggregates them with one.
2

MapReduce
What is it?
• A distributed data processing model and execution environment that
runs on large clusters of commodity machines.
•Can be used with Java, Ruby, Python, C++ and more
•Inherently parallel, thus putting very large-scale data analysis into
the hands of anyone with enough machines at their disposal
•MapReduce process flow
3

MapReduce
Problem example: Weather Dataset
Create a program that mines weather data
•Weather sensors collecting data every hour at many
locations across the globe, gather a large volume of log
data. Source: NCDC
•The data is stored using a line-oriented ASCII format, in
which each line is a record
•Mission - calculate max temperature each year around
the world
4

Format of a National Climate Data Center record

0057
332130 # USAF weather station identifier
99999 # WBAN weather station identifier
19500101 # observation date
0300 # observation time
4
+51317 # latitude (degrees x 1000)
+028783 # longitude (degrees x 1000)
FM-12
+0171 # elevation (meters)
99999
V020
320 # wind direction (degrees)
1 # quality code
N
0072
1
00450 # sky ceiling height (meters)
1 # quality code
CN
010000 # visibility distance (meters)
1 # quality code
N9
-0128 # air temperature (degrees Celsius x 10)
1 # quality code
-0139 # dew point temperature (degrees Celsius x 10)
1 # quality code
10268 # atmospheric pressure (hectopascals x 10)
1 # quality code
5

Input formatting phase

•In our example –

• NCDC log file
0043011990999991950051512004...9999999N9+00221+99999999999...
0043011990999991950051518004...9999999N9-00111+99999999999...
0043012650999991949032412004...0500001N9+01111+99999999999...
0043012650999991949032418004...0500001N9+00781+99999999999
• Output (to MR framework):
(106,0043011990999992010051512004...9999999N9+00221+99999999999...)
(212,0043011990999991950051518004...9999999N9-00111+99999999999...)
(318,00430126509999920119032412004...0500001N9+01111+99999999999..)
(424,0043012650999991949032418004...0500001N9+00781+99999999999...)
6

Map phase

• The input to our map phase is the lines <offset, line_text>

pairs
• Map function pulls out the year and the air temperature,
since these are the only fields we are interested in
• Map function also drops bad records - filters out
temperatures that are missing suspect, or erroneous.
• Map Output (<year, temp> pairs):
7

MR framework processing phase

•The output from the map function is processed by the MR

framework before being sent to the reduce function
•This processing sorts and groups the key-value pairs by
key
•MR framework processing output (<year, temperatures>
pairs):
8

Reduce phase

•The input to our reduce phase is the <year, temperatures>

pairs
•All the reduce function has to do now is iterate through the
list and pick up the maximum reading
•Reduce output:
9

Data output phase

• The input to the data output class is the <year, max

temperature> pairs from the reduce function
• When using the default Hadoop output formatter, the
output is written to a pre-defined directory, which contains
one output file per reducer.
• Output formatter file output:
10

Map Reduce logical data flow

Map-Reduce Example
• Let’s assume we have chosen orders as our aggregate, with each
order having line items.
• Each line item has a product ID, quantity, and the price charged. This
aggregate makes a lot of sense as usually people want to see the
whole order in one access.
• Sales analysis people want to see a product and its total revenue for
the last seven days. This report doesn’t fit the aggregate structure
• Such situation that calls for map-reduce.
• A map is a function whose input is a single aggregate and whose
output is a bunch of key value pairs.
• Reduce function takes multiple map outputs with the same key and
combines their values. It would reduce down to one, with the totals for
the quantity and revenue.
12

Map function
• Each instance of the map function is independent of all the others.
Takes a single aggregate record as input

• Hence a map-reduce framework can create efficient map tasks on

each node and freely allocate each order to a map task.

• A map function might yield 1000 line items from orders for “Database
Refactoring”

• While the map function is limited to working only on data from a single
aggregate

• Outputs a set of relevant key-value pairs

Map function
• Each application of the map function is independent of all
the others.
• Takes a single aggregate record as input
• Outputs a set of relevant key-value pairs
• Each instance of the map function is independent from all
others
14

Reduce function
• Takes multiple map outputs with the same key as input
• Summarizes (or reduces) there values to a single output,
uses all values emitted for a single key
15

Map-reduce framework
• The map-reduce framework arranges for map tasks to be
run on the correct nodes to process all the documents and
for data to be moved to the reduce function
• Arranges for map function to be applied to pertinent
documents on all nodes.
• Moves data to the location of the reduce function.
• Collects all values for a single pair and calls the reduce
function on the key and value collection.
• To run map-reduce job, programmers only need to supply
the map and reduce functions.
16

Partitioning and Combining

• The results of the mapper are divided up based the key
on each processing node.
• Results of mappers are divided based on the key of each
processing node
• Multiple keys are grouped together into partitions. The
framework then takes the data from all the nodes for one
partition, combines it into a single group for that partition,
and sends it off to a reducer.
• Partition framework takes the data from all nodes for that
partition, combines it into a single group for that partition
and sends it off to reducer.
17

Partitioning
18

Partitioning and Combining

• Reducer function operates on the results of single key.
Hence multiple reducers can be run in parallel.
• Multiple reducers can then operate on partitions in
parallel, with final results merged together.
• Combiner function cuts down repetitive data by combining
all the data for same key into single value.
• The reduce function needs a special shape for this to
work: Its output must match its input. We call such a
function a combinable reducer.
19

Combining
20

• Not all reduce functions are combinable.

• Consider a function that counts the number of unique customers for a
particular product. The map function for such an operation would need to
emit the product and the customer.
• The reducer can then combine them and count how many times each
customer appears for a particular product, emitting the product and the
count.
• But this reducer’s output is different from its input, so it can’t be used as a
combiner.
21

Composing Map-Reduce
Calculations
22

When making a count, each map emits 1, which can be summed

to get a total.
23

Calculation broken down into two map-reduce steps, which will be

expanded in the
next three figures
24

Creating records for monthly sales of a product

The second stage mapper creates base records for

year-on-year comparisons
26

The reduction step is a merge of incomplete records.

Key-Value databases
• All data is stored with a key and an associated value blob.
• Least complex of the NoSQL databases.
• Key-Value stores are represented as a hash map, they're
powerful for basic Create-Read-Update-Delete
operations.
• Key-Value databases typically scale quite well and shard
easily across 'x' number of nodes. Each shard would
contain a range of keys and their associated values.
• Storing and retrieving session information, storing in-app user
profiles, and storing shopping cart data in online stores.
28

• Key Value is suitable for

• Storing and retrieving session information for a Web application.
Each user session would receive some sort of unique key and all
data would be stored together.

• Key Value is not suitable when

• Data is interconnected with a number of many-to-many relationships
in the data, such as social networking or recommendation engine
scenarios.
• Use case requires a high level of consistency for multi-operation
transactions, involving multiple keys.
29

• Aerospike, Reddis: An open-source and using a flash-

optimized in-memory
• Apache Cassandra: Distributed, free and an open-source
management system
• Oracle Berkeley DB: Basic, high performance, embedded,
open-source
• AWS DynamoDB: Fully managed service
• Couchbase Server: Suitable for business-critical
applications
• Riak: Fast, flexible and scalable
30

Riak Key value database

• Riak KV is built to handle a variety of challenges facing Big
Data applications that include tracking user or session
information, storing connected device data and replicating data
across the globe.
• Buckets are used to define a virtual keyspace for storing Riak
objects.
• Buckets are essentially a flat namespace in Riak. They allow the
same key name to exist in multiple buckets and enable you to
apply configurations across keys.
31

Riak vs Relational database

• Storing user session data in Riak

• Object is both the largest and
smallest element of data
• Keys in Riak are simply a binary
value that are used to identify Objects
• Single object stores all the data and
is put into a single bucket.
33

• The downside of storing all the different objects in the single

bucket would be that one bucket would store different types of
aggregates, increasing the chance of key conflicts.
• Alternately we can append the name of the object to the key, so
that we can get to individual objects as needed.
34

• Different buckets for different objects (such as User Profile and

Shopping Cart) segments the data across different buckets can
be built, thus allowing you to read only the object you need
without having to change key design.
• Buckets which store specific data are known as domain buckets.
Bucket bucket = [Link](bucketName).execute();
DomainBucket<UserProfile> profileBucket =
[Link](bucket, [Link]).build();
35

Apache Cassandra
• Cassandra is a distributed is highly scalable and designed to
manage very large amounts of structured data. It provides high
availability with no single point of failure.
• Data in Cassandra is stored as a set of rows that are organized
into tables.
• Each Row is identified by a primary key value.
• All the nodes in a cluster play the same role. Each node is
independent and at the same time interconnected to other nodes.
• Each node in a cluster can accept read and write requests,
regardless of where the data is actually located in the cluster.
• When a node goes down, read/write requests can be served
from other nodes in the network.
36

Key-Value Store Features

• Consistency
• Transactions
• Query Features
• Scaling

MapReduce Concepts in NoSQL Databases
No ratings yet
MapReduce Concepts in NoSQL Databases
12 pages
MapReduce for Weather Data Analysis
No ratings yet
MapReduce for Weather Data Analysis
77 pages
NoSQL Database Models Explained
No ratings yet
NoSQL Database Models Explained
55 pages
Understanding NoSQL Databases and Big Data
100% (1)
Understanding NoSQL Databases and Big Data
28 pages
MapReduce: Types, Features, and Formats
No ratings yet
MapReduce: Types, Features, and Formats
26 pages
BCS501 Module 2: Requirements Engineering
No ratings yet
BCS501 Module 2: Requirements Engineering
39 pages
Scaling Methods for Graph Databases
No ratings yet
Scaling Methods for Graph Databases
16 pages
Case Tools Laboratory Manual for CSE
No ratings yet
Case Tools Laboratory Manual for CSE
49 pages
Understanding Big Data and NoSQL Concepts
100% (1)
Understanding Big Data and NoSQL Concepts
3 pages
Types and Management of Disasters
No ratings yet
Types and Management of Disasters
14 pages
Bad601 Model Question Paper Overview
No ratings yet
Bad601 Model Question Paper Overview
32 pages
2024 Operating Systems Model Paper
No ratings yet
2024 Operating Systems Model Paper
4 pages
Anna University Regulations 2008 Overview
No ratings yet
Anna University Regulations 2008 Overview
14 pages
DBMS Unit 5: Authentication & Access Control
No ratings yet
DBMS Unit 5: Authentication & Access Control
8 pages
Practical Big Data Analytics Guide
No ratings yet
Practical Big Data Analytics Guide
35 pages
Understanding Data Types and Analytics
No ratings yet
Understanding Data Types and Analytics
64 pages
Big Data Analytics Overview and Architecture
No ratings yet
Big Data Analytics Overview and Architecture
13 pages
Overview of Storage Area Networks
No ratings yet
Overview of Storage Area Networks
47 pages
BCA Final Year Project Report Guidelines
No ratings yet
BCA Final Year Project Report Guidelines
3 pages
HBase: NoSQL Database Overview
No ratings yet
HBase: NoSQL Database Overview
36 pages
Overview of NoSQL Database Types
No ratings yet
Overview of NoSQL Database Types
19 pages
Overview of Hadoop Ecosystem Components
No ratings yet
Overview of Hadoop Ecosystem Components
126 pages
Quiz Application Project Report
No ratings yet
Quiz Application Project Report
52 pages
Ai Notes
No ratings yet
Ai Notes
18 pages
XHTML Document Creation Guide
100% (1)
XHTML Document Creation Guide
18 pages
18CS822 Storage Area Networks Questions
No ratings yet
18CS822 Storage Area Networks Questions
3 pages
Software Engineering Lab Manual
No ratings yet
Software Engineering Lab Manual
30 pages
BIS701 Big Data Analytics Course Plan
No ratings yet
BIS701 Big Data Analytics Course Plan
15 pages
FDP on Quantum Computing 2025
No ratings yet
FDP on Quantum Computing 2025
1 page
MongoDB E-Commerce Case Study
100% (8)
MongoDB E-Commerce Case Study
32 pages
NoSQL Databases: Overview and Benefits
No ratings yet
NoSQL Databases: Overview and Benefits
28 pages
OOAD Question Bank Overview
100% (2)
OOAD Question Bank Overview
5 pages
HDFS Overview for BCS714D Notes
No ratings yet
HDFS Overview for BCS714D Notes
27 pages
RDBMS vs MongoDB: Key Differences Explained
No ratings yet
RDBMS vs MongoDB: Key Differences Explained
3 pages
BCS501 Key Questions Overview
100% (1)
BCS501 Key Questions Overview
1 page
Non-Classical Information Retrieval Models
No ratings yet
Non-Classical Information Retrieval Models
8 pages
BCS403 DBMS Lab Manual
No ratings yet
BCS403 DBMS Lab Manual
63 pages
Overview of Hadoop and Big Data Analytics
100% (1)
Overview of Hadoop and Big Data Analytics
25 pages
SQL Command for Inserting Required Values
No ratings yet
SQL Command for Inserting Required Values
38 pages
Agile Development Principles and Practices
No ratings yet
Agile Development Principles and Practices
36 pages
Cloud Computing: Models and Challenges
No ratings yet
Cloud Computing: Models and Challenges
26 pages
BCS702 Parallel Computing Module 2 Notes
No ratings yet
BCS702 Parallel Computing Module 2 Notes
22 pages
CCS334 Big Data Analytics Overview
No ratings yet
CCS334 Big Data Analytics Overview
18 pages
Understanding Human-Centered AI Principles
100% (1)
Understanding Human-Centered AI Principles
61 pages
Introduction to Digital Image Processing
No ratings yet
Introduction to Digital Image Processing
69 pages
BCC 302 Python Programming Question Bank
No ratings yet
BCC 302 Python Programming Question Bank
5 pages
Big Data Analytics Lab Manual
No ratings yet
Big Data Analytics Lab Manual
44 pages
Understanding Hadoop I/O Mechanisms
No ratings yet
Understanding Hadoop I/O Mechanisms
3 pages
BCS501 Module 1 Overview
100% (1)
BCS501 Module 1 Overview
52 pages
Cloud Computing Architecture Overview
No ratings yet
Cloud Computing Architecture Overview
30 pages
Big Data Analytics Overview and Insights
No ratings yet
Big Data Analytics Overview and Insights
20 pages
Understanding ACID Properties in DBMS
No ratings yet
Understanding ACID Properties in DBMS
4 pages
VTU 4th Sem ADA Lab Manual
No ratings yet
VTU 4th Sem ADA Lab Manual
64 pages
Social Media Analytics Lab Guide
No ratings yet
Social Media Analytics Lab Guide
11 pages
NoSQL Database Types and Challenges
No ratings yet
NoSQL Database Types and Challenges
20 pages
Understanding Map-Reduce Framework
No ratings yet
Understanding Map-Reduce Framework
17 pages
Big Data Processing with Hadoop and Spark
No ratings yet
Big Data Processing with Hadoop and Spark
44 pages
Understanding Map-Reduce Framework
No ratings yet
Understanding Map-Reduce Framework
44 pages
Understanding MapReduce in NoSQL Databases
No ratings yet
Understanding MapReduce in NoSQL Databases
79 pages
Understanding MapReduce Architecture
No ratings yet
Understanding MapReduce Architecture
32 pages
Understanding NoSQL Databases Basics
No ratings yet
Understanding NoSQL Databases Basics
15 pages
MapReduce Workflow Overview
No ratings yet
MapReduce Workflow Overview
32 pages
Overview of Support Vector Machines
No ratings yet
Overview of Support Vector Machines
15 pages
Social Network Topology Analysis
No ratings yet
Social Network Topology Analysis
22 pages
HDFS Overview in Hadoop Framework
No ratings yet
HDFS Overview in Hadoop Framework
34 pages
Essential Apache Pig Overview
No ratings yet
Essential Apache Pig Overview
14 pages
Big Data Challenges and Solutions
No ratings yet
Big Data Challenges and Solutions
32 pages
Association Rule Mining Techniques
No ratings yet
Association Rule Mining Techniques
21 pages
IMechE UAS'24 Preliminary Design Review
No ratings yet
IMechE UAS'24 Preliminary Design Review
12 pages
© The Institute of Chartered Accountants of India: Rs. Rs
No ratings yet
© The Institute of Chartered Accountants of India: Rs. Rs
12 pages
NOVATION® Indulge 3340 Fact Sheet
No ratings yet
NOVATION® Indulge 3340 Fact Sheet
2 pages
Understanding the X-Y-Z Bridge Convention
No ratings yet
Understanding the X-Y-Z Bridge Convention
4 pages
Duplicated Data and Inconsistency Issues
No ratings yet
Duplicated Data and Inconsistency Issues
8 pages
Deming's Total Quality Management Insights
No ratings yet
Deming's Total Quality Management Insights
27 pages
Electron Cloud Lab
33% (6)
Electron Cloud Lab
4 pages
NMTC 48th Results Announcement
No ratings yet
NMTC 48th Results Announcement
12 pages
Bread and Pastry Production Lesson Plan
No ratings yet
Bread and Pastry Production Lesson Plan
5 pages
SpaceLogic AS-P-NLS Specifications
No ratings yet
SpaceLogic AS-P-NLS Specifications
3 pages
Effective College Reading Strategies
No ratings yet
Effective College Reading Strategies
14 pages
Cyan Lightning Headband for Athletes
No ratings yet
Cyan Lightning Headband for Athletes
1 page
OPOS 5-Minute Recipe Cookbook
No ratings yet
OPOS 5-Minute Recipe Cookbook
165 pages
1 in 8.5 Turnout Fittings for PSC Sleepers
75% (4)
1 in 8.5 Turnout Fittings for PSC Sleepers
9 pages
48 Tips For Active Inspire & Promethean
No ratings yet
48 Tips For Active Inspire & Promethean
12 pages
Water Management and Drought Impacts
No ratings yet
Water Management and Drought Impacts
8 pages
Elective Local Officials PDS 2025 Form
100% (2)
Elective Local Officials PDS 2025 Form
3 pages
Impedance Mismatch Effects on Signal Integrity
No ratings yet
Impedance Mismatch Effects on Signal Integrity
13 pages
Chemical Engineering Thermodynamics Exam
No ratings yet
Chemical Engineering Thermodynamics Exam
2 pages
Retroexacavadora 430 F Planos Hidraulicos
No ratings yet
Retroexacavadora 430 F Planos Hidraulicos
2 pages
Inter-Process Communication Overview
No ratings yet
Inter-Process Communication Overview
35 pages
Status of Macro and Micro Nutrients in Selected Soils of Moshi Rural District, Kilimanjaro Tanzania: Case Study of Mboni, Kyou and Makami Chini Villages.
No ratings yet
Status of Macro and Micro Nutrients in Selected Soils of Moshi Rural District, Kilimanjaro Tanzania: Case Study of Mboni, Kyou and Makami Chini Villages.
44 pages
Urban Traffic Flow Prediction Using LSTM
No ratings yet
Urban Traffic Flow Prediction Using LSTM
10 pages
Port Size: 1/4" Iso G/NPT Suitable For Gas/liquid Applications Suitable For Tracking Applications
No ratings yet
Port Size: 1/4" Iso G/NPT Suitable For Gas/liquid Applications Suitable For Tracking Applications
4 pages
Pros and Cons of Distance Learning
No ratings yet
Pros and Cons of Distance Learning
5 pages
ESET NOD32 Activation Keys 2017-2019
No ratings yet
ESET NOD32 Activation Keys 2017-2019
4 pages
Properties and Tests of Continuity
No ratings yet
Properties and Tests of Continuity
12 pages
Zoncare M5 Ultrasound Features Overview
50% (2)
Zoncare M5 Ultrasound Features Overview
26 pages
Nursing Dosage Calculation Guide
100% (6)
Nursing Dosage Calculation Guide
49 pages
Overview of Traveling-Wave Antennas
No ratings yet
Overview of Traveling-Wave Antennas
12 pages

Understanding Map-Reduce Framework

Uploaded by

Understanding Map-Reduce Framework

Uploaded by

1

Format of a National Climate Data Center record

Input formatting phase

•In our example –

• The input to our map phase is the lines <offset, line_text>

MR framework processing phase

•The output from the map function is processed by the MR

•The input to our reduce phase is the <year, temperatures>

Data output phase

• The input to the data output class is the <year, max

Map Reduce logical data flow

• Hence a map-reduce framework can create efficient map tasks on

• Outputs a set of relevant key-value pairs

Partitioning and Combining

Partitioning and Combining

• Not all reduce functions are combinable.

When making a count, each map emits 1, which can be summed

Calculation broken down into two map-reduce steps, which will be

Creating records for monthly sales of a product

The second stage mapper creates base records for

The reduction step is a merge of incomplete records.

• Key Value is suitable for

• Key Value is not suitable when

• Aerospike, Reddis: An open-source and using a flash-

Riak Key value database

Riak vs Relational database

• Storing user session data in Riak

• The downside of storing all the different objects in the single

• Different buckets for different objects (such as User Profile and

Key-Value Store Features

You might also like