0% found this document useful (0 votes)
72 views36 pages

Understanding Map-Reduce Framework

MapReduce is a distributed data processing model that enables parallel processing of large datasets across clusters of machines, primarily utilizing the Map and Reduce functions to handle data efficiently. The document illustrates its application through a weather dataset example, detailing the phases of input formatting, mapping, reducing, and outputting results. Additionally, it discusses key-value databases, their suitability for various applications, and compares different key-value store systems like Riak and Apache Cassandra.

Uploaded by

madhuri.bitcse
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
72 views36 pages

Understanding Map-Reduce Framework

MapReduce is a distributed data processing model that enables parallel processing of large datasets across clusters of machines, primarily utilizing the Map and Reduce functions to handle data efficiently. The document illustrates its application through a weather dataset example, detailing the phases of input formatting, mapping, reducing, and outputting results. Additionally, it discusses key-value databases, their suitability for various applications, and compares different key-value store systems like Riak and Apache Cassandra.

Uploaded by

madhuri.bitcse
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd

1

Map-Reduce
• Design pattern to take advantage of clustered machines
to do processing in parallel
• While keeping as much work and data as possible local to a single
machine
• Widely used open source implementation and is the part
of Hadoop project.
• Map function- Reads the records from the database
and emits key-value pairs.
• Reduce function- Takes several key-value pairs with the
same key and aggregates them with one.
2

MapReduce
What is it?
• A distributed data processing model and execution environment that
runs on large clusters of commodity machines.
•Can be used with Java, Ruby, Python, C++ and more
•Inherently parallel, thus putting very large-scale data analysis into
the hands of anyone with enough machines at their disposal
•MapReduce process flow
3

MapReduce
Problem example: Weather Dataset
Create a program that mines weather data
•Weather sensors collecting data every hour at many
locations across the globe, gather a large volume of log
data. Source: NCDC
•The data is stored using a line-oriented ASCII format, in
which each line is a record
•Mission - calculate max temperature each year around
the world
4

Format of a National Climate Data Center record


0057
332130 # USAF weather station identifier
99999 # WBAN weather station identifier
19500101 # observation date
0300 # observation time
4
+51317 # latitude (degrees x 1000)
+028783 # longitude (degrees x 1000)
FM-12
+0171 # elevation (meters)
99999
V020
320 # wind direction (degrees)
1 # quality code
N
0072
1
00450 # sky ceiling height (meters)
1 # quality code
CN
010000 # visibility distance (meters)
1 # quality code
N9
-0128 # air temperature (degrees Celsius x 10)
1 # quality code
-0139 # dew point temperature (degrees Celsius x 10)
1 # quality code
10268 # atmospheric pressure (hectopascals x 10)
1 # quality code
5

Input formatting phase

•In our example –


• NCDC log file
0043011990999991950051512004...9999999N9+00221+99999999999...
0043011990999991950051518004...9999999N9-00111+99999999999...
0043012650999991949032412004...0500001N9+01111+99999999999...
0043012650999991949032418004...0500001N9+00781+99999999999
• Output (to MR framework):
(106,0043011990999992010051512004...9999999N9+00221+99999999999...)
(212,0043011990999991950051518004...9999999N9-00111+99999999999...)
(318,00430126509999920119032412004...0500001N9+01111+99999999999..)
(424,0043012650999991949032418004...0500001N9+00781+99999999999...)
6

Map phase

• The input to our map phase is the lines <offset, line_text>


pairs
• Map function pulls out the year and the air temperature,
since these are the only fields we are interested in
• Map function also drops bad records - filters out
temperatures that are missing suspect, or erroneous.
• Map Output (<year, temp> pairs):
7

MR framework processing phase

•The output from the map function is processed by the MR


framework before being sent to the reduce function
•This processing sorts and groups the key-value pairs by
key
•MR framework processing output (<year, temperatures>
pairs):
8

Reduce phase

•The input to our reduce phase is the <year, temperatures>


pairs
•All the reduce function has to do now is iterate through the
list and pick up the maximum reading
•Reduce output:
9

Data output phase

• The input to the data output class is the <year, max


temperature> pairs from the reduce function
• When using the default Hadoop output formatter, the
output is written to a pre-defined directory, which contains
one output file per reducer.
• Output formatter file output:
10

Map Reduce logical data flow


11

Map-Reduce Example
• Let’s assume we have chosen orders as our aggregate, with each
order having line items.
• Each line item has a product ID, quantity, and the price charged. This
aggregate makes a lot of sense as usually people want to see the
whole order in one access.
• Sales analysis people want to see a product and its total revenue for
the last seven days. This report doesn’t fit the aggregate structure
• Such situation that calls for map-reduce.
• A map is a function whose input is a single aggregate and whose
output is a bunch of key value pairs.
• Reduce function takes multiple map outputs with the same key and
combines their values. It would reduce down to one, with the totals for
the quantity and revenue.
12

Map function
• Each instance of the map function is independent of all the others.
Takes a single aggregate record as input

• Hence a map-reduce framework can create efficient map tasks on


each node and freely allocate each order to a map task.

• A map function might yield 1000 line items from orders for “Database
Refactoring”

• While the map function is limited to working only on data from a single
aggregate

• Outputs a set of relevant key-value pairs


13

Map function
• Each application of the map function is independent of all
the others.
• Takes a single aggregate record as input
• Outputs a set of relevant key-value pairs
• Each instance of the map function is independent from all
others
14

Reduce function
• Takes multiple map outputs with the same key as input
• Summarizes (or reduces) there values to a single output,
uses all values emitted for a single key
15

Map-reduce framework
• The map-reduce framework arranges for map tasks to be
run on the correct nodes to process all the documents and
for data to be moved to the reduce function
• Arranges for map function to be applied to pertinent
documents on all nodes.
• Moves data to the location of the reduce function.
• Collects all values for a single pair and calls the reduce
function on the key and value collection.
• To run map-reduce job, programmers only need to supply
the map and reduce functions.
16

Partitioning and Combining


• The results of the mapper are divided up based the key
on each processing node.
• Results of mappers are divided based on the key of each
processing node
• Multiple keys are grouped together into partitions. The
framework then takes the data from all the nodes for one
partition, combines it into a single group for that partition,
and sends it off to a reducer.
• Partition framework takes the data from all nodes for that
partition, combines it into a single group for that partition
and sends it off to reducer.
17

Partitioning
18

Partitioning and Combining


• Reducer function operates on the results of single key.
Hence multiple reducers can be run in parallel.
• Multiple reducers can then operate on partitions in
parallel, with final results merged together.
• Combiner function cuts down repetitive data by combining
all the data for same key into single value.
• The reduce function needs a special shape for this to
work: Its output must match its input. We call such a
function a combinable reducer.
19

Combining
20

• Not all reduce functions are combinable.


• Consider a function that counts the number of unique customers for a
particular product. The map function for such an operation would need to
emit the product and the customer.
• The reducer can then combine them and count how many times each
customer appears for a particular product, emitting the product and the
count.
• But this reducer’s output is different from its input, so it can’t be used as a
combiner.
21

Composing Map-Reduce
Calculations
22

When making a count, each map emits 1, which can be summed


to get a total.
23

Calculation broken down into two map-reduce steps, which will be


expanded in the
next three figures
24

Creating records for monthly sales of a product


25

The second stage mapper creates base records for


year-on-year comparisons
26

The reduction step is a merge of incomplete records.


27

Key-Value databases
• All data is stored with a key and an associated value blob.
• Least complex of the NoSQL databases.
• Key-Value stores are represented as a hash map, they're
powerful for basic Create-Read-Update-Delete
operations.
• Key-Value databases typically scale quite well and shard
easily across 'x' number of nodes. Each shard would
contain a range of keys and their associated values.
• Storing and retrieving session information, storing in-app user
profiles, and storing shopping cart data in online stores.
28

• Key Value is suitable for


• Storing and retrieving session information for a Web application.
Each user session would receive some sort of unique key and all
data would be stored together.

• Key Value is not suitable when


• Data is interconnected with a number of many-to-many relationships
in the data, such as social networking or recommendation engine
scenarios.
• Use case requires a high level of consistency for multi-operation
transactions, involving multiple keys.
29

• Aerospike, Reddis: An open-source and using a flash-


optimized in-memory
• Apache Cassandra: Distributed, free and an open-source
management system
• Oracle Berkeley DB: Basic, high performance, embedded,
open-source
• AWS DynamoDB: Fully managed service
• Couchbase Server: Suitable for business-critical
applications
• Riak: Fast, flexible and scalable
30

Riak Key value database


• Riak KV is built to handle a variety of challenges facing Big
Data applications that include tracking user or session
information, storing connected device data and replicating data
across the globe.
• Buckets are used to define a virtual keyspace for storing Riak
objects.
• Buckets are essentially a flat namespace in Riak. They allow the
same key name to exist in multiple buckets and enable you to
apply configurations across keys.
31

Riak vs Relational database


32

• Storing user session data in Riak


• Object is both the largest and
smallest element of data
• Keys in Riak are simply a binary
value that are used to identify Objects
• Single object stores all the data and
is put into a single bucket.
33

• The downside of storing all the different objects in the single


bucket would be that one bucket would store different types of
aggregates, increasing the chance of key conflicts.
• Alternately we can append the name of the object to the key, so
that we can get to individual objects as needed.
34

• Different buckets for different objects (such as User Profile and


Shopping Cart) segments the data across different buckets can
be built, thus allowing you to read only the object you need
without having to change key design.
• Buckets which store specific data are known as domain buckets.
Bucket bucket = [Link](bucketName).execute();
DomainBucket<UserProfile> profileBucket =
[Link](bucket, [Link]).build();
35

Apache Cassandra
• Cassandra is a distributed is highly scalable and designed to
manage very large amounts of structured data. It provides high
availability with no single point of failure.
• Data in Cassandra is stored as a set of rows that are organized
into tables.
• Each Row is identified by a primary key value.
• All the nodes in a cluster play the same role. Each node is
independent and at the same time interconnected to other nodes.
• Each node in a cluster can accept read and write requests,
regardless of where the data is actually located in the cluster.
• When a node goes down, read/write requests can be served
from other nodes in the network.
36

Key-Value Store Features


• Consistency
• Transactions
• Query Features
• Scaling

You might also like