0% found this document useful (0 votes)

7 views52 pages

1 MapReduce introduction with example

Hadoop is an open-source software framework designed for distributed processing of large datasets across clusters of commodity hardware, utilizing a programming model called MapReduce. The architecture consists of a master/slave structure with a Namenode managing metadata and Datanodes storing actual data, while the MapReduce engine facilitates parallel processing through map and reduce tasks. Hadoop supports fault tolerance and scalability, making it suitable for processing both structured and unstructured data.

Uploaded by

kajalyadav102703

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views52 pages

1 MapReduce introduction with example

Uploaded by

kajalyadav102703

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 52

Introduction to MapReduce

What is Hadoop?
Hadoop is an apache open source software framework
It is a java framework
4 distributed processing of large datasets across
large clusters of nodes
Large datasets: Terabytes or petabytes of data
Large clusters: hundreds of thousands of computers (nodes)
Hadoop is open-source implementation for Google MapReduce
Hadoop is based on a simple programming model called MapReduce
Runs on large cluster of commodity machines
Hadoop provides both distributed storage and processing of large datasets
Support processing on streaming data
2 Satishkumar Varma,
What is Hadoop?
Hadoop framework consists on two main layers
Distributed file system (HDFS)
Execution engine (MapReduce)

3 Satishkumar Varma,
Why: Goals / Requirements?
Facilitate storage & processing of large and rapidly growing
dataset Structured and non-structured data
Simple programming models
High scalability and availability
Use commodity (cheap!) hardware with little redundancy
Fault-tolerance
Move computation rather than data

4 Satishkumar Varma,
Hadoop Design Principles
Need to process big data
High Need to parallelize computation across thousands of
nodes Use commodity (affordable) hardware with little
redundancy
Contrast to Parallel DBs: Small number of high-end expensive
machines
Commodity hardware: Large number of low-end cheap machines working
in parallel to solve a computing problem
Automatic parallelization & distribution: Hidden from the end-user
Fault tolerance and automatic recovery: Nodes/tasks will fail and will
recover
automatically
Clean and simple programming abstraction: Users only provide two
5 Satishkumar Varma,
Hadoop Architecture?
Hadoop
Architecture

6 Satishkumar Varma,
Hadoop Architecture Overview?
Hadoop architecture is a master/slave architecture
The master being the namenode and slaves are datanodes
Namenode controls the access to the data by clients
Datanodes manage the storage of data on the nodes that
are running on.
Hadoop splits the file into one or more blocks and these blocks are
stored in the datanodes
Each data block is replicated to 3 different datanodes to provide
high availability of the hadoop system
The block replication factor is configurable
7 Satishkumar Varma,
Who Uses MapReduce/Hadoop?
Google: Inventors of MapReduce computing paradigm
Yahoo: Developing Hadoop open-source of MapReduce
IBM, Microsoft, Oracle
Facebook, Amazon, AOL, NetFlex
Many others + universities and research labs

8 Satishkumar Varma,
Hadoop Components?
Hadoop Distributed File System
MapReduce Engine
Types of Nodes
Namenode
Secondary
Namenode Datanode
JobTracker
TaskTracke
r

YARN – (Yet
Another
Resource
9 Satishkumar Varma,
Hadoop Components?
Hadoop Distributed File System:
HDFS is designed to run on commodity machines which are of low
cost hardware
Distributed data is stored in the HDFS file
system HDFS is highly fault tolerant
HDFS provides high throughput access to the
applications that
require big data
Java-based scalable system that stores data across multiple
machines without prior organization

10 Satishkumar Varma,
Hadoop Components?
Namenode:
Namenode is the heart of the hadoop system
Namenode manages the file system namespace
It stores the metadata information of the data
blocks
This metadata is stored permanently on to
local disk in the form of
namespace image and edit log file
Namenode also knows the location of the data blocks on data node
However the namenode does not store this information persistently
Namenode creates block to datanode mapping when it is restarted
If the namenode crashes, then the entire hadoop system goes down
11 Satishkumar Varma,
Hadoop Components
Secondary Namenode:

Responsibility is to periodically copy and merge the

namespace image and edit log

In case if the name node crashes,

Then the namespace image stored in secondary namenode can be
used to restart the namenode.

12 Satishkumar Varma,
Hadoop Components
DataNode:
Stores the blocks of data and retrieves them
Datanodes also reports the blocks information to the namenode
periodically
Stores the actual data in HDFS
Notifies NameNode of what blocks it has
Can run on any underlying filesystem (ext3/4, NTFS, etc)
NameNode replicates blocks 2x in local rack, 1x elsewhere

13 Satishkumar Varma,
Hadoop Components
JobTracker
JobTracker responsibility is to schedule the clients jobs
Job tracker creates map and reduce tasks and schedules
them to run on the datanodes (tasktrackers)
Job Tracker also checks for any failed tasks and reschedules
the failed tasks on another datanode
Jobtracker can be run on the namenode or a separate node

TaskTracker
Tasktracker runs on the datanodes
Task trackers responsibility is to
run the map or reduce tasks assigned by the namenode
and report the status of the tasks to the namenode
14 Satishkumar Varma,
How does it work: Hadoop Architecture
Hadoop distributed file system (HDFS)
MapReduce execution engine

Master node (single node)

Many slave nodes

15 Satishkumar Varma,
Hadoop distributed file system (HDFS)

Centralized namenode
- Maintains metadata info about files

1 2 3 4 5
File F
Blocks (64/128 MB)

Many datanode (1000s)

- Store the actual data
- Files are divided into
blocks
- Each block is
replicated N times
(Default = 3)

16 Satishkumar Varma,
HDFS Properties
Large
A HDFS instance may consist of thousands of server machines,
each storing part of the file system’s data

Replication
Each data block is replicated many times (default is 3)

Failure
Failure is the norm rather than exception

Fault Tolerance
Detection of faults and quick, automatic recovery from them is a
core architectural goal of HDFS

17 Satishkumar Varma,
MapReduce

MapReduce is a programming model for data processing. The

model is simple, yet not too simple to express useful programs.

Hadoop can run MapReduce programs written in various

languages like Java, Ruby, Python, and C++.

Most important, MapReduce programs are inherently parallel,

thus putting very large-scale data analysis into the hands of
anyone with enough machines at their disposal.
Analyzing the Data with Hadoop
To take advantage of the parallel processing that Hadoop
provides, we need to express our query as a MapReduce job.

Map and Reduce

MapReduce works by breaking the processing into two phases:
the map phase and the reduce phase. Each phase has key-value
pairs as input and output, the types of which may be chosen by
the programmer. The programmer also specifies two functions:
the map function and the reduce function.
Traditional Way
Let us understand, when the MapReduce framework was not
there, how parallel and distributed processing used to happen in
a traditional way. So, let us take an example where I have a
weather log containing the daily average temperature of the
years from 2000 to 2015. Here, I want to calculate the day having
the highest temperature in each year.

So, just like in the traditional way, I will split the data into smaller
parts or blocks and store them in different machines. Then, I will
find the highest temperature in each part stored in the
corresponding machine. At last, I will combine the results
received from each of the machines to have the final output.
Challenges associated with this traditional approach:
Critical path problem: It is the amount of time taken to
finish the job without delaying the next milestone or
actual completion date. So, if, any of the machines
delays the job, the whole work gets delayed.

Reliability problem: What if, any of the machines which is

working with a part of data fails? The management of
this failover becomes a challenge.
Challenges associated ctd..
Equal split issue: How will I divide the data into smaller chunks
so that each machine gets even part of data to work with. In
other words, how to equally divide the data such that no
individual machine is overloaded or under utilized.

Single split may fail: If any of the machine fails to provide the
output, I will not be able to calculate the result. So, there
should be a mechanism to ensure this fault tolerance
capability of the system.

Aggregation of result: There should be a mechanism to

aggregate the result generated by each of the machines to
produce the final output.
Challenges associated ctd..

These are the issues which I will have to take care

individually while performing parallel processing of huge
data sets when using traditional approaches.
To overcome these issues, we have the MapReduce
framework which allows us to perform such parallel
computations without bothering about the issues like
reliability, fault tolerance etc. Therefore, MapReduce gives
you the flexibility to write code logic without caring about
the design issues of the system.
What is MapReduce?
MapReduce is a programming framework that allows us to perform distributed
and parallel processing on large data sets in a distributed environment.
Map-Reduce Execution Engine: Color Count Example

Input blocks Produces (k, v) Shuffle & Sorting Consumes(k, [v])

on HDFS ( based on k ( ,
, 1) [1,1,1,1,1,1..])
Produces(k’, v’)
Ma Parse- ( ,
p hash Redu 100)
ce

Ma Parse-
p
hash Redu
ce

Ma Parse-
p hash Redu
ce

Ma Parse-
p hash

Users only provide the “Map” and “Reduce”

functions

26 Satishkumar Varma,
Map-Reduce Engine
MapReduce Engine
JobTracker
TaskTracker
JobTracker splits up data into smaller tasks(“Map”) and sends it to the
TaskTracker process in each node
TaskTracker reports back to the JobTracker node and reports on job
progress, sends data (“Reduce”) or requests new jobs

27 Satishkumar Varma,
MapReduce Engine Properties
Job Tracker is the master node (runs with the namenode)
– Receives the user’s job
– Decides on how many tasks will run (number of mappers)
–Decides on where to run each mapper (concept of locality)
This file has 5 Blocks  run 5 map tasks
Where to run the task reading block “1”
Try to run it on Node 1 or Node 3

28 Satishkumar Varma,
MapReduce Engine Properties
Task Tracker is the slave node (runs on each datanode)
– Receives the task from Job Tracker
– Runs the task until completion (either map or reduce task)
– Always in communication with the Job Tracker reporting progress
In this example, 1 map-reduce job consists of 4 map tasks and 3 reduce tasks

Ma Parse-
p hash Red
uce

Ma Parse-
p hash
Red
uce

Ma Parse-
p hash

29 Satishkumar Varma,
Key – Value Pair
Mappers and Reducers are users’ code (provided functions)
Just need to obey the Key-Value pairs interface
Mappers:
– Consume <key, value> pairs
– Produce <key, value> pairs
Reducers:
– Consume <key, <list of values>>
– Produce <key, value>
Shuffling and Sorting:
– Hidden phase between mappers and reducers
– Groups all similar keys from all mappers, sorts and passes them to a
certain reducer in the form of <key, <list of values>>

30 Satishkumar Varma,
MapReduce Phases
Deciding on what will be the key and what will be the value 
developer’s responsibility

31 Satishkumar Varma,
Refinement: Combiners
Combiner combines values of all keys of a single mapper (single machine)
Much less data needs to be copied and shuffled!

32 Satishkumar Varma,
Let us try to understand the two tasks Map &f Reduce with the help of a small diagram −
Example 1: Word Count
Job: Count the occurrences of each word in a data
set

Map Tasks Reduce Tasks

37 Satishkumar Varma,
Example 2: Color Count
Job: Count the number of each color in a data set
Input blocks Produces (k, Shuffle & Sorting Consumes(k, [v])
v) based on k ( ,
on HDFS ( , 1) [1,1,1,1,1,1..])
Produces(k’, v’)
Ma Parse- ( , 100)
p hash Redu Part0001
ce

Ma Parse-
p
hash Redu Part0002
ce

Ma Parse-
p hash Redu Part0003
ce

Ma Parse-
p hash That’s the output file, it has
3 parts on probably 3
different machines

38 Satishkumar Varma,
Example 3: Color Filter
Job: Select only the blue and the green colors
Input blocks Produces (k, v)
on HDFS (
, 1)
Write to HDFS
Ma Part0001
p

Write to HDFS
Ma Part0002
p
That’s the output file, it has
4 parts on probably 4
Write to HDFS
Ma Part0003 different machines
p

Write to HDFS
Ma Part0004
p
• Each map task will select only
the blue or green colors
• No need for reduce phase
39 Satishkumar Varma,
Other Hadoop Software Components
Hadoop: It is Apache’s open source software framework for storing,
processing and analyzing big data
Other software components that can run on top of or alongside Hadoop and have
achieved top-level Apache project status include:
Hive: A data warehousing and SQL-like query language that presents data in
the form of tables. Hive programming is similar to database programming.
Ambari: A web interface for managing, configuring and testing Hadoop
services
and components.
Cassandra: A distributed database system.
Flume: Software that collects, aggregates and moves large amounts of
streaming
data into HDFS.
40 Hbase: A nonrelational, distributed database that runs on top of Hadoop.
Satishkumar Varma,
Other Software Components
Hcatalog: A table and storage management layer that helps users share and
access
data.
Oozie: A Hadoop job scheduler.
Pig: A platform for manipulating data stored in HDFS that includes a compiler
for
MapReduce programs and a high-level language called Pig Latin. It provides a way
to perform data extractions, transformations and loading, and basic analysis
without having to write MapReduce programs.
Solr: A scalable search tool that includes indexing, reliability,
central configuration, failover and recovery.
Spark: An open-source cluster computing framework with in-
memory analytics.
Sqoop: A connection and transfer mechanism that moves data between
41 Satishkumar Varma,
Hadoop Ecosystem
It is Apache’s open source software framework for storing, processing and analyzing big data.

42 Satishkumar Varma,
Hadoop Ecosystem
Hbase
Hive
Pig
Flume
Sqoop
Oozie
Hue
Mahout
Zookee
per
43 Satishkumar Varma,
Hadoop Ecosystem
 Hive
 It is SQL-like interface to Hadoop.
 It is an abstraction on top of MapReduce. allows users to query data in
the Hadoop Cluster without knowing java or MapReduce.
 Uses HiveQL language very similar to SQL.
 Pig
 Pig is an alternative abstraction on top of MapReduce.
 It uses a dataflow scripting language called as PigLatin.
 The Pig interpreter runs on the client machine.
 Takes the PigLatin script and turns it into a series of MapReduce jobs and
submits those jobs to the cluster.

44 Satishkumar Varma,
Hadoop Ecosystem
 Hbase
 HBase is "Hadoop Database".
 It is 'NoSQL' data store.
 It can store massive amount of data e.g. Gigabyte, terabyte or even pet
bytes of data in a table.
 MapReduce is not designed for iterative processes.

Flume
 It is distributed real time data collection service.
 It efficiently collects, aggregate and move large amounts of data.

45 Satishkumar Varma,
Hadoop Ecosystem
 Sqoop
 It provides a method to import data from tables in relational database
into HDFS.
 It supports easy parallel database import/export.
 User can insert data from RDBMS to HDFS , Export data from HDFS to
back into RDBMS.
 Oozie
 Oozie is workflow maanagement project.
 Oozie allows developers to create a workflow of MapReduce jobs
including dependencies between jobs.
 The Oozie server submits the jobs to the server in the correct sequence.

46 Satishkumar Varma,
Hadoop Ecosystem
 Hue
 An open source web interface that supports Apache Hadoop and
its ecosystem licensed under the Apache v2 license
 Hue aggregates the most common Apache Hadoop components
into a single
interface and targets the user experience/UI Toolkit
 Mahout
 Machine learning tool
 Supports distributed & scalable machine learning algo on Hadoop
Platform
 Helps in building intelligent applications easier and faster
 It can provide distributed data mining function combined with
47 Satishkumar Varma,
Hadoop Ecosystem
 ZooKeeper
 It is a centralized service for maintaining
configuration information .
 It provides distributed synchronization.
 It contains a set of tools to build distributed applications
that can safely handle partial failures.
 Zookeeper was designed to store coordination data,
status information, configuration , location information
about distributed application.

48 Satishkumar Varma,
Hadoop Limitations
Security Concerns
Vulnerable by Nature
Not fit for small data
Potential Stability Issues
General limitations

49 Satishkumar Varma,
Hadoop fs Shell Commands Examples
hadoop fs <args>
hadoop fs -ls /user/hadoop/students
hadoop fs -mkdir /user/hadoop/hadoopdemo
hadoop fs -cat /user/hadoop/dir/products/products.dat
hadoop fs -copyToLocal /user/hadoop/hadoopdemo/sales salesdemo
hadoop fs -put localfile /user/hadoop/hadoopdemo
hadoop fs -rm /user/hadoop/file
hadoop fs -rmr /user/hadoop/dir

50 Satishkumar Varma,
Hadoop Applications

Advertisement (Mining user behavior to generate recommendations)

Searches (group related documents)
Security (search for uncommon patterns)
Non-realtime large dataset computing:
NY Times was dynamically
generating PDFs of articles from
1851-
1922
Wanted to pre-generate & statically serve articles to improve
performance
Using Hadoop + MapReduce running on EC2 / S3, converted

51
4TB of TIFFs into 11 million PDF articles in 24 hrs
Satishkumar Varma,
Hadoop Applications

Hadoop is in use at most organizations that handle big data:

Yahoo!
Facebook
Amazon
Netflix
Etc…
Some
examples of
scale:
Yahoo!’s Search Webmap runs on 10,000 core Linux cluster and
powers Yahoo! Web search

52
FB’s Hadoop cluster hosts 100+ PB of data (July, 2012) &
Satishkumar Varma,

Unit 5
No ratings yet
Unit 5
7 pages
Large-Scale Data Management: Cs525: Special Topics in Dbs
No ratings yet
Large-Scale Data Management: Cs525: Special Topics in Dbs
22 pages
Chapter2 Bdi
No ratings yet
Chapter2 Bdi
101 pages
Report Title: Wasit University
No ratings yet
Report Title: Wasit University
8 pages
Exploring Bigdata With Hadoop: Dr.A.Bazila Banu Associate Professor Department of Cse
No ratings yet
Exploring Bigdata With Hadoop: Dr.A.Bazila Banu Associate Professor Department of Cse
23 pages
Hadoop Map Reduce Concept
No ratings yet
Hadoop Map Reduce Concept
23 pages
Parallel Project
No ratings yet
Parallel Project
32 pages
Hadoop: Er. Gursewak Singh Dsce
No ratings yet
Hadoop: Er. Gursewak Singh Dsce
15 pages
Big Data
No ratings yet
Big Data
67 pages
BDA Notes
No ratings yet
BDA Notes
25 pages
Unit 5
No ratings yet
Unit 5
35 pages
Lecture 5 - Hadoop and Mapreduce
No ratings yet
Lecture 5 - Hadoop and Mapreduce
30 pages
Hadoop
No ratings yet
Hadoop
5 pages
HadoopMapreduce Summerization
No ratings yet
HadoopMapreduce Summerization
24 pages
CC unit5
No ratings yet
CC unit5
27 pages
Shortnotes For Cloud
No ratings yet
Shortnotes For Cloud
22 pages
HADOOP NOTES
No ratings yet
HADOOP NOTES
8 pages
CC_unit4_52e39303-d867-4b14-b5bf-38bc746359c6
No ratings yet
CC_unit4_52e39303-d867-4b14-b5bf-38bc746359c6
14 pages
Unit V Cloud Technologies and Advancements
No ratings yet
Unit V Cloud Technologies and Advancements
33 pages
MapReduce_Unit3
No ratings yet
MapReduce_Unit3
27 pages
L02-Hadoop Framework
No ratings yet
L02-Hadoop Framework
40 pages
Unit 2 (1)
No ratings yet
Unit 2 (1)
22 pages
U-3 Big Data
No ratings yet
U-3 Big Data
23 pages
Cloud Computing Prof
No ratings yet
Cloud Computing Prof
11 pages
Lecture 5 - Hadoop and Mapreduce
No ratings yet
Lecture 5 - Hadoop and Mapreduce
30 pages
Hadoop Lab
100% (1)
Hadoop Lab
32 pages
Unit - III Advanced Analytics Technology and Tools
No ratings yet
Unit - III Advanced Analytics Technology and Tools
44 pages
CLOUD.pdf
No ratings yet
CLOUD.pdf
138 pages
Module 2.1
No ratings yet
Module 2.1
21 pages
Bda Unit 1
No ratings yet
Bda Unit 1
13 pages
wk8__final
No ratings yet
wk8__final
39 pages
Hadoop 1 Converted
No ratings yet
Hadoop 1 Converted
26 pages
Hadoop 1
No ratings yet
Hadoop 1
75 pages
Introduction to Hadoop - Copy
No ratings yet
Introduction to Hadoop - Copy
14 pages
Unit V FRAMEWORKS AND VISUALIZATION
No ratings yet
Unit V FRAMEWORKS AND VISUALIZATION
71 pages
Unit-2-_Hadoop2_
No ratings yet
Unit-2-_Hadoop2_
30 pages
Big Data Analytics
From Everand
Big Data Analytics
Nitin Kumar Yadav
No ratings yet
Introduction: Hadoop's History and Advantages 2. Architecture in Detail 3. Hadoop in Industry
No ratings yet
Introduction: Hadoop's History and Advantages 2. Architecture in Detail 3. Hadoop in Industry
53 pages
Unit 3 & 4 big data
No ratings yet
Unit 3 & 4 big data
18 pages
Hadoop and Mapreduce
No ratings yet
Hadoop and Mapreduce
21 pages
Hadoop Spark
No ratings yet
Hadoop Spark
34 pages
HADOOP
No ratings yet
HADOOP
19 pages
Unit-2 Hadoop and MapReduce
No ratings yet
Unit-2 Hadoop and MapReduce
32 pages
The CAP Theorem Overview
No ratings yet
The CAP Theorem Overview
16 pages
Lecture4 IntroMapReduce PDF
No ratings yet
Lecture4 IntroMapReduce PDF
75 pages
Chapter 4 MapReduce and New Software Stack
No ratings yet
Chapter 4 MapReduce and New Software Stack
48 pages
IDS Unit3
No ratings yet
IDS Unit3
16 pages
11 Lecture
No ratings yet
11 Lecture
22 pages
Introduction To Big Data and Hadoop
100% (1)
Introduction To Big Data and Hadoop
29 pages
Unit 5 - Introduction To Hadoop
No ratings yet
Unit 5 - Introduction To Hadoop
50 pages
Attachment (21)
No ratings yet
Attachment (21)
11 pages
Big Data - Hadoop
No ratings yet
Big Data - Hadoop
20 pages
Hadoop Ankit
No ratings yet
Hadoop Ankit
20 pages
BDA Unit 4 PDF
No ratings yet
BDA Unit 4 PDF
31 pages
Business Intelligence & Big Data Analytics-CSE3124Y
No ratings yet
Business Intelligence & Big Data Analytics-CSE3124Y
26 pages
Bda Unit 2
No ratings yet
Bda Unit 2
21 pages
Hadoop
No ratings yet
Hadoop
13 pages
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
From Everand
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
Wei Liu
No ratings yet
Learning Hadoop 2
From Everand
Learning Hadoop 2
Garry Turkington
4/5 (1)
Advanced Hadoop Techniques: A Comprehensive Guide to Mastery
From Everand
Advanced Hadoop Techniques: A Comprehensive Guide to Mastery
Adam Jones
No ratings yet
Unit 1
No ratings yet
Unit 1
17 pages
HBASE
No ratings yet
HBASE
18 pages
Symbiosis Skills and Professional University
No ratings yet
Symbiosis Skills and Professional University
3 pages
BDA_Unit_2
No ratings yet
BDA_Unit_2
29 pages
Hadoop Building Blocks
No ratings yet
Hadoop Building Blocks
30 pages
Drill Slides
No ratings yet
Drill Slides
14 pages
=Big data Technologi
No ratings yet
=Big data Technologi
36 pages
Hive File Formats Presentation
No ratings yet
Hive File Formats Presentation
19 pages
GISTAM 2015 Proceedings
No ratings yet
GISTAM 2015 Proceedings
10 pages
B2. Introduction To Big Data With Spark and Hadoop - Coursera
No ratings yet
B2. Introduction To Big Data With Spark and Hadoop - Coursera
12 pages
Big Data Assignment
No ratings yet
Big Data Assignment
13 pages
Umme Resume BI-Analyst
No ratings yet
Umme Resume BI-Analyst
3 pages
04 Proyek PDB
No ratings yet
04 Proyek PDB
39 pages
Rashmi Jeswani Capstone
No ratings yet
Rashmi Jeswani Capstone
84 pages
Spark On Hadoop Vs MPI OpenMP On Beowulf
No ratings yet
Spark On Hadoop Vs MPI OpenMP On Beowulf
10 pages
HvrUserManual-5 5 PDF
No ratings yet
HvrUserManual-5 5 PDF
586 pages
Strings
No ratings yet
Strings
10 pages
Real Time Applications Using MapReduce
No ratings yet
Real Time Applications Using MapReduce
12 pages
IT6006 Data Analytics
No ratings yet
IT6006 Data Analytics
12 pages
Hortonworks Data Platform: HDFS Administration
No ratings yet
Hortonworks Data Platform: HDFS Administration
74 pages
bigDataspark Manual(MR-22)
No ratings yet
bigDataspark Manual(MR-22)
106 pages
Data engineer Rithick bisher
No ratings yet
Data engineer Rithick bisher
5 pages
Data Science Solutions With Python Fast and Scalable Models Using
100% (1)
Data Science Solutions With Python Fast and Scalable Models Using
128 pages
Deepak Sanagapalli Uuupdated Resume
No ratings yet
Deepak Sanagapalli Uuupdated Resume
8 pages
1691488222216-MTech Data Science Syllabus
No ratings yet
1691488222216-MTech Data Science Syllabus
36 pages
Elastic Cloud Storage (ECS) : Software-Defined Object Storage
No ratings yet
Elastic Cloud Storage (ECS) : Software-Defined Object Storage
5 pages
QuestionBank Unit1
No ratings yet
QuestionBank Unit1
29 pages
Cloud Computing Unit - 3 Final
No ratings yet
Cloud Computing Unit - 3 Final
43 pages
GCC Lab Manual2
No ratings yet
GCC Lab Manual2
50 pages

1 MapReduce introduction with example

Uploaded by

1 MapReduce introduction with example

Uploaded by

Introduction to MapReduce

Responsibility is to periodically copy and merge the

In case if the name node crashes,

Master node (single node)

Many slave nodes

Many datanode (1000s)

MapReduce is a programming model for data processing. The

Hadoop can run MapReduce programs written in various

Most important, MapReduce programs are inherently parallel,

Map and Reduce

Reliability problem: What if, any of the machines which is

Aggregation of result: There should be a mechanism to

These are the issues which I will have to take care

Input blocks Produces (k, v) Shuffle & Sorting Consumes(k, [v])

Users only provide the “Map” and “Reduce”

Map Tasks Reduce Tasks

Advertisement (Mining user behavior to generate recommendations)

Hadoop is in use at most organizations that handle big data:

You might also like