0% found this document useful (0 votes)
7 views52 pages

1 MapReduce introduction with example

Hadoop is an open-source software framework designed for distributed processing of large datasets across clusters of commodity hardware, utilizing a programming model called MapReduce. The architecture consists of a master/slave structure with a Namenode managing metadata and Datanodes storing actual data, while the MapReduce engine facilitates parallel processing through map and reduce tasks. Hadoop supports fault tolerance and scalability, making it suitable for processing both structured and unstructured data.

Uploaded by

kajalyadav102703
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views52 pages

1 MapReduce introduction with example

Hadoop is an open-source software framework designed for distributed processing of large datasets across clusters of commodity hardware, utilizing a programming model called MapReduce. The architecture consists of a master/slave structure with a Namenode managing metadata and Datanodes storing actual data, while the MapReduce engine facilitates parallel processing through map and reduce tasks. Hadoop supports fault tolerance and scalability, making it suitable for processing both structured and unstructured data.

Uploaded by

kajalyadav102703
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 52

Introduction to MapReduce

What is Hadoop?
Hadoop is an apache open source software framework
It is a java framework
4 distributed processing of large datasets across
large clusters of nodes
Large datasets: Terabytes or petabytes of data
Large clusters: hundreds of thousands of computers (nodes)
Hadoop is open-source implementation for Google MapReduce
Hadoop is based on a simple programming model called MapReduce
Runs on large cluster of commodity machines
Hadoop provides both distributed storage and processing of large datasets
Support processing on streaming data
2 Satishkumar Varma,
What is Hadoop?
Hadoop framework consists on two main layers
Distributed file system (HDFS)
Execution engine (MapReduce)

3 Satishkumar Varma,
Why: Goals / Requirements?
Facilitate storage & processing of large and rapidly growing
dataset Structured and non-structured data
Simple programming models
High scalability and availability
Use commodity (cheap!) hardware with little redundancy
Fault-tolerance
Move computation rather than data

4 Satishkumar Varma,
Hadoop Design Principles
Need to process big data
High Need to parallelize computation across thousands of
nodes Use commodity (affordable) hardware with little
redundancy
Contrast to Parallel DBs: Small number of high-end expensive
machines
Commodity hardware: Large number of low-end cheap machines working
in parallel to solve a computing problem
Automatic parallelization & distribution: Hidden from the end-user
Fault tolerance and automatic recovery: Nodes/tasks will fail and will
recover
automatically
Clean and simple programming abstraction: Users only provide two
5 Satishkumar Varma,
Hadoop Architecture?
Hadoop
Architecture

6 Satishkumar Varma,
Hadoop Architecture Overview?
Hadoop architecture is a master/slave architecture
The master being the namenode and slaves are datanodes
Namenode controls the access to the data by clients
Datanodes manage the storage of data on the nodes that
are running on.
Hadoop splits the file into one or more blocks and these blocks are
stored in the datanodes
Each data block is replicated to 3 different datanodes to provide
high availability of the hadoop system
The block replication factor is configurable
7 Satishkumar Varma,
Who Uses MapReduce/Hadoop?
Google: Inventors of MapReduce computing paradigm
Yahoo: Developing Hadoop open-source of MapReduce
IBM, Microsoft, Oracle
Facebook, Amazon, AOL, NetFlex
Many others + universities and research labs

8 Satishkumar Varma,
Hadoop Components?
Hadoop Distributed File System
MapReduce Engine
Types of Nodes
Namenode
Secondary
Namenode Datanode
JobTracker
TaskTracke
r

YARN – (Yet
Another
Resource
9 Satishkumar Varma,
Hadoop Components?
Hadoop Distributed File System:
HDFS is designed to run on commodity machines which are of low
cost hardware
Distributed data is stored in the HDFS file
system HDFS is highly fault tolerant
HDFS provides high throughput access to the
applications that
require big data
Java-based scalable system that stores data across multiple
machines without prior organization

10 Satishkumar Varma,
Hadoop Components?
Namenode:
Namenode is the heart of the hadoop system
Namenode manages the file system namespace
It stores the metadata information of the data
blocks
This metadata is stored permanently on to
local disk in the form of
namespace image and edit log file
Namenode also knows the location of the data blocks on data node
However the namenode does not store this information persistently
Namenode creates block to datanode mapping when it is restarted
If the namenode crashes, then the entire hadoop system goes down
11 Satishkumar Varma,
Hadoop Components
Secondary Namenode:

Responsibility is to periodically copy and merge the


namespace image and edit log

In case if the name node crashes,


Then the namespace image stored in secondary namenode can be
used to restart the namenode.

12 Satishkumar Varma,
Hadoop Components
DataNode:
Stores the blocks of data and retrieves them
Datanodes also reports the blocks information to the namenode
periodically
Stores the actual data in HDFS
Notifies NameNode of what blocks it has
Can run on any underlying filesystem (ext3/4, NTFS, etc)
NameNode replicates blocks 2x in local rack, 1x elsewhere

13 Satishkumar Varma,
Hadoop Components
JobTracker
JobTracker responsibility is to schedule the clients jobs
Job tracker creates map and reduce tasks and schedules
them to run on the datanodes (tasktrackers)
Job Tracker also checks for any failed tasks and reschedules
the failed tasks on another datanode
Jobtracker can be run on the namenode or a separate node

TaskTracker
Tasktracker runs on the datanodes
Task trackers responsibility is to
run the map or reduce tasks assigned by the namenode
and report the status of the tasks to the namenode
14 Satishkumar Varma,
How does it work: Hadoop Architecture
Hadoop distributed file system (HDFS)
MapReduce execution engine

Master node (single node)

Many slave nodes

15 Satishkumar Varma,
Hadoop distributed file system (HDFS)

Centralized namenode
- Maintains metadata info about files

1 2 3 4 5
File F
Blocks (64/128 MB)

Many datanode (1000s)


- Store the actual data
- Files are divided into
blocks
- Each block is
replicated N times
(Default = 3)

16 Satishkumar Varma,
HDFS Properties
Large
A HDFS instance may consist of thousands of server machines,
each storing part of the file system’s data

Replication
Each data block is replicated many times (default is 3)

Failure
Failure is the norm rather than exception

Fault Tolerance
Detection of faults and quick, automatic recovery from them is a
core architectural goal of HDFS

17 Satishkumar Varma,
MapReduce

MapReduce is a programming model for data processing. The


model is simple, yet not too simple to express useful programs.

Hadoop can run MapReduce programs written in various


languages like Java, Ruby, Python, and C++.

Most important, MapReduce programs are inherently parallel,


thus putting very large-scale data analysis into the hands of
anyone with enough machines at their disposal.
Analyzing the Data with Hadoop
To take advantage of the parallel processing that Hadoop
provides, we need to express our query as a MapReduce job.

Map and Reduce


MapReduce works by breaking the processing into two phases:
the map phase and the reduce phase. Each phase has key-value
pairs as input and output, the types of which may be chosen by
the programmer. The programmer also specifies two functions:
the map function and the reduce function.
Traditional Way
Let us understand, when the MapReduce framework was not
there, how parallel and distributed processing used to happen in
a traditional way. So, let us take an example where I have a
weather log containing the daily average temperature of the
years from 2000 to 2015. Here, I want to calculate the day having
the highest temperature in each year.

So, just like in the traditional way, I will split the data into smaller
parts or blocks and store them in different machines. Then, I will
find the highest temperature in each part stored in the
corresponding machine. At last, I will combine the results
received from each of the machines to have the final output.
Challenges associated with this traditional approach:
Critical path problem: It is the amount of time taken to
finish the job without delaying the next milestone or
actual completion date. So, if, any of the machines
delays the job, the whole work gets delayed.

Reliability problem: What if, any of the machines which is


working with a part of data fails? The management of
this failover becomes a challenge.
Challenges associated ctd..
Equal split issue: How will I divide the data into smaller chunks
so that each machine gets even part of data to work with. In
other words, how to equally divide the data such that no
individual machine is overloaded or under utilized.

Single split may fail: If any of the machine fails to provide the
output, I will not be able to calculate the result. So, there
should be a mechanism to ensure this fault tolerance
capability of the system.

Aggregation of result: There should be a mechanism to


aggregate the result generated by each of the machines to
produce the final output.
Challenges associated ctd..

These are the issues which I will have to take care


individually while performing parallel processing of huge
data sets when using traditional approaches.
To overcome these issues, we have the MapReduce
framework which allows us to perform such parallel
computations without bothering about the issues like
reliability, fault tolerance etc. Therefore, MapReduce gives
you the flexibility to write code logic without caring about
the design issues of the system.
What is MapReduce?
MapReduce is a programming framework that allows us to perform distributed
and parallel processing on large data sets in a distributed environment.
Map-Reduce Execution Engine: Color Count Example

Input blocks Produces (k, v) Shuffle & Sorting Consumes(k, [v])


on HDFS ( based on k ( ,
, 1) [1,1,1,1,1,1..])
Produces(k’, v’)
Ma Parse- ( ,
p hash Redu 100)
ce

Ma Parse-
p
hash Redu
ce

Ma Parse-
p hash Redu
ce

Ma Parse-
p hash

Users only provide the “Map” and “Reduce”


functions

26 Satishkumar Varma,
Map-Reduce Engine
MapReduce Engine
JobTracker
TaskTracker
JobTracker splits up data into smaller tasks(“Map”) and sends it to the
TaskTracker process in each node
TaskTracker reports back to the JobTracker node and reports on job
progress, sends data (“Reduce”) or requests new jobs

27 Satishkumar Varma,
MapReduce Engine Properties
Job Tracker is the master node (runs with the namenode)
– Receives the user’s job
– Decides on how many tasks will run (number of mappers)
–Decides on where to run each mapper (concept of locality)
This file has 5 Blocks  run 5 map tasks
Where to run the task reading block “1”
Try to run it on Node 1 or Node 3

28 Satishkumar Varma,
MapReduce Engine Properties
Task Tracker is the slave node (runs on each datanode)
– Receives the task from Job Tracker
– Runs the task until completion (either map or reduce task)
– Always in communication with the Job Tracker reporting progress
In this example, 1 map-reduce job consists of 4 map tasks and 3 reduce tasks

Ma Parse-
p hash Red
uce

Ma Parse-
p hash Red
uce

Ma Parse-
p hash
Red
uce

Ma Parse-
p hash

29 Satishkumar Varma,
Key – Value Pair
Mappers and Reducers are users’ code (provided functions)
Just need to obey the Key-Value pairs interface
Mappers:
– Consume <key, value> pairs
– Produce <key, value> pairs
Reducers:
– Consume <key, <list of values>>
– Produce <key, value>
Shuffling and Sorting:
– Hidden phase between mappers and reducers
– Groups all similar keys from all mappers, sorts and passes them to a
certain reducer in the form of <key, <list of values>>

30 Satishkumar Varma,
MapReduce Phases
Deciding on what will be the key and what will be the value 
developer’s responsibility

31 Satishkumar Varma,
Refinement: Combiners
Combiner combines values of all keys of a single mapper (single machine)
Much less data needs to be copied and shuffled!

32 Satishkumar Varma,
Let us try to understand the two tasks Map &f Reduce with the help of a small diagram −
Example 1: Word Count
Job: Count the occurrences of each word in a data
set

Map Tasks Reduce Tasks


37 Satishkumar Varma,
Example 2: Color Count
Job: Count the number of each color in a data set
Input blocks Produces (k, Shuffle & Sorting Consumes(k, [v])
v) based on k ( ,
on HDFS ( , 1) [1,1,1,1,1,1..])
Produces(k’, v’)
Ma Parse- ( , 100)
p hash Redu Part0001
ce

Ma Parse-
p
hash Redu Part0002
ce

Ma Parse-
p hash Redu Part0003
ce

Ma Parse-
p hash That’s the output file, it has
3 parts on probably 3
different machines

38 Satishkumar Varma,
Example 3: Color Filter
Job: Select only the blue and the green colors
Input blocks Produces (k, v)
on HDFS (
, 1)
Write to HDFS
Ma Part0001
p

Write to HDFS
Ma Part0002
p
That’s the output file, it has
4 parts on probably 4
Write to HDFS
Ma Part0003 different machines
p

Write to HDFS
Ma Part0004
p
• Each map task will select only
the blue or green colors
• No need for reduce phase
39 Satishkumar Varma,
Other Hadoop Software Components
Hadoop: It is Apache’s open source software framework for storing,
processing and analyzing big data
Other software components that can run on top of or alongside Hadoop and have
achieved top-level Apache project status include:
Hive: A data warehousing and SQL-like query language that presents data in
the form of tables. Hive programming is similar to database programming.
Ambari: A web interface for managing, configuring and testing Hadoop
services
and components.
Cassandra: A distributed database system.
Flume: Software that collects, aggregates and moves large amounts of
streaming
data into HDFS.
40 Hbase: A nonrelational, distributed database that runs on top of Hadoop.
Satishkumar Varma,
Other Software Components
Hcatalog: A table and storage management layer that helps users share and
access
data.
Oozie: A Hadoop job scheduler.
Pig: A platform for manipulating data stored in HDFS that includes a compiler
for
MapReduce programs and a high-level language called Pig Latin. It provides a way
to perform data extractions, transformations and loading, and basic analysis
without having to write MapReduce programs.
Solr: A scalable search tool that includes indexing, reliability,
central configuration, failover and recovery.
Spark: An open-source cluster computing framework with in-
memory analytics.
Sqoop: A connection and transfer mechanism that moves data between
41 Satishkumar Varma,
Hadoop Ecosystem
It is Apache’s open source software framework for storing, processing and analyzing big data.

42 Satishkumar Varma,
Hadoop Ecosystem
Hbase
Hive
Pig
Flume
Sqoop
Oozie
Hue
Mahout
Zookee
per
43 Satishkumar Varma,
Hadoop Ecosystem
 Hive
 It is SQL-like interface to Hadoop.
 It is an abstraction on top of MapReduce. allows users to query data in
the Hadoop Cluster without knowing java or MapReduce.
 Uses HiveQL language very similar to SQL.
 Pig
 Pig is an alternative abstraction on top of MapReduce.
 It uses a dataflow scripting language called as PigLatin.
 The Pig interpreter runs on the client machine.
 Takes the PigLatin script and turns it into a series of MapReduce jobs and
submits those jobs to the cluster.

44 Satishkumar Varma,
Hadoop Ecosystem
 Hbase
 HBase is "Hadoop Database".
 It is 'NoSQL' data store.
 It can store massive amount of data e.g. Gigabyte, terabyte or even pet
bytes of data in a table.
 MapReduce is not designed for iterative processes.

Flume
 It is distributed real time data collection service.
 It efficiently collects, aggregate and move large amounts of data.

45 Satishkumar Varma,
Hadoop Ecosystem
 Sqoop
 It provides a method to import data from tables in relational database
into HDFS.
 It supports easy parallel database import/export.
 User can insert data from RDBMS to HDFS , Export data from HDFS to
back into RDBMS.
 Oozie
 Oozie is workflow maanagement project.
 Oozie allows developers to create a workflow of MapReduce jobs
including dependencies between jobs.
 The Oozie server submits the jobs to the server in the correct sequence.

46 Satishkumar Varma,
Hadoop Ecosystem
 Hue
 An open source web interface that supports Apache Hadoop and
its ecosystem licensed under the Apache v2 license
 Hue aggregates the most common Apache Hadoop components
into a single
interface and targets the user experience/UI Toolkit
 Mahout
 Machine learning tool
 Supports distributed & scalable machine learning algo on Hadoop
Platform
 Helps in building intelligent applications easier and faster
 It can provide distributed data mining function combined with
47 Satishkumar Varma,
Hadoop Ecosystem
 ZooKeeper
 It is a centralized service for maintaining
configuration information .
 It provides distributed synchronization.
 It contains a set of tools to build distributed applications
that can safely handle partial failures.
 Zookeeper was designed to store coordination data,
status information, configuration , location information
about distributed application.

48 Satishkumar Varma,
Hadoop Limitations
Security Concerns
Vulnerable by Nature
Not fit for small data
Potential Stability Issues
General limitations

49 Satishkumar Varma,
Hadoop fs Shell Commands Examples
hadoop fs <args>
hadoop fs -ls /user/hadoop/students
hadoop fs -mkdir /user/hadoop/hadoopdemo
hadoop fs -cat /user/hadoop/dir/products/products.dat
hadoop fs -copyToLocal /user/hadoop/hadoopdemo/sales salesdemo
hadoop fs -put localfile /user/hadoop/hadoopdemo
hadoop fs -rm /user/hadoop/file
hadoop fs -rmr /user/hadoop/dir

50 Satishkumar Varma,
Hadoop Applications

Advertisement (Mining user behavior to generate recommendations)


Searches (group related documents)
Security (search for uncommon patterns)
Non-realtime large dataset computing:
NY Times was dynamically
generating PDFs of articles from
1851-
1922
Wanted to pre-generate & statically serve articles to improve
performance
Using Hadoop + MapReduce running on EC2 / S3, converted

51
4TB of TIFFs into 11 million PDF articles in 24 hrs
Satishkumar Varma,
Hadoop Applications

Hadoop is in use at most organizations that handle big data:


Yahoo!
Facebook
Amazon
Netflix
Etc…
Some
examples of
scale:
Yahoo!’s Search Webmap runs on 10,000 core Linux cluster and
powers Yahoo! Web search

52
FB’s Hadoop cluster hosts 100+ PB of data (July, 2012) &
Satishkumar Varma,

You might also like