0% found this document useful (0 votes)
20 views

Lecture 5 - Hadoop and Mapreduce

Hadoop is an open source framework that allows distributed processing of large datasets across clusters of commodity hardware. It uses a master-slave architecture with a master node to manage metadata and slave data nodes that store data blocks. Hadoop consists of HDFS for storage and MapReduce as an execution engine that distributes computation across nodes. Mappers process data in parallel and reducers aggregate the outputs by key.

Uploaded by

mba20238
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views

Lecture 5 - Hadoop and Mapreduce

Hadoop is an open source framework that allows distributed processing of large datasets across clusters of commodity hardware. It uses a master-slave architecture with a master node to manage metadata and slave data nodes that store data blocks. Hadoop consists of HDFS for storage and MapReduce as an execution engine that distributes computation across nodes. Mappers process data in parallel and reducers aggregate the outputs by key.

Uploaded by

mba20238
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 30

What is Hadoop?

An open source framework Commodity Hardware


that allows distributed
processing of large data-sets  Economic / affordable
across the cluster of machines
Commodity Hardware  Typically low
performance hardware
• Open source framework written in Java

• Inspired by Google's Map-Reduce programming model as well as


its file system (GFS)
Hadoop History
Doug Cutting added Hadoop defeated
DFS & MapReduce Super computer
in
converted 4TB of
Doug Cutting started Doug Cutting
image archives over
working on joined Cloudera
100 EC2 instances

2002 2003 2004 2005 2006 2007 2008 2009

published GFS & Hadoop became


MapReduce papers Development of top-level project
started as Lucene sub-project

launched Hive,
SQL Support for Hadoop
What is Hadoop?
• Open source software framework designed for storage
and processing of large scale dataset on large clusters
of commodity hardware
• Large datasets  Terabytes or petabytes of data
• Large clusters  hundreds or thousands of nodes

• . Uses for Hadoop


• Data-intensive text processing
• Graph mining
• Machine learning and data mining
• Large scale social network analysis
What is Hadoop (Cont’d)
• Hadoop framework consists on two main layers
• Hadoop Distributed file system (HDFS)
• Execution engine (MapReduce)

6
Hadoop Master/Slave Architecture

• Hadoop is designed as a master-slave architecture

Master node (single node)

Many slave nodes

7
Design Principles of Hadoop
• Need to process big data

• Need to parallelize computation across thousands of nodes

• Commodity hardware
• Large number of low-end cheap machines working in parallel to
solve a computing problem

8
Properties of HDFS
• Large: A HDFS instance may consist of thousands of
server machines, each storing part of the file system’s
data
• Replication: Each data block is replicated many times
(default is 3)
• Failure: Failure is the norm rather than exception

• Fault Tolerance: Detection of faults and quick,


automatic recovery from them is a core architectural goal
of HDFS

9
Hadoop: How it Works

10
Hadoop Architecture
• Distributed file system (HDFS)
• Execution engine (MapReduce)

Master node (single node)

Many slave nodes

11
Hadoop Distributed File System
(HDFS)

Centralized namenode
- Maintains metadata info about files

File F 1 2 3 4 5

Blocks (64 MB)

Many datanode (1000s)


- Store the actual data
- Files are divided into blocks
- Each block is replicated N times
(Default = 3)

12
Hadoop Distributed File System
Namenode
File1
1
2
• NameNode: 3
• Stores metadata (file names, 4
block locations, etc)

• DataNode:
• Stores the actual HDFS data
1 2 1 3
blocks 2 1 4 2
4 3 3 4

Datanodes
Data Retrieval

• When a client wants to retrieve data it communicates


with the NameNode to determine which blocks make
up a file and on which data nodes those blocks are
stored

• Then communicated directly with the data nodes to read


the data
MapReduce
Distributing computation across nodes
MapReduce Overview

• A method for distributing computation across multiple


nodes
• Each node processes the data that is stored at that node

• Consists of two main phases


• Map
• Reduce
The Mapper

• Reads data as key/value pairs


• The key is often discarded

• Outputs zero or more key/value pairs


Shuffle and Sort

• Output from the mapper is sorted by key

• All values with the same key are guaranteed to go to


the same machine
The Reducer

• Called once for each unique key

• Gets a list of all values associated with a key as input

• The reducer outputs zero or more final key/value pairs


• Usually just one output per input key
JobTracker and TaskTracker

• JobTracker
• Determines the execution
plan for the job
• Assigns individual tasks

• TaskTracker
• Keeps track of the
performance of an
individual mapper or
reducer
Properties of MapReduce Engine
• Job Tracker is the master node (runs with the namenode)
• Receives the user’s job
• Decides on how many tasks will run (number of mappers)
• Decides on where to run each mapper (concept of locality)
Node 1 Node 2 Node 3

• This file has 5 Blocks  run 5 map tasks

• Where to run the task reading block “1”


• Try to run it on Node 1 or Node 3

21
Properties of MapReduce Engine
(Cont’d)
• Task Tracker is the slave node (runs on each datanode)
• Receives the task from Job Tracker
• Runs the task until completion (either map or reduce task)
• Always in communication with the Job Tracker reporting progress

In this example, 1 map-reduce


job consists of 4 map tasks and 3
reduce tasks

22
MapReduce Phases

Deciding on what will be the key and what will be the value  developer’s
responsibility

23
Map-Reduce Execution Engine
(Example: Color Count)
Input blocks Produces (k, v) Shuffle & Sorting Consumes(k, [v])
on HDFS ( , 1) based on k ( , [1,1,1,1,1,1..])

Produces(k’, v’)
( , 100)

Users only provide the “Map” and “Reduce” functions


24
Key-Value Pairs
• Mappers and Reducers are users’ code (provided functions)
• Just need to obey the Key-Value pairs interface
• Mappers:
• Consume <key, value> pairs
• Produce <key, value> pairs

• Reducers:
• Consume <key, <list of values>>
• Produce <key, value>

• Shuffling and Sorting:


• Hidden phase between mappers and reducers
• Groups all similar keys from all mappers, sorts and passes them to a certain reducer
in the form of <key, <list of values>>
25
Example 1: Word Count
• Job: Count the occurrences of each word in a data set

Map Reduce
Tasks Tasks

26
Example 2: Color Count
Job: Count the number of each color in a data set

Input blocks Produces (k, v) Shuffle & Sorting Consumes(k, [v])


on HDFS ( , 1) based on k ( , [1,1,1,1,1,1..])

Produces(k’, v’)
( , 100)
Part0001

Part0002

Part0003

That’s the output file, it has


3 parts on probably 3
27 different machines
Example 3: Color Filter
Job: Select only the blue and the green colors
• Each map task will select only
Input blocks Produces (k, v) the blue or green colors
on HDFS ( , 1)

• No need for reduce phase


Write to HDFS
Part0001

Write to HDFS
Part0002
That’s the output file, it has
4 parts on probably 4
Write to HDFS
Part0003 different machines

Write to HDFS
Part0004

28
Other Tools

• Hive
• Hadoop processing with SQL

• Pig
• Hadoop processing with scripting

• HBase
• Database model built on top of Hadoop

29
Who Uses Hadoop?

You might also like