Lecture 5 - Hadoop and Mapreduce
Lecture 5 - Hadoop and Mapreduce
launched Hive,
SQL Support for Hadoop
What is Hadoop?
• Open source software framework designed for storage
and processing of large scale dataset on large clusters
of commodity hardware
• Large datasets Terabytes or petabytes of data
• Large clusters hundreds or thousands of nodes
6
Hadoop Master/Slave Architecture
7
Design Principles of Hadoop
• Need to process big data
• Commodity hardware
• Large number of low-end cheap machines working in parallel to
solve a computing problem
8
Properties of HDFS
• Large: A HDFS instance may consist of thousands of
server machines, each storing part of the file system’s
data
• Replication: Each data block is replicated many times
(default is 3)
• Failure: Failure is the norm rather than exception
9
Hadoop: How it Works
10
Hadoop Architecture
• Distributed file system (HDFS)
• Execution engine (MapReduce)
11
Hadoop Distributed File System
(HDFS)
Centralized namenode
- Maintains metadata info about files
File F 1 2 3 4 5
12
Hadoop Distributed File System
Namenode
File1
1
2
• NameNode: 3
• Stores metadata (file names, 4
block locations, etc)
• DataNode:
• Stores the actual HDFS data
1 2 1 3
blocks 2 1 4 2
4 3 3 4
Datanodes
Data Retrieval
• JobTracker
• Determines the execution
plan for the job
• Assigns individual tasks
• TaskTracker
• Keeps track of the
performance of an
individual mapper or
reducer
Properties of MapReduce Engine
• Job Tracker is the master node (runs with the namenode)
• Receives the user’s job
• Decides on how many tasks will run (number of mappers)
• Decides on where to run each mapper (concept of locality)
Node 1 Node 2 Node 3
21
Properties of MapReduce Engine
(Cont’d)
• Task Tracker is the slave node (runs on each datanode)
• Receives the task from Job Tracker
• Runs the task until completion (either map or reduce task)
• Always in communication with the Job Tracker reporting progress
22
MapReduce Phases
Deciding on what will be the key and what will be the value developer’s
responsibility
23
Map-Reduce Execution Engine
(Example: Color Count)
Input blocks Produces (k, v) Shuffle & Sorting Consumes(k, [v])
on HDFS ( , 1) based on k ( , [1,1,1,1,1,1..])
Produces(k’, v’)
( , 100)
• Reducers:
• Consume <key, <list of values>>
• Produce <key, value>
Map Reduce
Tasks Tasks
26
Example 2: Color Count
Job: Count the number of each color in a data set
Produces(k’, v’)
( , 100)
Part0001
Part0002
Part0003
Write to HDFS
Part0002
That’s the output file, it has
4 parts on probably 4
Write to HDFS
Part0003 different machines
Write to HDFS
Part0004
28
Other Tools
• Hive
• Hadoop processing with SQL
• Pig
• Hadoop processing with scripting
• HBase
• Database model built on top of Hadoop
29
Who Uses Hadoop?