1 MapReduce introduction with example
1 MapReduce introduction with example
What is Hadoop?
Hadoop is an apache open source software framework
It is a java framework
4 distributed processing of large datasets across
large clusters of nodes
Large datasets: Terabytes or petabytes of data
Large clusters: hundreds of thousands of computers (nodes)
Hadoop is open-source implementation for Google MapReduce
Hadoop is based on a simple programming model called MapReduce
Runs on large cluster of commodity machines
Hadoop provides both distributed storage and processing of large datasets
Support processing on streaming data
2 Satishkumar Varma,
What is Hadoop?
Hadoop framework consists on two main layers
Distributed file system (HDFS)
Execution engine (MapReduce)
3 Satishkumar Varma,
Why: Goals / Requirements?
Facilitate storage & processing of large and rapidly growing
dataset Structured and non-structured data
Simple programming models
High scalability and availability
Use commodity (cheap!) hardware with little redundancy
Fault-tolerance
Move computation rather than data
4 Satishkumar Varma,
Hadoop Design Principles
Need to process big data
High Need to parallelize computation across thousands of
nodes Use commodity (affordable) hardware with little
redundancy
Contrast to Parallel DBs: Small number of high-end expensive
machines
Commodity hardware: Large number of low-end cheap machines working
in parallel to solve a computing problem
Automatic parallelization & distribution: Hidden from the end-user
Fault tolerance and automatic recovery: Nodes/tasks will fail and will
recover
automatically
Clean and simple programming abstraction: Users only provide two
5 Satishkumar Varma,
Hadoop Architecture?
Hadoop
Architecture
6 Satishkumar Varma,
Hadoop Architecture Overview?
Hadoop architecture is a master/slave architecture
The master being the namenode and slaves are datanodes
Namenode controls the access to the data by clients
Datanodes manage the storage of data on the nodes that
are running on.
Hadoop splits the file into one or more blocks and these blocks are
stored in the datanodes
Each data block is replicated to 3 different datanodes to provide
high availability of the hadoop system
The block replication factor is configurable
7 Satishkumar Varma,
Who Uses MapReduce/Hadoop?
Google: Inventors of MapReduce computing paradigm
Yahoo: Developing Hadoop open-source of MapReduce
IBM, Microsoft, Oracle
Facebook, Amazon, AOL, NetFlex
Many others + universities and research labs
8 Satishkumar Varma,
Hadoop Components?
Hadoop Distributed File System
MapReduce Engine
Types of Nodes
Namenode
Secondary
Namenode Datanode
JobTracker
TaskTracke
r
YARN – (Yet
Another
Resource
9 Satishkumar Varma,
Hadoop Components?
Hadoop Distributed File System:
HDFS is designed to run on commodity machines which are of low
cost hardware
Distributed data is stored in the HDFS file
system HDFS is highly fault tolerant
HDFS provides high throughput access to the
applications that
require big data
Java-based scalable system that stores data across multiple
machines without prior organization
10 Satishkumar Varma,
Hadoop Components?
Namenode:
Namenode is the heart of the hadoop system
Namenode manages the file system namespace
It stores the metadata information of the data
blocks
This metadata is stored permanently on to
local disk in the form of
namespace image and edit log file
Namenode also knows the location of the data blocks on data node
However the namenode does not store this information persistently
Namenode creates block to datanode mapping when it is restarted
If the namenode crashes, then the entire hadoop system goes down
11 Satishkumar Varma,
Hadoop Components
Secondary Namenode:
12 Satishkumar Varma,
Hadoop Components
DataNode:
Stores the blocks of data and retrieves them
Datanodes also reports the blocks information to the namenode
periodically
Stores the actual data in HDFS
Notifies NameNode of what blocks it has
Can run on any underlying filesystem (ext3/4, NTFS, etc)
NameNode replicates blocks 2x in local rack, 1x elsewhere
13 Satishkumar Varma,
Hadoop Components
JobTracker
JobTracker responsibility is to schedule the clients jobs
Job tracker creates map and reduce tasks and schedules
them to run on the datanodes (tasktrackers)
Job Tracker also checks for any failed tasks and reschedules
the failed tasks on another datanode
Jobtracker can be run on the namenode or a separate node
TaskTracker
Tasktracker runs on the datanodes
Task trackers responsibility is to
run the map or reduce tasks assigned by the namenode
and report the status of the tasks to the namenode
14 Satishkumar Varma,
How does it work: Hadoop Architecture
Hadoop distributed file system (HDFS)
MapReduce execution engine
15 Satishkumar Varma,
Hadoop distributed file system (HDFS)
Centralized namenode
- Maintains metadata info about files
1 2 3 4 5
File F
Blocks (64/128 MB)
16 Satishkumar Varma,
HDFS Properties
Large
A HDFS instance may consist of thousands of server machines,
each storing part of the file system’s data
Replication
Each data block is replicated many times (default is 3)
Failure
Failure is the norm rather than exception
Fault Tolerance
Detection of faults and quick, automatic recovery from them is a
core architectural goal of HDFS
17 Satishkumar Varma,
MapReduce
So, just like in the traditional way, I will split the data into smaller
parts or blocks and store them in different machines. Then, I will
find the highest temperature in each part stored in the
corresponding machine. At last, I will combine the results
received from each of the machines to have the final output.
Challenges associated with this traditional approach:
Critical path problem: It is the amount of time taken to
finish the job without delaying the next milestone or
actual completion date. So, if, any of the machines
delays the job, the whole work gets delayed.
Single split may fail: If any of the machine fails to provide the
output, I will not be able to calculate the result. So, there
should be a mechanism to ensure this fault tolerance
capability of the system.
Ma Parse-
p
hash Redu
ce
Ma Parse-
p hash Redu
ce
Ma Parse-
p hash
26 Satishkumar Varma,
Map-Reduce Engine
MapReduce Engine
JobTracker
TaskTracker
JobTracker splits up data into smaller tasks(“Map”) and sends it to the
TaskTracker process in each node
TaskTracker reports back to the JobTracker node and reports on job
progress, sends data (“Reduce”) or requests new jobs
27 Satishkumar Varma,
MapReduce Engine Properties
Job Tracker is the master node (runs with the namenode)
– Receives the user’s job
– Decides on how many tasks will run (number of mappers)
–Decides on where to run each mapper (concept of locality)
This file has 5 Blocks run 5 map tasks
Where to run the task reading block “1”
Try to run it on Node 1 or Node 3
28 Satishkumar Varma,
MapReduce Engine Properties
Task Tracker is the slave node (runs on each datanode)
– Receives the task from Job Tracker
– Runs the task until completion (either map or reduce task)
– Always in communication with the Job Tracker reporting progress
In this example, 1 map-reduce job consists of 4 map tasks and 3 reduce tasks
Ma Parse-
p hash Red
uce
Ma Parse-
p hash Red
uce
Ma Parse-
p hash
Red
uce
Ma Parse-
p hash
29 Satishkumar Varma,
Key – Value Pair
Mappers and Reducers are users’ code (provided functions)
Just need to obey the Key-Value pairs interface
Mappers:
– Consume <key, value> pairs
– Produce <key, value> pairs
Reducers:
– Consume <key, <list of values>>
– Produce <key, value>
Shuffling and Sorting:
– Hidden phase between mappers and reducers
– Groups all similar keys from all mappers, sorts and passes them to a
certain reducer in the form of <key, <list of values>>
30 Satishkumar Varma,
MapReduce Phases
Deciding on what will be the key and what will be the value
developer’s responsibility
31 Satishkumar Varma,
Refinement: Combiners
Combiner combines values of all keys of a single mapper (single machine)
Much less data needs to be copied and shuffled!
32 Satishkumar Varma,
Let us try to understand the two tasks Map &f Reduce with the help of a small diagram −
Example 1: Word Count
Job: Count the occurrences of each word in a data
set
Ma Parse-
p
hash Redu Part0002
ce
Ma Parse-
p hash Redu Part0003
ce
Ma Parse-
p hash That’s the output file, it has
3 parts on probably 3
different machines
38 Satishkumar Varma,
Example 3: Color Filter
Job: Select only the blue and the green colors
Input blocks Produces (k, v)
on HDFS (
, 1)
Write to HDFS
Ma Part0001
p
Write to HDFS
Ma Part0002
p
That’s the output file, it has
4 parts on probably 4
Write to HDFS
Ma Part0003 different machines
p
Write to HDFS
Ma Part0004
p
• Each map task will select only
the blue or green colors
• No need for reduce phase
39 Satishkumar Varma,
Other Hadoop Software Components
Hadoop: It is Apache’s open source software framework for storing,
processing and analyzing big data
Other software components that can run on top of or alongside Hadoop and have
achieved top-level Apache project status include:
Hive: A data warehousing and SQL-like query language that presents data in
the form of tables. Hive programming is similar to database programming.
Ambari: A web interface for managing, configuring and testing Hadoop
services
and components.
Cassandra: A distributed database system.
Flume: Software that collects, aggregates and moves large amounts of
streaming
data into HDFS.
40 Hbase: A nonrelational, distributed database that runs on top of Hadoop.
Satishkumar Varma,
Other Software Components
Hcatalog: A table and storage management layer that helps users share and
access
data.
Oozie: A Hadoop job scheduler.
Pig: A platform for manipulating data stored in HDFS that includes a compiler
for
MapReduce programs and a high-level language called Pig Latin. It provides a way
to perform data extractions, transformations and loading, and basic analysis
without having to write MapReduce programs.
Solr: A scalable search tool that includes indexing, reliability,
central configuration, failover and recovery.
Spark: An open-source cluster computing framework with in-
memory analytics.
Sqoop: A connection and transfer mechanism that moves data between
41 Satishkumar Varma,
Hadoop Ecosystem
It is Apache’s open source software framework for storing, processing and analyzing big data.
42 Satishkumar Varma,
Hadoop Ecosystem
Hbase
Hive
Pig
Flume
Sqoop
Oozie
Hue
Mahout
Zookee
per
43 Satishkumar Varma,
Hadoop Ecosystem
Hive
It is SQL-like interface to Hadoop.
It is an abstraction on top of MapReduce. allows users to query data in
the Hadoop Cluster without knowing java or MapReduce.
Uses HiveQL language very similar to SQL.
Pig
Pig is an alternative abstraction on top of MapReduce.
It uses a dataflow scripting language called as PigLatin.
The Pig interpreter runs on the client machine.
Takes the PigLatin script and turns it into a series of MapReduce jobs and
submits those jobs to the cluster.
44 Satishkumar Varma,
Hadoop Ecosystem
Hbase
HBase is "Hadoop Database".
It is 'NoSQL' data store.
It can store massive amount of data e.g. Gigabyte, terabyte or even pet
bytes of data in a table.
MapReduce is not designed for iterative processes.
Flume
It is distributed real time data collection service.
It efficiently collects, aggregate and move large amounts of data.
45 Satishkumar Varma,
Hadoop Ecosystem
Sqoop
It provides a method to import data from tables in relational database
into HDFS.
It supports easy parallel database import/export.
User can insert data from RDBMS to HDFS , Export data from HDFS to
back into RDBMS.
Oozie
Oozie is workflow maanagement project.
Oozie allows developers to create a workflow of MapReduce jobs
including dependencies between jobs.
The Oozie server submits the jobs to the server in the correct sequence.
46 Satishkumar Varma,
Hadoop Ecosystem
Hue
An open source web interface that supports Apache Hadoop and
its ecosystem licensed under the Apache v2 license
Hue aggregates the most common Apache Hadoop components
into a single
interface and targets the user experience/UI Toolkit
Mahout
Machine learning tool
Supports distributed & scalable machine learning algo on Hadoop
Platform
Helps in building intelligent applications easier and faster
It can provide distributed data mining function combined with
47 Satishkumar Varma,
Hadoop Ecosystem
ZooKeeper
It is a centralized service for maintaining
configuration information .
It provides distributed synchronization.
It contains a set of tools to build distributed applications
that can safely handle partial failures.
Zookeeper was designed to store coordination data,
status information, configuration , location information
about distributed application.
48 Satishkumar Varma,
Hadoop Limitations
Security Concerns
Vulnerable by Nature
Not fit for small data
Potential Stability Issues
General limitations
49 Satishkumar Varma,
Hadoop fs Shell Commands Examples
hadoop fs <args>
hadoop fs -ls /user/hadoop/students
hadoop fs -mkdir /user/hadoop/hadoopdemo
hadoop fs -cat /user/hadoop/dir/products/products.dat
hadoop fs -copyToLocal /user/hadoop/hadoopdemo/sales salesdemo
hadoop fs -put localfile /user/hadoop/hadoopdemo
hadoop fs -rm /user/hadoop/file
hadoop fs -rmr /user/hadoop/dir
50 Satishkumar Varma,
Hadoop Applications
51
4TB of TIFFs into 11 million PDF articles in 24 hrs
Satishkumar Varma,
Hadoop Applications
52
FB’s Hadoop cluster hosts 100+ PB of data (July, 2012) &
Satishkumar Varma,