Hadoop Overview
Hadoop Overview
4
Hadoop Key Characteristics
Reliable
Scalable
5
Hadoop Core Components
Hadoop is a system for large scale data processing.
It has two main components:
MapReduce (Processing)
Splits a task across processors
“near” the data & assembles results
Self-Healing, High Bandwidth
Clustered storage
JobTracker manages the TaskTrackers
6
HDFS
Hadoop's Distributed File System is designed to reliably store very
large files across machines in a large cluster. It is inspired by the
Google File System. Hadoop DFS stores each file as a sequence of
blocks, all blocks in a file except the last block are the same size.
Blocks belonging to a file are replicated for fault tolerance. The
block size and replication factor are configurable per file. Files in
HDFS are "write once" and have strictly one writer at any time.
DataNodes:
slaves which are deployed on each machine and provide the
actual storage
responsible for serving read and write requests for the clients
Map Reduce
The Hadoop Map/Reduce framework harnesses a cluster of machines and
executes user defined Map/Reduce jobs across the nodes in the cluster. A
Map/Reduce computation has two phases, a map phase and a reduce phase. The
input to the computation is a data set of key/value pairs.
Tasks in each phase are executed in a fault-tolerant manner, if node(s) fail in the
middle of a computation the tasks assigned to them are re-distributed among
the remaining nodes. Having many map and reduce tasks enables good load
balancing and allows failed tasks to be re-run with small runtime overhead.
Ref: http://labs.google.com/papers/mapreduce.html
Architecture
Like Hadoop Map/Reduce, HDFS follows a master/slave
architecture. An HDFS installation consists of a single Namenode, a
master server that manages the file system namespace and
regulates access to files by clients. In addition, there are a number
of Datanodes, one per node in the cluster, which manage storage
attached to the nodes that they run on. The Namenode makes
filesystem namespace operations like opening, closing, renaming
etc. of files and directories available via an RPC interface. It also
determines the mapping of blocks to Datanodes. The Datanodes are
responsible for serving read and write requests from filesystem
clients, they also perform block creation, deletion, and replication
upon instruction from the Namenode.
Simple Hadoop Architecture
(Small cluster)
Map-Reduce Architecture
Map-Reduce Process
Hadoop Ecosystem
Hadoop Demo
Thank You
16