Prepared By: Manoj Kumar Joshi & Vikas Sawhney
Prepared By: Manoj Kumar Joshi & Vikas Sawhney
General Agenda
Clients should view a DFS the same way they would a centralized
FS; the distribution is hidden at a lower level.
DATA NODE
Edit logs - Its the sequence of changes made to the filesystem after NameNode started
Data Node
DataNode stores data in the Hadoop Filesystem
Replication placement
High initialization time to create replication to all machines
An approximate solution: Only 3 replications
One replication resides in current node
One replication resides in current rack
One replication resides in another rack
Replication Pipeline
Rack Awareness
Data Node Failure
Data Node Failure Condition
If a data node failed, Name Node could know the blocks it
contains, create same replications to other alive nodes, and
unregister this dead node
Data Integrity
Corruption may occur in network transfer, Hardware failure etc.
Apply checksum checking on the contents of files on HDFS, and
store the checksum in HDFS namespace
If checksum is not correct after fetching, drop it and fetch another
replication from other machines.
Heartbeats and Re-Replication
Secondary Name Node
Not a failover NameNode
Trick : Bring new NameNode up, but use DNS to make cluster
believe it’s the original one
Apache MapReduce
A software framework for distributed processing of large data sets
It splits the input data set into independent chunks that are
processed in a completely parallel manner.
TASK TRACKER
JobTracker
JobTracker is the daemon service for submitting and tracking
MapReduce jobs in Hadoop
Hadoop FS Shell
Formatting filesystem with HDFS
bin/hadoop namenode -format
To add a directory
bin/hadoop dfs –mkdir abc
To list a directory
bin/hadoop dfs -ls /