Hadoop
Hadoop
Hadoop
Hadoop
Hadoop
Introduction
•Distributed programming framework.
•Hadoop is an open source framework for writing and running distributed applications that process large amounts of data.
•Key points of hadoop are
•Accessible
–Hadoop runs on large clusters of commodity machines or on cloud computing services such as Amazon’s Elastic
Compute Cloud (EC2 ).
•Robust
–Because it is intended to run on commodity hardware, Hadoop is architected with the assumption of frequent
hardware malfunctions. It can gracefully handle most such failures.
•Scalable
–Hadoop scales linearly to handle larger data by adding more nodes to the cluster.
•Simple
–Hadoop allows users to quickly write efficient parallel code.
Introduction
•The Secondary NameNode (SNN) is an assistant daemon for monitoring the state of the cluster
HDFS.
•Like the NameNode, each cluster has one SNN.
• No other DataNode or TaskTracker daemons run on the same server.
•The SNN differs from the NameNode, it doesn’t receive or record any real-time changes to
HDFS.
•Instead, it communicates with the NameNode to take snapshots of the HDFS metadata at
intervals defined by the cluster configuration.
•As mentioned earlier, the NameNode is a single point of failure for a Hadoop cluster, and the
SNN snapshots help minimize the downtime and loss of data.
•Nevertheless, a NameNode failure requires human intervention to reconfigure the cluster to use
the SNN as the primary NameNode.
JobTracker
•The JobTracker daemon is the liaison (mediator) between
your application and Hadoop.
•Once you submit your code to your cluster, the JobTracker
determines the execution plan by determining which files to
process, assigns nodes to different tasks, and monitors all
tasks as they’re running.
•Should a task fail, the JobTracker will automatically re
launch the task, possibly on a different node, up to a
predefined limit of retries.
•There is only one JobTracker daemon per Hadoop cluster.
Task tracker
•Job Tracker is the master overseeing the overall execution
of a Map-Reduce job.
• Task Trackers manage the execution of individual tasks on
each slave node.
•Each Task Tracker is responsible for executing the
individual tasks that the Job Tracker assigns.
•Although there is a single Task Tracker per slave node,
each Task Tracker can spawn multiple JVMs to handle many
map or reduce tasks in parallel.
Hadoop Master/Slave Architecture
Master/slave Architecture
Hadoop Configuration Modes
•Local (standalone) mode
•The standalone mode is the default mode for Hadoop.
• Hadoop chooses to be conservative and assumes a minimal
configuration. All XML (Configuration) files are empty
under this default mode.
•With empty configuration files, Hadoop will run completely
on the local machine.
• Because there’s no need to communicate with other nodes,
the standalone mode doesn’t use HDFS, nor will it launch
any of the Hadoop daemons.
•Its primary use is for developing and debugging the
Hadoop Configuration Modes
•Pseudo-distributed mode
•The pseudo-distributed mode is running Hadoop in a
“cluster of one” with all daemons running on a single
machine.
•This mode complements the standalone mode for
debugging your code, allowing you to examine memory
usage, HDFS input/output issues, and other daemon
interactions.
•Need Configuration on XML Files hadoop/conf/.
Hadoop Configuration Modes
•Configuration
•core-site.xml
•<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
•<!-- Put site-specific property overrides in this file. -->
•<configuration>
–<property> <name>fs.default.name</name> <value>hdfs://localhost:9000</value> <description>The name of the default file system. A URI
whose scheme and authority determine the FileSystem implementation. </description>
–</property>
•</configuration>
•mapred-site.xml
•<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
•<!-- Put site-specific property overrides in this file. -->
•<configuration>
–<property> <name>mapred.job.tracker</name> <value>localhost:9001</value> <description>The host and port that the MapReduce job tracker
runs at.</description>
–</property>
•</configuration>
•hdfs-site.xml
•<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
•<!-- Put site-specific property overrides in this file. -->
–<configuration>
–<property> <name>dfs.replication</name> <value>1</value> <description>The actual number of replications can be specified when the file is
created.</description>
–</property>
•</configuration>
Hadoop Configuration Modes
•Fully distributed mode
•Benefits of distributed storage and distributed computation
•master—The master node of the cluster and host of the
NameNode and Job-Tracker daemons
•backup—The server that hosts the Secondary NameNode
daemon
•hadoop1, hadoop2, hadoop3, ...—The slave boxes of the
cluster running both DataNode and TaskTracker daemons
Working with files in HDFS
•Hardware Failure
•Hardware failure is the norm rather than the exception.
• An HDFS instance may consist of hundreds or thousands
of server machines, each storing part of the file system’s
data.
• The fact that there are a huge number of components and
that each component has a non-trivial probability of failure
means that some component of HDFS is always non-
functional.
•Therefore, detection of faults and quick, automatic recovery
Assumptions and Goals