Unit-2 (Hadoop)
Unit-2 (Hadoop)
UNIT-II
Hadoop
Scalability
The primary benefit of Hadoop is its Scalability. One
can easily scale the cluster by adding more nodes.
There are two types of Scalability in Hadoop:
• Vertical
• Horizontal
Vertical scalability
It is also referred to as “scale up”. In vertical
scaling, you can increase the hardware
capacity of the individual machine. In other
words, you can add more RAM or CPU to
your existing system to make it more robust
and powerful.
Horizontal scalability
It is also referred as “scale out” is basically the
addition of more machines or setting up the
cluster. In horizontal scaling instead of increasing
hardware capacity of individual machines you
add more nodes to existing clusters and most
importantly, you can add more machines without
stopping the system.
Therefore we don’t have any downtime or green zone, nothing
of such sort while scaling out. So at last to meet your
requirements you will have more machines working in parallel.
Hadoop Streaming
Hadoop MapReduce
framework is written in
Java and provides
support for writing
map/reduce programs in
Java only. But Hadoop
provides an API for
writing MapReduce
programs in languages
other than Java.
• Hadoop Streaming is the utility that allows us to create and run
MapReduce jobs with any script or executable as the mapper or
the reducer.
• It uses Unix streams as the interface between the Hadoop and
our MapReduce program so that we can use any language
which can read standard input and write to standard output to
write for writing our MapReduce program.
• Hadoop Streaming supports the execution of Java, as well as
non-Java, programmed MapReduce jobs execution over the
Hadoop cluster. It supports the Python, Perl, R, PHP, and C++
programming languages.
Syntax for Hadoop Streaming
You can use the below syntax to run MapReduce code written in a language
other than JAVA to process data using the Hadoop MapReduce framework.
• $HADOOP_HOME/bin/hadoop jar
• $HADOOP_HOME/hadoop-streaming.jar
• -input myInputDirs \
• -output myOutputDir \
• -mapper /bin/cat \
• -reducer /usr/bin/wc
• Parameters Description
Parameter Description
-input myInputDirs \ Input location for mapper
-output myOutputDir \ Output location for reducer
-mapper /bin/cat \ Mapper executable
-reducer /usr/bin/wc Reducer executable
How Streaming Works
Let us now see how Hadoop Streaming works.
• The mapper and the reducer (in the above example) are the
scripts that read the input line-by-line from stdin and emit the
output to stdout.
• The utility creates a Map/Reduce job and submits the job to an
appropriate cluster and monitors the job progress until its
completion.
• When a script is specified for mappers, then each mapper task
launches the script as a separate process when the mapper is
initialized.
• The mapper task converts its inputs (key, value pairs) into
lines and pushes the lines to the standard input of the
process. Meanwhile, the mapper collects the line oriented
outputs from the standard output and converts each line into
a (key, value pair) pair, which is collected as the result of the
mapper.
• When the reducer script is specified, then each reducer
task launches the script as a separate process, and then the
reducer is initialized.
•As the reducer task runs, it converts its input key/value pairs into lines and
feeds the lines to the standard input of the process. Meantime, the reducer
gathers the line-oriented outputs from the stdout of the process and converts
each line collected into a key/value pair, which is then collected as the
result of the reducer.
•For both mapper and reducer, the prefix of a line until the first tab
character is the key, and the rest of the line is the value except the tab
character. In the case of no tab character in the line, the entire line is
considered as key, and the value is considered null. This is customizable by
setting -inputformat command option for mapper and -outputformat option
for reducer
Hadoop Pipes
• Hadoop Pipes is the name of the C++ interface to
Hadoop MapReduce.
• Unlike Streaming, this uses standard input and
output to communicate with the map and reduce
code.
• Pipes uses sockets as the channel over which the
task tracker communicates with the process running
the C++ map or reduce function.
In many ways, the
approach will be
similar to Hadoop
streaming, but
using Writable
serialization to
convert the types
into bytes that are
sent to the process
via a socket.