0% found this document useful (0 votes)
2 views

Hadoop module1

Uploaded by

Ankita Chavan
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Hadoop module1

Uploaded by

Ankita Chavan
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 37

Hadoop Distributed File System Basics

Hadoop Distributed File System Design Features

• The Hadoop Distributed File System (HDFS) was designed for Big
Data processing.
• Although capable of supporting many users simultaneously, HDFS is not
designed as a true parallel file system.
• the design assumes a large file write-once/read-many model .
• For instance, HDFS rigorously restricts data writing to one user at a
time. All additional writes are “append-only,” and there is no random
writing to HDFS files.
• Bytes are always appended to the end of a stream, and byte streams are
guaranteed to be stored in the order written.
Features
• The design of HDFS is based on the design of the Google File System
(GFS).
• HDFS is designed for data streaming where large amounts of data are
read from disk in bulk.
• The HDFS block size is typically 64MB or 128MB. Thus, this approach
is entirely unsuitable for standard POSIX file system use.
• In addition, due to the sequential nature of the data, there is no local
caching mechanism.
• The large block and file sizes make it more efficient to reread data from
HDFS than to try to cache the data.
HDFS Components
HDFS User Commands
• General HDFS Commands
• $ hdfs version
Hadoop 2.6.0.2.2.4.2-2
Subversion [email protected]:hortonworks/hadoop.git -r
22a563ebe448969d07902aed869ac13c652b2872
Compiled by jenkins on 2015-03-31T19:49Z
Compiled with protoc 2.5.0
From source with checksum b3481c2cdbe2d181f2621331926e267
This command was run using /usr/hdp/2.2.4.2-2/hadoop/hadoopcommon-
2.6.0.2.2.4.2-2.jar
List Files in HDFS
• To list the files in the root HDFS directory, enter the following command.
• $ hdfs dfs -ls /
Found 10 items
drwxrwxrwx - yarn hadoop 0 2015-04-29 16:52 /app-logs
drwxr-xr-x - hdfs hdfs 0 2015-04-21 14:28 /apps
drwxr-xr-x - hdfs hdfs 0 2015-05-14 10:53 /benchmarks
drwxr-xr-x - hdfs hdfs 0 2015-04-21 15:18 /hdp
drwxr-xr-x - mapred hdfs 0 2015-04-21 14:26 /mapred
drwxr-xr-x - hdfs hdfs 0 2015-04-21 14:26 /mr-history
drwxr-xr-x - hdfs hdfs 0 2015-04-21 14:27 /system
drwxrwxrwx - hdfs hdfs 0 2015-05-07 13:29 /tmp
drwxr-xr-x - hdfs hdfs 0 2015-04-27 16:00 /user
drwx-wx-wx - hdfs hdfs 0 2015-05-27 09:01 /var
• To list files in your home directory, enter the
following command
• $ hdfs dfs –ls
drwx------ - hdfs hdfs 0 2015-05-27 20:00 .Trash
drwx------ - hdfs hdfs 0 2015-05-26 15:43 .staging
drwxr-xr-x - hdfs hdfs 0 2015-05-28 13:03 DistributedShell
drwxr-xr-x - hdfs hdfs 0 2015-05-14 09:19 TeraGen-50GB
drwxr-xr-x - hdfs hdfs 0 2015-05-14 10:11 TeraSort-50GB
Make a Directory in HDFS
• $ hdfs dfs -mkdir stuff
Copy Files to HDFS
$ hdfs dfs -put test stuff
Copy Files from HDFS
$ hdfs dfs -get stuff/test test-local
• Copy Files within HDFS
$ hdfs dfs -cp stuff/test test.hdfs
Delete a File within HDFS
$ hdfs dfs -rm test.hdfs
Delete a Directory in HDFS
$ hdfs dfs -rm -r -skipTrash stuff
Deleted stuff
The MapReduce Model
• Apache Hadoop is often associated with MapReduce computing.
• The MapReduce computation model provides a very powerful tool for
many applications.
• There are two stages: a mapping stage and a reducing stage.
• In the mapping stage, a mapping procedure is applied to input data. The
map is usually some kind of filter or sorting process.
• For instance, assume you need to count how many times the name “Kutuzov”
appears in the novel War and Peace. One solution is to gather 20 friends and
give them each a section of the book to search. This step is the map stage. The
reduce phase happens when everyone is done counting and you sum the total as
your friends tell you their counts.
$ grep " Kutuzov " war-and-peace.txt
Simple Mapper Script
#!/bin/bash
while read line ; do
for token in $line; do
if [ "$token" = "Kutuzov" ] ; then
echo "Kutuzov,1"
elif [ "$token" = "Petersburg" ] ; then
echo "Petersburg,1"
fi
done
done
Simple Reducer Script

#!/bin/bash
kcount=0
pcount=0
while read line ; do
if [ "$line" = "Kutuzov,1" ] ; then
let kcount=kcount+1
elif [ "$line" = "Petersburg,1" ] ; then
let pcount=pcount+1
fi
done
echo "Kutuzov,$kcount"
echo "Petersburg,$pcount"
MapReduce Parallel Data Flow
• Input Splits
• HDFS distributes and replicates data over multiple servers.
The default data chunk or block size is 64MB.
• Thus, a 500MB file would be broken into 8 blocks and
written to different machines in the cluster.
• The split size can be based on the number of
records in a file (if the data exist as records) or an actual
size in bytes.
• Splits are almost always smaller than the HDFS block size.
The number of splits corresponds to the number of
mapping processes used in the map stage.
Map Step.
• The mapping process is where the parallel nature of Hadoop
comes into play. For large amounts of data, many mappers
can be operating at the same time. The user provides the
specific mapping process.
MapReduce will try to execute the mapper on the machines
where the block resides.
• Because the file is replicated in HDFS, the least busy
• node with the data will be chosen. If all nodes holding the data
are too busy, MapReduce will try to pick a node that is closest
to the node
Combiner Step.
• It is possible to provide an optimization or pre-reduction as part of the map
stage where key–value pairs are combined prior to the next stage. The
combiner stage is optional.
• Shuffle Step. Before the parallel reduction stage can complete, all similar
keys must be combined and counted by the same reducer process.
• Therefore, results of the map stage must be collected by key–value pairs
and shuffled to the same reducer process. If only a single reducer process
is used, the shuffle stage is not needed.
Reduce Step.
• The final step is the actual reduction. In this stage, the data
• reduction is performed as per the programmer’s design. The reduce step is
• also optional. The results are written to HDFS
Fault Tolerance
• One of the most interesting aspects of parallel MapReduce operation is the strict
• control of data flow throughout the execution of the program.
• mapper processes do not exchange data with other mapper processes, and data
• can only go from mappers to reducers—not the other direction.
• The confined data flow enables MapReduce to operate in a fault-tolerant fashion.
• The design of MapReduce makes it possible to easily recover from the failure of one or many map
processes.
• For example, should a server fail, the map tasks that were running on that machine could easily be
restarted on another working server because there is no dependence on any other map task. In functional
language terms, the map tasks “do not share state” with other mappers. In a similar fashion, failed
reducers can be restarted. However, there may be additional work that has to be redone in such a case.
Recall that a completed reduce task writes results to HDFS.
• If a node fails after this point, the data should still be available due to the redundancy in HDFS. If reduce
tasks remain to be completed on a down node, the MapReduce ApplicationMaster will need
• to restart the reducer tasks. If the mapper output is not available for the newly restarted reducer, then
these map tasks will need to be restarted. This process is totally transparent to the user and provides a
fault-tolerant system to run applications.
Speculative Execution
• One of the challenges with many large clusters is the inability to predict or
• manage unexpected system bottlenecks or failures.
• In theory, it is possible to control and monitor resources so that network traffic and processor
load can be evenly balanced. however, this problem represents a difficult challenge for large
systems.
• Thus, it is possible that a congested network, slow disk controller, failing disk, high processor
load, or some other similar problem might lead to slow performance without anyone noticing.
• When one part of a MapReduce process runs slowly, it ultimately slows down
• everything else because the application cannot complete until all processes are
• finished.
• The nature of the parallel MapReduce model provides an interesting solution to this problem.
Therefore, it is possible to start a copy of a running map process without disturbing any other
running mapper processes.
• most of the map tasks are coming to a close, the ApplicationMaster notices that
• some are still running and schedules redundant copies of the remaining jobs on
• less busy or free servers. Should the secondary processes finish first, the first processes are
then terminated (or vice versa). This process is known as speculative execution .
MapReduce Programming
• Compiling and Running the Hadoop WordCount Example
• Given two input files with contents Hello World Bye World and
• Hello Hadoop Goodbye Hadoop, the WordCount mapper will produce
• two maps:
• < Hello, 1>
• < World, 1>
• < Bye, 1>
• < World, 1>
• < Hello, 1>
• < Hadoop, 1>
• < Goodbye, 1>
• < Hadoop, 1>
WordCount sets a mapper
job.setMapperClass(TokenizerMapper.class);
a combiner

And a reducer
job.setReducerClass(IntSumReducer.class);

job.setCombinerClass(IntSumReducer.class)

• Hence, the output of each map is passed through the local combiner (which sums
• the values in the same way as the reducer) for local aggregation and then sends
• the data on to the final reducer
• Thus, each map above the combiner performs
the following pre-reductions:
• < Bye, 1>
• < Hello, 1>
• < World, 2>
• < Goodbye, 1>
• < Hadoop, 2>
• < Hello, 1>
• The reducer implementation, via the reduce method, simply sums the values,
• which are the occurrence counts for each key.
• The final output of the reducer is the following:
• < Bye, 1>
• < Goodbye, 1>
• < Hadoop, 2>
Using the Streaming Interface

• The Apache Hadoop steaming interface enables almost any program to use the
• MapReduce engine.

• The streams interface will work with any program that can
• read and write to stdin and stdout.

• When working in the Hadoop streaming mode, only the mapper and the
• reducer are created by the user.

• This approach does have the advantage that the mapper and the reducer can be
easily tested from the command line.
Python Mapper Script (mapper.py)

• #!/usr/bin/env python
• import sys
• # input comes from STDIN (standard input)
• for line in sys.stdin:
• # remove leading and trailing whitespace
• line = line.strip()
• # split the line into words
• words = line.split()
• # increase counters
• for word in words:
• # write the results to STDOUT (standard output);
• # what we output here will be the input for the
• # Reduce step, i.e. the input for reducer.py
• ##
• tab-delimited; the trivial word count is 1
• print '%s\t%s' % (word, 1)
Python Reducer Script (reducer.py)
• #!/usr/bin/env python
• from operator import itemgetter
• import sys
• current_word = None
• current_count = 0
• word = None
• # input comes from STDIN
• for line in sys.stdin:
• # remove leading and trailing whitespace
• line = line.strip()
• # parse the input we got from mapper.py
• word, count = line.split('\t', 1)
• convert count (currently a string) to int
• try:
• count = int(count)
• except ValueError:
• # count was not a number, so silently
• # ignore/discard this line
• continue
• # this IF-switch only works because Hadoop sorts map output
• # by key (here: word) before it is passed to the reducer
• if current_word == word:
• current_count += count
• else:
• if current_word:
• # write result to STDOUT
• print '%s\t%s' % (current_word, current_count)
• current_count = count
• current_word = word
• # do not forget to output the last word if needed!
• if current_word == word:
• print '%s\t%s' % (current_word, current_count)
• The operation of the mapper.py script can be observed by
running the
• command as shown in the following:
• $ echo "foo foo quux labs foo bar quux" | ./mapper.py
• Foo 1
• Foo 1
• Quux 1
• Labs 1
• Foo 1
• Bar 1
• Quux 1
• reducer.py script to the following command
pipeline:
• $ echo "foo foo quux labs foo bar quux" |
./mapper.py|sort -
• k1,1|./reducer.py
• Bar 1
• Foo 3
• Labs 1
• Quux 2
Using the Pipes Interface

• Pipes is a library that allows C++ source code to be used for mapper and reducer
• code. Applications that require high performance when crunching numbers may
• achieve better throughput if written in C++ and used through the Pipes interfaceBoth key and
value inputs to pipes programs are provided as STL strings
• (std::stringA program to use with Pipes is defined
• by writing classes extending Mapper and Reducer. Hadoop must then be
• informed as to which classes to use to run the job
Compiling and Running the Hadoop Grep Chaining
• The command works differently from the *nix grep command in that it does not display the
complete matching line, only
• the matching string. If matching lines are needed for the string foo, use
• .*foo.* as a regular expression.
• The program runs two map/reduce jobs in sequence and is an example of
• MapReduce chaining. The first job counts how many times a matching string
• occurs in the input, and the second job sorts matching strings by their frequency
• and stores the output in a single file
Debugging MapReduce

• The best advice for debugging parallel MapReduce applications


• The key word here is parallel. Debugging on a distributed system is hard and
• should be avoided at all costs.
• The best approach is to make sure applications run on a simpler system (i.e.,
• the HDP Sandbox or the pseudo-distributed single-machine install) with smaller
• data sets. Errors on these systems are much easier to locate and track. In
• addition, unit testing applications before running at scale is important. If
• applications can run successfully on a single system with a subset of real data,
• then running in parallel should be a simple task because the MapReduce
• algorithm is transparently scalable. Note that many higher-level tools (e.g., Pig
• and Hive) enable local mode development for this reason.
Listing, Killing, and Job Status
• jobs can be managed using the mapred job command.
• The most import
• options are -list, -kill, and -status.
• the yarnapplication command can be used to control all applications running on the
Hadoop Log Management

• The MapReduce logs provide a comprehensive listing of both mappers and


reducers.

• The actual log output consists of three files—stdout, stderr, and syslog (Hadoop
system messages)—for the application.

• There are two modes for log storage.


• The first (and best) method is to use log aggregation.

• In this mode, logs are aggregated in HDFS and can be displayed in the YARN
• ResourceManager user interface.
• If log aggregation is not enabled, the logs will be placed locally on the cluster
• nodes on which the mapper or reducer ran.

• The location of the unaggregated local logs is given by the yarn.nodemanager.log-


dirs property in the yarn-site.xml file.

• Without log aggregation, the cluster nodes used by the job must be noted, and then
the log files must be obtained directly from the nodes.

• Log aggregation is highly recommended .


Enabling YARN Log Aggregation

• If Apache Hadoop was installed from the official Apache Hadoop sources, the
• following settings will ensure log aggregation is turned on for the system.
• If you are using Ambari or some other management tool, change the setting
• using that toolnot handmodify the configuration files. For example, if you are
using Apache
• Ambari, check under the YARN Service Configs tab for yarn.logaggregation-
• enable. The default setting is enabled. To manually enable log aggregation,
follows these steps:
• 1. As the HDFS superuser administrator (usually user hdfs), create the
• following directory in HDFS:

• $ hdfs dfs -mkdir -p /yarn/logs


• $ hdfs dfs -chown -R yarn:hadoop /yarn/logs
• $ hdfs dfs -chmod -R g+rw /yarn/logs
Web Interface Log View

• The most convenient way to view logs is to use the YARN ResourceManagerweb user interface. Each task
has a link to the logs for that task. If log aggregation is enabled,clicking on the log link will show a
window similar In the figure, the contents of stdout, stderr, and syslog are displayed on a single page. If log
aggregation is not enabled, a message stating that the logs are not available will be displayed.
• Command-Line Log Viewing
• MapReduce logs can also be viewed from the command line. The yarn logs command enables the logs to
be easily viewed together without having to hunt for individual log files on the cluster nodes. As before,
log aggregation is required for use. The options to yarn logs are as follows
• $ yarn logs
• Retrieve logs for completed YARN applications.
• usage: yarn logs -applicationId <application ID> [OPTIONS]
• general options are:
• -appOwner <Application Owner> AppOwner (assumed to be current user if
• not specified)
• -containerId <Container ID> ContainerId (must be specified if node
• address is specified)
• -nodeAddress <Node Address> NodeAddress in the format nodename:port
(must be specified if container id is specified)

You might also like