0% found this document useful (0 votes)
177 views

Module-1: Hdfs Basics Running Example Programs and Benchmarks Hadoop Mapreduce Framework Mapreduce Programming

The document provides information on compiling and running the Hadoop WordCount example program. It describes 5 steps: 1) creating a wordcount_classes directory, 2) compiling the WordCount.java program, 3) creating a jar file, 4) creating input/output directories in HDFS and copying input files, and 5) running the WordCount application jar. The output files will be present in the output directory and can be viewed.

Uploaded by

bhattsb
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
177 views

Module-1: Hdfs Basics Running Example Programs and Benchmarks Hadoop Mapreduce Framework Mapreduce Programming

The document provides information on compiling and running the Hadoop WordCount example program. It describes 5 steps: 1) creating a wordcount_classes directory, 2) compiling the WordCount.java program, 3) creating a jar file, 4) creating input/output directories in HDFS and copying input files, and 5) running the WordCount application jar. The output files will be present in the output directory and can be viewed.

Uploaded by

bhattsb
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 33

MODULE-1

• HDFS BASICS
• Running example programs and
benchmarks
• Hadoop MapReduce Framework
• MapReduce Programming
OUTLINE
 Compiling and Running the Hadoop
WordCount Example
 Using the Streaming Interface

 Using the Pipes Interface

 Compiling and Running the Hadoop


Grep Chaining Example
 Debugging MapReduce
MAPREDUCE PROGRAMMING
 Wordcount.java
 To compile and run the program from the
command line, perform the following steps:
1. Make a local wordcount_classes directory
$ mkdir wordcount_classes

2. Compile the WordCount.java program using the


‘hadoop classpath’ command to include all the
available Hadoop class paths.

$ javac -cp ‘hadoop classpath’ -d wordcount_classes


WordCount.java

3. The jar file can be created using the following


command:
$ jar -cvf wordcount.jar -C wordcount_classes/
4. To run the example, create an input directory in
HDFS and place a text file in the new
directory.

$ hdfs dfs -mkdir war-and-peace-input


$ hdfs dfs -put war-and-peace.txt war-and-peace-input

5. Run the WordCount application using the


following command:

$ hadoop jar wordcount.jar WordCount war-and-peace-


input
war-and-peace-output
 The following files will be present in war-and-peace-output
directory.

 The complete list of word counts can be copied from HDFS


$ hdfs dfs -get war-and-peace-output/part-r-0000

 The output content directory and all its content can be


removed with following command
$ hdfs dfs -rm -r -skipTrash war-and-peace-output
USING THE STREAMING INTERFACE

 Apache Hadoop streaming interface enables


any program to use the MapReduce engine.
 The stream interface works with any program
that can read and write to stdin and stdout.
 In streaming mode, only the mapper and
reducer are created by the user.
 In this approach the mapper and reducer can
be easily tested from the command line.
 To run this application using Hadoop, create a
directory and move the war-and-peace.txt input
file into HDFS
$ hdfs dfs -mkdir war-and-peace-input
$ hdfs dfs -put war-and-peace.txt war-and-peace-input

 Before run output directory is removed


$hdfs dfs –rm –r skipTrash war-and-peace-output

 Locate the hadoop-streaming.jar file in HDFS. The


following command line use mapper.py and
reducer.py to do wordcount on the input file.
USING THE PIPES INTERFACE

 Pipes is a library that allows C++ source code


to be used for mapper and reducer.
 Both key and value inputs to pipes programs
are provided as strings.
 The Program must define the instance of
mapper and instance of reducer.
 Program to use Pipes is defined by extending
the Mapper and Reducer.
 The executable must be placed in HDFS.
 The program can be compiled with following line:
$ g++ wordcount.cpp –o wordcount

 Create the war-and-peace-input directory in HDFS


$ hdfs dfs -mkdir war-and-peace-input
$ hdfs dfs -put war-and-peace.txt war-and-peace-input

 The executable must be placed in HDFS and


output directory must be removed before running
$ hdfs dfs –put wordcount bin
$hdfs dfs –rm –r skipTrash war-and-peace-output

 Run the program. On successful output is placed


in war-and-peace-output directory.
COMPILING AND RUNNING THE HADOOP
GREP CHAINING EXAMPLE
 The Hadoop Grep.java example extracts matching
strings from text files and counts how many times
they occurred.

 It doesn't display the complete matching line, but only


the matching string, so in order to display lines
matching "foo", use .*foo.* as a regular expression.

 The program runs two map/reduce jobs in sequence.

 The first job counts how many times a matching string


occurred in input and the second job sorts matching
strings by their frequency and stores the output in a
single output file.
 Each mapper of the first job takes a line as input
and matches with the user-provided regular
expression. It extracts all matching strings and
emits <matching string, 1> pairs.
 The RegexMapper class is used to perform this
task.

 Each reducer sums the frequencies of each


matching string. The output is sequence files
containing the matching string and count.
 The reduce phase is optimized by running a
combiner that sums the frequency of strings from
local map output. As a result it reduces the
amount of data that needs to be shipped to a
reduce task.
 The LongSumReducer class is used to perform this
 The second job takes the output of the first job as
its input and sorts matching strings by their
frequencies and stores the output in a single file.

 The mapper is an inverse map that reverses (or


swaps) its input <key, value> pair to <value,
key> pair.

 The reducer is an Identityreducer class. The


number of reducers is one, so the output is stored
in one file, and it is sorted by count in descending
order.

 The output is text file, contains a count and string


per line.
 Compile and run the Grep.java
1. Create a directory
$ mkdir Grep_classes

2. Compile the WordCount.java using following


line
$ javac -cp ‘hadoop classpath’ -d Grep_classes Grep.java

3. Create a Java archive using following command


$ jar -cvf Grep.jar -C Grep_classes

4. Create directory in HDFS and move war-and-


peace.txt into HDFS.
$ hdfs dfs -mkdir war-and-peace-input
$ hdfs dfs -put war-and-peace.txt war-and-peace-input
5. Ensure that before run, output directory is removed
$ hdfs dfs –rm –r skipTrash war-and-peace-output

6. Following command will Run the program


$ hadoop jar Grep.jar org.apache.hadoop.examples.Grep
war-and-peace-input war-and-peace-output Kutuzov

7. The result can be found in output file


$ hdfs dfs -cat war-and-peace-output/part-r-00000
530 Kutuzov
DEBUGGING MAPREDUCE
 The best advice for debugging parallel MapReduce
applications is : Don’t
 Debugging on distributed system is hard and
should be avoided at all costs.
 The best approach is to ensure applications runs
on simpler system with small data sets.
 Errors are much easier to locate and track. Unit
Testing is important.
 If applications run successfully on single system
with subset of data, then can run in parallel.
 Another approach is to use application logs to
inspect the progress.
Listing, Killing and Job Status
 Jobs can be managed using the mapred job
Command
 The options are -list, -kill and -status.

 yarn application command can be used to control


all applications running on the cluster.

Hadoop Log Management


 The MapReduce logs provide listing of both
mappers and reducers.
 The log output consists of 3 files: stdout, stderr
and syslog.
 There are 2 modes for log storage.

 The first method is to use log aggregation. In this


method logs are aggregated in HDFS.
 The result is displayed using yarn logs command.

 Second method, if log aggregation is disabled, the


logs can be placed locally on the cluster nodes
where mapper and reducer runs.

 The location of unaggregated local logs is given by


the yarn.nodemanager.log-dirs property in yarn-
site.xml

 Log aggregation is highly recommended.


Enabling YARN Log Aggregation
 To manually enable log aggregation, follow these
steps:
1. Create the following directory in HDFS.
$ hdfs dfs -mkdir -p /yarn/logs
$ hdfs dfs -chown -R yarn:hadoop /yarn/logs
$ hdfs dfs -chmod -R g+rw /yarn/logs

2. Add the following properties in the yarn-site.xml


and restart all YARN services.
Command-Line Log Viewing
 MapReduce logs can also be viewed from the
command line.
 The yarn logs command enables to view logs.

$ yarn logs

 The options to use yarn logs are as follows.


Example
 Run pi exmple program

$ hadoop jar $HADOOP_EXAMPLES/hadoop-mapreduce-


examples.jar pi 16 100000

 Find Application ID using,


$ yarn application -list -appStates FINISHED

 Run the command to produce dump of all the logs for


the application
$ yarn logs -applicationId application_143266…_001 >
AppOut

 The AppOut file can be inspected. The list of


containers are found in the file. A specific container
can be examined.

You might also like