0% found this document useful (0 votes)
26 views

Developing a MapReduce Application

Uploaded by

rowaxad868
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views

Developing a MapReduce Application

Uploaded by

rowaxad868
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 30

Developing a MapReduce

Application
The Configuration API
• Components in Hadoop are configured using
Hadoop’s own configuration API.

• An instance of the Configuration class (found in the


org.apache.hadoop.conf package) represents a
collection of configuration properties and their values.

• Each property is named by a String, and the type of a


value may be one of several types, including Java
primitives such as boolean, int, long, float, and other
useful types such as String, Class, java.io.File, and
collections of Strings
simple configuration file
Combining Resources
Configuring the Development Environment

• The first step is to download the version of Hadoop


that you plan to use and unpack it on your
development machine.
• Using an IDE, create a new project and add all the JAR
files from the top level of the unpacked distribution
and from the lib directory to the classpath.
• Then compile of Java Hadoop programs and run them
in local (standalone) mode within the IDE.
Managing Configuration
• When developing Hadoop applications, it is common to
switch between running the application locally and
running it on a cluster.
• The existence of a directory called conf that contains
three configuration files: hadoop-local.xml, hadoop-
localhost.xml, and hadoopcluster.xml
• hadoop-cluster.xml contains details of the cluster’s namenode and
YARN resource manager addresses.
<?xml version="1.0"?>
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://namenode/</value>
</property>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>yarn.resourcemanager.address</name>
<value>resourcemanager:8032</value>
</property>
</configuration>
GenericOptionsParser, Tool, and ToolRunner

• Hadoop comes with a few helper classes for making it


easier to run jobs from the command line.
• GenericOptionsParser is a class that interprets
common Hadoop command-line options and sets
them on a Configuration object for your application to
use as desired.
• it’s more convenient to implement the Tool interface
and run your application with the Tool Runner, which
uses GenericOptionsParser internally.
An example Tool implementation for printing the properties in a Configuration
Contd

• ConfigurationPrinter a subclass of Configured, which


is an implementation of the Configurable interface.

• All implementations of Tool need to implement


Configurable (since Tool extends it), and subclassing
Configured is often the easiest way to achieve this.

• The run() method obtains the Configuration using


Configurable’s getConf() method and then iterates
over it, printing each property to standard output
• Picking up the properties specified in conf/hadoop-localhost.xml by
running the following commands:
% export HADOOP_CLASSPATH=target/classes/
% hadoop ConfigurationPrinter -conf conf/
hadoop-localhost.xml \
| grep yarn.resourcemanager.address=
yarn.resourcemanager.address=localhost:8032
• GenericOptionsParser also allows us to set individual properties. For
example:
% hadoop ConfigurationPrinter -D color=yellow | grep color color=yellow
GenericOptionsParser and ToolRunner options
Writing a Unit Test

• The map and reduce functions in MapReduce are easy


to test in isolation, which is a consequence of their
functional style.

• MRUnit is a testing library that makes it easy to pass


known inputs to a mapper or a reducer and check
that the outputs are as expected.
The test for the mapper
Test on reducer
Running Locally on Test Data
Running a Job in a Local Job Runner
Using the Tool interface introduced earlier in the
chapter, it’s easy to write a driver to run our
MapReduce job for finding the maximum temperature
by year.
Application to find the maximum temperature
public class MaxTemperatureDriver extends Configured job.setMapperClass(MaxTemperatureMapper.class);
implements Tool { job.setCombinerClass(MaxTemperatureReducer.clas
@Override s);
public int run(String[] args) throws Exception { job.setReducerClass(MaxTemperatureReducer.class)
;
if (args.length != 2) { job.setOutputKeyClass(Text.class);
System.err.printf("Usage: %s [generic options] <input> job.setOutputValueClass(IntWritable.class);
<output>\n",
return job.waitForCompletion(true) ? 0 : 1;
getClass().getSimpleName()); }
ToolRunner.printGenericCommandUsage(System.err); public static void main(String[] args) throws
return -1; Exception {
int exitCode = ToolRunner.run(new
} MaxTemperatureDriver(), args);
Job job = new Job(getConf(), "Max temperature"); System.exit(exitCode);
job.setJarByClass(getClass()); }
FileInputFormat.addInputPath(job, new Path(args[0])); }
FileOutputFormat.setOutputPath(job, new
Path(args[1]));
• To run in a local job runner
% export HADOOP_CLASSPATH=target/classes/
% hadoop v2.MaxTemperatureDriver -conf conf/hadoop-local.xml \
input/ncdc/micro output
(OR)
% hadoop v2.MaxTemperatureDriver -fs file:/// -jt local input/ncdc/micro
output
Testing the Driver
• Two approaches
• To use the local job runner & run the job against a test file on the local
filesystem
• To run the driver using a “mini-” cluster
• MiniDFSCluster, MiniMRCluster class
• Creates in-process cluster for testing against the full HDFS and MapReduce machinery
• ClusterMapReduceTestCase
• A useful base for writing a test
• Handles the details of starting and stopping the in-process HDFS and MapReduce clusters in
its setUp() and tearDown() methods
• Generates a suitable JobConf object that is configured to work with the clusters
Running on a Cluster
• The local job runner uses a single JVM to run a job, so as long as all
the classes that our job needs are on its classpath, then things will
just work.
• In a distributed setting, things are a little more complex. For a start, a
job’s classes must be packaged into a job JAR file to send to the
cluster.
• Packaging a Job
• Following Maven command will create a JAR file called hadoop-
examples.jar in the project directory containing all of the compiled
classes:
% mvn package -DskipTests
• The client classpath
The user’s client-side classpath set by hadoop jar <jar> is made up of:
• The job JAR file
• Any JAR files in the lib directory of the job JAR file, and the classes directory (if
present)
• The classpath defined by HADOOP_CLASSPATH, if set
The task classpath
• On a cluster (and this includes pseudodistributed mode), map and reduce
tasks run in separate JVMs, and their classpaths are not controlled by
HADOOP_CLASSPATH.
• HADOOP_CLASSPATH is a client-side setting and only sets the classpath for the
driver JVM, which submits the job.
• Instead, the user’s task classpath is comprised of the following:
• The job JAR file
• Any JAR files contained in the lib directory of the job JAR file, and the classes
• directory (if present)
• Any files added to the distributed cache using the -libjars option ,or the
addFileToClassPath() method on DistributedCache (old API), or Job
Packaging dependencies
• Given these different ways of controlling what is on the client and task
classpaths, there are corresponding options for including library
dependencies for a job:
• Unpack the libraries and repackage them in the job JAR.
• Package the libraries in the lib directory of the job JAR.
• Keep the libraries separate from the job JAR, and add them to the client
classpath
via HADOOP_CLASSPATH and to the task classpath via -libjars.
Launching a Job
• To launch the job, we need to run the driver, specifying the cluster
that we want to run the job on with the -conf option (we equally
could have used the -fs and -jt options):
% unset HADOOP_CLASSPATH
% hadoop jar hadoop-examples.jar v2.MaxTemperatureDriver \
-conf conf/hadoop-cluster.xml input/ncdc/all max-temp
Retrieving the Results
• Once the job is finished, there are various ways to retrieve the results.
Each reducer produces one output file, so there are 30 part files
named part-r-00000 to partr-00029 in the max-temp directory.
• This job produces a very small amount of output, so it is convenient to
copy it from HDFS to our development machine. The -getmerge
option to the hadoop fs command is useful
% hadoop fs -getmerge max-temp max-temp-local
% sort max-temp-local | tail
• Another way of retrieving the output if it is small is to use the -cat
option to print the
• output files to the console:
• % hadoop fs -cat max-temp/*
Debugging a Job
• Via print statements
• Difficult to examine the output which may be scattered across
the nodes
• Using Hadoop features
• Task’s status message
• To prompt us to look in the error log
• Custom counter
• To count the total # of records with implausible data
• If the amount of log data is large,
• Write the information to the map’s output rather than to
standard error
for analysis and aggregation by the reduce
• Write the program to analyze the logs

You might also like