Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 30
Developing a MapReduce
Application The Configuration API • Components in Hadoop are configured using Hadoop’s own configuration API.
• An instance of the Configuration class (found in the
org.apache.hadoop.conf package) represents a collection of configuration properties and their values.
• Each property is named by a String, and the type of a
value may be one of several types, including Java primitives such as boolean, int, long, float, and other useful types such as String, Class, java.io.File, and collections of Strings simple configuration file Combining Resources Configuring the Development Environment
• The first step is to download the version of Hadoop
that you plan to use and unpack it on your development machine. • Using an IDE, create a new project and add all the JAR files from the top level of the unpacked distribution and from the lib directory to the classpath. • Then compile of Java Hadoop programs and run them in local (standalone) mode within the IDE. Managing Configuration • When developing Hadoop applications, it is common to switch between running the application locally and running it on a cluster. • The existence of a directory called conf that contains three configuration files: hadoop-local.xml, hadoop- localhost.xml, and hadoopcluster.xml • hadoop-cluster.xml contains details of the cluster’s namenode and YARN resource manager addresses. <?xml version="1.0"?> <configuration> <property> <name>fs.defaultFS</name> <value>hdfs://namenode/</value> </property> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> <property> <name>yarn.resourcemanager.address</name> <value>resourcemanager:8032</value> </property> </configuration> GenericOptionsParser, Tool, and ToolRunner
• Hadoop comes with a few helper classes for making it
easier to run jobs from the command line. • GenericOptionsParser is a class that interprets common Hadoop command-line options and sets them on a Configuration object for your application to use as desired. • it’s more convenient to implement the Tool interface and run your application with the Tool Runner, which uses GenericOptionsParser internally. An example Tool implementation for printing the properties in a Configuration Contd
• ConfigurationPrinter a subclass of Configured, which
is an implementation of the Configurable interface.
• All implementations of Tool need to implement
Configurable (since Tool extends it), and subclassing Configured is often the easiest way to achieve this.
• The run() method obtains the Configuration using
Configurable’s getConf() method and then iterates over it, printing each property to standard output • Picking up the properties specified in conf/hadoop-localhost.xml by running the following commands: % export HADOOP_CLASSPATH=target/classes/ % hadoop ConfigurationPrinter -conf conf/ hadoop-localhost.xml \ | grep yarn.resourcemanager.address= yarn.resourcemanager.address=localhost:8032 • GenericOptionsParser also allows us to set individual properties. For example: % hadoop ConfigurationPrinter -D color=yellow | grep color color=yellow GenericOptionsParser and ToolRunner options Writing a Unit Test
• The map and reduce functions in MapReduce are easy
to test in isolation, which is a consequence of their functional style.
• MRUnit is a testing library that makes it easy to pass
known inputs to a mapper or a reducer and check that the outputs are as expected. The test for the mapper Test on reducer Running Locally on Test Data Running a Job in a Local Job Runner Using the Tool interface introduced earlier in the chapter, it’s easy to write a driver to run our MapReduce job for finding the maximum temperature by year. Application to find the maximum temperature public class MaxTemperatureDriver extends Configured job.setMapperClass(MaxTemperatureMapper.class); implements Tool { job.setCombinerClass(MaxTemperatureReducer.clas @Override s); public int run(String[] args) throws Exception { job.setReducerClass(MaxTemperatureReducer.class) ; if (args.length != 2) { job.setOutputKeyClass(Text.class); System.err.printf("Usage: %s [generic options] <input> job.setOutputValueClass(IntWritable.class); <output>\n", return job.waitForCompletion(true) ? 0 : 1; getClass().getSimpleName()); } ToolRunner.printGenericCommandUsage(System.err); public static void main(String[] args) throws return -1; Exception { int exitCode = ToolRunner.run(new } MaxTemperatureDriver(), args); Job job = new Job(getConf(), "Max temperature"); System.exit(exitCode); job.setJarByClass(getClass()); } FileInputFormat.addInputPath(job, new Path(args[0])); } FileOutputFormat.setOutputPath(job, new Path(args[1])); • To run in a local job runner % export HADOOP_CLASSPATH=target/classes/ % hadoop v2.MaxTemperatureDriver -conf conf/hadoop-local.xml \ input/ncdc/micro output (OR) % hadoop v2.MaxTemperatureDriver -fs file:/// -jt local input/ncdc/micro output Testing the Driver • Two approaches • To use the local job runner & run the job against a test file on the local filesystem • To run the driver using a “mini-” cluster • MiniDFSCluster, MiniMRCluster class • Creates in-process cluster for testing against the full HDFS and MapReduce machinery • ClusterMapReduceTestCase • A useful base for writing a test • Handles the details of starting and stopping the in-process HDFS and MapReduce clusters in its setUp() and tearDown() methods • Generates a suitable JobConf object that is configured to work with the clusters Running on a Cluster • The local job runner uses a single JVM to run a job, so as long as all the classes that our job needs are on its classpath, then things will just work. • In a distributed setting, things are a little more complex. For a start, a job’s classes must be packaged into a job JAR file to send to the cluster. • Packaging a Job • Following Maven command will create a JAR file called hadoop- examples.jar in the project directory containing all of the compiled classes: % mvn package -DskipTests • The client classpath The user’s client-side classpath set by hadoop jar <jar> is made up of: • The job JAR file • Any JAR files in the lib directory of the job JAR file, and the classes directory (if present) • The classpath defined by HADOOP_CLASSPATH, if set The task classpath • On a cluster (and this includes pseudodistributed mode), map and reduce tasks run in separate JVMs, and their classpaths are not controlled by HADOOP_CLASSPATH. • HADOOP_CLASSPATH is a client-side setting and only sets the classpath for the driver JVM, which submits the job. • Instead, the user’s task classpath is comprised of the following: • The job JAR file • Any JAR files contained in the lib directory of the job JAR file, and the classes • directory (if present) • Any files added to the distributed cache using the -libjars option ,or the addFileToClassPath() method on DistributedCache (old API), or Job Packaging dependencies • Given these different ways of controlling what is on the client and task classpaths, there are corresponding options for including library dependencies for a job: • Unpack the libraries and repackage them in the job JAR. • Package the libraries in the lib directory of the job JAR. • Keep the libraries separate from the job JAR, and add them to the client classpath via HADOOP_CLASSPATH and to the task classpath via -libjars. Launching a Job • To launch the job, we need to run the driver, specifying the cluster that we want to run the job on with the -conf option (we equally could have used the -fs and -jt options): % unset HADOOP_CLASSPATH % hadoop jar hadoop-examples.jar v2.MaxTemperatureDriver \ -conf conf/hadoop-cluster.xml input/ncdc/all max-temp Retrieving the Results • Once the job is finished, there are various ways to retrieve the results. Each reducer produces one output file, so there are 30 part files named part-r-00000 to partr-00029 in the max-temp directory. • This job produces a very small amount of output, so it is convenient to copy it from HDFS to our development machine. The -getmerge option to the hadoop fs command is useful % hadoop fs -getmerge max-temp max-temp-local % sort max-temp-local | tail • Another way of retrieving the output if it is small is to use the -cat option to print the • output files to the console: • % hadoop fs -cat max-temp/* Debugging a Job • Via print statements • Difficult to examine the output which may be scattered across the nodes • Using Hadoop features • Task’s status message • To prompt us to look in the error log • Custom counter • To count the total # of records with implausible data • If the amount of log data is large, • Write the information to the map’s output rather than to standard error for analysis and aggregation by the reduce • Write the program to analyze the logs
Python Advanced Programming: The Guide to Learn Python Programming. Reference with Exercises and Samples About Dynamical Programming, Multithreading, Multiprocessing, Debugging, Testing and More
Python Advanced Programming: The Guide to Learn Python Programming. Reference with Exercises and Samples About Dynamical Programming, Multithreading, Multiprocessing, Debugging, Testing and More