Extracting Real Value From Your Data With Apache Hadoop: Sarah Sproehnle
Extracting Real Value From Your Data With Apache Hadoop: Sarah Sproehnle
Sarah Sproehnle
Hadoop Instructor, Cloudera
Glynn Durham
Hadoop Instructor, Cloudera
Daniel Templeton
Product Manager, Oracle Grid Engine
Michal Bachorik
Member Technical Staff, Oracle Grid Engine
JavaOne|Oracle Develop
Extracting Real Value from Your Data With Apache Hadoop
Prerequisites
This hands-on lab assumes you have basic experience with the following technologies: Java SE Platform
Lab Exercises
1. 2. 3. 4. Getting Familiar With Hadoop Developing Map/Reduce Applications Using a Custom File Format Getting Familiar with Hive BONUS
Page 2 of 51
Solutions to the lab exercises can be found in the /home/lab314413/solutions directory. Please feel free to seek assistance from the instructor or Oracle Demo staff at any point during the lab.
Page 3 of 51
Background Information
In addition to being an API against which Map/Reduce applications are written, Hadoop is also a complex execution framework and distributed file system. The primary interface to both is the <hadoop_home>/bin/hadoop command. Using the hadoop command, it's possible to transfer data in and out of HDFS and launch and monitor Map/Reduce jobs. (A job is a running instance of a Map/Reduce application.) The hadoop command is subdivided into several subsystems. For example, there's a subsystem for working with HDFS and another for launching and managing jobs. To find out more about the subsystems available, run hadoop. To find out more about a specific subsystem, run hadoop <subsystem>, e.g. hadoop fs.
Steps To Follow
Step 1: Exploring HDFS
The first thing to do is to use the Hadoop Distributed File System, HDFS. This is where data is stored in a way that makes it accessible to Hadoop MapReduce programs. The subsystem associated with the wrapper program is fs. 1. 2. Open up a terminal window by selecting Applications Accessories Terminal from the desktop menu or by right-clicking on the desktop background and selecting Open Terminal. In the terminal window, run: % hadoop fs You'll see a help message describing all the commands associated with this subsystem.
Page 4 of 51
3.
Run: % hadoop fs -ls / This command will show you the contents of the root directory. There should be three entries. One is /user. Individual users have a home directory under this path, named for their usernames.
Page 5 of 51
4.
Try viewing the contents of the /user directory by running: % hadoop fs -ls /user You should see your home directory in the directory listing.
Page 6 of 51
5.
Page 7 of 51
Because there are currently no files in your home directory, the command silently exits. Note that behavior is different from what would happen if you tried to refer to a directory that doesn't exist, such as hadoop fs -ls /foo Note that the directory structure within HDFS has nothing to do with the directory structure of the local file system; they're completely separate name spaces.
3.
Now try the same fs -ls command without a path argument: % hadoop fs -ls
Page 9 of 51
You should see the same results. If you don't include a path argument to the fs -ls command, it assumes you mean your home directory, e.g. /user/lab314413.
Relative paths If you pass any relative (non-absolute) paths to FsShell commands (or use relative paths in MapReduce programs), they will implicitly be relative to the user's directory. For example, you can see the contents of the uploaded input directory by running:
hadoop fs -ls input
You also could have uploaded the Shakespeare files into HDFS by running:
hadoop fs -put input input
Page 10 of 51
This command will print the first 20 lines of Henry IV to the terminal. This command is useful for viewing the output of a MapReduce program. Very often an individual output file of a MapReduce program will be very large, making it inconvenient to view the entire file in the terminal. For this reason, it's often a good idea to pipe the output of the fs -cat command into head, tail, more, or less. Note that when piping the output of the fs -cat command to a local UNIX command, the full contents of the file are still being extracted from HDFS and sent to your local machine. Once on your local machine, the file contents are then modified before being displayed. 2. If you want to download a file and manipulate it in the local filesystem (e.g., with a text editor), the fs -get command takes two arguments, an HDFS path, and a local path, and copies the HDFS contents into the local file system: % hadoop fs -get input/all-shakespeare ~/spk.txt % head -20 ~/spk.txt
Page 11 of 51
Other Commands
There are several other commands associated with the FsShell subsystem; these can perform most common filesystem manipulations (rm, mv, cp, mkdir, etc.). Try playing around with a few of these if you'd like.
Page 12 of 51
This program will search for all instances of the word "wherefore" in the Shakespeare corpus, regardless of initial capitalization, and report back the number of occurrences of each matching string. The strings which matched the regular expression argument, as well as their frequency of occurrence, will be written to files in the output directory, called grep_output. Note that this behavior is different than what the UNIX grep tool typically does. 2. Now, use fs -ls on the output directory to see the list of output files: % hadoop fs -ls grep_output
Page 13 of 51
3.
Use fs -cat to view individual output files. (There should be one, named part-00000.) % hadoop fs -cat grep_output/part-00000
Page 14 of 51
4.
Remove the output directory since we won't be needing it anymore: % hadoop fs -rmr grep_output
Page 15 of 51
If you'd like, try to run another MapReduce job, substituting a different regular expression for "[Ww]herefore" in the command string. If you would like to send your results to a different output directory, substitute that output directory name for grep_output above. If you want to send your output to the grep_output directory again, you'll need to remove the existing directory named grep_output first. (Hadoop will refuse to run the job and stop with an error message if its output directory already exists.) Status pages Hadoop also provides a web-based status tool to see what's occurring in your cluster. If you'd like to see what jobs have been run, open the web browser (Firefox) and enter the URL: http://localhost/. Try starting a job and, while it's still running, refresh the status page. You'll see the running job listed; clicking on it will show you more intermediate status information.
Page 16 of 51
% hadoop jar $HADOOP_HOME/*examples*.jar grep input grep_output "[Ww]herefore" 3. In the second terminal run: % hadoop job -list This will list the job id's of all running jobs. A job id looks something like job_200902131742_0002.
4.
Copy the job id, and then kill the job by running: % hadoop job -kill <jobid>
Page 17 of 51
The JobTracker will kill the job, and the program running in the original terminal will inform you that the job has failed.
If you need to cancel a job and restart it (e.g., because you immediately thought of a bugfix after launching a job), make sure to properly kill the old job first.
Page 18 of 51
Summary
In this exercise you learned about how to use the hadoop command and two of the command subsystems. You should now be able to move data into and out of HDFS as well as start and stop MapReduce programs. In the next exercise we'll look at how to write your own MapReduce program.
Page 19 of 51
Background Information
Functional Programming Concepts
MapReduce programs are designed to compute large volumes of data in a parallel fashion. This requires dividing the workload across a large number of machines. This model would not scale to large clusters (hundreds or thousands of nodes) if the components were allowed to share data arbitrarily. The communication overhead required to keep the data on the nodes synchronized at all times would prevent the system from performing reliably or efficiently at large scale. Instead, all data elements in MapReduce are immutable, meaning that they cannot be updated. If in a mapping task you change an input (key, value) pair, it does not get reflected back in the input files; communication occurs only by generating new output (key, value) pairs which are then forwarded by the Hadoop system into the next phase of execution.
List Processing
Conceptually, MapReduce programs transform lists of input data elements into lists of output data elements. A MapReduce program will do this twice, using two different list processing idioms: map, and reduce. These terms are taken from several list processing languages such as LISP, Scheme, or ML.
Mapping Lists
The first phase of a MapReduce program is called mapping. A list of data elements are provided, one at a time, to a function called the Mapper, which transforms each element individually to an output data element.
Figure 1: Mapping creates a new output list by applying a function to individual elements of an input list.
As an example of the utility of map, suppose you had a function toUpper(str) which returns an uppercase version of its input string. You could use this function with map to turn a list of strings into a list of uppercase strings. Note that we are not modifying the input string here: we are returning a new string that will form part of a new output list.
Page 20 of 51
Reducing Lists
Reducing lets you aggregate values together. A reducer function receives an iterator of input values from an input list. It then combines these values together, returning a single output value.
Figure 2: Reducing a list iterates over the input values to produce an aggregate value as output.
Reducing is often used to produce "summary" data, turning a large volume of data into a smaller summary of itself. For example, "+" can be used as a reducing function, to return the sum of a list of input values.
The mapping and reducing functions receive not just values, but (key, value) pairs. The output of each of these functions is the same: both a key and a value must be emitted to the next list in the data flow. MapReduce is also less strict than other languages about how the Mapper and Reducer work. In more formal functional mapping and reducing settings, a mapper must produce exactly one output element for each input element, and a reducer must produce exactly one output element for each input list. In MapReduce, an arbitrary number of values can be output from each phase; a mapper may map one input into zero, one, or one hundred outputs. A reducer may compute over an input list and emit one or a dozen different outputs. Keys divide the reduce space: A reducing function turns a large list of values into one (or a few) output values. In MapReduce, all of the output values are not usually reduced together. All of the values with the same key are presented to a single reducer together. This is Page 21 of 51
performed independently of any reduce operations occurring on other lists of values, with different keys attached.
Figure 3: Different colors represent different keys. All values with the same key are presented to a single reduce task.
Steps To Follow
This project is designed to get you familiar with the MapReduce environment by having you write a one-pass MapReduce program to calculate an inverted index over a set of documents. Given a body of input text, an offset indexer uses Hadoop to produce an index of all the words in the text. For each word, the index has a list of all the locations where the word appears, and optionally a text excerpt of each line where the word appears. Running the line indexer on the complete works of Shakespeare, the following input lines: lucrece.txt, offset 38624: To cipher what is writ in learned books, orlando.txt, offset 66001: ORLANDO Which I take to be either a fool or a cipher. would produce the following output for the word "cipher:" cipher lucrece.txt@38624,orlando.txt@66001,... The goal of this exercise is to develop an inverted index program and run it in Hadoop on a set of documents. The idea will be to create a mapper class that outputs every word in its document as a key with the file and location as the value, and then to create a reducer class that aggregates the values of like keys.
Page 22 of 51
3.
Select InvertedIndex.
4.
2. 3.
Double-click on the LineIndexMapper.java node. The Java source file will open in the editor window. Look at the class definition on lines 22 and 23. Notice that the mapper class extends the MapReduceBase class, which is a helper class for developing MapReduce applications. Notice also that it implements the Mapper interface and uses Generics to specify four class types. The first class is the type of the input key. In most cases, the key for the mapper is the byte offset into the file as a long. The second class is the type of the input values. In most cases the values are lines of text as strings. The third and fourth classes are the types of the keys and values emitted by this mapper class. Our inverted index mapper will be outputting text keys and text values. Notice that Hadoop uss its own set of classes for dealing with primitives (including strings) for performance reasons. Instead of a Long, the input key is a org.apache.hadoop.io.LongWritable. Instead of String, the other three types are org.apache.hadoop.io.Text. Go to line 30 where you will find the INSERT CODE HERE comment. This comment is in the map() method and marks the place where the mapping should be done. Remember that the input key is the byte offset into the file of the first character of the line, and the input value is the line of text found at that byte offset, minus any trailing new lines and carriage returns. The first thing to do is to find the file name for the input split. First, get the FileSplit object for the current input split with the Reporter.getInputSplit() method. You'll have to cast it to a FileSplit object explicitly. Next get the file name from the FileSplit object with the FileSplit.getPath() method. This method returns a Path object, so you'll need to convert it to a String using the Path.getName() method: String fileName = ((FileSplit)reporter.getInputSplit()).getPath().getName();
4.
5.
6.
Next, build the output value by concatinating the file name and the given key with an '@': Text outputValue = new Text(fileName + "@" + key); Notice that you should use the org.apache.hadoop.io.Text class instead of java.lang.String.
7.
Next, tokenize the input value so you can find all the words in it: StringTokenizer tokenizer = new StringTokenizer(value.toString()); Page 24 of 51
8.
Next, loop over the tokenizer to process each word: while (tokenizer.hasMoreTokens()) { String word = tokenizer.nextToken(); ... }
9.
Finally, for each word that you find, use the OutputCollector instance to output the word as a key with the value from above: output.collect(new Text(word), outputValue); Notice that you have to convert the String word into a Text instance.
} }
Listing 1: LineIndexMapper.java
3. 4.
thing you'll need is a StringBuilder instance for building the value string: StringBuilder combinedValue = new StringBuilder(); 5. Next, loop over the set of name-values pairs in the values parameter and concatenate then into a comma-separated string: boolean first = true; while (values.hasNext()) { if (!first) { combinedValue.append(SEP); } combinedValue.append(values.next().toString()); first = false; } 6. Finally, emit the combined value using the same key as in the input: output.collect(key, new Text(combinedValue.toString())); Notice that you have to convert the StringBuilder into a String and then into a Text instance. Your reducer class should now look something like:
Page 26 of 51
public class LineIndexReducer extends MapReduceBase implements Reducer<Text, Text, Text, Text> { private static final String SEP = ","; public LineIndexReducer() { } public void reduce(Text key, Iterator<Text> values, OutputCollector<Text, Text> output, Reporter reporter) throws IOException { StringBuilder combinedValue = new StringBuilder(); boolean first = true; while (values.hasNext()) { if (!first) { combinedValue.append(SEP); } combinedValue.append(values.next().toString()); first = false; } } }
Listing 2: LineIndexReducer.java
3. 4.
Page 27 of 51
The InvertedIndex.class argument is given as the second parameter to help the JobConf instance to locate the JAR file associated with the job. It will use the facilities provided by the Java platform to trace back from the class file to the JAR file. You could have specified any class in the project, e.g. the mapper or reducer class. 5. Next, set the input and output paths to the constant values defined above: FileInputFormat.addInputPath(conf, new Path(INPUT_PATH)); FileOutputFormat.setOutputPath(conf, new Path(OUTPUT_PATH)); The FileInputFormat and FileOutputFormat classes serve dual roles. They serve both as base classes for the classes that manage the ingestion of input files and the creation of output files and as configuration points. They offer several static methods for configuring how input and output are managed. In this case, we use the static methods to add an input path and set the output path. 6. Next, set the mapper and reducer classes to run: conf.setMapperClass(LineIndexMapper.class); conf.setReducerClass(LineIndexReducer.class); 7. Next, the MapReduce framework needs to know the types of the keys and values emitted by the mapper and reducer. In this exercise, the mapper and reducer emit keys and values of the same types, so both can be set together with the JobConf.setOutputKeyClass() and JobConf.setOutputValueClass() methods. (If the mapper and reducer emitted different key and value types, you could override the key and value types for the mapper with the JobConf.setMapOutputKeyClass() and JobConf.setMapOutputValueClass().) Set the types of the key and values emitted from the reducer to Text: conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(Text.class); 8. Finally, run the job: JobClient.runJob(conf);
Page 28 of 51
public class InvertedIndex extends Configured implements Tool { // where to put the data in hdfs when we're done private static final String OUTPUT_PATH = "output"; // where to read the data from. private static final String INPUT_PATH = "input/all-shakespeare"; /** Driver for the actual MapReduce process */ public int run(String[] args) throws IOException { JobConf conf = new JobConf(getConf(), InvertedIndex.class); FileInputFormat.addInputPath(conf, new Path(INPUT_PATH)); FileOutputFormat.setOutputPath(conf, new Path(OUTPUT_PATH)); conf.setMapperClass(LineIndexMapper.class); conf.setReducerClass(LineIndexReducer.class); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(Text.class); JobClient.runJob(conf); return 0; } public static void main(String[] args) throws Exception { int ret = ToolRunner.run(new LineIndexer(), args); System.exit(ret); } }
Listing 3: InvertedIndex.java
Page 29 of 51
2.
From a terminal window (You should have two open; pick one.) run: % cd <lab_root>/exercises/InvertedIndex % hadoop jar dist/InvertedIndex.jar This command will read the Shakespeare compendium in the input directory in HDFS and compute an inverted index. It will be written to a directory named output. (You hard-coded that path into the driver class in the previous step.)
3.
View your results by running: % hadoop fs -cat output/part-00000 | more To stop viewing the file contents, press 'q' or 'CTRL-c'.
Page 30 of 51
Before You Rerun If you need to rerun your program for any reason, you'll first need to remove the output directory that was created in the previous run: % hadoop fs -rmr output If you don't, Hadoop will refuse to run the job and print an error message: Directory already exists.
Summary
In this exercise you wrote your first complete MapReduce program. You learned the basic structure of the mapper class, the reducer class, and the driver class. You also were introduced to the Hadoop classes for managing primitives and strings. In the next exercise, you'll learn more about writing MapReduce programs.
Page 31 of 51
Background Information
MapReduce Data Flow
Now that you have seen the components that make up a basic MapReduce job, let's see how everything works together at a higher level:
another's existence. Similarly, different reduce tasks do not communicate with one another. The user never explicitly marshals information from one machine to another; all data transfer is handled by the Hadoop MapReduce platform itself, guided implicitly by the different keys associated with values. This is a fundamental element of Hadoop MapReduce's reliability. If nodes in the cluster fail, tasks must be able to be restarted. If they have been performing side-effects, e.g., communicating with the outside world, then the shared state must be restored in a restarted task. By eliminating communication and side-effects, restarts can be handled more gracefully.
A Closer Look
The previous figure described the high-level view of Hadoop MapReduce. From this diagram, you can see where the mapper and reducer components of the Word Count application fit in, and how it achieves its objective. We will now examine this system in a bit closer detail. Figure 5 shows the pipeline with more of its mechanics exposed. While only two nodes are depicted, the same pipeline can be replicated
Page 33 of 51
Input files: This is where the data for a MapReduce task is initially stored. While this does not need to be the case, the input files typically reside in HDFS. The format of these files is arbitrary; while line-based log files can be used, we could also use a binary format, multi-line input records, or something else entirely. It is typical for these input files to be very large -- tens of gigabytes or more. InputFormat: How these input files are split up and read is defined by the InputFormat. An InputFormat is a class that provides the following functionality: Selects the files or other objects that should be used for input Defines the InputSplits that break a file into tasks Provides a factory for RecordReader objects that read the file Several InputFormats are provided with Hadoop. An abstract type is called FileInputFormat; all InputFormats that operate on files inherit functionality and properties from this class. When starting a Hadoop job, FileInputFormat is provided with a path containing files to read. The FileInputFormat will read all files in this directory. It then divides these files into one or more InputSplits each. You can choose which InputFormat to apply to your input files for a job by calling the setInputFormat() method of the JobConf object that defines the job. InputSplits: An InputSplit describes a unit of work that comprises a single map task in a MapReduce program. A MapReduce program applied to a data set, collectively referred to as a Job, is made up of several (possibly several hundred) tasks. Map tasks may involve reading a whole file; they often involve reading only part of a file. By default, the FileInputFormat and its descendants break a file up into 64 MB chunks (the same size as blocks in HDFS). By processing a file in chunks, we allow several map tasks to operate on a single file in parallel. If the file is very large, this can improve performance significantly through parallelism. Even more importantly, since the various blocks that make up the file may be spread across several different nodes in the cluster, it allows tasks to be scheduled on each of these different nodes; the individual blocks are thus all processed locally, instead of needing to be transferred from one node to another. Of course, while log files can be processed in this piece-wise fashion, some file formats are not amenable to chunked processing. By writing a custom InputFormat, you can control how the file is broken up (or is not broken up) into splits. RecordReader: The InputSplit has defined a slice of work, but does not describe how to access it. The RecordReader class actually loads the data from its source and converts it into (key, value) pairs suitable for reading by the Mapper. The RecordReader instance is defined by the InputFormat. The default InputFormat, TextInputFormat, provides a LineRecordReader, which treats each line of the input file as a new value. The key associated with each line is its byte offset in the file. The RecordReader is invoke repeatedly on the input until the entire InputSplit has been consumed. Each invocation of the RecordReader leads to another call to the map() method of the Mapper. Partition & Shuffle: After the first map tasks have completed, the nodes may still be performing several more map tasks each. But they also begin exchanging the intermediate outputs from the map tasks to where they are required by the reducers. This process of moving map outputs to the reducers is known as shuffling. A different subset of the intermediate key space is assigned to each reduce node; these subsets (known as "partitions") are the inputs to the reduce tasks. Each map task may emit (key, value) pairs to any partition; all values for the same key are always reduced together regardless of which mapper is its origin. Therefore, the map nodes must all agree on where to send the different pieces of the intermediate data. The Partitioner class determines which partition a given (key, value) pair will go to. The default partitioner computes a hash value for the key and assigns the partition based on this result. Sort: Each reduce task is responsible for reducing the values associated with several intermediate keys. The set of intermediate keys on a single node is automatically sorted by Hadoop before they are presented to the Reducer. OutputFormat: The (key, value) pairs provided to this OutputCollector are then written to output files. The way they are written is governed by the OutputFormat. The OutputFormat functions much like the InputFormat class described earlier. The instances of OutputFormat provided by Hadoop write to files on the local disk or in HDFS; they all inherit from a common FileOutputFormat. Each Reducer writes a separate file in a common output directory. These files will typically be named part-nnnnn, where nnnnn is the partition id associated with the reduce task. The output directory is set by the FileOutputFormat.setOutputPath() method. You can control which particular OutputFormat is used by calling the setOutputFormat() method of the JobConf object that defines your MapReduce job. A table of provided OutputFormats is given below. Page 34 of 51
RecordWriter: Much like how the InputFormat actually reads individual records through the RecordReader implementation, the OutputFormat class is a factory for RecordWriter objects; these are used to write the individual records to the files as directed by the OutputFormat. The output files written by the Reducers are then left in HDFS for your use, either by another MapReduce job, a separate program, for for human inspection.
Steps To Follow
In this exercise you'll developing a custom InputFormat implementation as well as an associated reader to read email messages for processing by a provided MapReduce application. To avoid privacy issues, rather than using real email messages, you'll be working with messages from a Twitter ATOM feed on Justin Bieber tweets. The message format is the same and has all the same properties.
Page 35 of 51
2.
Select MailReader.
3.
2. 3.
Double-click on the MailRecordInputFormat.java node. The Java source file will open in the editor window. Look at the class definition on line 16. Notice that the class extends the FileInputFormat class and uses Generics to specify two class types. The first class is the type of the data keys, and the second class is the type of the data values. In this case both are org.apache.hadoop.io.Text. Go to line 20 where you will find the INSERT CODE HERE comment. This comment is in the getRecordReader() method and marks the place where the reader should be created and returned. First, log that the input format is being applied to this input split: reporter.setStatus(input.toString());
4. 5.
6.
Next, create and return an instance of the MailRecordReader class. You'll develop this class in the next exercise, but the stub provided includes a constructor that accepts a JobConf object and a FileSplit object. You can simply cast the InputSplit parameter to a FileSplit object in this case. Call the constructor and return the results: return new MailRecordReader(job, (FileSplit)input);
Page 37 of 51
public class MailRecordInputFormat extends FileInputFormat<Text, Text> { @Override public RecordReader<Text, Text> getRecordReader(InputSplit input, JobConf job, Reporter reporter) throws IOException { reporter.setStatus(input.toString()); return new MailRecordReader(job, (FileSplit)input); } }
Listing 4: MailRecordInputFormat.java
3. 4.
Page 38 of 51
initializeLineReader(job, fileIn); 7. Last thing in this method, set the pos counter to the value of the start counter after the call to the initializeLineReader() method: pos = start; Your constructor should now look something like:
public MailRecordReader(Configuration job, FileSplit split) throws IOException { this.maxLineLength = job.getInt("mapred.linerecordreader.maxlength", Integer.MAX_VALUE); Path file = split.getPath(); FileSystem fs = file.getFileSystem(job); FSDataInputStream fileIn = fs.open(split.getPath()); start = split.getStart(); end = start + split.getLength(); initializeLineReader(job, fileIn); pos = start; }
Listing 5: MailRecordReader.java MailRecordReader()
8. 9.
Next, find the next() method and the INSERT CODE HERE comment in it. This comment marks the place where a new mail record should be read using the prepared LineReader object. Before you do anything else, you'll want to set the values of the key and value arguments. Because Hadoop uses mutable data types passed as arguments to return values, it's possible that these arguments may contain some leftover data. Set the key to unknown and just clear the value: key.set(unknown); value.clear();
10. Because email messages are by definition of unpredictable length, there's no way to know a priori how many lines will need to be read to get a complete record. Instead, you'll just read until you find a line with only a . on it. The LineReader.readLine() method reads in a line of data and returns the number of bytes read. This number is useful both to track your progress through the input split, but also to tell when you've hit the end of the file. If the number of bytes read is 0, then you've reached the end of the file. The parameters to the LineReader.readLine() method are a buffer Text object in which to store the read bytes, and two limits. The first limit is the maximum number of bytes to read without finding a new line character before giving up. The second is the Page 39 of 51
maximum number of bytes to read in total. The reason for having two limits has to do with input splits and how they're handled. Remember that a MapReduce task will be assigned an input split that is local to it (or as close as possible), but that the HDFS interface works at the file level. If you read past the end of your input split, HDFS will just go transfer whatever data you need from the remote node hosting it. While that works, it's something you want to avoid for performance reasons. The reason for a line length limit is to keep a bad data file from causing a large file from accidentally being read completely as one line, initiating large data transfers in the process. The reason for the overall read limit is to be able to limit the read to the boundaries of the input split. There's a catch, however. While you want to limit the read to the boundaries of the input split, you also want to read in complete records. To get a complete record, it's OK to read past the end of the input split. So, instead of setting the overall read limit to the number of bytes remaining in the input split, a convenience method, maxToConsume(), is provided that returns a limit based on the number of bytes remaining in the input split and the maximum expected line length. Let's start with a loop to read a line per interation until the current position (pos) is greater than the end of the split (end). If the number of bytes read was 0, break out of the loop: while (pos < end) { int newSize = in.readLine(buffer, maxLineLength, maxToConsume()); if (newSize == 0) { break; //EOF } } 11. Next, in the loop, increment the current position in the input split by the number of bytes read: pos += newSize; 12. Next, let's look at the data. If the contents of the buffer are . then we read a complete record. The EOM constant is provided to make the buffer comparison easier. In the loop, compare the buffer to EOM, and if they match, then set recordRead to true and break out of the loop: if (buffer.compareTo(EOM) == 0) { recordRead = true; break; // EOM } 13. Assuming the buffer isn't the EOM marker, then next in the loop you should append its contents to the value object. Because the LineReader.readLine() method strips off any trailing new line characters, you'll also want to add a new line character to the end. The NEWLINE constant is provided to make that append easier. Use the Text.append() method to do the append: value.append(buffer.getBytes(), 0, buffer.getLength()); value.append(NEWLINE, 0, NEWLINE.length); 14. The data key for each mail record will come from a message-id header in the message. The last thing to do is test whether the Page 40 of 51
buffer starts with message-id: , and if so, use the remainder of the line as the value of the key. Because the name of the header might be Message-ID or MESSAGE-ID or message-id or any other combination of cases, you'll first want to convert the buffer into a string and the string into lower (or upper) case before making the comparison. Use the Text.set() method to set the value of the key to the substring of the buffer after the header:: if (buffer.toString().toLowerCase().startsWith("message-id: ")) { key.set(buffer.getBytes(), 12, buffer.getLength() - 12); } Your next() method should now look something like:
Page 41 of 51
public boolean next(Text key, Text value) throws IOException { boolean recordRead = false; Text buffer = new Text(); key.set("unknown"); value.clear(); while (pos < end) { int newSize = in.readLine(buffer, maxLineLength, maxToConsume()); if (newSize == 0) { break; // EOF } pos += newSize; if (buffer.compareTo(EOM) == 0) { recordRead = true; break; } value.append(buffer.getBytes(), 0, buffer.getLength()); value.append(NEWLINE, 0, NEWLINE.length); if (buffer.toString().toLowerCase().startsWith("message-id: ")) { key.set(buffer.getBytes(), 12, buffer.getLength() - 12); } } } return recordRead;
Page 42 of 51
2.
From a terminal window (You should have two open; pick one.) run: % hadoop fs -rmr output % cd <lab_root>/exercises/MailReader % hadoop jar dist/MailReader.jar input/mail.txt output This command will read through the mail.txt file and calculate which users are spamming Twitter with the most Justin Bieber content. It will output the worst offender and the number of messages that person sent.
4.
View your results by running: % hadoop fs -cat output/part-00000 To stop viewing the output file, press 'q' or 'CTRL-c'.
Page 43 of 51
Before You Rerun If you need to rerun your program for any reason, you'll first need to remove both output directories that were created in the previous run: % hadoop fs -rmr output_0 % hadoop fs -rmr output If you don't, Hadoop will refuse to run the job and print an error message: Directory already exists.
Summary
In this exercise you learned about how to control the way that MapReduce reads in data files and how data in managed in a Hadoop cluster. You should now be ready to begin writing your own Hadoop applications. In the last exercise, we'll look at a higher-level abstraction on top of MapReduce called Hive.
Page 44 of 51
Background Information
Hive is a data warehouse infrastructure built on top of Hadoop that provides tools to enable easy data summarization, ad hoc querying and analysis of large datasets data stored in HDFS files. It provides a mechanism to apply structure to this data along with a simple query language called Hive QL, that is based on SQL and that enables users familiar with SQL to query this data. At the same time, this language also allows traditional map/reduce programmers to be able to plug in their custom mappers and reducers to do more sophisticated analysis that may not be supported by the built-in capabilities of the language.
Hive Structure
Hive structures data into the well-understood database concepts like tables, columns, rows, and partitions. It supports all the major primitive types integers, floats, doubles and strings as well as complex types such as maps, lists and structs. The latter can be nested arbitrarily to construct more complex types. In addition, Hive allows users to extend thesystem with their own types and functions. The query language is very similar to SQL and therefore can be easily understood by anyone familiar with SQL. There are some nuances in the data model, type system and HiveQL that are different from traditional databases and that have been motivated by the experiences gained at Facebook. We will highlight these and other details in this section.
Steps To Follow
Step 1: Starting Hive
The first thing to do is to start the Hive shell: 1. 2. If you don't already have a terminal window open, open up a terminal window by selecting Applications Accessories Terminal from the desktop menu or by right-clicking on the desktop background and selecting Open Terminal. In the terminal window, run:
Page 45 of 51
% hive You should see something like: Hive history file=/tmp/s314413/hive_job_log_s314413_201009231410_542324277.txt hive>
Page 46 of 51
% hadoop fs -cat /user/hive/warehouse/bible_freq/bible_freq.txt | head Notice that the Hive data file has the same contents as the file you uploaded to the input directory. (Actually, it is the file you uploaded to the input directory. The LOAD command caused the file in HDFS to be moved into the /user/hive/warehouse directory.)
Page 47 of 51
Notice that this query resulted in a two-stage MapReduce job, making it a very slow query. 2. Next, try finding the most commonly occurring frequency among words in the Shakespeare corpus (i.e. the mode of the frequencies) using the COUNT() function and the GROUP BY clause: hive> select freq, count('*') as f2 from shake_freq group by freq order by f2 desc limit 1; (Not surprisingly, 1 is the most commonly occurring frequency.)
Step 5: Subqueries
To finish out calculating the average frequencies, you're now going to calculate the medium frequency, which is to say that if there are n rows in the table in sorted order, you're looking for the frequency in row n/2. 1. First, you'll have to calculate the median row number using the COUNT() function: hive> select (count('*') / 2) from shake_freq; Remember this value. 2. Next, to get the median frequency, you'll want to select the top n/2 rows and then select the bottom 1 row from that. You can do that using a query embedded in a query: hive> select freq from (select freq from shake_freq order by freq asc limit 15904) f1 order by freq desc limit 1; Note that subqueries can only be used for the FROM clause, which is why you had to calculate the median row to use as a limit in a separate query. Also note that the results of the subquery must be assigned a name. We used f1 in this case.
Page 48 of 51
where s.freq >= 1 and b.freq >=1; This join excludes words that don't appear in both texts. 3. Now you can run queries against the merged table. First, check that the merged data looks reasonable by selecting the first 20 rows: hive> select * from merged limit 20; 4. Next, find the word that appears most frequently in the combined texts: hive> select word, sfreq, bfreq, (sfreq + bfreq) as tfreq from merged order by tfreq desc limit 1; (Not surprisingly, the most common word is 'the'.) You could have combined steps 2 and 4 into a single statement (select s.word, s.freq, b.freq, (s.freq + b.freq) as tfreq from shake_freq s join bible_freq b on (s.word = b.word) where s.freq >= 1 and b.freq >=1 order by tfreq desc limit 1;) if there were no need to run any additional queries on the combined tables. 5. Quit Hive: hive> quit;
Summary
In this exercise, you've been introduced to Hive and how it can be used to perform ad hoc queries much more simply than by writing MapReduce programs directly. For straightforward data mining tasks, Hive is often the best tool for the job. For complex tasks, however, it is often necessary to fall back to a power and flexibility offered by a MapReduce program.
Page 49 of 51
Summary
Congratulations! You have successfully completed LAB S314413: Extracting Real Value from Your Data With Apache Hadoop. In this lab you learned about writing MapReduce programs, how to customize the MapReduce process, and how to use the Hive data warehousing tool. For additional information about the technologies used in this lab, please see the following links: http://hadoop.apache.org/ http://www.cloudera.com/hadoop-tutorial/ You can also contact the lab authors directly at: [email protected] [email protected] [email protected] [email protected] Thank you for participating!
Page 50 of 51
3. 4. 5. 6.
Page 51 of 51