Extreme Computing Lab Exercises Session One: 1 Getting Started
Extreme Computing Lab Exercises Session One: 1 Getting Started
Session one
Miles Osborne (original: Sasa Petrovic)
1 Getting started
First you need to access the machine where you will be doing all the work. Do this by typing:
ssh namenode
If this machine is busy, try another one (for example bw1425n13, bw1425n14 or any other number
up to bw1425n24).
Hadoop is installed on Dice under /opt/hadoop/hadoop-0.20.2 (this will also be referred to as the
Hadoop home directory ). The hadoop command can be found in the /opt/hadoop/hadoop-0.20.2/bin
directory. To make sure you can run Hadoop command without having to be in the same directory,
use your favorite editor (emacs) to edit the ~/.benv le and add the following line:
PATH=$PATH:/opt/hadoop/hadoop-0.20.2/bin/
Don't forget to save the changes! After that, run the following command:
source ~/.benv
to make sure new changes take place right away. After you do this, run
2 HDFS
Hadoop consists of two major parts: a distributed lesystem (HDFS for Hadoop DFS) and the mech-
anism for running jobs. Files on HDFS are distributed across the network and replicated on dierent
nodes for reliability and speed. When a job requires a particular le, it is fetched from the ma-
chine/rack nearest to the machines that are actually executing the job.
You will now proceed to do some simple commands in HDFS. REMEMBER: all the commands
mentioned here refer to HDFS shell commands, NOT to UNIX commands which may have the same
name. You should keep all your source and data within a directory named /user/<your_matriculation_number>
which should already be created for you.
1
hadoop dfs -ls /user/sXXXXXXX
For example, if your matriculation number is s0123456, you should see something like:
Found 2 items
drwxr-xr-x - s0123456 s0123456 0 2011-10-19 09:55 /user/s0123456/data
drwxr-xr-x - s0123456 s0123456 0 2011-10-19 09:54 /user/s0123456/source
• copyFromLocal copy single src, or multiple srcs from local le system to the destination
lesystem. Source has to be a local le reference.
Example:
• copyToLocal copy les to the local le system. Files that fail the CRC check may be copied
with the -ignorecrc option. Files and CRCs may be copied using the -crc option. Destination
must be a local le reference.
Example:
2
• cp copy les from source to destination. This command allows multiple sources as well in
which case the destination must be a directory. Similar to UNIX cp command.
Example:
3
3 Running jobs
3.1 Computing π
NOTE: in this example we use the hadoop-0.20.2-examples.jar le. This le can be found in
/opt/hadoop/hadoop-0.20.2/, so make sure to use the full path to that le if you are not running
the example from the /opt/hadoop/hadoop-0.20.2/ directory.
This example estimates the mathematical constant π to some error. The error depends on the
number of samples we have (more samples = more accurate estimate). Run the example as follows:
Do the results match your expectations? How many samples are needed to approximate the third
digit after the decimal dot correctly?
Hadoop has a number of demo applications and here we will look at the canonical task of word
counting.
Task B We will count the number of times each word appears in a document. For this purpose, we will
use the /user/sasa/data/example3 le, so rst copy that le to your input directory. Second, make
sure you delete your output directory before running the job or the job will fail. We run the wordcount
example by typing:
Hadoop streaming is a utility that allows you to create and run map/reduce jobs with any executable
or script as the mapper and/or the reducer. The way it works is very simple: input is converted into
lines which are fed to the stdin of the mapper process. The mapper processes this data and writes to
stdout. Lines from the stdout of the mapper process are converted into key/value pairs by splitting
them on the rst tab character.1 The key/value pairs are fed to the stdin of the reducer process which
collects and processes them. Finally, the reducer writes to stdout which is the nal output of the
program. Everything will become much clearer through examples later.
It is important to note that with Hadoop streaming mappers and reducers can be any programs
that read from stdin and write to stdout, so the choice of the programming language is left to the
programmer. Here, we will use Python.
How to actually run the job
Suppose you have your mapper, mapper.py, and your reducer, reducer.py, and the input is in
/user/hadoop/input/. How do you run your streaming job? It's similar to running the DFS examples
from the previous section, with minor dierences:
4
hadoop jar contrib/streaming/hadoop-0.20.2-streaming.jar \
-input /user/hadoop/input \
-output /user/hadoop/output \
-mapper mapper.py \
-reducer reducer.py
contrib/streaming/hadoop-0.20.2-streaming.jar
Note the dierences: we always have to specify
(this le can be found in /opt/hadoop/hadoop-0.20.2/ directory so you either have to be in this
directory when running the job, or specify the full path to the le) as the jar to run, and the particular
mapper and reducer we use are specied through -mapper and -reducer options.
In case that the mapper and/or reducer are not already present on the remote machine (which
will often be the case), we also have to package the actual les in the job submission. Assuming
that neither mapper.py nor reducer.py were present on the machines in the cluster, the previous job
would be run as
#!/usr/bin/python
What happens when instead of using mapper.py you use /bin/cat as a mapper? C Task
What happens when you use /bin/cat as a mapper AND as a reducer? C Task
Setting job conguration
Various job options can be specied on the command line, we will cover the most used ones in this
section. The general syntax for specifying additional conguration variables is
-jobconf <name>=<value>
To avoid having your job named something like streamjob5025479419610622742.jar, you can
specify an alternative name through mapred.job.name variable. For example,
-jobconf mapred.job.name="My job"
Run the random-mapper.py example again, this time naming your job Random job <matricula- C Task
tion_number>, where <matriculation_number> is your matriculation number. After you run the
job (and preferably before it nishes), open the browser and go to http://hcrc1425n01.inf.ed.ac.
uk:50030/. In the list of running jobs look for the job with the name you gave it and click on it.
You can see various statistics about your job try to nd the number of reducers used. How many
reducers did you use? If your job nished before you had a chance to open the browser, it will be in
the list of nished jobs, not the list of running jobs, but you can still see all the same information by
clicking on it.
5
3.4 Secondary sorting
As was mentioned earlier, the key/value pairs are obtained by splitting the mapper output on the
rst tab character in the line. This can be changed using stream.map.output.field.separator and
stream.num.map.output.key.fields variables. For example, if I want the key to be everything up
to the second - character in the line, I would add the following:
-jobconf stream.map.output.field.separator=- \
-jobconf stream.num.map.output.key.fields=2
192.168.2.1
190.191.34.38
161.53.72.111
192.168.1.1
161.53.72.23
You want to partition the data so that addresses with the rst 16 bits are processed by the same
reducer. However, you also want each reducer to see the data sorted according to the rst 24 bits of
the address. Using the mentioned partitioner class you can tell Hadoop how to group the data to be
processed by the reducers. You do this using the following options:
-jobconf map.output.key.field.separator=.
-jobconf num.key.fields.for.partition=2
The rst option tells Hadoop what character to use as a separator (just like in the previous exam-
ple), and the second one tells how many elds from the key to use for partitioning. Knowing this, here
is how we would solve the IP address example (assuming that the addresses are in /user/hadoop/in-
put):
hadoop jar contrib/streaming/hadoop-0.20.2-streaming.jar \
-input /user/hadoop/input \
-output /user/hadoop/output \
-mapper cat \
-reducer cat \
-partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner \
-jobconf stream.map.output.field.separator=. \
-jobconf stream.num.map.output.key.fields=3 \
-jobconf map.output.key.field.separator=. \
-jobconf num.key.fields.for.partition=2
The line with -jobconf num.key.fields.for.partition=2 tells Hadoop to partition IPs based
on the rst 16 bits (rst two numbers), and -jobconf stream.num.map.output.key.fields=3 tells
it to sort the IPs according to everything before the third separator (the dot in this case) this
corresponds to the rst 24 bits of the address.
Task B Copy the le /user/sasa/data/secondary to your input directory. Lines in this le have the
following format:
LastName.FirstName.Address.PhoneNo
1. Partition the data so that all people with the same last name go to the same reducer.
2. Partition the data so that all people with the same last name go to the same reducer, and also
make sure that the lines are sorted according to rst name.
3. Partition the data so that all the people with the same rst and last name go to the same
reducer, and that they are sorted according to address.