Demonstration: Understanding Pig: HDP Developer: Apache Pig and Hive
Demonstration: Understanding Pig: HDP Developer: Apache Pig and Hive
1.3. Notice that the output includes where the logging for your Pig session will go as well
as a statement about connecting to your Hadoop filesystem:
[main] INFO org.apache.pig.Main - Logging error messages to:
/root/devph/labs/demos/pig_1377892197767.log
[main] INFO org.apache.pig.backend.hadoop.executionengine. HExecutionEngine
- Connecting to hadoop file system at: hdfs://sandbox.hortonworks.com:8020
2.3. Use copyFromLocal to copy the pigdemo.txt file into the demos folder:
grunt> copyFromLocal /root/devph/labs/demos/pigdemo.txt demos/
grunt> pwd
hdfs://sandbox.hortonworks.com:8020/user/root/demos
3.2. Demonstrate the describe command, which describes what a relation looks like:
grunt> describe employees;
employees: {state: bytearray,name: bytearray}
NOTE: Fields have a data type, and we will discuss data types later in
this unit. Notice that the default data type of a field (if you do not
specify one) is bytearray.
Notice this requires a MapReduce job to execute, and the result is a collection of
tuples:
(SD,Rich)
(NV,Barry)
(CO,George)
(CA,Ulf)
(IL,Danielle)
(OH,Tom)
(CA,manish)
(CA,Brian)
(CO,Mark)
4.2. The output is still tuples, but only the records that match the filter appear:
(CA,Ulf)
(CA,manish)
(CA,Brian)
5.3. All records with the same state will be grouped together, as shown by the output of
the emp_group relation:
grunt> DUMP emp_group;
The output is:
(CA,{(CA,Ulf),(CA,manish),(CA,Brian)})
(CO,{(CO,George),(CO,Mark)})
(IL,{(IL,Danielle)})
(NV,{(NV,Barry)})
(OH,{(OH,Tom)})
(SD,{(SD,Rich)})
CA {(CA,Ulf),(CA,manish),(CA,Brian)}
CO {(CO,George),(CO,Mark)}
IL {(IL,Danielle)}
NV {(NV,Barry)}
OH {(OH,Tom)}
SD {(SD,Rich)}
Notice that the fields of the records (which in this case is the state field followed by
a bag) are separated by a tab character, which is the default delimiter in Pig. Use
the PigStorage object to specify a different delimiter:
grunt> STORE emp_group INTO 'emp_group_csv' USING PigStorage(',');
grunt > ls
There will be a couple of additional numeric aliases created by the system for
internal use. Please ignore them.
Step 8: Monitor the Pig Jobs
8.1. Point your browser to the JobHistory UI at http://sandbox:19888/.
8.2. View the list of jobs, which should contain the MapReduce jobs that were executed
from your Pig Latin code in the Grunt shell.
8.3. Notice you can view the log files of the ApplicationMaster and also each map and
reduce task.
Successful outcome: You will have a couple of Pig programs that load the White House
visitors’ data, with and without a schema, and store the output of
a relation into a folder in HDFS.
Before you begin: Your HDP 2.2 cluster should be up and running within your VM.
1.2. Unzip the archive in the /root/devph/labs/Lab5.1 folder, which contains a file
named whitehouse_visits.txt that is quite large:
# unzip whitehouse_visits.zip
2.2. From the Grunt shell, make a new directory in HDFS named whitehouse:
grunt> mkdir whitehouse
2.3. Use the copyFromLocal command in the Grunt shell to copy the
whitehouse_visits.txt file to the whitehouse folder in HDFS, renaming the file
visits.txt. (Be sure to enter this command on a single line):
grunt> copyFromLocal /root/devph/labs/Lab5.1/whitehouse_visits.txt
whitehouse/visits.txt
Copyright © 2015, Hortonworks, Inc. All rights reserved.
45
Course Material of: Dr A R Singla
2.4. Use the ls command to verify that the file was uploaded successfully:
grunt> ls whitehouse
hdfs://sandbox.hortonworks.com:8020/user/root/whitehouse/visits.txt<r 3>
183292235
NOTE: TextLoader simply creates a tuple for each line of text, and it
uses a single chararray field that contains the entire line. It allows you
to load lines of text and not worry about the format or schema yet.
3.3. We want to get a sense of what this data looks like. Use the LIMIT operator to define
a new relation named A_limit that is limited to 10 records of A.
grunt> A_limit = LIMIT A 10
3.4. Use the DUMP operator to view the A_limit relation. Each row in the output will look
similar to the following and should be 10 arbitrary rows from visits.txt:
grunt> DUMP A_limit
);
5.4. Each record should contain seven fields. What happened to the rest of the fields
from the raw data that was loaded from whitehouse/visits.txt?
_________________________________________________________________
Answer: They were simply ignored when each record was read in from HDFS.
Step 6: Use a Different Storer
6.1. In the previous step, you stored a relation using PigStorage with a tab delimiter.
Enter the following command, which stores the same relation but in a JSON format:
grunt> store B into 'whouse_json' using JsonStorage();
Notice that the schema you defined for the B relation was used to create the
format of each JSON entry:
{"lname":"MATTHEWMAN","fname":"ROBIN","mname":"H","id":"U81961","status":"735
74","state":"VA","arrival":"2/10/2011 11:14"}
{"lname":"MCALPINEDILEM","fname":"JENNIFER","mname":"J","id":"U81961","status
":"78586","state":"VA","arrival":"2/10/2011 10:49"}
Result: You have now seen how to execute some basic Pig commands, load data into a
relation, and store a relation into a folder in HDFS using different formats.
Successful outcome: You will have written several Pig scripts that analyze and query
the White House visitors’ data, including a list of people who
visited the President.
Before you begin: At a minimum, complete steps 1 and 2 of the Getting Started with
Pig lab.
2.6. Use DUMP on A_count to view the result. The output should look like:
grunt> DUMP A_count;
(rowcount,447598)
We can now conclude that there are 447,598 rows of text in visits.txt.
Step 3: Analyze the Data’s Contents
3.1. We now know how many records are in the data, but we still do not have a clear
picture of what the records look like. Let’s start by looking at the fields of each record.
Load the data using PigStorage(‘,’) instead of TextLoader():
grunt> visits = LOAD '/user/root/whitehouse/' USING PigStorage(',');
3.3. Use LIMIT to display only 50 records then DUMP the result. The output should be 50
tuples that represent the first 10 fields of visits:
grunt> firstten_limit = LIMIT firstten 50;
grunt> DUMP firstten_limit;
(PARK,ANNE,C,U51510,0,VA,10/24/2010 14:53,B0402,,)
(PARK,RYAN,C,U51510,0,VA,10/24/2010 14:53,B0402,,)
(PARK,MAGGIE,E,U51510,0,VA,10/24/2010 14:53,B0402,,)
(PARK,SIDNEY,R,U51510,0,VA,10/24/2010 14:53,B0402,,)
(RYAN,MARGUERITE,,U82926,0,VA,2/13/2011 17:14,B0402,,)
(WILE,DAVID,J,U44328,,VA,,,,)
(YANG,EILENE,D,U82921,,VA,,,,)
(ADAMS,SCHUYLER,N,U51772,,VA,,,,)
Copyright © 2015, Hortonworks, Inc. All rights reserved.
50
Course Material of: Dr A R Singla
(ADAMS,CHRISTINE,M,U51772,,VA,,,,)
(BERRY,STACEY,,U49494,79029,VA,10/15/2010 12:24,D0101,10/15/2010 14:06,D1S)
Notice from the output that the first three fields are the person’s name. The next
seven fields are a unique ID, badge number, access type, time of arrival, post of
arrival, time of departure, and post of departure.
Step 4: Locate the POTUS (President of the United States of America)
4.1. There are 26 fields in each record, and one of them represents the visitee (the person
being visited in the White House). Your goal now is to locate this column and determine
who has visited the President of the United States. Define a relation that is a projection
of the last seven fields ($19 to $25) of visits. Use LIMIT to only output 500 records. The
output should look like:
grunt> lastfields = FOREACH visits GENERATE $19..$25;
grunt> lastfields_limit = LIMIT lastfields 500;
grunt> DUMP lastfields_limit;
It is not necessarily obvious from the output, but field $19 in the visits relation
represents the visitee. Even though you selected 500 records in the previous step,
you may or may not see POTUS in the output above. (The White House has
thousands of visitors each day, but only a few meet the President.)
4.2. Use FILTER to define a relation that only contains records of visits where field $19
matches POTUS. Limit the output to 500 records. The output should include only visitors
who met with the President. For example:
grunt> potus = FILTER visits BY $19 MATCHES 'POTUS';
grunt> potus_limit = LIMIT potus 500;
grunt> DUMP potus_limit;
6.3. Define a projection of the potus relationship that contains the name and time of
arrival of the visitor:
grunt> potus_details = FOREACH potus GENERATE
(chararray) $0 AS lname:chararray,
(chararray) $1 AS fname:chararray,
(chararray) $6 AS arrival_time:chararray,
(chararray) $19 AS visitee:chararray;
6.5. Store the records of potus_details_ordered into a folder named potus and using
a comma delimiter:
grunt> STORE potus_details_ordered INTO 'potus' USING PigStorage(',');
6.7. Notice that there is a single output file, so the Pig job was executed with one
reducer. View the contents of the output file using cat:
grunt> cat potus/part-r-00000
The output should be in a comma-delimited format and should contain the last
name, first name, time of arrival (if available), and the string POTUS:
CLINTON,WILLIAM,,POTUS
CLINTON,HILLARY,,POTUS
CLINTON,HILLARY,,POTUS
CLINTON,HILLARY,,POTUS
CLONAN,JEANETTE,,POTUS
CLOOBECK,STEPHEN,,POTUS
CLOOBECK,CHANTAL,,POTUS
CLOOBECK,STEPHEN,,POTUS
CLOONEY,GEORGE,10/12/2010 14:47,POTUS
7.2. Click on the job’s ID to view the details of the job and its log files.
Result: You have written several Pig scripts to analyze and query the data in the White
House visitors’ log. You should now be comfortable with writing Pig scripts with the Grunt
shell and using common Pig commands like LOAD, GROUP, FOREACH, FILTER, LIMIT,
DUMP, and STORE.
Objective: Research the White House visitor data and look for members of
Congress.
Before you begin: You should have the White House visitor data in HDFS in
/user/root/whitehouse/visits.txt.
grunt> cd whitehouse
grunt> visits = LOAD 'visits.txt' USING PigStorage(',');
1.2. Field $25 is the comments. Filter out all records where field $25 is null:
grunt> not_null_25 = FILTER visits BY ($25 IS NOT NULL);
1.3. Now define a new relation that is a projection of only column $25:
grunt> comments = FOREACH not_null_25 GENERATE $25 AS comment;
1.4. View the schema of comments and make sure you understand how this relation
ended up as a tuple with one field:
grunt> describe comments;
comments: {comment: bytearray}
2.2. Now DUMP the comments_sample relation. The output should be non-null comments
about visitors to the White House, similar to:
grunt> DUMP comments_sample;
NOTE: Our end goal is find visitors to the White House who are also
members of Congress. We could run our MapReduce job on the
entire visits.txt dataset, but it is common in Hadoop to split data into
smaller input files for specific tasks, which can greatly improve the
performance of your MapReduce applications. In this step, you will
split visits.txt into two separate datasets.
4.1. In this step, you will split visits.txt into two datasets: those that contain “CONGRESS”
in the comments field, and those that do not.
4.2. Use the SPLIT command to split the visits relation into two new relations named
congress and not_congress:
grunt> SPLIT visits INTO congress IF($25 MATCHES
'.* CONGRESS .*'), not_congress IF (NOT($25 MATCHES
'.* CONGRESS .*'));
4.3. Store the congress relation into a folder named ‘congress’ using a JSON format:
grunt> STORE congress INTO 'congress';
4.5. View the output folders using ls. The file sizes should be equivalent to the following:
grunt> ls congress
hdfs://sandbox.hortonworks.com:8020/user/root/whitehouse/congress/_SUCCESS<r
3> 0
hdfs://sandbox.hortonworks.com:8020/user/root/whitehouse/congress/part-m-
00000<r 3> 45618
hdfs://sandbox.hortonworks.com:8020/user/root/whitehouse/congress/part-m-
00001<r 3> 0
grunt> ls not_congress
hdfs://sandbox.hortonworks.com:8020/user/root/whitehouse/not_congress/_SUCCES
S<r 3> 0
hdfs://sandbox.hortonworks.com:8020/user/root/whitehouse/not_congress/part-m-
00000<r 3> 90741587
hdfs://sandbox.hortonworks.com:8020/user/root/whitehouse/not_congress/part-m-
00001<r 3> 272381
4.6. View one of the output files in congress and make sure the string “CONGRESS”
appears in the comment field:
grunt> cat congress/part-m-00000
NOTE: You now have two datasets: one in ‘congress,’ with 102
records, and the remaining records in the ‘not_congress’ folder.
These records are still in their original, raw format.
Solution:
grunt> congress_grp = GROUP congress ALL;
grunt> congress_count = FOREACH congress_grp GENERATE COUNT(congress);
grunt> DUMP congress_count;
Result: You have just split ‘visits.txt’ into two datasets, and you have also discovered that
102 visitors to the White House had the word “CONGRESS” in their comments field. We will
further explore these visitors in the next lab as we perform a join with a dataset containing
the names of members of Congress.
Successful outcome: A file of members of Congress who have visited the White House.
Before you begin: If you are in the Grunt shell, exit it using the quit command. In
this lab, you will write a Pig script in a text file.
1.2. Use the hadoop fs -ls command to verify that the congress.txt file is in
whitehouse, and use hadoop fs -cat to view its contents. The file contains the names
of and other information about the members of the U.S. Congress.
# hadoop fs -ls whitehouse
2.3. At the top of the file, add a comment:
--join.pig: joins congress.txt and visits.txt
4.2. The names in visits.txt are all uppercase, but the names in congress.txt are not.
Define a projection of the congress relation that consists of the following fields:
congress_data = FOREACH congress GENERATE
district,
UPPER(lname) AS lname,
UPPER(fname) AS fname,
party;
5.2. Use the STORE command to store the result of join_contact_congress into a
directory named ‘joinresult’.
Solution:
join_contact_congress = JOIN visitors BY (lname,fname),
congress_data BY (lname,fname);
STORE join_contact_congress INTO 'joinresult';
6.4. Wait for the MapReduce job to execute. When it is finished, write down the number
of seconds it took for the job to complete (by subtracting the StartedAt time from the
FinishedAt time) and write down the result: ___________________
6.5. The type of join used is also output in the job statistics. Notice the statistics output
has “HASH_JOIN” underneath the “Features” column, which means a hash join was
used to join the two datasets.
Step 7: View the Results
7.1. The output will be in the joinresult folder in HDFS. Verify that the folder was
created:
# hadoop fs -ls -R joinresult
-rw-r--r-- 3 root root 0 joinresult/_SUCCESS
-rw-r--r-- 3 root root 40892 joinresult/part-r-00000
8.2. Modify your JOIN statement in join.pig so that is uses replication. It should look
like this:
join_contact_congress = JOIN visitors BY (lname,fname),
congress_data BY (lname,fname) USING 'replicated';
8.3. Save your changes to join.pig and run the script again.
# pig join.pig
8.4. Notice this time that the statistics output shows Pig used a “REPLICATED_JOIN”
instead of a “HASH_JOIN”.
8.5. Compare the execution time of the REPLICATED_JOIN vs. the HASH_JOIN. Did you
have any improvement or decrease in performance?
NOTE: Using replicated does not necessarily increase the join time.
There are way too many factors involved, and this example is using
small datasets. The point is that you should try both techniques (if one
dataset is small enough to fit in memory) and determine which join
algorithm is faster for your particular dataset and use case.
You have already saved the output of the JOIN, so there is no need to perform the
STORE command again.
9.2. Notice in the output of your join.pig script that we know which party the visitor
belongs to: Democrat, Republican, or Independent. Using the join_contact_congress
relation as a starting point, see if you can figure out how to output the number of
Democrat, Republican, and Independent members of Congress that visited the White
House. Name the relation counters and use the DUMP command to output the results:
COUNT(join_contact_congress);
DUMP counters;
Successful outcome: The resulting Pig script stores a projection of visits.txt in a folder
in the Hive warehouse named wh_visits.
Before you begin: You should have visits.txt in a folder named whitehouse in HDFS.
1.3. Notice that all White House visitors who met with the President are the potus
relation.
1.4. Notice that the project_potus relation is a projection of the last name, first name,
time of arrival, location, and comments from the visit.
Step 2: Store the Projection in the Hive Warehouse
2.1. Open wh_visits.pig with the gedit text editor.
2.2. Add the following command at the bottom of the file, which stores the
project_potus relation into a very specific folder in the Hive warehouse:
STORE project_potus INTO '/apps/hive/warehouse/wh_visits/';
4.1. The wh_visits.pig script creates a directory in the Hive warehouse named
wh_visits. Use ls to view its contents:
# hadoop fs -ls /apps/hive/warehouse/wh_visits/
-rw-r--r-- 3 root hdfs 0 /apps/hive/warehouse/wh_visits/_SUCCESS
-rw-r--r-- 3 root hdfs 971339 /apps/hive/warehouse/wh_visits/part-m-
00000
-rw-r--r-- 3 root hdfs 142850 /apps/hive/warehouse/wh_visits/part-m-
00001
4.2. View the contents of one of the result files. It should look like the following:
hadoop fs -cat /apps/hive/warehouse/wh_visits/part-m-00000
...
FRIEDMAN THOMAS 10/12/2010 12:08 WH PRIVATE LUNCH
BASS EDWIN 10/18/2010 15:01 WH
BLAKE CHARLES 10/18/2010 15:00 WH
OGLETREE CHARLES 10/18/2010 15:01 WH
RIVERS EUGENE 10/18/2010 15:01 WH
Result: You now have a folder in the Hive warehouse named wh_visits that contains a
projection of the data in visits.txt. We will use this file in an upcoming Hive lab.