0% found this document useful (0 votes)
180 views26 pages

Demonstration: Understanding Pig: HDP Developer: Apache Pig and Hive

Case

Uploaded by

NIKITA TAYAL
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
180 views26 pages

Demonstration: Understanding Pig: HDP Developer: Apache Pig and Hive

Case

Uploaded by

NIKITA TAYAL
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

Course Material of: Dr A R Singla

HDP Developer: Apache Pig and Hive

Demonstration: Understanding Pig


This lab explores Pig scripts and relations.

Table 8. About  this  Lab  

Objective: To understand Pig scripts and relations.

During this Watch as your instructor performs the following steps.


Demonstration:

Related lesson: Introduction to Pig

Perform the following steps:


Step 1: Start the Grunt Shell
1.1. Review the contents of the file pigdemo.txt located in /root/devph/labs/demos.
# more root/devph/labs/demos/pigdemo.txt

1.2. Start the Grunt shell:


# pig

1.3. Notice that the output includes where the logging for your Pig session will go as well
as a statement about connecting to your Hadoop filesystem:
[main] INFO org.apache.pig.Main - Logging error messages to:
/root/devph/labs/demos/pig_1377892197767.log
[main] INFO org.apache.pig.backend.hadoop.executionengine. HExecutionEngine
- Connecting to hadoop file system at: hdfs://sandbox.hortonworks.com:8020

Step 2: Make a New Directory


2.1. Notice you can run HDFS commands easily from the Grunt shell. For example, run
the ls command:
grunt> ls

2.2. Make a new directory named demos:


grunt> mkdir demos

2.3. Use copyFromLocal to copy the pigdemo.txt file into the demos folder:
grunt> copyFromLocal /root/devph/labs/demos/pigdemo.txt demos/

2.4. Verify the file was uploaded successfully:


grunt> ls demos
hdfs://sandbox.hortonworks.com:8020/user/root/demos/pigdemo.txt<r 3> 89

2.5. Change the present working directory to demos:


grunt> cd demos

Copyright © 2015, Hortonworks, Inc. All rights reserved.


41
Course Material of: Dr A R Singla

HDP Developer: Apache Pig and Hive

grunt> pwd
hdfs://sandbox.hortonworks.com:8020/user/root/demos

2.6. View the contents using the cat command:


grunt> cat pigdemo.txt
SD Rich
NV Barry
CO George
CA Ulf
IL Danielle
OH Tom
CA manish
CA Brian
CO Mark

Step 3: Define a Relation


3.1. Define the employees relation, using a schema:
grunt> employees = LOAD 'pigdemo.txt' AS (state, name);

3.2. Demonstrate the describe command, which describes what a relation looks like:
grunt> describe employees;
employees: {state: bytearray,name: bytearray}

NOTE: Fields have a data type, and we will discuss data types later in
this unit. Notice that the default data type of a field (if you do not
  specify one) is bytearray.

3.3. Let’s view the records in the employees relation:


grunt> DUMP employees;

Notice this requires a MapReduce job to execute, and the result is a collection of
tuples:
(SD,Rich)
(NV,Barry)
(CO,George)
(CA,Ulf)
(IL,Danielle)
(OH,Tom)
(CA,manish)
(CA,Brian)
(CO,Mark)

Step 4: Filter the Relation by a Field


4.1. Let’s filter the employees whose state field equals CA:
grunt> ca_only = FILTER employees BY (state=='CA');

Copyright © 2015, Hortonworks, Inc. All rights reserved.


42
Course Material of: Dr A R Singla

HDP Developer: Apache Pig and Hive

grunt> DUMP ca_only;

4.2. The output is still tuples, but only the records that match the filter appear:
(CA,Ulf)
(CA,manish)
(CA,Brian)

Step 5: Create a Group


5.1. Define a relation that groups the employees by the state field:
grunt> emp_group = GROUP employees BY state;

5.2. Bags represent groups in Pig. A bag is an unordered collection of tuples:


grunt> describe emp_group;
emp_group: {group: bytearray,employees: {(state: bytearray,name: bytearray)}}

5.3. All records with the same state will be grouped together, as shown by the output of
the emp_group relation:
grunt> DUMP emp_group;
The output is:
(CA,{(CA,Ulf),(CA,manish),(CA,Brian)})
(CO,{(CO,George),(CO,Mark)})
(IL,{(IL,Danielle)})
(NV,{(NV,Barry)})
(OH,{(OH,Tom)})
(SD,{(SD,Rich)})

NOTE: Tuples are displayed in parentheses. Curly braces represent


bags.
 

Step 6: The STORE Command


6.1. The DUMP command dumps the contents of a relation to the console. The STORE
command sends the output to a folder in HDFS. For example:
grunt> STORE emp_group INTO 'emp_group';
Notice at the end of the MapReduce job that no records are output to the console.
6.2. Verify that a new folder is created:
grunt> ls
hdfs://sandbox.hortonworks.com:8020/user/root/demos/emp_group <dir>
hdfs://sandbox.hortonworks.com:8020/user/root/demos/pigdemo.txt<r 3> 89

6.3. View the contents of the output file:


grunt> cat emp_group/part-r-00000
Copyright © 2015, Hortonworks, Inc. All rights reserved.
43
Course Material of: Dr A R Singla

HDP Developer: Apache Pig and Hive

CA {(CA,Ulf),(CA,manish),(CA,Brian)}
CO {(CO,George),(CO,Mark)}
IL {(IL,Danielle)}
NV {(NV,Barry)}
OH {(OH,Tom)}
SD {(SD,Rich)}
Notice that the fields of the records (which in this case is the state field followed by
a bag) are separated by a tab character, which is the default delimiter in Pig. Use
the PigStorage object to specify a different delimiter:
grunt> STORE emp_group INTO 'emp_group_csv' USING PigStorage(',');

To view the results:

grunt > ls

grunt > cat emp_group_csv/part-r-00000

Step 7: Show All Aliases


7.1. The aliases command shows a list of currently defined aliases:
grunt> aliases;
aliases: [ca_only, emp_group, employees]

There will be a couple of additional numeric aliases created by the system for
internal use. Please ignore them.
Step 8: Monitor the Pig Jobs
8.1. Point your browser to the JobHistory UI at http://sandbox:19888/.
8.2. View the list of jobs, which should contain the MapReduce jobs that were executed
from your Pig Latin code in the Grunt shell.
8.3. Notice you can view the log files of the ApplicationMaster and also each map and
reduce task.

NOTE: Three commands trigger a logical plan to be converted to a


physical plan and execute as a MapReduce job: STORE, DUMP, and
  ILLUSTRATE.

Copyright © 2015, Hortonworks, Inc. All rights reserved.


44
Course Material of: Dr A R Singla

HDP Developer: Apache Pig and Hive

Lab: Getting Started with Pig


This lab explores using Pig to navigate through HDFS and explore a dataset.

Table 9. About  this  Lab  

Objective: Use Pig to navigate through HDFS and explore a dataset.

File locations: /root/devph/labs/Lab5.1

Successful outcome: You will have a couple of Pig programs that load the White House
visitors’ data, with and without a schema, and store the output of
a relation into a folder in HDFS.

Before you begin: Your HDP 2.2 cluster should be up and running within your VM.

Related lesson: Introduction to Pig

Perform the following steps:


Step 1: View the Raw Data
1.1. Change directories to the /root/devph/labs/Lab5.1 folder:
# cd ~/devph/labs/Lab5.1

1.2. Unzip the archive in the /root/devph/labs/Lab5.1 folder, which contains a file
named whitehouse_visits.txt that is quite large:
# unzip whitehouse_visits.zip

1.3. View the contents of this file:


# tail whitehouse_visits.txt
This publicly available data contains records of visitors to the White House in
Washington, D.C.
Step 2: Load the Data into HDFS
2.1. Start the Grunt shell:
# pig

2.2. From the Grunt shell, make a new directory in HDFS named whitehouse:
grunt> mkdir whitehouse

2.3. Use the copyFromLocal command in the Grunt shell to copy the
whitehouse_visits.txt file to the whitehouse folder in HDFS, renaming the file
visits.txt. (Be sure to enter this command on a single line):
grunt> copyFromLocal /root/devph/labs/Lab5.1/whitehouse_visits.txt
whitehouse/visits.txt
Copyright © 2015, Hortonworks, Inc. All rights reserved.
45
Course Material of: Dr A R Singla

HDP Developer: Apache Pig and Hive

2.4. Use the ls command to verify that the file was uploaded successfully:
grunt> ls whitehouse
hdfs://sandbox.hortonworks.com:8020/user/root/whitehouse/visits.txt<r 3>
183292235

Step 3: Define a Relation


3.1. You will use the TextLoader to load the visits.txt file.

NOTE: TextLoader simply creates a tuple for each line of text, and it
uses a single chararray field that contains the entire line. It allows you
  to load lines of text and not worry about the format or schema yet.

Define the following LOAD relation:


grunt> A = LOAD '/user/root/whitehouse/' USING TextLoader();

3.2. Use DESCRIBE to notice that A does not have a schema:


grunt> DESCRIBE A;
Schema for A unknown.

3.3. We want to get a sense of what this data looks like. Use the LIMIT operator to define
a new relation named A_limit that is limited to 10 records of A.
grunt> A_limit = LIMIT A 10

3.4. Use the DUMP operator to view the A_limit relation. Each row in the output will look
similar to the following and should be 10 arbitrary rows from visits.txt:
grunt> DUMP A_limit

(WHITLEY,KRISTY,J,U45880,,VA,,,,,10/7/2010 5:51,10/9/2010 10:30,10/9/2010


23:59,,294,B3,WIN,10/7/2010
5:51,B3,OFFICE,VISITORS,WH,RES,OFFICE,VISITORS,GROUP TOUR
,1/28/2011,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
,,,,,,,,)

Step 4: Define a Schema


4.1. Load the White House data again, but this time use the PigStorage loader and also
define a partial schema:
grunt> B = LOAD '/user/root/whitehouse/visits.txt' USING PigStorage(',') AS (
lname:chararray,
fname:chararray,
mname:chararray,
id:chararray,
status:chararray,
state:chararray,
arrival:chararray
Copyright © 2015, Hortonworks, Inc. All rights reserved.
46
Course Material of: Dr A R Singla

HDP Developer: Apache Pig and Hive

);

4.2. Use the DESCRIBE command to view the schema:


grunt> describe B;
B: {lname: chararray,fname: chararray,mname: chararray,id: chararray,status:
chararray,state: chararray,arrival: chararray}

Step 5: The STORE Command


5.1. Enter the following STORE command, which stores the B relation into a folder named
whouse_tab and separates the fields of each record with tabs:
grunt> store B into 'whouse_tab' using PigStorage('\t');

5.2. Verify that the whouse_tab folder was created:


grunt> ls whouse_tab;

You should see two map output files.


5.3. View one of the output files to verify they contain the B relation in a tab-delimited
format:
grunt> fs -tail whouse_tab/part-m-00000;

5.4. Each record should contain seven fields. What happened to the rest of the fields
from the raw data that was loaded from whitehouse/visits.txt?
_________________________________________________________________
Answer: They were simply ignored when each record was read in from HDFS.
Step 6: Use a Different Storer
6.1. In the previous step, you stored a relation using PigStorage with a tab delimiter.
Enter the following command, which stores the same relation but in a JSON format:
grunt> store B into 'whouse_json' using JsonStorage();

6.2. Verify that the whouse_json folder was created:


grunt> ls whouse_json;

6.3. View one of the output files:


grunt> fs -tail whouse_json/part-m-00000;

Notice that the schema you defined for the B relation was used to create the
format of each JSON entry:
{"lname":"MATTHEWMAN","fname":"ROBIN","mname":"H","id":"U81961","status":"735
74","state":"VA","arrival":"2/10/2011 11:14"}
{"lname":"MCALPINEDILEM","fname":"JENNIFER","mname":"J","id":"U81961","status
":"78586","state":"VA","arrival":"2/10/2011 10:49"}

Result: You have now seen how to execute some basic Pig commands, load data into a
relation, and store a relation into a folder in HDFS using different formats.

Copyright © 2015, Hortonworks, Inc. All rights reserved.


47
Course Material of: Dr A R Singla

HDP Developer: Apache Pig and Hive

Copyright © 2015, Hortonworks, Inc. All rights reserved.


48
Course Material of: Dr A R Singla

HDP Developer: Apache Pig and Hive

Lab: Exploring Data with Pig


This lab explores using Pig to navigate through HDFS and explore a dataset.

Table 10. About  this  Lab  

Objective: Use Pig to navigate through HDFS and explore a dataset.

File locations: whitehouse/visits.txt in HDFS

Successful outcome: You will have written several Pig scripts that analyze and query
the White House visitors’ data, including a list of people who
visited the President.

Before you begin: At a minimum, complete steps 1 and 2 of the Getting Started with
Pig lab.

Related lesson: Introduction to Pig

Perform the following steps:


Step 1: Load the White House Visitor Data
1.1. You will use the TextLoader to load the visits.txt file. From the Pig Grunt shell, define
the following LOAD relation:
# pig

grunt> A = LOAD '/user/root/whitehouse/' USING TextLoader();

Step 2: Count the Number of Lines


2.1. Define a new relation named B that is a group of all the records in A:
grunt> B = GROUP A ALL;

2.2. Use DESCRIBE to view the schema of B.


grunt> DESCRIBE B;

What is the datatype of the group field? _____________________


Where did this datatype come from?
_______________________________________________________
Answer: The group field is a chararray because it is just the string “all” and is a result of
performing a GROUP ALL.
2.3. Why does the A field of B contain no schema? ______________________
Answer: The A field of B contains no schema because the A relation has no
schema.

Copyright © 2015, Hortonworks, Inc. All rights reserved.


49
Course Material of: Dr A R Singla

HDP Developer: Apache Pig and Hive

2.4. How many groups are in the relation B? ______________


Answer: The B relation can only contain one group because it a grouping of every
single record. Note that the A field is a bag, and A will contain any number of
tuples.
2.5. The A field of the B tuple is a bag of all of the records in visits.txt. Use the COUNT
function on this bag to determine how many lines of text are in visits.txt:
grunt> A_count = FOREACH B GENERATE 'rowcount', COUNT(A);

NOTE: The ‘rowcount’ string in the FOREACH statement is simply to


demonstrate that you can have constant values in a GENERATE clause.
  It is certainly not necessary; it just makes the output nicer to read.

2.6. Use DUMP on A_count to view the result. The output should look like:
grunt> DUMP A_count;

(rowcount,447598)

We can now conclude that there are 447,598 rows of text in visits.txt.
Step 3: Analyze the Data’s Contents
3.1. We now know how many records are in the data, but we still do not have a clear
picture of what the records look like. Let’s start by looking at the fields of each record.
Load the data using PigStorage(‘,’) instead of TextLoader():
grunt> visits = LOAD '/user/root/whitehouse/' USING PigStorage(',');

This will split up the fields by comma.


3.2. Use a FOREACH...GENERATE command to define a relation that is a projection of the
first 10 fields of the visits relation.
grunt> firstten = FOREACH visits GENERATE $0..$9;

3.3. Use LIMIT to display only 50 records then DUMP the result. The output should be 50
tuples that represent the first 10 fields of visits:
grunt> firstten_limit = LIMIT firstten 50;
grunt> DUMP firstten_limit;

(PARK,ANNE,C,U51510,0,VA,10/24/2010 14:53,B0402,,)
(PARK,RYAN,C,U51510,0,VA,10/24/2010 14:53,B0402,,)
(PARK,MAGGIE,E,U51510,0,VA,10/24/2010 14:53,B0402,,)
(PARK,SIDNEY,R,U51510,0,VA,10/24/2010 14:53,B0402,,)
(RYAN,MARGUERITE,,U82926,0,VA,2/13/2011 17:14,B0402,,)
(WILE,DAVID,J,U44328,,VA,,,,)
(YANG,EILENE,D,U82921,,VA,,,,)
(ADAMS,SCHUYLER,N,U51772,,VA,,,,)
Copyright © 2015, Hortonworks, Inc. All rights reserved.
50
Course Material of: Dr A R Singla

HDP Developer: Apache Pig and Hive

(ADAMS,CHRISTINE,M,U51772,,VA,,,,)
(BERRY,STACEY,,U49494,79029,VA,10/15/2010 12:24,D0101,10/15/2010 14:06,D1S)

NOTE: Because LIMIT uses an arbitrary sample of the data, your


output will be different names but the format should look similar.
 

Notice from the output that the first three fields are the person’s name. The next
seven fields are a unique ID, badge number, access type, time of arrival, post of
arrival, time of departure, and post of departure.
Step 4: Locate the POTUS (President of the United States of America)
4.1. There are 26 fields in each record, and one of them represents the visitee (the person
being visited in the White House). Your goal now is to locate this column and determine
who has visited the President of the United States. Define a relation that is a projection
of the last seven fields ($19 to $25) of visits. Use LIMIT to only output 500 records. The
output should look like:
grunt> lastfields = FOREACH visits GENERATE $19..$25;
grunt> lastfields_limit = LIMIT lastfields 500;
grunt> DUMP lastfields_limit;

(OFFICE,VISITORS,WH,RESIDENCE,OFFICE,VISITORS,HOLIDAY OPEN HOUSE/)


(OFFICE,VISITORS,WH,RESIDENCE,OFFICE,VISITORS,HOLIDAY OPEN HOUSES/)
(OFFICE,VISITORS,WH,RESIDENCE,OFFICE,VISITORS,HOLIDAY OPEN HOUSE/)
(CARNEY,FRANCIS,WH,WW,ALAM,SYED,WW TOUR)
(CARNEY,FRANCIS,WH,WW,ALAM,SYED,WW TOUR)
(CARNEY,FRANCIS,WH,WW,ALAM,SYED,WW TOUR)
(CHANDLER,DANIEL,NEOB,6104,AGCAOILI,KARL,)

It is not necessarily obvious from the output, but field $19 in the visits relation
represents the visitee. Even though you selected 500 records in the previous step,
you may or may not see POTUS in the output above. (The White House has
thousands of visitors each day, but only a few meet the President.)
4.2. Use FILTER to define a relation that only contains records of visits where field $19
matches POTUS. Limit the output to 500 records. The output should include only visitors
who met with the President. For example:
grunt> potus = FILTER visits BY $19 MATCHES 'POTUS';
grunt> potus_limit = LIMIT potus 500;
grunt> DUMP potus_limit;

Copyright © 2015, Hortonworks, Inc. All rights reserved.


51
Course Material of: Dr A R Singla

HDP Developer: Apache Pig and Hive

(ARGOW,KEITH,A,U83268,,VA,,,,,2/14/2011 18:42,2/16/2011 16:00,2/16/2011


23:59,,154,LC,WIN,2/14/2011 18:42,LC,POTUS,,WH,EAST
ROOM,THOMPSON,MARGRETTE,,AMERICA'S GREAT OUTDOORS ROLLOUT EVENT
,5/27/2011,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
,,,,,,,)
(AYERS,JOHNATHAN,T,U84307,,VA,,,,,2/18/2011 19:11,2/25/2011 17:00,2/25/2011
23:59,,619,SL,WIN,2/18/2011 19:11,SL,POTUS,,WH,STATE
FLOO,GALLAGHER,CLARE,,RECEPTION
,5/27/2011,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
,,,,,,,)

Step 5: Count the POTUS Visitors


5.1. Let’s discover how many people have visited the President. To do this, we need to
count the number of records in visits where field $19 matches POTUS. See if you can
write a Pig script to accomplish this. Use the potus relation from the previous step as a
starting point. You will need to use GROUP ALL and then a FOREACH projection that uses
the COUNT function.
If successful, you should get 21,819 as the number of visitors to the White House who
visited the President.
Solution:
grunt> potus = FILTER visits BY $19 MATCHES 'POTUS';
grunt> potus_group = GROUP potus ALL;
grunt> potus_count = FOREACH potus_group GENERATE COUNT(potus);
grunt> DUMP potus_count;

Step 6: Finding People Who Visited the President


6.1. So far you have used DUMP to view the results of your Pig scripts. In this step, you
will save the output to a file using the STORE command.
6.2. Now FILTER the relation by visitors who met with the President:
grunt> potus = FILTER visits BY $19 MATCHES 'POTUS';

6.3. Define a projection of the potus relationship that contains the name and time of
arrival of the visitor:
grunt> potus_details = FOREACH potus GENERATE
(chararray) $0 AS lname:chararray,
(chararray) $1 AS fname:chararray,
(chararray) $6 AS arrival_time:chararray,
(chararray) $19 AS visitee:chararray;

6.4. Order the potus_details projection by last name:


grunt> potus_details_ordered = ORDER potus_details BY lname ASC;

Copyright © 2015, Hortonworks, Inc. All rights reserved.


52
Course Material of: Dr A R Singla

HDP Developer: Apache Pig and Hive

6.5. Store the records of potus_details_ordered into a folder named potus and using
a comma delimiter:
grunt> STORE potus_details_ordered INTO 'potus' USING PigStorage(',');

6.6. View the contents of the potus folder:


grunt> ls potus
hdfs://sandbox.hortonworks.com:8020/user/root/potus/_SUCCESS<r 3> 0
hdfs://sandbox.hortonworks.com:8020/user/root/potus/part-r-00000<r 3>
501378

6.7. Notice that there is a single output file, so the Pig job was executed with one
reducer. View the contents of the output file using cat:
grunt> cat potus/part-r-00000

The output should be in a comma-delimited format and should contain the last
name, first name, time of arrival (if available), and the string POTUS:
CLINTON,WILLIAM,,POTUS
CLINTON,HILLARY,,POTUS
CLINTON,HILLARY,,POTUS
CLINTON,HILLARY,,POTUS
CLONAN,JEANETTE,,POTUS
CLOOBECK,STEPHEN,,POTUS
CLOOBECK,CHANTAL,,POTUS
CLOOBECK,STEPHEN,,POTUS
CLOONEY,GEORGE,10/12/2010 14:47,POTUS

Step 7: View the Pig Log Files


7.1. Each time you executed a DUMP or STORE command, a MapReduce job is executed
on your cluster. You can view the log files of these jobs in the JobHistory UI. Point your
browser to http://sandbox:19888/:

 
7.2. Click on the job’s ID to view the details of the job and its log files.

Copyright © 2015, Hortonworks, Inc. All rights reserved.


53
Course Material of: Dr A R Singla

HDP Developer: Apache Pig and Hive

Result: You have written several Pig scripts to analyze and query the data in the White
House visitors’ log. You should now be comfortable with writing Pig scripts with the Grunt
shell and using common Pig commands like LOAD, GROUP, FOREACH, FILTER, LIMIT,
DUMP, and STORE.

Copyright © 2015, Hortonworks, Inc. All rights reserved.


54
Course Material of: Dr A R Singla

HDP Developer: Apache Pig and Hive

Lab: Splitting a Dataset


This lab explores splitting a dataset, using White House visitor data, and looking for members
of Congress.

Table 11. About  this  Lab  

Objective: Research the White House visitor data and look for members of
Congress.

File locations: n/a

Successful outcome: Two folders in HDFS, congress and not_congress, containing a


split of the White House visitor data.

Before you begin: You should have the White House visitor data in HDFS in
/user/root/whitehouse/visits.txt.

Related lesson: Advanced Pig Programming

Perform the following steps:


Step 1: Explore the Comments Field
1.1. In this step, you will explore the comments field of the White House visitor data.
From the Pig Grunt shell, start by loading visits.txt:
# pig

grunt> cd whitehouse
grunt> visits = LOAD 'visits.txt' USING PigStorage(',');

1.2. Field $25 is the comments. Filter out all records where field $25 is null:
grunt> not_null_25 = FILTER visits BY ($25 IS NOT NULL);

1.3. Now define a new relation that is a projection of only column $25:
grunt> comments = FOREACH not_null_25 GENERATE $25 AS comment;

1.4. View the schema of comments and make sure you understand how this relation
ended up as a tuple with one field:
grunt> describe comments;
comments: {comment: bytearray}

Step 2: Test the Relation


2.1. A common Pig task is to test a relation to make sure it is consistent with what you
are intending it to be. But using DUMP on a big data relation might take too long or not be
practical, so define a SAMPLE of comments:
grunt> comments_sample = SAMPLE comments 0.001;
Copyright © 2015, Hortonworks, Inc. All rights reserved.
55
Course Material of: Dr A R Singla

HDP Developer: Apache Pig and Hive

2.2. Now DUMP the comments_sample relation. The output should be non-null comments
about visitors to the White House, similar to:
grunt> DUMP comments_sample;

(ATTENDEES VISITING FOR A MEETING)


(FORUM ON IT MANAGEMENT REFORM/)
(FORUM ON IT MANAGEMENT REFORM/)
(HEALTH REFORM MEETING)
(DRIVER TO REMAIN WITH VEHICLE)

Step 3: Count the Number of Comments


3.1. The comments relation represents all non-null comments from visits.txt. Write Pig
statements that output the number of records in the comments relation. The correct
result is 222,839 records.
Solution:
comments_all = GROUP comments ALL;
comments_count = FOREACH comments_all GENERATE
COUNT(comments);
DUMP comments_count;

Step 4: Split the Dataset

NOTE: Our end goal is find visitors to the White House who are also
members of Congress. We could run our MapReduce job on the
  entire visits.txt dataset, but it is common in Hadoop to split data into
smaller input files for specific tasks, which can greatly improve the
performance of your MapReduce applications. In this step, you will
split visits.txt into two separate datasets.

4.1. In this step, you will split visits.txt into two datasets: those that contain “CONGRESS”
in the comments field, and those that do not.
4.2. Use the SPLIT command to split the visits relation into two new relations named
congress and not_congress:
grunt> SPLIT visits INTO congress IF($25 MATCHES
'.* CONGRESS .*'), not_congress IF (NOT($25 MATCHES
'.* CONGRESS .*'));

4.3. Store the congress relation into a folder named ‘congress’ using a JSON format:
grunt> STORE congress INTO 'congress';

4.4. Similarly, STORE the not_congress relation in a folder named ‘not_congress’.


grunt> STORE not_congress INTO 'not_congress';

4.5. View the output folders using ls. The file sizes should be equivalent to the following:
grunt> ls congress

Copyright © 2015, Hortonworks, Inc. All rights reserved.


56
Course Material of: Dr A R Singla

HDP Developer: Apache Pig and Hive

hdfs://sandbox.hortonworks.com:8020/user/root/whitehouse/congress/_SUCCESS<r
3> 0
hdfs://sandbox.hortonworks.com:8020/user/root/whitehouse/congress/part-m-
00000<r 3> 45618
hdfs://sandbox.hortonworks.com:8020/user/root/whitehouse/congress/part-m-
00001<r 3> 0
grunt> ls not_congress
hdfs://sandbox.hortonworks.com:8020/user/root/whitehouse/not_congress/_SUCCES
S<r 3> 0
hdfs://sandbox.hortonworks.com:8020/user/root/whitehouse/not_congress/part-m-
00000<r 3> 90741587
hdfs://sandbox.hortonworks.com:8020/user/root/whitehouse/not_congress/part-m-
00001<r 3> 272381

4.6. View one of the output files in congress and make sure the string “CONGRESS”
appears in the comment field:
grunt> cat congress/part-m-00000

Step 5: Count the Results


5.1. Write Pig statements that output the number of records in the congress relation. This
will tell us how many visitors to the White House have “CONGRESS” in the comments of
their visit log. The correct result is 102.

NOTE: You now have two datasets: one in ‘congress,’ with 102
records, and the remaining records in the ‘not_congress’ folder.
  These records are still in their original, raw format.

Solution:
grunt> congress_grp = GROUP congress ALL;
grunt> congress_count = FOREACH congress_grp GENERATE COUNT(congress);
grunt> DUMP congress_count;

Result: You have just split ‘visits.txt’ into two datasets, and you have also discovered that
102 visitors to the White House had the word “CONGRESS” in their comments field. We will
further explore these visitors in the next lab as we perform a join with a dataset containing
the names of members of Congress.

Copyright © 2015, Hortonworks, Inc. All rights reserved.


57
Course Material of: Dr A R Singla

HDP Developer: Apache Pig and Hive

Copyright © 2015, Hortonworks, Inc. All rights reserved.


58
Course Material of: Dr A R Singla

HDP Developer: Apache Pig and Hive

Lab: Joining Datasets


This lab explores joining two datasets in Pig.

Table 12. About  this  Lab  

Objective: Join two datasets in Pig.

File locations: /root/devph/labs/Lab6.2

Successful outcome: A file of members of Congress who have visited the White House.

Before you begin: If you are in the Grunt shell, exit it using the quit command. In
this lab, you will write a Pig script in a text file.

Related lesson: Advanced Pig Programming

Perform the following steps:


Step 1: Upload the Congress Data
1.1. Put the file /root/devph/labs/Lab6.2/congress.txt into the whitehouse
directory in HDFS.
# hadoop fs -put /root/devph/labs/Lab6.2/congress.txt whitehouse

1.2. Use the hadoop fs -ls command to verify that the congress.txt file is in
whitehouse, and use hadoop fs -cat to view its contents. The file contains the names
of and other information about the members of the U.S. Congress.
# hadoop fs -ls whitehouse

# hadoop fs -cat whitehouse/congress.txt

Step 2: Create a Pig Script File


2.1. In this lab, you will not use the Grunt shell to enter commands. Instead, you will enter
your script in a text file. Start by opening the gedit text editor using the shortcut
provided on the left-hand toolbar of your VM.
2.2. Click the Save button and save the new, empty file as join.pig in the
devph/labs/Lab6.2 folder:

Copyright © 2015, Hortonworks, Inc. All rights reserved.


59
Course Material of: Dr A R Singla

HDP Developer: Apache Pig and Hive

 
2.3. At the top of the file, add a comment:
--join.pig: joins congress.txt and visits.txt

Step 3: Load the White House Visitors


3.1. Define the following visitors relations, which will contain the first and last names of all
White House visitors:
visitors = LOAD 'whitehouse/visits.txt' USING PigStorage(',') AS
(lname:chararray, fname:chararray);

That is the only data we are going to use from visits.txt.


Step 4: Define a Projection of the Congress Data
4.1. Add the following load command that loads the ‘congress.txt’ file into a relation
named congress. The data is tab-delimited, so no special Pig loader is needed:
congress = LOAD 'whitehouse/congress.txt' AS (
full_title:chararray,
district:chararray,
title:chararray,
fname:chararray,
lname:chararray,
party:chararray
);
Copyright © 2015, Hortonworks, Inc. All rights reserved.
60
Course Material of: Dr A R Singla

HDP Developer: Apache Pig and Hive

4.2. The names in visits.txt are all uppercase, but the names in congress.txt are not.
Define a projection of the congress relation that consists of the following fields:
congress_data = FOREACH congress GENERATE
district,
UPPER(lname) AS lname,
UPPER(fname) AS fname,
party;

Step 5: Join the Two Datasets


5.1. Define a new relation named join_contact_congress that is a JOIN of visitors and
congress_data. Perform the join on both the first and last names.

5.2. Use the STORE command to store the result of join_contact_congress into a
directory named ‘joinresult’.
Solution:
join_contact_congress = JOIN visitors BY (lname,fname),
congress_data BY (lname,fname);
STORE join_contact_congress INTO 'joinresult';

Step 6: Run the Pig Script


6.1. Save your changes to join.pig.
6.2. Open a Terminal window and change directories to the Joining Datasets lab folder:
# cd ~/devph/labs/Lab6.2

6.3. Run the script using the following command:


# pig join.pig

6.4. Wait for the MapReduce job to execute. When it is finished, write down the number
of seconds it took for the job to complete (by subtracting the StartedAt time from the
FinishedAt time) and write down the result: ___________________

6.5. The type of join used is also output in the job statistics. Notice the statistics output
has “HASH_JOIN” underneath the “Features” column, which means a hash join was
used to join the two datasets.
Step 7: View the Results
7.1. The output will be in the joinresult folder in HDFS. Verify that the folder was
created:
# hadoop fs -ls -R joinresult
-rw-r--r-- 3 root root 0 joinresult/_SUCCESS
-rw-r--r-- 3 root root 40892 joinresult/part-r-00000

7.2. View the resulting file:


# hadoop fs -cat joinresult/part-r-00000
The output should look like the following:
Copyright © 2015, Hortonworks, Inc. All rights reserved.
61
Course Material of: Dr A R Singla

HDP Developer: Apache Pig and Hive

DUFFY SEAN WI07 DUFFY SEAN Republican


JONES WALTER NC03 JONES WALTER Republican
SMITH ADAM WA09 SMITH ADAM Democrat
CAMPBELL JOHN CA45 CAMPBELL JOHN Republican
CAMPBELL JOHN CA45 CAMPBELL JOHN Republican
SMITH ADAM WA09 SMITH ADAM Democrat

Step 8: Try Using Replicated on the Join


8.1. Delete the joinresult directory in HDFS:
# hadoop fs -rm -R joinresult

8.2. Modify your JOIN statement in join.pig so that is uses replication. It should look
like this:
join_contact_congress = JOIN visitors BY (lname,fname),
congress_data BY (lname,fname) USING 'replicated';

8.3. Save your changes to join.pig and run the script again.
# pig join.pig

8.4. Notice this time that the statistics output shows Pig used a “REPLICATED_JOIN”
instead of a “HASH_JOIN”.
8.5. Compare the execution time of the REPLICATED_JOIN vs. the HASH_JOIN. Did you
have any improvement or decrease in performance?

NOTE: Using replicated does not necessarily increase the join time.
There are way too many factors involved, and this example is using
  small datasets. The point is that you should try both techniques (if one
dataset is small enough to fit in memory) and determine which join
algorithm is faster for your particular dataset and use case.

Step 9: Count the Results


9.1. In join.pig, comment out the STORE command:
--STORE join_contact_congress INTO 'joinresult';

You have already saved the output of the JOIN, so there is no need to perform the
STORE command again.

9.2. Notice in the output of your join.pig script that we know which party the visitor
belongs to: Democrat, Republican, or Independent. Using the join_contact_congress
relation as a starting point, see if you can figure out how to output the number of
Democrat, Republican, and Independent members of Congress that visited the White
House. Name the relation counters and use the DUMP command to output the results:

join_group = GROUP join_contact_congress


BY congress_data::party;
counters = FOREACH join_group GENERATE group,
Copyright © 2015, Hortonworks, Inc. All rights reserved.
62
Course Material of: Dr A R Singla

HDP Developer: Apache Pig and Hive

COUNT(join_contact_congress);

DUMP counters;

TIP: When you group the join_contact_congress relation, group it


by the party field of congress_data. You will need to use the ::
  operator in the BY clause. It will look like:
congress_data::party

9.3. The correct results are shown here:


(Democrat,637)
(Republican,351)
(Independent,2)

Step 10: Use the EXPLAIN Command


10.1. At the end of join.pig, add the following statement:
EXPLAIN counters;

If you do not have a counters relation, then use join_contact_congress instead.


10.2. Run the script again. The Logical, Physical, and MapReduce plans should display
at the end of the output.
10.3. How many MapReduce jobs did it take to run this job? _____________
Answer: Three MapReduce jobs: the first two jobs only require a map phase, and
the third job has both a map and a reduce phase.
Result: You should have a folder in HDFS named joinresult that contains a list of
members of Congress who have visited the White House (within the timeframe of the
historical data in visits.txt).

Copyright © 2015, Hortonworks, Inc. All rights reserved.


63
Course Material of: Dr A R Singla

HDP Developer: Apache Pig and Hive

Copyright © 2015, Hortonworks, Inc. All rights reserved.


64
Course Material of: Dr A R Singla

HDP Developer: Apache Pig and Hive

Lab: Preparing Data for Hive


This lab explores transforming and exporting a dataset for use with Hive.

Table 13. About  this  Lab  

Objective: Transform and export a dataset for use with Hive.

File locations: /root/devph/labs/Lab6.3

Successful outcome: The resulting Pig script stores a projection of visits.txt in a folder
in the Hive warehouse named wh_visits.

Before you begin: You should have visits.txt in a folder named whitehouse in HDFS.

Related lesson: Advanced Pig Programming

Perform the following steps:


Step 1: Review the Pig Script
1.1. From a command prompt, change directories to the Preparing Data for Hive lab
folder:
# cd ~/devph/labs/Lab6.3/

1.2. View the contents of wh_visits.pig:


# more wh_visits.pig

1.3. Notice that all White House visitors who met with the President are the potus
relation.
1.4. Notice that the project_potus relation is a projection of the last name, first name,
time of arrival, location, and comments from the visit.
Step 2: Store the Projection in the Hive Warehouse
2.1. Open wh_visits.pig with the gedit text editor.
2.2. Add the following command at the bottom of the file, which stores the
project_potus relation into a very specific folder in the Hive warehouse:
STORE project_potus INTO '/apps/hive/warehouse/wh_visits/';

Step 3: Run the Pig Script


3.1. Save your changes to wh_visits.pig.
3.2. Run the script from the command line:
# pig wh_visits.pig

Step 4: View the Results


Copyright © 2015, Hortonworks, Inc. All rights reserved.
65
Course Material of: Dr A R Singla

HDP Developer: Apache Pig and Hive

4.1. The wh_visits.pig script creates a directory in the Hive warehouse named
wh_visits. Use ls to view its contents:
# hadoop fs -ls /apps/hive/warehouse/wh_visits/
-rw-r--r-- 3 root hdfs 0 /apps/hive/warehouse/wh_visits/_SUCCESS
-rw-r--r-- 3 root hdfs 971339 /apps/hive/warehouse/wh_visits/part-m-
00000
-rw-r--r-- 3 root hdfs 142850 /apps/hive/warehouse/wh_visits/part-m-
00001

4.2. View the contents of one of the result files. It should look like the following:
hadoop fs -cat /apps/hive/warehouse/wh_visits/part-m-00000
...
FRIEDMAN THOMAS 10/12/2010 12:08 WH PRIVATE LUNCH
BASS EDWIN 10/18/2010 15:01 WH
BLAKE CHARLES 10/18/2010 15:00 WH
OGLETREE CHARLES 10/18/2010 15:01 WH
RIVERS EUGENE 10/18/2010 15:01 WH

Result: You now have a folder in the Hive warehouse named wh_visits that contains a
projection of the data in visits.txt. We will use this file in an upcoming Hive lab.

Copyright © 2015, Hortonworks, Inc. All rights reserved.


66

You might also like