Module 4 - Pig
Module 4 - Pig
ANALYTICS
Part – 4
“PIG”
By: Dr.Rashmi L Malghan &
Ms. Shavantrevva S Bilakeri
Agenda:
Apache PIG
Pig was introduced by yahoo and later on it was made fully open source by Apache
Hadoop.
It also provides a bridge to query data over Hadoop clusters but unlike hive, it
implements a script implementation to make Hadoop data accessable by developers
and business persons.
Apache pig provides a high level programming platform for developers to process
and analyses Big Data using user defined functions and programming efforts.
In January 2013 Apache released Pig 0.10.1 which is defined for use with Hadoop
0.10.1 or later releases.
MapReduce Way: Apache
PIG:
• Focus on the data transformations rather than the
underlying MapReduce implementation.
Y= GROUP ….;
……..
Z= FILTER …;
…..
DUMP Y; #view result on screen
……..
STORE Z into ‘temp’
Example Latin Script: find the total distance travelled by a flight
-- Load the flight data
flight_data = LOAD 'path/to/flight_data' USING PigStorage(',') AS (date:chararray,
distance:int);
From the filtered data, find the student name with highest gpa.
A = filter A by gpa>5.0
Note: by default columns are indexed with $01, $02, $03 when USING PigStorage is not used.
--Load student data
student_data = LOAD 'path/to/student_data' USING PigStorage(',')
AS (rollno:chararray, name:chararray, gpa:float);
Import -Sqoop-
transfer data between
RDBM to HDFS or
HDFS to RDMS
1. Dump operator
2. Describe operator
3. Explanation operator
4. Illustration operator
Dump Operator
The Dump operator is used to run the Pig Latin statements and display the results on the
screen. It is generally used for debugging Purpose.
Example
grunt> describe student;
grunt> student: { id: int,firstname: chararray,lastname: chararray,phone:
chararray,city: chararray }
ILLUSTRATE OPERATOR
The illustrate operator gives you the step-by-step execution
of a sequence of statements.
Syntax
grunt> illustrate Relation_name;
GROUP operator
The GROUP operator is used to group the data in one or more
relations. It collects the data having the same key.
Syntax
grunt> Group_data = GROUP Relation_name BY age;
Example
grunt> group_data = GROUP student_details by
age;
(21,{(4,Preethi,Agarwal,21,9848022330,Pune),(1,Rajiv,Reddy,21,9848022337,Hydera bad)})
(22,{(3,Rajesh,Khanna,22,9848022339,Delhi),(2,siddarth,Battacharya,22,984802233 8,Kolkata)})
(23,{(6,Archana,Mishra,23,9848022335,Chennai),(5,Trupthi,Mohanthy,23,9848022336 ,Bhuwaneshwar)})
(24,{(8,Bharathi,Nambiayar,24,9848022333,Chennai),(7,Komal,Nayak,24,9848022334, trivendram)})
• The COGROUP operator works more or less in the same way as the GROUP operator.
• The only difference between the two operators is that the group operator is normally used
with one relation, while the cogroup operator is used in statements involving two or more
relations.
Grouping Two Relations using Cogroup
Assume that we have two files namely student_details.txt and employee_details.txt in the
HDFS directory /pig_data/ .
grunt> cogroup_data = COGROUP student_details by age, employee_details by
age;
The UNION operator of Pig Latin is used to merge the content of two relations.
To perform UNION operation on two relations, their columns and domains must
be identical.
Syntax: grunt> Relation_name3 = UNION Relation_name1, Relation_name2;
Output
The SPLIT operator is used to split a relation into two or more relations.
Syntax: grunt> SPLIT Relation1_name INTO Relation2_name IF (condition1), Relation2_name
(condition2)
Output
Output
(6,Archana,Mishra,23,9848022335,Chennai)
(8,Bharathi,Nambiayar,24,9848022333,Chennai)
The FOREACH operator is used to generate specified data transformations based on the
column data.
Syntax: grunt> Relation_name2 = FOREACH Relatin_name1 GENERATE (required
data);
Output
Example:
grunt> limit_data = LIMIT student_details 4;
Apache Pig - Eval Functions
Apache Pig provides various
built-in functions namely eval,
load, store, math, string,
bag and tuple functions.
Apache Pig - Eval Functions
Apache Pig - Load & Store Functions
Apache Pig - Bag & Tuple
Functions
Apache Pig - String Functions
Apache Pig - Date-time Functions
Apache Pig - Components
2. Batch mode: Create pig script to run in batch mode. Write pig latin
statements in a file and save it with .pig extension
Executing Pig Script in Batch mode
Executing a Pig Script from HDFS
We can also execute a Pig script that resides in the HDFS.
• Apache Pig provides extensive support for User Defined Functions (UDF’s).
• The UDF support is provided in six programming languages, namely, Java, Jython,
Python, JavaScript, Ruby and Groovy.
• For writing UDF’s, complete support is provided in Java and limited support is provided
in all the remaining languages.
• Since Apache Pig has been written in Java, the UDF’s written using Java language work
efficiently compared to other languages.
• In Apache Pig, we also have a Java repository for UDF’s named Piggybank.
• User can use piggy bank functions in pig latin script and
can share their functions in piggy bank
PIG EXECUTION : Load and Store data locally and on Hadoop
Step1: Create input.txt file
Step1: Create a text file and add some contents to text file
Step 2: Open .pig file and edit the following script into that
Step8: grunt> ls
Output: 1 too
hdfs://192.168.159.101:9000/pig1/output/_SUCCESS<r 2> 0 2 you
hdfs://192.168.159.101:9000/pig1/output/part-r-00000<r 2> 127 1 Data
Step9: grunt> cat part-r-00000 2 good
Output: 1 hope
2 i 1 you.
2 am 1 Btech
1 hi 1 about
2 in 1 doing
1 are 1 manipal
2 how 1 science.
1 studying