Unit-V Pig Programming
Unit-V Pig Programming
PIG: Hadoop
Programming made
easier
Syllab
us
Pig: Hadoop Programming Made Easier
Admiring the Pig Architecture,
below:
Apache Pig
Components
As shown in the figure, there are various components
in the Apache Pig framework. Let us take a look at
the major components.
Parser
Initially the Pig Scripts are handled by the Parser. It
checks the syntax of the script, does type checking,
and other miscellaneous checks.
The output of the parser will be a DAG (directed
acyclic graph), which represents the Pig Latin
statements and logical operators.
In the DAG, the logical operators of the script are
represented as the nodes and the data flows are
represented as edges
Optimiz
er
After the output from the parser is retrieved, a
logical plan for DAG is passed to a logical
optimizer. The optimizer is responsible for
carrying out the logical optimizations such as
projection and pushdown.
Compiler
The compiler compiles the logical plan sent by
the optimizer. The compiler compiles the
optimized logical plan into a series of MapReduce
jobs.
Execution engine
After the logical plan is converted to MapReduce
jobs, these jobs are sent to Hadoop in a properly
Grunt shell
• Grunt shell is a shell command.
• The Grunts shell of Apace pig is mainly used to
write pig Latin scripts. Pig script can be executed
with grunt shell which is native shell provided by
Apache pig to execute pig queries
• We can invoke shell commands
using sh and fs.
Shell Commands:
sh Command
• we can invoke any shell commands from the
Grunt shell, using the sh command. But make
sure, we cannot execute the commands that are
a part of the shell environment (ex − cd), using
Cont..
Syntax
The syntax of the sh command is:
grunt> sh shell command parameters
fs Command
• Moreover, we can invoke any fs Shell
commands from the Grunt shell by using
the fs command.
• The syntax of fs command is:
grunt> sh File System command
parameters
Cont..
File System Commands:
Command Description
cat prints the contents of one or more files
cd changes the current directory
copyFromLocal copies a local file to a Hadoop file system
copyToLocal copies a file or directory from HDFS to local file
system
cp copies a file or a directory to another directory
fs Accesses Hadoop ‘s file system shell
ls Lists file
mkdir Creates new directory
mv Move a file/directory to another directory
pwd prints the path of the current working directory
rm Deletes a file or a directory
Utility Commands:
Tuple:
A tuple is a record that consists of sequence of fields, where
the fields can be of type. It is similar to a row in RDBMS table
Example −
Relation
Pag Latin statements works with relations. A relation is a
outermost structure of data model and it is defined as a bag of
tuples
The relations in Pig Latin are unordered (there is no guarantee
that tuples are processed in any particular order).
• A relation is a bag
• A bag is a collection of tuples
• A tuple is an ordered set of fields
Working through the ABCs of Pig Latin
• Pig Latin is the language for Pig programs.
• Pig translates the Pig Latin script into
MapReduce jobs that can be executed
within Hadoop cluster.
• Pig Latin development team followed
three key design principles:
Keep it simple:
Make it smart
Don’t limit development
Keep it Simple
• Pig Latin is an abstraction for MapReduce that
simplifies the creation of parallel programs on the
Hadoop cluster for data flows and analysis.
• Complex tasks may require a series of
interrelated data transformations — such series
are encoded as data flow sequences.
• Writing Pig Latin scripts instead of Java
MapReduce programs makes these programs
easier to write, understand, and maintain
because
a) you don’t have to write the job in Java,
b) you don’t have to think in terms of
MapReduce, and
c) you don’t need to come up with custom code
to
Make it smart
• Pig Latin Compiler transform a Pig Latin program
into a series of Java MapReduce jobs.
• The compiler can optimize the execution of these
Java MapReduce jobs automatically, allowing the
user to focus on semantics rather than on how to
optimize and access the data.
• For example, SQL is set up as a declarative query
that you use to access structured data stored in
an RDBMS. The RDBMS engine first translates the
query to a data access method and then looks at
the statistics and generates a series of data
access approaches. The cost-based optimizer
chooses the most efficient approach for execution
Don’t limit development
• Make Pig extensible so that developers can add
functions to address their particular business
problems.
• Traditional RDBMS data warehouses make use of
the ETL data processing pattern, where you extract
data from outside sources, transform it to fit your
operational needs, and then load it into the end
target, whether it’s an operational data store, a
data warehouse, or another variant of database.
• With big data, the language for Pig data flows goes
with ELT instead: Extract the data from your
various sources, load it into HDFS, and then
transform it as necessary to prepare the data for
further analysis
Going with the Pig Latin Application
Flow
Pig Latin is a dataflow language, where you define
a data stream and a series of transformations
that are applied to the data as it flows through
your application
This is in contrast to a control flow language (like
C or Java), where you write a series of
instructions.
In control flow languages, we use constructs like
loops and conditional logic (like an if statement).
You won’t find loops and if statements in Pig Latin
To realize working with pig is significantly easier
Working of Pig
The basic flow of a Pig program is:
For a Pig program to access the data, you first tell Pig what file
or files to use.
accessible to Pig,
the screen when you debug your programs. When your program
goes into production, you simply change the DUMP call to a STORE
call so that any results from running your programs are stored in a
PigStorage() function:
It loads and stores data as structured text files. It
takes a delimiter using which each entity of a
tuple is separated, as a parameter. By default, it
Pig Data
Types
• Pig Data types defines the data model of how pig thinks
the
structure of the data that it is processing
• In pig, the data model gets defined when the data is loaded
and it have a particular schema and structure.
• Pig model is rich enough to handle most of the structures
like hierarchal data and table-like structures.
• Pig data types are broken into two categories:
2.Complex types
• Scalar types contain single value of types where as complex
Scalar
Types
int- Represents a signed 32-bit integer. Example : 8
long-Represents a signed 64-bit integer. Example : 5L
Datetime-Represents a date-time.
Example : 1970-01-01T00:00:00.000+00:00
Biginteger-Represents a Java BigInteger. Example : 60708090709
Example : 185.98376256272893883
Null Values
Values for all the above data types can be NULL.
Apache Pig treats null values in a similar way as
SQL does.
A null can be an unknown value or a non-existent
value.
It is used as a placeholder for optional values.
Bag-
A bag containing collection of tuples which are unordered, Bag
constants are constructed using braces, with tuples in the bag
separated by com-
mas
Syntax: Inner bag
{ tuple [, tuple …] }
Terms:
{ } - An inner bag is enclosed in curly brackets { }
tuple - A tuple
Example :
• {(raju,30),(Mohhammad,45)}
Keys Points about Bag:
• A bag can have duplicate tuples.
• A bag can have tuples with differing numbers of fields. However, if
Pig tries to access a field that does not exist, a null value is
substituted.
• A bag can have tuples with fields that have different data types.
However, for Pig to effectively process bags, the schemas of the
tuples within those bags should be the same. For example, if half of
the tuples include chararray fields and while the other half include
float fields, only half of the tuples will participate in any kind of
computation because the chararray fields will be converted to null.
• Bags have two forms: outer bag (or relation) and inner bag
Example: Outer Bag
In this example A is a relation or bag of tuples. You can think of this bag as an
outer bag.
A = LOAD 'data' as (f1:int, f2:int, f3:int);
DUMP A;
Output:
(1,2,3)
(4,2,1)
(8,3,4)
(4,3,3)
(a = b) is not true
!= Not Equal − Checks if the values of two operands are
equal or not. If the values are not equal, then condition
becomes true.
(a != b) is true.
> Greater than − Checks if the value of the left operand is
greater than the value of the right operand. If yes, then the
condition becomes true.
Used to determine how the data goes in and comes out of Pig
Cont..
Cont..
Filtering
Operators
FILTER
DISTINCT
FOREACH, GENERATE
STREAM
FILTER
It is used to select required rows from a relation based on a
condition
Syntax:- relation-name1= FILTER relation-name2 By
Condition;
Example:
• grunt>
A = LOAD ‘student_data.txt‘ USING PigStorage(',’) as
( id:int, firstname:chararray, lastname:chararray,
phone:chararray, city:chararray );
B= FILTER A BY city==‘chennai’
DUMP B
DISTINCT
• To remove duplicate rows from a relation.
Syntax:-
relation-name1= DISTINCT relation-name2;
Example:
• grunt>
A = LOAD ‘student_data.txt‘ USING PigStorage(',’) as
( id:int, firstname:chararray,
lastname:chararray,
phone:chararray, city:chararray );
B= DISTINCT A
DUMP B
FOREACH, GENERATE
To generate data transformations based on columns
of data.
Syntax:
Relation_name2= FOREACH relation-name1
GENERATE (required data)
Example:
• grunt>
A = LOAD ‘student_data.txt‘ USING PigStorage(',’) as
( id:int, firstname:chararray, lastname:chararray,
phone:chararray, city:chararray );
B= FOREACH A GENERATE id,fisrstname,city;
DUMP B
STREAM
• The stream operator allows transforming data in a relation
using an external program or script.
• This is possible because hadoop Mapreduce supports
“streaming”
Example:
C = STREAM A through ‘cut –f 2’;
Which use the Unix cut command to extract the second filed of
each tuple in A
Grouping and
Joining
Group:
The group operator is used to group the data in one relation. It collects
the data having same key
Syntax:
Syntax:
ILLUSTRATE –
To view the step by step execution of a pig script.
If we need to test a script with small sample of data then
we use it.
Syntax: ILLUSTRATE relation-name
Example : ILLUSTRATE student;
Boolean
Operators
and- AND operation
or- OR operation
Pig does not support a boolean data type. However, the result
of a boolean expression (an expression that includes boolean
and comparison operators) is always of type boolean (true or
false).
Cast Operators
Cast operators enable you to cast or convert data from one
type to
another, as long as conversion is supported (see the table
above).
For example, suppose you have an integer field, myint, which
A= LOAD 'data' AS (f1:int,f2:int,f3:int);
DUMP A;
B = GROUP A BY f1;
DUMP B;
DESCRIBE B;
O/P: B: {group: int, A: {f1: int,f2: int,f3: int}}
x, B by x;
D = foreach C
generate y; -- which y?
Flatten
Operator
The FLATTEN operator looks like a UDF syntactically, but it is
actually an operator that changes the structure of tuples and
bags in a way that a UDF cannot. Flatten un-nests tuples as
well as bags. The idea is the same, but the operation and
result is different for each type of structure.
For tuples, flatten substitutes the fields of a tuple in place of the
tuple. For example, consider a relation that has a tuple of the
form (a, (b, c)). The expression GENERATE $0, flatten($1), will
cause that tuple to become (a, b, c).
grunt> cat empty.bag
grunt> A = LOAD 'empty.bag' AS (b : bag{}, i : int);
grunt> B = FOREACH A GENERATE flatten(b), i;
grunt> DUMP
B; grunt>
Built-in
•
Functions
AVG:- Compute the average of numeric values in a single
column of a bag
Syntax:- CONCAT(expression)
• COUNT:- Compute the Number of elements in a bag
Syntax:- COUNT(expression)
• DIFF:- Compare two fields in a table
Syntax:- DIFF(expression, expression)
• CONCAT:- Concatenates two expressions of identical types
Syntax:- CONCAT(expression, expression)
Example:-FOREACH A GENERATE concat(first_name,
second_name)
• MAX:- Computes the maximum of the numeric values or
chararrays in a single colun bag. It requires a preceding
GROUP statement for group maximum.
Syntax:- MAX(expression)
• MIN:- Computes the minimum of the numeric values or chararrays in a
single column bag.
Syntax:- MIN(expression)
• SIZE:- Used to compute the number of elements based on the data type.
It includes
null values also.
Syntax:- SIZE(expression)
Syntax:- SUM(expression)
• TOKENIZE:- Splits a string in a single tuple(which contains group of words)
and outputs a bag of words
Syntax:- TOKENIZE(expression)
(This)
(is)
(a)
(hadoop)
(class)
(hadoop)
(is)
(a)
(bigdata)
(technology)
Cont..
Cont..
Complete Pig Script for WordCount Program
Note:
You can see just with 5 lines of pig program, we have solved the word
count problem very easily.
Calculate maximum recorded temperature by year
for weather dataset in Pig Latin:
Local Mode:
In this mode all the files are installed and run from your
local host and local file system
Executes in a single JVM
~$ PIG –x local
MapReduce Mode (or) Distributed Mode
• In this mode, Apache Pig will take the input from HDFS
paths only,
and after processing data it will put output files on top of
HDFS
• In MapReduce mode of Execution,Pig translates
queries into MapReduce jobs and runs them on a
Hadoop Cluster
• In this mode, whenever we execute the pig latin statements
to process the data, a Mapreduce job is invoked in the back-
end to perform a particular operation on the data that
exists in the HDFS.
• MapReduce mode with the fully distributed cluster is useful
Syntax:
command is used.
• PIG –x mapreduce (or) PIG
Apache Pig Execution Mechanisms(Pig Script
Interfaces)
Apache Pig scripts can be executed in three ways,
namely, interactive mode, batch mode, and embedded mode.
DUMP result;
Output
• The output has dept. store, customer count,
total sales.
( R,1,1.1)
( S,3,9.2)
( T,2,8.9)
( Z,1,1.1)
List total Sales per customer
• grunt>
data = LOAD 'Documents/stores.txt' using PigStorage(',') as
(customerName:chararray, deptName:chararray,
purchaseAmount:float);
DUMP result;
output
• The output has customer id, total transactions
per customer, total sales.
(A,2,8.0)
(B,2,4.6000004)
(C,2,6.6)
(D,1,1.1)
Example: stores.txt
• cust_Id dstore spent
A S 3.3
A S 4.7
B S 1.2
B T 3.4
C Z 1.1
C T 5.5
D R 1.1
Example Pig Script-2
Consider The student data File (st.txt), Data in the
following format Name, District, age, gender
i) Write a PIG script to Display Names of all
female students
ii) Write a PIG script to find the number of
Students from Prakasham District
iii) Write a PIG script to Display District wise
count of all male students.
st.txt
SID Name District Age Gender
1 Raju Guntur 28 Male
2 Prudhvi West Godavari 29 Male
3 Indra Prakasham28 Male
4 Ramana Prakasham27 Male
5 Nagarjuna Nellore 29 Male
6 Ravindra Krishna 30 Male
7 Jyothi Guntur 27 Female
8 Lahari West Godavari 26 Female
9 Hema Prakasham27 Female
Write a PIG script to Display Names of all female students
• grunt>
data = LOAD 'Documents/st.txt' using PigStorage(',')
as (sid:int, name:chararray, district:chararray,
age:int, gender: chararray);
DUMP result;
Write a PIG script to find the number of Students from Prakasham District
• grunt>
data = LOAD 'Documents/st.txt' using PigStorage(',')
as (sid:int, name:chararray, district:chararray, age:int,
gender: chararray);
DUMP std_count;
Write a PIG script to Display District wise count of all male students.
• grunt>
data = LOAD 'Documents/st.txt' using PigStorage(',') as (sid:int,
name:chararray, district:chararray, age:int, gender: chararray);
DUMP std_count;
Using Join
Operation
customers = LOAD 'customer' USING
PigStorage(',')as (id:int,
name:chararray, age:int,
address:chararray, salary:int);
orders = LOAD 'orders' USING PigStorage(',')
as
(oid:int, date:chararray, customer_id:int,
amount:int);
customer_orders = JOIN customers BY id,
orders BY customer_id;
dump customer_orders;
To Exercise more problems on
Pig Scripts-go to below link
• http://howt2talkt2apachepig.blogspot.com/